This action might not be possible to undo. Are you sure you want to continue?

Parallel Algorithms for Matrix Computations

This page intentionally left blank

Parallel Algorithms for Matrix Computations

K. A. Gallivan Michael T. Heath Esmond Ng James M. Ortega Barry W. Peyton R. J. Plemmons Charles H. Romine A. H. Sameh Robert G. Voigt

Philadelphia

Society for Industrial and Applied Mathematics

Printed in the United States of America..9'434—dc20 90-22017 is a registered trademark. Gallivan. Parallel processing (Electronic computers) I. Library of Congress Cataloging-in-Publication Data Parallel Algorithms for matrix computations / K. No part of this book may be reproduced. 3600 University City Science Center. Matrices—Data processing. [et al. 2.. p. Gallivan .]. A. 3. Philadelphia. Includes bibliographical references. K.) QA188. A. PA 19104-2688. . (Kyle A.Copyright ©1990 by the Society for Industrial and Applied Mathematics. stored. ISBN 0-89871-260-2 1.P367 1990 512. write to the Society for Industrial and Applied Mathematics. cm. 10987654 All rights reserved. Algorithms. For information. or transmitted in any manner without the written permission of the publisher.

ICASE. Michael T. Oak Ridge National Laboratory.O. Mathematical Sciences Section. VA 22903. Oak Ridge. Ortega. R. Oak Ridge. Voigt. VA 23665. Hampton. Urbana.List of Authors K. Robert G. TN 37831-8083. University of Illinois. TN 37831-8083. University of Virginia. V . Mathematical Sciences Section. P. Charles H. Box 2009. TN 37831-8083. Center for Supercomputing Research and Development. A. Sameh. Box 2009. TN 37831-8083. Mathematical Sciences Section. IL 61801. University of Illinois. A. Department of Mathematics and Computer Science. P. Mathematical Sciences Section. Box 2009.O. NC 27109. Oak Ridge. Wake Forest University. Barry W. Heath. Romine. P. Gallivan. Center for Supercomputing Research and Development. Oak Ridge National Laboratory. IL 61801. Winston-Salem. James M. Box 2009. NASA Langley Research Center. Oak Ridge. Oak Ridge National Laboratory. Applied Mathematics Department.O. Esmond Ng. H. Oak Ridge National Laboratory. Peyton. P.O. Plemmons. Charlottesville. J. Urbana.

This page intentionally left blank .

Plemmons. Since the amount of literature in these areas is quite large. as well as distributed-memory architectures. specifically by Cholesky factorization. The purpose is to provide an overall perspective on parallel algorithms for both dense and sparse matrix computations in solving systems of linear equations. The second paper (by Heath. Due to recent trends in the design of computer architectures. The paper also surveys associated algorithms for dense matrix computations. eigenvalue and singular-value computations. numeric factorization. Architectures considered here include both shared-memory systems and distributed-memory systems. and Peyton) is primarily concerned with parallel algorithms for solving symmetric positive definite sparse linear systems. Short descriptions of the contents of each of the three papers in this book are provided in the following paragraphs. the scientific and engineering research community is becoming increasingly dependent upon the development and implementation of efficient parallel algorithms for matrix computations on modern high-performance computers. as well as combinations of the two. including ordering. having a relatively large number of processors. as well as for dense or structured problems arising in least squares computations. Ng. The volume contains two broad survey papers and an extensive bibliography. or reference an extensive selection of important parallel algorithms for matrix computations. The authors focus their attention on direct methods of solution. describe. and perhaps the most common problem that arises is that of solving sparse symmetric positive definite linear systems. They are usually the first such tools to be implemented in any new computing environment. and rapid elliptic solvers are considered. Algorithms for matrix computations are among the most widely used computational tools in science and engineering. and triangular solution. such as hypercubes. eigenvalue and singular-value computations. Algorithms for dense or structured matrix computations in direct linear system solvers.Preface This book consists of three papers that collect. an attempt has been made to select representative work. and Sameh) contains a general perspective on modern parallel and vector architectures and the way in which they influence algorithm design. direct least squares computations. The authors concentrate on approaches to computations that have been used on shared-memory architectures with a modest number of (possibly vector) processors. The main driving force for the development of vector and parallel computers has been scientific and engineering computing. Parallel algorithms are surveyed for all phases of the solution process for sparse systems. symbolic factorization. Major emphasis is given to computational primitives whose efficient execution on parallel and vector computers is essential to attaining high-performance algorithms. and rapid elliptic solvers. vii . The architectures considered include both commercially available machines and experimental research prototypes. The first paper (by Gallivan.

also included are a number of references on machine architecture. Although this is primarily a bibliography on numerical methods. mathematicians. collected by the authors over a period of several years. are provided in this work.000 references. The book may serve as a reference guide on modern computational tools for researchers in science and engineering. J. Plemmons Wake Forest University . For instance. many of the algorithms discussed in the first two papers have been treated in courses on scientific computing that have been offered recently at several universities. Voigt. and other topics of interest to computational scientists and engineers. It should be useful to computer scientists. R. programming languages. The book may also be useful as a graduate text in scientific computing. and engineers who would like to learn more about parallel and vector computations on high-performance computers. Over 2. and Romine) consists of an extensive bibliography on parallel and vector numerical algorithms.The final paper (by Ortega.

Gallivan.J. March 1990). and Barry W. Heath. R. Peyton A Bibliography on Parallel and Vector Numerical Algorithms James M. Ortega. Esmond Ng.A. Parallel Algorithms for Sparse Linear Systems Michael T.Plemmons. Robert G.Contents 1 Parallel Algorithms for Dense Linear Algebra Computations K. Voigt. and Charles H. Sameh (Reprinted from SIAM Review.H. and A. Romine 83 125 ix .

This page intentionally left blank .

Raleigh. 1989. as well as rapid elliptic solvers. In this paper. shared memory multivector architectures * Received by the editors March 6. University of Illinois. Illinois 61801. Matrix computations. In particular. eigenvalue and singular-value problems. The purpose is to review the current status and to provide an overall perspective of parallel algorithms for solving dense. Some of this work was surveyed in the 1985 article by Ortega and Voigt [138]. GALLIVAN † . H. Scientific and engineering research is becoming increasingly dependent upon the development and implementation of efficient parallel algorithms on modern high-performance computers. and Romirie produced an extensive bibliography of parallel and vector numerical algorithms [139]. Their main emphasis. Key words. research and commercial. With this in mind we have attempted in this paper to collect and describe a selection of what we consider to be some of the more important parallel algorithms in dense matrix computations. and algebraic eigenvalue problems. govern the performance of many applications on vector and parallel computers. 1989. North Carolina State University. numerical linear algebra. however. dense problems are taken to include block tridiagonal matrices in which each block is dense. including the solution of systems of linear equations. More recently. least squares computations. 65N20 1. This research was supported by the Department of Energy under grant DE-FG02-85ER25001 and the National Science Foundation under grants NSF-MIP-8410110 and NSF-CCR-8717942. Sameh [153]. banded. 65-02. SAMER † Abstract. 65F05. PLEMMONS†. eigenvalue and singular value computations. and Heller [91] there has been an explosion of research activities on this topic. least squares problems. Ortega.PARALLEL ALGORITHMS FOR DENSE LINEAR ALGEBRA COMPUTATIONS K. parallel computation AMS(MOS) subject classifications. North Carolina 27695-8206. Our purpose in the present paper is to provide an overall perspective of parallel algorithms for dense matrix computations in linear system solvers. * Departments of Computer Science and Mathematics. AND A. t Center for Supercpmputing Research and Development. was on the solution of partial differential equations on vector and parallel computers. We also point to the textbook by Hockney and Jesshope [100] which includes some material on programming linear algebra algorithms on parallel machines. Numerical linear algebra is an indispensable tool in such research and this paper attempts to collect and describe a selection of some of its more important parallel algorithms. 1 . Since the early surveys on parallel numerical algorithms by Miranker [133]. least squares problems. Numerical linear algebra algorithms form the most widelyused computational tools in science and engineering. Urbana. J. as well as algorithms for banded matrices which are dense within the band. The work of this author was supported by the Air Force Office of Scientific Research under grant AFOSR-88-0285 and by the National Science Foundation under grant DMS-85-21154. Introduction. we concentrate on approaches to these problems that have been used on available. accepted for publication (in revised form) October 31. and Ortega [137] published a textbook containing a discussion of direct and iterative methods for solving linear systems on vector and parallel computers. 65F20. Voigt. R. 65F15. and rapid elliptic solvers. A major emphasis is given here to certain computational primitives whose efficient execution on parallel and vector computers is essential in order to obtain high performance algorithms. A. or block-structured problems arising in the major areas of direct solution of linear systems.

multivector processors and multilevel hierarchical memories). supercomputer architects have responded with systems that achieve the required level of performance via progressively complex synergistic effects of the interaction of hardware. on local memory hypercube architectures.. AND A. 2. As a result. The organization of the rest of this paper is as follows. Cedar to CRAY-2. Section 2 briefly discusses some of the important aspects of the architecture and the way in which they influence algorithm design. we believe this is appropriate and timely. To satisfy the steadily increasing demand for computational power by users in science and engineering. Furthermore. a CRAY-2 or Cedar [117]. considerable attention is given here to the discussion and performance analysis of certain computational primitives and algorithms for high performance machines with hierarchical memory systems. GALLIVAN. system software (e. the algorithm designer encounters a complex relationship between performance. especially as it relates to the numerical solution of partial differential equations. The latter adaptation may. the topic is of sufficient complexity and importance to require a separate survey for adequate treatment. e. and system architecture (e. the topics and the level of detail at which each is treated can not help but be biased by the authors' interest. Section 3 contains a discussion of the decomposition of algorithms into computational primitives of varying degrees of complexity. Algorithms for LU and LU-\ike factorizations on both shared and distributed memory systems are considered in §4. blocksize analysis. Given recent developments in numerical software technology. SAMEH with a modest number of processors and distributed memory architectures such as the hypercube. Architectures of interest. PLEMMONS. blocksizes). architectural parameters (cache size. Eigenvalue and singular value problems are considered in §6. Matrix multiplication. Parallel algorithms using special techniques for solving generally sparse problems in linear algebra will also not be considered in this particular survey. number of processors). recent domain decomposition. changing the size of cluster or global memory on Cedar. As a result. for any particular system in the architectural class. Recursive least squares computations. Finally. e. Although significant results have recently been obtained. Parallel factorization schemes for block-tridiagonal systems. §7 contains a review of parallel techniques for rapid elliptic solvers of importance in the solution of separable elliptic partial differential equations. block cyclic reduction. and triangular system solvers are emphasized.. Since the amount of literature in these areas is very large we have attempted to select representative work in each. and algorithmic parameters (method used. which arise in numerous application areas.g. codes for scientific computing such as numerical linear algebra take the form of a parameterized family of algorithms that can respond to changes within a particular architecture.g. are also discussed in terms of applications to computations in control and signal processing. R. A. of course. Many important topics relevant to parallel algorithms in numerical linear algebra are not discussed in this survey. In particular.g. restructuring compilers and operating systems). or when moving from one member of an architectural family to another. Iterative methods for linear systems are not mentioned since the recent text by Ortega [137] contains a fairly comprehensive review of that topic. and boundary integral domain decomposition schemes are examined.. For example. Section 5 concerns parallel orthogonal factorization methods on shared and distributed memory systems for solving least squares problems.g. are discussed in detail. H..2 K. Algorithm designers are faced with a large variety of system configurations even within a fairly generic architectural class such as shared memory multivector processors. involve chang- . J.

On every cycle of a vector operation multiple operands are read from memory.. each of the functional unit stages operate on a set of vector elements that are moving through the pipe. Fortunately.1 Such parallelism is typically very fine-grain and requires the identification in algorithms of large amounts of homogeneous work applied to vector objects.. and parallel architectures is required to develop parallel languages and programming environments along with parallel computer systems that mitigate this architectural sensitivity.). Perhaps the best known is that of Hockney and Jesshope [100]. the reference pattern to the memory modules generated by accessing elements of a vector is crucial in determining the rate at which the memory system can supply data to the processor. For example. numerical linear algebra is rich in such operations and the vector processor can be used with reasonable success. Architectural parameters that determine the performance for a particular vector length include cycle time. The algorithmic parameter that encapsulates this information is the stride of vector access. the influence of the functional unit on algorithmic parameter choices is not the only consideration required. i.. etc. and an element of the result of the operation is written to memory. Obviously. Heavy demands are placed on the memory system in that it must process two reads and a write (along with any other control I/O) in a single cycle.e. Such cooperative work is underway at several institutions. Typically.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 3 ing the method used completely. the number of stages of the pipeline. The Cyber 205 is a memory-to-memory vector processor that has been successfully used for scientific computation. numerical algorithms. From the point of view of the functional unit. such demands are met by using a highly interleaved or parallel memory system with M > I memory modules whose aggregate bandwidth matches or exceeds that demanded by the pipeline. Elements of vectors are then assigned across the memory modules in a simple interleaved form. cooperative research involving expertise in the areas of parallel software development (debugging. There are several consequences of such a situation. floating point multiplication.g. Second. restructuring compilers. e. algorithm designers must be sensitive to architecture/algorithm mapping issues and any discussion of parallel numerical algorithms is incomplete if these issues are not addressed. First. into multiple stages and implements a pipelined functional unit that allows multiple instances of the computation to proceed simultaneously — one in each stage of the pipe. accessing the column of an array stored in column-major order results in a stride of 1 while accessing a row of 1 The details of the architectural tradeoffs involved in a vector processor are somewhat surprisingly subtle and complex. This technique decomposes operations of interest. if scientific computing is to reap the full benefits of parallel processing. say from Gaussian elimination with partial pivoting to a decomposition based on pairwise pivoting. as well as any other startup costs involved in preparing the functional unit for performing the computations. The architecture that first caused a widespread and substantial algorithm redesign activity in numerical computing is the vector processor. the basic algorithmic parameter that influences performance is the vector length. For an excellent discussion of some of them see [174]. As a result. or using more complex skewing schemes [193]. v(i) is in module i mod M. That is. the number of elements on which the basic computation is to be performed. one of the main thrusts of parallel computing research must be to change the situation. e. Such processors exploit the concept of pipelining computations. .g. Various models have been proposed in the literature to characterize the relationship between algorithmic and architectural parameters that determine the performance of vector processors.

however. to compensate for the loss in data transfer bandwidth. Some other vector processors. the triad. The major consequence of this. and x is a vector in memory. the Alliant FX/8 allows one argument of a vector instruction to be given as an address in memory. care must be taken to reuse a register operand several times before reloading the register or to accumulate as many partial results of successive computations in the same register before storing the values to memory. on a machine without chaining. For example. one CPU of a CRAY X-MP. 2 . Some register-based vector processors also use two other techniques to improve performance. a is a scalar. J. Essentially. For example. This influences kernel design in that careful ordering of assembler level instructions can improve the exploitation of the processor. considered in detail below. The CRAY1. where ^i and v-z are vector registers. it appears as if the vector addition was taking one of its operands directly from memory. careful consideration of the order of instructions used in implementing an algorithm or kernel is required. The processors on such machines are capable of executing arbitrary code segments in Some register-based vector processors also have multiple ports to memory in an attempt to have the best of both worlds. respectively. the quest for even more speed led to the availability and continuing development of MIMD multiprocessors and multivector processors.g. i.. make the chaining of functional units and the memory port an explicit part of the vector instruction set. e. is that such processors require careful management of data transfer between memory and register in order to achieve reasonable performance. after a small amount of time. The best example of this is the workhorse of its instruction set. SAMEH the same array requires a stride of Ida where Ida is the leading dimension of the array data object.4 K.g. In particular.2 Each of the registers can hold a vector of sufficient length to effectively use the pipelined functional units available. Chaining allows the elements of the vector loaded into the first register to be made available. Such instruction constructs greatly simplify the exploitation of the chaining capabilities of a vector processor at the cost of the loss of a certain amount of flexibility. The second technique is essentially functional unit parallelism with certain resource dependences managed by the hardware at runtime. R. A.e. etc.. H. Multiple instructions that have no internal resource conflict. The first is the use of parallelism across functional units and ports. one CPU of a CRAY-2 and one computational element of an Alliant FX/8 are examples of register-based vector processors that have a single port to memory and. AND A. While vector processors have been used and can deliver substantial performance for many computations.. thereby chaining the memory port and the appropriate functional units. For processors that handle chaining of instructions automatically at runtime. are executed simultaneously. reducing the number of loads and stores. adding two vector registers with the result placed in a third and loading of a fourth register from memory. provide a set of vector registers internal to the processor to store operands and results. e. for use by the adder before the load is completed. which computes vi <— V2 + ax. loading a vector from memory into a register and adding it to another register would require two distinct nonoverlapped vector operations and therefore two startup periods. making as much use of the available resources as possible. PLEMMONS. This instruction explicitly chains the floating point multiplier and adder and the memory port. Not all vector processors are implemented with the three computational memory ports (2 reads/1 write) required by a memory-to-memory processor. The technique is called chaining and it allows the result of one operation to be routed into another operation as an operand while both operations are active. GALLIVAN.

l(a). a connection that does not involve the shared memory modules.. The aggregate size of these local memories is usually relatively insignificant compared to the large shared memory available. the organization and proper exploitation of this system must be carefully considered when designing high-performance algorithms. a user-controlled processor can access any element of memory without the aid of another user-controlled processor.. 1. and only accessible by. each processor. They are characterized by the fact that the interconnection network links all of the processors to all of the memory modules. i. assuming appropriate overhead levels. in practice. the fine-grain parallelism of vector processors. Some memory/processor topologies. . This connection can take on several forms. the architecture moves toward the distributed end of the architectural spectrum.e.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 5 parallel and therefore subsume. 1 shows a highly interleaved or parallel memory system connected to the processors. There is no concept of a direct connection between a processor and some subset of the remaining processors. Of course. As local memory sizes increase. i. the ability of the network/memory system to supply data to the multiple processors at a sufficient rate is one of the key components of performance of shared memory architectures. Not surprisingly. Many have a small amount of local memory associated with. As a result. Shared memory architectures have the generic structure shown in Fig.e. FIG. few shared memory machines strictly adhere to this simple characterization. The generic organization in Fig.

Ideally. in practice. On the other hand. in vector operations. SAMEH For a small number of processors and memory modules. The idea is similar to the use of registers within vector processors in that data can be kept for reuse in small fast memory private to each processor. The 17-network of Lawrie [119] can connect p = sk processors and memory modules with k network stages. AND A. however. PLEMMONS. As mentioned above. the addition of architectural features such as data prefetch mechanisms and local memory can provide some mitigation of this problem. One difference between local memories/caches and vector registers. p. Each stage comprises sk~l (s x s)-crossbars. The discussion above clearly shows that the optimization of algorithms for shared memory multivector architectures involve the consideration of the tradeoffs concern3 A computation is said to have high data locality if the ratio of the data elements to the number of operations is small. Some provide special purpose hardware for controlling small grain tasks on a moderate number of processors and simple TEST-AND-SET4 synchronization in memory. . is that registers have a prescribed shape and must be used. local memory or caches can contain. GALLIVAN.. up to a point. Unfortunately. one of the ways in which the performance of a large shared memory system can be improved is the introduction of local memories or caches with each processor. arbitrary data objects with no constraint on type or use. Fortunately. it is necessary to build scalable networks out of several smaller completely connected switches such as (s x s)-crossbars. the latency for each memory access grows as O(k). and setting the location if the test succeeds.6 K. for a total of O(p\ogsp) switches. testing its value. As p increases. 4 The TEST-AND-SET operation allows for the indivisible action of accessing a memory location. These mechanisms are required for the assignment of parallel work to a processor and enforcing data dependences to ensure correct operation once the assignment is made. H. Others provide more complex synchronization processors at the memory module or network level with capabilities such as FETCH-AND-OP or the Zhu-Yew primitives used on Cedar [196]. a high-performance bus or crossbar switch can provide complete connectivity and reasonable performance. there are some which are oriented toward large-grain task parallelism which rely more on system-software-based synchronization mechanisms with relatively large cost to coordinate multiple tasks within a user's job. Finally. A. For larger systems. If sufficient data locality3 is present in the computations the processor can proceed at a rate consistent with the data transfer bandwidth of the cache rather than the lower effective bandwidth of the large shared memory due to latency and conflicts. data skewing schemes and access stride manipulation are important in balancing the memory bandwidth achieved with the aggregate computational rate of the processors. for the most part. they must contain and be operated on as a vector v e 3?m where ra is the vector length. J. the Alliant FX/8. however. keeping the two within a small multiple is achievable for numerical linear algebra computations via the skewing and stride manipulations or with the introduction of local memory (discussed below). These differences can strongly affect the way that these architectural features influence algorithm parameter choices. It can be used as the basic building block of most synchronization primitives. The support found on the various multiprocessors varies considerably. R. the two should balance perfectly. Another feature which can significantly influence the performance of an algorithm on a shared memory machine is the architectural support for synchronization of processors. e. such networks quickly become too costly as p increases.g. often at the same time with the tasks of other users. As with vector processors.

essentially a linear array with a wrap-around connection between the first and last processor. On p-processor distributed memory machines with an aggregate memory size M each user-controlled processor has direct access to its local memory only. Accessing any other memory location requires the active participation of another user-controlled processor. Many of these are potentially contradictory. Figure 1 illustrates three popular connection schemes. Each CE has eight 32-element vector registers and eight floating point scalar registers as well as other integer registers. This mechanism allows an iteration of a parallel loop to be assigned to a processor within in time equivalent to a few floating point operations and provides synchronization support from lower iterations to higher iterations with a cost of a few cycles. typically of size M/p. increasing data locality by reorganizing the order of computations can directly conflict with the attempt to increase the vector length of other computations. For example. The ring topology (b) uses a linear nearest-neighbor bidirectional connection. the CE's can cooperate efficiently on parallel loops with iterations with a granularity of a small number of floating point operations. on the choice of algorithm or kernel organization. Sequent. It consists of up to eight register-based vector processors or computational elements (CE's). vector computation. while the mesh connection (c) provides two-dimensional nearest neighbor connections (wrap-around meshes are also used extensively). The CE's are connected by a concurrency control bus (used as a synchronization facility). The Alliant FX/8 possesses most of the interesting architectural features that have influenced linear algebra algorithm design on shared memory processors recently. Most of the detailed performance information for shared memory machines given below was obtained on this machine. and supercomputers such as the CRAY X-MP and Y-MP. and NEC. As a result. Many shared memory parallel and multivector processors are commercially available over a wide range of price and performance. CRAY-2. therefore management of the vector registers is crucial. Distributed memory architectures can be roughly characterized in a fashion similar to that used above for shared memory. like the CRAY-1 and a single CPU of the CRAY-2. The modeling and tradeoff analysis of these features will be discussed in detail below for selected topics. synchronization and parallel or hierarchical memory systems. load balancing. there are two major factors that distinguish them from shared memory architectures. the vector triad instruction v <— v + ax (the preferred instruction for achieving high performance in many codes) has a maximum performance of 68 Mflops. The size of the cache can be configured from 64KB up to 512KB. In particular. Both of these simple topologies work quite well for many numerical linear algebra algorithms. each capable of delivering a peak rate of 11. The hyper- . The CE's share the physical memory as well as a write-back cache that allows up to eight simultaneous accesses per cycle. 2. As a result of this idea of direct interaction between processors to exchange data.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 7 ing the influence of architectural features. In particular.75 Mflops for calculations using 64-bit data (two operations per cycle) implying a total peak rate of approximately 94 Mflops. The startup times for the vector instructions can reduce this rate significantly. These include the Encore. see the cluster blowup in Fig. Alliant FX series. several algorithms are presented below for ring architectures. The cache and the four-way interleaved main memory are connected through the main memory bus. distributed memory architectures are often identified by the topology of the connections between processors. There is only one memory port on each CE. such as parallelism. These are the mode of memory access and the mode of synchronization. For example.

a fc-dimensional cube has 2k processors (vertices) each of which is connected to k other processors. This construction also demonstrates one of the basic scalability problems of the hypercube in that the number of connections for a particular processor grows as the size of the cube increases as opposed to the constant local connection complexity of the simpler mesh and ring topologies. the nodes have a very natural binary numbering scheme based on a Gray code. Computations can proceed on a pro- . and NCUBE. as the name implies. FIG. It can be constructed from two (k — l)-dimensional cubes by simply connecting corresponding vertices. many of the hypercube algorithms published use the cube as if it were one of the simpler topologies. In general. local connections in an arbitrarily dimensioned space. Intel. Commercially available hypercubes include those by Ametek. J. due to the memory accessing paradigm. AND A. PLEMMONS. H. Synchronization on a distributed memory architecture. A four-dimensional cube is illustrated in (d). is accomplished via a data flow mechanism rather than the indivisible update used in a large shared memory system. 2. A. The Cedar multiprocessor. Many of the more common topologies. R. GALLIVAN. SAMEH cube connection is perhaps the most discussed distributed memory topology recently. As a result of this construction.8 K. In fact. The connection patterns are. can be embedded into a hypercube of appropriate dimension. such as rings and meshes.

we would like to partition the data and computations across the processors and memory modules in such a way that a small amount of data is exchanged between processors at each stage of an algorithm. but less powerful. Once this is accomplished the major task is the partitioning of the data and the computations onto the processors. It consists of clusters of vector processors connected to a large interleaved shared global memory — access to which can be accelerated . of course.) Clearly. ignoring for a moment delays due to communication. Typically. dispersing the computations over many processors may increase the amount of communication required and thereby negate the benefit of parallelism. a key aspect of algorithm organization is the exposure of large amounts of parallelism. (These transactions are the interaction between the processors associated with the local memory and the remote memory modules mentioned above. As a result the cost of communication is completely amortized over the subsequent computations that make use of the data. Since the machines tend to have more. identical to the hierarchical memory problem of amortizing a fetch of a data operand from the farthest level of memory by combining it with several operands in the nearest. This partitioning must address several tradeoffs. as is shown below. followed by the use of the received data in operations on many local data. The property of data locality. there is often a tradeoff between data locality and the amount of exploitable parallelism. which was very significant for shared memory machines in the management of registers and hierarchical memory systems. Therefore many of the discussions to follow concerning the data-transfer-to-operations ratios that are motivated by shared hierarchical memory considerations are often directly applicable to the distributed memory case. although not necessarily. although. This is. is also very important for some distributed memory machines in achieving the desired balance. One indicator of the efficient partitioning of the computations and data is the relationship between the load balance across processors and the amount of communication between processors. An example of such an architecture that is used in this paper to facilitate the discussion of these algorithms is the Cedar system being built at the University of Illinois Center for Supercomputing Research and Development (see Fig. Ideally. processors. On the other hand. the algorithm/architecture mapping questions for a distributed memory machine change appreciably from those of shared memory. since the synchronization is so enmeshed in the control and execution of interprocessor communication. Of course. To reduce total execution time. As we would expect. there is a spectrum of architectures and a particular machine tends to have characteristics of both shared and distributed memory architectures. 2).PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 9 cessor when due to its position in its local code the processor decides a computation is to be performed and all of the memory transactions involving operands for the computation in remote memory modules are complete. the major algorithmic reorganization that can alter the efficiency of the synchronization on distributed memory machines is the partitioning of the computations (or similarly the data) so as to reduce the synchronization overhead required. For these hybrid architectures efficient algorithms often involve a combination of techniques used to achieve high performance on the two extremes. If the partitioning of the computations and data also results in a balanced computational load the algorithm proceeds near the aggregate computational rate of the machine. a suitable balance must be achieved between the amount of communication required and efficient spreading of the parallel computations across the machine. a more balanced load produces a more parallel execution of the computations.

since the primitive's computational complexity is manageable. Over the years. in turn. for many presently available machines the computational granularity represented by single instances of the BLAS primitives from one of the levels or multiple instances executing simultaneously is sufficient for investigating the relative strengths and weaknesses of the architecture with respect to dense matrix computations. A. The larger consists of large-grain tasks and multitasking synchronization primitives such as event waiting and posting similar to CRAY large-grain primitives. H. R. GALLIVAN. PLEMMONS. Motivation. AND A. primitives with the appropriate functionality which exploit the novel architectural features are chosen and used to develop new forms of the algorithms. Consequently. As new architectures emerge. Consequently. a shared memory multivector processor. J. cache and cluster memory shared by the processors within a cluster. 3. and global memory shared by all processors in the system. each cluster is. 3. a task waiting for a task-level event is marked as blocked from execution and removed from the pool of tasks considered by the operating system when allocating computational resources. However. whose cluster memory is accessible only by its CE's. and global concurrency across clusters. At this level it looks much like a conventional shared memory processor.1. The second and lower-cost control mechanism is the SDOALL loop (for spread DOALL) which provides a self-scheduling loop structure whose iterations are grabbed and executed at the cluster level by helper tasks created at the initiation of the user's main task. it is possible to probe at an architecture/software level which is free of spurious software considerations . Each iteration can then use the smaller grain parallelism and vectorization available within the cluster upon which it is executing. These primitives are relatively high cost in that they affect the state of the task from the point of view of the operating system. These algorithms can be expressed in terms of computational primitives ranging from operations on matrix elements to those involving submatrices. one per global memory module. The medium grain SDOALL loop is ideal for moderately tight intercluster communication such as that required at the highest level of control in multicluster primitives with BLAS-like functionality that can be used in iterations such as the hybrid factorization routine presented in §4.. The development of high-performance codes for a range of architectures is greatly simplified if the algorithms under consideration can be decomposed into computational primitives of varying degrees of complexity. The memory hierarchy consists of: vector registers private to each vector processor. As the pursuit of high performance has increased the complexity of computer architectures. Hardware support for synchronization between clusters on a much tighter level than the task events is supplied by synchronization processors. e.10 K. First. The investigation of dense matrix algorithms in terms of decomposition into lowerlevel primitives such as the three levels of the BLAS has several advantages. a slightly modified Alliant FX/8. such a strategy has been applied successfully to the development of dense linear algebra codes. Computational primitives. Control and synchronization mechanisms between clusters are supported at two levels of granularity. Three levels of parallelism are also available: vectorization at the individual processor level. which implements the Zhu-Yew synchronization primitives [196]. SAMEH by data prefetching hardware. The size of the cluster memory is fairly large and therefore the aggregate makes up a considerable distributed memory system. the Cedar machine is characterized by its hierarchical organization in both memory and processing capabilities.g. concurrency within each cluster. the need to exploit this richness of decomposition has been reflected in the evolution of the Basic Linear Algebra Subroutines (BLAS).

Third.5 Second. number of processors. the hiding of parts of the architecture via partial primitives is similarly beneficial. As indicated earlier. 5 . on certain architectures a rank-1 BLAS2 primitive does not perform as well as a matrix-vector multiplication. Thus. Ideally. Based on the discussion in §2 which indicates that the investigation of data locality is of great importance for both shared and distributed memory machines. on hierarchical memory systems [63].g. Unfortunately. Even in this case. it simplifies the design of numerical software for nonexpert users.. cache or local memory. (See the discussion of triangular system solvers below for a simple example. the level at which the user has direct control over performance-critical algorithm/architecture tradeoffs. special attention is given to identifying the strengths and weaknesses of each primitive in this regard and its relationship to the amount of exploitable parallelism. Code is designed in terms of a sequential series of calls to primitives which use all of the resources of the machine in the best way to achieve high performance. This typically occurs through the use of total primitives. the consideration of data locality and its relationship to the exploitable parallelism in an algorithm is a key activity in developing highperformance algorithms for both hierarchical shared memory and distributed memory Very loosely speaking this is usually the assembler level.) Fourth. It is particularly crucial that the analysis of this behavior identifies any contradictory trends that require tradeoff consideration. Architecture/algorithm analysis methodology. the analysis should also yield insight into techniques a compiler could use to automatically restructure code to improve performance. matrix-manipulation constructs are already included in many proprietary extensions to Fortran due to the need for higher-level constructs to achieve high performance on some machines.. When such a strategy is possible a certain amount of performance portability is achieved as well. i. detailed knowledge of the efficient mapping of primitives to different architectures provides a way of thinking about algorithm design that facilitates the rapid generation of new versions of an algorithm by the direct manipulation of its algebraic formulation.g.). For example. allowing meaningful conclusions to be reached about the most effective way to use a new machine. etc. exposing the weaknesses of an architecture for the execution of basic primitives provides direction for architectural development.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 11 such as ways of tricking a restructuring compiler/code generator combination into producing the code we want. and the limits of performance improvement possible via a particular technique such as blocking.e. primitives which hide all of the architectural details crucial to performance from the user.. Additionally. In this paper we are mostly concerned with analyses that concern the effects of hierarchical (registers.e. i. global memory) or distributed memory systems. In this section. computational primitives from each level of the BLAS hierarchy are discussed and analyses of their efficiency on the architectures of interest in this paper are presented in various degrees of detail. e.. [75]. e. 3. however. Finally. it aids in the identification of directions in language and restructuring technologies that would help in the implementation of high-performance scientific computing software. preferences within a set of primitives can be identified by such an analysis. many important architectures do not lend themselves to total primitives.2. A user need only deal with managing the interaction of the partial primitives which may or may not execute simultaneously. The design of efficient computational primitives and algorithms that exploit them requires an understanding of the behavior of the algorithm/primitive performance as a function of certain system parameters (cache size.

whose sum is the total time for the algorithm. [67]. Jalby. [99]. they considered an architecture comprising a moderate number (p) of vector processors that share a small fast cache or local memory and a larger slower global memory. [105] and at the California Institute of Technology on hypercubes (the Caltech Concurrent Computation Program) [59] [61].12 K. PLEMMONS. loop orderings in the LU factorization. on performance in terms of architectural parameters. GALLIVAN. [186]. In their methodology. Note that no assumptions have been made concerning the overlap (or lack thereof) of computation and the loading of data in order to write T as a sum of these two terms. we point out some performance modeling efforts concerning these tradeoffs that have appeared in the literature and present a summary of the techniques used on hierarchical shared memory architectures to produce some of the results discussed in later sections. and Sameh [67]. from memory to cache. Several papers have appeared recently which discuss modeling the influence of a hierarchical memory on numerical algorithms. therefore. and Trivedi [185]. The intersection of these two regions yields a set of blocksizes that should give nearoptimal performance for the time function as a whole. e. Earlier work on virtual memory systems also discusses similar issues. and ra and TI are the associated proportionality constants or the "average" times for an operation and data load. and prefetching. (The analysis is easily altered for the private cache or local memory case. are analyzed separately. however.. Of particular interest here are studies by the groups at the University of Illinois on shared memory multivector processors (the Cedar Project) [9]. R. [3]. the space of possible blocksize choices. [66]. [76].e.g.. The components Ta and Aj are respectively proportional to the number of arithmetic operations and data transfers.) An example of such an architecture is the Alliant FX/8. the total time for the algorithm is where na and n/ are the number of operations and data transfers. This component is called the data loading overhead and is denoted A.. The effect of such overlapping is seen through a reduction in . This time represents the raw computational speed of the algorithm and is derived by ignoring the hierarchical nature of the memory system: it is the time required by the algorithm given that the cache is infinitely large. [101]. [105] proposed the use of a decoupling methodology to analyze in terms of certain architectural parameters the trends in the relationship between the performance and the blocksizes used when implementing BLAS3 primitives and block algorithms on a shared memory multivector processor. Meier. AND A. that provides near-optimal behavior is produced for each time component. differ enough to require the further investigation that has taken place.g. used in both the matrix multiplication primitives and the block algorithms built from them. The first component considered is called the arithmetic time and is denoted Ta. A. The second component of the time function considered is the degradation of the raw computational speed of the algorithm due to the use of a cache of size CS and a slower main memory. In this section. H. e. A region in the parameter space. Gallivan. the work of Trivedi performs many of the analyses for virtual memory systems that were later needed for both BLAS2 and BLAS3 such as the effect of blocking. J. The details and assumptions for the hierarchical memory case.. In these studies performance analyses were developed to express the influence of the blocksizes. required by the algorithm. SAMEH architectures. two time components. the work of McKellar and Coffman [131]. In fact. In particular. i.

For the purposes of qualitative analysis. p. to ti. Specifically. This overlap effect can cause 77 to vary from zero. a <— xTy. The analysis of Ta considers the performance of the algorithm with respect to the architectural parameters of the multiple vector processors and the register-cache hierarchy under the assumption of an infinite cache. however. These primitives possess a simple one-dimensional parallelism especially suitable for vector processors with sufficient memory bandwidth to tolerate the high ratio of memory references to operations. A can be bounded under various assumptions (average case. [61] analyzed the efficiency of the broadcast-multiply-roll matrix multiplication algorithm and other numerical linear algebra algorithms on hypercubes in terms of similar parameters. reducing the arithmetic time may directly conflict with reducing the relative cost of data loading. This analysis is accomplished by expressing A//T a in terms of two ratios: a cache-miss ratio and a cost ratio. Fox. y <— y ± ax [121]. directly. more machinespecific tradeoff studies must be performed. The utility of the results of the decoupling form of analysis depends upon the fact that the intersection of the near-optimal regions for each term is not empty or at least that the arithmetic time does not become unacceptably large when using parameter values in the region where small relative costs for data loading are achieved. In other cases. etc. w | for the triad and /z w 1 . Typically. where p = ni/na is the cache-miss ratio and A = T//r a is the cost ratio.) and trends in the behavior of the primitive or algorithm derived in terms of architectural parameters via the consideration of the behavior of the cachemiss ratio /i as a function of the algorithm's blocksizes. and Hey [59]. For some algorithms this is not true. On distributed memory machines.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 13 TI. First and second-level BLAS. the analysis involves questions similar to those discussed concerning the BLAS2 below. The data loading overhead can be analyzed so as to produce a region in the parameter space where the relative cost of the data loading Aj/T a is small. In some cases. for machines which must fetch data on demand sequentially from memory to cache. the number of processors and a cost ratio tcomm/tfiop which gives the relative cost of communication to computation. analyses in the spirit of the decoupling methodology can be performed. where t\ is the amount of time it takes to transfer a single data element. Rather than considering A. worst case. 3. These studies typically involve probing the interaction of data motion to and from the various levels of memory and the underlying hardware to identify effective tradeoffs [64]. Otto. Johnsson and Ho [110] presented a detailed analysis of matrix multiplication on a hypercube with special attention to the complexity of the communication primitives required and the associated data partitioning. they expressed efficiency in terms of the number of matrix elements per node (blocksize). [65]. In particular. the second portion of the analysis attempts a more modest goal. the registercache hierarchy is significant enough to require another application of the decoupling methodology with the added constraint of the shape of the registers. The first level of the BLAS comprises vector-vector operations such as dotproducts. This level was used to implement the numerical linear algebra package LINPACK [38].3. and vector triads (SAXPY). for machines which have a perfect prefetch capability from memory to cache. a technique known as multilevel blocking can mitigate these conflicts [67]. For some machines.

As the number of processors increases to a maximum of n.) If the matrix dimensions n\ and n^ are larger than the register size6 of any of the processors there is no possibility of efficient register reuse and the value of (1 remains at the disappointing BLAS1 level. however. [56]. y <— y ± Ax. The triad. and \JL « 1 + ^ for the dotproduct. both primitives are easily decomposed into several smaller versions of themselves for parallel execution. For vector processors. On multivector processors. (For the rank-1 it is only possible to use the triad. This problem was later analyzed in a more systematic way in [42] and resulted in the definition of the extended BLAS or BLAS2 [40]. . For problems where either n\ or 6 The term register size does not necessarily mean the vector length of a single vector register. R. The dependences graph of the reduction operation and its properties that produced a small // for a limited number of processors scale very poorly as p increases and translate directly into a relative increase in the amount of memory traffic required on a shared memory architecture and interprocessor communication on a distributed memory machine. consider first the efficiency of implementing the two BLAS2 primitives as a set of BLAS1 primitives each of the order of the matrix. The two techniques. where p is the number of processors.14 K. the matrix-vector multiplication allows a choice of primitives. it is preferable to write algorithms for register-based multivector processors in terms of matrix-vector multiplications rather than rank-1 updates. AND A. PLEMMONS. In general. performance tuning is limited to adjusting the vector length and stride of access. New implementations of dense numerical linear algebra algorithms were developed which paid particular attention to vector register management and an emphasis on matrix-vector primitives resulted [24]. For p = n the triad requires (9(1) time with // w 2 while the dotproduct requires O(log n) with /z w 2. GALLIVAN. These primitives improve data locality in the sense that the number of memory references per operation can be reduced by accumulating the results of several vector operations in a vector register before writing to memory as in matrix-vector multiplication or by keeping in registers operands common to successive vector operations as in a rank-1 update. // w | + ^. For the triad. note that the fetch of a becomes more significant. on the other hand. H. A. the preference for the dotproduct over the triad is reversed. SAMEH for the dotproduct. Note that these primitives subsume the triad and dotproduct BLAS1 primitives and become those primitives in the limit as one of the dimensions of A tends to 1. produces a vector result and must therefore write n elements to memory in addition to reading the 2n elements of the operands. (For a distributed memory machine. do not result in similar improvements in data locality. A <— A ± xyT'.) The advent of architectures with more than a few processors and high-performance register-based vector processors with limited processor-memory bandwidth such as the CRAY-1 exposed the limitations of the first level of the BLAS. The superiority of the dotproduct is due to the fact that it is a reduction operation that writes a scalar result after accumulating it in a register. To see this. Such a reversal often occurs when considering large numbers of processors relative to the dimension of the primitive. and a rank-1 update. It can also refer to the aggregate size of all of the vector registers used-in a processor in a given implementation of the primitive. The second level of the BLAS includes computations involving O(n 2 ) operations such as a matrix-vector multiplication. Architectures with a more substantial number of processors were also more efficiently used since matrix-vector operations consist essentially of multiple BLAS1 primitives that can be executed in parallel — roughly speaking they possess two-dimensional parallelism. whether or not the reversal of preference occurs can depend strongly on the initial partitioning of the data. J.

2) time. This is the same as the best BLAS1 primitive. This is accomplished by partitioning A into k\ki submatrices Aij £ $R T O l X m 2 5 where it is assumed for simplicity that nj = kirrii with ki and m^ integers.l/2ri2 for the rank-1 update and //. The rank-1 update is thus reduced to k\k<2 independent small rank-1 updates. For the small rank-1. and the result is then added to the appropriate element of y and written back to memory. the values are an improvement compared to their limiting first level primitives. As long as n\ or n-2 do not get very small. could be taken equal to the corresponding dimension of the matrix. If these results could be maintained for arbitrary n\ and n2. The vector x is loaded into a register once. but it has the advantage of more exploitable parallelism. For the matrix-vector product. To show that this is indeed possible. a true two-dimensional partitioning must be used. n\n<2 and the rank-1 update requires O(l) time with (1 = 2. If the small dimension is n\ then a slightly different technique is used. ?/. the technique depends upon whether n\ or n^ is the small dimension. Since the register size determines the largest vector object we can work with and extra transfers to and from registers translate directly into additional time. = 1/2 + l/2ni + l/n 2 for the matrix-vector product. Each row of A is read in turn and used in an inner product calculation with x in the register. The resulting global (i value for the entire rank-1 update is (i = 1 + l/2mi + l/2m2Now consider its behavior as p. depending on the implementation chosen for the register-based smaller rank-1 update. and the algorithm requires O(mim. As p increases further. [L — I + l/2ni 4. say mi. p =. thereby suppressing the writes back to memory of partial sums. ni (the choice of mi or m. Of course.2 simply depends upon the shape of the matrix and the exact number of processors). As a result. which implies that the primitives are degenerating into a first level primitive. The only difference . is accumulated in a vector register. the superiority of the BLAS2 on register-based multivector processors would be established. it is possible to reuse the registers in such a way that both primitives achieve their theoretical minimum values of /z. For small and moderate p. one of the blocksizes. this local optimal is achieved by reading the small vector into vector register once and reusing it to form a triad with each row or column of the matrix in turn. The result of the operation. each element of the matrix and the two vectors are loaded into the processor exactly once and the elements of the matrix are written exactly once — the optimal data transfer behavior for a rank-1 update. This is not surprising since in the limit each processor is doing essentially the same scalar computation as the BLAS1 triad. the number of register-based vector processors used. Consequently. Every data element is read and written the minimum number of times. increases. however. It follows that fi = l + l/2r + l/2r&2 where r is the register length. At the limit of available parallelism.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 15 n-2 is smaller than the register size. we will exploit the richness of structure present in linear algebra computations and partition the primitives into smaller versions of themselves. So we set p = k\k^ which balances the computational load and the amount of data required by each processor. the rank-1 update still has a value of n similar to the dotproduct BLAS1 primitive. If it is n-2 then a technique similar to the rank-1 update is used. we make m^ 1 + m^1 as small as possible under the constraint that either mi < r or m 2 < r. and partitioning x and y conformally. The blocksizes which determine the partitioning are chosen so that the smaller instances of the primitives are locally optimal with respect to their values of //.

The time required is O(raira. when implementing algorithms with BLAS2 primitives on a register-based multivector architecture with a moderate number of processors. H. a matrix-vector product-based algorithm will significantly outperform the same algorithm based on a rank-1 update. Note also that at some point while increasing the number of processors the vector length used by each processor will fall below the breakeven point for the use of the vector capability of the processor. a simple fan-in of the results can be done.+Aik2xk2 All of the basic computations z <— z + AijXj can proceed in parallel with a fan-in dependence graph required on the update of the yi if £2 > 1. R. ki is equal to 1 and synchronization is required. for the case where a two-dimensional partitioning is used with p = k\ki. Consequently.16 K. If i = 1 then no synchronization is required since k^ — n-2 and the loop can execute in parallel. A similar decomposition technique can be used for the matrix-vector product primitive y <— y±Ax. GALLIVAN. and the switch should be made to scalar mode. in this case p = n\n^^ the value of/^ increases to approximately 2. but also the transfers associated with the ki independent fan-in trees which sum together the partial sums into the final values of yi for 1 < i < k\. The resulting algorithm is do i = l. PLEMMONS.2).. The matrix is partitioned into submatrices AIJ G 9ft TOlXm2 and partitioning x and y conformally. A. Note that on a moderate number of processors the matrix-vector primitive is twice as efficient as the rank-1 primitive of the same size. The most important is that for register-based multivector . for a small to moderate number of processors one of the raj can be set to the register length and the other to the remaining dimension of A. since the number of processors is assumed small the partial sums from local matrix-vector products can be accumulated in a vector of length n\ private to each processor (not necessarily a register). The resulting global n is where r is the vector length. Due to the reduction nature of the matrixvector product. its time has a lower bound of O(logn2). After all processors are finished accumulating their partial sums.As before. The time required is O(raira2) + O(rai Iog2 £2) with As with all of the other primitives. AND A.and secondlevel BLAS primitives. SAMEH is that in the BLAS2 case there is much more exploitable parallelism. The results above demonstrate several important points about first. If i = 1 In the latter case. not only must the transfers be computed for the small matrix-vector products performed by each of the processors. J. when p is as large as possible. In this case.fci end do yi ← yi+Ai1x1+. However. As with the rank-1 update it is possible to derive an estimate of the time and the value of fj..

The highest level of the BLAS is motivated by the use of memory hierarchies. 3. only the lowest level of the hierarchy (or in some cases the two lowest. e. first released in the middle of 1985. was not distinct enough to achieve a significant advantage over BLAS2 primitives. « 1 but it does not depend upon efficient reduction operations on vector registers being available on a processor and its extra dimension of parallelism makes it more flexible than the previous primitives. The dotproduct improves the ratio to IJL w 1 but not all processors have high-performance capabilities. This performance order is given from worse to best in terms of decreasing values of //. This problem of data reuse in the design of algorithms has been studied since the beginning of scientific computing. The introduction of the CRAY X-MP and its additional memory ports delayed even further the move to the next level of the BLAS. The triad with ^ ~ f does far too many spurious data transfers to be of use on a processor with a single port to memory. Third-level BLAS. registers and cache) are able to supply data at the computational bandwidth of the processors. there can be a significant difference between the performance of a given algorithm when implemented in terms of the four primitives discussed above. The block algorithms developed for such architectures relied on transferring large submatrices between different levels of storage.. with prepaging in some cases. Similar considerations were also needed on later machines with paged virtual memory systems. Of course. required the use of secondary storage such as tape or disk to hold all of the data for a problem. Indeed. By far. the preferred primitive for such an architecture is the matrix-vector product due to its superior register management. This allows the cost of the data transfer between levels to be amortized over several operations performed at the computational bandwidth of the processors. In this case we considered increasing the number of register-based vector processors available to the maximum needed. These evolved into the algorithms contained in ESSL. Agarwal and Gustavson designed matrix multiplication primitives and block algorithms for solving linear systems to exploit the cache memory on the IBM 3090 in the latter part of 1984. the use of matrix-matrix modules was considered when developing algorithms for the CRAY-1. which had small physical memories. This last point is crucial. data locality must be exploited to allow computations to involve mostly data located in the lowest levels.4. however. as Calahan points out [23]. Our discussions implicitly assumed a shared memory architecture when increasing the number of processors. The second observation from the results above is how the preferences can reverse when the architecture used is radically altered. It was finally caused by the availability of high-performance architectures which rely on the use of a hierarchical memory system and with more profound performance consequences when not used correctly. On such systems.g. The BLAS2 rank-1 update primitive also has p.4. If for some reason the data had been partitioned in a different way the trends need not be the same. Early machines.1. 3. The hierarchy. It was shown that in the limit all have similar register-memory transfer behavior and the nonreduction operations have a distinct advantage if it is assumed that the data and computations have been partitioned ideally. While the results do hold for certain distributed memory architectures. Motivation. Hence. however. for the IBM 3090 . and localizing operations to achieve acceptable performance.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 17 processors with a moderate number of processors. the resulting matrix-matrix primitives could have been used in algorithms for the machines which motivated the BLAS2. they can be very sensitive to the assumptions concerning initial data partioning.

such as Alliant. [156].g. i. n3) or a matrix multiplied by several vectors (na = u <C HI.ns do s = l.4. A numerical linear algebra library based on block methods was developed and its performance analyzed in terms of architectural parameters in 1985 and early 1986 for a single cluster of the Cedar machine. and bt. A respectively B.r&2 end do end do end do ^s. Clearly.r =: Cs. GALLIVAN. In 1987. a S)t . [84]. 3. or BLAS3. There are three basic generic approaches to performing these computations which correspond to different choices of orderings of the loops.. A. [130]. The techniques have become so accepted that some manufacturers now provide high-performance libraries which contain block methods and matrix-matrix primitives. respectively. (n2 = 1). it is most often used as a rank-o. The development of these routines and numerical linear algebra libraries clearly demonstrated that a third level of primitives.r ~T~ O's. Calahan developed block LU factorization algorithms for one CPU of the CRAY-2 [23]. In block algorithms. [5]. Some. update (n-2 = u> <C ni. [149]. In 1985. the multivector processor Alliant FX/8 [9]. provide matrix multiplication intrinsics in their concurrent/vector processing extensions to Fortran. e.e. based on matrix-matrix computations was required to achieve high performance on the emerging architectures. Some algorithms. The most basic BLAS3 primitive is a simple matrix operation of the form where C. A. The analysis of the parallel complexity of such a computation has been the subject of much study.. and n-i xna matrices. PLEMMONS. many papers have appeared in the literature considering the topic on various machines. At approximately the same time. and matrix-vector multiplication. (na = 1). [39]. the data locality is no longer effectively independent of problem size as it is for the first two levels of the BLAS. SAMEH with vector processing capabilities [1]. n\ xn-2.2.r where c Sjr . and they increase the parallelism available by yet another dimension by essentially consisting of multiple independent BLAS2 primitives. Since the reawakening of interest in block methods for linear algebra. BLAS2 primitives. this primitive subsumes the rank-1 update.18 K. n^}. They are called the inner. [37]. and more recently for the multiprocessor version of the architecture [2]. .r denote the elements of C.t"t. The basic scalar computation can be expressed as do r = l. Third-level primitives perform O(n3) operations on O(n 2 ) data. [105]. R. and B are n\ xna. J. AND A. [44]. an effort began to standardize for Fortran 77 the BLAS3 primitives and block methods for numerical linear algebra [35]. In this section we give a brief summary of some generic algorithms and mention some implementations on various machines that have appeared recently in the literature. H.ni do t = l. Such primitives achieve a significant improvement in data locality. Bischof and Van Loan developed the use of block Householder reflectors in computing the QR factorization and presented results on an FPS-164/MAX [16].

end do.b*jr) end do end do middle . k 3 . The basic loop is of the form d oi = 1 . This is accomplished by performing.r=c*r+Ab*r outer . na do s — l. multiple matrix-vector products — the preferred BLAS2 primitive for vector register management. a slightly more exotic matrix multiplication based on storing and manipulating the diagonals of matrices [127]. k 2 d oj = 1 .product: do t = 1. in several places and will not be repeated here. for use on the CDC STAR-100 vector processor. Madsen. na end do and c*.r — cs. and m2 x 777. Rodrigue. and data fetch available on many systems.tbT.3. [100]. and Bkj whose dimensions are m\ x ras. All have immediate generalizations involving submatrices.product: do r = 1. Their motivation was mitigating the performance degradation of the algorithms above for banded matrices and the difficulties in accessing the transpose of a matrix on some machines.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 19 middle.ni cs. and B into submatrices Cij. a more sophisticated two-level generalization of the blocking strategy discussed below can be used to achieve high performance.product: do r = 1. adder. Each has its advantages and disadvantages for various problem shapes and architectures. respectively. however. These issues are discussed in the literature. the middle product algorithm facilitates the efficient use of the vector registers and data bandwidth to cache of each processor. It proceeds by partitioning the matrices C. When the vector processors are such that register-register operations are significantly faster than chained operations from local memory or cache. e. possibly in parallel.g. and exploits the chaining of the multiplier.. We do note. mi x m 2 .r + inner-prod(as^. A. and outer product methods due to the fundamental kernels used and correspond to the following code segments: inner . that for register-based vector and multivector processors with one port to memory. [137]. Aik. and Karush considered. [67]. [105] and applicable to machines with a moderate number of reasonably coupled multivector processors with a shared cache implements a block version of the basic matrix multiplication loops. The BLAS3 primitive implemented for a single cluster of the Cedar machine [66]. K 1 d ok = 1 . n-2 C = C + a*.

There are. Any row or column of a matrix can now be accessed in a conflict free fashion. if the number of processors are increased to p = nin^n^ the inner product form of the algorithm can generate the result in O(log2 n^} time. SAMEH GJJ — GJJ + Aik * Bkj end do end do end do where HI = kimi. and n^ = £37713.7 If. k2. As noted above the middle product algorithm which performs several multiplications of Aik and columns of Bkj in parallel is well suited for register-based architectures like the Alliant FX/8. produces distinctly different blocksizes and shapes [105]. kernels for the block operations which differ from those discussed below and alternate orderings must be analyzed so that selection of the appropriate form of the routine can be done at runtime based on the shape of the problem. n-2 = k^m-i. so the algorithm proceeds by dedicating the full resources of the p vector processors to each of the block operations in turn. guaranteeing smooth performance characteristics as the shapes become BLAS2-like. such an approach would place tremendous strain on a highly interleaved or parallel memory systems. However. when assigning the first element of a row it is placed in the memory module that is skewed by one from the module that contained the first element of the previous row. A.g. This implies that a block of A must reside in the cache simultaneously thereby altering the optimal shapes. AND A. As mentioned earlier. Clearly. If we are willing to sacrifice some numerical stability. H. Bailey has 7 The i—j — k ordering of the block loops.. one way that such strain is mitigated is by assigning elements of structured variables to the memory banks in such a way as to minimize the chance of conflicts when accessing certain subsections of the data. and £3 are assumed to be positive integers for simplicity. . fast schemes which use less than O(n3) operations can be used to multiply two matrices. However. hence it is assumed in the analysis below. This is especially important for cases with extreme shapes. the processors are not tightly coupled enough parallelism can moved to the block level. Its use can be motivated by the desire to keep a block of C in cache while accumulating its final value. As is shown below this particular ordering (or one trivially related to it) is appropriate for use in the block algorithms discussed in later sections. In [95]. for example. of course. In it the elements of each row of a matrix are assigned in an interleaved fashion across the memory modules. J. for example. Matrix multiplication algorithms for the distributed memory ILLIAC IV were developed based on this scheme which can be easily adapted to the shared memory situation. For a shared memory machine. One of the easiest memory module mapping strategies that achieves this goal dates back to the ILLIAC IV ([114]. when developing a robust BLAS3 library. also see [116]). The kernel block multiplication can be computed by any of the basic concurrent/vector algorithms. and k\. [115]. The block operations Cij = Cij + Aik * Bkj possess a large amount of concurrent and vectorizable computations. For the inner product algorithm it is particularly important that the row and columns of matrices be accessible in a conflict free manner. e. The technique is called the skewed storage scheme. Higham has analyzed this loss of stability for Strassen's method [175] and concluded that it does not preclude the effective use of the method as a BLAS3 kernel. GALLIVAN. PLEMMONS. Recently. R.20 K. several possible orderings of the block loops and several other kernels that can be used for the block operations. This can also be useful in the case of private caches or local memories for each processor.

j) of each matrix starts out in the memory of the processor correspondingly indexed. and roll phases. and 773 = £37713.2 > 16 to 32 depending on the . Bailey also notes that the algorithm is very amenable to use on multiple CPU's of the CRAY-2 although no such results are presented. Speedups as high as 2. and 7773 chosen according to the preceding reasoning are: mi = 32k or is large. on step i (i = 0. 777. £2. Blocksize analysis. Each block operation dj = Cij + A^ * B^j uses the resources of the p vector processors by performing matrix-vector products in parallel. 777. Assume the processors are connected as a two-dimensional wrap-around mesh and the square subblock with index (i. Finally. 777. For the Alliant FX/8. \/n — 1) the processor in each row owning y ^j. [129]. 3. users can write their algorithms as a sequence of calls to these data motion primitives in a fashion similar to the method advocated with respect to the computational primitives discussed above. In other words. the values of mi. n-2 = ^2^2. [135].4. Values of mi. In particular. each processor transmits the subblock of B it has in its memory to the processor in the same column of the mesh but one row up.2. and fci. C e 3? nxn .PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 21 considered the use of Strassen's method to multiply matrices on the CRAY-2 [7]. The algorithm consists of ^lfn steps each of which consists of broadcast.(j+i)mod v / n broadcasts it to the rest of the processors in the row which store it in a local work array T. and 7713 which yield near-optimal values of the arithmetic time for the kernel can be determined by an analysis similar to those presented above for the BLAS2. In this work they consider the implementation of the computational primitive in terms of communication primitives some of which implicitly perform computations as the data move through the cube.k3 (^ij = (^ij + Any * B]fj end do end do end do where n\ = k\mi. The essential tradeoffs require balancing the parallel and vector processing capabilities and the bandwidth restrictions due to the single port to memory on each processor. Each processor then multiplies T by the subblock of B presently in its memory and adds it to the subblock of C that it owns. &2 do j = I. The repetition of this three-phase step \/n times corresponds to the number of steps required to let each subblock of B return to its original processor. • • • . is representative of distributed memory algorithms [59]-[61]. The increased performance compared to CRAY's MXM library routine is achieved via the reduced operation count implicit in the method and the careful use of local memory via an algorithm due to Calahan. multiply. ki do k = 1. Recall that the block level loops were do i = 1. Johnsson and Ho have considered the implementation of matrix multiplication on a hypercube [110]. B. and £3 are assumed to be positive integers. The final phase of each step consists of rolling the matrix B up one row in the mesh with appropriate wrap-around at the ends of the mesh.3.) Consider the calculation of C <— C + AB where A.2. As a result.01 are reported compared to CRAY's library routine on a single CPU. The broadcast-multiply-roll algorithm for matrix multiplication described and analyzed by Fox et al. (For other distributed memory algorithms see [78]. In this section we summarize the application of the decoupling methodology to the matrix multiplication algorithm for the single cluster of the Cedar machine described above.

SAMEH overhead surrounding the accumulation. . GALLIVAN.22 K. the minimization of the number of loads performed by the BLAS3 primitive is equivalent to the solution of the minimization problem where CS is the cache size and p is the number of processors. AND A. Since the submatrices AH.m2)-plane. These can be summarized as: 1. Therefore.vectors primitives. /z. is Constraints for the optimization of the terms involving mi and m-2 are generated by determining what amount of data must fit into cache at any given time and requiring that this quantity be bounded by the cache size CS. J. it is assumed that each A^ is loaded once and kept in cache for the duration of the j loop. The reduction of the data loading overhead reduces to a simple constrained minimization problem. H. and general large dense matrix multiplication (see [67] for details). and ms = Sk or is large. The final set of constraints come from the fact that the submatrices cannot be larger than the matrices being multiplied. Some cases where this distinction becomes important are discussed below. A. given an infinite cache. PLEMMONS. it is assumed that each of the Cij and Bkj are loaded into cache repeatedly. The value of ma is arbitrary and taken to be n3. two of which are of interest for the rank-w update and matrix-times-o. The solution to the minimization problem separates the (ni.ri2) plane into four distinct regions. Similarly. R. The constraints trace a rectangle and an hyperbola in the (mi. are associated with the inner loop. It is easily seen by considering the number of transfers required that the cachemiss ratio. is given by The theoretical minimum. Note that the conservative approach is taken in that no distinction is made between reads and writes in that A is set under the pessimistic assumption that anything loaded has to be written back whether or not it was updated.

) . and/or HZ become small. Also note that each block of the matrix C is read and written exactly once implying that this blocking maintains the minimum number of writes back to main memory. Note however that the blocks of C are written to several times. (The troublesome boundary cases can be handled by altering the block-loop ordering or choosing a different form of the block multiplication kernel.) For the rank-u. the decoupling methodology does yield a strategy which can be used to choose near-optimal blocksizes for BLAS3 primitives. In general. The blocksizes that result are. Note that the block loops simplify to do i = 1. these writes are not significant since the blocksizes have been chosen to reduce the significance of all transfers (including these writes) to a negligible level.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 23 Note that since the near-optimal region for the arithmetic time component was unbounded in the positive direction. different from the one shown above (see [105]). The key observation with respect to the behavior of // for BLAS3 primitives is that it decreases hyperbolically as a function of mi and m?. except for some boundary cases where ni. The i-j-k block loop ordering can be used and analyzed in a similar fashion if it is desirable to accumulate a block of C in local memory. (This assumes this particular block loop ordering but similar statements can be made about the others. This implies that. ra end do end do.-vectors primitive the partitioning is and the block loops reduce to do i = 1. k do j = 1. there is a nontrivial intersection between it and the near-optimal region for the data loading component. Once again block parallelism is obviously exploitable when needed. n^. of course. primitive this results in a partitioning of the form where the blocksizes are given by the case above with n2 = u and small. k end do and parallelism at the block-loop level becomes trivially exploitable when necessary. For large dense matrix multiplication and for the matrix-times-u.

indicating the superiority of square blocks for this type of matrix multiplication algorithm.4. SAMEH It follows that the relative cost of transferring data decreases rapidly and reaches a global minimum of the form Therefore. On the other hand. The actual optimization process must be altered. however. data loading overhead can be reduced to O(l/VCS). the analysis should result in a preference which is parameterized by a> with end conditions consistent with these two observations.. where p is the number of processors. H. Consider the comparison of the rank-u. primitive to the primitive which multiplies a 'matrix by u vectors. To make such a comparison we will restrict ourselves to the multivector shared hierarchical memory case considered above and to four partitionings of the primitives which exploit the knowledge that u. For the most part. have implications with respect to the blocksizes used in block versions of linear algebra algorithms. AND A. The use of registers imposes shape constraints on blocksize choices and it is often more convenient not to decouple the two components of time. assuming that n^ is much larger than \/CS (large dense matrix multiplication). J. Hence. For BLAS3 primitives where one of the dimensions is smaller than the others. R.4. is small compared to the other dimensions of the . This limit on the cache-miss ratio reduction due to blocking is consistent with the bound derived in Hong and Kung [101]. A. the two primitives are identical and no preference should be predicted by the analysis. The hyperbolic nature of the data loading overhead implies that reasonable performance can be achieved without increasing the blocksizes to the near-optimal values given above. Of course. Preferred BLAS3 primitives. GALLIVAN. and the number of matrix elements stored locally in each processor (hence bounded by the local memory size). cost of a floating point operation. If u. the data loading overhead is a satisfactory 0(l/w). however. exactly how large mi and m2 must be in order to reduce the data loading overhead to an acceptable amount depends on the cost ratio A of the machine under consideration.24 K. the conclusions stated here still hold. The constant c is 1 for the square subblock decomposition but is y^/2 for the row decomposition. PLEMMONS. and Hey [61] derives a result in the same spirit as (8). tjiop. — 1 this is the BLAS2 comparison discussed earlier and for the shared memory multivector processor analyzed above the matrix-vector multiplication primitive should be superior. They show that the efficiency (speedup divided by the number of processors) of the broadcast-multiply-roll matrix multiplication algorithms is where tcomm. Otto. if u> = n. The expression for the data loading overhead based on (2) and (4) is also of the correct form for matrix multiplication primitives blocked for register usage in that hyperbolic behavior is also seen. with value denoted u. For hypercubes. 3. the analysis of Fox. The preceding analysis also allows the issue of superiority of one BLAS3 primitive compared to the others to be addressed. The existence of a lower bound on the cache-miss ratio achievable by blocking does. and n are the cost for communication of data.

k\m\ = h. and ra is the blocksize which must be determined. Both have the block loop form do i = 1. m end do end do. It is assumed that the primitives make use of code to perform the basic block operations which has been optimized for register-cache transfer and is able to maintain efficient use of the lowest levels of the hierarchy as the shape of the problem changes. In this case.e. This partitioning/primitive combination is denoted Form-l. operations and the block loops are of the form do i = 1. A{ e 3? mxw . As noted above. Each is appropriate under various assumptions about the architecture and shape of the problem. The partitioning used is shown in (6) where C. update computes C «.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 25 matrices involved (denoted h and / below). The primitive requires hl + hu + klui loads from memory and hi writes back to memory. The computations requires 2/i/u. This allows us to be more precise than the conservative bounding of the cost of data transfer presented above.. denoted Form-2. and B e ^ x '. One of the blocksizes is fixed at u. The blocksizes mi and m-2 are determined under the simplified constraint of mim^ < CS. the partitioning is such that AI G $R m i X m 2 . The rank-u.C ± AB where C € $h*1. In this case. We will also distinguish between elements which are only read from memory into cache and those which require reading and writing. k end do. Specifically. They differ in the constraints placed on the blocksizes./i. k do j = 1.) Three different partitionings for the primitive which multiplies a matrix by u vectors are analyzed. results from applying the analysis of the previous section to the i-k-j loop ordering of the original triply nested loop form of the matrix multiplication primitive. and Ci and Bi are dimensioned conformally. The partitioning of the rank-u. The second primitive also computes C <— C ± AB. Note that we have used the knowledge of the analysis above to fix two blocksizes at uj and /. C £ sjj/ixu^ ^4 £ sftfcx^ an(] B ^ $R' XW . (The qualitative conclusions of the previous analysis do not change. Also note that this affects the value of the cost ratio A in that it need not be as large as required above. however. Below we derive and compare the value of [i for each of the four implementations of the primitives. i. the source of differences in the performance of the two primitives is the amount of data transfer required between cache and main memory which is given by the ratio p. update used is of the form given above in (6) but the values of the blocksizes are altered to reflect the more accurate analysis obtained by differentiating between reads and writes. km =. A G » hxw .. Such a strategy was proposed in [105] and has been demonstrated effective on the Alliant FX/8. e 3? mx/ . The first version. The first two are of the form shown in (7). k^mi — /. Form-2 requires hi + hluj(m^1 + m^1) loads and . the arithmetic time Ta has been parameterized according to u and the code adjusted accordingly. three partitionings are considered.

J. This simplifies the minimization problem and leaves only mi to be determined. applies the i-k-j ordering to the transpose of the matrix multiplication to determine blocksizes. R. The resulting partitioning is of the form where At € 3?/lxm. A. In [105] it is shown that this partitioning sets m-2 to the value r where r is determined via the analysis of register-cache transfer cost. This constraint is generated by requiring the accumulation in cache of a block Ci which implies that a d and the Aij contributing to the product must fit in cache simultaneously. k end do. denoted Form-4. PLEMMONS. The partitioning is such that Ai 6 sft T "i xm 2 ? kimi = h. denoted Forra-3. Form-4 requires hi + lu + Zkhu loads and khuj writes to memory. it requires (k-z — \}hw writes to cache due to the local accumulation of Ci. This form is valuable for certain architecture/shape combinations. k<imi = /. and C{ and B{ are dimensioned conformally. The blocksizes mi and m-2 are determined under the constraint of mi(m2 +u>] < CS. The constraint mu < CS is applied. SAMEH kihu writes to memory. The block loops simplify to do i = 1. GALLIVAN. TABLE 1 Comparison of the four forms of the BLAS3 primitives. AND A. Form-3 requires hi + h(jj + hlum^1 loads hw writes to memory. The third version. As before. results from analyzing the i-j-k loop ordering of the original triply nested loop form of the matrix multiplication primitive as in [105]. Bt € K m x w and km = I. one of the blocksizes is fixed at u. The second version. C=C+Ai*Bi . H.26 K. Additionally.

we would not expect significant performance differences between the two primitives when uj and the size of the matrices are large enough. Consequently. the fact that one is up to a factor of two more than the other (though this multiple rapidly reduces as well) quickly becomes irrelevant since the relative cost of data transfer to computational work has become an insignificant performance consideration. 3.e. by a factor of 2 when u is near 1. Since the results of the analysis of the primitives given above and the analysis of the block methods which use them indicate that (jj = \/C~S represents a limit point on performance improvement the optimal /z evaluated there is also given. given these partitionings and an architecture satisfying the assumptions of the analysis. The values show clearly the well-known inferiority of the rank-u. Such observations have been verified on an Alliant FX/8. Performance of square matrix multiplication on an Alliant FX/8. one would not expect the performance of the block algorithms that use the two BLAS3 primitives. Table 1 lists the results of analyzing each of the four forms presented above. As a result. i. as (jo increases.. FIG. Finally. The generic form of p. in the near-BLAS2 regime.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 27 Note that if parallelism across the blocks is used this form requires synchronization (which is typically done on a subblock level). However. the value of the blocksizes which give the optimal data loading cost are also listed. . is given in terms of the dimensions of the problem and the blocksizes used as well as its optimal value.

a block LU algorithm.4. The BLAS kernels a <— xTy and y <— y ± ax achieve 11 Mflops and 7 Mflops. A. 3. ui/n. and ma = 64. H. Triangular system solvers. if mi > 96 then the term in the expression for the number of loads for the rank-fc kernel which involves mi is not significant compared to the term involving m2These curves clearly show that increasing k yields increased performance and a significant portion of the effective peak computational rate is achievable. and mi = 128. The experiments were performed executing the particular kernel many times and averaging to arrive at an estimate of the time spent in a single instance of the kernel. whether dense or sparse. The blocksizes used for each curve are from low to high performance : mi = 32. In this section.5. respectively.28 K. the curves have two distinct parts. When the asymptotic performance in the second region is close to the peak in cache the number of loads is being managed effectively. It would also be expected that the trend in preference for non-reduction types of computations as the number of processors or the cost of processor synchronization increases seen with BLAS2 primitives carry over to the BLAS3.k updates. Experimental results. the k = 96 curve was not included in Fig. The parameters m2 and ma are taken as k and n as recommended by the analysis of the BLAS3 primitive. This parameter is kept constant for each figure to allow a fair comparison between the performances of the various kernels. of 68 Mflops. The first is characterized by a peak of high performance. GALLIVAN. It is instructive to compare the performance of the rank-fc kernel to typical BLAS and BLAS2 kernels. mi = 64. PLEMMONS. 3. and ma = n. Also note that as k increases the difference in performance of two successive rank-fc kernels diminishes. 7712 = 64. AND A. It is clear from the asymptotic performance of the top curve that a significant portion of peak performance can be achieved by choosing the correct blocksizes. SAMEH e. J. In this case an asymptotic rate of just below 52 Mflops is achieved on a machine with a peak rate. Solving triangular systems of linear equations. It is interesting to compare this peak to the rest of the curve which corresponds to the kernel operating on a problem whose data is initially in main memory. we report briefly on experimental results on the Alliant FX/8. In fact.5.g. the BLAS3 analysis recommends mi = (CS/k] — p. This is the region where the kernel operates on a problem which fits in cache. with their arguments in main memory. and 7713 = 32. Indeed. Even though the solution process consumes substantially less time than the associated factorizaAs is discussed later.. R. to be significantly different for sufficiently large problems 8. respectively. is encountered in numerous applications. The BLAS2 matrixvector product kernel achieves 18 to 20 Mflops. This technique was used to minimize the experimental error present on the Alliant when measuring a piece of code of short duration. 4 since it delivers performance virtually identical to the k = 64 kernel. Further. Figures 4 and 5 show the performance of various rank. The performance rate in this region gives some idea of the arithmetic component of the time function. mi = 96. As a consequence of this technique. The parameter mi is taken to be 96 and 128 in the two figures. The performance benefits of using BLAS3 primitives and carefully selecting blocksizes in their implementation has been demonstrated in the literature. when the ratio of the blocksize to the problem. for the values of k considered here. is small other tradeoffs must be considered in the performance of block algorithms. 8 . Figure 3 illustrates the effect of blocksize on the performance of the BLAS3 primitive C <— C — AB where all three matrices are square and of order n. including vector startup. m<2 = 32.

it is vital to solve them as efficiently as possible on the architecture at hand. There are two classical sequential algorithms for solving a lower triangular system Lx — /. Performance of rank-k update with mi = 96 on an Alliant FX/8.i — 1 enddo enddo & = 0 z /Aii and Column . tion stage. • • • . 2. we often wish to solve these triangular systems repeatedly with different right-hand sides but with the same triangular matrix.] and z. j = 1. where L = [A^]. Hence.n do j — l. 4. They differ in the fact that one is oriented towards rows. These algorithms are: Row-oriented : 6 = 0i/An do i = 2. and the other columns. x = [£. n.oriented : . f = [</>»].PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 29 FIG.

3. J. If the vector processor has adequate bandwidth then. PLEMMONS. Performance of rank-k update with m\ — 128 on an Alliant FX/8. R. 5. these two algorithms are the basis for many adaptations suitable for various vector and parallel architectures. A.and column-oriented versions vectorize trivially to yield algorithms based respectively on the BLAS operations of SAXPY and DOTPRODUCT. Each step of the row-sweep algorithm requires less data motion than the corresponding step in the column-sweep algorithm. the DOTPRODUCT primitive reads two vectors and produces a scalar while the SAXPY reads two vectors and writes a third back to memory.5.n — 1 £j = ^ j l ^ j j do i = j + l. GALLIVAN.30 K. theoretically . We refer to these algorithms as the row-sweep or forward-sweep. and the column-sweep [116]. H. Shared-memory triangular system solvers.n </>* = & — \j£j end do end do sn = fin/Ann As is shown below. AND A.1. SAMEH FIG. The inner loops of the row. do j = l.

£ ft" J — \J 1 i i Jp ) •> vviicic o/j. T T}T (j) IT . . 9 . #1 . . the reduced data traffic of the DOTPRODUCT may be preferable. In practice. Let _ (Tx T . • • • .g.jv). Block forms of the algorithms must be considered. The first accumulates 64 partial sums in a vector register and the second reduces these sums to a scalar. x T\T x — V l ' i x p / ix — V**'l ) i j< ) i x j CU1U The block algorithm is: B-Row -Sweep : f . the performance degradation due to excessive register transfers of the vector algorithms described above can be severe. The block column-sweep algorithm may then be described as: B-Col-Sweep : doj = Q.0] where Cl e SR"*^-1)" and Li € 5R"X". . although in some cases this may not be an option. (This assumes.Lj..PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 31 at least. e.p-2 end do solve Z/P"1)^?"1) = f(p~l) via CoLSweep or RowJSweep. Using the notation above. For register-based vector processors with limited bandwidth to memory such as the CRAY-1 or a single processor of the Alliant FX/8 each of which has a single port to memory. Note that this blocking allows the registers to be used efficiently. this should not be an important distinction. . the column-sweep algorithm can accumulate the solution to the triangular system L^x^ = /} in vector registers resulting in a data flow between registers and memory identical to that of a v x v block of the matrix-vector product with the exception that the vector length reduces by one for each of the v operations.9) The row-sweep algorithm can surfer from the fact that it accesses rows of the matrix. 7. partition L so that each block row is of the form [C t . On some machines this is not necessarily a good assumption. The first phase has the memory reference pattern mentioned above but the second is memory intensive and its cost can be significant for smaller vectors. and let each of L^\ x^\ and /^ be of order (n . jl ^i Ji . fT\T wnere x.. Similarly. The Alliant FX/8 has a considerable increase in the startup cost of the dotproduct compared to that of the triad instruction. Let Z/°) = L. that the implementation of the DOTPRODUCT is not particularly expensive. The matrix-vector product which updates the right-hand side vector is blocked in the fashion described above to allow the accumulation in a vector register of the result of v vector operations before writing the register to memory rather than the one write per two reads of the triads in the nonblocked column-sweep.( f T . when the data placement has been determined by some other portion of the algorithm of which the triangular solve is a component. Similarly. A block row-sweep algorithm can also be derived which reduces the amount of register-memory traffic even further. of course. ai £0'+i) _ j^y..1 where with Ljy. This can be remedied by storing the transpose of the lower triangular matrix. CRAY machines implement the dotproduct in a two-stage process. /(°) = /. j = 0. however. . f. and f[3' being each of order v (we assume that v divides n).

7. q = 1. highly dependent on the efficiency of the synchronization mechanisms provided on the multiprocessor of interest. A. of course. if L = [Lpq].p end do. SAMEH FIG. in the above sequential form of the column-oriented algorithm. • • • . the main point of a DOACROSS is that computing each £j need not wait for the completion of the whole inner iteration i = j + 1. For example. In fact. The row and column sweeps . 6. of course. The column-oriented. The particular parallel schedule used in the DO-ACROSS approach is. It is. f = {/p}. the computation is performed by blocks.. R. x — {xp}. which is also suited for both shared and distributed memory multiprocessors. then the DO-ACROSS on two processors may be illustrated as shown in Fig. To minimize the synchronization overhead in a DO-ACROSS and efficiently use registers or local memory. GALLIVAN. n. H. All of the methods presented thus far in this section can be viewed as reorganizations of the task graph in Fig. Two processor DO-ACROSS synchronization pattern. quite straightforward to combine the two approaches to use a more consistent shape throughout the algorithm. on the other hand. solve LjXj = fj via CoLSweep or Row-Sweep This algorithm requires only one or two vector writes per block row computation depending upon whether or not the result of the matrix-vector product is left in registers for the triangular-solve primitive to use. Another triangular solver. n= 2 P v solve L\XI = bi via CoLSweep or Row-Sweep do j = 2. one processor may compute £.4.32 K. 6. where each block is of order n/4. Vectorization can be exploited in each of the calculations shown if each processor has vector capabilities. This algorithm is characterized by the use of short and wide matrix-vector operations rather than the tall and narrow shapes of the block column-sweep. For example. J. is that based on the DO-ACROSS notion. and p. AND A. The row-oriented algorithm executes each row in turn starting from the top and tasks within each row from left to right. PLEMMONS. • • • . executes each column in turn starting from the left and tasks within a column from the top to bottom.soon after another processor has computed </>j := <f>j — AJJ_I£J_I.

It follows immediately that the column-sweep and row-sweep algorithms are specified algebraically by (here with n = 8): and . Mj = I — £ j v j . reveals certain limitations of all methods based upon it. To go faster we need a new dependence graph which relates the solution x to the data L and /. it follows that where Ni = I — ^ef. The algorithms can be easily described algebraically in terms of elementary unit lower triangular matrices. For example. respectively. Careful consideration of the task graph. however. Suppose that each node represents the operation on a submatrix of order ra and n = km. Triangular system solution dependence graph. li is the vector corresponding to column i in L with the 1 on the diagonal removed and Vj is similarly constructed from row j of L. The new dependence graph can be generated from recognizing the algebraic characterization of the column. on a vector machine merely vectorize the tasks within a row or column. the dependence graph has a critical path with O(k] length which establishes a fundamental limit to the speed at which these algorithms can solve a triangular system. respectively. The dependence graph implies that the maximum number of processors that can ever be active at the same time is k — 1. Further. assuming without loss of generality that \a = 1. It is easy to see from the algebraic structure of Ni and Mj that multiplying them by a vector corresponds to the computational primitives of a triad and dotproduct.and row-sweep algorithms.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 33 FlG. 7. Block versions of the algorithm interpret each node as corresponding to computations involving a submatrix rather than a single element.

H. GALLIVAN. however. Note that thus far we have assumed only one right-hand side vector. see [159]. Note that we do not invert the Lj's. Also a simple application of associativity can generate two algorithms that have a much shorter critical path.1 ) where Now L(1) = £>(°)L(°) and /(1) = £)(°)/(°) are obtained by simple multiplication. J. The BLAS3 primitive triangular solver assumes that multiple right-hand side vectors and solutions are required.34 K. respectively. This algorithm offers opportunities for both vector and parallel computers. PLEMMONS. Eventually. It is therefore typically not appropriate for an architecture with a limited number of processors such as those of interest here. • • • . we may form it in 0(log2 n) stages. Consider such a system Lx = /. The required number of arithmetic operations is 0(m 2 nlog(n/2ra)) resulting in a redundancy of O(mlog(n/2m)). provides the necessary data locality for high performance on a hierarchical memory system. It can be shown by careful consideration of the structure of the matrices at each stage that the critical path has a length of fc2/2 + 3fc/2 floating point operations where k = Iog2 n. We repeat the process by multiplying both sides of L^x = / (0) by DM = diag((LJ 0) ). For a discussion of the numerical stability of this algorithm see [187]. with the last stage consisting of only one matrix-vector multiplication. but obtain /^°^ and G\ by solving triangular systems using one of the above parallel algorithms. For banded lower triangular systems in which the bandwidth m (the number of subdiagonals with nonzero entries) is small.. This. At the first stage we have n/ra triangular systems to solve. n/ra. those operations can be completed in O(logmlogn) time. for (m + 1) right-hand sides except for the first system which has only one right-hand side.fi^-i are lower and upper triangular. Specifically._i. The generalization of the algorithms above are straightforward and the blocksizes (the number and order of right-hand sides solved in a stage of the algorithm) can be analyzed in a fashion similar to the matrix multiplication primitives. where Li and . of course. column-sweep algorithms are ineffective on vector or parallel computers. R. The algorithm requires approximately n3/10 + 0(n 2 ) operations and n3/68 + O(n2) processors. the column-sweep expression can be transformed into Note the logarithmic nature of the critical path. . each of order m. SAMEH The grouping of computations makes clear the source of the 0(n) critical path in the dependence graph.g. A. Premultiplying both sides of Lx = f by D = diag(L~ 1 ) we obtain the system L^x = f^ where Z/°) is block bidiagonal with identities of order ra on the diagonal and matrices G. The algorithm specified is called the product form and is due to Sameh and Brent [159]. i = 1. where L is partitioned as a block-bidiagonal matrix with diagonal submatrices Li and subdiagonal submatrices #. L (log(n / m)) = /„ and /( lo g(™/™)) = x. Such an improvement is not without cost. Instead of performing the product (Nn-i • • • -/V 2 A/i)/ in (n— 1) stages. however. in which the matrix is of order (n/2 x m).= L~^Rj on the subdiagonal. AND A. Given ra2n/2 + O(mn) processor. e. and /(°) = D f . In the subsequent stages we have several matrix-matrix and matrix-vector multiplications.

can be completed in time 1p~lrn2n + 3p~lmn + O(m 2 ). introduced by Chen. solving the above system reduces to solving a smaller triangular system of order rap. See [189] for a discussion of the performance of this algorithm applied to lower bidiagonal systems and the attendant numerical stability properties. for example. Zj are vectors of ra elements each. After solving this system by the previous parallel scheme. we can retrieve the rest of the elements of the solution vector x by obvious matrix-vector multiplications. . Kuck. then after solving the triangular systems and the original system is reduced to Lx = g in which L is of the form where Let in which Wi is a matrix of order ra and TJ. may be described as follows. If the right-hand side / and the solution x are partitioned accordingly.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 35 An alternative scheme. Thus. given p — mp processors. and Sameh [27]. Let the banded lower triangular matrix L be partitioned as where and each Li is of order (n/p) » m and each Ri is upper triangular of order ra. The algorithm requires approximately 4m 2 n operations which.

. The parallel column storage algorithm can be described as follows: Col-Storage : do i = 1 . i} collects and sums the partial inner products T from each processor.sweep algorithm shown above. If the lower triangular matrix L is stored by columns then the column-sweep algorithm will cause excessive interprocessor communication. GALLIVAN.i). H. AND A. A large number of papers have appeared for handling triangular systems on distributed memory architectures (mainly rings and hypercubes). In implementing the column storage algorithm on an Intel iPSC hypercube. fan-out) £j to each processor do i = j + 1 . J. n T = 0 0i = 0t .in(r.ri)/\ii Here. then the above column-sweep algorithm becomes: Row-Storage : do j = 1 .£j^ij do j = 1 . enddo 77 = /an_m(r. Romine and Ortega [151]. A less communication intensive column storage oriented algorithm has been suggested in [150] and [151]. i) if i is one of my column indices then & = (0i .and columnoriented triangular solvers on distributed memory systems have been studied in [122]. there a communication scheme which allows for ring embedding into a hypercube . Such an operation enables the processor whose memory contains column i to receive the sum of all the r's over all processors.36 K.5. /] is stored by rows. during stage i of the algorithm. and that on a hypercube with p = 2" processors. see Sameh [158]. the pseudo-routine /an_m(r. for example. the fan-out communication can be accomplished in v stages. Most are variations on the basic algorithms above adapted to exploit the distributed nature of the architectures. Further modifications to the basic row. SAMEH 3. Heath and Romine [89]. n if j is one of my row indices then & = ^jl^jj communicate (broadcast. Note first that the computations in the inner loop can be executed in parallel. For example. Li and Coleman [122] and Eisenstat et al. R. PLEMMONS. it is necessary to distinguish whether a given triangular matrix L is stored across the individual processor memories by rows or by columns. Such an algorithm is based upon the classical sequential Row. [53]. e. For such architectures.g. A.2. n if i is one of my row indices then enddo enddo. leaving the result r/ in the processor containing column i. i .1 if j is one of my column indices then T = T+ tjXij enddo. suppose that the matrix [L. information is gathered into one processor from all others via a fan-in operation fan. Distributed-memory triangular system solvers.

Gustavson. Finally. Of course. 4. To simplify the discussion of the effects of hierarchical memory organization. Shared-memory algorithms for dense systems.g. this choice in turn depends on a careful analysis of the architecture/primitive mapping.1.1. the results of an analysis of the architecture/algorithm mapping of this latter algorithm for a multivector processor with a hierarchical memory are also examined along with performance results from the literature.1. LU factorization algorithms. and Karp [42] and more recently Ortega [137] compare the orderings for vector machines such as the CRAY-1 where the key problem is the efficient exploitation of the register-based organization of the processor and the single port to memory. and Alliant FX/8. the choice of computational organization can be crucial in achieving high performance. [186]. These reorganizations are typically listed in terms of the ordering of the nested loops that define the standard computation. 4. see [162]. The goal of the LU decomposition is to factor an n x n-matrix A into a lower triangular matrix L and an upper triangular matrix U. In this subsection we consider some of the approaches used in the literature for implementing the LU factorization of a matrix A e 9? nxn on shared-memory multivector processors such as the CRAY-2.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 37 is emphasized. Version 1.1. Papers have also appeared that are concerned with comparing the reorderings given a particular machine/compiler/library combination. In general. In addition. Ai-\ — . and techniques needed to achieve high performance for both shared and distributed memory systems have been considered in great detail in the literature. Ortega has also considered the problem on highly parallel computers [137]. Since architectural characteristics can favor one primitive over another. we move directly to the block versions of the algorithms.. In this section we review some of these techniques for the LU and LU-like factorizations for dense and block tridiagonal linear systems. CRAY X-MP. 4.1. The classical LU factorization [83] can be expressed in terms any of the three levels of the BLAS. Systematic comparisons of the reorganizations have appeared in various contexts in the literature. Trivedi considered them in the context of virtual memory systems in combination with other performance enhancement techniques [185]. Throughout the discussion u denotes the blocksize used and the more familiar BLAS2-based versions of the algorithms can be derived by setting uj = 1. e.1. The algorithms. This factorization is certainly one of the most used of all numerical linear computations. 4. most of the conclusions reached in these papers can be easily understood and parameterized by analyses of the computational primitives and the algorithms in the spirit of those in the previous section and below. The addition of partial pivoting is then considered and a block generalization of the LU factorization (L and U being block triangular) is presented for use with diagonally dominant matrices. the study in [53] has improved upon the cyclic type algorithms in [89]. the distribution of work among the primitives. Dongarra. The essential differences between the various forms are: the set of computational primitives required. Four different organizations of the computation of the classical LU factorization without pivoting are presented with emphasis on identifying the computational primitives involved in each. and the size and shape of the subproblems upon which the primitives operate. Version 1 of the algorithm assumes that at step i the LU factorization of the leading principal submatrix of dimension (i — l)w. The are several ways to organize the computations for calculating the LU factorization of a matrix.

A. AND A._i. The algorithm proceeds by producing the next ui columns and rows of L and C7. This is a straightforward block generalization of the standard rank-1-based Gaussian elimination algorithm. Solve for M: B *.H .2. J.U^M = B.38 K. Update: H <. H. and that the transformations necessary to compute this information have been applied to the submatrix A1 G $R n ~£ xn ~£ in the lower right-hand corner of A that has yet to be reduced. The basic step of the algorithm can be deduced by considering the following partitioning of the factorization of the matrix AI € 9f?*u. and the other submatrices are dimensioned conformally.1. and computing At+1.L2U2 = H. Clearly. The next w rows of L and uj columns of U are computed during step i to produce the factorization of the leading principal submatrix of order iu.L2iUi2. R. Solve for L2i: A2i <— U^L^i = A^. respectively.) Clearly. repeating this step on successively larger submatrices will produce the factorization of A € 3? nxn .1. Factor H *. (The arrow is used to represent the portion of the array which is overwritten by the new information obtained in each phase.MTG. Solve for C712: A±2 <— LnU\2 = Ai2. The basic step of the algorithm consists of four phases: (i) (ii) (Hi) (iv) Solve for G: C <.Xlu': where H is a square matrix of order uj and the rest of the blocks are dimensioned conformally. SAMEH LJ_I[/. In each step. Version 2. Early stages of the algorithm are dominated by the factorization primitive.L^G = C. Note also that when aj = 1 the use of the triangular solver allows efficient use of the vector registers on vector processors like the CRAY-1 or a single CE of the Alliant FX/8 which have single ports to memory. the update of H requires 2hu>2 and the factorization requires O(o> 3 ). where most of the work is done. 4. The later stages. solving the triangular system requires 2u>h'2 operations. Update: A22 <— A22 . Assume that the factorization of the matrix A1 e SR n -£ xn -£ is partitioned as follows: where AH is square and of order u. the updated A22 is Al+l and the algorithm proceeds by repeating the above four phases. after k = n/ui such steps the factorization LU = A results. is available.Z/2i and C/i2 are the desired u columns and rows of L and U and identity defines Ai+1. Clearly. columns of L and £ rows of U are known at the start of step i. This dominance is particularly pronounced when the BLAS2 (<jj — 1) version of the algorithm is used. LI i. Version 2 of the algorithm assumes that the first £ = (i — l)o. is dominated by solving triangular systems with u right-hand side vectors. The basic step of the algorithm consists of: (i) (ii) (Hi) (iv) Factor: AH <— I/iit/n = AH. where h = (i — l)o>. GALLIVAN. PLEMMONS. .

and C. By our assumptions. As in Version 1 the factorization primitive operates on systems of order (jj. I/2i. the BLAS2 version. Note that the triangular systems that must be solved are of order u with many right-hand sides as opposed to the large systems which are solved in Version 1. M4]. This is then followed by the calculation of the columns and rows. 4. the first iu rows and columns of the triangular factors of A are known. the first part of step i is to perform the update to the portion upon which the desired columns of Z/22 and rows of t/22 depend. and U\2 are known and the first u columns of L22 and the first u) rows of U^i are to be computed. the first uj rows and columns of the factorization of the updated ^22 are then given by: . and U^. where MI and Af3 consist of the first (jj rows and columns of the respective matrices. ). Version 3 does not update the remainder of the matrix at every step. B. that the transformations that produced these known columns and rows must be applied elements of A which are to be transformed into the next u> columns and rows of L and U. Like Version 2.lw ) x ( n -* a. As a result. and calculate L n . Version 3 of the algorithm can be viewed as a hybrid of the first two versions.1. As a result. t/n. at the end of stage i. LH. Since Version 3 assumes that none of the update A 2 2 <— ^22 — £21^12 has occurred in the first i — 1 steps of the algorithm. Let L 2 i = [M^. M%]T and C/i2 = [M3. suppose the update of ^22 and its subsequent factorization are partitioned and where H and H are square matrices of order uj and the other submatrices are dimensioned conformally. f/n.1.3. It also assumes. based on the rank-1 update. update of the submatrix AII G sfj( ri . it is assumed that the first (i — l)u columns of L and rows of U are known at the start of step i. Version 3. like Version 1. As is well known and obvious from the analysis of the previous section. Z/ 2 i.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 39 This version of the algorithm is dominated by the rank-u. To derive the form of the computations. Consider the factorization: where AH is a square matrix of order (i — l)u and the rest are partitioned conformally. Step i then has two major phases: Calculate ff. The first phase of step i computes In the second phase. is not the preferred form for register-based vector or multivector processors with a single port to memory due to poor register usage.

LUM = Al. where h — n — (i — l)uj. Specifically.1. R. Update: [A?. [^12»^22] T > an(^ [^12>^22] T > respectively. By our assumptions. 4. SAMEH (iii) Solve for f/12: C <. Factor: A2 <— LU = A2. and the rest are partitioned conformally. step i computes . columns of L and U are known. with L and U lower and upper triangular respectively. the latter two of which are applied to systems of order u).4.1. when u) is nontrivial the preference for this block form over Version 2 does not necessarily follow.Znt/12 = CT. as noted above.UTGT = A%. Let L w . Specifically. and Au be the matrices of dimension n x u. The factorization kernel operates on a system of order u. for LJ = 1 this version of the algorithm becomes a form which uses the preferred BLAS2 primitive — matrix-vector multiplication. Partial pivoting..1. The work in this version of the algorithm is split between a matrix multiplication primitive. Hence. L^P. however. A^]T *. Although. 4.L21M. £7. AND A. and a factorization primitive. Partial pivoting can be easily added to Versions 2. that the matrix multiplication primitive is applied to a problem which has the shape of a large matrix applied to u> vectors (or the transpose of such a problem). GALLIVAN. A$]T . U. Step i computes the next (jj columns of the two triangular factors. PLEMMONS. and 4 of the algorithm. L 2 i. J. 3. t/u. a triangular solver. step i comprises the computations: (i) (ii) (iii) (iv) Solve for M: Aj «. Step i calculates Lw and f7w by applying all of the transformations from steps 1 to i — 1 to AU and then factoring a rectangular matrix. and A. Note. Solve for G: A3 *.) Consider the partitioning where L.40 K. LH.1. This version of the algorithm requires the solution of a large triangular system with d) right-hand sides as well as a small triangular system of order u with many right-hand sides. A. As with Version 3 the matrix multiplication primitive operates on a problem with the shape of a large matrix times uj vectors and the factorization of a system of order o> and the same observations apply. H. Consider the factorization where A\\ is a square matrix of order (i — l)u. This version also has the feature that it works exclusively with columns of A which can be advantageous in some Fortran and virtual memory environments. and A2 are square matrices of order u.5. Step i of each of the versions requires the LU factorization of a rectangular matrix M € 9? /lXuJ . (These are also columns (i — l}uj + 1 through iu of L. Version 4 of the algorithm assumes that at the beginning of step i the first (i — l)u.[A%. formed of the first (j columns of [0. Version 4. and U\\ are known.

The application to either portion of the matrix may also be delayed. Note. 4. Similarly. diagonally dominant or symmetric positive definite. see [23].g. I/nt/n = MI. appear to the right of the active area for step i. the permutation can be applied immediately to the transformations of the previous steps. (Although in the latter case. This primitive is usually cast as a BLAS2 version of one of the versions above. i. which are stored in the elements of the array A to the left of the active area for step i.. and (iii) and (iv) of Version 4. In the case of pivoting. A block generalization. lower and upper triangular matrices of order (jj. The update of the elements of the array which have yet to reach their final form could be delayed by maintaining a global permutation matrix which is then applied to only the elements required for the next step. This algorithm decomposes A into a lower block triangular matrix L^ and an upper block triangular matrix Uu with blocks of the size uj by uj (it is assumed for simplicity that n = ku. respectively. the arithmetic component of time and the data transfer overhead both increase. k > 1). Version 5. this calculation could be split into two pieces: the factorization of a system of order w. The ability to split the factorization of the tall matrix into smaller BLAS3-based components in the latter case has benefits with respect to hierarchical memory usage. the source of difficulties is slightly different. since uj is usually taken so that such systems fit in cache or local memory. a conflict between their reductions occurs. of course. For example. In fact. The block LU algorithm is given by: . Assume that A is diagonally dominant and consider the factorization: where AH is a square matrix of order u. e. both discussed below along with a solution. these operations are performed via BLAS2 primitives repeatedly updating a matrix which can not be kept locally. respectively.. In the versions above without pivoting. which. however. PJ.) When partial pivoting is added to the versions of the algorithm these computations at each step cannot be separated and are replaced by a single primitive which produces the factorization of a rectangular matrix with permuted rows.1.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 41 where LH and U\\ are.6. a fundamental difference compared to the nonpivoting versions. and the solution of a triangular system of order u) with h — w right-hand sides. This situation is similar to that seen in the block version of Modified Gram Schmidt and Version 5 of the factorization algorithm. where P is a permutation matrix. [67].1. In some cases it is possible to use a block generalization of the classical LU factorization in which L and U are lower and upper block triangular matrices. The use of such a block generalization is most appropriate when considering systems which do not require pivoting for stability. can be applied in various ways.e. (i) and (ii) of the second phase of Version 3. the application to the transformations from steps 1 through i — 1 could be suppressed and the Pj could be kept separately and applied incrementally in a modified forward and backward substitution routine. As a result. and to the elements of the array A which have yet to reach their final form. (These computations are: (i) and (ii) in Version 2.) The information contained in the permutations associated with each step.

the importance of such a reorganization depends highly on the architecture in question. have applied the decoupling methodology to Version 5 [67]. There are two general aspects of the block LU decomposition through which the blocksize uj = n/k influences the arithmetic time: the number of redundant operations (applicable when the Gauss-Jordan approach is used). Gallivan et al. the above block algorithm uses three primitives: a Gauss-Jordan inversion (or LU decomposition). A. As noted above. H.2. For those systems. however. SAMEH Statements (i) and (ii) can be implemented in several ways. which produce the classical LU factorization. Since A is assumed to be diagonally dominant. As the order of the system increases. The data transfer overhead of the algorithm is most conveniently analyzed by writing the algorithm's cache-miss ratio as the weighted average of the cache-miss ratios of the various instances of each primitive. R. historically frowned upon. the computations of Version 5 can be reorganized so that different combinations of BLAS3 primitives and different shapes of submatrices are used. The redundancy factor of (1 + 2/fc 2 ) and the fact that the number of operations performed in the Gauss-Jordan primitive is an increasing function of u) cause the arithmetic time component to prefer smaller blocksizes for small and moderately sized systems.42 K. as a function of a. See [34] for a discussion of its application. has recently been the subject of renewed interest. the GaussJordan scheme.1. J. increasing u> and therefore decreasing k clearly exacerbates the two problems noted above to such a degree that the effect is dominant compared to the reduction in data transfer overhead gained by increasing the blocksize. to general nonsymmetric systems of equations. The weights are the ratio of the number of operations in the particular instance of the primitive to the total number of operations required. (Due to parallel processing. update. AND A. the main BLAS3 primitive can be changed from a rank-u. PLEMMONS. explicit inversion of the diagonal blocks can be done either via the Gauss-Jordan algorithm [143] or an LU decomposition without pivoting. some of the local cache-miss ratios are zero due to the interaction between the instances of the primitives. 4.. with appropriate modifications. the computations in step (z) above are replaced by solving two triangular systems of order u with many right-hand sides. Note that when u = 1 this form of the algorithm becomes the BLAS2 version based on rank-1 updates. In practice. As with Versions 1-4. the block LU algorithm is more expensive by a factor of approximately (1 + 2/fc 2 ) than the classical LU factorization which requires about 2n 3 /3 operations. between the performance of each of the primitives and the distribution of work among the primitives.) If the Gauss-Jordan kernel is used. update into a matrix multiplying u row or column vectors. these effects become secondary to data transfer considerations. and the relationship. this occurs when the remaining . For example. GALLIVAN. as is assumed below. and a rank-u. A <— AB. Their results demonstrate many of the performance trends observed in the literature for the various forms of block methods. In this form. Performance analysis. A summary of the important points follows. In the latter case.

This does not imply that blocksizes cannot be set effectively based on observed performance data of the primitives for various shapes and sizes of problems. Finally. it is clear that the data locality of a BLAS2 version. the cache-miss ratio is of the form: where 771 is proportional to 1/n and JR is a function of u. Agarwal and Gustavson have extended their work which led to single CPU block algorithms for the IBM 3090 by considering parallel forms of the BLAS3 primitives and LU factorization on an IBM ES/3090 model 600S [2]. 4. The exact point where this transition to rapidly increasing occurs is dependent on the implementation of the Gauss-Jordan primitive. at the point uj = n. they discuss . In particular. is very poor. ui < 16. Dongarra and Hewitt discuss the use of a rank-3-based approach on four CPU's of a CRAY X-MP [45]. but.e. i. Here we list some representative papers and then consider in more detail the performance of Version 5 and its relationship to the trends predicted via the blocksize analysis presented above. but it does not provide any explanation as to why the performance is as observed. The results of using a BLAS2 form of Version 3 on the CRAY-1 and one CPU of a CRAY X-MP were given by Dongarra and Eisenstat in [41]. In the middle of the interval of interest the cache-miss ratio is of the form: where r?2 is proportional to 1/n. The column-oriented BLAS2 form of Version 4 was used by Fong and Jordan on the CRAY-1 [56].3. where CS denotes cache size. Note that without a model of the data transfer properties of the primitives such an analysis at the algorithmic level is impossible.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 43 part of the matrix to be decomposed approaches the size of the cache and later instances of primitives find an increasing proportion of their data left in cache by earlier instances. roughly separates into three regimes. u. In particular. This can only be done by considering the architecture/algorithm mapping of the primitives and the implications of combining them in the manner specified by the particular version of the factorization algorithm used.. = 1. The ratio /x becomes a rapidly increasing function once u} exceeds \f~C~S until it reaches. In [67] the results are derived using the conservative assumption of no interaction between instances of primitives. Experimental results. Such a black box tuning approach is quite useful in practice. the cache-miss ratio is where r/3 is proportional to 1/n. For small values of a. The various versions of the algorithms have appeared in different contexts in the literature. when u w VCS. Calahan demonstrated the power of the block form of Version 3 (with and without pivoting) on the hierarchical memory system of one CPU of a CRAY2.1. which is bounded by a small constant. update which achieves a similar cache-miss ratio. any decrease in /z between uj = \/CS and the transition point is typically insignificant. the cache-miss ratio of the algorithm of a BLAS2-based version of the Gauss Jordan algorithm which has a value of approximately |. This result is expected since the computations are dominated by the rank-u.. The behavior on the interval 1 < u < VCS.

The actual rate is. Radicati. The block form of Version 4 was also considered in a virtual memory setting by Du Croz et al. higher for the block methods. Performance of block LU on an Alliant FX/8. H. AND A. The performance was computed using the nonblock version operation count. The significant improvement over BLAS2-based routines by a small amount of blocking can be seen in the performance of the u) = 8 curve and comparing it to the 7 to 10 Mflops possible via a BLAS2-based Version 2 code or the 15 to 17 Mflops of a BLAS2-based . A. Robert. [39]. 8 clearly show the trends predicted by the analysis above. 8. SAMEH the use of parallel block methods in a multitasking environment where the user is not necessarily guaranteed control of all (or any fixed subset) of the six processors in the system. PLEMMONS. ETA-10P. and Sguazzero have presented the results of a rank-fc-based code on an IBM 3090 multivector processor for one to six processors [149]. in [50] and used as a model of a block LU factorization in the BLAS3 standard proposal by Dongarra et al. The curves in Fig. Finally. Dayde and Duff have compared the performance of the different organizations of the block computations on a CRAY-2. R. J. Figure 8 illustrates the performance of the block LU algorithm for diagonally dominant matrices for various blocksizes on an Alliant FX/8 [67].44 K. GALLIVAN. The performance trends for Version 5 predicted via the decoupling analysis summarized above have been verified in [67]. therefore. and IBM 3090-200/VF. FIG.

The conflict between arithmetic time and data loading overhead minimization which produces the shifting of the preferred blocksize as a function of n can be mitigated somewhat by using a double-level blocking [67]. For small systems. Performance of double-level block LU on an Alliant FX/8. u>). Both require a pair of blocksizes (0. The outer-to-inner approach replaces the operation of the Gauss-Jordan primitive on a system of order w with a block LU factorization using the inner blocksize 9. This conflict has been deliberately exacerbated in these experiments by using a Fortran implementation of the Gauss-Jordan primitive and assembler coded BLAS3 routines. The decoupling methodology can be used to . increasing beyond this value causes performance degradation due to the conflict between reducing p. performance improves as (jj is increased until an optimal is reached. As expected. and efficiently distributing work among the primitives.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 45 Version 3 code. the conflict reduces and performance is maintained until ui exceeds \/CS. The inner-to-outer approach begins with a block LU factorization with blocksize 0 which is determined largely by the arithmetic time analysis and which is typically smaller than the single-level load analysis would recommend. 9. FIG. For larger systems. There are two basic approaches to double-level blocking: inner-to-outer and outerto-inner. for any fixed order of the system. Several rank-# updates are then grouped together into a rank-cj in order to improve the data loading overhead.

We will concentrate here only on the work of Geist and Romine.2. recent triangular solvers for distributed memory multiprocessors have removed the main reason for preferring row storage. H. for an example. Note that that double-level version yields performance higher than all of the single-level implementations of Fig. Distributed-memory algorithms. and Schultz [104]. as discussed earlier.g. AND A. Chu and George [28]. e. R. and Geist and Romine [79]. do k — 0. 4. we refer to the following scheme as RSRP. A. In most of the early work. n — 1 determine row pivot update permutation vector if (I own pivot row) fan-out(broadcast) pivot row else receive pivot row for (all rows i > k that I own) enddo enddo. Consider the two basic storage schemes: storage of A by rows and by columns. But.1. Row Storage with Row Pivoting.. The row storage case is considered first. the Column Storage with Row Pivoting (CSRP) scheme is given by: CSRP: each processor executes the following do k = 0. Adopting the terminology of Geist and Romine [79].2. LU factorization. n — 1 if (I own column k) . SAMEH show that these techniques do mitigate the conflict between reducing the arithmetic time component and the data loading overhead (see [67] for details). Saad. [78]. row storage for the coefficient matrix was chosen principally because no efficient parallel algorithms were then known to exist for the subsequent forward and backward sweeps if the coefficient matrix were to be stored by columns. see Heath [88]. J. see Ipsen. Figure 9 demonstrates that the use of inner-to-outer form of double-level blocking can indeed improve performance. PLEMMONS. Next. Geist and Heath [77]. 8 over the entire interval. RSRP: each processor executes the following. 4. We also describe some LU-\ike factorization schemes that are useful on distributed memory and hybrid architectures. A number of papers have appeared in recent years describing various parallel LU factorization schemes on such architectures. The related parallel Cholesky schemes will not be discussed in this section.46 K. Our objective here is to describe the effects that the data-storage and pivoting schemes have on the efficiency of the LU factorization of a dense matrix A = (ctij) on distributed memory systems. GALLIVAN.

10 for n = 8. 10 .PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 47 determine pivot row interchange broadcast the column just computed and pivot index else receive the column just computed and pivot index interchange for (all columns j > k that I own) enddo enddo.n] and VT = [z/i. P is either the identity of order 2 or (fi2. It is ideally suited Pairwise pivoting can also be useful on shared memory machines to break the bottleneck caused by partial pivoting discussed earlier. Row Storage with Column Pivoting.ei) so that One of the many possible annihilation schemes for reducing a nonsingular matrix A of order n to upper triangular form is illustrated in Fig. The RSCP algorithm can be readily seen as nothing more than the dual of algorithm CSRP. In particular. Geist and Heath [78] indicate that both RSCP and CSRP yield essentially identical speedup on an Intel iPSC hypercube. If UT — [//i. 4. (The elements marked with i can all be eliminated simultaneously on step i. In fact. in the absence of such techniques as loop unrolling. the two schemes that can most easily be pipelined are: pivoting by interchanging rows when the matrix is distributed across the processors by columns (algorithm CSRP). vn] are two row vectors. LU factorization with partial pivoting is most efficient when pipelining is used to mask the cost of pivoting. Here. consists of searching the current pivot row for the element with maximum modulus. Gaussian elimination with pairwise pivoting is an alternative to LU factorization which is attractive on a variety of distributed memory architectures including systolic arrays since it introduces parallelism into the pivoting strategy. then we can choose a stabilized elementary transformation so as to annihilate either p.) Such a triangularization scheme requires In — 3 stages in which each stage consists of a maximum of [n/2\ independent stabilized transformations. and then exchanging columns to bring this element to the diagonal.10 Such a pivoting strategy dates back to Wilkinson's work on Gaussian elimination using the ACE computer with its limited amount of memory [62]. and pivoting by interchanging columns when the matrix is distributed across the processors by rows (algorithm RSCP). Pairwise pivoting. The main idea is rather simple.2. A modification of RSRP. which we refer to as RSCP.\ or 1/1. p.2. • • • . • • • . whichever is smaller in magnitude. Geist and Heath conclude that.

we must combine the strategies outlined above for both shared and distributed memory models. our experience contradicts some conclusions of Trefethan and Schreiber [184] indicating that some further work is required to reconcile this seeming inconsistency. The need to introduce double-level blocking forms of the algorithm indicated the onset of such a problem on a Cedar cluster: the attempt to spread the BLAS2-implemented kernel across the processors in a cluster introduced serious limitations on the performance of the block algorithm. vectorization. however. A. Our extensive numerical experiments indicate that. that it does not produce an LU factorization of the matrix. as is the case with partial pivoting. Annihilation scheme for n — 8. H. and communication. One possible drawback of this pivoting strategy is that the upper bound on the growth factor is the square of that of partial pivoting [168]. One of the main reasons for this is the fact that an algorithm designed via this method may have problems with an inappropriate choice of task granularity and the resulting excessive communication requirements. This method of algorithm design. AND A. such as Cedar. Breaking the problem among the clusters so as to minimize intercluster communication while maintaining load balancing is an issue faced by users of distributed memory architectures. [169]. R. J. While these algorithms and kernels form an invaluable building block for algorithms on the Cedar system and the conclusions of the analysis are applicable over a fairly wide range of multivector architectures. Cedar's advantage is the existence of a shared global memory. Note. PLEMMONS.2. are concerned with achieving high performance on an architecture like a single Cedar cluster. The above annihilation scheme was originally motivated by a parallel Givens reduction introduced in [161] and now used extensively in applications such as signal processing for recursive least squares computations. A hybrid scheme. discussed above. L is replaced by a product of matrices in which each one can be readily inverted. The shared memory block LU algorithm and the BLAS3 primitives. In that sense.48 K. care must be taken not to generalize these conclusions too far. for a ring of processors [157] or other systolic arrays [80]. In order to design factorization schemes for multicluster machines. For example. 10. This parallel Givens reduction was later generalized for a ring of processors [158].3. in which each cluster is a parallel computer with tightly coupled processors. however. on a single Cedar cluster (and similar architectures) routines for many of the basic linear algebra tasks encountered in practice can be designed as a series of calls to BLAS3 kernels and BLAS2-implemented algorithms thereby masking all of the architectural considerations of parallelism. 4. such growth is rarely encountered in practice. cannot be generalized to all hierarchical shared memory machines. GALLIVAN. When this problem . SAMEH FIG.

In the first stage. be partitioned as where Ai resides in the ith cluster memory. to annihilate the first element of the (possibly new due to pairwise pivoting) first row of f/2. (For example. using a block-Lf/ scheme with partial pivoting. with pairwise pivoting. e. Let Uk = [p-ij]. During the kth group the latest values for the rows of Uk are used by clusters k + 1 to 4 in a pipelined fashion to further reduce their segments of the decomposition. without loss of generality. There are several possible designs for such an algorithm.. The block computations can be pipelined across clusters using the necessary Cedar synchronization primitives. It should be noted that cluster k is idle during the kth group of waves and the remainder of the algorithm since the other clusters will update the rows of Uk that it has produced and placed in global memory. In this case the algorithm design must take into account that intercluster communication is rather costly. for solving dense linear systems on a multiple cluster architecture which requires a relatively small amount of intercluster communication. due to Sameh [157]. The updated first row of U\ is then transmitted to cluster 3 so as to annihilate /xf ^ and then to cluster 4 where n\ l is eliminated with the final version of the first row of U\ residing in global memory. A*i (1 . see [14]. 4 where Pi is a permutation.g.) The first group of n/4 computational waves which use the rows of U\ produced by cluster 1 is described below. A block of rows rather than a single row is communicated between processors and the row rotation is replaced with a block Gaussian elimination procedure. 2. each Ai is factored into the form for i = 1. and Ui is upper trapezoidal. The algorithm consists of two major stages.The first row of U\ is transmitted via the global memory to cluster 2 where it is used. that each Ui has a nonsingular upper triangular part. the factorization of A may be completed in the second stage which consists of 3n/4 computational waves pipelined across the four clusters. Let A. cluster 1 only performs the initial reduction of AI and is then released for other tasks within the application code of which solving the system is a part or the tasks of other users since Cedar is a multiuser system. A second possibility uses the control structure of the pipelined Givens factorization on a ring of processors described in [158]. The remainder of this section discusses another algorithm.3. a nonsingular matrix of order n. Wave 1. These computational waves comprise three groups of n/4 waves. One of the most straightforward is based on the outer-to-inner double-level block form presented above.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 49 becomes extreme. other forms of the algorithm must be used which typically involve reorganizing the block computations to more efficiently map the algorithm to the architecture via tasks of coarser granularity with more attention focused on minimizing the required communication. An example of such a situation is the solution of a dense linear system using more than one cluster of Cedar (possibly a subset of the total number available). . [157]. Assuming. For simplicity a four-cluster Cedar is assumed. The pattern of the remaining two groups follows trivially. Typically this involves some notion of pipelining (possibly multidimensional) at the-block level. Li is unit lower triangular.

such forms of Gaussian elimination offer little potential vectorization and parallelization.g. Several studies have generalized this partitioning scheme to narrow-banded systems. Often. Later. J.g.. Hence. the jth row of U\ is transmitted to clusters 2. 3. R. PLEMMONS. 3. Waves 2 < j < n/4. Similar to the above discussions for banded triangular systems. SAMEH As soon as /^ x is annihilated in cluster k.. IJL\ -. Block tridiagonal linear systems. simple interleaving of block rows of A would enhance load balancing among the clusters. After these annihilations occur. see [46]. Using block versions of Gaussian elimination for block tridiagonal systems seems a natural extension of the efficient dense solvers discussed above. [120]. respectively.3. f/2 followed by f/s. for handling tridiagonal systems on vector or parallel computers was introduced in [161]. e. which at this point is upper Hessenberg. i. [132] and the recent book by Ortega [137]. H. a partitioning scheme. Wang [192] considered the simpler problem of diagonally dominant systems and gave essentially the same form of the algorithm modified to use Gaussian elimination (made possible by the assumption of diagonal dominance) and a different method for the elimination of the spikes. A more recent study of block Gaussian elimination on the Alliant FX/8 for solving such systems [11] indicates the importance of efficient dense solvers and the underlying BLAS3 as components for block tridiagonal solvers. This basic form of the algorithm possesses many levels of communication and computation granularity and can be modified to improve utilization of a multicluster architecture. a narrow-banded system. for n = 24 it is of the form The cluster then proceeds to reduce Uk to upper trapezoidal form through a pipelined Gaussian elimination process using pairwise pivoting. and 4 to annihilate ^JL\ -. Note that after this first group of computational waves U\ is in its final form in global memory. 4. referred to as the spike algorithm below. [47]. k = 2. transforming Uk to upper Hessenberg form and then reducing it back to an upper trapezoidal matrix. AND A. solving such linear systems constitutes the major computational task. The second and third computational groups proceed in the same way as the first did with each cluster fetching the appropriate row from the source matrix.. each cluster reduces Uk. If the size of the blocks is small. Some of the early work may be found in [191] and the survey by Heller [90]. This implies that cluster 2 is now available for other work.4. efficient algorithms for solving these systems on vector and parallel computers are of importance. e. Similar to the first wave. and /i| -. where Givens reductions were used to handle the diagonal blocks. if the whole Cedar machine were devoted to such a dense solver.50 K. For example. Block tridiagonal systems arise in numerous applications — one example being the numerical handling of elliptic partial differential equations via finite element discretization.e. The matrix C/2 is in its penultimate form since it will only change due to the pairwise pivoting done by clusters 3 and 4 in the second group of computational waves. . to upper trapezoidal form. the nonzero portion of Uk is a n/4 x (n — 1) upper Hessenberg matrix. A. GALLIVAN.

be full. is a banded matrix of order q = n/p and bandwidth 2m + 1 (same as A). each of order m..PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 51 The main idea of this partitioning scheme may be outlined as follows. where A is a banded diagonally dominant matrix of order n. p is taken to be 4. The algorithm consists of three stages. in which £j and Ci+i are lower and upper triangular matrices. respectively.g.A2 . for the sake of illustration. Let the linear system under consideration be denoted by Ax = /. n = pq. e. If both sides of Ax = f were premultiplied by diag( Al . and I < i < p — 1. The algorithm described below assumes p CPU's of a CRAY X-MP or CRAY-2. It is assumed that the number of superdiagonals m <C n is equal to the number of subdiagonals and that. Stage 1.Bi] and conformally x and /. On a sequential machine such a system would be solved via Gaussian elimination.Ai. 1 < i < p. where each Ai. for simplicity of presentation. see [38] for example. • • • . Here. Let the matrix A be partitioned into the block-tridiagonal form with block row [Ci. . or a Cedar system with p clusters. Ap we obtain a system of the form where in which Ei and Fi are matrices of m columns given by and and will. in general.

Stage 2. A. This results in a block-diagonal reduced system in which each block is of the form . In other words. .. varies with m. H. AND A. however. Typically. as follows where P. and gi are obtained by solving the associated linear systems. which is referred to as the reduced system Ky = /i. that the bandwidth of the system has doubled compared to the original system.52 K. PLEMMONS. Q». let gi and xi be conformally partitioned: The structure of the resulting partitioned system is such that the unknown vectors Vji 1 5: J < 6 (each of order m) are disjoint from the rest of the unknowns. it can be shown that the reduced system is also diagonally dominant and hence there are a number of options available for solving it. GALLIVAN. while clusters 1 and 4 each solves m + 1 linear systems of the same form. and Ti e §ff mxm . the m equations above and the m equations below each of the 3 partitioning lines form an independent system of order 6m. Si. For m < 8 a variant of the spike algorithm is used. it is small enough to be sent to a single Cedar cluster and solved with an appropriate algorithm. Note. Also. J. It is also possible to use all of the clusters to solve the reduced system via an iterative technique such as Orthomin(k) [47]. When it is large enough to warrant a multicluster approach the reduced-system approach could be applied again. in turn. we can ignore the matrices Qi and Si as ||Si||oo and HQilloo are much smaller than ||Ti||oo and 11 Pi 11 oo. SAMEH In stage 1. Since A is diagonally dominant. In each cluster 2 < k < 4 we solve 2m + 1 linear systems of the form AkV = r. Block-column permutations can reduce the bandwidth back to its original value but this destroys diagonal dominance and pivoting will usually be required to solve the permuted reduced system. The method of solution used on each cluster (Alliant FX/8) for these 4 systems with multiple right-hand sides.Ej. R. For 8 < m < 16 (approximately). Fj. block cyclic reduction is the most effective and for larger m a block Gaussian elimination is recommended [11]. Let Ei and Fi be partitioned. Note that no intercluster communication is needed. if the original linear system is sufficiently diagonally dominant. respectively. Finally.

} . e. see [16] and [158]. In solving the linear least squares problem: where A is an m x n matrix of rank n.. two-dimensional arrays. The speedup compared to a block LU algorithm on one CPU was approximately 2. These can be used to alter the form of the algorithm to more efficiently map to a variety of shared memory architectures. see [48]. they provide a detailed performance analysis of the problem. and [161]. if the system is symmetric positive definite. Fox et al. The work by Johnsson [108]. [11] also reports on the performance achieved on four CPU's of a CRAY X-MP/416.8 was achieved indicating an efficient use of the microtasking capabilities and memory system of the machine. the rest of the components of the solution vector of the original system may be retrieved as follows: and Provided that the y^s are stored in the global memory. a speedup relative to itself of 3. In [60]. (m > n). Such a factorization may be realized on multiprocessors via plane rotations. shuffleexchange networks and boolean cubes.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 53 Stage 3. (Although the latter algorithm is more commonly associated with the calculation of an orthogonal basis of the range of A. In addition to reporting on the performance results for this algorithm on the Alliant FX/8. see [9]. in which Q is an orthogonal matrix and R is a nonsingular upper triangular matrix of order n. it is often necessary to obtain the factorization. For one such alternative see [155]. 5. Dongarra and Johnsson [46] have discussed how the algorithm can be modified to obtain a reduced system that is symmetric positive definite as well.g. An analysis of the parallel and numerical aspects of a two-sided Gaussian elimination for solving tridiagonal systems has been given recently by van der Vorst [188]. Least squares. this stage requires no intercluster communication. Also. elementary reflectors. Using four partitions on a system of order 16384 with blocksize 32. There are several modifications and reorganizations possible of the spike algorithm for solving banded systems discussed above. Once the y^s are obtained. have also considered the problem of banded systems on hypercubes. [158]. [109] is representative of organization of concurrent algorithms for solving tridiagonal and narrow banded systems on distributed memory machines with various connection topologies. or the Modified Gram-Schmidt algorithm.

organization of Givens and Householder reductions on a ring of processors convey the main ideas needed for implementation on hypercubes and locally connected distributed memory architectures.k+i . .B1] where A is of rank n. where V\ = U\ = w\. the matrix B is updated via which relies on the high efficiency of one of the most important kernels in BLAS3. it does not offer the data locality needed in a hierarchical shared memory system such as that of an Alliant FX/8. H.. [19]. . .. then it is possible to generate elementary reflectors Pk = / — QfcUfcUjT. SAMEH In the section concerning shared memory multiprocessors. 2.1. A block Householder reduction. one per block.---. -Pi). and AI consists of the first k columns of A.54 K. • • • . • • • . [163].an ]. Vj = (PjV^_i. see also the related papers [15]. While an organization that allows such an overlap is well suited for some shared memory machines and for a distributed memory multiprocessor such as a ring of processors. If A = A\ = [a[ . The block algorithm may be described as follows. Wj).g. a>j ) T » J' — k + 1.1. <5. k = 1. offers such data locality. that of a rank-fc update. -Pfc^ = (pkj. proceed with the usual Householder reduction scheme by generating the k elementary reflectors PI through Pk such that where RI is upper triangular of order k without modifying the matrix B. The process is then repeated on the modified B with another well-chosen block size..04 . n. block versions of Householder reduction and the modified Gram-Schmidt algorithm are presented. • • • .e. A block scheme proposed by Bischof and Van Loan [16]. Next. A. The two basic tasks in such a procedure are [170]: (i) generation of the reflector Pk such that -Pfcflfc = (pkki 0> • " •> 0) T ? k = 1. Shared-memory algorithms. This scheme depends on the fact that the product of k elementary reflectors Qk = (Pfc. [146]. reflector Pk+i may be generated even before task (ii) for stage k is finished.. R. • • • . n.1.. Let the ra x n matrix (ra > n) whose orthogonal factorization is desired be given by A=[A1. GALLIVAN. If we accumulate the product Qk = Pk • • • PI = I — ^kU^ as each Pi is generated. see [158]. 5. For distributed memory multiprocessors. where Pi = Im — WiU)T. n. as well as a pipelined Givens reduction for updating matrix factorization. k. • • • . e. PLEMMONS. J. Wj) and Uj = (C/j-i. i. for j = 2. It may also be desirable to accumulate the various Qfc's. [36]. 5. to obtain the orthogonal matrix. PI. and so on until the factorization is completed. that triangularizes A. and (ii) updating the remaining (n — k) columns. can be expressed as a rank-fc update of the identity of order ra. AND A. On a parallel computer. a n ] by annihilating all but the first element in a^. such that forming Pk Ak produces the kth row of R and the (ra — k} x (n — k) matrix Ak+i = [a.

of the performance achieved by a BLAS3 implementation of the block Householder algorithm (PQRDC) compared to a BLAS2 version (DQRDC) on an Alliant FX/8. is shown in Fig. Schreiber and Van Loan have considered a more efficient storage scheme for the product of Householder matrices [164]. The block scheme.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 55 FIG. They describe the compact WY representation of the orthogonal matrix Q which is of the form where Y e sft mxn is a lower trapezoidal matrix and T G K n x n is a upper triangular . The performance shown is computed using the nonblock algorithm operation count. however. Most recently. 11. Performance of block Householder algorithm on an Alliant FX/8. where p = n/k is the number of blocks (assuming a uniform block size throughout the factorization). requires roughly (1 + 2/p) times the arithmetic operations needed by the classical sequential scheme. It was shown in [16] that this block algorithm is as numerically stable as the classical Householder scheme. 11. An example. Bischof and Van Loan report the performance of the block algorithm at 18 Mflops for large square matrices (n = 1000) on an FPS-164/MAX with a single MAX board and note that an optimized LINPACK QR running on an FPS164 without MAX boards would achieve approximately 6 Mflops. [85].

56

K. A. GALLIVAN, R. J. PLEMMONS, AND A. H. SAMEH

matrix. The representation requires only mn storage locations and can be computed in a stable fashion. 5.1.2. A block-modified Gram-Schmidt algorithm. The goal of this algorithm is to factor an m x n matrix A of maximal rank into an orthonormal m x n matrix Q and an upper triangular R of order n where m > n and A is of maximal rank. Let A be partitioned into two blocks A\ and B where A\ consists of cu columns of order m, with Q and R partitioned accordingly:

The algorithm is given by:

If n = kuj, step (i) is performed k times and steps (ii) and (iii) are each performed k — 1 times. Three primitives are needed for the jth step of the algorithm: a QR decomposition (assumed here to be a modified Gram-Schmidt routine — MGS); a matrix multiplication AB; and a rank-u; update of the form C <— C — AB. The primitives allow for ideal decomposition for execution on a limited processor shared memory architecture. The BLAS2 version of the modified Gram-Schmidt algorithm is obtained when u) = 1 or o> = n, and a double-level blocking version of the algorithm is derived in a straightforward manner by recursively calling the single-level block algorithm to perform the QR factorization of the m x uj matrix A\. Jalby and Philippe have considered the stability of this block algorithm [106] and Gallivan et al. have analyzed the performance as a function of blocksize [67]. Below, a summary of this blocksize analysis is presented along with experimental results on an Alliant FX/8 of single and double-level versions of the algorithm. The analysis is more complex than that of the block LU algorithm for diagonally dominant matrices discussed above, but the conclusions are similar. This increase in complexity is due to the need to apply a BLAS2-based MGS primitive to an m x u matrix at every step of the algorithm. As with the block version of the LU factorization with partial pivoting, this portion of each step makes poor use of the cache and increases the amount of work done in less efficient BLAS2 primitives. The analysis of the arithmetic time component clearly shows that the potential need for double-level blocking is more acute for this algorithm than for the diagonally dominant block LU factorization on problems of corresponding size. The behavior of the algorithm with respect to the number of data loads can be discussed most effectively by considering approximations of the cache-miss ratios. For the interval 1 < u; < / « CS/m the cache-miss ratio is

where rji is proportional to 1/n, which achieves its minimum value m/(2CS] at u = I. Under certain conditions the cache-miss ratio continues to decrease on the interval

PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS

57

/ < uj < n where it has the form

where r\i is proportional to 1/n, which reaches its minimum at a point less than vCS and increases thereafter, as expected. (See [67] for details.) When uj = n the cachemiss ratio for the second interval is 1/2 corresponding to the degeneration from a BLAS3 method to a BLAS2 method. The composite cache-miss ratio function over both intervals behaves like a hyperbola before reaching its minimum; therefore the cache-miss ratio does not decline as rapidly in latter parts of the interval as it does near the beginning.

FIG. 12. Performance of one-level block MGS on an Alliant FX/8.

A load analysis of the double-level algorithm shows that double-level blocking either reduces or preserves the cache-miss ratio of the single-level version while improving the performance with respect to the arithmetic component of time. Figures 12 and 13 illustrate, respectively, the results of experiments run on an Alliant FX/8, using single-level and double-level versions of the algorithm applied to square matrices. The cache size on this particular machine is 16K double precision words.

58

K. A. GALLIVAN, R. J. PLEMMONS, AND A. H. SAMEH

For the range of n, the order of the matrix, shown in Fig. 12, the single-level optimal blocksize due to the data loading analysis starts at u = 64, decreases to uj = 21 for n = 768, and then increases to u; = 28 at n = 1024. Analysis of the arithmetic time component recommends the use of a blocksize between u = 16 and u> = 32. Therefore, due to the hyperbolic nature of // and the arithmetic time component analysis it is expected that the performance of the algorithm should increase until u w 32. The degradation in performance as aj increases beyond this point to, say u> = 64 or 96, should be fairly significant for small and moderately sized systems due to the rather large portion of the operations performed by the BLAS2 MGS primitive.

FIG. 13. Performance of two-level block MGS on an Alliant FX/8.

The results of the experiments confirm the trends predicted by the theory. The version using u; = 32 is clearly superior. The performance for u) = 8 is uniformly dismal across the entire interval since the blocksize is too small for both data loading overhead and arithmetic time considerations. Note that as n increases the gap in performance between the u = 32 version and the larger blocksize versions narrows. This is due to both arithmetic time considerations as well as data loading. As noted above, for small systems, the distribution of operations reduces the performance of the larger blocksize version; but, as n increases, this effect decreases in importance. (Note that this narrowing trend is much slower than that observed for the block LU

PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS

59

algorithm. This is due to the fact that the fraction of the total operations performed in the slow primitive is u/n for the block Gram-Schmidt algorithm and only u; 2 /n 2 for the block LU.} Further, for larger systems, the optimal blocksize for data loading is an increasing function of n; therefore, the difference in performance between the three larger blocksizes must decrease. Figure 13 shows the increase in performance which results from double-level blocking. Since the blocksize indicated by arithmetic time component considerations is between 16 and 32 these two values were used as the inner blocksize 9. For 9 = 16 the predicted outer blocksize ranges from u> = 64 up to uj = 128; for 9 = 32 the range is u = 90 to uj = 181. (Recall that the double-level outer blocksize is influenced by the cache size only by virtue of the fact that vCS is used as a maximum cutoff point.) For these experiments the outer blocksize of us = 96 was used for two reasons. First, it is a reasonable compromise for the preferred outer blocksize given the two values of 0. Second, the corresponding single-level version of the algorithm, i.e., (0,u)} = (96,96), did not yield high-performance and a large improvement due to altering 0 would illustrate the power of double-level blocking. (To emphasize this point the curve with (9,u] = (96,96) is included.) The curves clearly demonstrate that double-level blocking can improve the performance of the algorithm significantly. (See [67] for details.) 5.1.3. Pipelined Givens rotations. While the pipelined implementation of Givens rotations is traditionally restricted to distributed memory and systolic type architectures, e.g., [80], it has been successful on shared memory machines in some settings. In [48] a version of the algorithm was implemented on the HEP and compared to parallel methods based on Householder transformations. Rather than using the standard row-oriented synchronization pattern, the triangular matrix R was partitioned into a number of segments which could span row boundaries. Synchronization of the update of the various segments was enforced via the HEP's full-empty mechanism. The resulting pipelined Givens algorithm was shown to be superior to the Householder based approaches. Gallivan and Jalby have implemented a version of the traditional systolic algorithm (see [80]) adapted to efficiently exploit the vector registers and cache of the Alliant FX/8. The significant improvement in performance of a structural mechanics code due to Berry and Plemmons, which uses weighted least squares methods to solve stiffness equations, is detailed in [10] (see also [144], [145]). The hybrid scheme for LU factorization discussed earlier for cluster-based shared memory architectures converts easily to a rotation-based orthogonal factorization, see [157]. Chu and George have considered a variation of this scheme for shared memory architectures [31]. The difference is due to the fact that Sameh exploited the hybrid nature of the clustered memory and kept most of the matrix stored in a distributed fashion while pipelining between clusters the rows used to eliminate elements of the matrix. Chu and George's version keep these rows local to the processors and move the rows with elements to be eliminated between processors. 5.2. Distributed memory multiprocessors. 5.2.1. Orthogonal factorization. Our purpose in this section is to survey parallel algorithms for solving (10) on distributed memory systems. In particular, we discuss some algorithms for the orthogonal factorization of A. Several schemes have been proposed in the past for the orthogonal factorization of matrices on distributed memory systems. Many of them deal with systolic arrays and require the use of O(n 2 )

and Kung [17]. Figure 14 shows the organization of Givens reduction on three processors for a rectangular matrix of seven rows and five columns on such a ring. 14. A. on the other hand. For instance. Let dj ' denote the jth column of Ak. I 1 21 31 41 51 61 71 4 54 64 74 — Proc. AND A. PLEMMONS. Muller. but do not consider a specific architecture or communication pattern. and Gentleman and Kung [80] all consider Givens reduction and require a triangular array of O(n2) processors. Each processor possesses a local memory with one processor only handling the input and output. Sameh [158]. Modi and Clarke [134] have suggested a greedy algorithm for Givens reduction and the equivalent ordering of the rotations. Here. Somesh.Pfc). j < i.60 K. An entry ij. Cosnard. We now briefly survey some of these algorithms that have been implemented on current commercially available distributed memory multiprocessors. J. processors. a Pk alone indicates generation of the fcth elementary reflector. and Vemulapati [148] and by Elden [54]. and Morph [4]. SAMEH Proc. Bojanczyk. while Luk [125] uses a mesh connected array of O(n 2 ) processors.Then Householder reduction on the same matrix and ring architecture as above may be organized as shown in Fig. In this study the coefficient matrix A is stored by rows across the processors in the usual wrap fashion and most of the rotations involve rows within one processor in a type of divide-and-conquer scheme. R. 2 Proc. 15. 3 — 2 32 42 52 62 72 5 65 75 — — — — 3 43 53 63 73 — 6 76 FIG. Ahmed. Delosme. where Ak+i = Qk^k in which Qk = diag(Jfc. Theoretical studies and comparisons of such algorithms for Givens reduction have been given by Pothen. In chronological order. Ak is upper triangular in its first (k — 1) rows and columns with Pk being the elementary reflector of order (m — k + 1) that annihilates all the elements below the diagonal of the fcth column of Ak. we begin with the work of Chamberlain and Powell [25]. GALLIVAN. Recall that the classical Householder reduction may be described as follows. Brent. considers both Givens and Householder reduction on a ring of processors in which the number of processors is independent of the problem size. it is necessary to carry out rotations . indicates the rotation of rows i and j so as to annihilate the zth element of row j. H. Here. However. Each column depicts the operations taking place in each processor. where n is the number of columns of the matrix. Givens reduction on a three processor ring. and Robert [32] have shown that the greedy algorithm is optimal in the number of timesteps required.

Their tests seem to indicate that Givens reduction is superior on such an architecture. and Sameh [81] (see also [29]). 15. Numerical tests were made on an Intel iPSC hypercube with 32 processors based on 80287 floating point coprocessors to illustrate the practicality of their algorithms. and Plemmons [113].e. The tests by Kim. They describe two ways of implementing the merges and compare them in terms of load balance and communication overhead.2. Plemmons. the data is always exchanged between neighboring processors. Somesh. and Sameh [81] and developed further in [145]. Finally in this section we mention two studies which directly compare the results of implementations of Givens rotations with Householder transformations on local memory systems. and Vemulapati [148] on a modified version of a greedy Givens scheme with a standard row-oriented version of Householder transformations. The schemes used here are very similar the basic approach suggested originally by Golub. Another feature of the algorithm is that redundant computations are incorporated into a communication scheme which takes full advantage of the hypercube connection topology.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 61 FlG. Pothen and Raghavan [147] have compared the earlier work of Pothen. We note that Katholi and Suter [112] have also adopted this approach in developing an orthogonal factorization algorithm for shared memory systems. 5. Chu and George [30] have also suggested and implemented algorithms for performing the orthogonal factorization of a dense rectangular matrix on a hypercube multiprocessor. Plemmons. and have performed tests on a 30 processor Sequent Balance computer. and Plemmons on a 64-processor Intel hypercube clearly favor their modified Householder transformation scheme. Agrawal. Agrawal.2. Recursive least squares. equations) .. Householder reflectors on a three processor ring. however. involving rows in different processors. In recursive least squares (RLS) it is required to recalculate the least squares solution vector x when observations (i. Their recommended scheme involves the embedding of a two-dimensional grid in the hypercube network. have developed and tested a row-block version of the Householder transformation scheme which is based upon the divide-and-conquer approach suggested by Golub. Kim. Extensive computational experiments which are reported by the authors on a 64-processor Intel hypercube support their theoretical performance analysis results. which they call merges. and their analysis of the algorithm determines how the aspect ratio of the embedded processor grid should be chosen in order to minimize the execution time or storage usage.

There are two main approaches to solving RLS problems. This is called downdating. and the covariance matrix methodbased instead on modifying the inverse . and Plemmons [92]. PLEMMONS. R. the information matrix methodbased on modifying the triangular matrix R in (11). Form the matrix vector product (12) 2. H.R"1 and x to x at each recursive timestep. Choose plane rotations Qi. The Cholesky factor R~* for P is readily available in control and signal processing applications. The process of modifying least squares computations by updating the covariance matrix P has been used in control and signal processing for some time in the context of linear sequential filtering. to form and update L 3. SAMEH are successively added to or deleted from (10) without resorting to complete refactorization of the matrix A. Applications of RLS updating and downdating include robust regression in statistics. Various algorithms for modifying R in the information matrix approach due to updating or downdating have been implemented on a 64-node Intel hypercube by Henkel. For example.62 K. the information matrix method is based on modifying the normal equations matrix ATA. Form . In theory. Algorithm (Covariance Updating). and in estimation methods in adaptive signal processing and control. A. They make use of either plane rotations or hyperbolic type rotations. the algorithm computes the updated factor L = R-1 of P and the updated least squares estimate vector x as follows: 1. and update R~l to . while the covariance matrix method is based on modifying the covariance matrix The covariance matrix P measures the expected errors in the least squares solution x to (10). Recently Pan and Plemmons [140] have described the following parallel scheme. Given the current least squares estimate vector re. in many applications information continues to arrive and must be incorporated into the solution x. Heath. J. GALLIVAN. We begin with estimates for P = R~1R~T and #. Alternatively. modification of the Hessian matrix in certain optimization schemes. This is called updating. AND A. the current factor L = R~T of P = (ATA)~l and the observation yTx = a being added.R"1. it is sometimes important to delete old observations and have their effects excised from x.

Speedup and efficiency on the iPSC/2 for a problem of size n = 1024.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 63 As the recursive least squares computation proceeds. Eigenvalue and singular value problems. Steps 1 and 3 are highly parallelizable and effective implementation details of step 2 on a hypercube are given in [93]. Solving the algebraic eigenvalue problem.76 P 1 4 16 64 6. Algorithms for handling the generalized eigenvalue problem on shared or distributed memory multiprocessors are very similar to those used on sequential machines. In this brief review. the speedup is given by.98 0. either standard Ax = \x.94 0. This reduction process can be made efficient on shared memory multiprocessors. or generalized Ax = ABx. Table 2 shows the speedup and efficiency on an iPSC/2 hypercube (4 MB of memory for each processor) for a single phase of the algorithm on a test problem of size n — 1024. the most efficient stage is the initial one of reducing B to the upper triangular form. An efficient parallel implementation of this algorithm on the hypercube distributed-memory system making use of bidirectional data exchanges and some redundant computation is given in [93]. is an important and potentially timeconsuming task in numerous applications. Most of the parallel algorithms developed for the dense eigenvalue problem have been aimed at the standard problem. a new equation is added.1. x replaces x. For the nonsymmetric generalized eigenvalue problems where the matrix B is known to be often extremely ill-conditioned in many applications.60 Efficiency 1 0. Reduction of the symmetric generalized eigenvalue problem to the standard form is achieved by a Cholesky factorization of the symmetric positive definite matrix B which is well-conditioned in most applications. efficiency = sneedun time on 1 processor time on p processors' P An alternative hypercube implementation of the RLS scheme of Pan and Plemmons [140] has been given by Chu and George [31]. only the dense case is considered for both the symmetric and nonsymmetric problems. Eigenvalue problems. 6.90 15. TABLE 2 Number of Processors Speedup 1 3.06 48. On a shared memory multiprocessor. and the process returns to step 1. the . for example. Here. there is no adequate substitute to Moler and Stewart's QZ-scheme [136]. by adopting a block Cholesky scheme similar to the block LU decomposition discussed earlier to obtain the Cholesky factor L of B and to explicitly form the matrix L~1AL~T using the appropriate BLAS3. speedup with the corresponding efficiency. One complete recursive update is performed. Dispensing thus with the generalized eigenvalue problems. L replaces L.

leaving little that can be gained from vectorization. Stewart has considered the implementation of this basic iteration on a linear array of processors [172]. parallelization. H.64 K. On parallel computers with hierarchical memories. This block scheme requires more arithmetic operations than the classical algorithm using elementary reflectors by a factor of roughly 1 + 1/k. Block sizes can be as small as 2 for certain architectures. Performance of the block scheme on the Alliant FX/8 is shown in Fig. AND A. [86]. These usually consist of elementary reflectors which could yield high computational rates on vector machines provided appropriate BLAS2 kernels are used. where we assume that the matrix A is of order n where n = ki> + 2. after balancing. the performance of the algorithm is enhanced since the additional work required consists of computations that are . This algorithm consists mainly of chasing a bulge represented by a square matrix of order 3 whose diagonal lies along the subdiagonal of the upper Hessenberg matrix. and to a lesser extent.1. block versions of the classical scheme. Reduction to a condensed form. More recently. 6. Such block schemes are similar to those discussed above for orthogonal factorization. For the standard problem the first step. and their use does not sacrifice numerical stability. a block implementation with multiple QR shifts was proposed by Bai and Demmel [6] which yields some advantage for vector machines such as the Convex C-l and Cyber 205. Zm consists of the last (n — m) rows of Vm.. do j = 1. The performance shown is based on the operation count of the nonblock algorithm. k do i = (j-l)i/ + 1. [44]. J. 16 [86].1. If we are seeking all of the eigenvectors as well.R-algorithm with an implicit shifting strategy. is the reduction to upper Hessenberg form via orthogonal similarity transformations. > Obtain an elementary reflector Pi — / — WiwJ such that Pi annihilates the last n — i — 1 elements of the zth column of A Construct: Here. and Jacobi or Jacobi-like schemes for both the symmetric and nonsymmetric standard eigenvalue problems. PLEMMONS. SAMEH remainder of the section will be divided between procedures that depend on reduction to a condensed form. For the sake of illustration we present a simplified scheme for this block reduction to the upper Hessenberg form. The next stage is that of obtaining the eigenvalues of the resulting upper Hessenberg matrix via the Q.g. e. A. GALLIVAN. yield higher performance than BLAS2-based versions. R. This in turn affects only 3 rows and columns of the Hessenberg matrix. We start with the nonsymmetric case. see [16].

on sequential machines. a BLAS2 implementation of Householder tridiagonalization using rank-2 updates (see [43]) yields a computational rate of roughly 200 Mflops for matrices of order 1000 (see Fig. As before. The performance of Eispack's TRED2 is also presented in the figure for comparison. The classical procedure is inherently sequential.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 65 amenable to vector and/or parallel processing. Reduction to Hessenberg form on Alliant FX/8. that of updating the orthogonal matrix used to reduce the original matrix to Hessenberg form. Such reduction can be achieved by a minor modification of the above block reduction to the Hessenberg form. offering nothing in the form of vectorization or parallelism. 16. the performance is computed based on the nonblock version operation count. 17 [87]. FIG. Once the tridiagonal matrix is obtained two approaches have been used. Dongarra and Sorensen [49]. Recently. On 1 CPU of a CRAY X-MP. If all the eigenvalues are required a QR-based method is used. the most common method for handling the standard dense symmetric eigenvalue problem consists of first reducing the symmetric matrix to the tridiagonal form via elementary reflectors followed by handling the tridiagonal eigenvalue problem. for obtaining its eigenvalues and eigenvectors. adapted an alternative due to Cuppen [33] for the use on multipro- . with an 8. Figure 18 shows a comparison of the performance of this BLAS3-based block reduction with a BLAS2-based reduction on the Alliant FX/8 [86]. Similarly.5 ns clock.

Reduction to tridiagonal form on CRAY X-MP (1 CPU). if Q = diag(Qi. Thus. In its simplest form.D 2 ).2. then T is orthogonally similar to a rank-1 . A. GALLIVAN.i.a. i. i = 1. H. where Qi is an orthogonal matrix of order m and Di is diagonal. This in turn can be written as. SAMEH cessors. 17. Let T = ((3i. FIG. where we assume that none of its off-diagonal elements /% vanishes.Q 2 ) and D = diag(Di. and &i is the ith column of the identity of order m. it can be written as.(3i+i) be the symmetric tridiagonal matrix under consideration. J.e. This algorithm obtains all the eigenvalues and eigenvectors of the symmetric tridiagonal matrix. PLEMMONS. r is a "carefully" chosen scalar.. R. Ti = QiDiQ?. AND A. we have two tasks: namely obtaining the spectral decomposition of T\ and T2. Assuming that it is of order 2m. where each Tj is tridiagonal of order m. Now.66 K. in which the scalar 7 and the column vector v can be readily derived. the main idea of the algorithm may be outlined as follows.

see [118] and [102]. For example.e. see [167] or [195]. and is more than four times faster than Eispack's TQL1. e.2. then the algorithm will consist of obtaining the spectral decomposition of 2fc tridiagonal matrices each of order m. see [123]. If eigenvalues only (or all those lying in a given interval) or selected eigenpairs are desired. For matrices with clusters of poorly separated eigenvalues.9 on eight CE's. or clusters of computationally coincident eigenvalues. see Wilkinson and Reinsch [195] or Parlett [141]. This modification depends on a multisectioning strategy in which the interval containing the desired eigenvalues is divided into (p — 1) subintervals where p is the number of processors. 27 times faster than Eispack's pair Bisect and Tinvit. This process is repeated until all the eigenvalues. Using the Sturm sequence property we can simultaneously determine the number of eigenvalues contained in each of the (p — 1) subintervals. where This module may be used recursively to produce a parallel counterpart to Cuppen's algorithm [33] as demonstrated in [49].PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 67 perturbation of a diagonal matrix. For example. followed by k stages in which stage j consists of applying the above module simultaneously to 2 fc ~ J pairs of tridiagonal matrices in which each is of order 2 J ~ 1 m.g. and later for the Alliant FX/8. see [58]..-!] of order 500 with the same achievable accuracy for the eigenvalues.. then the final stage consists of a combination of inverse iteration and orthogonalization for those vectors corresponding to poorly separated eigenvalues.g. this multisectioning scheme is more than 13 times faster than the best BLAS2-based version of Eispack's TQL2. if the tridiagonal matrix T is of order 2 fc m. it achieves a speedup of 7. i. Even if all the eigenpairs of the above tridiagonal matrix are required. The eigenvalues of T are thus the roots of and its eigenvectors are given by. and five times faster than its nearest competitor. with the same accuracy in the computed eigenpairs. This scheme proved to be the most effective on the Alliant FX/8 for obtaining all or few of the eigenvalues only. are separated. e.5 . e.or underflow. where p and z are trivially obtained from 7 and z.g. Compared to its execution time on one CE. see [141]. namely the ZEROIN procedure due to Brent and Dekker... however. for the tridiagonal matrix [-1. for the well-known Wilkinson matrices W±2T> e -g. This is accomplished by having each processor evaluate the well-known linear recurrence leading to the determinant of the tridiagonal matrix T — \JL! or the corresponding nonlinear recurrence so as to avoid over. then a bisection-inverse iteration combination is used. the multisectioning algorithm may not be competitive if all the eigenpairs are required with high accuracy. Such a combination has been adapted for the Illiac IV parallel computer. e. If eigenvectors are desired.g.. This "isolation" stage is followed by the "extraction" stage where the separated eigenvalues are evaluated using a root finder which is a hybrid of pure bisection and the combination of bisection and the secant methods. parallel Cuppen's procedure [49].

Further studies by Simon [166] demonstrate the robustness of the above multisectioning strategy compared to other bisection-inverse iteration combinations proposed in [8]. which have pairs of very close eigenvalues. A. An alternative to reduction to a condensed form is that of using one of the Jacobi schemes for obtaining all the eigenvalues or all the eigenvalues and eigenvectors.. see [152]. PLEMMONS. Work on such parallel procedures dates back to the Illiac IV distributed memory parallel computer. several ordering schemes of these independent rotations are presented that minimize the number of orthogonal transformations (i. SAMEH FIG.e. direct sum of rotations) within each sweep. which are presented in that work. 18. Much more work has been done since on this parallel two-sided . AND A. see [194].2. Also.68 K. H.g. exploit the fact that independent rotations can be applied simultaneously. Algorithms for handling the two-sided Jacobi scheme for the symmetric problem. comparisons between the above multisectioning scheme and parallel Cuppen's algorithm have been given by Ipsen and Jessup on hypercubes [103] indicating the effectiveness of multisectioning on distributed memory multiprocessors for cases in which the eigenvalues are not pathologically clustered. GALLIVAN. Reduction to tridiagonal form on Alliant FX/8. J..1. Furthermore. Jacobi and Jacobi-like schemes. e. R. the multisectioning method requires roughly twice the time required by the parallel Cuppen's procedure in order to achieve the same accuracy for all the eigenpairs. 6.

Given a symmetric nonsingular matrix A of order n and columns aj. has been modified for parallel computations. obtain through an iterative process an orthogonal matrix V such that where S has orthogonal columns within a given tolerance. where i < j.g. the one-sided Jacobi scheme due to Hestenes [94] requires only accessing of the columns of the matrix under consideration.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 69 Jacobi scheme for the symmetric eigenvalue problem. . This procedure may be described as follows. Several schemes can be used to select the order of the plane rotations. /3 = \\ai\\2 — \\Q>j\\2i and 7 = A/a2 + (32. primarily for the Illiac IV. so that dj dj = 0 and \\di\\2 > \\dj\\2.. a = 2af a. 1 < i < n. More recent related parallel schemes.. see Brent and Luk [18]. aimed at distributed memory multiprocessors as well.This is accomplished as follows. These have been motivated primarily by the emergence of systolic arrays. Unlike the two-sided Jacobi scheme. Also. This feature makes it more suitable for shared memory multiprocessors with hierarchical organization such as the Alliant FX/8. e. have been developed by Stewart [171] and Eberlein [52] for the Schur decomposition of nonsymmetric matrices. The orthogonal matrix V is constructed as the product of plane rotations in which each is chosen to orthogonalize a pair of columns. Here. A most important byproduct of such investigation of parallel Jacobi schemes is a result due to Luk and Park [126]. in [152] a Jacobi-like algorithm for solving the nonsymmetric eigenvalue problem due to Eberlein [51]. for the symmetric eigenvalue problem. where they show the equivalence of various parallel Jacobi orderings to the classical sequential cyclic by row ordering for which Forsythe and Henrici [57] proved convergence of the method. Shown below is the pattern for one sweep for a matrix of order n = 8 an annihilation scheme related to those recommended in [152]. if otherwise.

Jacobi schemes may also be used to efficiently obtain the singular-value decomposition of the upper triangular matrix resulting from the orthogonal factorization via block Householder transformations. The convergence of cyclic block Jacobi methods has been discussed by Shroff and Schreiber [165]. We start with the classical Matrix Decomposition (MD). The same onesided Jacobi scheme discussed above has proved to be most effective on the hierarchical memory system of the Alliant FX/8.'s on regular domains. where A — A+al. PLEMMONS.7). Several algorithms have been developed for obtaining the singular-value decomposition on vector and parallel computers. see [107]) may be used to obtain all the singular values and vectors. i.g. (3. then its eigenvalues are taken as the 2-norms of the columns of 5. and BlockCyclic Reduction (BCR) schemes for separable elliptic P. a generalization of the multisectioning algorithm may be used to obtain selected singular values and vectors. In this section. this Jacobi scheme is faster than algorithms that depend on tridiagonalization. the matrix V yields a set of approximate eigenvectors from which the eigenvalues may be obtained via Rayleigh quotients. This is then followed by the chasing of zeros via rotation of rows and columns to yield a bidiagonal matrix. Once the bidiagonal matrix is obtained a generalization of Cuppen's algorithm (e. GALLIVAN. for example. Such a procedure results in a performance that is superior to the best vectorized version of Eispack's or LINPACK routines which are based on the algorithm in [82]. on the Alliant FX/8. Finally. For tall and narrow matrices with certain distributions of clusters of singular values and/or extreme rank deficiencies.2.E. 6.70 K. 7. for matrices of size less than 150 or for matrices that have few clusters of almost coincident eigenvalues. we treat the eigenvalue problem Ax = (\+a)x . with the same size residuals.6) can be orthogonalized simultaneously by 3 independent rotations. however.D. than the best BLAS2-version of LINPACK's routine for matrices of order 16000 x 128 [12]. usually in a few sweeps. The application of the subsequent plane rotations has to proceed sequentially but some benefit due to vectorization can still be realized. After convergence of this iterative process. The matrix is first reduced to the upper triangular form via the block Householder reduction. Note that if A is not known to be nonsingular.. leading to the achievement of high performance. J. H. Rapid elliptic solvers. with a being the smallest number chosen such that A is positive definite. R. On an Alliant FX/8. This is followed by a Boundary Integral-based Domain Decomposition method for . Experiments showed that the block-Householder reduction and the one-sided Jacobi scheme combination is up to five times faster. The most obvious implementation of the reduction to the bidiagonal form on a parallel or vector computer follows the strategy suggested by Chan [26]. Singular-value problems. by using the sequential algorithm due to Golub and Reinsch [82].. (4. SAMEH where each sweep consists of n orthogonal transformations each being the direct sum of no more than [_n/2j independent plane rotations. a block generalization of the two-sided Jacobi scheme has been considered by Van Loan [190] and Bischof [13] for distributed memory multiprocessors. denotes that the column pairs (2. The most robust of these schemes are those that rely first on reducing the matrix to the bidiagonal form. A. Similarly. AND A. we review parallel schemes for rapid elliptic solvers. An integer k = 8. If the matrix A is positivedefinite.e. Luk has used the one-sided Jacobi scheme to obtain the singular-value decomposition on the Illiac IV [124] and block variations of Jacobi's method have been attempted by Bischof on IBM's LCAP system [13]. suggested in the previous section.8).

m — 1. and is achieved when the number I of the block cyclic reduction steps preceding Fourier analysis is taken approximately as (Iog2log2n). and CORF. 7.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 71 handling the Laplace equation on irregular domains that consist of regular domains. The most effective sequential algorithm combines the block cyclic reduction and Fourier analysis schemes. This is Hockney's FACR(l] algorithm [97].1. Using only cyclic reduction. Cyber-205. 4. [183]. we obtain the well-known linear system of order n 2 : where T = [—1. is the eigenvector matrix of T. In [177] it is shown that the asymptotic operation count for FACR(l] on an n x n grid is 0(n 2 log 2 log 2 n). and the ICL-DAP. on the ILLIAC IV.I]) 1//2 sin(/m7r/[n + 1])]. This amounts to performing in each cluster q sine transforms each of length n. /. It can be shown. Using natural ordering of the grid points. [142] and [173]. Hockney [96]. A domain decomposition MD-scheme. Now . • • • . • • • . or Fourier analysis. • • • . Next each cluster j obtains gj = (^_ 1)(?+1 . /( J -_ 1 jg +2 . that a perfect shuffle interconnection network is sufficient to keep the communication cost to a minimum. Efficient direct methods for solving the finite-difference approximation of the Poisson equation on the unit square have been developed by Buneman [20]. where for the sake of illustration we consider only Dirichlet boundary conditions. We consider first the MD-algorithm for solving the 5-point finite difference approximation of the Poisson equation on the unit square with a uniform n x n grid. [97]. It consists of performing a set of independent sine transforms. Each cluster j. and solving a set of independent tridiagonal systems. n. On a parallel computer consisting of n 2 processors. [97]. g j q ) . examples of such domains are the right-angle or T-shapes. This parallel MD-scheme consists of 3 stages [154]: Stage 1.) 9+1 . -1] is a tridiagonal matrix of order n. 1 < j < 4 (a four cluster Cedar is assumed). is ideally suited for parallel computation. where gk = Qfk: in which Q = [(2/[n -I. Buzbee [21] observed that Fourier analysis. where q = n/4. 2. Ericksen [55] considered the implementation of FACR(l). and Buzbee. and Nielson [22]. Excellent reviews of these methods on sequential machines have been given by Swarztrauber [177] and Temperton [182]. and Hockney [98] compared t performance of FACR(l) on the CRAY-1. the MD-Poisson solver for the twodimensional case requires O(log 2 n) parallel arithmetic steps [160]. The multiprocessor version algorithm presented below can be readily modified to accommodate Neumann and periodic boundary conditions. with an arbitrarily powerful interconnection network. to solve the problem on a sequential machine would require O(n 2 log 2 n) arithmetic operations. or the matrix decomposition Poisson solver (MD-Poisson solver).fjq of the right-hand side. Golub. [22]. forms the subvectors /(j_].

= -diag(7z-1). where Xk = 4 — 2cos(fc?r/[n + 1]). vj = (v7j_1\q+1. where J — [eq.2. respectively.2.= efc fc . • • • .ei]. in each cluster j. the right-hand side of the above system is obtained by solving in each cluster j the n independent systems for k = 1 . n. g . and efr fc = ej/i (j -_ 1)9+ j. 1 < j < 4.vjq) with vk = Quk. • • • . Cj and ei are the ith columns of Iq and 7n. n. Afc. R. • • • . and 1 < j < 4. <?.72 K. n. As a result. F and G are given by: Mhj = pj.. the independent systems Tkck — ei. for i = 1. • • • . —7 n ] is a block tridiagonal matrix of order #n. for A. and where T.7}"'). H. where ejsk = e^ (j _ 1)q+i . • • • . 2. and T^rffc — e9. 2 .. in order to obtain F and G we need only solve in each cluster the n systems Tkdk = eq. in which 7Z. where hj — (h^_^q+v • • • . —1] each of order <?. k = 1. and k = 1. • • • . A. From the structure of (7. • • • . F and G are of the form.. Here. J. k = 1. PLEMMONS. Observing that M consists of n independent tridiagonal matrices Tk = [—1. 2. for i = 1. see [111] for example. M — [—/ n . may be reduced to. • • • . and This system. however. • • • . SAMEH we have the system where each cluster memory contains one block row.n. n . = 1. Stage 2. MF = E.1) it is seen that the three pairs of n equations above and below each partition are completely decoupled from the rest of the n 2 . hjq). Here.2. A.2. . and MG — ET. AND A. GALLIVAN. we have ck — Jdk. in turn. Since Tk is a Toeplitz matrix. Hence. The matrices F and G can be similarly obtained by solving.

each of length n.2. for i — 1. where v0 = v4q+i — 0. each cluster j obtains V (j~l)<l + i = h(j-l)q+i — (TiV(j-i)q + rq-i+iVjq+i) for i — 2. 5 . In order to design an efficient version of this kernel it is necessary to perform an analysis of the influence of the memory hierarchy similar to that presented above for the block LU algorithm. Block cyclic reduction (BCR) dates back to the work of Hockney and was presented in [22] in its stabilized form due to Buneman. a package based on BCR for the solution of (16) and extensions thereof. • • • . Stage 3. This reduced system. In this case A is the n block tridiagonal matrix diag[—/. A.\)q+i. of the solution via q sine transforms.3.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 73 equations [161]. each cluster j retrieves the q subvectors U(j_i)q+i — Qv(j-. comprises n independent pentadiagonal systems each of order 6. k = 1. • • • . and then solve a system of the form where Y G 9? mx ( 2 r~l) and p2r-i(A} is a Chebyshev polynomial of degree 27""1 in A. that the subvectors Vkq. Now.3. • • . Finally. consists of interlocking blocks of the form: This system. where A. The discretization of the separable elliptic equation with Dirichlet boundary conditions and a five-point stencil on a naturally ordered n x m grid defined on a rectangular region leads to a system of the form Au = /. ^fcq+i. k . which can be solved in a very short time. BCR is a rapid elliptic solver (RES) having sequential computational complexity O(nmlogn). [180] resulted in the development of FISHPAK.. —/]. in turn. 7.2. Note that one of the key computational kernels in this algorithm is the calculation of multiple sine transformations. [178]. The work in [176].2. Such an analysis is contained in [74]. Since its roots Az are known. are available. the idea of the method for reduction steps r = 1. A modified block cyclic reduction. Assuming that n = 2 fc — 1. I are respectively tridiagonal and identity matrices of order m.I is to combine the current 2 f c ~ r + 1 — 1 vectors into 2 fc ~ r — 1 ones. Hence the system to be solved becomes . where each factor is tridiagonal. of order 6n. it can be written in product form.q — 1.

In summary. A. AND A. Partial fraction decomposition can also be applied to the parallel solution of parabolic equations. Clearly as r increases. PLEMMONS. Boundary integral domain decomposition. [179]. the effectiveness of a parallel or vector machine to handle (17) decreases rapidly. Parallel and standard BCR on n x n grid on Alliant FX/8.e. A parallel version of BCR was recently discovered [70]. [181]. 7. [72] for details. See [71]. For a discussion of parallel BCR on distributed memory machines see [73]. R. i. Figure 19 shows the performance of the parallel and standard BCR on the Alliant FX/8.. SAMEH FIG. H.3. A new method (BIDD) was recently proposed for the solution of Laplace's equation [68]. GALLIVAN. The method is . [69]. the method is based in expressing the matrix rational function (jv-i^)]"1 as a partial fraction. J.74 K. 19. as a linear combination of the 2r-1 components (A — X\r~ 7)"1 Coefficients aj r are equal to l/Q^r-^Af 1 )) and can be derived analytically.

G € ^xN is the influence matrix consisting of fundamental solutions based at Wj for each boundary point. Philadelphia. 8 (1987). M. BERNSTEIN AND M.. 487-507. Lecture Notes in Computer Science. ARMSTRONG.. Intl. Once a has been computed. 761. in Proc. Handler et al. Tech. ALPERN. pp. MORPH. BERRY AND R. An approximation u to the solution u is sought as a finite linear combination of N fundamental solutions [128] (f)j(z) = — ^ log \z — Wj\ of V 2 w = 0: For a given set of N points Wj lying outside the domain. Bailey. QR factorization algorithms for coarse-grained distributed systems. BAI AND J. [5] J. 305-314. Comput. 1988. 9 (1988). Processing.. The WY representation for products of Householder matrices. Amsterdam. Mech. Extra high speed matrix multiplication on the CRAY-2.. BERRY. Algorithm and performance notes for block LU factorization. SAMEH. 1986. Choosing these n points to be subdomain boundary points.. Dept. HARROD. [4] H. [3] A. New York University. Cornell University.. 1989. 65-82. J. AND M. 1986. [2] R. CHANDRA. PLEMMONS. pp. 15 (1982). s2-s!3. Optimizing Givens' algorithm for multiprocessors. IBM. Comput. Rep. University of Illinois. Methods Appl. Rep. Intl. Computing. BlSCHOF. of Third SIAM Conf. . Statist. W. SIAM J. Wright. BAILEY. Parallel numerical algorithms on the Cedar system. AGGARWAL. Lo. Center for Supercomputing Research and Development. Sci.. 1986. IL. K. 2 (1988). NY. 37-57. Theory of Computing. Sci. Conf. Ithaca. On a block implementation of Hessenberg multishift QR iterations. of Computer Science. [14] . 217-221. North-Holland. we can compute the solution by applying the elliptic solvers most suitable for each subdomain. D. 19th ACM Symp. B. [12] . 161-164. AND A. ed. TR86-798. Highly concurrent computing structures for matrix arithmetic and signal processing. g G 9R" consists of boundary values for u. GALLIVAN. SAMEH. Comput. A model for hierarchical memory. Tech. 1988. pp. Engrg. W. Rep. [8] H. U. AND M. SIAM J. in Aspects of Computation on Asynchronous Parallel Processors. with H G ^xN being the influence matrix for the n points. TR 88-939. in Proc. VAN LOAN. Parallel algorithms for the singular value and dense symmetric eigenvalue problems. [11] M. SNIR. 1988. Cornell University. 1987. in Proc.. [13] C. Springer-Verlag. [6] Z. in CONPAR 86. eds.. the solution at any [i points on the domain is u = Ha. Processing for Scientific Computing. [10] M. ed. IEEE Computer Society Press. Courant Institute of Mathematical Sciences. Urbana.. Comput. Rodrigue. Dept. Statist. Computing the singular value decomposition on a distributed system of vector processors. 603-607. 9 (1988). New York. GUSTAVSON. a G ^.. NY. DEMMEL. DELOSME. ed. pp. [16] C. BERRY AND A. Supercomputer Appl. Society for Industrial and Applied Mathmatics. Algorithms and experiments for structural mechanics on high performance architectures. pp. of Computer Science. A. MEIER. pp. BlSCHOF AND C. CSRD Rept. J. SIAM J. pp. [9] M. AHMED. 1988. B. Multiprocessor schemes for solving block tridiagonal linear systems. 1988. pp. A parallel implementation of matrix multiplication and LU factorization on the IBM 3090. Tech. PHILIPPE. H. on Par. Par. REFERENCES [1] IBM Engineering and Scientific Subroutine Library Guide and Reference. pp. AGARWAL AND F. Tech. Rep.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 75 characterized by the decoupling of the problem into independent subproblems on subdomains. GOLDSTEIN. JALBY. Statist. Sci. Berlin. Ithaca. G. 601-602. 64 (1987). [15] . W.N is computed to minimize \\g — Ga\\p for some norm p. [7] D. A pipelined block QR decomposition algorithm. S.

Chr. pp. 1984. SESEK. CHEN. Amsterdam. Updating and downdating the inverse of a Cholesky factor on a hypercube multiprocessor. HOFFMAN. [37] J. A fast Poisson solver amenable to parallel computation. ACM Trans. A collection of equation solving codes for the CRAY-1. Argonne National Laboratory. 48 (1986). D O N G A R R A . Oak Ridge National Lab. Michelsen Institute. Waterloo. SEL 133. Rep. [24] D. Dept. pp. Engrg. C H A M B E R L A I N AND M. TR86-28. J. 10 (1984). Bergen. D O N G A R R A . Tech. pp. 3 (1974). QR factorization of a dense matrix on a hypercube multiprocessor. BRENT. of Computer Science. Tech. ElSENSTAT. D. Comput. TM-88. Statist. [26] T. 5 (1984). Argonne. Intl.. Math. TM-41. J. Parallel Comput. J. [29] .. Tech.. Argonne National Laboratory. GALLIVAN. BRENT AND F. Philadelphia. Argonne. DUFF. 1988. [41] J. CCS 86/10. Comput. Statist. Rep.. C A L A H A N . Ann Arbor. 1986. TM-89. 91-112. pp. CS-88-45. University of Waterloo. Numerically stable solution of dense systems of linear equations using mesh-connected processors. 1979. Squeezing the most out of an algorithm in CRAY Fortran. H A M M A R L I N G . B O J A N C Z Y K . SORENSEN. 273-280. Comput. AND L. 5 (1987). KARP. L. B R O N L U N D AND T. pp. A. D O N G A R R A . [31] . R. Dept. H A M M A R L I N G . Engrg. Sci. QR-factorization of partitioned matrices. Tech. 7 (1970). HANSON. [43] J. 1986. IL. Tech. Numer. 793-796. A compact non-iterative Poisson solver. TM-46. Anal. Methods Appl. Numer. pp. [38] J. Squeezing the most out of eigenvalue solvers on high-performance computers. S. A proposal for a set of Level 3 basic linear algebra subprograms. CROZ. Workshop on the Level 3 BLAS. 4 (1978). Norway. I. pp.. . Argonne. AND S.. Block-oriented local-memory-based linear equation solution on the CRAY-2: uniprocessor algorithms. Rep. Parallel QR decomposition of a rectangular matrix. CS-88-46. New York. K A U F M A N . Mech. An improved algorithm for computing the singular value decomposition. QR factorization for linear least squares on the hypercube. LUK. DEMMEL. AMES. Software. CROZ. Tech. Par. pp. Mathematics and Computer Science Div. GEORGE. A divide and conquer method for the symmetric tridiagonal eigenproblem. 270-277. Processing. The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays. Argonne National Laboratory. Sci. CUPPEN. Report 294. DEKKER AND W. 9 (1976). H. BUZBEE.76 K. On direct methods for solving Poisson's equation. 239-249. J. Oak Ridge. IL. HAMMARLING. Practical parallel band triangular system solvers. D. Canada. MI. 65-74. SIAM Rev. AND G.. Numer. SIAM J. J. CROZ. AND A. 1986. pp. Tech. A proposal for an extended set of Fortran basic linear algebra subprograms. Gaussian elimination with partial pivoting and load balancing on a multiprocessor. TM-97. CHU AND A.. C-22 (1973). [34] T. Rehabilitation of the Gauss-Jordan algorithm. ACM Trans. 8 (1982). Math. 6 (1985). 1969. [39] J.. [25] R. J. SIAM J. GOLUB. DONGARRA. ORNL/TM-10691. Rep. D O N G A R R A . Prospectus for the development of a linear algebra library for highperformance computers. AND D. of Mathematics. DIETRICH. 1979. Waterloo. COSNARD. Stanford University Institute for Plasma Research. CA. W. SIAM J. SAMEH [17] A. AND E. W. S. Faculty of Mathematics. J. Tech. Tech. HAMMARLING.. pp. 219-230. Mathematics and Computer Science Div. C. Argonne. IEEE Trans. Rep. 1987. CALAHAN. Tech. Rep. AND A. Stanford. TN. BUZBEE.. [18] R. 1987. [20] O. DONGARRA. Math. BUNCH. 95-104. R. S. [30] . 177-195. [40] J. Comput. Rep. SAMEH. 375-378. MULLER. pp. pp. [36] G. University of Amsterdam. NIELSON. D. Math. Society for Industrial and Applied Mathematics. 36 (1981). Tech. Rep. Tech. A new formulation of the hypermatrix Householder-QR decomposition. KUCK. University of Michigan. ACM Trans. C H A N . IL. AND Y. [21] B. POWELL.. IL. Rep. 1988. ROBERT. 26 (1984). AND A. AND C. DONGARRA AND S. Software. pp. IL. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. D O N G A R R A . [27] S. Mathematics and Computer Science Div.. [22] B. J O H N S E N . PLEMMONS. Conf. UNPACK User's Guide. 1988. [28] E. University of Waterloo. Software. [32] M. 153-172. F. Canada. [33] J. 1985. AND R. Argonne National Laboratory. 72-83. GUSTAVSON. Math. AND H.. Argonne National Laboratory. [23] D. [19] O. G. A. MOLER. K U N G . 627-656. Rep. Rep.. Methods Appl. Mathematics and Computer Science Div. B U N E M A N . 1987. D. Argonne. Rep. Comput. Mathematics and Computer Science Div. [42] J. pp. GREENBAUM. [35] J. 69-84. in Proc. A balanced submatrix merging algorithm for multiprocessor architectures. IEEE Computer Society Press. Mech. STEWART.

E. S. Conf. Argonne National Laboratory. 9 (1988). Statist. JALBY. Tech. IL. Linkoping. s!39-s!54. W. pp. 488-499. HEWITT. 17-31. IEEE Trans. 167-174. W. [69] . 1988. Tech. SALMON. 297: Proc. Rep. 1989. [50] J. 1. 433-442. Supercomputing. 1972. FORSYTHE AND P. Urbana.. AND U. SAMEH. pp. 4 (1987). TAYLOR. in Lecture Notes in Comput. A fully parallel algorithm for the symmetric eigenvalue problem. Solving Problems on Concurrent Processors. pp. 94 (1960). FOX. IL..PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 77 [44] J. The cyclic Jacobi method for computing the principal values of a complex matrix. Computer Methods for Mathematical Computations. pp. MALONY. Springer-Verlag. First edition. LYZENGA. GALLIVAN. Berlin. HENKEL.. A. FOX. Sweden. Software. ElSENSTAT. Argonne. J. OTTO. 10 (1962). [63] K. Supercomputing. 1987. [51] P. [65] K. New York. D. Fox. AND D. 10791084.. WALKER. SORENSEN. Iterative and direct methods for solving Poisson's equation and their adaptability to Illiac IV. Center for Advanced Computations. DONGARRA. Modern Computing Methods. of . MEIER. Comput.. [59] G. Statist. G A N N O N . Lith-MAT-R-1988-02. MALCOLM. A. HEATH. [46] J. F. January 1987. Philosophical Library. M. pp. Argonne National Laboratory. Implementing dense linear algebra algorithms using multitasking on the CRAY X-MP/4. MOLER. 1--23. AND A. on Measuring and Modeling Computer Systems.. GALLIVAN. MICHEL. AND H. [49] J. Block reduction of matrices to condensed form for eigenvalue computations. [53] S. On some parallel banded system solvers. SIAM J. 589-600. Parallel Comput. [58] G. [52] . [55] J. Comput. The use 0/BLAS3 in linear algebra on a parallel processor with a hierarchical memory. JALBY.. A parallel QR decomposition algorithm. Soc. Presented at the Level 3 BLAS Workshop. GALLIVAN. MEIER. EBERLEIN. ACM Press. Parallel Comput. LA-6774. Impact of hierarchical memory systems on linear algebra algorithm design. AND D. 7 (1981). DONGARRA AND L. University of Illinois. GALLOPOULOS AND D. ACM Press. New York. Behavioral characterization of multiprocessor memory systems. Boundary integral domain decomposition on hierarchical memory multiprocessor. Matrix algorithms on a hypercube I: matrix multiplication. [56] K.. Rep. C-36 (1987). ROMINE. J. AND D. NJ. Rep. [48] J. H A M M A R L I N G . SIAM J. 25-34. WIJSHOFF. AND A. S. 79-89. [45] J. 219-246. Linkoping University. Statist. GALLIVAN. S. [67] K. Implementation of some concurrent algorithms for matrix factorization. Some linear algebra algorithms and their performance on CRAY1. LEE. U. Solving large full sets of linear equations in a paged virtual store. G A N N O N . 8 (1987). pp. 8 (1987). 347-350. 1961. HENRICI.. Conf. Parallel Dist. 1988. 527-536. pp. Indust. 74-88. 1042-1073. M. DU CROZ. [62] L. in Proc. AND J. pp. in Proc. [57] G. 1 (1984). Mathematics and Computer Science Div. Parallel Comput. TM-99. Modified cyclic algorithms for solving triangular systems on distributed memory multiprocessors. Englewood Cliffs. 1987. ACM Trans. W. C. NUGENT. 223-235. Math. Strategies for cache and local memory management by global program transformation. Domain decomposition in distributed and shared memory environments. pp. D O N G A R R A . AND H. Sci. pp. Appl. Fox. OLVER. Cornput. W I L K I N S O N . NM. Englewood Cliffs. [47] J. [68] E. JALBY. pp. J. Statist. WIJSHOFF. AND W. ERICKSEN. Los Alamos. JORDAN. Trans. [60] G. [61] G. Parallel Comput. Conf. ACM Press. CAC Doc. REID. Solving banded systems on a parallel processor. J. Comput. pp. JOHNSSON. Fast Laplace solver by boundary integral-based domain decomposition. 1989 ACM SIGMETRICS Conf. Tech. Comput.. D. Performance prediction of loop constructs on multiprocessor hierarchical memory systems. Intl. SIAM J. 60. DONGARRA AND A. GALLIVAN. JOHNSON. 1988 Intl. 587616. [64] K. Supercomputer Appl. pp. A Jacobi-like method for the automatic computation of eigenvalues and eigenvectors of an arbitrary matrix. MALONY. S. Sci. On the Schur decomposition of a matrix for parallel computation. AND C. FORSYTHE. FONG AND T. Sci. pp. Los Alamos National Laboratory. ELDEN. Amer. New York. [54] L. JALBY.. NJ. A. SORENSEN. Math. HEY. New York. DONGARRA AND T. pp. in Proc. Prentice-Hall. J.. Tech. 1989 Intl. 5 (1988). in Proc. SAMEH. W. 1987. [66] K. SAMEH. OTTO. pp. Sci. AND C. 12-48. 7 (1986). SIAM J. Supercomputing. pp. G. J. Computing. Sci. GOODWIN. pp. 3 (1986). 2 (1988). M. Math. 1977. Rep. DONGARRA AND D. Vol. 5 (1987). Prentice-Hall. 1989.. AND D. SORENSEN. Soc. 1977. JALBY. 1987 Intl.

1985. Recursive least squares on a hypercube multiprocessor using the covariance factorization. Wilhelmson. Fox. in Hypercube Multiprocessors 1986. Tech. Parallel block cyclic reduction algorithm for the fast solution of elliptic equations. IL. W. ORNL-6190.. Singular value decomposition and least squares solutions. A. Comput. D. . Parallel Processing for Scientific Computing. The influence of memory hierarchy on algorithm organization: programming FFTs on a vector multiprocessor. 277-301. 1988. HEATH AND C. Parallel Comput. Philadelphia. R. Cholesky downdating on a hypercube. SIAM Rev. Dept. 484-496. SAMEH. HEATH. In- [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] . R. Processing for Scientific Computing. J. 9 (1988). W. private communication. LU factorization algorithms on distributed memory multiprocessor architectures. SIAM J. Urbana. D. A survey of parallel algorithms for numerical linear algebra. Parallel block schemes for large-scale least squares computations. PLEMMONS. pp. Tech. Tech. C. 403-420. VAN LOAN. Cambridge. and R. Oak Ridge National Laboratory. C. GOLUB AND C. Rep. Center for Supercomputing Research and Development. AND K. pp. 1981. ed. Rep. 15921598. 180-195. A. Oak Ridge National Lab. Parallel solution of triangular systems on distributed memory multiprocessors.. Society for Industrial and Applied Mathematics. GEIST AND M. ORNL-6150. 161-180. June 1989. IL. NC.. G A N N O N . Efficient parallel solution of parabolic equations: implicit methods.. CSRD Rept. Scientific Applications and Algorithm Design. T. TN. 1988. HARTEN.. Efficient parallel solution of parabolic equations: explicit methods.. Rep. Inversion of matrices by biorthogonalization and related results. ACM Press. G. Statist. 20 (1978). G. To be presented at the Fourth SIAM Conf. Oak Ridge. pp. 1988. University of Illinois. 1988. Matrix triangularization by systolic arrays. GOLUB AND C. Int'l. 1988. G A N N O N AND J. M. eds. AND A. SAMEH Third SIAM Conf. HARROD. 1986. 1985. 143-160. pp. Urbana. M. pp. KUNG. 19-26. pp. GUSTAVSON. Oak Ridge. 740-777. Baltimore. SIAM J. 14 (1970). June 1989. GOLUB. On the impact of communication complexity on the design of parallel numerical algorithms. CA. G. SPIE 298. GALLIVAN. Comput. pp. University of Illinois. 253276. 1988 Intl. F. University of Illinois Press. MIT Press. SAAD. D. T.78 K. R. The Johns Hopkins University Press. M. 11 (1990). . JALBY. Douglass. San Diego. Center for Supercomputing Research and Development. Computer Science. Society for Industrial and Applied Mathematics. Athens. M. Comput. PLEMMONS. IL. pp. Tech. JALBY. Conf. HEATH. ed. 696. MIT Press.. pp.. 639-649.. Parallel Cholesky factorization in message passing multiprocessor environments. Math. A block scheme for reduction to condensed form. J. Rep.. University of Illinois. HESTENES. . in Proc. 13 (1976). Some fast elliptic solvers for parallel architectures and their complexities. GALLIVAN. PLEMMONS. Heath. J. Soc. L. pp. Conf. Supercomputing. 11801195.. G. On the problem of optimizing data transfers for complex memory systems. Sci. Real Time Signal Processing. H. ACM Press. private communication. G A N N O N AND W. 1987. Philadelphia. AND A. D. TN. ed. Rep. 1 (May 1989). in The Characteristics of Parallel Algorithms. ROMINE. W. New York. G. Rep. Some aspects of the cyclic reduction algorithm for block tridiagonal linear systems. ed. SIAM J. GENTLEMAN AND H.. Matrix factorization on a hypercube. A. on Par. HELLER. Cambridge. in Proc.. in High Speed Computing. pp. Jamieson. Tech... AND R. IL. 558-588. L. GALLOPOULOS AND Y. to appear. To be presented at the Fourth SIAM Conf. REINSCH. SIAM J. Anal. Argonne.. Statist. Comput.. in Hypercube Multiprocessors 1988. pp. pp. HEATH. GEIST AND C. Parallel Cholesky factorization on a hypercube multiprocessor. Numer. Rodrigue. 9 (1988). North Carolina State University. D. IEEE Trans.. Programming with the BLAS. PLEMMONS. and R. Sci. on Supercomputing. . HENKEL AND R. Statist. pp. High Speed Comput. P. Greece. Gannon. 1988. Tech. Numer. Gannon. 1983. in The Characteristics of Parallel Algorithms. Jamieson. Douglass. . D. Matrix Computations. 1987.. E. Raleigh. 238-253. Urbana. Sci. . C-33 (1984). ROMINE. VAN ROSENDALE. Also presented at 1987 Int'l. 113-141. pp. HENKEL. eds. Center for Supercomputing Research and Development.. Parallel Processing for Scientific Computing. M. 10 (1989). New York.

[110] L. Tech. Computer Science.. 95-113. [114] M. pp. Tech. pp. Matrix operations on [115] D. Loss of orthogonality in a Gram-Schmidt process. AND R. 271-288. in Performance Evaluation of Supercomputers. 205-239. Yale University. 1967. 135 211. 1987. [105] W. LAWSON. K I N C A I D . 1972. Comput. DAVIDSON. Amsterdam. Statist. pp. pp. Tech. 1982. Optimizing matrix operations on a parallel mulitprocessor with a hierarchical memory system. Conf. SAAD. Par. pp. pp. [101] J. IL. JALBY AND U. D. 1988. pp. 10 (1984). 1987. AL. Martin. JOHNSSON AND C. IEEE Trans. The Structure of Computers and Computations. Elsevier Science Publishers B. OKAWA. IL. R. A fast direct solution of Poisson's equation using Fourier analysis. North-Holland. Exploiting fast matrix multiplication with the Level 3 BLAS. J. Raleigh.. ViEIRA. 20 (1978). Tech. IRISA. Par. 185-195. New Haven. France. JALBY AND B. SAMEH. MEIER. New Haven. IPSEN AND E. K U N G . JACM. ILLIAC IV Doc. Software. ILLIAC IV software and application programming. JOHNSSON. Theory of Computing. Yale University. Rep. Rep. [96] R. 1989. SORENSEN. KUCK A N D A. Parallel computations of eigenvalues of real matrices. HANSON. Linear Algebra Appl. IEEE Trans. Sci. pp. New York. . 13th ACM Symp. Y. Mathematics and Computer Science Div. IPSEN. Appl. Comput. E. Science. [102] H. Dept. Rep.. [99] . 1266 1272. Rep. innovations and orthogonal polynomials. LAWRIE. Rep. 1987. Dept. Solving tridiagonal systems on ensemble architectures. AGRAWAL. Math. Tech. Processing. I/O complexity: the red-blue pebble game. Basic linear algebra subprograms . AND R. 8 (1987). Methods Comput. [103] I. 1986. PLEMMONS. Adam Hilger. 1988. Tech. ACM Trans Math. Rep. New York. 1987. SUTER. Rep. KATHOLI AND B. University of Illinois. [108] L. QR factorization of a rectangular matrix. TR 89-984. 231 (1986). T. [120] D. 118. Recursive least squares filtering for signal processing on distributed memory multiprocessors. to appear. JESSUP. North Carolina State University. Algorithms for multiplying matrices of arbitrary shapes using shared memory primitives on Boolean cubes. KlM. Parallel Proc. KROGH. HlGHAM. Bristol. Phys. Intl. [116] [117] D. Inter. ed. of Computer Science. 1988. Amsterdam.. HOCKNEY. Math. Dept. D. First edition. 1978. [Ill] T.. Access and alignment of data in an array processor. IEEE Computer Society Press. Parallel Computers. [118] D. Ithaca. University of Illinois. LAWRIE AND A. Inversion of Toeplitz operators. Intl.PARALLEL DENSE LINEAR ALGEBRA ALGORITHMS 79 dust. IL. pp. Tech.V. YALEU/DCS/TR-569. CAC Doc. Dept. 1145-1155. [95] N. HOCKNEY AND C. NC. pp. Parallel supercomputing today and the Cedar approach. Center for Advanced Computation. AND M. 354-392.. AND M. pp. D. PHILIPPE. H U A N G . WILHELMSON.. IFIP Congress 1971. University of Alabama at Birmingham. SAMEH. Argonne National Laboratory. Y. in Proc. The potential calculation and some applications. B. 11 (1985). Rennes. T. [107] E. [119] D. Software. [109] . pp. 77 (1986). J. SCHULTZ. KNOWLES. [113] S. TR88-07. SAMEH. Processing. LAWRIE.. Optimizing the FACR(l) Poisson solver on parallel computers. Rep. JESSHOPE. Birmingham. 1981. [112] C. [104] I. MORE. Complexity of dense linear system solution on a multiprocessor ring. SIAM J. Rep. CT. Tech. 215 235. A. pp. C-17 (1968).. RR-548. pp. MURAOKA. in Proc. HONG AND H. 109. A parallel algorithm for symmetric tridiagonal eigenvalue problems. 6 (1958). Vol. AND A.. Solving the symmetric tridiagonal eigenvalue problem on the hyper cube. Solving narrow banded systems on ensemble architectures. [106] W. 1. Comput. 1974. Dept. Conf. 12 (1965). Problem related performance parameters for supercomputers. 9 (1970). A parallel algorithm for computing the singular value decomposition of a matrix.. (1990). C-24 (1975). in Proc. in Proc. of Computer Science. Urbana. of Computer Science. John Wiley. [100] R. [97] . JESSUP AND D. 967-974. of Computer Science. AND F. pp. 1981. 326-333. IEEE Computer Society Press. Urbana. 758-770. Argonne. KAILATH. New York. 106-119. ACM Trans. Cornell University. NY. ILLIAC IV. KUCK. [98] . 429-432.. The computations and communication complexity of a parallel banded system solver. KUCK. HO. TM-102. SIAM Rev. [121] C. Tech. 51 90.

153-165. Introduction to Parallel and Vector Solution of Linear Systems. Comput.. 1988. VOIGT. ORNL/TM-10998. Anal. [129] O. [147] A. PAN AND R. Englewood Cliffs. PLEMMONS. Comp. 1987. 337-347. pp. PEASE. 20-24. pp. WILKINSON. pp. Statist. Cornput. 1986. A parallel partition method for solving banded systems of linear equations. Linear Algebra Appl. T. Sci. B. Computing the singular value decomposition on the Illiac IV. [131] A.. Math. A. [127] N. 18 (1975). [139] J. A parallel block scheme applied to computations in structural analysis. Rep. pp. Numer. North Carolina State University. Numer. pp. ORTEGA AND R. SCHMIDT. VAN DE VELDE. MOLER AND G. Tech. LUK AND H. SIAM J. [141] B. 15 (1968). SOMESH. CLARKE. ROBERT. TN. in Hypercube Multiprocessors 1986. pp.. The approximate solution of elliptic boundary-value problems by fundamental solutions. VOIGT. pp. G.. Society for Industrial and Applied Mathematics. On the stability of Gauss-Jordan elimination with pivoting. Raleigh. ORTEGA. Philadelphia. Appl. 241-256. 14 (1977). 5 (1976). McCOMB AND S. NC. A survey of parallelism in numerical analysis. pp. Prentice-Hall. COFFMAN JR. MADSEN. GALLIVAN. NJ. Tech. 2 (1985). Computer Science Dept. 6 (1980). 1987.. Engineering and scientific library for the IBM 3090 vector facility. NC. T. Orthogonal factorization on a distributed memory multiprocessor. Algebraic Discrete Methods. [148] A. Substructuring methods for computing the nullspace of equilibrium matrices. LUK. RAGHAVAN. Statist. of Computer Science. PARLETT. 8 (1987). A bibliography on parallel and vector numerical algorithms. [130] J. 485-502. [137] J. Math. Society for Industrial and Applied Mathematics. Solution of partial differential equations on vector and parallel computers. Rep. KARUSH. 13 (1971).. 8 (1988). Least squares modifications with inverse factorizations: parallel implications. Comm. pp. Rep. pp. pp. SIAM Rev. Statist. pp. ed. MODI AND M. SIAM J. pp. POTHEN.. Li AND T. An adaption of the fast Fourier transform for parallel processing. PLEMMONS. AND A. pp. 43 (1984). [143] G. Comput. ORTEGA. SGUAZZERO. [138] J. Math. An efficient parallel scheme for minimizing a sum of Euclidean norms. 308-323. 524-547. Numer. 5 (1979). A parallel triangular solver on a distributed memory multiprocessor. [124] F. Rep. 524-539. Comput. North Carolina State University. Sci. W. AND A. to appear. Anal. AND U. Heath. 181-195. 1987. Parallel Comput. Comput. 1988. to appear.. WHITE. 587-596. Y. 121 (1989). 33-43.. [125] . [146] R. ACM. Inform.. Raleigh. PARK. [142] M. Parallel Comput. COLEMAN. [136] C. Process. SIAM J. RADICATI. AND P. MATHON AND R.. An algorithm for generalized matrix eigenvalue problems. WRIGHT. MlRANKER. [123] S. SIAM J. 71-85. 10 (1973). pp. sl55-s!65. ACM Trans. [135] C. MEIER. pp.. An alternative Givens ordering... AND C. 8390. RODRIGUE. The Symmetric Eigenvalue Problem. STEWART. Sci. VEMULAPATI. [149] G. PHILIPPE. [128] R. ACM Trans. SAMEH for Fortran use.80 K. ACM. 149-240. Center for Research in Sci. POTHEN AND P. [145] R. 404-415... LO. 1989. PLEMMONS. Pennsylvania State University. JOHNSTON. Philadelphia. 377-384. 7 (1986). Statist. SIAM J. Heath. JACM. Dense linear systems Fortran solvers on the IBM 3090 vector multiprocessor. Software. SIAM J. PETERS AND J. 27 (1985). [134] J. PA... pp.. MCBRYAN AND E. J. MOLER. A multiprocessor algorithm for the symmetric tridiagonal eigenvalue problem. SIAM Rev. 1987. 27 (1988). Oak Ridge National Laboratory. R. 8 (1987). M. 452-549.. 9 (1988). Tech. Math. SIAM J. Lett. [122] G. Software. ROMINE. Tech. PLEMMONS AND S. Oak Ridge. [132] U. Sci. New York. University Park. pp... Comm. A rotation method for computing the QR decomposition. Comput. pp. Plenum. pp. Matrix multiplication by diagonals on a vector/parallel processor. H. McKELLAR AND E. IBM Systems Journal. [144] R. Orthogonal factorization on a distributed memory multiprocessor. [140] C. [126] F. SAMEH. . 41-45. CS-87-24. 638-650. 252-264. M. A proof of convergence for two parallel Jacobi SVD algorithms. PLEMMONS AND R. 7 (1987). R. Dept. ed. Matrix computations on distributed memory multiprocessors. AND J. 12 (1969). [133] W. Organizing matrices and matrix operations for paged memory systems. 1980. Hypercube algorithms and implementations. in Hypercube Multiprocessors 1987. J. pp. s227-s287. IEEE Trans. pp.

and zeros every place else. As we shall see.). called a diagonal matrix. .-. 2. as earlier noted. this representation plays a fundamental role in the theory of symmetric matrices. . .TW. This is a particular case of a more general result we shall establish later on.38 Introduction to Matrix Analysis (recalling once again the orthogonality of the #. . . Let us use the notation EXERCISES 1. k = 2. .•)> The matrix on the right-hand side has as its main diagonal the characteristic values Xi. 2. Show that if A has distinct characteristic roots. 3. Multiplying on the right by T' and on the left by T. Show that A* . obtain the set of characteristic vectors associated with the characteristic roots of Ak. then A satisfies its own characteristic equation. . for k . If A has distinct characteristic roots.1. . and that A" . . Xjv. we obtain the important result This process is called reduction to diagonal form. . . . and using the fact that TT' = I. A matrix of this type is. . Xj. 3.(X<**.

ICOMP-88-19.. SWEET. and A. 2. pp. pp. NY. 9 (1988). TRIVEDI.82 [179] [180] [181] [182] [183] [184] [185] [186] [187] [188] [189] [190] [191] [192] [193] [194] [195] [196] K. TEMPERTON. 761-765. The block Jacobi method for computing the singular value decomposition. J. N. On the paging performance of array algorithms. 10 (1989). Oxford. Ithaca. R. 1988. Boston. J. C.. YEW. Sci. A.. A. Rep. D. Dept. Tech. 1988. R. Rep. WIJSHOFF. 7 (1981). Lawrie. Comput. WANG. 14 (1977). Statist. Comput. Kuck. H. Springer-Verlag. 34 (1980). C. University of Illinois. J. IEEE Trans. pp. MA. Math. pp. ACM Trans. A scheme to enforce data dependences on large multiprocessor systems. Rep. MIT. IL. 229-244. 170-183. pp. in High Speed Computer and Algorithm Organization. 1985. Statist. J. K. 1965. Dept. Dept. Phys. NASA Lewis Research Center. Phys. PLEMMONS. SE-13 (1987). Analysis of a parallel solution method for tridiagonal linear systems. 1-20.. Appl. RUU-CS-84-7. A parallel and vector variant of the cyclic reduction algorithm. SAMEH . H. 1971. pp.. OH. Tech. 303-311. Comput. ZHU AND P.. Q. Sameh. Comput. pp.. Berlin. Software. Numer. pp. Handbook for Automatic Computation: Linear Algebra.. J. . 1989. H. Cambridge. Prepaging and applications to structured array problems. Tech. GALLIVAN. VAN DER VORST AND K. SCHREIBER. On the accuracy of solving triangular systems in parallel. Academic Press. A cyclic reduction algorithm for solving block tridiagonal systems of arbitrary dimension. REINSCH. The influence of vector computer architecture on numerical algorithms. L. Vector and parallel methods for the direct solution of Poisson's equation. TREFETHAN AND R. The Algebraic Eigenvalue Problem.. Anal. pp. Math. D. of Computer Science. DEKKER. 1974. . Cornell University. 27-35. Cleveland. IEEE Trans. Comput. Soft. TSAO. New York. H. Data Organization in Parallel Computers. H. Direct methods for the solution of the discrete Poisson equation: some comparisons. Sci. . Comput. Vectorization of linear recurrence relations. WILKINSON. SIAM J. Urbana. Vol. pp. TR85-680. Tech. eds. Rep. WILKINSON AND C. 314-329. VAN LOAN. SIAM J. 707-720. A parallel method for tridiagonal equations. UIUCDCSR-74-662. C-26 (1977). SIAM J. Engrg. J. 938-947. AND A. R. Average-case stability of Gaussian elimination. VAN DER VORST. of Mathematics. VOIGT. Kluwer. pp. Clarendon Press. of Computer Science. 241-263. .. Parallel Comput. On the FACR(l) algorithm for the discrete Poisson equation. 726-739. 27 (1989). G. C. 5 (1987). 31 (1979). 1977. C.

The need for high performance on common operations such as matrix multiplication and solving systems of linear equations has had a strong influence on the design of many architectures. however. and triangular solution. Nevertheless. P. A survey of parallel algorithms for dense matrix computations is given in [34]. Oak Ridge. There has. We survey recent progress on parallel algorithms for all phases of the solution process. 65F. In this paper we survey recent progress in the development of parallel algorithms for solving sparse linear systems on computer architectures having multiple processors. U. In this paper we will examine the reasons for this state of affairs. also been progress * Mathematical Sciences Section. Department of Energy under contract DE-AC0584OR21400 with Martin Marietta Energy Systems Inc. Dense matrix computations are of such central importance in scientific computing that they are usually among the first algorithms implemented in any new computing environment. compilers. of course. One could plausibly argue. reviewing the major issues and progress to date in sparse matrix computations on parallel computer architectures. yet typically show significantly lower efficiency on today's parallel architectures. We focus our attention on direct methods for solving sparse symmetric positive definite systems. there have been some notable successes in attaining very high performance (e. Despite the difficulty and relative neglect of sparse matrix computations on advanced computer architectures.S. Box 2009. Sparse matrix computations are equally as important and pervasive. and such computations have become standard benchmarks for evaluating the performance of new computer systems. symbolic factorization. making efficient implementations on novel architectures substantially more difficult to achieve than for dense matrix computations. 83 . etc. and the needs of sparse matrix computations have had some effect on computer design (e. parallel algorithms. we will try to sketch the conceptual framework in which this work has taken place. sophisticated data structures. but both their performance and their influence on computer system design have tended to lag those of their dense matrix counterparts.65W 1. Oak Ridge National Laboratory. In a sense this relative lack of attention and success is not surprising: sparse matrix computations involve more complex algorithms. and therefore even more useful as design targets and benchmark criteria than the dense matrix computations that have usually played this role. numeric factorization. Introduction. it is ironic that sparse matrix computations contain more inherent parallelism than the corresponding dense matrix computations (in a sense to be discussed below). Cholesky factorization AMS(MOS) subject classifications.O. sparse linear systems..g.g. HEATH*. and irregular memory reference patterns. In addition to surveying the literature in this area. [14]). we will focus our attention on the solution of sparse symmetric positive definite linear systems by Cholesky factorization. Tennessee 37831-8083. ESMOND NG* AND BARRY W. that the greater complexity and irregularity of sparse matrix computations make them much more realistic representatives of typical scientific computations. This research was supported by the Applied Mathematical Sciences Research Program. Key Words.PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS MICHAEL T. PEYTON* Abstract. specifically by Cholesky factorization. the inclusion of scatter/gather instructions on some vector supercomputers)... To keep the scope of the article within reasonable bounds. including ordering. Office of Energy Research.

when roundoff errors are ignored. we will sketch briefly some necessary background material on serial algorithms for solving sparse symmetric positive definite linear systems. One way to solve the linear system is first to compute the Cholesky factorization where the Cholesky factor L is a lower triangular matrix with positive diagonal elements..84 M. First.. with the triangular solution phase simply becoming .. it is desirable for the amount of fill to be small. E. NG AND B. but there are many additional issues associated with parallel iterative algorithms or parallel sparse eigenvalue algorithms that we do not specifically address. many zero entries in the lower triangle of A remain zero in L. The permuted system is equally useful for solving the original linear system. We then survey the progress to date in developing parallel implementations for each of the major phases of the solution process. iterative methods).W. least squares. An outline of the paper is as follows. and x is the unknown solution vector to be computed. Consider a system of linear equations where A is an n x n symmetric positive definite matrix.T. We will see that the same graph theoretic tools originally developed for analyzing sequential sparse matrix algorithms also play a critical role in understanding parallel algorithms as well.g. we can often choose P so that the Cholesky factor L of PAPT has less fill than L. More precisely. eigenvalues). That is.g. nonsymmetric linear systems. then during the course of the factorization some entries that are initially zero in the lower triangle of A may become nonzero entries in L. PEYTON on parallel algorithms for other matrix problems (e. For efficient use of computer memory and processing time. LU and QR). a given linear system yields the same solution regardless of the particular order in which the equations and unknowns are numbered. This freedom in choosing the ordering can be exploited to enhance the preservation of sparsity in the Cholesky factorization process. we can choose P based solely on sparsity considerations. Then the solution vector x can be computed by successive forward and back substitutions to solve the triangular systems If A is a sparse matrix. 2. and other basic approaches (e. For a much more complete treatment. Background.g. It is well known that row or column interchanges are not required to maintain numerical stability in the factorization process when A is positive definite. 6 is a known vector. and to store and operate on only the nonzero entries of A and L. let P be any permutation matrix. Usually. Our discussion of sparse Cholesky factorization illustrates some of the major issues that also arise in other parallel sparse matrix factorizations as well. however. These entries of L are known as fill or fill-in. Since PAPT is also a symmetric positive definite matrix. Furthermore. other factorizations (e. We conclude with some observations on future research directions. but a comprehensive treatment of all of these topics would easily require an entire book. HEATH. meaning that most of its entries are zero. the reader should consult [25] or [47].

The structural effect of Gaussian elimination on the matrix is then easily described in terms of the corresponding graph. and thus fill refers to the structural nonzeros of L. Since pivoting is not required in the factorization process. or at least postponed. We assume that exact cancellation never occurs. Thus. which is a distinct advantage in terms of efficiency. We will also have occasion to refer to the filled graph. which is the graph of A with all fill edges added (i. The fill introduced into the matrix as a result of eliminating a variable adds fill edges to the corresponding graph precisely so that the neighbors of the eliminated vertex become a clique. graph theory lias proved to be an extremely helpful tool in modeling the symbolic or structural aspects of sparse elimination algorithms..e.e. Several software packages [17. Numeric Factorization: Insert the nonzeros of A into the data structure and compute the Cholesky factor L of PAPT.29] for serial computers use this basic approach to solve sparse symmetric positive definite linear systems. The use of a graph theoretic model dates to the early work of Parter [91] and Rose [97]. involving no floating-point computation. As one might expect from the combinatorial nature of the ordering problem for sparse factorization. We now briefly discuss algorithms and methods for performing each of these steps on sequential machines.-j is nonzero in the matrix. determine a permutation matrix P so that the Cholesky factor L of PAPT suffers little fill.45] and the minimum degree algorithm [48. 1 . i. The elimination or factorization process can thus be modeled by a sequence of graphs. 3. and has now come to permeate the subject. 4. there is an edge between two vertices a and j of F(A). and then set x = PTz.70. once the ordering is known. 2. vertices of lowest degree). however.. the precise locations of all fill entries in L can be predicted in advance1. a great deal of research effort has been devoted to developing good heuristics for limiting fill in sparse Cholesky factorization. including the nested dissection algorithm [39. the direct solution of Ax = b consists of the following sequence of four distinct steps: 1. Ordering: Find a good ordering P for A. if t{j ^ 0 in the Cholesky factor matrix L)." Thus. that is. every location of the factor that is occupied by a nonzero entry at some point in the factorization. The graph of an n x n symmetric matrix A. Note that the first two steps are entirely symbolic.. This fact suggests that fill can be limited. finding a permutation P that minimizes fill is a very difficult combinatorial problem (an NP-complete problem) [107]. by eliminating first those vertices having fewest neighbors (i. Symbolic Factorization: Determine the structure of L and set up a data structure in which to store A and compute the nonzero entries of L. than the more general sparse ordering techniques.27.PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS 85 Unfortunately. and as we shall see. with a > j.98].1. each having one less vertex than the previous graph but possibly gaining edges. This data structure need not be modified during subsequent computations. F(A). 2. so that a data structure can be set up to accommodate L before any numeric computation begins. they are at an even greater disadvantage in a parallel context. These band-oriented methods have been less successful. denoted by G(A). Ordering. The process by which the nonzero structure of L is determined in advance is called "symbolic factorization. is a labelled undirected graph having n vertices (or nodes).e. Limiting fill is also the primary motivation for a number of methods based on reducing the bandwidth or profile of A. with an edge between two vertices i and j if the corresponding entry a. until only one vertex remains. Triangular Solution: Solve Ly = Pb and LTz — y.

At each step of the elimination process. For highly regular. which serves as a kind of theoretical benchmark for the quality of orderings [39. such an algorithm would then have the same time complexity as the numeric factorization itself (i. along with all edges incident on nodes in S. As might be expected from the "greedy" nature of the algorithm. To summarize. E. Nevertheless.60]. This idea can be applied recursively. minimum degree is. have proven to be computationally expensive. but also for its asymptotically optimal fill properties for certain model problems. giving a nested sequence of dissections of the graph.. Experiments in both [24] and [48] illustrate the sensitivity of the quality of minimum degree orderings to the way ties are broken when there is more than one node of minimum degree from which to choose. George and Liu [48] review a series of enhancements to implementations of the minimum degree algorithm and demonstrate the consequent reductions in ordering time. Another effective algorithm for limiting fill in Cholesky factorization is nested dissection. 2. which is the most successful and widely applicable heuristic developed to date for limiting fill in sparse Cholesky factorization. Let S be a set of nodes (called a separator) whose removal. Another strength is its efficiency: as a result of a number of refinements over several years. For problems in dimensions higher than two.W. planar problems (e. an effective and efficient ordering heuristic. then the matrix will have a bordered block diagonal nonzero pattern. elimination of a node within one of the pieces cannot introduce fill into any of the other pieces. suitably small separators can usually be found [68. Despite its simplicity. which is based on a divide-and-conquer paradigm. but its success is not well understood. fill is restricted to the diagonal blocks and the border [47. The effectiveness of nested dissection in limiting fill is highly dependent on the size of the separators used to split the graph. Symbolic Factorization. PEYTON The foregoing discussion provides the basis for the minimum degree algorithm. it would require the same number of symbolic operations as the number of floating-point operations required by the numeric factorization). and no robust and efficient way is known for dealing with the wide variability in the quality of the orderings it produces. several weaknesses of the minimum degree ordering heuristic are well documented in the literature. However.g. NG AND B. If the matrix is reordered so that the variables in each piece are numbered contiguously and the variables in the separator are numbered last. though some interesting results using deficiency (the number of fill edges created by the elimination step) to break ties are reported in [15]. two-dimensional finite difference or finite element grids). Attempts to make the selection more intelligent or less myopic. nested dissection is much less effective. on balance.99]. nested dissection is important not only for its practical usefulness on suitable problems. however.T. however.69].e. More importantly. current implementations are extremely efficient on most problems. or for highly irregular problems with less localized connectivity. HEATH.2. this simple heuristic selects as the next node to be eliminated a node of minimum degree in the current elimination graph. No tie-breaking scheme proposed to date is both effective and efficient. breaking the pieces into smaller and smaller pieces with successive sets of separators. divides the graph into at least two remaining pieces. A naive approach to symbolic factorization is simply to carry out Cholesky factorization on the structure of A symbolically.86 M. With .. Berman and Schnitger [13] show that for a model problem there exists a minimum degree tie-breaking scheme for which the time and space complexity of the factorization is worse than that of known asymptotically optimal orderings. the minimum degree algorithm produces reasonably good orderings over a remarkably broad range of problem classes.

This simple symbolic factorization algorithm is already very efficient. It is easy to show that Moreover. For example. Symbolic factorization algorithm. For a given lower triangular Cholesky factor matrix L. it can be shown that the structure of column j of L can be characterized as follows [47]: That is.PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS 87 a little care. . as follows. For a given sparse matrix M. the complexity of symbolic factorization can be reduced to O(f?(L)).{j} Struct(L*j) := S if Struct(L+j) / 0 then p(j) := min {i £ Struct(L+j)} RpW-RpWUU] FlG. define and In other words. where rj(L) denotes the number of nonzeros in L. define the function p as follows: Thus. for j := 1 to n do Rj := 0 for j := 1 to n do S := Struct(A*j) for i E Rj do 5 := S U Struct(L*i) . the structure of column j of L is given by the structure of the lower triangular portion of column j of A. and Struct(M*j) is the sparsity structure of column j of the strict lower triangle of M. This characterization leads directly to an algorithm for performing the symbolic factorization. when there is at least one off-diagonal nonzero in column j of L. 1. with time and space complexity O(r)(L)).-») is the sparsity structure of row i of the strict lower triangle of M. 5<ruc<(M. together with the structure of each column of L whose first off-diagonal nonzero is in row j. shown in Figure 1. but it is subject to further refinement. p(j) is the row index of the first off-diagonal nonzero in that column. in which the sets Rj are used to record the columns of L whose structures will affect that of column j of L.

Since only the nonzero entries of the matrix are stored. a compact data structure is set up to accommodate all of its nonzero entries. column-Cholesky and submatrix-Cholesky. Submatrix-Cholesky: Taking k in the outer loop. k < j. we see that there are three basic types of algorithms. depending on which of the three indices is placed in the outer loop: 1. Once the structure of L is known. Column-Cholesky: Taking j in the outer loop. We will therefore concentrate our attention on the two column-oriented methods.T.W. . as the columns tend to become more dense toward the end of the factorization. successive columns of L are computed one by one. j. In its simplest form. 2. 3. For sparse Cholesky factorization. This freedom can be exploited to take better advantage of particular architectural features of a given machine (cache. Although this integer overhead potentially rivals the space requirements for the nonzeros themselves. PEYTON if RJ contains only one column. and each has its advantages and disadvantages in a given context. say i. and k can be nested in any order. An efficient implementation of the symbolic factorization algorithm is presented in [47].3. With its low complexity and an efficient implementation. In column-oriented Cholesky factorization algorithms. vectorization.4 and 3. cmod(j. where symmetry is exploited so that only the lower triangle of the matrix is accessed. These three families of algorithms have markedly different memory reference patterns in terms of which parts of the matrix are accessed and modified at each stage of the factorization (see Figure 2). k) : modification of column j by column k. then clearly Struct(L*j) = Struct(L+i) — { j } . row-Cholesky is seldom used because of the difficulty in designing a compact row-oriented data structure for storing the nonzeros of L that can also be accessed efficiently in the numerical factorization phase [71]. successive columns of L are computed one by one. each with a different pattern of memory access. in practice the subscript compression technique mentioned above greatly reduces this overhead storage [46]. and Struct(A+j) C Struct(L+i). 2. with the inner loops applying the current column as a rank-1 update to the remaining partially-reduced submatrix.) [21]. Specializing to Cholesky factorization. Efficient implementation of sparse row-Cholesky is even more difficult on vector and parallel architectures since it is difficult to vectorize or parallelize sparse triangular solutions (see discussions in Sections 2.5). these conditions are often satisfied when j is relatively large. there are two fundamental types of subtasks: 1. successive rows of L are computed one by one. Row-Cholesky: Taking i in the outer loop. In fact. additional indexing information must be stored to indicate the locations of the nonzeros. HEATH. the symbolic factorization step usually requires less computation than any of the other three steps in solving a symmetric positive definite system by Cholesky factorization.88 M. Gaussian elimination on a dense matrix A can be described as a triple nested loop around the single statement The loop indices i. with the inner loops computing a matrix-vector product that gives the effect of previously computed columns on the column currently being computed. Numeric Factorization. virtual memory. with the inner loops solving a triangular system for each new row in terms of the previously computed rows. This shortcut can be used to speed up the symbolic factorization algorithm and to reduce the storage requirements using a technique known as "subscript compression" [104]. E. NG AND B. etc.

cdiv(j) : division of column j by a scalar. FIG. high-level descriptions of the column-Cholesky and submatrix-Cholesky algorithms are given in Figures 3 and 4. since at each stage it accesses needed columns to the left of the current column in the matrix. 2.PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS 89 FIG. 2. since the inner products that affect a given column are not accumulated until actually needed to modify and complete that column. 4. Sparse submatrix-Cholesky factorization algorithm. After all column modifications have been applied to column j. At that point the algorithm updates column j with a nonzero multiple of each column k < j of L for which tjk ^ 0. but we use different notation to emphasize that we are dealing with their sparse counterparts. In terms of these basic operations. These sparse matrix operations correspond to saxpy and sscal in the terminology of the BLAS [64] for dense linear algebra. FlG. Sparse column-Cholesky factorization algorithm. the diagonal entry tjj is computed and used to scale the completely updated column to obtain the remaining nonzero entries of L*j. Column-Cholesky is sometimes said to be a "left-looking" algorithm. In column-Cholesky. 3. column j of A remains unchanged until the index of the outer loop takes on that particular value. It can also be viewed as a "demand-driven" algorithm. Three forms of Cholesky factorization. It is also sometimes referred to as a "fan-in" .

. It is also sometimes referred to as a "fan-out" algorithm. Moreover. with only a relatively small "frontal" matrix representing currently "active" columns kept in main memory. and therefore one can take advantage of vectorization more readily on hardware that supports it [3.5. since each new column is used as soon as it is completed to make all modifications to all the subsequent columns it affects. Before leaving the general topic of sparse factorization. PEYTON algorithm. access to inactive portions of the matrix. A set of supernodes for an example matrix is shown in Figure 5. A supernode is a set of contiguous columns in the Cholesky factor L that share essentially the same sparsity structure. must be kept to a minimum. since the basic operation is to combine the effects of multiple previous columns on a single subsequent column.W.. the set of contiguous columns j. Columns in the same supernode can be treated as a unit for both computation and storage. its effects on all subsequent columns are computed immediately.11. One of the main motivations for frontal and multifrontal methods is that the frontal matrices can be treated as dense.11] or cache [3. since at each stage columns to the right of the current column are modified. the localization of memory references in these methods is advantageous in exploiting cache [100] or on machines with virtual memory and paging [76]. supernodes have been used to organize sparse factorization algorithms around matrix-vector or matrix-matrix operations that reduce memory traffic by making more efficient use of vector registers [5. submatrix-Cholesky is sometimes said to be a "right-looking" algorithm.NG AND B.100]. In submatrix-Cholesky. E. But while the cmod(j.. which essentially amount to different ways of amalgamating partial results.j + t constitutes a supernode if 5<rwc<(L*it) = Struct(L+tk+\) U {k + 1} for j < k < j + 1 — 1.19].27. It can also be viewed as a "data-driven" algorithm. . We will see that these characterizations of the column-Cholesky and submatrix-Cholesky algorithms have important implications for parallel implementations. More recently. For example. Similarly. Instead they are accumulated and passed on through a succession of update matrices until finally they are incorporated into the target column.29]. The reason for this approach is that in the frontal method most of the matrix is kept out of core on auxiliary storage. The cited reports document the substantial gains in performance obtained by using these techniques. frontal methods [61]. To minimize I/O traffic. see [28] or [78]. Jk) updating operations are computed in the order shown in Figure 4. For further details on multifrontal methods. Supernodes have long played an important role in enhancing the efficiency of both the minimum degree ordering [50] and the symbolic factorization [104]. More specifically.j + 1. The column-Cholesky algorithm is the most commonly used method in commercially available sparse matrix packages [17. we note that many variations and hybrid implementations of these schemes are possible.T HEATH. they are not applied directly to the column j being updated. both columns already completed and columns yet unreduced. we introduce two additional concepts that are useful in analyzing and efficiently implementing sparse factorization algorithms. Having stated the "pure" column. as soon as column k is completed.. since the basic operation is for a single column to affect multiple subsequent columns. are essentially variations on submatrix-Cholesky. the out-of-core version of the multifrontal method can be implemented so that only a few small "frontal" matrices are kept in main memory. Thus.and submatrix-Cholesky algorithms. and their generalizations to multifrontal methods [28].90 M.

(x and • refer to nonzeros in A and fill in L. respectively. Supernodet for 7 x 7 nine-point grid problem ordered by nested dissection. 5.) . Numbers over diagonal entries label supernodes.PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS 91 FlG.

Figure 6 shows the elimination tree for the matrix shown in Figure 5.5. It is shown in [71] and [103] that the set of columns/nodes that modify column/node j (namely. a small number of large tasks .. In this case. Liu [79] discusses the many uses of elimination trees in sparse matrix computations. For this reason. so it is not surprising that they demand a comparable range of algorithmic techniques to achieve good efficiency in the various settings..2. while focusing occasionally on implementation issues that may arise in a more specific context. with more advanced computer architectures.. node i is said to be the parent of node j. and SIMD (single-instruction. Among these is their use in managing the frontal and update matrices in the multifrontal method. the node set Struct(L*j) is a subset of the ancestors of j in the tree. the computational work must be broken into a number of subtasks that can be assigned to separate processors. since it is often more difficult to take full advantage of vector or parallel processors in performing triangular solutions. with i > j.g.103] associated with the Cholesky factor L of a given matrix A has {1. multiple-data stream) architectures typically having tens of thousands of processors.2. however.W. Because triangular solution requires many fewer floating-point operations than the factorization step that precedes it. Nevertheless. 2. It also follows that the columns that receive updates from column j are ancestors of j in T(A). Some machines have an additional level of parallelism in the form of vector units within each individual processor. which has obvious implications for implementing the factorization in parallel. the set Struct(Lj+)) is a subset of T\j] denoted by Tr\j]. Tr\j] is called the row subtree of j. In other words. organization of memory. The most appropriate number and size of these tasks (e. control mechanisms. E. 3. Moreover. Parallel architectures display an enormous variation in the number and power of processors. Tr\j] is also a subtree of T(A) rooted at node j.. the triangular solution step usually requires only a small fraction of the total time to solve a sparse linear system on conventional sequential computers.92 M. The structure of the forward and back substitution algorithms is more or less dictated by the sparse data structure used to store the triangular Cholesky factor L and by the structure of the elimination tree T(A). The most widely available and commercially successful parallel architectures thus far fall into three rough categories: shared-memory MIMD (multiple-instruction. and has an edge between two vertices i and j.4. There is relatively little to be said about the triangular solution step. It follows that column j can be completed only after every column in Tr\j] has been computed. Another key role is in the analysis of data dependencies that must be observed when factoring the matrix. n} as its node set. Triangular Solution. and node j is a child of node «. NG AND B. Parallel Algorithms. Let T\j] denote the subtree of T(A) rooted at node j. distributed-memory MIMD architectures typically having on the order of 32 to 1024 processors. These proportions can change. and synchronization and communication overhead. HEATH. PEYTON The elimination tree T(A) [71. if » = p(j). In this section we summarize the progress to date in adapting direct methods for the solution of sparse symmetric positive definite linear systems to perform well on the various parallel architectures that have become available in recent years. multiple-data stream) architectures typically having 30 or fewer processors. We will discuss these issues in greater detail in Section 3.T. In exploiting parallelism to solve any problem.. we will try to concentrate on general principles that are widely applicable. where p is the function defined in Section 2.

PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS 93 FlG. .es. Bold-face numbers label supernodes. 6. Nodes not enclosed by an oval are singleton supernod. Ovals enclose supernodea that contain more than one node. Elimination tree for the matrix shown in Figure 5.

This type of parallelism is available only in sparse factorization. However. we will have little to say about finer grain parallelism.ki) can be performed in parallel. Moreover. i. Fine grain parallelism can be exploited in two distinctly different ways: 1. But of course we are not limited to exploiting only parallelism of this nature. Let ji and j% be two column indices whose subtrees T\ji] and T\J2\ are not disjoint. as in most problems. in which each task is the completion of all columns in a subtree of the elimination tree. Clearly. Obviously. On shared-memory machines the scheduling problem is relatively easy to deal with: a shared queue of tasks . ki) and cmod(J2. and triangular solution.. There is much more parallelism to be found at the medium-grain level of the individual cmod operations. in Figure 6 the columns of T[9] (columns 1-9) are completely independent of the columns of T[18] (columns 10-18). PEYTON or a large number of small tasks) depend on the target parallel architecture and the levels at which parallelism naturally occurs in the problem. Developing parallel sparse submatrix-Cholesky algorithms for SIMD machines presents a more difficult challenge. fine-grain parallelism. the updates cmod(ji.5. and research on this topic is still in its infancy [54]. large-grain parallelism refers to the independent work done in computing columns in disjoint subtrees.94 M. Here. a number of levels of computational granularity can potentially be exploited. To date. respectively. Suppose that ki and k^ are indices of columns that must be used to modify columns j\ and J2. 2. it is desirable to schedule the problem so that the amount of synchronization and/or communication is low. medium-grain parallelism. Consider two disjoint subtrees T\j] and T[i]. one of the goals in mapping the problem onto the processors is to ensure that the work load is balanced across all processors. While we will have a great deal to say about algorithms that employ the first two sources of parallelism. implementation on parallel architectures has caused no fundamental change in the overall high-level approach to solving sparse symmetric positive definite linear systems. E.e. All work required to compute the columns of T[j] is completely independent of all work required to compute the columns of T[i]. Liu [72] uses the elimination tree to analyze the following levels of parallelism in Cholesky factorization: 1. 2. parallelizing the rank-one update that constitutes a major step of submatrixCholesky on an SIMD machine. On parallel machines the same sequence of four distinct steps is performed: ordering. multiply-add pair. For example. numeric factorization. 3. where neither root node is a descendent of the other. vectorization of the column operations cmod and cdiv on vector supercomputers. In sparse factorization. Exploiting vectorization requires some changes and refinement of the basic sequential algorithms [3. large-grain parallelism. it is not available in the dense case. HEATH. both shared-memory and distributed-memory MIMD machines require an additional step to be performed: the tasks into which the problem is decomposed must be mapped onto the processors.W. symbolic factorization. in which each task is a single floating-point operation or flop. The term often used to denote the size of computational tasks in a parallel implementation is granularity. and it is an extremely important source of parallelism in the sparse case as well.19]. but it does not require changes as extensive and basic as those required to exploit higher levels of parallelism. This is the primary source of parallelism in the dense case. in which each task is a single cmod or cdiv column operation.T.11. NG AND B.

can be used to achieve dynamic load balancing. First. 7. 2. respectively. Orderings for a tridiagonal system serve to illustrate the point. the primary goal of reordering the matrix is simply to lower the work and space required for the factorization step. 3. Dynamic load balancing tends to be inefficient on current distributed-memory machines. 2 < . Simply lowering fill. however. Orderings for parallel factorization. On sequential or vector machines. Ordering. There are two distinct issues associated with the ordering problem in a parallel environment: 1. Moreover each column j. Factor matrices and corresponding elimination trees for tridiagonal matrix using natural ordering and nested dissection reordering (even-odd reduction). may not provide an ordering appropriate for parallel factorization. Under the natural ordering. In fact. requires a . both the fill and work are minimized. however. note that the natural ordering results in an elimination tree that is a chain (see Figure 7). there is no large-grain (subtree-level parallelism) to exploit. so that the goal can be further simplified to lowering fill only. Nevertheless. Determining an ordering appropriate for performing the subsequent factorization efficiently on the parallel architecture in question. Experience and intuition suggest that the two almost inevitably rise and fall together. so a static assignment of tasks to processors must be determined in advance. Let us call the ordering that preserves the tridiagonal structure the natural ordering.' < n. the matrix incurs no fill during factorization.PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS 95 FlG. Computing the ordering itself in parallel. the natural ordering is the poorest possible ordering for parallel factorization. We now proceed to discuss the progress made in developing parallel algorithms for each of these five steps. X and • refer to original nonzeros and fill nonzeros.. Indeed. while there are sometimes other secondary considerations. 3.1.1.1.

it is worth noting that the problem of ordering a matrix to minimize its elimination tree height. which is much shorter than the height n — 1 obtained using the natural ordering. with short trees assumed to be superior to taller trees [62. These solution schemes are equivalent to reordering the system with a nested dissection ordering (again. But it is well known that even-odd reduction schemes for these systems. This contention is based on the assumption of a submatrix-Cholesky parallel factorization algorithm that requires roughly uniform time for the elimination of each column. most work on the parallel ordering problem has used elimination tree height as the criterion for comparing orderings.2. the ordering step will remain a bottleneck limiting the size of problems that can be solved on distributed-memory parallel architectures. We now consider some of the difficulties in performing the ordering step efficiently in parallel. Despite the limited potential for any gain in execution time. While the total floating-point work (ignoring square roots) increases by a factor between two and three. but there is not yet an agreed upon objective function for the parallel ordering problem. It remains to be shown that this assumption is in fact realized for sparse problems on available SIMD machines. parallel completion time using the nested dissection ordering is ideally O(log n) compared with O(n) using the standard ordering. Moreover. a distributed implementation of the ordering step is necessary to take advantage of the large amount of local memory available on such machines in solving very large problems. This example is an extreme illustration of how inappropriate the goal of fillreduction can be in the parallel setting. the height of the elimination tree is approximately Iog2 n. The highly sophisticated ordering algorithms discussed earlier. then and only then becoming available for the subsequent column modification cmod(j + l. the floating-point work is strictly sequential. there have been no systematic attempts to develop metrics for measuring the quality of parallel orderings.j). to run on parallel architectures. like the problem of minimizing fill. The basic minimum degree algorithm has an inherently sequential outer loop.96 M.80]. However. PEYTON single column modification cmod(j. There is also no fine-grain parallelism to exploit. there is no medium-grain (column-modification level) parallelism to exploit.j — 1) before it can be completed with the cdiv(j) operation. Thus. Thus. the search for nodes of minimum degree and the necessary graph transformations and degree updates could conceivably be spread . HEATH. NG AND B.65. there is still motivation for adapting these algorithms. Thus far. especially in the case of distributed-memory machines. The assumption is more doubtful on other parallel architectures.1. see Figure 7). A separate problem is the need to compute the ordering in parallel on the same machine on which the other steps of the solution process are to be performed. are extremely efficient and normally constitute only a small fraction of the total execution time in solving a sparse system on sequential computers. Using the nested dissection ordering.W. however. but with little more than intuition as a basis for this choice. E.92] could potentially be exploited to permit parallel execution. In [77]. though they introduce more work. Liu suggests some more realistic measures of parallel completion time. Otherwise. there is no parallelism at all to exploit in the floating-point computation.65]. or developing new ones. Moreover. For massively parallel SIMD machines.77.T. Computing the ordering in parallel. Multiple elimination of independent nodes of minimum or near-minimum degree [70. 3. it has been suggested that small elimination tree height may indeed be a suitable goal [54. is a very difficult combinatorial problem [93]. namely minimum degree and nested dissection. also greatly increase the parallelism. with a single node eliminated at each stage. In particular.

reliable algorithms and software become available.PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS 97 across multiple processors.1. yet it is on this class of machines that the ordering problem seems most difficult to address. We focus our attention in the remainder of this section on the effectiveness of various ordering strategies in facilitating the subsequent parallel factorization. an algorithm based on this approach has been developed for use on a massively parallel SIMD machine [54]. where 1. the nested dissection heuristic (based on the generation of level structures) is effective in reducing fill for a much more restricted class of problems than minimum degree. it is evident that the problem of computing effective parallel orderings in parallel is very difficult and remains largely untouched by research efforts to date.3. Nevertheless. nested dissection is similar to minimum degree in that it enjoys a very efficient sequential implementation. it is not clear that minimum degree orderings would be particularly appropriate for parallel factorization. Second. . however. applying the basic minimum degree algorithm to the tridiagonal system discussed above produces an elimination tree that is a chain. there are several problems with this approach. A different approach for computing independent sets for parallel elimination is described in [66]. The divide-and-conquer paradigm introduces a natural source of parallelism. Duff et al. Second. The standard nested dissection ordering heuristic [45] would appear to offer much greater opportunity for an effective parallel implementation. However. Unfortunately. and we are not aware of any attempt to produce such an implementation. that this approach would have acceptable efficiency on distributedmemory MIMD machines. due to the independence of the successive pieces into which the graph is split. It is possible that such an approach could also be reasonably effective on some shared-memory MIMD machines. To summarize this discussion. with relatively little parallelism in the first few levels of the dissection. with little regard for whether the ordering can itself be computed effectively in parallel. to be candidates for inclusion hi the next independent set. As noted above. It is doubtful. We now turn our attention to the problem of ordering for parallel factorization and/or executing the ordering algorithms on the target parallel machine. the most promising of which increases the size of the independent sets by allowing all nodes whose degree are within a constant factor a of the current minimum degree. For example. these problems are very difficult to deal with. Parallel ordering algorithms. 3. Finally. and much work remains to be done before mature. First. but we know of no such implementations. First. and that work is of a highly irregular and somewhat sequential nature. the original graph should be distributed across the processors in some intelligent way before the dissection process is begun. It is ironic that much of the research on parallel algorithms for sparse factorization has been performed on the latter class of machines.5. the divide-and-conquer approach provides only a logarithmic potential speedup. and its primary subtask (generating a level structure via breadth-first search) is inherently serial. and thus the resulting elimination tree height would be at least [n/2j. both in computing the ordering and in subsequently using it for the factorization step. there are also difficulties with this approach. [26] contains several suggestions for dealing with this problem.1 < a < 1. for a distributed-memory implementation there is something of a bootstrapping problem: in order to utilize all of the local memory and simultaneously minimize interprocessor communication costs. the highly successful enhancements incorporated into current implementations of the method [48] have resulted in an intricate and extremely efficient algorithm: there is very little work to be partitioned among the processors. Third.

Thus. Liu uses tree rotations [73] to find equivalent orderings that reduce elimination tree height. is applied to produce a low-fill ordering for the matrix. He reports substantial reductions for a number of test problems. where the initial ordering used is a minimum degree ordering. an equivalent ordering is simply a different perfect elimination ordering for the filled graph F(A) that models the sparsity structure of L determined by the initial fill-reducing ordering. E. Roughly speaking.T. Given the natural divide-andconquer parallelism exhibited by nested dissection. so that the potential gain from restructuring may be quite small. The tridiagonal example cited earlier demonstrates that there may be no such ordering. the latter implementation is linear in the number of compressed subscripts used to represent the structure of L. low-fill would seem to enhance potential parallelism rather than suppress it. several researchers have explored various implementations of nested dissection in an effort to generate good orderings . NG AND B. HEATH. this approach may be able to fine-tune an ordering for use in parallel factorization. In [67] a more efficient implementation of the Jess and Kees algorithm is presented. the height of the elimination tree may not be a very accurate indicator of the actual parallel factorization time. however. Of course. PEYTON Tree restructuring for parallel elimination. Nested dissection and graph partitioning heuristics. One approach to generating low-fill orderings that are suitable for parallel sparse factorization is to decouple the reduction of fill and enhancement of parallelism into separate phases. such as minimum degree. a mechanism for restructuring the elimination tree. This interesting phenomenon is not fully understood. The effectiveness of this approach depends in part on whether there is in fact a good parallel ordering within the class of orderings equivalent to the initial low-fill ordering.98 M. In [77]. Liu proves that the Jess and Kees algorithm [62] produces an equivalent ordering whose associated elimination tree height is minimum among all equivalent orderings. about the conditions under which good equivalent parallel orderings might exist for realistic classes of problems. In [80] Liu and Mirzaian present a practical O(r)(L)} implementation of the Jess and Kees algorithm. Thus. but the key question of how much parallelism might be available in the original underlying problem goes unanswered. In the same report. then based on this initial ordering an equivalent reordering is produced that is more suitable for parallel factorization. On the other hand. elimination trees produced by minimum degree orderings typically have height already close to the minimum. and a computable criterion for determining when a given reordering will in fact reduce the subsequent parallel factorization time. By "equivalent" we mean an ordering that generates the same fill edges but may substantially restructure the elimination tree.W. First a standard ordering technique. Perhaps the primary problem with this approach is that it fails to get at the heart of the problem. since some of the parallelism in sparse factorization is due specifically to sparsity. Tests comparing Liu's tree rotations heuristic with their implementation of the Jess and Kees algorithm showed that the heuristic almost always produces a minimum-height tree. Their timings showed the tree rotations heuristic to be far more efficient than their implementation of the Jess and Kees algorithm. Very little is known. Implementation of the equivalent ordering approach requires an initial fill-reducing ordering. Moreover. Our intuition based on limited experience is that equivalent orderings have the capacity to modify only relatively minor features of the parallelism possessed by the initial fill-reducing ordering. Tests of this implementation indicate that a Jess and Kees ordering can usually be obtained in roughly the same amount of time as an ordering using the tree rotations heuristic.

or more specifically. Gilbert and Zmijewski ran tests using both kinds of separators and report ordering times and factorization times on an Intel iPSC/1 hypercube. They report fairly substantial and consistent reductions in tree height for their test problems. and therefore omits step 1 from the general scheme given above. but not all. a small set of nodes whose removal separates the graph into two portions of roughly equal size. all their tests were run on an unspecified sequential machine and no timings results were reported. As with most nested dissection algorithms. and thus the quality of the ordering is poorer on many. However. defined as follows. to which the process is applied recursively. there is little parallelism to exploit until the ordering algorithm is several levels down into the recursion. Thus. They use a greedy heuristic to generate node separators from edge separators. In [44] the adaptation of an automatic nested dissection algorithm [45] for execution on distributed-memory MIMD machines is discussed. where there are adequately many independent subproblems to work on. Number the nodes of the separator last in the ordering. This method generates a node separator directly. The algorithm first generates a level structure by means of a breadth-first search. But the automatic nested dissection heuristic is generally not as effective at reducing fill as the minimum degree heuristic. Gilbert and Zmijewski [55] use the Kernighan-Lin heuristic [63] to generate a small edge separator. they did not implement their algorithm on a parallel machine. which then must be converted into node separators. Some of the graph partitioning heuristics employed in fact produce edge separators. a small set of edges whose removal from the graph separates the graph into two vertex sets of roughly equal size. Let V\ contain the nodes in P\ incident on at least one edge in the separator set. Associated with an edge separator are wide and narrow node separators. The basic scheme in nested dissection is as follows: 1. Then one of the middle levels is chosen as a node separator. Fiduccia-Mattheyses. problems. and recursively apply steps 1 and 2 to the two subgraphs produced hi step 2. Their heuristic is guaranteed to find a minimal node separator among the nodes belonging to V = V\ U V-i. or more specifically. see [45] for details. 3. We now review some specific implementations of this approach. and define Vz C PI in the same way. An advantage of this method is that it is simple and generally inexpensive to compute. 2. Transform the small edge separator into a small node separator. The choice of starting node in the search can be crucial. . the algorithm for finding a separator appears to be inherently sequential. Let PI and PI be the two sets of nodes into which the edge separator partitions the graph. Kernighan-Lin. subdividing the problem into two or more independents subgraphs. Use a graph partitioning heuristic to obtain a small edge separator of the graph.PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS 99 for parallel factorization. Lewis and Leiserson [65] use a variant of the KernighanLin heuristic due to Fiduccia-Mattheyses [31] to generate edge separators. The effectiveness of nested dissection in reducing fill and enhancing parallelism depends on graph partitioning heuristics to find small node separators for the graph. The set V = V\ U V-j is the associated wide separator and both V\ and V? are the associated narrow separators.In their tests they use elimination tree height to compare the quality of their orderings with those obtained by using tree rotations to reduce the elimination tree height of minimum degree orderings. Level structures.

along with some matching theory.T.. Since all necessary data are globally accessible by all processors.e. there need be no concern over which specific processor picks up a given task. it is a hybrid of two very different ordering techniques. but uses a minimum degree ordering (a bottom-up ordering). while the vertices in PI are those corresponding to entries y. The authors use matching theory for bipartite graphs. A hybrid approach. a "middle" separator determined by the minimum degree ordering is chosen. 3.. it includes statistics for the top-level separator only. The report cited here does not include statistics for complete nested dissection orderings based on this technique. and the vertices in PI are taken to be those corresponding to entries y. to obtain the separators. This method generates a nested dissection ordering (a top-down ordering).1. NG AND B. Thus. Task Partitioning and Scheduling. In implementing sparse columnCholesky on a shared-memory MIMD machine. This approach achieves good load balancing dynamically. They use an implementation of the Lanczos algorithm to compute the required eigenvector for general sparse graphs. Tcol(j) consists of all column modifications. Pothen. The tasks Tcol(j) are maintained in a queue and doled out to processors as they complete previous tasks.. To generate an edge separator. the timings and ordering statistics reported indicate that it obtains good orderings in a reasonably efficient manner on a sequential machine. E.e. an ideal arrangement for the highly irregular task profile usually generated by . of y for which y.> ym. as well as the final scaling operation.. they first compute the eigenvector y associated with the smallest positive eigenvalue of the Laplacian matrix associated with the G(A). After a standard minimum degree ordering algorithm is initially applied to the problem. their bipartite-matching method for transforming an edge separator into a node separator is optimal in the sense that it minimizes the size of the node separator over all possible node separators that can be obtained from the given edge separator (i. PEYTON Spectral separators. computing the ordering in parallel with this approach appears to be very difficult. to generate from the edge separator a minimum-cardinality node separator [94]. The nodes of the new separator are numbered last in the ordering.. Since most of the time is spent performing Lanczos iterations. Thus.< y m .33] in the framework described above. However. In [74] and [75] Liu presents a hybrid approach that combine elements of both the minimum degree and nested dissection algorithms.W. 3. Then the median entry ym of y is found.of y for which y.100 M. which can be parallelized in a fairly straightforward manner. Again. shrink) this separator. and then the process is applied recursively to the subproblems remaining to be ordered. the problem of partitioning the factorization into tasks for concurrent execution on multiple processors is fairly simple.2. Each column j corresponds to a task Tcol(j) defined by That is. A technique based on matching theory for bipartite graphs is then used to improve (i. in particular the Dulmage-Mendelsohn decomposition. but the application of the method to parallel factorization is noted in both papers. The primary emphasis of the two papers is simply to produce improved fill-reducing orderings. Shared-memory MIMD machines.2. their method should run efficiently in parallel even in the top few levels of the nested dissection recursion. HEATH. over all separators contained in the set of nodes incident on the separator edges). to be applied to column j. The method proceeds as follows. Simon and Liou [95] study the use of spectral partitions [32.

On these machines. For large problems. with additional column tasks appended to the queue after their descendants have been completed. which also requires distributed computation of the elimination tree. Roughly speaking. In [110]. there is no efficient means of implementing dynamic load balancing on these machines for problems of this type. In this case. Thus. As we have seen. if a single processor cannot store the adjacency structure of A.PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS 101 sparse problems. as shown in [88]. Fortunately. and such a mapping must be determined in advance of the factorization. we do not yet know the structure of L on which the definition of T(A) is based. Performance usually is not very sensitive to which topological ordering is used to schedule the column tasks. Thus. the task queue Q is given by: However. Distributed-memory MIMD machines. Each processor uses its portion of the adjacency structure of A to compute a "local" version of the elimination tree.4. the lack of globally accessible memory means that issues concerned with data locality are dominant considerations. 3. The ordering of the elimination tree shown in Figure 8 is particularly appropriate. Currently. this "local" tree contains in a compressed form the contribution of each processor's local adjacency list to the final elimination tree.2. The situation is much more difficult on distributed-memory MIMD machines.2. and it is often adequate to use the fill-reducing ordering to schedule the tasks. Efficient scheduling of the tasks Tcol(j) on shared-memory MIMD machines is also easily accomplished. In attempting to compute the elimination tree. the elimination tree contains information on data dependencies among tasks and the corresponding communication requirements. a static assignment of tasks to processors is normally employed in this setting. In essence. An ordering of the elimination tree is a topological ordering if each node is numbered higher than all of its descendants. based on the trade-offs between load balancing and the cost of interprocessor communication. however. then the structure of A must be distributed among the processors. Gilbert and Zmijewski present an algorithm for computing the elimination tree in parallel on a distributedmemory multiprocessor. Elimination trees. since the overhead required to do so is trivial — a single n-vector computed in O(n) time. Scheduling the column tasks in this manner is especially worthwhile. T(A) can be generated directly from the structure of A by an extremely efficient algorithm [79]. the elimination tree is an extremely helpful guide in determining an effective assignment of columns (and corresponding tasks) to processors in the distributedmemory case. their algorithm proceeds as follows. In short. See Section 3. the target architecture for much of the algorithm development for parallel sparse factorization reported in the literature. scheduling columns by their height in the elimination tree usually improves performance by reducing synchronization delays. The final phase of the algorithm combines these "local" trees to/obtain the final . we appear to be confronted by a bootstrapping problem: prior to symbolic factorization.88] for parallel implementations of sparse Cholesky based on these ideas. but again we face the recurring problem of having very little work to distribute over the processors. A more dynamic queue management strategy is to initialize the queue to contain only the tasks corresponding to the leaf nodes. It is desirable to compute the elimination tree in parallel. uniform access to main memory permits the use of dynamic load balancing and a fairly simple restructuring of a sequential sparse Cholesky algorithm to obtain a good parallel algorithm.1 and [41.

HEATH. PEYTON FlG. NG AND B.W. A good ordering of column tasks in task queue used by parallel column-Cholesky algorithm for shared-memory MIMD machines. .T.102 M. E. 8.

In the experiments reported in [110]. These goals can be in conflict. successive levels in the elimination tree were wrap-mapped to the processors. This resulted in good load balancing for the model problem. as shown in Figure 9. Nodes belonging to the same separator in the elimination tree are assigned to the processors in wrap fashion. All communication associated with the algorithm is restricted to this final "combining" operation. 9. elimination tree. the next step is to use it in mapping the columns onto the processors. 2. though the differences are not unreasonable. introduced in [49]. In the early work on this problem. but it also often results in unnecessarily high message volume. does an excellent . and 3. however. Mapping the problem onto the processors. The "subtree-to-subcube" mapping. A wrap-mapping of the factor columns onto four processors numbered 0. especially for highly irregular problems. the parallel algorithm takes considerably more time than the sequential algorithm.PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS 103 FlG. The primary goals of the mapping are good load balance and low interprocessor communication. After the elimination tree has been generated. 1.

4. George et al. one selects an appropriate set of P subtrees of the elimination tree. job of reducing communication while maintaining good load balance for model grid problems and other problems with similar regularity in their structure. a similar processor clustering concept is applicable to most distributed-memory architectures. . say To.. 2 and 3. 1. [49] show that for the fan-out distributed factorization algorithm (see Section 3.T. 10. The root separator is wrap-mapped onto the set of all available processors.104 M. The basic idea is quite simple.W. HEATH. communication volume can be limited to O(Pk2). . their processor sets are merged together and wrap-mapped onto the nodes/columns of the separator that begins at that point. Where two subtrees merge together into a single subtree. and then assigns the columns corresponding to TJ to processor i (0 < i < P — 1). E. NG AND B.Tp_i. PEYTON FlG. Although the use of subcubes is specific to hypercube architectures. Figure 10 shows this mapping for our model problem.Ti. If P is the number of processors. Subtree-to-subcube mapping of the columns of the matrix to four processors numbered 0..2) applied to model problems defined on k x k grids..

PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS

105

which is asymptotically optimal. Gao and Parlett [35] prove the slightly stronger result that the communication volume for each processor is O(Jb 2 ), which indicates that the overhead associated with communication is, in some sense, balanced among the processors. Closely related results can be found in two papers by Naik and Patrick [86,87]. It is quite easy and natural to obtain a good "subtree-to-subcube" mapping for elimination trees obtained by applying standard nested dissection orderings to model problems. It is difficult, however, to generalize the subtree-to-subcube mapping to more irregular problems. Progress in that direction is reported in [38] and [101]. However, an adequate understanding of the trade-offs between communication and load balance for more realistic problems will require further study. 3.3. Symbolic Factorization. On a distributed-memory MIMD multiprocessor, it is necessary to compute Struct(L+j) for every column j of L and to store Struct(L*j) on the processor responsible for computing that column. Thus, a distributed algorithm for computing the symbolic factorization is required. The sequential algorithm for this step is remarkably efficient, and so once again we find ourselves with little work to distribute among the processors, so that good efficiency is difficult to achieve in a parallel implementation. As we have seen, Struct(L^j) depends on Struct(A+j) and on Struct(L+k) for every k such that p(k) = j (i.e., for every child k of j in the elimination tree). In [42] a column-oriented parallel symbolic factorization algorithm is presented. At any point during the execution of this algorithm, the number of tasks available for parallel execution is limited to the number of leaves in the subtree of the elimination tree induced by nodes whose structures are not yet complete. Limited parallelism, small task sizes, and communication overhead make it difficult to attain good speed-ups. Moreover, the subscript compression technique so critical to the space and time efficiency of the sequential symbolic factorization algorithm can be only partially realized on these machines. For example, let columns j and j + l of L be two columns belonging to the same supernode but assigned to two distinct processors, say po and pi, respectively. The sequential algorithm exploits the fact that Struct(L^j^i) = Struct(L*j) — {j + 1} to save both time and storage, as discussed earlier in Section 2.2. The parallel algorithm, however, must store Struct(L^j) on processor po and Struct^L+j+i) on processor pi. Good mappings typically wrap-map columns belonging to the same supernode. Thus the situation in our illustration is typical — even pervasive; hence parallel symbolic factorization necessarily requires more total work and storage on distributed-memory MIMD multiprocessors, although the parallel completion time will usually still be less. The test results reported in [42] confirm that currently only modest speed-ups are attainable. It is possible to improve parallel symbolic factorization on distributed-memory MIMD multiprocessors if the supernodal structure is known in advance [81]. The key observation is that it is necessary to compute only the structure of the first column of each supernode. Processors holding other columns in that supernode do not have to compute the structures of these columns; all they need to do is to retrieve the structure from the processor that is responsible for computing the structure of the first column. In [110] Zmijewski and Gilbert present a row-oriented parallel symbolic factorization algorithm that has more potential parallelism, but is more complicated and requires rearrangement of the output into a column-oriented format. Timing results for this algorithm are not presented, but the authors indicate that its cost is high.

106

M.T. HEATH, E. NG AND B.W. PEYTON

However, the problems they experimented with were quite small, so it remains unclear how competitive the algorithm might be on larger problems. In a study [53] that may be applicable on massively parallel machines, Gilbert and Hafsteinsson show that using a shared-memory CRCW (concurrent-read, concurrent-write) PRAM (parallel random access machine) model of computation, there is a parallel algorithm for symbolic factorization that requires O(log2 n) time using i/(L) processors. 3.4. Numeric Factorization. On sequential machines, numeric factorization is typically much more expensive than the other steps in the solution process. As a result, parallel numeric factorization has received considerably more attention than the other steps in the parallel solution process. It is also more amenable to parallelization than the other solution steps, though it is still much more difficult to deal with than dense factorization. Development of reasonably good parallel sparse Cholesky algorithms has taken longer than development of their dense counterparts. The bookkeeping and irregular structure dealt with in the sparse algorithms present a greater challenge to the algorithm developer; consequently, many more issues and difficulties remain to be addressed in future work. Most of the work has been directed towards the development of parallel algorithms that exploit medium- and large-grain parallelism on shared-memory or distributedmemory MIMD machines. Some exceptions are work on vectorizing sparse Cholesky factorization on powerful vector supercomputers [3,5,11,19], work on fine-grained algorithms for massively parallel SIMD machines [54], and work on systolic-like algorithms for multiprocessor grids [18,106]. We will restrict our discussion to algorithms designed for MIMD machines. 3.4.1. Parallel column-Cholesky for shared-memory machines. Of the three formulations of sparse Cholesky, column-Cholesky is in many ways the simplest to implement. As noted earlier, it has been more commonly used in sparse matrix software packages [17,27,29] than other methods, such as the multifrontal method. It is probably better known to a broader audience than the other methods. George et al. [41] show that the algorithm can be adapted in a straightforward manner to run efficiently in parallel on shared-memory MIMD machines. For all these reasons this algorithm is an ideal place to begin our discussion of parallel sparse Cholesky algorithms. A parallel algorithm. To facilitate our discussion, we introduce a more detailed version of the column-Cholesky algorithm shown earlier in Figure 3. In particular, we need to indicate how the row structure sets Struct(Lj+) are generated by the algorithm. The more detailed version of the algorithm shown in Figure 11 requires the following new notation. Let next(j,k), k < j, be the lowest numbered column greater than j that requires updating by column fc. That is, next(j,k} is the row index of the first nonzero in column k after row j. (Note that next(j,j) is merely the parent of j in the elimination tree.) The column index sets Si (1 < * < n) are initially empty, but when column j is processed, Sj = Struct(Lj*), as required. For simplicity and brevity, the algorithm in Figure 11 does not detail how to handle the case when there is no "next" column to be updated. The use of the index sets Si and other implementation details of the serial algorithm are discussed in [47]. However, we note one particular detail in the implementation. Since each completed column k appears in no more than one set Si at any time during the algorithm's execution, a single n-vector link suffices to maintain each set 5,- (1 < i < n) as a singly-linked list [47].

PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS

107

for j = 1 to n do

for j = 1 to n do for Jb 6 Sj do

Jj

C

'— V

A

cmod(j, k) i := next(j, k) Si:=SiV{k} cdiv(j) i:=next(j,j) Sn=Si\J{j}

FlG. 11. Sparse column-Cholesky factorization algorithm, showing the computation of row structure sett Struct(Li*) in the sets Si, 1 < ii < n.

This algorithm can be implemented in parallel on a shared-memory MIMD machine in a fairly straightforward manner [40]. Each column j corresponds to a task

as discussed in Section 3.2. Initially, the task queue, denoted by Q, contains all column tasks Tcol(j) ordered by some topological ordering of the elimination tree. For ease of notation, we assume that the elimination ordering and the schedule-prescribed ordering are the same, so we have

As the computation proceeds, a processor obtains (and removes) the column task currently at the front of the queue and proceeds to compute that task. After completing the task, the processor obtains from Q another column task to compute, and it continues in this manner, as do all the other processors, until the factorization is complete. This simple "pool of tasks" approach does an excellent job of dynamically balancing the load, even though the column task profile for typical sparse problems is quite irregular. Obviously, access to this queue must be synchronized to ensure that each column task Tcol(j) is executed by one and only one processor. The parallel algorithm also must synchronize access to the n-vector link in which the sets •% (1 5: * 5: n) are maintained. Only one processor at a time can modify this array, and thus the two sequences of instructions that manipulate link must be critical sections in the algorithm. A high-level description of the parallel algorithm is given in Figure 12. Recent improvements. The algorithm in Figure 12 has two significant drawbacks. First, the number of synchronization operations (obtaining and relinquishing a lock) is O(n + rj(L)), which is quite high. Second, since the algorithm does not exploit supernodes, it will not vectorize well on vector supercomputers with multiple processors, natural target machines for the algorithm. The introduction of supernodes into the algorithm deals quite effectively with both problems [88]. The use of supernodes to improve computational rates on vector supercomputers is well documented [3,5,11,19]. The duplicate sparsity structure found in columns within the same supernode enables one to organize the computation around level-2 or level-3 BLAS-like computational kernels. Such block operations reduce memory

108

M.T. HEATH, E. NG AND B.W. PEYTON

**Q := {rco/(l),Tco/(2),... ,Tcol(n)} for .;' = 1 to n do Sj := 0 while Q ^ 0 do
**

pop Tcol(j) from Q while column j requires further cmod's do ifSj = 0do wait until Sj ^ 0 obtain Jb from Sj i := next(j, k) lock 5,- := $ U {*} unlock cmod(j, Jb) ccfe't^j) i:=ne^0',j) lock Si := Si U {;'} unlock

FlG. 12. Parallel sparse column-Cholesky factorization algorithm for shared-memory MIMD machines.

traffic by retaining and reusing data in cache, vector registers, or whatever limited rapid-access memory resource is provided on the particular machine in question. In the following discussion, we will let bold-face integers 1, 2, ..., N stand for the supernodes. Thus, N < n is the number of supernodes. We will also use bold-face capital letters such as J and K to denote each supernode by its index, and use small letters such as i, j, and k to denote each individual column by its number. Let K be a supernode comprising the set of contiguous columns {k,k + l,k -f 2 , . . . , k + t}. Because of the sparsity structure shared by each column of K, every column of K modifies column j, j > k + t, if and only if at least one column of K modifies column j. For example, column 40 in supernode 30 in Figure 5 is modified by each column 37, 38, and 39 in the previous supernode, but it is modified by none of the columns 19, 20, and 21 that compose supernode 15. The block operation used to improve the algorithms in Figures 3 and 12 is a level-2 BLAS-like kernel, cmod(j, K), which modifies column j with a multiple of the appropriate entries of each column k 6 K. In particular, the modifications from the columns in K can be accumulated as dense saxpy operations and no indirect addressing is required until the result is applied to column j. For a column k + i € K, we let cmod(k + i, K) denote the operation of updating column k + i with every column of K numbered earlier than k -f i. That is, cmod(k + i, K) is given by cmod(k + i, K) := {cmod(k + i, Jb), cmod(k + i, k + 1),..., cmod(k + i, k + i — 1)}. For the matrix in Figure 5, cmo<f(30,22) is given by cmo(i(30,22) := {cmod(30,28),cmod(30,29)}. Since columns k, k + 1, ..., k -f i — I in supernode K have the same structure below row k+i— 1, the modifications to column Jfc+»' can again be performed by dense saxpy

:=l for J = 1 to N do for j G J do for K € 5. J) cdiv(j) if j is the last column of supernode J do i := next(J. J) Si:=S. do cmod(j. The next column to be updated by supernode K after it has updated column j is denoted by next(j. with no indirect addressing required. K).. K) Si := Si U {K} cmod(j. K) lock S.. 14.. Using this notation. .:= Si U {K} unlock crnod(j. J) cdiv(j) i := next(3.Tco/(n)} for j = 1 to n do Sj := 0 while Q / 0 do pop Tcol(j) from Q let J be the supernode containing column j while column j requires further cmooPs do ifSj = 0 d o wait until Sj ^ 0 obtain K from Sj i := nex<(j. and similarly the first column outside supernode K requiring modification by the columns of K is denoted by next(K.Tco/(2). Figures 13 and 14 display supernodal versions of the sequential and parallel column-Cholesky algorithm shown in Figures 11 and 12.-U{J} FlG. respectively.PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS 109 operations.K).. Q := {Tco/(l). K) cmod(j. K) i := next(j. J) lock Si := Si U {J} unlock FlG. Parallel sparse supernodal column-Cholesky factorization algorithm for shared-memory MIMD machines. for j = 1 to n do S. Sequential sparse supernodal column-Cholesky factorization algorithm. 13. .

The differences among these algorithms stem from the various formulations of serial sparse Cholesky upon which they are based. The algorithm introduced in [43]. When it receives a column L**. 3. We now detail how the fan-out algorithm partitions the task Tsub(k) among the processors. In other words. was the first sparse Cholesky factorization algorithm developed for distributed-memory machines.k) is required. the column partition . Each column L+k is stored on one and only one of P available processors. Distributed fan-out algorithm. The distributed fan-out.4. The fan-out algorithm is based on submatrix-Cholesky. which is defined by That is. The use of supernodes reduces the number of synchronization operations to a number proportional to <r(L). both of which are introduced later in the subsection dealing with this algorithm. it partitions each task Tsub(k) among the processors. where the data sent from one processor to another are the completed factor columns. The distributed multifrontal algorithm partitions among the processors the tasks upon which the sequential multifrontal method is based: partial dense submatrix-Cholesky factorization and the assembly operations. In order to keep the cost of interprocessor communication at acceptable levels. We let mycols(p) denote the set of columns owned by processor p. fan-in.110 M. We will denote the Jb-th task performed by the outer loop of the algorithm by Tsub(k).W. and then performs all column modifications that use the new column. it is essential for the algorithm to make heal use of local data as much as possible. The fan-out algorithm is a data-driven algorithm.2. then map[k] := p. NG AND B. it partitions each task Tcol(j) among the processors. • All three use the column assignment to distribute among the processors the tasks found in the outer loop of one of the serial implementations of sparse Cholesky factorization. which is often much less than f)(L). Algorithms for distributed-memory machines are usually structured around some prior distribution of the data to the processors. each task Tsub(k) is partitioned among the processors by the partition defined by the column mapping. An n-vector map is required to record the distribution of columns to processors: if column k is stored on processor p. The outer loop of the fan-out algorithm constantly checks the message queue for incoming columns. sometimes by as much as an order of magnitude [46]. it performs the following set of cmods: Indeed. It is a parallel version of the submatrixCholesky factorization algorithm shown in Figure 4. and multifrontal algorithms are typical examples of this type of distributed algorithm (the fan-in and multifrontal algorithms will be discussed in the following subsections). These three distributed algorithms are all designed within the following framework. More precisely. PEYTON Let <r(L) denote the number of subscripts in the supernodal representation of the sparsity structure of L. HEATH. Tsub(k) first obtains //** by performing the cdiv(k) operation. now known as the fan-out algorithm.T. The fan-in algorithm is based on columnCholesky. it uses it to modify every column j 6 mycols(p) for which cmod(j. E. • All three require assignment of the matrix columns to the processors.

Another problem with the method is the expense of mapping the entries of the updating column k to the corresponding entries of the updated column j when performing cmod(j. .Tsu6(Jb. Ashcraft et al. k) if column j required no more cmod's do cdivj send L+j to the processors in procs(L+j) mycols(p) := mycols(p) — {j} FlG.PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS 111 induces the partition of Tsub(k) into subtasks of the form {Tsub(k. it still is insufficient to make the method competitive. the owner of the columns updated by the task.. Only the processors in the set procs(Ltk) := {map[j] | j € Struct(L+k)} require column L^. Though the resulting improvement in performance is substantial. 15. but due to several weaknesses it has since been superseded by fan-in algorithms and distributed multifrontal algorithms.] has completed all column modifications required by column j.Tsub(k. l). say L^ for j G Struct(L+k) H mycols(p) do cmod(j.p) := {cmod(j. the fan-out algorithm was first to be implemented on a distributedmemory machine. as effectively as the other methods do. When processor p = map[. k) \ j 6 Struct(L+k) H mycols(p)}. k). Historically. where it eventually is used to modify later columns in the matrix. many of these tasks will be empty. [9] and Zmijewski [109] have independently improved the algorithm by having it send aggregated update columns rather than individual factor columns for columns belonging to a subtree that has been mapped to a single processor. it then performs cdiv(j) and sends it to every processor in procs(L+j).. such as the subtreeto-subcube mapping. It simply does not exploit a good communication-reducing column mapping.. Fan-out Cholesky factorization algorithm for processor p of a distributed-memory MIMD machine. The distributed fan-out algorithm incurs greater interprocessor communication costs than the other two methods.2). The set Struct(L^k) must accompany the factor column L^ when . both in terms of total number of messages and total message volume.p) assigned to processor p. P)} where Tsub(k. with each non-empty task Tsu6(Jb. for j G mycols(p) do if j is a leaf node in T(A) do cdiv(j) send L+J to the processors in procs(L^j) mycols(p) := mycols(p) — {j} while mycols(p) ^ 0 do receive any column of L. Of course. The algorithm is shown in Figure 15.

it must receive and process any aggregated update columns required by column j from other processors before it can complete the cdiv(j) operation. Moreover. k 6 Struct(Lj+).p). and then participates in a global phase during which the processors cooperate in reducing down to a single number the P local reductions generated during the preceding "perfectly parallel" phase. each task Tcol(j) is partitioned among the processors by the partition of the columns induced by the column mapping. introduced by Ashcraft et al. The fan-in algorithm is given in Figure 16. it distributes each column task Tcol(j) among the processors hi a manner similar to the distribution of tasks Tsub(k) in the fan-out algorithm. The outer loop of the algorithm processes every column j of the matrix in ascending order by column number. More precisely. When processor p processes column j. k) for which k € Struct(Lj+) fl mycols(p).T. Based on the sparse column-Cholesky algorithm. cdiv(j) cannot be computed until all modifications cmod(j. on the other hand. Let u(j. and needed by the receiving processor to update a target column. 3. Distributed fan-in algorithm.W.k) requires that both index sets Struct(L*j) and Strtict(L^) be searched in order to match indices.4. The fanin algorithm is a demand-driven algorithm.3. each processor p is responsible for computing cdiv(j) for every column j 6 mycols(p). mycols(P)} induces the partition of Tcol(j) into subtasks of the form where Tcol(j. in [10]. k). NG AND B. These weaknesses have provoked efforts to develop better distributed factorization algorithms. k) denote the scaled column accumulated into the factor column by the cmod(j. each cmod(j... which computes a series of inner product calculations in precisely this manner.. This results in poor serial efficiency. in which each processor first locally reduces the data assigned to it down to a single number. Indeed. processor p does own column j.p) assigned to processor p. if processor p does not own column .k). Indeed. which will eventually incorporate it into column j. mycols(2). PEYTON it is sent to other processors to enable these processors to complete column modifications of the form cmod(j. k) operation. It is interesting to note that any column j 6 mycols(p) will receive an aggregated update column from every processor in the set . HEATH. the column partition {mycols(l).. This roughly doubles the communication volume and creates a more complicated message that must be packed by the sending processor and unpacked by the receiving processor. then it sends the resulting aggregated update column to processor q — map\j]. the name "fan-in" is taken from the fan-in distributed algorithm for dense triangular solution [58]. the fan-in method is analogous to the standard parallel algorithm for a dot product. As with the fan-out algorithm. have been performed..k) for which k 6 mycols(p) n Struct(Lj+)..112 M.p) aggregates into a single update vector every update vector u(j. the owner of the updating columns used by the task. Of course. One of the improved distributed factorization algorithms is the fan-in algorithm. Viewed in a more general way. Note that throughout this subsection we freely use the notation introduced in the previous subsection. E. After performing Tcol(j. it aggregates into a single update vector u every update vector u(j. where the data required are aggregated update columns computed by the sending processor using columns it owns. with each non-null task Tcol(j. If.

PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS 113 for j := 1 to n do if j € mycols(p) or Struct(Lj+) PI mycols(p) / 0 do w:=0 for k £ Struct(Lj*) n mycols(p) do u := u + u(j. The divideand-conquer nature of the nested dissection ordering is clearly visible in Figure 18. k) if mapL/] / p do send « to processor g = map[. such that map[fc] / map\j]. Compute-ahead fan-in algorithm. A more visual comparison of the communication patterns of the fan-out and fanin algorithms is given in Figures 17 and 18. with time on the horizontal axis. The problem being solved is the factorization of a matrix of order 225 derived from a model finite element problem on a 15 x 15 grid. depending on whether the processor is busy or idle. In contrast. the fan-out algorithm sent the factor column L+J to every processor in the processor set procs(L*j). observe that processor p will fall idle if. which also illustrates the ability of the fan-in algorithm. Consider the communication costs incurred by the two algorithms during the computation of columns that constitute a subtree of the elimination tree that has been mapped to a single processor by a subcube-to-subtree mapping. Processor activity is shown by horizontal lines and interprocessor communication by slanted lines. These diagrams were produced using a package developed at Oak Ridge National Laboratory for visualizing the behavior of parallel algorithms [57]. Each message sent between processors is shown by a line drawn from the sending processor at the time of transmission to the receiving processor at the time of reception of the message. The horizontal line corresponding to each processor is either solid or blank. On the other hand. Struct(Lj+) also belongs to the subtree. using a nested dissection ordering and subtree-to-subcube mapping on eight processors. while receiving aggregated update columns destined for a column j £ . In Figure 16. By contrast.] else incorporate u into the factor column j while any aggregated update column for column j remains unreceived do receive in u another aggregated update column for column j incorporate u into the factor column j cdiv(j) FlG. respectively. 16. given an appropriate mapping. the fan-out algorithm must send L+j to another processor if there is a column index k e Struct(L*j) for some column j in the subtree. These figures illustrate snapshots of the execution of the two algorithms on an Intel iPSC/2 hypercube. For the fan-in algorithm there will be no communication during this portion of the computation. to exploit this structure to reduce communication. This observation is an informal indication of why the fan-in algorithm is better than the fan-out algorithm at exploiting a good mapping to reduce interprocessor communication. because for every column j in the subtree. Fan-in sparse Cholesky factorization algorithm for processor p of a diatributed-memory MIMD machine. even under the ideal conditions represented here. the fan-out algorithm shown in Figure 17 exhibits much greater communication traffic as well as a less regular communication pattern.

17. Communication pattern of fan-in algorithm for a model problem. E. HEATH. T . FlG. NG AND B. . PEYTON FlG. 18.W. Communication pattern of fan-out algorithm for a model problem.114 M .

Since the aggregated update columns are managed so that they share the same sparsity structure as the target column. The fundamental idea in frontal methods is to keep only a relatively small portion of the matrix in main memory at any given time. Though supernodes play an important role in organizing the compute-ahead fanin algorithm. multifrontal methods are generalizations of single-front methods. Compute-ahead aggregating of update columns is limited to target columns i > j that belong to the same supernode as the current column j. and when there are none. and incorporate it into the factor column. it is interesting to note that the potential exploitation of supernodes in distributed-memory algorithms is somewhat limited because good mappings typically distribute the columns of a supernode among several processors. which in turn depends on the structure of the problem and the ordering used in solving it. all of which must account for the fact that the active matrix constitutes a moving "window" through the problem. their insertion into the proper location in the full active matrix data structure. followed by a saxpy.4. For some column » > j. a single narrow . the algorithm toggles between performing so-called computeahead tasks on columns i > j. involving the assembly of matrix elements. When unable to complete the current column j.PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS 115 myco/s(p). One straightforward enhancement to the method is to probe the queue for such messages. and more efficient inner-loop computations by avoiding the indirect addressing that is characteristic of general sparse data structures. There are two types of compute-ahead tasks to be performed on later columns of the factor: 1. Thus. for example. no indirect indexing is required to incorporate them into the factor column. In structural analysis. that is. a long thin truss is ideal for a frontal solution technique in that. it has no such updates in its message queue. 2. with an appropriate ordering. For details concerning these and other implementation issues.4. so that computations involving it are more efficient on scalar machines and more readily vectorized on vector machines. There is no reason why supernodes cannot serve in this role in the fan-in algorithm also. Compute-ahead tasks of the first type have priority over compute-ahead tasks of the second type. Parallel multifrontal algorithms. the overall data management required in frontal methods is quite complicated. 3. The success of frontal methods is dependent on keeping the size of the active matrix small. As noted earlier. and the writing of completed portions of the factor to disk. aggregate into a work vector the update vector u(«. compute-ahead tasks of the second type require merely a receive. proceed with useful work on later factor columns. consult [8]. The original motivation for developing frontal methods was for more effective use of auxiliary storage in the out-of-core solution of sparse systems. current implementations of both the basic and compute-ahead fan-in algorithms do not exploit supernodes to reduce memory traffic in the inner loops of the computation — one of their key roles in the parallel shared-memory columnCholesky algorithm. and to use a full matrix representation for this "active" portion of the matrix. and detecting and processing incoming aggregated updates for the current column j. compute-ahead tasks of the second type are performed only when the algorithm has exhausted its supply of tasks of the first type. Receive an aggregated update column for some column i > j. k) for each completed column k € Struct(L^) D mycols(p). This is due primarily to the ease and "naturalness" with which successive aggregate update columns sharing the same sparsity pattern can be computed. However. Although the data structure for the active matrix is very simple.

The outer loop of the serial multifrontal algorithm processes the supernodes 1. modifying the method to run on MIMD machines is also more difficult and complicated. The algorithm performs two tasks within this frontal matrix: 1. and compute a structurally symmetric numerical factorization of A within the resulting data structure.. Instead. NG AND B. As noted in Section 2. however.22. they are processed in the order in which they are visited by a postorder traversal of a supernodal elimination tree. PEYTON "front" passes along the length of the truss. The factor and update columns computed within this matrix are stored in a dense matrix format. As we shall see. We should also point out that some of the codes and algorithms cited in this section are designed for nonsymmetric linear systems. there have been implementations on both shared-memory [12. completing the columns of each supernode when the supernode is processed. A supernodal elimination tree with 31 supernodes is displayed in Figure 6. such codes can be discussed within the framework of this article because they perform a symbolic factorization of the structurally symmetric matrix A + AT.116 M. and assume that the assembly step for supernode K's frontal matrix has been completed. Nevertheless. each is accumulated and passed on through a succession of update matrices until it is finally incorporated into the target column. much of the material in [22] and [23] is directly applicable to sparse multifrontal Cholesky factorization. The algorithm then . the situation is not quite so straightforward. which solves nonsymmetric systems and pivots for stability. HEATH.82] machines. We discuss first the partial submatrix-Cholesky factorization step and then the assembly step in more detail. The reader should consult [28] or [78] for background material on multifrontal methods. Suppose that K contains r columns of the matrix.. essentially minimizing the use of indirect addressing — one of the major strengths of the method. and at least one includes pivoting for stability. 2.2. For reasons discussed below. Therefore. The difficulties in producing a brief but clear description stem primarily from the complexity of the method: a sequential multifrontal code is considerably more complicated than a sequential sparse column-Cholesky code. the work in [22] and [23] is based on the Harwell MA37 code. For instance. the various fronts must eventually merge before the problem can be completed. The order in which the supernodes are processed is critical. however.. As might be expected. Background. Of course. dense partial submatrix-Cholesky factorization within the frontal matrix generates the factor and update columns. After the assembly step.36.T. then multiple fronts can be employed. The use of multiple fronts seems to suggest an obvious parallel implementation: simply assign a separate front to each processor. The assembly step inserts the required data into the frontal matrix. k) operations are not applied directly to column j of the factor matrix. If a single front would become unacceptably large. but the hope is that with an appropriate ordering such mergers can be postponed as late as possible in the computation. though it is by no means unmanageable.W.. A self-contained presentation of parallel multifrontal algorithms would occupy more space than we can afford in an article of this scope. This section is limited to a brief overview of the literature on the subject and a short discussion of some of the problems that arise in parallel implementations. E. Every supernode K has associated with it a frontal matrix in which the factor and update columns associated with the supernode are computed. N in order. leading to multifrontal methods. the multifrontal method is a sophisticated variant of the sparse submatrix-Cholesky factorization algorithm (Figure 4) for which the cmod(j.105] and distributed-memory [9.23.

Subdividing the working storage in an effort to localize the garbage collection operations and reduce their negative effect on parallelism proved to be too complicated and ineffective [22]. A parallel multifrontal code developed by . For any given update matrix. During the course of the computation. limiting both the storage and time required to store them. A parallel multifrontal code developed by Lucas [83] for the CRAY 2 allocates subtrees to individual processors and has each processor manage a local stack for its assigned subtree. 3. they process the supernodes in some order that allows greater exploitation of the large-grained (subtree-level) parallelism. the update matrices can be stored efficiently on a stack. 2. The postordering of supernodes used in the sequential algorithm severely limits the parallelism available. Garbage collection to reclaim unused storage requires a critical section that seriously inhibits parallelism. there are eventually more processors than independent subtrees." obtain its associated update matrix and add each entry to its corresponding entry in the frontal matrix. after which the first r columns of the frontal matrix contain the r factor columns of K. Instead. In [22] and [23]. The assembly step consists of the following three steps: 1. it successively processes the remaining tasks Ts«6(K). Duff considered several strategies for dealing with the resulting fragmentation of working storage. the code abandons the use of subtree-level parallelism. Henceforth. while update matrices for child supernodes are popped off the stack as needed during each assembly step. For each "child supernode. Zero out the frontal matrix. In [23]. We are aware of two other parallel multifrontal codes designed to run on parallel shared-memory MIMD machines. To create more independent processes. These trailing columns constitute the update matrix generated by this block elimination step.105]. One key problem associated with parallel multifrontal algorithms for shared-memory MIMD multiprocessors is the management of auxiliary storage for the update matrices. algorithm developers have abandoned the postordering and the stack of update matrices. Insert the required entries of A into the appropriate locations of the matrix. Instead. The scheme is guaranteed to waste no more than half the working storage. using CRAY autotasking to partition each task Tsub(K) among all the processors. Duff used the buddy system to manage the storage for update matrices. and the trailing columns in the frontal matrix contain aggregated update columns for later columns of the matrix.PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS 117 computes r major steps of dense submatrix-Cholesky factorization within the frontal matrix. the buddy system obtains a free block of storage of length 2*. Shared-memory MIMD machines. increasing both the storage and time required by this part of the algorithm [23.83. it limits exploitation of the parallelism that exists among the many disjoint subtrees of the elimination tree available in most realistic problems. Because the supernodes are ordered by a postorder traversal of the elimination tree. The update matrix is stored and assembled later into the frontal matrix of its "parent supernode" in the elimination tree. but which complicates management of the working storage for update matrices. New update matrices are pushed onto the stack as soon as they are generated. At that point. Breaking up individual update matrices to make better use of free storage was not considered because it would destroy the data locality vital for efficient use of cache — one of the important strengths of the multifrontal method and a key consideration on the Alliant FX/8 [23]. in particular. where k is the smallest power of two that provides enough contiguous storage locations to hold the matrix. we will denote this task by Tsub(K).

Distributed-memory MIMD machines. NG AND B. Typically. Before the task Tswft(K) can be performed. we restrict our attention to issues associated with distributing the tasks T«u6(l).1). if an update column from .W. More precisely. The situation is not as simple as it is for parallel column-Cholesky. and thus smaller granularity is required for acceptable performance. and a systematic comparison of all the distributed-memory factorization algorithms will appear in [7]. Contributions from distributed update matrices for any children of K must be sent to the appropriate processors and scatter-added into the appropriate frontal matrix locations. where simply dealing out the column tasks Tcol(j). Lucas's code and the first code developed by Ashcraft implement essentially the same distributed multifrontal algorithm [6. Near the root of the supernodal elimination tree. with some care in the scheduling. When processor p = map[k] € map(K) completes factor column k 6 K. that is. both the factor columns and the aggregated update columns generated by the task Tsub(K) are distributed among the processors in map(K). The key feature of the algorithm is the distribution of all the columns of K's frontal matrix among the processors in raap(K).e. this phase of the algorithm is very similar to the fan-out algorithm shown in Figure 15. is very effective in exploiting both subtree.T. as noted in [22] and [26].118 M. As with other distributed factorization algorithms. the tasks Tsw6(K) for supernodes K near the root supernode must be partitioned into smaller tasks and distributed among the processors. Ashcraft [9] has also developed parallel multifrontal codes for such machines. the largest being the tasks Tsufe(K). Here. His results indicate that working with small blocks of columns is most effective.. dense submatrix-Cholesky factorization on the first |K| columns of the distributed frontal matrix. This section contains a brief discussion of a few features of this algorithm. map[Jb]. E. i. to partition the tasks Tsub(K) among the processors. each column Jb of the matrix is assigned to and stored on one processor..and column-level parallelism (see section 3.83]. A second issue discussed in [23] is partitioning the work among the processors for execution in parallel.4. most of the work is performed in the larger frontal matrices associated with supernodes near the root. In [23]. HEATH.. supernode K's distributed frontal matrix must be assembled. the algorithms of Lucas [83] and Vu [105] use the autotasking capabilities of their target machines. Duff parameterizes his code so that it can spawn tasks of any granularity between two extremes. The processors in map(K) work together to perform the task Tsub(K). T«u6(N) among the processors. then. PEYTON Vu [105] for the CRAY Y-MP uses a similar strategy.. . If the multifrontal method distributes indivisible tasks Tsub(K) among the processors in a similar fashion. Thus.84] developed the first implementation of the multifrontal method for distributed-memory MIMD machines. Lucas [82. The algorithm used to perform this task can be viewed as a straightforward adaptation of the parallel dense submatrix-Cholesky algorithm presented in [37]. Tsub(2). Since then. Further enhancements to the algorithm. Consider a supernode K and let map(K) denote the set of processors that own at least one column of K or a descendant of a column K in the elimination tree. and the smallest being individual cmods and cdiva. parallelism decreases as the computation proceeds toward the root supernode and disappears altogether when the root supernode is reached. it broadcasts L+k to the other processors in map(K). the CRAY 2 and CRAY Y-MP. That is. The other processors in map(K) at some point receive L+t and use it to modify every column of the frontal matrix that they own.

its extra communication overhead is far smaller than that incurred by the pure fan-out algorithm. Much research is needed in those areas. It is clear from the relative length of the discussions that much of this research has been focused on the design and implementation of parallel numerical factorization algorithms. However. First.PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS 119 a "child" update matrix is located on a different processor than the corresponding column of its "parent" frontal matrix. The factorization phase performs a restricted broadcast of completed factor columns. who considered the use of cyclic algorithms for solving sparse triangular systems.85. Triangular Solution. The ordering problem seems particularly problematic in a distributed-memory environment because of the difficulty of partitioning the graph of the matrix among the processors in an intelligent way before the ordering is determined. Second. These difficulties are mitigated somewhat by the subtree-level parallelism that is available only in the sparse case. they nonetheless performed rather poorly. as well as incomplete. Though the parallel algorithms for sparse forward and back triangular solutions contained in [44] exploit this parallelism. Generalizing these techniques so that they can be incorporated into a parallel sparse triangular solution algorithm is a possible avenue for future improvement. while the assembly phase moves aggregated update columns from one processor to another. symbolic factorization and triangular solutions. In this paper. Some of the work in [4]. Data dependencies and a paucity of work to distribute among the processors make it very difficult to achieve high computational rates. Concluding remarks. The two forms of communication result in somewhat higher communication cost for the multifrontal algorithm than that incurred by the fan-in algorithm. and preliminary results indicate similar performance for the fan-in and distributed multifrontal algorithms [9]. we have provided a summary of parallel algorithms currently available for the four phases in the solution of sparse symmetric positive definite systems. particularly on shared-memory MIMD multiprocessors. due to preservation of sparsity in the factor matrix. Cholesky factorizations. 4. these algorithms have generally been less successful and lacking in efficiency. however. Other work on parallel sparse triangular solution algorithms [4. [30] have shown that intricate pipelining techniques are required to achieve computational rates as high as 50% efficiency for large dense problems on distributed-memory hypercube multiprocessors.5. the successful pipelining techniques used in [30. there is relatively little to say about parallel algorithms for forward and backward triangular solutions. Two factors make the situation even more difficult in the sparse case. even for dense problems. Both phases of the factorization require interprocessor communication." where it is incorporated into the appropriate column of the frontal matrix. Some of these algorithms have exhibited reasonable speed-up ratios. Heath^and Romine [58] and Eisenstat et al. is applicable to complete.102] has been directed primarily toward use in the preconditioned conjugate gradient algorithm. Loss of this regularity in sparse Cholesky factors increases the difficulty of using these complicated techniques to speed up sparse triangular solution. 3. .56.58] appear to require the extremely regular structure of a dense matrix. Although there have been attempts in developing parallel algorithms for the other phases. there is usually far less work to distribute among the processors — approximately r)(L) flops rather than the n(n — l)/2 flops available in the dense case. A step in this direction has been made by Zmijewski [108]. Unfortunately. then the aggregated update column must be sent to its "new owner. namely ordering.

. 11 (1989). High Speed Comput. there may not be enough memory on a single processor to carry out symbolic factorization and/or triangular solution serially. which have improved the presentation of the material. Boeing Computer Services. J.96]. This may be true for MIMD multiprocessors with globally shared memory. J. many algorithms require multiple triangular solutions. 1985. Appl.59. AsHCRAFT. AMESTOY AND I. Parallel algorithms for sparse least squares problems are discussed in [16. A compute-ahead implementation of the fan-in sparse distributed factorization scheme.. 201-221. 41-59.120 M.20. TN. ALAGHBAND. Solving sparse triangular linear systems on parallel computers. The authors would like to thank Eduardo D'Azevedo. REFERENCES [l] G. Yale University. 3 (1989). Oak Ridge. Tech. ASHCRAFT. [3] P. JORDAN. 1990. Engineering Technology Applications Division.52]. Internat. PEYTON It may be argued that current sequential algorithms for symbolic factorization and triangular solution are so efficient that perhaps they can be used on one processor in a multiprocessor environment instead of developing parallel versions. symmetric positive definite linear systems. Some recent examples can be found in [1. Report ETA-TR-51. parallel algorithms are still needed to make use of the large (collectively) local memory available on distributed-memory parallel machines for solving large problems. First. J. Report ORNL/TM-11496. 73-95. [8] C. PEYTON. ANDERSON AND Y. Work has also been done on parallel algorithms for other matrix computations. Parallel pivoting combined with parallel reduction and fill-in control. Second. ICASE. pp. research on the design of efficient parallel algorithms for symbolic factorization and triangular solution will be necessary eventually. S. Personal communication. Virginia. Dept. [7] . . In the case of direct methods for solving sparse nonsymmetric linear systems. Internal. A vector implementation of the multifrontal method for large sparse. There has been a great deal of research on parallel iterative methods for solving large sparse linear systems as well. Acknowledgment.22. there are at least two reasons why parallel algorithms are needed for symbolic factorization and triangular solution. although symbolic factorization and triangular solution are often the least expensive phases in the solution process on serial machines. Multiprocessor sparse L/U decomposition with controlled fill-in. DUFF. [4] E. 1990.T. Report 85-48. Liu.23. Ph. pp.2. 1 (1989). Oak Ridge National Laboratory. Supercomp. Alan George and Joseph Liu for their suggestions and comments. EISENSTAT. they may become the dominant phases as more efficient parallel numerical factorization algorithms are developed. 1987. Tech.D. NG AND B. see the book by Ortega [89]. Thus. Thesis. even if they^are somewhat inefficient. Seattle. NASA Langley Research Center. B. Third. Parallel Computing. much of the research has been carried out on shared-memory MIMD multiprocessors.3. HEATH. even if these algorithms may be less efficient that their sequential counterparts.W. 1990.51. SHERMAN. AND A. SAAD. Our emphasis in this paper has been on parallel direct methods for solving sparse symmetric positive definite systems. of Computer Science. E. Many additional references on all aspects of parallel matrix computations can be found in [90]. Vectorization of a multiprocessor multifrontal code. [2] G. Washington. [5] C. Tech. [6] . ALAGHBAND AND H. For a summary of such work and references to this extensive literature. Hampton. On MIMD multiprocessors with local memory. pp.

KUCK. [18] J. R. MATTHEYSES. On George's nested dissection method. SHERMAN. Modified cyclic algorithms for solving triangular systems on distributed-memory multiprocessors. K. Appl. Sparse orthogonal decomposition on a hypercube multiprocessor. 1987. J. The multifrontal method in a parallel environment. Oxford University Press. DUFF. 78-35. [22] I. pp. [19] A. A. 1 (1987).. 26 (1984). Meth. SIAM J. J.-H. (To appear). KENNEDY. M. 1984. Report PAM-436. GUSTAVSON. Math. YEW. GALLIVAN. BERMAN AND G. pp. R. J. Supercomp. CONROY. Software. SIAM J. Tiebreaking the minimum degree algorithm for ordering sparse symmetric positive definite matrices. 1982. Department of Computer Science. [31] C. 54-135. Department of Computer Science.. FIEDLER. ElSENSTAT. 298-305. Math.. CAVERS. pp. pp. SAMEH. A nondeterministic parallel algorithm for general untymmetric sparse LU factorization. DUFF AND J. ROMINE. in Advances in Numerical Computation. The multifrontal solution of indefinite sparse symmetric linear equations. Comput. Appl. A. [25] I. in Hypercube . HENKEL. SIAM J. ElSENSTAT. University of Waterloo. SIAM Review. Report YALEU/DCS/RR-810. 1145-1151. Hammarling. J. J.-Y. J. LESCRENIER. 1 (1987). Internat. cache. F. Stat. Engrg. Appl. SLAM J. CHU AND A. [32] M.-H. 1990. University of British Columbia. Maryland 20742. Center for Pure and Applied Mathematics. Tech. DONOARRA. Parallel Computing. [13] P. Concurrent multifrontal methods: shared memory. GEORGE. N. M. DUFF. Comput. SIAM Review. Matrix Anal. Math. DAVE AND I. GURSKY. DONGARRA. [26] I. Matrix Anal. [34] K. S. [29] S. [33] . Stat.. G. 686-695.. Sparse matrix calculations on the Cray-2. Solving finite element problems with parallel multifrontal schemes. GAO AND B. AND A. [36] G. AND J. pp. 593-599. S. Berkeley. Liu. LlU. Waterloo. Communication cost of sparse Cholesky factorization on a hypercube. A fan-in algorithm for dirtributcd tparte Gaussian elimination. Appl. Department of Computer Science University of Maryland. pp. in Proceedings of the 19th Design Automation Conference. 9 (1983). 23 (1973). Report CS-84-36. K. 1987. SIMON. Direct Methods for Sparse Matrices. [15] I. LEWIS. 26-44. J. pp.. GOULD. Numer. KARP. Parallel algorithms for dense linear algebra computations. G. Math. England. C. [21] J. Sci. Supercomp. [16] E. CHU. University of California. A linear-time heuristic for improving network partitions. Tech. M. SIAM J. GRIMES. Report AERE R 10533. [14] J.. SCHNTTQER. Anal. AND J. H. DUFF. [20] T. pp.. AND A. W. ASHCRAFT. DUFF. pp. AND C. Parallel direct solution of sparse linear systems of equations. Parallel Computing. GEIST. AND E. 383-402. DAVIS AND P. ASHCRAFT. [35] F. 5 (1987). AND A. 229-239. DUFF. California. SHERMAN. Tech. 32 (1990). J. ACM Trans. MONTRY. GEORGE. the symmetric codes.. 11 (1990). KARP. Multiprocessing a tparte matrix code on the Alliant FX/8. ERISMAN. IEEE Software. 1986. Appl. Report TR 1714. A comparison of three column-bated attributed tparte factorization scheme*. 27 (1989). [27] I. ASHCRAFT. M. pp. 25 (1975). 11 (1990). New Haven. [24] I.a set of Fortran subroutines for solving sparse symmetric sets of linear equations. pp. [10] C. W. WEIOAND. Parallel implementation of multifrontal schemes. Master's thesis. Matrix Anal. 91-112. J. A. The Yale sparse matrix package I. 11 (1990). 1990. MA27 . pp. 13 (1976). 619-633. ElSENSTAT. 589-600. Harwell. 11 (1990). Cox and S. BENNER. SLAM J. 193-204. pp. 6 (May 1989). A property of eigenvectors of non-negative symmetric matrices and its application to graph theory. M. PEYTON. ElSENSTAT. pp. PLEMMONS. pp. A. pp. and frontwidth issues. Appl. 1988 Gordon Bell prize. J.. No. User's guide for SPARSPAK-A: Waterloo sparse linear equations package. REID. pp. pp. B. Czech. CT. PARLETT. AND D. [23] . REID. College Park. Progress in sparse matrix methods for large linear systems on vector supercomputers. REID. Internal.PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS 121 numerical factorization. Numer. Internat. ERISMAN. 453-465. 83-88. AND G. [11] C. Oxford University Press. On the performance of the minimum degree ordering for [9] C. AND H.Yale University.. [30] S. HEATH. [17] E. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. pp. Tech. Comput. 302-325. 1988. [28] . 1982. AND J. eds. FlDUCCIA AND R. Ontario. [12] R. K. 175-181. BROWNE. AND A. ScHULTZ. 55-64. Oxford. AND J. 18 (1982). Liu. Czech. REID. 9 (1988). Tech. Sci. 10-30. Algebraic connectivity of graphs. 3 (1986).

W. 77 (1986). Tech. A. Sci. HEATH. W.-H. 287-298. April 1986. Parallel solution of triangular systems on distributed-memory multiprocessors.-Y. ROSE. Math. J.-H. pp. 189-203. B. GILBERT AND E. LEISERSON AND J. Anal. GILBERT AND R. Basic linear algebra subprograms for Fortran usage. J. pp. Numer. KERNIGHAN AND S. 14 (1990). [39] A. GEORGE. 18 (1989). Dept. An automatic nested dissection algorithm for irregular finite element problems. A parallel graph partitioning algorithm for a message-passing multiprocessor. AND F. pp. Symbolic Cholesky factorization on a local-memory multiprocessor. NG. Num. KINCAID. HANSON.-H. MclNTYRE. Stat. Stat. Solution of sparse positive definite systems on a shared memory multiprocessor. Ne$ted dissection of a regular finite element mesh. 2 (1970). 291-314.. HEATH AND C. HEATH. J. Meth. LlU. Anal. Engng. Parallel Programming. ACM Trans. A. Charleston. 9 (1988). HAFSTEINSSON. Sci. Anal. W. J. SIAM J. pp. pp. C.). 231-239. (1986). 1-19. LAWSON. Parallel Cholesky factorization on a hypercube multiprocessor. On the application of the minimum degree algorithm to finite element systems. pp. 1987. Comput.. 151-162. Report Tech. 15 (1978). Parallel Programming. pp. NG. Bell System Technical Journal. ROMINE. Comput. Tech. Norway. ZMUEWSKI. Sci. GEIST AND M. Xerox Palo Also Research Center. M. New Jersey. pp. Int. 85-95. GEORGE. 10 (1973). M. KROGH. Parallel symbolic factorization of sparse linear systems. [40] A. 15 [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] . C-31 (1982). Appl. An efficient heuristic procedure for partitioning graphs. 15 (1978). 129-156. J. JESS AND H.W. LEWIS. . Software. Report CMI No. M. pp.122 M. SIAM J. 309-325. 656-661. pp. GILBERT. Task scheduling for parallel sparse Cholesky factorization. MARTIN. J. HOFFMAN. An optimal algorithm for symbolic factorization of symmetric matrices. 308-371. 165-187. SIAM J. 90-111. A frontal solution program for finite element analysis. Parallel sparse Gaussian elimination with partial pivoting. pp. NG AND B. 22 (1990). 5 (1979). SCHREIBER. HEATH. A pipelined Givens method for computing the QR factorization of a sparse matrix. R. J. Math. Anal. 1985. B. 99. An efficient parallel sparse partial pivoting algorithm. Annals of Operations Research. 27 (1989). SIAM. Sparse Cholesky factorization on a local-memory multiprocessor.. A. AND J. Internal. SIAM J. No. 1990. 583-593. LlU. 16 (1987).. D. GEORGE. M. 1988. GEORGE. Oak Ridge. Linear Alg. pp. [37] G. 1990. 31 (1989). A data structure for parallel L/U decomposition. Solving sparse triangular linear systems using fortran with extensions on the NYU Ultracomputer prototype. GEORGE AND D. J. 1981. Complexity bounds for regular finite difference and finite element grids. Tech. Numer. IEEE Trans... 364-369. 427-449. SC. Philadelphia. . Englewood Cliffs. Tennessee. AND E. (Submitted to SIAM J. pp. Linear Alg. pp. Numer. Michelsen Institute. Appl. HEATH. C.. Centre for Computer Science. J. T. Visual animation of parallel algorithms for matrix computations. New York University. Oak Ridge National Laboratory. J. M. The evolution of the minimum degree ordering algorithm. 77 (1986). Comput. 5 (1987). Parallel Cholttky factorization on a sharedmemory multiprocessor.. Parallel Computing. 291-307. Highly parallel sparse Cholesky factorization. M. NYU Ultracomputer Note. Tech. pp. 5-32. PEYTON Multiprocessors 1987. 9 (1980). Heath. 345-363. Comp. pp. SIAM J. 9 (1988).. 49 (1970). Bergen. pp. AND D. GEORGE AND J. 327-340. Numer. Comput. pp. Communication results for parallel sparse Cholesky factorization on a hypercube. A. Appl. Parallel Programming. A. Solution of sparse positive definite systems on a hypercube. pp. Report CSL-90-7.. AND E. SIAM J. G. pp.. 10 (1989). E. . W. G.-Y. 88/45052-1. Orderings for parallel sparse symmetric factorization. pp. . M. KEES. Internat. 219-240. HEATH AND D. J. Chr. [38] G. in Parallel [41] A. IRONS. pp. Liu. J. LlU.. HEATH. G. NG.. Rept. G. . Fifth Distributed Memory Computng Conf. .-H.. 558-588.-Y. Intern. Stat. Report ORNL-6211. of Science and Technology.. SIAM Review. ed. GEORGE AND E.-Y. Prentice-Hall Inc. 10 (1973). in Proc. GREENBAUM. SORENSEN. SIAM J. Parallel Computing. Comput. pp. 1053-1069. GEIST AND E. (To appear). LiN.T. Computer Solution of Large Sparse Positive Definite Systems. A. Parallel Computing. GILBERT AND H.

The Pennsylvania State University. Report CS-90-04. 1990. A separator theorem for planar graphs. Sci. 310-325. 36 (1979).PARALLEL ALGORITHMS FOR SPARSE LINEAR SYSTEMS 123 [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] Processing for Scientific Computing. TN.. PA. Math. 165-184. A. ROSE. Oak Ridge. Parallel Computing. Parallel Computing. 1989. Reordering sparse matrices for parallel elimination. A parallel solution method for large sparse systems of equations. ACM Trans. 1988. LlPTON AND R. Tech. F.. 73-91. R.. 6 (1988). Virginia. Report CS-88-51. J. Computational models and task scheduling for parallel sparse Cholesky factorization. LEUZB. Software. A supernodal Cholesky factorization algorithm for shared-memory multiprocessors. Stat. 11 (1990). Parallel Computing. 141-153. TARJAN. Generalized nested dissection.-J. The role of elimination trees in sparse factorization. Appl. SIAM J. J. University Park. LEWIS. V. POTHEN. 1136-1145. 2 (1989). 100-107. J. Hampton. G. BLANK. ed. NG AND B. E. A supernodal symbolic Cholesky factorization on a localmemory multiprocessor. R. Math. VOIGT. B. New York. Numer. Personal communication. NASA Langley Research Center. SLAM J. Department of electrical engineering. Sci. pp.. In preparation. pp. Parallel pivoting algorithms for sparse symmetric matrices. . Stat. pp. 177-199. SIAM. Comput. MlRZAIAN. PARTER.-Y. Stanford University. Solving planar systems of equations on distributed-memory multiprocessors. 3 (1961). R. pp. NAIK AND M. TARJAN. ROMINE. M. Rodrigue. Matrix Anal. A. LlU AND A. A compact row storage scheme for Cholesky factors using elimination trees. D. Report CS-88-16. pp. Ontario. R. Tech. Disc. 10 (1989). Independent tet ordering* for parallel matrix factorization by Gaussian elimination. R. pp. 364-369. 15 (1989). Philadelphia. pp. W. . Tech. 1990. 1 (1984). Parallel solution of linear systems with striped sparse matrices. Computing the block triangular form of a sparse matrix. A graph partitioning algorithm by node separators. . In preparation. The minimum degree ordering with constraints. Parallel Computing. PA. 1989. pp. Hampton. 10 (1989).. 134-172. IEEE Trans.. PEYTON. PhD thesis. 981-991. S. AND R. POTMEN. . A bibliography on parallel and vector numerical algorithms. 9 (1988). . SIAM J. Introduction to parallel and vector solution of linear systems. Computer Aided Design. pp. W. Report ICASE Report No. Comput. LUCAS. 1989. 15 (1989). POTHEN AND C. . The use of linear graphs in Gaussian elimination. Report ICASE Report No. FAN. J.-H. AND C. Virginia. LUCAS. pp. NASA Langley Research Center. North York. pp. SIAM J. ACM Trans. ACM Trans. MBLHEM. Plenum Press. 12 (1986). A fatt algorithm for reordering sparge matrices for parallel factorization. SIAM J. . 99-110. Liu AND E. Tech. 1987. W. Software. G. Report ORNL/TM-10998. 127-148. ORTEGA. 89-40. 16 (1979). 424-444. . 27-32. 11 (1989). 10 (1989). The complexity of optimal elimination trees. pp. Equivalent sparse matrix reordering by elimination tree rotations. The multifrontal method and paging in sparse Cholesky factorization. ICASE. . . SIAM J. 198-219. Sci. The Pennsylvania State University. SIAM J. Oak Ridge National Laboratory. The multifrontal method for sparse matrix solution: theory and practice. pp. Parallel Computing. 1156-1173. 1988. 1990. Dept. J. Comput. Math. Math. 88-14. TIEMAN. Tech. Department of Computer Science. PEYTON. Math. PETERS. A linear reordering algorithm for parallel pivoting of chordal graphs. Appl. SIAM Review. Data traffic reduction schemes for sparse Cholesky factorization. 327-342. PATRICK. ACM Trans..-H. 1988. pp. of Computer Science. AND J. AND A. ORTEGA. 3 (1986). W. LlPTON. pp. pp. Stat. pp. ICASE. NG. CAD-6 (1987). Data traffic reduction schemes for Cholesky factorization on asynchronous multiprocessor systems. Anal. 346-358. pp. R. Math. pp. 1990. LlU. 11 (1985). Department of Computer Science. Software. Modification of the minimum degree algorithm by multiple elimination. . Software.-H. Tech. 177-191.. York University. . J.

New York 14853-7501. [97] D. Comput. [108] E. Parallel Computing. 7 (1988). 1972. pp. SIAM Symposium on Sparse Matrices. POTMEN. Math. Math. SALTZ. Stanford. Computing the minimum fill-in is NP-complete. [109] . 199-210. J. LlOU. LUEKER. 2 (1981). PhD thesis. 1988. California. Yale University. [100] E. Academic Press. Limiting communication in parallel sparse Cholesky factorization. Wouk. Parallel orthogonal factorization. pp. Aggregation methods for solving sparse triangular systems on •multiprocessors. GILBERT. pp. A parallel algorithm for sparse symbolic Cholesky factorization on a multiprocessor. ROSE..T. 430-452. Software). . 183-217. [104] A. SlMON. pp. Cornell University. On the efficient solution of sparse systems of linear and nonlinear equations. University of California. 597-609. AppL. NG AND B. 1989. ROSE. RAGHAVAN AND A. SLAM J. Software. 1989. C. 256-276. PhD thesis. pp. Sparse Cholesky Factorization on a Multiprocessor. 123-144. Gleneden Beach.. Partitioning sparse matrices with eigenvectors of graphs. ACM Trans. [96] P. 11 (1990). Alg. 1986. pp. PEYTON University Park. Vector. Department of Computer Science. [95] A. A new implementation of sparse Gaussian elimination. SCHREIBER. [106] P. Matrix Anal. RoTHBERG AND A. SHERMAN. Gleneden Beach. SIAM J. Triangulated graphs and the elimination process. 5 (1976). 1989. [102] J. Personal communication. Tech. Distributed sparse factorization of circuit matrices via recursive E-tree partitioning. AppL. 1975. A graph-theoretic study of the numerical solution of sparse positive definite systems of linear equations. 77-79. Stat. [107] M. TARJAN. WORLEY AND R. SIAM J. Comput. 32 (1970). in Graph Theory and Computing. California 93106...W. [105] P. SIAM Symposium on Sparse Matrices. pp. Report STAN-CS-89-1286. Stanford University. Sci. Department of Computer Science. ed. [110] E. ZMIJEWSKI AND J. VISVANATHAN. Read. A. in New Computing Environments: Parallel. SCHREIBER. SIAM J. POTHBN. August 1987. YANNAKAKIS. R. 1990. E. Vu. AND G. Santa Barbara.. (To appear in ACM Trans. Oregon. HEATH. [98] . [103] R. GUPTA. ed. R. pp. [99] D. [101] P. Fast sparse matrix factorization on modern workstations.124 M. Tech. SADAYAPPAN AND V. and Systolic. Meth. 11 (1990). pp. 266-283. Algorithmic aspects of vertex elimination on graphs. Disc. 1989. Philadelphia. H. Anal. 8 (1982). PA. ZMIJEWSKI. Ithaca. 8-38. Oregon. Nested dissection on a mesh-connected processor array. Math. Report TRCS89-18. SIAM Publications. AND K.

ACM. BROWN. February. 26. pp. pp. Proc. KAMBDA [1978]. Proc. Comparison of methods and algorithms for tridiagonal systems and for vectorization of diffusion computation. For further information about access to an electronic copy of the bibliography. t The work of this author was supported by the National Aeronautics and Space Administration under NASA Contract Nos. U. Oper. ABU-SUFAH AND A. 1 The work of this author was supported by the Applied Mathematical Sciences Research Program. Conf. Multigrid solvers on parallel computers. BRANDT [1981]. DICKSON [1974]. CHANDY.epm. 125 . SCHULTZ. Vectorization of a penalty function algorithm for well scheduling. NASA Ames Research Center. AND P. R. Res. On evaluating parallel computing systems. K. VA 23665. Certain conference proceedings and anthologies that have been published in book form we list under the name of the editor (or editors) and then list individual articles with a pointer back to the whole volume. ABDEL-WAHAB AND T. pp.S.. Report TR-85. NAS-1-46-6. 39-83. VOIGT* AND CHARLES H. Par. we hope that the references we have collected over the years will be of some value to researchers entering the field. refers to the article by Brandt in the volume listed under [1752] M.gov or rgvCicase. Although this is a bibliography on numerical methods. Scheduling to minimize maximum cumulative cost subject to series-parallel precedence constraints. [4] W. Office of Energy Research. the reference [225] A. Tech.. [3] I. programming languages. DENNING [1985]. ADAM. pp. Naturally. not by year. and by the Institute for Computer Applications in Science and Engineering as ICASE Interim Report 6.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS JAMES M. REFERENCES [1] H. ABU-SUFAH AND A. send email to either romineCmsr. and other topics of interest to scientific computing. MALONY [1986]. [2] I. [6] T. ABU-SHOMAYS [1985]. Vector processing on the Alliant FX/8 multiprocessor. 17. 141-158. Department of Energy under contract DE-AC05-84OR21400 with Martin Marietta Energy Systems Inc.ornl. in Gary [700]. ADAMS. any such collection will be incomplete. 1986 Int. Hampton. Tech. in Schultz[1752]. ABSAR [1983]. Note that the cross-reference is by reference number. [7] G. 685-690. in Numrich [1469].edu. MALONY [1986]. ROMINE* Since parallel and vector computation is expanding rapidly. * The work of this author was supported in part by the National Aeronautics and Space Administration under NASA Contract No. pp. pp. 29-56. NASl-18107 and NASl-18605 at the Institute for Computer Applications in Science and Engineering (ICASE). ROBERT G. AND J. NASA Langley Research Center. ORTEGA*.3. May. A comparison of list schedules for parallel processing systems. we have included a number of other references on machine architecture. RIACS. Experimental results for vector processing on the A lliant FX/8. Report 539. University of Illinois at Urbana-Champaign. 361-370. for example. This bibliography was also published in January of 1989 by Oak Ridge National Laboratory as ORNL-TM/10998. [5] W. Comm. Our apologies in advance to authors whose works we have missed. Center for Supercomputing Research and Development. 559-566.

in Fox et al. pp. PhD dissertation. Tufts University. Parallel computational geometry. ECSE Tech. Highly concurrent computing structures for matrix arithmetic and signal processing. 2. [32] E. pp. in Parter [1529]. DOWDY. 53-56. AND P. A new version of the sparse ma- . 15(1). DELOSME. Appl. [31] T. [18] L. 1. [23] D. A parallel algorithm for solving large scale sparse linear systems using block pre-conditioned conjugate gradient method on an MIMD machine. Amsterdam. pp. pp. CHAZELLE. Boulder. L. 38-43. [25] H. IEEE Conference on Fundamentals of Computer Science. in Numrich [1469]. BOKU. S. M-step preconditioned conjugate gradient methods. Hampton. 6. J. 373-392. Conf. Recursive binary partitions. University of Oklahoma. Iterative Algorithmt for Large Sparse Linear Systems on Parallel Computers. ADAMS AND R. Report on an evaluation study of data flow computation. Sci. Report. 490-506. [15] L. 7. YAP [1985]. development and use of the Finite Element Machine. 101-114. VA. DENNING [1985]. KUDOH. pp. pp. NASA Ames Research Center. NASA Langley Research Center. Numer. pp. JORDAN [1985]. NASA Langley Research Center. ADAMS AND R. Tech. [1986]. J. University of Colorado. pp. University of Colorado. [623]. Computer. ROMINE [8] G. Tech. 10-13. CYBENKO [1987]. Meth. Tech. A note on parallel algorithms for partial differential equations. ADAMS AND T. A multi-color SOR method for parallel computation. VOIGT [1984]. A methodology for exploiting parallelism in the finite element process.. [24] R. Meth. October. [9] L. ALMEIDA. pp. Modeling algorithm execution time on processor arrays. 1550-1553. [35] H. January. NASA Langley Research Center. 468-477. GUSTAFSSON [1984]. ALAGHBAND AND H. [14] L.. Report TR-85. [19] N. [11] L. [16] L. pp. Statist. 106A(7). [26] G. SIAM J. Report OU-PPI. Report 86-1-5. BROWN. Multiprocessor Sparse LU Decomposition with Controlled Fill-in. Minimization of the processing time of parallel computers. CROCKETT [1984]. 65-82. G. M.. ADAMS [1982]. pp. [29] G. Computer. Comput. Department of Electrical and Computer Engineering. Conf. AHLBERG AND B. ADAMS [1986]. 1983 Int. Proc. [17] L. Boulder. Par. Department of Applied Mathematics and Computer Science. pp. North-Holland. ALLEN AND G. Appl. 36-43. Is SOR color-Hind?. Department of Computer Science. pp. Department of Electrical and Computer Engineering. AND H. R. 17(7). Parallel pivoting combined with parallel reduction. [20] T. JORDAN [1985]. ALT [1985]. [33] V. Reordering computations for parallel execution. Numer. TR-85-02. ADAMS AND J. Hampton. ICASE. Par. The University of Virginia. Design.. Also published as NASA CR-166027. 263-271. Boulder.. Computing roots of polynomials on vector processing machines. Multiprocessor sparse L/U decomposition with controlled fill-in. A vector elastic model for the Cyber 205. GUIBAS. AND M. Sparse Gaussian elimination with controlled fill-in on a shared memory multiprocessor. O'DUNLAING. Tech. C. Computer. DHALL. Data flow systems. pp. VA. LAKSHMIRARAHUN [1985]. PhD dissertation. University of Colorado. MORF [1982]. [12] L. April. ed. JORDAN [1983]. AHMED. ADAMS. AGRAWAL. AND S. DIAZ. VOIGT AND C. [651]. ONO [1987]. LEUZE [1988]. B. JORDAN [1986]. Comm. ALAGHBAND AND H. [27] G. ADAMS [1983]. ADAMS [1985]. [21] V. AMANO. Parallelization of the MA28 sparse matrix package for the HEP. Proc. 15(2). ADAMS AND E. AGGARNAL. To appear. H. pp. ALAGHBAND AND H. [22] A.2. Comput. T. [1982]. Proc.. Physics Letters. Report 8775. ALLROTH [1984]. December. AGGARWAL. ADAMS AND O. VOIGT [1984]. pp. SIAM J. [28] G. An analytic model for parallel Gaussian elimination on a binary N-cube. 1982 Int. Report 85-48. 301-321. ORTEGA. [34] R. ADAMS AND H. Sci. 93-98. L. Additive polynomial preconditioned for parallel computers. AGERWALA AND ARVIND. 299-308. Statist. Report CSDG-83-3. An M-step preconditioned conjugate gradient method for parallel computation. in Kowalik [1116]. [30] G. ALAGHBAND [1987]. Department of Electrical and Computer Engineering.126 J. 452-463. AND C. 329-331. [10] L. Comm. Tech. R. in Feilmeieret al. RIACS. ICASE. Schools of Electrical Engineering and Computer Science. Also [1985]. Advanced Computer Architecture. Tech. Parallel Computing. AND M. JOHNSON [1985]. [13] L. ALAGHBAND [1988]. Proc. Proc. T. ORTEGA [1982].

Performance comparison of the CRAY XMP/24 and the CRAY-2. 43. IEEE Trans. COMPCON 84. An efficient parallel algorithm for the solution of large sparse linear matrix equations. M. G. pp. Boeing Computer Services. GROSS. Optimization of Householder transformations part I. H. 3-43. ARMSTRONG [1987]. D.. C-36. Phys. ANDERSON. and examples. AND A. C. ICASE. Par.. Program structures for parallel processing. R. ARNOLD. ROY [1987]. 2(1). D. pp. Machine independent techniques for scientific supercomputing. Tech. KATHAIL [1981]. M. BRYANT [1979]. ACM Computing Surveys. 530-536. WEBB [1987]. MENZILCIOGLU. pp. ANDERSON.. Livermore. 34. 495-498. AND J. (Paper PMll). Minneapolis. Twelfth Conf. pp. 1982 Int. T. E. H. A. 1523-1538. Par. Limiti of expectation. 26-36. The validity of the tingle processor approach to achieving large scale computing capabilities AFIPS Conf. pp. KUNG. Computer interconnection structures: Taxonomy. SIMON [1987]. 74-83. A. Platma physics at gigaflops on the CRA Y-f. ARVIND AND V. GRIMES. N. ARNOLD [1982]. ANNARATONE. AND J. implementation. pp. ANDERSON. M. ANDRE. ANDERSON AND E. characteristics. Lawrence Livennore National Laboratory. AND O. s59-s72. 12th International Symposium on Computer Architecture. FRY. KONIGES. Early experience with the SCS-40. pp. A. 197-213. Proc. F. Proc. ARMSTRONG. pp. pp. July. Proc. AND A. R. Supercomputer. The Warp computer: Architecture. 1983 Int. Proc. W. M. D. Linear least squares. Int.. C-32. Report 87-65. Sci. Report UCRL-96034. ACM Computing Surveys. R. ARNOLD [1984]. Tech. M. Parallel computers for partial differential equations simulation. pp. FRY. R. M. 67-74. LAM. C.. Comparing barrier algorithms. MlLNER [1986]. A. ANDERSON. IEEE Trans. Synchronization of Parallel Programs. Proc. M. ANDERSON. VARJUS [1985]. ARNOULD. JENSEN [1975]. pp. pp. CA. 1986 Int. AND H. SPIE Real Time Signal Processing IX. WARP architecture and implementation. ARPASI AND E. Third Symposium on Science and Engineering on Cray Supercomputers. Paper 6S10. E. and performance. MIT Press. Proc. E. M. AND J. RIEBMAN. R.. Paris. Comput. T. ANWAR AND M. AND H. San Francisco. 8. SIMON [1987]. Conf. ROY [1987]. Tech. 483-485. 155-168. MARSLAND.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 127 [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] trix solving machine. Solving equations of motion on a virtual tree machine.. AND A. Concepts and notations for concurrent programming. A. Comput. J. Gigaflop speed algorithm for the direct solution of large block-tridiagonal systems in 3D physics applications. Sci. FRY.. Submitted to J. Par. CA. Parallel computing and multitasking. KUNG. Scientific Computer Information Exchange Meeting. HERMAN. GRUBBR. Proc. Computing. Supercomputer Appl. Proc. R. Comput. Comput. ARENSTORF AND H. Cambridge. 7. IEEE Comp. GRUBBR.. Measurements and estimates of the PA MS plasma equilibrium solver on existing and near-term supercomputers. GRUBBR. SCHAEFFER [1987]. G. ROY [1987]. HOROWITZ. D. J. ANDERSON. 88-97. KONIGES. AND M. C. ANDERSON. Proc. Conf. T. American Physical Society Division of Plasma Physics Meeting. PARR. McCOY. AMDAHL [1988]. pp. 1987 Int. Asynchronous algorithms for Poisson's equation with nonlinear boundary conditions. AND A. Conf. MENZILCIOGLU [1986]. 235-242. D. GRIMES. MlRIN [1987]. Vector optimization on the CYBER 205. ANDERSON. Proc. D. MN. Conun. Report ETA-TR-57. on Numerical Simulation of Plasmas. Parallel Comput. JORDAN [1987]. San Diego.. 30. Conf. DEWE [1983]. R. 15. Par. 265-273. ARNOULD. 786788. pp. AMDAHL [1967]. AND A.. pp. LAM. Mathematical model partitioning and packing for parallel computer calculation. CA. ARNOLD [1983]. 100-107. ACM. R. Conf. A multiple processor data flow machine that supports gen- .. NASA Langley Research Center. Parallel cyclic reduction algorithm for the direct solution of large block-tridiagonal systems. ANDREWS AND F. pp. SCHNEIDER [1983]. Proc. D. EL TARZI [1985]. OLAPSSON. G. A survey of linear systems solvers on the NMFECC system. AND M. O. SIAM J. D. pp. ARVIND AND R. C. G. 94102. 69-88. R. 22. First International Conference Industrial and Applied Math. pp. Statist. Performance evaluation of three automatic vectorizer packages. GROSS. ANNARATONE. McCOY [1986]. 8. ROY [1987]. Proc. ANDERSON [1965]. GRUBER. J. Comm.

1987 Int. [63] S. ORTEGA. Canada. Izdatel'stvo Nauka. Gentleman. Recent progress in sparse matrix methods for large linear systems on vector supercomputers. 9-24. Boeing Computer Services. 10-30. AXELSSON [1985]. pp. M. Int. ASKEW AND F. 3. PEYTON [1987]. AYKANAT. 291-302. Control and Computers. GRIMES [1987]. 25. AXELSSON [1986]. General Motors Research Lab. Solution of very large least squares problems by nested dissection on a parallel processor. [79] A. [81] O. [86] C. 19(1-4).. Proc. symmetric positive definite linear systems.. [1133]. in Kuck et al. [65] C. J986ASME Int. Computer Science Tech. Boeing Computer Services. CO. AXELSSON AND V. Applied Mathematics Tech. 9. pp. [71] C. Math. ASHWORTH AND A. R. WALKDEN [1984]. ATHAS AND C. 21(8). Ontario. 3-18. Conf. J. JONES [1986]. pp. TOMLIN [1979]. Optimal scheduling of assembly language kernels for vector processor!. A segmented FFT algorithm for vector computers. BORISOV [1985]. M. AND A. 84-90. Parallel computers and finite element analysis. AND J. G. [84] O. AND S. Computational Processes and Systems. SEITZ [1988]. PEYTON. ASHCRAPT. AXELSSON [1988].. ASHCRAFT.). On the design and implementation of a package for solving a class of partial differential equations. in Paddon [1512]. P. S. [78] J. SLAM J. Sci. AVIZIENIS. 12/13. pp. ASHCRAFT. Report GMR-5299. AYKANAT AND F. 122-151. pp. Tech. Report. K. 73-83. Parallel Computing. ed. OZGUNER. EVCEGOVAC. pp. ROMINE eralized procedures. GRIMES. Proc. Moscow. University of Waterloo. B. Applied Mathematics Tech. ASHCRAPT AND R. pp. DORAIVELU [1987]. 19th Allerton Conf. [64] C. Proc. Tech. MARTIN. [75] V. A moving computation front approach for vectorizing ICCG calculations. Experience with programming a vector processor for the solution of the Navier-Stokes equations in a three-dimensional region. [68] C. J. Parallel Computing. [69] C. Effects of synchronization barriers on multiprocessor performance. [73] M.. 43-50. LYNE [1988]. ASHCRAPT [1985]. SYLVAIN. [83] O. Incomplete block matrix factorization preconditioning methods. General Motors Research Lab. AND B. BIT. Computers in Engineering. Comput. An investigation of fault-tolerant architectures for large scale numerical computing. [77] W. Comput. & Comp. pp. [74] S. [67] C.. Waterloo. Statist. Proceedings of the Second Copper Mountain Conference on Multigrid Methods.. pp. Computer Science and Statistics: Twelfth Annual Symposium on the Interface. Tech. on Comni. A supemodal implementation of general sparse factorization for vector computers. 8th Annual Sym. [70] C. S. 159-183. F. SIMON [1987]. SADAYAPPAN. McCormick. ed. Tech. AND H. LEWIS. ASHCRAPT [1985]. G. EUKHOUT [1986]. Boeing Computer Services. Math. ASRIELI AND P. Conf. VOIGT AND C. LANG. Large grain parallel conjugate gradient algorithms on a hypercube multiprocessor. Computer. WEIDE [1986]. 641-644. pp. AviLA AND J. 129-140. Report ETA-TR-49. [66] C. 1. Boeing Computer Services. Izdatel'stvo Nauka. A computational survey of the conjugate gradient preconditioners on the CRAY 1-S. [85] C. Arch. pp. A note on the vectorization of scalar recursions. Moscow. 9-14. June. Domain decoupled incomplete factorizations. Report ETA-TR-60. University of Illinois at Urbana-Champaign. On vectorizing incomplete factorization and SSOR preconditioners. Multicomputer!: Message-passing concurrent computers. [72] C. Appl. T. J. Par. Report ETATR-51. ASHCRAPT AND R.. pp. A vector implementation of the multifrontal method for large sparse. SHOOK. SCHWAN. Appl. AXELROD [1986]. H. R. [82] O. Parallelization of a finite . GRIMES [1988]. Comp. ARYA AND D. General Motors. S. ASHCRAPT [1987]. The influence of relaxed supernode partitions on the multifrontal method. Analysis of incomplete matrix factorizations as multigrid smoothers for vector and parallel computers. P. The ultimate answer?. Parallel reduction methods for the solution of landed systems of equations. Parallel Computing. J. OZGUNER [1987]. ASHCRAPT [1987]. A survey of vectorizable preconditioning methods for large scale finite element matrix problems. 107-114. (Special Issue. Base language of the programming system of a vector processor. J. ASRIELI [1985]. THOMASIAN [1977]. AND B. AYKANAT. pp. Supercomputer Appl. Report ETA-TR-52. pp. CALAHAN [1981]. LEWIS. 166-187. [87] C.128 J. [80] T. pp. 3. 6. Report GMR-5174. pp. Computational Processes and Systems. 73-84. DORAIVELU. 217-224. Copper Mountain. [76] V. May.

REBBI [1985]. REBBI [1984].. Report LA-UR-86-2080. DYKSEN. 111-120. Res. pp. pp. [96] J. Phys. Computer Systems Architecture. BARLOW. [102] W. Department of Computer Science. pp. K. Tech. 1985 Int. Tech. 56-60. 25. pp. Comm. Report UMSI 86145. [Ill] D. AND R. McCormick and U. 17(10). J. L. BABB. Report CS-TR-2030. Extra high speed matrix multiplication on the Cray~2. Application development on the CDC Cyber 205. pp. 239-251. Appl. 323-332. Phys.-L. pp. pp. MORIARTY. Phys. Comput. Los Alamos National Laboratory. 716-733. BARLOW AND D. BAER [1973]. [105] D. 1-8. Comp. pp. Trottenberg.. MORIARTY [1986]. REBBI [1984]. AND C. AND B. BABB [1986]. Conf. Department of Computer Science. IEEE Comp. Super-computing. BARKAI. HIROMOTO [1986]. C-25.. 505-520. HOUSTIS. IPSEN [1987]. Applications development of the ETA-10. [107] D. Computer. Statist. BARKAI AND K. Oper.. pp. Math. E. [98] D. Supercomputers. S. IEEE Trans. [92] S. BARLOW AND I. Los Alamitos. MORIARTY. A highly optimized vectorized code for Monte Carlo simulation of SU(S) lattice gauge theories.. & Comp. BADEN [1986]. 26. Tech. Par. ACM Trans. Comm. Comput. pp. A survey of some theoretical aspects of multiprocessing. 1271-1277. BAER [1980]. K... [110] D. [112] D. Conf. 3-14.). pp. AND C. University of Minnesota. BARKAI. pp. September. [109] D. W. Computing about physical objects. REBBI [1984]. 1-9. [108] D. HOFFMAN. Computer architecture. Reducing communication overhead: A parallel code optimization. BABUSKA AND H. BARKAI. REBBI [1985]. [101] K. AND C. pp. MORIARTY [1986]. AND C. SCHRAGE [1978]. 284-290. Proc. AND L. Comput. 36. Report LBL-22584. AND C. Computational aerodynamics and supercomputers. BAER [1984]. Parallel procetting with large-grain data flow technique!. BARKAI. A highly optimized vectorized code for Monte Carlo simulation of SU(S) lattice gauge theories. 32. [93] J. J. Proc. eds. Computer Science Press. BRANDT [1983]. [103] I. Math. Proc. MORIARTY. BAER [1977]. J.. Comp. EVANS [1982]. BAJAJ. CO. SHANEHCHI [1982]. K. Proceedings of the First Copper Mountain Conference on Multigrid Methods. Vectorization of the multigrid method: The twodimensional Poisson equation. Report TR-696. RABBI [1987].-L. [114] R. Proc. 13(3-4). 46. AND J.. Synchronous and asynchronous iterative parallel algorithms for linear systems. [90] R.. [100] E. Developing a parallel Monte Carlo transport algorithm using large-grain data flow. Comparative study of the exploitation of . 43-60. [89] R. pp. 159-172. [95] J. Parallel procetting on the CRAY X-MP with large-grain data flow techniques. Par. 77-87. BARKAI AND K. Vectorized multigrid Poisson solver for the CDC CYBER 205. A modified conjugate gradient solver for very large systems. Copper Mountain. BAKER AND L. 1984 Int. RICE [1987]. pp. BARKAI AND A. [91] I. 662-673. KORB. Some aspects of parallel implementation of the finite element method on message passing architectures.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 129 element application program on a hypercube multiprocessor. Sci. Dynamic load balancing of a vortex calculation running on multiprocessors. Purdue University. Proc. ELMAN [1988]. [94] J. M. 17(7).. pp. 13. Computer. MACLEOD [1987]. BAILEY [1987].. pp. BAR-ON [1987].. Scaled Givens rotations for the solution of linear least squares problems on systolic arrays. Tech. Comm. K. J. University of Maryland. Comput. A practical parallel algorithm for solving band symmetric positive definite systems of linear equations. [115] R. December. Tech. [104] D. CUNY. [97] D. MORIARTY. Sci. 215-228. Phys. 5. in Fernbach [630]. Statist. Soc. BAILEY. (Special Issue. Finding an optimal sequence by dynamic programming: An extension to precedence-related tasks. COMPCON 84. Multiprocessing systems. D. 40. EVANS. 1. pp. C. pp. 101-108. in Heath [860]. 13-33. MORIARTY. [106] D. 4. BARKAI. A high performance fast Fourier transform algorithm for the CRAY-2. 8. 8. MORIARTY. SIAM J. J. pp. BABB [1984]. Softw.-L. Comput. BARKAI. [99] D. ACM Computing Surveys. K. K. [88] R. AND J. Comm. SIAM J. 31-80. A modified conjugate gradient solver for very large systems. Comput.. Dist. Par. BAILEY [1988]. Conf. A modified conjugate gradient solver for very large systems. [113] J. pp. BALLHAUS [1984].. 603-607. STORC.-L. CA. 55-61. Comput. CAMPOSTRINI. in Numrich [1469]. Lawrence Berkeley Laboratory.

AND M. NJ. Asynchronous iterative methods for multiprocessors. SHANEHCHI [1983]. Conf.. Math. 249. in Kowalik [1116]. 105-136. H. 746-757. Shared memory. H. 41. J. pp. pp.. in Paddon [1512]. SIAM J. Data organization in large numerical computations. Sandia National Laboratory. Design of a Massively Parallel Processor. Comput. STARAN parallel processor system hardware. AND J.. Science. K. P. An experimental study of methods for parallel preconditioned Krylov methods. pp..130 J.142-149. BROWN. Parallel Computing. Proc. M. Parallel multitection applied to the eigenvalue problem. p. BENSON AND P. 1201-1247. M. AND J. Simul. Proc. Multis: A new class of multiprocessor computers. Proc. KATZ. 405-410. Fast parallel algorithms for the Moore-Penrose pseudo-inverse. pp. Proc. G. Proc. pp. Conf. pp. ROMINE different level* of parallelism on different parallel architecture*. NM. G. V. in McCormick [1312]. R. EVANS. J.. 43. CROWLEY [1988]. BERGER.. STOKER [1968]. A rotating and folding algorithm using a twodimensional "systolic" communication geometry. D. Math. BARNES. in Potter [1584]. Parallel algorithms for the solution of certain large sparse linear sustems. Comput. R. M. BEN-ARI [1982]. AND K. Department of Computer Science. Tech. The Massively Parallel Procettor tyttem overview. D. pp. VOIGT AND C. K. M. 570-580. G. S. Enslow. pp. FORSYTH [1984]. Fluid and Thermal Sciences Department. BAUDET [1978]. [1133]. Predictor-corrector methods for the solution of time dependent parabolic problems on parallel processors. in Schultz [1752]. pp. AND G. OLIGER. 597-604. J. Int. BENNER AND G. 34-40. J. Tech. EVANS [1987]. ACM.. WRIGHT [1984]. Comp. Comput. pp. M. and frontwidth considerations in multifrontal algorithm development. AND J. NM. 121-127. 1. Academic Press. ORTEGA. J. BOKHARI [1987]. pp. BERGER AND S. 6. K. Albuquerque. Sci. K. BENES [1965]. The Flip network in STARAN. C. VAN GUNSTEREN. BATCHER [1980]. G. Statist. Iterative method* for asynchronous multiprocessors. M. pp. AFIPS Conf. 1976 Int.. BATCHER [1985]. 543-561. BARLOW.. IEEE Trans. pp. KUCK. Inc. MD. 147-155. 245-260. pp. FREDERICKSON [1988]. SLOTNICK. 25. BERGER AND S. AND R. Comput. Albuquerque. H.197-202. pp. D. Prentice-Hall. G. J. M. Exploiting vector computers for simulation. Bell System Tech. M. BEHIE AND P. 1982 Int. 166-170.. EVANS. SHANEHCHI [1984].. SALTZ. BELL [1985]. Par. BENNER [1986]. pp. 1985 Int. R. Inc. FREDERICKSON [1987]. Institute of Electrical and Electronics Engineers. Heuristic remarks and mathematical problems regarding the theory of connecting systems. Comput. BATCHER [1976]. 5. EISENSTAT. M. BAUDET [1977]. POSTMA [1984].. 226-244. 221-228. Sandia National Laboratories. pp. BENSON AND P. Par. BENYON [1985]. NCC. PATTERSON [1987]. Mathematical Theory of Connecting Networks and Telephone Traffic.. in Heath [860]. The Illiac IV computer. BARLOW. Sparse matrix vector multiplication on the DAP. pp. in Kucket al. Silver Spring. W. M. 836-840. Proc. 6-9. P. V. Supercomputing. 65-71. Proc. J. BAXTER. D. Proc. IEEE Trans. 1979 Int. pp. pp. D. Principles of Concurrent Programming. J. Tech. J. 27. Par. M. 462-467. R. Partitioning strategy for non-uniform problems on multiprocessors. CYBER and DAP. C-36.. BENSON. Report SAND85-2752. 228. K. 4. BEKAKOS AND D. Overview of preconditioned conjugate gradient (PCG) methods in concurrent finite element analysis. ed. SCHULTZ. R. BERENDSEN. Report SAND-85-2727. BATCHER [1979]. BATCHER [1974]. BOKHARI [1985]. C-17.. Englewood Cliffs. BENES [1962]. RODRIGUE [1981]. C-29. New York. IEEE Trans. KRETTMANN. 309-310. Incomplete factorization methods for fully implicit simulation of enhanced oil recovery. Comput. Proc. [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134] [135] [136] [137] [138] [139] [140] [141] [142] [143] . pp. MPP — A Massively Parallel Procettor. Molecular dynamics on CRAY. Report RR-629. A partitioning strategy for PDE's across multiprocessors. R. 425-438. BELL AND G. Yale University. Par. 16. Conf. Conf. cache. Fast pseudo-inverse algorithms on hypercubes. MONTRY [1986]. pp.

How to embed trees in hypercubes. Proc. BHAVSAR AND V. Par. pp. GOLDSTEIN [1986]. ROMANI [1988]. J. GALLIVAN [1987]. BHAVSAR. 8. Report 581. Effectiveness of some parallel computer architectures for Monte Carlo solution of partial differential equations. HUSSEIN. Par. B. AND D.. 104. 1982 Int. BERTSEKAS [1983]. BHAVSAR AND J. . 4. 421-423. Proc. AND K. [162] V. T. Some parallel algorithms for Monte Carlo solutions of partial differential equations. Report YALEU/DCS/RR-443. Proc. Control. P. vol. PHILLIPPE. S. pp. pp. 135141. Statist. AND A. Comput. pp. Monte Carlo neutron transport on the Alliant FX/8. B. R. 4. CODENOTTI. Multiprocessor Jacobi algorithms for dense symmetric eigenvalue and singular value decompositions. A multiprocessor scheme for the singular value decomposition. Comm. pp. 610-616. BUCKLEY. [165] V. A meth coloring method for efficient MIMD processing in finite element problem*. BERRY. New Brunswick. BERRY AND R. VLSI algorithms for Monte Carlo solutions of partial differential equations. [623]. W. 73-79. Denelcor Workshop on the HEP. SIAM J.. pp. [155] M. 64. Phys. BHAVSAR AND A.. [164] V. FRABOUL [1985]. 1987 Int. BUCKLEY. s73-s95. BERRY AND A. [148] L. pp. BROUAYE. A vectorizable eigenvalue tolver for sparse matrices. Canada. 193-199. SYRB [1982]. 601-602. Statist. Alg. 3. BHAVSAR [1981]. Conf. AND J. Contemporary Math. Appl. pp. University of Oklahoma. 439-458... eds. 39-58. J. On the mance of the FPS T-»erie* hypercukt. T.. PLEMMONS [1985]. Canada. Comput. 37. in Vichnevetsky and Stepleman [1920]. [146] D. B. Proc. Sci. [149] H. Yale University. Parallel solution of block tridiagonal linear systems. FRANCIONI. Vichnevetsky and R. Advances in Computer Methods for Partial Differential Equations. October. 325-340. New Brunswick. Par. Distributed asynchronous computation of fixed points. HELTON [1982]. SAMEH [1987]. pp. Parallel implementation of bisection for the calculation of eigenvalue* of tridiagonal tymmetric matrices. GUJAR [1984]... Advances in Computer Methods for Partial Differential Equations. pp. [158] D. Center for Supercomputing Research and Development. GALLIVAN. BERNSTEIN AND M. vol. pp. SNYDER [1987]. 433440. K. 1986 Int. BBRGBR. On mapping parallel algorithm* into parallel architecture*. BEVILACQUA. [152] M. Proc. Report 690. BERZINS. [166] V. IEEE Trans. Tech. La Recherche Aerospatiale. 107-120. AND P. 487-508. DAYDE. Parallel algorithm* on the Cedar syttem. PLEMMONS [1987]. IMACS. 41—46. Engrg. August. PLEMMONS [1985]. 85-91. [168] S. BERRY AND A. TASSOU. Programming. Algorithms and experiments for structural mechanics on high performance architectures. Automat. Conf. [156] M. Meth. DEW [1984]. Computing. 7-23. SAMEH [1986]. BERMAN AND L. GOLDSTEIN [1988]. Vichnevetsky and R. [154] M. New Brunswick. [150] H. SLAM J. BERZINS. BHAVSAR AND U. Steplernan. R.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 131 [144] P. [151] M.. pp. Canada. Tech. E.. pp. 5. 259-264. March. [157] D. BBRGBR. Math. pp. Center for Supercomputing Research and Development. T. U. perfor- [147] F. IMACS. POPLAWSKI [1987]. BERTSEKAS [1982]. Par. pp. University of Illinois at UrbanaChampaign. Experience in parallelizing numerical algorithm* for MIMD architecture use of asynchronous method*. [159] M. AC-27. University of Illinois at Urbana-Champaign. IMACS. [161] R. Design and analysis of parallel Monte Carlo algorithms. Proc. R. vol. AND C.. Optimizing Given*' algorithm for muhiproces»or». & Appl. Path Pascal simulation of multiprocessor lattice architectures for numerical computations. 27. 2. A multiple microprocessor system (MMPS) for the Monte Carlo solution of partial differential equations. [145] P. 47. 205-213. BERRY AND R. 25. BERNARD AND F. JALBY. Proc. Advances in Computer Methods for Partial Differential Equations. BERRY AND R. pp. Parallel algorithms for finite element structural analysis on the HEP multiprocessor. PADGAONKAR [1979]. 8. 268-276. Vichnevetsky. ISAAC [1987]. 25-33. Systolic matrix iterative algorithms. eds. [153] M. Tech. Stepleman. 483-488. Conf. Out. MEIER. Mech. Comp. AND F. IPSEN [1985]. pp. KANETKAR [1977]. pp. BHUTT AND I. HARROD.. in Heath [860]. BERGMARK. [167] V. Lo.. Department of Computer Science. pp. Distributed dynamic programming. Lin. M. pp. Sci. [160] M. ed. AND P. BERNSTEIN AND M. HELMINEN. W. Comput. DEW [1984]. Computing a banded basis of the null space on the Denelcor HEP multiprocessor. in Paddon [1512]. in Feilmeier et al. [163] V. pp. SAMEH [1986]. Comput.

BoKHARI [1988]. HviDSTEN [1988]. Sci. Cornput. Preconditioning conjugate gradient methods on vector computers. pp. . On the mapping problem. HUSSAINI. Willoughby. Par. Navier-Stokes solution on the CYBER-20S by a pseudospectral technique. 95-104. 299-311. pp. ROMINE [169] L. Paper 83-47.. [197] D. January. BJ0RSTAD AND A. 612-618. BOLEY [1978]. J. Statist. WlDLUND [1986]. AIAA. BlRKHOFF AND A. ORTEGA. H. BlRTA AND O. Sci. Iterative methods for the solution of elliptic problems on regions partitioned into substructures. Solving elliptic problems on regions partitioned into substructures. BisCHOF AND C. R. F. in Heath [860]. D. G. Comput. pp. pp. 207-214. BRENT [1985]. AND S. pp... 301-312. Proc. Numerically stable solution of dense systems of linear equations using mesh-connected processors. Amsterdam. SIAM J. Report TR 96-740. Amsterdam. Aerosp. HussAINI. 107-113.. pp.. 89-95. BoKHARI [1979]. Optimal asynchronous Newton method for the solution of nonlinear equations. [184] M. HANDLER. IEEE Trans. Parallel Computing. eds. secondary storage. Numer. pp. BlNI [1984]. VOIGT AND C. W. Report 84-21. 13. 2. HOFMANN. BlSCHOP [1986]. pp. Computing the singular value decomposition on a ring of array processors. [185] A. [762]. pp. A large scale. Proc. IEEE Trans. AND H.. VAN LOAN [1987]. Comp. 23. eds.. SCHOENSTADT. R. ed. Comput. [176] C. 1985 Int. 261-276. pp. VAN LOAN [1986]. Conf. VOLKERT [1985]. Orlando. Reid. [177] C. 239-248. pp. Department of Computer Science.132 J. BODE. ABOU-RABIA [1987]. [186] J. J. Par. BmiNGBN [1983]. The two-sided block Jacobi method on a hypercube. ORSZAG [1982]. [180] P. September. Rech. Tridiagonalization of a symmetric matrix on a square array of mesh-connected processors. pp. BnUNGEN [1983]. [195] S. W. [183] E. [171] S. Elliptic Problem Solvers II. [623]. pp. IEEE Trans. August. 31. ACM. A parallel method for the generalized eigenvalue problem. [194] S. KUNG [1984]. Proceedings of the Third International Conference on Numerical Methods in T •*»•""""• and Turbulent Flow. Los Alamos. [182] P. Seattle WA... SIAM J. 245-255. pp. Comput. pp. [196] D. Anal. LAMBIOTTE. M. Vectorization of block relaxation techniques: Some numerical experiments. Comput. J. [187] A. Partitioning problems in parallel. BHUTAN AND D. 1978 LASL Workshop on Vector and Parallel Processors. C-37. pp. C-33. pp. BoKHARI [1981]. On the mapping problem. BISCHOP [1987]. 8. IEEE Trans. Parallel solution of certain Toeplitz linear systems. pp. BJ0RSTAD [1987]. 5. Conf. 3-12. in Feilmeier et al. SIAM J.. BoJANCZYK [1984]. [175] C. G. in Birkhoff and Schoenstadt [173]. HENNING. Comput. J. North-Holland. BJ0RSTAD AND O. 367-377. [189] A. BlSCHOP AND C. Comput. BOKHARI. [170] D. Programming parallel numerical algorithms in Ada. Cullum and R. C-30. AORAWAL [1984]. The Relationship between Numerical Computation and Programming Languages. sparse. s2-s!3. Statist. BOJANCZYK AND R.. [174] L. Tech. NM. 792-803. BOKHARI [1984]. Newark. Proc. BOKHARI. BLUMEMFELD [1984]. 1097-1121. 305-307. Fast orthogonal derivatives on the STAR. Comput. WlDLUND [1984]. J. direct linear equation solver for structural analysis and its implementation on vector and parallel architectures. Proc. GUINRAUD. [190] S. K. 368-476. University of Minnesota. in Glowinski et al. SIAM J. November 9-11. [191] S. Simulation of late transition in plane channel flow. Cornell University. IEEE Trans. [178] C. pipelined and distributed computing. NV. Comput. 5. AND P. ENSELME.. pp. Department of Computer Science. BOJANCZYK. LEED [1982]. [193] S. Finding maximum on an array processor with a global bus. Large Scale Eigenvalue Problems. [1984]. Second IMAC International Symposium on Parallel Computation. [179] P. Generalized hybercube and hyberbus structures for a computer network. Parallel Hock predictor-corrector methods for ODE's. Potential assessment of a parallel structure for the solution of partial differential equations. M. DE. AND J. 33. Tech. Math. [181] P. 297-304. ORSZAG [1982].. NorthHolland.. 8. BRENT. [188] A. Dist. BOISSEAU. 133-139. Iterative methods for substructured elasticity problems in structural analysis. [173] G. BJ0RSTAD AND O. Reno. BLUM [1982]. The WY representation for products of Householder matrices. 48-57. M. 323-333. Proc. Appl. A numerical simulation of transition in plane channel flow. pp. M. [172] S. Multigrid oriented computer architecture. BOLEY [1984]. C-36.. [192] S. Academic Press. AND S. pp. 1979 Int. A parallel ordering for the block Jacobi method on a hypercube architecture. Par. FRTTSCH.

BOSSAVTT [1982]. Explicitly vectorized frontal routine for hydrodynamic stability and bifurcation analysis by Galerkin/finite element methods. SCRIVEN [1985].. BORIS [1976]. Journal of Spacecraft and Rockets. pp. [222] J. 25. FARTER [1978]. vol. [215] W. pp. Implementation of the Lanczos method for structural vibration analysis on a parallel computer. A vectorization of the H ess-McDonnel-Douglas potential flow program NUED for the STAR-100 computer. 153-166. [204] C. Computational Processes and Systems. IMACS. The construction of preconditioned for elliptic problems by substructuring. Math. [1981]. FL. BORODIN AND I. pp. A. Report TM-78816. Proc. Department of Computer Science. SAMEH. Report 1860. 60. FULTON [1987]. BOUKNIGHT. Englewood Cliffs. PAWLEY [1984]. WEILMUENSTER [1982]. Finite Difference Techniques for Vectorized Fluid Dynamics Calculation. 37. BRAMBLE. BOSTIC AND R. BOSTIC [1984]. [210] A. SOUTH [1984]. BRADLEY. BORIS AND N. [202] J. NASA Langley Research Center. Proc. Implementation of QR factorization on the DAP using Householder transformations. [217] G. [211] S. AND A. BRAILOVSKAYA [1965]. [221] I. pp. NASA Tech. BUZBEE. IEEE. Multigrid adaptive solutions to boundary value problems. On the vectorization of algorithms in linear algebra. BOUDOUVIS AND L. SLOTNICK [1972]. pp. MC!NTYRE. Comp. NJ. BOWGEN AND J. Department of Computer Science. On Hock relaxation techniques. D. PASCIAK. A domain decomposition Laplace solver for internal combustion engine modeling. DENENBBRG. Paralleling of method* for the modification of matrix factorizations. DWOYER. The Illiac IV system.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 133 [198] D. 10. [199] D. AND S. BUHR [1980]. AND J. 26th AIAA Structures. P. Computers and Structures. WlNSOR [1982]. Paper 84-0162. in Numrich [1469]. Math. Report 3048. BONDARENKO [1985].. [207] J. Tech. BOLEY [1986]. 167-170. BOWLER AND G. Tech. pp. [208] J. BOSTIC AND R. FULTON [1985]. BRANDT [1977]. Pipelined iterative methods for shared memory machines. [223] J. in Rodrigue [1643]. [205] J. pp. 173-215. Vectorized schemes for conical flow using the artificial density method. BONEY AND R. BOWEN AND R. . Report TM-85710. Structural Dynamics and Materials Conf. [201] L. 19. pp. 49(179). SIEMERS. Springer-Verlag. pp. [203] D. [206] J. 42-55. 107-110. Solving ike generalized eigenvalue problem on a synchronous linear procettor array.. SCHATZ [1987]. [224] A. Tech. SCOTT [1986]. J. BRANDENBURG AND D. Report 3237. The Logical Design of Multiple-Microprocessor Systems. Proc. Orlando. Prentice-Hall. BORIS [1976]. BOLEY. BRADLEY. [216] B. J. New York. Computational Complexity of Algebraic and Numeric Processes. BORDERS AND O. pp. B. [214] A. Comput. DYKSEN [1987]. WlDLUND [1987]. 95-97. [209] A. 369-379. Report 1. 1. Parallel Computing. Naval Research Laboratory. Purdue University. January. Inc. S. Vectorized computation of reactive flow. NY. IzdaieTstvo Nauka. [213] S. [212] S. 419-422. 10th IMACS World Congress on Systems Simulation and Scientific Computation. NASA Langley Research Center. iPSC Tech. 395-404. AND D. Report CSD-TR-688. AIAA. Solution of a tridiagonal system of equations on the Finite Element Machine. A difference scheme for numerical solution of the two-dimensional non-stationary Navier-Stokes equations for a compressible gas. [200] E. Intel. IEEE. RANDALL. SMITH [1979]. A concurrent processing approach to structural vibration analysis. Report 315. 333-390. NASA Tech. Vectorized tridiagonal solvers. J. Naval Research Laboratory. Molecular dynamics and Monte Carlo simulations in solid-state and elementary particle physics. Comm. 3. [218] K. 197-213. Tech.. MUNRO [1975]. New York University. Moscow. Soviet Physics Doklady. Embeddings of communication trees and grids into hypercubes. 66. pp. 31. pp. BOOK. Tech. Phys. BORIS [1986]. Comp. MODI [1985]. Phys. ed. A vectorized "near neighbors" algorithm of order n using a monotonic logical grid. 228-264. 72. Comp.. BONOMO AND W. D. 1-16. pp. AND K. Comparison of shuttle flight pressure data to computational and wind-tunnel results. 1-22. University of Wisconsin. Mathematics Research Center. [220] P. [219] P.. Flux-corrected transport modules for solving generalized continuity equations. pp. American Elsevier.

QUINLAN [1987]. North-Holland. December. R. in Schultz [1752]. 7th Int. [230] R. Sci. A systolic array for the linear time solution of Toeplitz systems of equations. Multilevel computation*: Review and recent developments. AND C. [243] W. Report TR-82-522. Fast Poisson solvers for MIMD computers. A systolic architecture for the singular value decomposition. in McCormick [1312]. L. NY. ACM Computing Surveys. BRODE [1981]. HIROMOTO. Computer. 21. S. [229] A. Report LA-UR-87-2163. Concurrent programming concepts. BRANDT [1988]. AND A. AND F. Ithaca. Cornell University. Department of Applied Mathematics. 6. BRINCH HANSEN [1978]. Conf. BROOKS [1984]. September. A systolic VLSI array for integer GCD computation. BRICKNER AND R. Australian National University. Weizmann Institute. Comput. pp. Some linear-time algorithms for systolic arrays. [247] P. 6. [239] R. Los Alamos National Laboratory. BRINCH HANSEN [1973]. H. Local and multi-level parallel processing mill. Amsterdam. Livermore. Communication and control costs of domain decomposition on loosely coupled multiprocessors. 200-205. H.. Two and three dimensional FFTs on highly parallel computers. [252] E. BRINCH HANSEN [1977]. RPS processor memory element. pp. BRENT. 14(9). HART. Proc. Department of Computer Science. 1985 Int. Los Alamos National Laboratory. 3. A multitasking kernel for the C and Fortran programming languages. BRENT AND F. of VLSI and Computer Systems. of VLSI and Computer Systems. BRENT AND F. pp. TURNBULL [1988]. Englewood Cliffs. pp.134 J. O'GALLAGHER [1987]. VAN LOAN [1985]. BRINCH HANSEN [1979]. pp. Israel. The Architecture of Concurrent Programs. [227] A. pp. [248] P. Proc. Conf. 39-83. BRANDT [1981]. Report TR 82-521. [249] L. NJ. A systolic architecture for almost linear-time solution of the symmetric eigenvalue problem. L. 12(5). LUK [1982]. M. WEISS [1985]. s27-s42. Report TR-CS-82-10. Computer. Report UCID-20167. pp. LUK. IFIP 9th World Computer Congress. 69-84. pp. Tech. [235] R. [233] R. Tech.Z)-geometry discrete ordinates neutron transport algorithm. [251] E. Computation of the singular value decomposition using mesh connected processors. 265-275. Distributed processes: A concurrent programming concept. Cornell University. MCCORMICK. Australian National University. Report. SWEET. ACM. Statist. VOIGT AND C. Multigrid methods on a hypercube. Proc. SPIE vol. KUNG [1982]. 295-302. Prentice-Hall. Tech. LUK [1983]. Computing the Cholesky factorization using a systolic architecture. [240] R. ORTEGA. R. [236] R. Multiprocessor FFT methods. 431: Real Time Signal Processing VI. 242-270. 6th Australian Computer Science Conf. Tech. BRENT AND F. AND C. [241] R. Multitasking a two-dimensional (R. BaiCKNER. McAULIPPB.. 1-22. Department of Computer Science. 46-51. Report TR-CS-82-11. LUK [1985]. [231] R. AND B. LUK [1982]. Department of Computer Science. Tech. [242] W. in McCormick [1312]. pp. PARRLEY [1986]. LUK [1983]. BRENT AND F. [244] W. CA. pp. J. Parallel Computing. BRANDT [1984]. pp.. Proc. 865-876.. R. BROOKS [1985]. PATERNOSTER [1987]. 50-56. pp. WIENKE [1987]. [237] R. AND J. 6. BRENT. pp. [238] R. Report LA-UR-87-2164. BRIGGS. SIAM J. Statist. Proc. F. [250] B. BRIGGS AND T. BRIGGS. KUNG. Syst. LUK. Rehovot. 35-63. 167-184. Comput. BRANTLEY. J. [232] R. pp. Par. BROCHARD [1984]. AND D. Department of Computer Science. [234] R. [226] A. BRENT AND F. A keynote address on concurrent programming. Multigrid toher$ on parallel computers. Berlin. Dist. BRASS AND G. Sci. pp. LUK [1982]. Inc. Comm. Computing the Cholesky factorization using a systolic architecture. K. Comp. Proc. LUK [1983]. Tech. [246] P. Precompilation of Fortran programs to facilitate array processing. 782-789. 1. 63-83. The solution of singular-value and symmetric eigenvalue problems on multiprocessors. HART. VAN LOAN [1983]. Tech. Parallel Computing. [228] W. ROMINE [225] A. Australian Computer Science Communications 5. [245] P. BRENT AND F. Computation of the generalized singular value decomposition using mesh-connected processors. BRENT AND H. Performance of the Butterfly processor-memory interconnection in a vec- . pp. Parallel iterative transport algorithms and comparative performance on distributed and common memory systems. September. G. F. Lawrence Livermore National Laboratory.. 223-245.. BRENT. 8. Tech. 1. SIAM J.

BUCHER AND T. [259] J. Report 87002. BROWNE [1986].. Argonne. 793-796. Remarks for the IFIP congress '83 panel on how to obtain high performance for high-speed processors. pp. August. ACM Computing Surveys. [270] P. BROOKS [1988]. 85-98. [1979]. [273] P. on Measurement and Modeling of Computer Systems. NEUMANN. Physics Today. Tech.. 293-315. Proc. ELSNER [1988]. [264] J. pp. Multidimensional numerical simulation of reactive flow in internal combustion engines. B. 339-348. 1985 Int. A compact non-iterative Poisson solver. Conf. pp. 151-165. CO. RIACS. Sci. IEEE Trans. [261] J. The indirect k-ary n-cube for • vector processing environment. BRUNO [1986]. 7. eds. [254] E. BROOKS [1988]. KucK [1971]. Report 86. [1982]. J. RIACS. DAVIES. AND R. Tech. Fort Collins. E. Final report on the feasibility of using the Massively Parallel Processor for large eddy simulations and other computational fluid dynamics applications. pp. [266] I. BRU. [278] B. M.. Vectorization of implicit Navier-Stokes codes on the CRAY1 computer. BRUNO [1984].. Report. pp. Comput. Department of Aeronautics and Astronautics. Vectorized Monte Carlo radiative heat transfer simulation of the laser isotope separation process. BROOMELL AND J. AND D. 175-192. HEATH [1983]. Parallel Computing. Proc. EDWARDS. CLOUTMAN. BUZBEE [1981]. Soc. [276] T. & Appl. England. June. C-20. Phys. 3. Energy Combust.2. BURKE AND L. [253] E. BUNING AND J. [274] BURROUGHS CORP. OVERBEEK [1985]. 6. pp. [275] R. 2835. ACM Sigmetrics Conf. 103. NASA Ames Research Center. Report LA-UR-83-1392. Classification categories and historical development of circuit switching topologies. Parallel Computing.. BUNEMAN [1969]. pp. A fast Poisson solver amenable to parallel computation. JORDAN [1984]. 340-345. Comput. 1979-81. pp. Parallel logic programming for numeric applications. Report. Proc. pp. J. March. Conf. McCUNE. Implementing techniques for elliptic problems on vector processors. [277] B. [1982]. Proc. [279] B. The computational speed of supercomputers. The shared memory hypercube. Proceedings of the International Conference on Vector and Parallel Processors in Computational Science. 55. Some Research Applications on the CRAY-1 Computer at the Daresbury Laboratory. Par. Lin. 1981. Tech. pp. 1-10. 37(5). BROOKS [1987]. Argonne National Laboratory. [258] J. BROWNE [1984]. [260] J. eds. 624-631. 235-246. [268] P. Los Alamos National . Framework for formulation and analysis of parallel computation structures. BROWNE [1985].. NASA. Par. Report 294. BROWNE [1984]. DELNES. [272] P. IEEE Trans. W. 15.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 135 tor environment. 21-24. BUTLER. JORDAN [1984]. Livermore. BUDNIK AND D. pp. Tech. BuZBEE [1973]. Tech.7. BURNS AND D. Proc. pp. Report on the feasibility of hypercube concurrent processing systems in computational fluid dynamics. Lawrence Livermore National Laboratory. March. Comp. 103-110. BROOKS [1985]. Prog. [256] E. PRYOR [1987]. pp. Conf. IL. 95-134. 294-299. Daresbury Laboratory.. [262] R. Solving very large elliptic problems on a supercomputer with solid state disk. Report 84. Parallel Computing. The shared memory hypercube. COMPCON 84. [263] J. Tech. BUZBEE [1983]. Formulation and programming of parallel computations: A unified approach. Parallel architecture for computer systems. Chester. The organization and use of parallel memories. TRAC: An environment for parallel computing. pp. AND J. BUCHER [1983]. Stanford University. pp. Parallel Computing. Contractor Report NAS2-9897. LUSK. BUTLER. CA. A butterfly processor-memory interconnection for a vector processing environment. RAMSHAW [1981]. Proc. IEEE Comp. Tech. Linear algebra programs for use on a vector computer with a secondary solid state storage device. [255] E. Models of parallel chaotic iteration methods. Stanford University. pp. [267] I. Institute for Plasma Research. MCS Tech. NAS facility feasibility study. BURKE. C-22. 6. AND L. 4. [265] I. Final report. [271] P. BUCHER AND T. England. pp. Report. LEVY [1979]. in Vichnevetsky and Stepleman [1920]. in Schultz [1752]. 1985 Int. Alg. [257] G. 1566-1569. NASA Ames Research Center.. [269] O. Institute for Scientific Computing. 546-550.

CALAHAN [1981]. 729-738. H. pp. AND J. 1978 LASL Workshop on Vector and Parallel Processors. Report. Proc.558-564. Conf. pp. IEEE. HOWELL [1977]. CALAHAN [1975]. 1189-93. Gaining insight from supercomputing. 7. pp. BUZBEE AND J. RODRIGUE [1980]. Los Alamos National Laboratory. Vectorization of algorithms for solving systems of elliptic difference equation*. pp. in Glowinski et al. [301] D. 255-271. Report LA-UR-84-2004. 375-378. Livermore. BUZBEE [1984]. Proc. Computer Information Exchange Meeting. [281] B. BUZBEE [1983]. AND C. SHARP [1985]. AND J. ROMINE Laboratory. [283] B. Parallel Computing. VOIGT AND C. Performance of linear algebra codes on the CRAY-1. AND S. CALAHAN [1982]. [1808]. Proc. Complexity of vectorized solution of two-dimensional finite element grids. 591-597.. University of Illinois at Urbana-Champaign. CALAHAN [1984]. CAS-26. [1133]. pp. CALAHAN. Tech. Proc. Report LA-8609-MS. 3. CALAHAN [1982]. 241-245.. High performance banded and profile equation-solvers for the CRAY-1: The unsymmetric case. Influence of task granularity on vector multiprocessor performance. CALAHAN [1986]. G. 227. [302] D. Systems Engineering Laboratory. AMES [1979]. MORRISON. CALAHAN [1983]. W. [280] B. Systems Engineering Laboratory. ORTEGA. BUZBEE. A collection of equation solving codes for the CRAY-1. BUZBEE [1984]. pp. D. Circuits and Syst. Tech.. 3. 313-332.. Proc. Perspectives on computing. in Kuck et al. in Noor [1450]. Proc. Conf. Proceedings of the llth Allerton Conference on Circuit and System Theory. . Large Eng. pp. CALAHAN [1973]. Proc. on Circuits and Computers. BUZBEE. Algorithmic and architectural issues related to vector processors. CALAHAN [1979]. pp. [282] B. Two parallel formulation* of particle-in-cell models. Conf. University of Michigan. A strategy for vectorization. local-memory-based linear equation solution on the CRAY-S: Uniprocessor algorithms. G. NY. Task granularity studies on a many-processor CRAY X-MP. M. Tech. On direct methods for solving Poisson's equation. [307] D. WORLTON [1982].. IEEE Conf. University of Michigan. pp. Anal. [289] B.. Symp. in Schultz [1752]. Systems Engineering Laboratory. [297] D. G. Pergammon Press. Application* of block relaxation. [298] D. Resevoir Simulation. 2. AND E. 1984 Int. pp. G. Science. Los Alamos National Laboratory. Report 160. pp. 1979 Int. MICHAEL. BUZBEE [1985]. in Snyder et al. 976-979. 223-232. eds. NIELSON [1970]. Rye. [290] B. [287] B. Proc. Vcctorizationt for the CRAY-1 of some methods for solving elliptic difference equations. pp. Proc. pp. 19-21. CRAY Channels. 489-506. pp. SESEK [1979]. CALAHAN [1981]. pp. Proc. Parallel solution of sparse simultaneous linear equations. R. A block-oriented sparse equation solver for the CRAY-1. PARTER [1979]. BUZBEE. Report SARL-2. Science. SPE Journal. WORLTON. [294] D. 1979 AIME Fifth Symposium on Reservoir Simulation. 1-5. 1986 Int. [284] B. 278-284. Tech. [286] B. CALAHAN [1977]. University of Michigan. CALAHAN AND W. GOLUB. AND G. R. [309] D. Vectorized direct solvers of 2-D grids. EWALD. [291] B. NM. Proc. pp. [305] D. Proc. Direct solution of linear equations on the CRAY-1. Parallel Computing. 187-192. 81-88. 116-123. CALAHAN [1980]. Vectorized sparse elimination. Vector processors: Models and applications. Block-oriented. AMES. BUZBEE AND D. Tasking studies in solving a linear algebra problem on a CRAY-class multiprocessor. DOE research in utilization of high performance systems. J. Par. University of Michigan. 218(17). 72.136 J. BOLEY. Sys. [296] D. BUZBEE [1986]. [308] D. CALAHAN [1981]. GOLUB. CALAHAN [1979]. [293] D. [304] D. Sparse vectorized direct solution of elliptic problems. pp. [300] D. BUZBEE. CAHOUET [1988]. SIAM J. Japanete supercomputer technology. [295] D. CA. pp. Par. [1978]. Int. IEEE Trans. 109-118. 627-656. Par. Los Alamos. Tech. 715-776. Supercomputer Algorithm Research Laboratory. [288] B. CALAHAN [1985]. pp. [306] D. [762]. [285] B. pp. Proc. [299] D. [303] D. [292] J. 6th Symp. October. On some difficulties occurring in the simulation of incompressible fluid flows by domain decomposition methods. Tech. Multi-level vectorized sparse solution of LSI circuits. BUZBEE. Numer. Sci. Report 91. Application of MIMD machines.

299-308. Numer. Comp. [311] P. Bergen. Tech. McLAY. QR factorization for linear least squares problems on the hypercube. [336] T. pp.. CAUGHEY. Experiences with the Intel iPSC hypercube. Report CNA-198. fast Fourier transforms.. New Haven. 831-849. R. Center for Numerical Analysis. [320] D. Sci. 6. Par. [316] G. ORBITS {1976]. Proc. pp. CHAN [1987]. pp. A mesh automaton for solving dense linear systems. Statist.. HOU [1988]. [334] T. pp. Comm. Department of Computer Science. CALAHAN.. CARDELMO AND P. CHAN [1987]. NASA Tech. CHEN [1985]. Meth. Preliminary report on results of matrix benchmarks on vector processors. February. Yale University. in Heath [860]. CHAMBERLAIN [1986].. SAAD [1985]. Element by element vector and parallel computations. 305-310. 2. [315] G. LINDHEIM. [312] P. NASA Langley Research Center. 591-614. PETERSEN [1987]. CHAN AND D. CAREY [1986]. Proceedings of the First Copper Mountain Conference on Multigrid Methods. 969-977. Dist. Nonlinear Finite Element Analysis — Structural Mechanics. Meth. CHAN AND D. JAMESON [1978]. IEEE. pp. Conf. CAREY [1985]. WETHERALD [1967]. in Heath [860]. CHAMBERLAIN [1988]. Proc. Comput. 1985 Int. Proc. POWELL [1986]. 418-^125. 756-761. CAPPELLO [1985]. J. pp. Acoustooptic linear algebra processors — Architectures. Par. On the implementation of kernel numerical algorithms for computational fluid dynamics on hypercubes. [333] T. and hypercubes. 1985 Int. CAREY [1981].-Y. Multigrid algorithms on the hypercube multiprocesor. [328] R. [324] R. pp. [313] C. Tech. s!4-s26. eds. 569-575. [323] D. [330] T. CARLSON AND K. algorithms and applications.. Multigrid algorithms on the hypercube multiprocessor. Michelsen Institute. 18(12). [325] R. 16.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 137 [310] D. pp. High speed processors and implications for algorithms and methods. 4. in Heath [860]. 217-230. Bathe. AND J. [314] G. CAPPELLO [1987]. Report YALEU/DCS/RR-368. A new parallel algorithm for solving a complex function f ( z ) = 0. pp.. in Heath [860]. 14. Parallel Computing. AND P. AND M. in Heath [860]. in Glowinski et al.). and K. Trottenberg. Par. SHARMA [1988]. J. CAM Report 88-16. Analysis of preconditioners for domain decomposition. 382-390. CHAMBERLAIN. [319] A. A high level library for hypercubes.. Department of Mathematics. Inherent and induced parallelism in finite element computations. Gaussian elimination on a hypercv. S. pp. JOY. Springer-Verlag. [329] T. eds. pp. RESASCO [1988].be automaton. Domain decomposition preconditioners for general second order elliptic problems. [332] T. SAAD [1986]. HWANG [1985]. W. IEEE Trans. [327] R. McCormick and U. An alternative view of LU factorization with partial pivoting on a hypercube multiprocessor. UCLA. Copper Mountain. E. Application of parallel processing to numerical weather prediction. Numer. Computer. 30-40. [331] T. Report.. pp. Proc. pp. J. 281-287. SIAM J. pp. RESASCO [1987]. Report TM-78733. 288-308. The vortex method on a hypercube concurrent processor. Algorithmic performance of dataflow multiprocessors. CARROLL AND R. 651-655. CO. [335] T. 27. Multigrid calculation of three-dimensional transonic potential flows. February. W.24-29. Tech. 241-260. NEWMAN. CHAMBERLAIN AND M. P. Department of Science and Technology. CASASENT [1984]. [318] W. Appl. P. pp. pp. Tech. [762]. Math. 4. 72. FREDERICKSON. Chr. ACM. Appl. E. Report CCS 86/10. A domain-decomposed fast Poisson solver on a rectangle. A framework for the analysis and construction of domain decomposition preconditioners. CHAN AND D. CHAN AND Y. CATHERASOO [1987]. pp. CT. SIAM J. pp. [321] C. C-35. Parallelism in finite element modeling. Berlin. Appl. CAUGHEY [1983]. . [326] R. 13(3-4). & Comp. 8. RESASCO [1987]. CHAMBERLAIN [1987]. pp. Comput. CAREY. Anal. University of Michigan. AND A. Hypercube implementation of domain decomposed fast Poisson solvers. pp. CHAN AND T. Norway. Gray codes. (Special Issue. [322] D. Systems Engineering Laboratory. Recent experiences with three dimensional transonic potential flow calculations. Proc. 225-234. CHAN AND Y. 747-755. University of Texas at Austin. Supercomputer. [317] G. Stein. Conf. 738-746. Wunderlich. BARRAOY. Comm. Numer..

Department of Mathematics. CHEN AND D. A survey of parallel multigrid algorithms. 22-31. 6(3). 1. pp. PhD dissertation. MIRANKER [1970]. RAPP [1988]. [352] D. 564-573. Par. CHAN AND R. AND M. Yale University. [353] D. Processor allocation in an n-cube multiprocessor using Gray codes. Time and parallel processor bounds for linear recurrence systems. 81-88. Tech. 1396-1407. Tech. [343] T. Comput. TUMINARO [1987]. [342] T. 6-9. Lin. pp. MIRANKER [1969]. PhD dissertation. pp. 26. GUSTAPSON [1986]. HSUING [1984]. CAM Report 87-16. CHAN AND R. J. CRAY Channels.. Comp. DONGARRA. pp. Department of Computer Science. [346] H. CHEN [1984].-S. 698-711. RAPP [1988]. M. R. CHAPMAN [1979]. 8. [357] M. [347] H. [351] Y. CHAN AND R. Dist. Comput. Eng. Appl. Alg. CHEN [1982]. IEEE Trans. 6. Statist. Implementation and evaluation of multigrid algorithms on hybercubes. 297312. A parallel Householder tridiagonalization sirat eg em using scattered row decomposition. CHAN AND R. pp. Chaotic relaxation. Department of Computer Science. [364] S. 417-424. University of Illinois at Urbana-Champaign. CHEN [1975]. IRANI [1980]. Proc. Parallel Computing. March. Report 368. 17th Aerospace Sciences Meeting.-P. A parallel Householder tridiagonalization airat eg em using scattered square decomposition. Yale University... pp. Multiprocessing linear algebra algorithms on the CRAY X-MP-2: Experiences with small granularity. Report 689. AND D. Comput. pp. Conjugate Gradient Methods for Partial Differential Equations. pp.. SALAMA. VOIGT AND C. Control. University of Illinois at Urbana-Champaign. IEEE Trans.. 3. 19(3). CHANG [1982]. A Jacobi algorithm and its implementation on parallel computers. [354] A. Par. 59-67. SCHULTZ [1987]. Y. SLAM J. . CHEN AND S. J. Proc. pp. Meth. pp. J. 3. California Institute of Technology. Par. Design and implementation of parallel multigrid algorithms. PhD dissertation. Sci. Numer. C-36. [355] K. KUCK [1975]. SCHREIBER [1985]. CHANG. Wu [1984]. in Control Data Corporation [411]. UCLA. Implementation of multigrid algorithms on hypercubes. [358] M. A non-gradient and parallel algorithm for unconstrained minimization. 101-115. Y. Comp. [349] D. Department of Computer Science. UTKU. CHAUVET [1984]. H. TUMINARO [1988]. Solving elliptic partial differential equation* on the hypercube multiprocessor. CHEN [1983]. ROMINE [337] T. CHEN AND K. ORTEGA. pp. 1984 Int. pp. [344] T. CHAN AND R.138 J. Center for Supercomputing Research and Development. 18th Allerton Conf. in McCormick [1312]. Proc. pp. CHANG. CHEN. pp. & Appl. [359] M. PhD dissertation. [362] S. Speedup of Iterative Programs in Multi-Processing Systems. Multitasking a vectorized Monte Carlo algorithm on the CRAY XMP/2. M. on Comm. Meth. 2. TUMINARO [1988]. 6. Tech.. CHAN. 101-117. pp. M.. AND D. [356] M. CHAN. Dist. S. AIAA paper 79-0129. Computer. Optimum solution to dense linear systems of equations... in Heath [860]. pp. TUMINARO [1988]. CHEN AND C. SAAD. Solving elliptic partial differential equations on the hypercube multiprocessor. S. I. Space-time Algorithms: Semantics and Methodology. Num. J. [361] S. Department of Computer Science. SAAD. A design methodology for synthesizing parallel algorithms and architectures. SIAM J. 199222. [340] T. [341] T. CHEN AND K. HAN [1987]. AND M. 730-737. Report YALBU/DCS/RR-373. AND C. SCHULTZ [1985]. CHAZAN AND W. Multigrid algorithms on the hypercube multiprocessor. Introducing replicated VLSI to supercomputing: The FPS-164/MAX scientific computer. Cont. G. Conf. CHAZAN AND W. 461-491. 857874. Polynomial Scaling in the Conjugate Gradient Method and Related Topics in Matrix Scaling. SALAMA. CHEN [1986]. [345] R. Parallel networks for multigrid algorithms: Architecture and complexity. SCHREIBBR [1985]. CHARLESWORTH AND J. in McCormick [1312]. [338] T. CHAN AND R. Borehole acoustic simulation on vector computers. CHANDRA [1978]. Computational aerodynamics development and outlook. Comm.-Q. and Comp. SHIN [1987]. pp. [350] A. [363] S.. Yale University. in Kowalik [1116]. [360] S. [348] S. Large-scale and high-speed multiprocessor system for scientific applications: CRAY X-MP-2 series. 207-217. Department of Computer Science.. A parallel quasi-Newton method for partially separable large scale minimization. C-24. UTKU. [339] T. Pennsylvania State University. 10-23.

Comput. Proc. 4. pp. 4. Report RM-86-13. [383] A. RlCE [1988]. [374] R. October. Philadelphia. SIEWERT [1986]. Tech. Report UCRL-94464. CHU AND A. Department of Applied Mathematics. CLAUSING. CHRIST AND A. pp. CHENG AND S. 613-622. CHIN. SMITH [1986]. Conf. Center for Supercomputing Research and Development. Purdue University. Math. 342-353. Statist. AND D. 899-913. Comput. multiple-scale problems. pp. AND A. [381] C. Par. Conf. pp. Parallel Computing. AND C. 10. KUCK. HEDSTROM. HAMILTON [1987]. MURATA [1983]. IEEE Trans. KOHLER [1979]. pp. Proc. Models for dynamic load balancing in a heterogeneous multiple processor system. April. CHENG AND O. 65-74. LUSK. Comput. Proc. HEDSTROM. [389] H. Sci. Proc.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 139 [365] S. IEEE Trans. GEAR [1987]. Parallel Algorithms/Architectures for the Solution of Elliptic Partial Differential Equations. pp. CHUANG AND L. Seismics Acous.. on Supercomputing. CHEUNG AND J.. CHERN AND T. AIAA.. QR factorization of a dense matrix on a hypercube multiprocessor. CHU AND H. CLEARY. in Wouk [1999]. Proc. Report ORNL/TM-10691. F. Softw. AND R. pp. G. D. A fast algorithm for concurrent LU decomposition and matrix inversion.. [366] S. [385] E. PhD dissertation. SIAM J. 1975 Sagamore Conf. Athens. C-35. Tech. ORTEGA [1986]. pp. pp. G. Report 657. Practical parallel hand triangular system solvers. Department of Computer Science. 136-153. Gaussian elimination with partial pivoting and load balancing on a multiprocessor. CHIMA AND G. [390] J. QR factorization of a dense matrix on a shared memory multiprocessor. On the use of the FN method for radiative transfer problems. University of Waterloo. GEORGE [1987]. AND H. 8. A technique for achieving portability among multiprocessors: Implementation on the Lemur. University of Illinois at Urbana-Champaign. [368] T. Report ORNL/TM-10581. HOUSTIS. CHEN AND A. 2. VLSI systems for land matrix multiplication. [386] E. HEDSTROM. Int. HOWES. Tech. PhD dissertation. 79-86. [367] K. Vector. Parallel Computing.. E. Department of Computer Science. D. Proc. G. KAIL ATM. HOWES. AND J. 237-38. 5th year Prog. HAGSTROM. CHRISTARA [1988]. LEV-ARI [1987]. MCGRAW [1986]. Tech. [388] M. 1983 Int. AND J. [379] Y. [371] M. CHEN. [375] R. The performance of software managed multiprocessor cache on parallel numerical algorithms. F. A simulation study of the CRAY X-MP memory system. Comput. SIAM J. HEDSTROM. Implementation of preconditioned S-step conjugate gradient methods on a multi processor system with memory hierarchy. CHEN [1987]. Rev. T. CHRONOPOULOS AND C. Proc.. [370] M. ACM Trans. HARRAR. University of Illinois at Urbana-Champaign. 90-94. CHIN. Comput. pp. CHU AND A. . CHUN. AND J. [372] G. Gaussian elimination and Choleski factorization on the FLEX/32. Proc. A very fast parallel processor. 344-350. Parallel solution of ODE's by multiblock methods.. 1983 Int.. JOHNSON [1983]. Par. Proc. 210-228.. CHU [1988]. 354-. CHIN. VEIDENBAUM [1987]. [369] H. A parallel spline collocation-capacitance method for elliptic partial differential equations. 1987 Int. Parallel computation of a domain decomposition method. Tech. [392] A. Fast parallel algorithms for QR and triangular factorization. Par. Statist. R. Oak Ridge National Laboratory. GEORGE [1987]. Report 1346. pp. IEEE Trans. Oak Ridge National Laboratory. [376] R. 5. CHEONG AND A. MURATA [1983]. Par. Efficient solution of the Euler and Navier-Stokes equations with a vectorized multiple-grid algorithm. June. Parallel com- [378] R. CHRISTARA. TERRANO [1984]. 137-162. Tech. JOHNSON [1982]. Report CSD-TR-735. 3D vector forward modeling. CHU AND A. pp. CHOW AND W. Purdue University. 8. MCGRAW [1986]. New Computing Environments: Parallel. SAMEH [1975]. Sci. 239-258. putation of multiple-scale problems. 136-153. pp. [387] E. Reston... Conf. Lab. SAHNI [1987]. Lawrence Livermore National Laboratory. [373] T. OVERBEEK [1985]. Efficient matrix multiplication on a concurrent dataloading array processor. C-28. AND J. [382] C. 33. A fixed size systolic array for arbitrarily large eigenvalue problems. Parallel computation of [377] R. and Systolic. Orthogonal Decomposition of Dense and Sparse Matrices on Multiprocessors. SAMBH [1978]. Tech. pp. E. pp. pp. GEORGE [1988]. Parallel programming in ANSI standard Ada. Conf. Springer-Veriag. On parallel triangular solvers. CHERN AND T. [384] E. 550-556. Parallel Computing. SCROGGS. Paper 83-1893. [380] N. CHERRY [1984]. J. CHIN. SORENSEN [1987]. [391] J. G. 270-77.

& Math. January. J. C. Acta Informatica. 838-847. CoDENOTTl AND P. December. TROTTENBERG [1988]. in Paddon [1512]. 617-621. Comp. GOLUB. [400] J. October. pp. CLEMANS-AUGUST AND U. GNUDI. Math. COSNARD. DETRICH. 712-723. AND J. Ft. pp. [408] V. CHIN. CoDENOTTl [1988]. G.. STORAASLI. 123-130. [418] G. M. 6. Appl. Report No. pp. I. [403] B. 91-106. Parallel Algorithms and Architectures. pp. 1714. TCHUENTE. I. 265-288. AND P. AND D. P. & Math. 781-784. University of Minnesota Supercomputer Institute. MARRAKCHI. J. Appl. Comp. pp. Iterative solution of linear equations on a parallel processor system. Math. PhD dissertation. SIAM J. Bulletin de la Direction des Etudes et des Recherches. Proceedings of the 29th AIAA Structures. Russo. Math. Prog. & Comp. J. pp. R. Comp. [411] CONTROL DATA CORPORATION [1982]. ROBERT [1986]. pp. Tech. Sci. COSNARD. [416] M. CLINARD AND A. 631-634.D. R.140 J. Statist. CLINT.R. A parallel perturbed biharmonic solver. Parallel Computing. Complexity of parallel QR factorization. NASA. CO. VAUOHAN [1988]. A chordal preconditioner for large scale optimization. [409] J. Appl. CoNROY [1986]. 27. 13-44. 239-249. FAVATI [1987]. AND Y. A thort note on standard parallel multigrid algorithm* for 3D. [413] M. D. 6. Resolution parallele de systemes liniaires denses par diagonalisation. ORTEGA. Collins. TRYSTRAM [1988]. Low rank modification of Jacobi and JOR iterative methods. Proceedings Symposium CYBER 205 Applications. STEWART [1984]. C. CoDENOTTl AND P. CALTABIANO. Y. pp. Parallel QR-decomposition of a rectangular matrix. ACM. Letters. ROBERT. pp. WALLACH [1977]. AND G. CLEARY. COLLIER.. Amsterdam. CLEMENTI. pp.. CARNEVALI. [395] E. AND D. [393] A. HOLT. [623]. A. AND M. 596-605. QUINTON. Y. COCHRANE AND D. CONCUS. LOGAN. C. COSNARD. pp. pp. Block preconditioning for the conjugate gradient method. [410] CONTROL DATA CORPORATION [1979]. 13.. 1.. . Contractor Report NAS2-9896. 200-213. S. Large-scale computations on a scalar. COSNARD. C-26. Fast parallel algorithms for matrix inversion and linear systems solution. J. March. M. E.. J. ORTEGA. CoLEMAN [1988]. A study of non-blocking switching networks. Parallel Direct Solution of Sparse Linear Systems of Equations. 406-424. 32. Comparison des methodes paralleles de diagonalisation pour la resolution de systemes liniaires denses. [396] J. COSNARD AND Y. IEEE Trans. CLOS [1953]. ROMINE The University of Virginia. Appl. Algorithms for the parallel computation of eigensystems. ROBERT. HEUN. Comput. The use of parallelism in numerical calculations. G. H. 5. Implementing fracture mechanics analysis on a distributed-memory parallel processor. CONRAD AND Y. Optimal scheduling for two-processor systems. Math. 681-86. [415] M. Paris. [398] C. 13. pp. ROBERT. D. Oak Ridge National Laboratory. [405] T. 13. Strategies and performance norms for efficient utilization of vector pipeline computers as illustrated by the classical mechanical simulation of rotationally inelastic collisions. C(2). University of Maryland. POOLE. 1301(16). Tech. Structural Dynamics and Materials Conference. Solution of structural analytic problemt on « parallel computer. Math. PALAMIDESE [1987]. [401] B. COSNARD. TRUHLAR [1986]. Report ORNL/TM-10367. M. 40.. SLOTNICK [1958].. Parallel Computing. 275-296. TRYSTRAM [1985]. [406] W. 1. [417] M. FAVATI [1987]. MULLER. E. [402] B. Acad. IBM. Research Memorandum RC-55. TRYSTRAM [1986]. O. [407] P. 67-87. CORONGIU. North-Holland. 33. COPFMAN AND R. Y. GRAHAM [1972]. [1986]. AND C. [414] M. FOLSOM. MCCALLIEN. vector and parallel "supercomputer". pp. Corn [1987]. J. Feasibility study for NASF. Parallel Gaussian elimination on an MIMD computer. Numer. Tough problems in reactor design. ENDERBY [1984]. [404] E. Y. Also available as Computer Science Tech. 48. pp. ROBERT. 33-36. ROBERT [1986]. [399] D. COCKE AND D. R. G. AND D. Final report. 101-116. Appl. GEIST [1987]. Comput. [394] T. [397] M. VOIGT AND C. Sci. MfiURANT [1985]. pp.F. 220-252. Bell System Tech.. J. A. J. Iterative methods for the parallel solution of linear systems. pp. [412] M. AND A. eds. pp. in Feilmeier et al. Report 86-4. PERROTT. pp.

A vectorizing C compiler. E. 324-353. COWELL AND C. PSolve: A concurrent algorithm for solving sparse systems of linear equations. 1987 Int. Softw. Tech. Proc. pp. Watson Research Center. pp. VA. [436] A. Comput. GOODHUE. 531-540. March. 19. To appear. STARR. Performance of a subroutine library on vector processing machines. 1-37. France. AND A. IEEE Trans. 7. M. A 3-D earthquake model. DAVIS [1983].. Current progress and prospects in numerical techniques for weather prediction models. ton. Center for Supercomputing Research and Development. 50. [426] B. Report 577. Comput. Department of Mathematical Sciences. Conf. CVETANOVIC [1986]. Supercomputer. KUCK. Cox [1988]. A divide and conquer method for the symmetric tridiagonal eigenproblem. CRUTCHPIELD [1987]. IEEE Spectrum. [432] G. L. August. 5.. pp. Clemson University. CRANE. CROCKETT [1987]. Phys. pp. GEIST [1987]. Proc. December. 67-82. 94-99. JEDYNAK. DAY AND B.. DAVYDOVA AND I. pp. Report NASA SP-347.. DAVY AND W.. AND T. Proc. Par. SIAM J. NASA Ames Research Center. Center for Supercomputing Research and Development. Features characterizing the solution of computational problems on current and projected highly efficient computing systems. May. [424] T. BLACKADAR [1985]. 27-36. Ocean modeling on the Cyber tOS at GFDL. R. in Control Data Corporation [411][446] M. DuCROZ [1988]. [427] L. KING [1986]. FRANK. Symposium. 177-195. Performance of Fortran floating-point operations on the Flex/32 multicomputer. PSolve: A concurrent algorithm for solving sparse systems of linear equations. W. FUNDERLIC. [435] K. Proceedings of a Cray Research Inc. . THOMPSON [1986]. pp. Institut National Polytechnique de Toulouse. Proc. R. [444] I. Performance analysis of the FFT algorithm on a shared memory parallel architecture. A hypercube implementation of the implicit double shift QR algorithm. J. [443] W. Supercomputing trade-offs and the Cedar system. MINKOPP. CVETANOVIC [1987]. [428] M. CSANKY [1976]. Parallel Computing. REINHARDT [1975]. pp. pp. Comm. 483-490. 36. CYBENKO. DALY AND J. INC. 162-172. Parallel complexities and computations of Cholesky's decomposition and QR factorization. DAVIDSON [1987]. D. 20(11). 165-172. Phys. [431] Z. Tech. [438] A. pp. Implementation of a divide and conquer cyclic reduction algorithm on the FPS T-SO hypercube. 618623. C-36. DAVIS. LAWRIE. ICASE Interim Report 4. [442] T. Int. C. COX [1983]. pp. Tech.. J. Computation of shuttle non-equilibrium flow fields on a parallel processor. in Heath [860]. Report ANL/MLS-TM-63. IBM T. Par. 55-64. [422] R. D. [420] C. The effects of problem partitioning allocation and granularity on the performance of multiple-processor systems. [430] Z. [421] M. K. AND S. Transforming Fortran DO loops to improve performance on vector architectures. [441] T. Science. SAMEH [1986]. 12. Computer architecture. CULLEN [1983]. [433] W. Tech. pp. A. VENKATARAMAN [1987]. Math. SHKOLLER [1982]. MILLIKEN. pp. 538-550. pp. pp. Tech. University of Illinois at Urbana-Champaign. PhD dissertation. KRUMME. Math. Performance measurements on a It8-node Butterfly parallel processor. AND A. 11. Fixed hypercube embedding. Comp. J. Comput. SIAM J. DATTA [1985]. Comput.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 141 [419] W. DUFF [1987]. January. Department of Computer Science. 15. DAVIS [1986]. Engineering and the CRAY-1. DAVYDOV [1985]. Tech. DAVE AND I. Math. 619-626. Tech. 5. [440] G.. AND V. DAVIS. 1985 Int. Numer. Tufts University. Simulation. Report URI-037. Moscow. ICASE. RIDEOUT [1977]. Argonne National Laboratory. [434] C. 27-32. Izdatel'stvo Nauka. Sparse matrix calculations on the CRAY-2. Computational Processes and Systems. J. pp. M. Report 612. [439] G. Performance modeling of large-grained parallelism. ACM Trans. [429] J. Report. Column LU factorization with pivoting on a hypercube multiprocessor. Conf. AND K. University of Illinois at Urbana-Champaign. Report RC11719(52739). NASA Langley Research Center. Algebraic Discrete Methods. CYRE. CllPPEN [1981]. Hamp[425] W. pp. [1982]. Parallelisation d'algorithmes d'optimisation pour des problemes d'optimum design. [445] S. DAVIDSON. WISPAC: A parallel array computer for large scale system simulation. Fast parallel matrix inversion algorithms. DAYDE [1986]. CROWTHER. D.. [423] CRAY RESEARCH. DAVIS AND E. HILLSTROM. 421-432. THOMAS. REDMOND. in Gary [700]. DAVIS [1986]. [437] E.

ADAMS [1987].2. Proc. DEMINET [1982]. Geometry-defining processors for partial differential equations. [1985]. PhD dissertation. pp. Research questions for performance analysis of supercomputers. Tech. Bristol. DEIWERT AND H. 322-323. Report YALEU/DCS/RR-540. [455] P. KAMP [1980]. ORTEGA. Univ. New York. Plenum Publishing Corp.-M. ROMINE [447] P. TODD [1984]. [448] C. R. Proc. [470] J. C-33. DEWEY AND A. DENNING AND G. American Scientist. DENNIS [1980]. Report. CA. University of California. [466] J. H. pp. [468] J. A method of matrix inverse triangular decomposition based on contiguous principal submatrices. MORF [1981]. Berkeley.. IEEE Trans. IEEE Trans. pp. SAMBA.. Proc. pp. DELOSME AND I. Soc.. DELOSME AND M.. Alg. 10. [459] P. On a real-time scheduling problem. 31. Lin. B. Algorithms and Architecture for Multiprocessor-Based Circuit Simulation. Lin. G. VOIGT AND C. Electronics Research Laboratory. Y. Naval Research Laboratory. NEWTON [1984]. Also in Gary [700]. 37-46. [473] S. Report 86-21.. WENG [1977]. AND S. Academic Press. Memorandum Report 5504. [460] P. Computation Structures Group Memo 225. pp. SAHNI [1981]. Exploring Applications of Parallel Processing to Power System Analysis. in Kuck et al. AND J. Massachusetts Institute of Technology Laboratory for Computer Science. Conf. pp. 278-288. Proceedings of the International Symposium on Large Scale Scientific Computation. eds. Oper. pp. [454] J. M. 1984 Int. DE VoRE [1984]. ed. DENNIS [1982]. LlU [1978]. Alder. Vectorization and implementation of an efficient multigrid algorithm for the solution of elliptic partial differential equations. 13(11). A. 21st Design Automation Conference. HENDRY [1984]. NEWTON [1984].. Electric Power Research Institute. DEUTSCH AND A. Department of Computer Science. pp.-M. Experience with multiprocessor algorithms. DELOSME AND I. Proc. Systolic Arrays.-M. [1133]. 75-111. ROTHMUND [1983]. 167-183. [463] J. Architecture and Performance of Specialized Computer Systems. GENIN. DEVREESE AND P. 207-214. [462] J. Ltd. D. Report EE 566SR. pp. 15-20. [461] P. DELOSME [1987]. Computer. Data flow supercomputers. A processor for two-dimensional symmetric eigenvalue and singular value arrays. C-31. [467] J. Supercomputers in Theoretical and Experimental Science. DEMBART AND K. Inst. January. & Appl. DELVES. Par. DENNIS AND K.-M. Amsterdam. Band matrices on the DAP. RIACS. Amsterdam. Comput. Data flow ideas for supercomputers. Yale University. NEVES [1977]. [456] L. SLAM J. Netherlands.. Conf. Evaluating supercomputers. 194212. [471] J. Res. pp. pp. [457] B. IEEE Comp. Symp. DENNIS [1984]. DENNIS [1984]. A one-tided Jacobi algorithm for computing the singular value decompotition on a vector computer. Comput. May. VAN CAMP. [469] J. 73. 657-673. GAO. Massachusetts Institute of Technology Laboratory for Computer Science. IPSEN [1987]. DENNIS. NY. pp. Scattering arrays for matrix computations. [449] G. Alg. High speed data flow computer architecture for the solution of the NavierStokes equations. San Diego. Modeling the weather with a dataflow supercomputer. Adam Hilger. [451] J. Tech. High speed data flow computer architecture for the solution of the NavierStokes equations. DENNING [1987]. Math. [464] J. AND K. [458] J.142 J. Report TR-87.. DE RlJK [1986]... A multiprocessor implementation of relaxation based electrical circuit simulation. IPSEN [1986]. Parallel matrix and graph algorithms. in Paddon [1512]. Parallel solution of symmetric positive definite systems with hyperbolic rotations. Application of data flow computation to the weather problem. NASSIMI. AIAA 16th Fluid and Plasma Dynamics Conference. [450] E. DEUTSCH AND A. [472] D. SPIE 25th Tech. COMPCON 84. AND Y. 592-603. Tech. pp. DHALL AND C. MSLICE: A multiprocessor based circuit simulator. 187-200. 143-157. 26. Tech. DEKEL. PATERA [1987]. 48-56. pp. DELSARTB. DENNING [1985]. NASA Ames Research Center. 127- . North-Holland. Sparse triangular factorization on vector computers.. DEUTSCH [1985]. Comput.. Efficient systolic arrays for the solution of Toeplitz systems: An illustration of a methodology for the construction of systolic architectures in VLSI. [453] J. Three dimensional flow over a conical afterbody containing a centered propulsive jet: A numerical simulation. & Appl. 77. [452] J. [465] J. Parallel computation.. G.

in Cray Research. Symp. STEIHANO [1985]. Proc. pp. DONGARRA [1985]. Proc. pp. Performance of various computers using standard linear equations software in a Fortran environment. 235. 51-59. in Feilmeier et al. DIAMOND [1975]. 13-15. Herts. France. Redesigning linear algebra algorithms. 22(3). ACM SIGNUM Newsletter. 5ome UNPACK timings on the CRAY-1. Block diagonal scaling for iterative methods — Thermal simulation. 92. JlNES. AND S. [479] R. [499] J.. PATEL [1982]. PERRIER [1981]. DINH. Proc. [488] J. DODSON AND J. [494] J. [484] D. [490] J. January. pp. et M. Versailles. DlAZ. Stillwater. [423].. DONGARRA [1983]. Methods. 21(4). Report OU-PPI-TR-85-05. A proposal for an extended set of Fortran basic linear algebra subprograms. DODSON AND J. To appear. Herts. [1987]. Dallas. LEWIS [1985]. Squeezing the most out of an algorithm in CRAY- . 19(3). Issues relating to extension of the basic linear algebra subprograms. Symp. Technical Memo 41. HAMMARLING. Boeing Computer Services. OK. December. HANSON [1988]. A proposal for a set of level S basic linear algebra subprograms. p. LEWIS [1982]. pp. Simul. Internal Memorandum G4550-CM-39. EISENSTAT [1984]. DONGARRA AND I. DoNGARRA [1978]. [496] J. 503-509. TX. pp. Sixth Int. DONGARRA AND I. HARADA [1987]. [480] Q. Proc. [483] L. The stability of a parallel algorithm for the solution of tridiagonal linear systems. DONGARRA. University of Oklahoma. BETTE. Report 125. W. 2-14. MANTEL. Harwell Report. HAMMARLING. P. [489] J. AND P. NOG. MCDONALD. JlNES. Argonne National Laboratory. W. PhD dissertation. [491] J. Hat field. North-Holland. 293-298. Increasing the performance of mathematical software through highlevel modularity. pp. [493] J. Workshop on Applied Computing in the Energy Field. SlNGH [1982]. AND R. Preliminary timing study for the CRAYPACK library. 2-18. HANSON [1986]. 1978LASL Workshop on Vector and Parallel Processors. North-Holland. Subdomain solutions of nonlinear problems in fluid dynamics on parallel processors. Calculating the block preconditioner on parallel multivector processors. [477] J. AND T. DUFF [1987]. pp. Hatfield. Computer. 1-17. Paris. DONGARRA. On « convergence criterion for linear (inner) iterative solvers for reservoir simulation. Argonne National Laboratory. ACM SIGNUM Newsletter. February. p. W. Report 132. DucKSBURY. DUCROZ. Tech. Report ANLMCS-TM-57 (Revision 1). B. S. JlNBS. HAMMARLING [1987]. [486] D. J. DONGARRA [1984]. A preconditioning algorithm for solving nonsymmetric linear systems suitable for supercomputers. R. North-Holland. 14. DoDSON [1981]. Report MCA-TM-23. Mathematics and Computer Science Division. Simulation Numerique en Elements Finis d'ecoulements de Fluides Visqueux Incompressibles par Une Methode de Decomposition de Domaines Sur Processeurs Vectoriels. Tech. S. 5th International Symposium on Computational Methods in Applied Sciences and Engineering. ed. DUFF [1986]. GLOWINSKI. pp. School of Electrical Engineering and Computer Science. AND R. Tech. SPE 1985 Res. [492] J. ACM SIGNUM Newsletter. pp.. France. 1975 Sagamore Conf. J. [474] M. [498] J. 239-248. DlNH [1982]. Curie. Softw. DlEKKAMPER [1984]. How do the mini-supers stack up?. DlAZ. [495] J. Bulletin de la Direction des Etudes et des Recherches. pp. Versailles. Comm. DUCROZ. AND R. Improving the performance of a sparse matrix solver on the CRAY-1. [476] J. Experimental Parallel Computing Architectures. DlXON AND K. 2-4. DlAZ. Argonne National Laboratory. AND P. ACM Trans. in Kartashev and Kartashev [1055]. [485] D. Applied Numer. 41-47.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 143 140. [482] L. I. DONGARRA. Par. [481] Q. An update notice on the extended BIAS. Tech. Performance of vector computers for direct and indirect addressing in Fortran. HANSON [1984]. Proc. pp. Harwell Laboratory. [497] J. J. PERIAUX. The place of parallel computing in numerical optimization: Four parallel algorithms for nonlinear optimization. [478] J. HAMMARLING. Inc. Math. [487] S. E. STEIHANO [1985]. & Applied Sciences. Doi AND N. 20(1). Advanced architecture computers. NOC. P.D. Comp. S. Methods in Eng. Development and performance of a block pre-conditioned iterative solver for linear systems in thermal simulation. An extended set of basic linear algebra subprograms. Vectorized finite element analysis of nonlinear problems in structural analysis. DlAZ [1986]. C(l). DUCROZ. AND T. [500] J. Tech. Proc. Univ.F. DONGARRA [1986]. [475] J. 58-75. A parallel version of the conjugate gradient algorithm for finite element problems. J. DONGARRA. A. [623]. AND T. DUCROZ. S. DlXON. DUFF. DONGARRA. DONGARRA AND S. STEIHANO [1986]. J.

Steinberg. EL. AND J. MA. S. Computer. [524] J. Unrolling loops in FORTRAN. Sci. MIRANKER [1988]. 77. W. pp. Sci. DONGARRA AND A. A collection of parallel linear equations routines for the Denelcor HEP. pp. ROMINE FORTRAN. Tech. DONGARRA AND R.. F. AND L. DONGARRA AND D. pp. Research Report RC13429. SIAM J. Effects ofRayleigh accelerations applied to an initially moving fluid. Report ANL-8579. DONGARRA AND R. Lee. 5. Tech. SAMEH. in McCormick [1312]. MIRANKER [1988]. Math. SPEISER. 175-186. Numer. [512] J. DONGARRA AND D. pp. [506] J. [504] J. Lin. A portable environment for developing parallel FORTRAN programs. 25. SIAM J. R. 7. 9. 26. University of Oklahoma. Urbana. DOWERS. Approaching the gigaflop). J. DONGARRA. 45—47. H. DONGARRA. 221-230. in McCormick [1312]. A. S. Report ANL/MCS/TM-33. [508] J. HICKS. SYMANSKI [1987]. Report ORNL-6335. Generating parallel algorithms through multigrid and aggregation/disaggregation techniques. [523] B. SORENSEN [1987].. 91-112. Comparison of the CRAY X-MP-4. Parallel Computing. LUK. CARRERAS. DOUGLAS AND W. 10. DOUGLAS. Oak Ridge National Laboratory. [518] C.. ROBERTSON. Appl. pp. SORENSEN [1987].. Statist. R. eds. GUSTAVSON. IEEE Proceedings of the 7th Symposium on Computer Arithmetic. ORTEGA. Report ANL/MCS-TM-15. SIAM J. DONGARRA AND D. Exper. Anal. DONOARRA. Yorktown Heights. & Comp. SORENSEN [1986]. pp. A parallel linear algebra library for the Denelcor HEP. pp. Schultz. S. AND W. Generating parallel algorithms through multigrid and aggregation/disaggregation techniques. Argonne National Laboratory. 5. 25-34. VOIGT AND C. SIAM Rev. [509] J. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. [515] J. LAWKINS. [520] C. DOUGLAS AND W. SMITH [1988]. Watson Research Center. AND A. DONGARRA AND T. The interaction of numerics and machines. 219-246. MIRANKER. HINDS [1985]. HORIGUCHI. HIROMOTO [1984]. WINKLER [1988]. Pract. 3. AND A. HEWITT [1986].J. Tech. 113-136. SIAM J. 8. M. S. 219-229. North-Holland. B. L.. DONOARRA AND A. LAKSHMIVARAHAN. January. Parallel Computing. pp. KARP [1984]. M. pp. SAMEH [1984]. On the comparison of the performance of Alliant FX/8. Comput. [507] J. School of Electrical Engineering and Computer Science. DONGARRA AND D. Parallel Computing.. and M. [517] C. A fast algorithm for the symmetric eigenvalue problem. Anal. F. & Appl. DONGARRA AND L. [502] J. [522] K. Softw. AND S. Implementing dente linear algebra algorithms using multitasking on the CRAYX-MP-4 (or. AND D. Statist. Using symmetries and antisymmetries to analyze a parallel multigrid method. [503] J. A fully parallel algorithm for the symmetric eigenvalue problem. Argonne National Laboratory. SMITH. To appear. pp. Parallel Computing. DRESSLER. Tech. Amsterdam-New York. DOUGLAS AND W. Argonne.144 J. SORENSON [1984]. Proc. pp.. 376-398. SORENSEN [1986]. Some non-telescoping parallel algorithms based on serial multigrid/aggregation/disaggregation techniques. B. B. A collection of parallel linear equation routines for the Denelcor HEP. VAX 11/780. LYNCH [1987]. JOHNSSON [1987]. DOUGLAS. Argonne National Laboratory. pp. Numer. ACM Trans. October. Implementation of a 3-D nonlinear MHD calculation on the Intel hypercube. IBM T. DRAKE. HAMMARLING [1986]. 1. [519] C. DOUGLAS AND B. AND S. Materials Processing hi the Reduced Gravity Envi- . 133-142. pp. SORENSEN [1985]. DHALL [1987]. Report ANL/MCS-TM-27. pp. [511] J. [516] C. MIRANKER [1988]. KAUFMAN. 20(7). [501] J. [525] R. HINDS [1979]. sl39-s!54. Tech. Squeezing the most out of eigenvalue solvers on high performance computers. and IBM 3081 in solving linear tri-diagonal systems. Softw. DRAKE. SLAPP: A systolic linear algebra parallel processor. Solving banded systems on a parallel processor. DONGARRA AND D. HENDERSON.. Tech. SPRADLEY [1982]. [505] J. Math.. DONOARRA AND A. H. 347-350. Comput. Fujitsu VP-tOO and Hitachi S-810/tO. Constructive interference in parallel algorithms. D. [510] J. Implementation of some concurrent algorithms for matrix factorization. Linear algebra on high performance computers. 167176. September. First IMACS Symposium on Computational Acoustics. 57-88. MIRANKER [1987]. pp. [514] J. An Argonne perspective. [513] J. [521] C. Argonne National Laboratory. 338-342. HIROMOTO [1983]. G. AND V. On some parallel banded system solvers. Report. Alg. NY. 20.

AND S. eds. DuCROZ AND J. 293-302. WASMEWSKI [1987]. The solution of sparse linear equations on the CRAY-l. Hypercube performance. Report ORNL/TM-10400. JOHNSSON [1986]. Mathematics and Computer Science Division. [545] I. Tech. GREENBAUM. Tech. DUFF AND L. Numerical study o/ a ramjet dump combustor flow field. DULLER AND D. IEEE Trans. DUFF [1986]. [535] J. Argonne. Summer Meeting. 2nd Int. DUFF AND J. pp. algorithms in multiprocessors. The solution of sparse linear equations on the CRAY-l. GOULD. [530] P. TALUKDAR [1979]. The influence of vector and parallel processors on numerical analysis. REID [1982]. Harwell Laboratory. N. 414-423. 299-305. Loen. Basic linear algebra computations on the Sperry ISP. in Kowalik [1116]. England. Softw. Parallel Array Processing. Computer Science and Systems Division.O. [547] I. DUNIGAN [1987]. Tech. REID. DUFF AND J.. [548] I. June 2-6. 45-54. WEIDNER [1982]. Oxon. Parallel Computing. ed. DUFF [1982]. Performance of synchronized iterative processes in multiprocessor systems. [527] J. DUNGWORTH [1979]. 131-136. DUFF [1986]. 419-431. Harwell Laboratory. AIAA Journal.R. Parallel implementation of multifrontal schemes. Conf. in Rodrigue [1643]. DUBOIS AND G. Approximating the inverse of a matrix for use in iterative algorithms on vector processors. pp. [423]. 51-76. [546] I. JOHNSSON [1986]. Proc. [541] I. Use of vector and parallel computers in the solution of large sparse linear equations. in Jesshope and Hockney [976]. Tech. An algorithm for power system simulation by parallel processing. pp. Comput. Report ANL/MCS-TM-84. Proc. DUFF AND L. RODRIGUE [1977]. Wiley. [544] I. DUFF [1986]. Conf. DURHAM. August 1984. pp. A. pp. DUFF [1986]. Rindone. pp. Haupt. Vector and Parallel Processors in Computational Science. Phys. Experience of sparse matrix codes on the CRAY-l. Report UCID-17515. pp. in Kuck et al. [552] M. AND J. 129-151. Juling. RODRIGUB [1979]. 4(3). [533] P. 76. Norway. G. [526] J. DUBOIS AND G. DUFF [1984]. pp. pp. Oxon. in Feilmeier et al. [1985]. in Heath [860]. Numerical study of a icramjet engine flow field. Computing. England. I.. AND G. [532] P. RODRIGUE [1977]. Computer Science and Systems Division. and Lange. [542] I. DUBOIS AND F. Elsevier Science Publishing Co. May. [553] T. Computer Science and Systems Division. 20/21. Report AERE-R 12393. Argonne National Laboratory.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 145 ronment of Space. in Cray Research. IEEE Power Eng. [529] M. PADDON [1984]. CRAY Channels. Comm. [550] R. DRUMMOND AND E. Paper 830421. Report CSS-211. [540] I. 20. The multifrontal method in a parallel environment. Technical Memorandum. Vector and Parallel Computing. Operator splitting on the STAR without transposing.. DUFF [1982]. pp. 17-39. Node orderings and concurrency in sparse problems: An experimental investigation. 293-309. [531] P. 3. [554] T. DUCKSBURY [1986]. England. DUFF [1986]. Supercomputer. pp. Computer Science and Systems Division. DUGAN. Tech. 1182-1187. Processor arrays and the finite element method. [538] I. Swimming upstream: Table lookups and the evaluation of piecewise defined functions on vector computers. Performance of S. North-Holland. Report CSS-210. [623]. Harwell Laboratory. SE-8. LESCRENIER. [534] P. [539] I. Inc. Multiprocessing a sparse matrix code on the Alliant FX/8. 193-204. England. DUBOIS. REID [1987]. Report AERE-R 12329. 257-268. pp. DUBOIS [1987]. AIAA. DUNIGAN [1987]. IL. Oxford. Soc. Performance of three hypercubes. DUFF [1987]. [543] I. [549] I. Lawrence Livennore National Laboratory.. Proc. Tech. [551] A. Oxon. 22. Oxon. Handler. An analysis of the recursive doubling algorithm. DUFF. [537] I. DRUMMOND [1983]. Argonne National Laboratory. Harwell Laboratory. The effect of orderings on the parallelization of sparse code. The parallel solution of sparse linear equations. 178-192. DUBOIS [1982]. The use of vector and parallel computers in the solution of large sparse linear equations. [528] M. Eng. pp. Int. Jeltsch. pp. BRIGGS [1982]. Tech. in Kartashev and Kartashev [1055]. The CRAY-l computer system. . Oak Ridge National Laboratory. [1133]. M. [536] I. The solution of sparse linear equations on the CRAY-l.

ElJKHOUT [1985]. ERHEL. ScHULTZ [1981]. Sci. EBERLEIN [1987]. [575] A. SIAM J. pp. [558] D. E. Comput. Supercomputer. Numer. Phys. Comput. AND F. IMAJNA. 7. LECA [1984]. 22. Report RC-12686. WATSON [1984]. AND C. Description of the ENIAC and comments on electronic digital computing machines. BRAINERD [1945]. Tech. [565] V. IEEE Power Eng. R. and Department of Mathematics. Algebraic Discrete Methods. College Park. 29-57. Oak Ridge National Laboratory. North-Holland. AND K. [581] J. EMMEN. Tech. Comput.2R. Scalar recurrences on chain able pipeline architectures. [562] J. 589-600. GOLDSTEIN. ELLIS AND L. ENSLOW [1977]. Report CNA-202. A parallel algorithm for simple roots of polynomials. THOMASETT [1982]. BAGANOFP. Study of the mapping of NavierStokes algorithms onto multiple-instruction/multiple-data-stream computers. 103-129. LICHNEWSKY. Report ORNL/TM-10881. EL TARAZI [1982]. [564] L. [580] J. [561] P. Report 1704. pp. Modified cyclic algorithms for solving triangular systems on distributed-memory multiprocessors. AGRON [1988]. D. September. [563] O. [582] J. R. ACM Computing Surveys. 9(3). EGECIOGLU. [568] K. EMMEN [1987]. GALLOPOULOS. Tech. Tech. in Vichnevetsky and Stepleman [1920]. EKANADHAM AND ARVIND [1987]. [569] M. pp. Tech. University of Pennsylvania. Comp. A parallel QR decomposition algorithm. 99-114. [1985]. University of Maryland. Uppsala University. Report 189. Parallelisation d'an algorithme de gradient conjugue preconditionne. IBM. ROMINE [555] T. STEVENS [1984]. [566] S. EL TARAZI [1985]. pp. Linkoping University. pp. ELDEN [1987]. Some convergence results for asynchronous algorithms. EBERLEIN [1987]. Multiprocessor organization: A survey. Quelques progress en calcul parallele et vectoriel. Inf. Tech. Performance of • second generation hypercube. A. Iterative methods for systems of first order differential equations. 325-340. Report.. Report 671. in Heath [860]. Statist. [570] M. Power system simulation on a multiprocessor. AND S. C-36. in Schultz [1752]. Tech.. HEATH. Summer Meeting. [556] I. ELMAN [1986]. 502-509.. ELMAN AND E. SIAM J. 13. FRABOUL. 790-796. Supercomputer Applications. DUGAN.. C. SIMPLE: PART I — An exercise in future scientific programming. [579] M. KOC [1987].. [578] P. Soc. Ordering techniques for the preconditioned conjugate gradient method on parallel computers. Tech. ORTEGA. Yorktown Heights. Statist. DUNIGAN [1988]. EHRLICH [1986]. JALBY. H. Report TM-85945. pp. VOIGT AND C. [573] H. EASTWOOD AND C. On the Schur decomposition of a matrix for parallel computation. ERHEL. The numerical Schwartz alternating procedure and SOR. HENKEL. [567] S. C.. Vector processing.. [571] L. 9. DURHAM. University of Illinois at Urbana-Champaign. EREEGOVAC AND T. 167-174. in Fernbach [630]. Comput. G. [560] P. JESSHOPE [1977]. Parallelism in finite element computations. Presented at the IBM Symposium on Vector Computers and Scientific Com- . Department of Scientific Computing. EBERLEIN [1987]. Trends in elliptic problem solvers. pp. AND J. THOMASETT [1983]. JONES. IEEE Trans. 5. LANG [1986]. Department of Computer Science. 4-6. ERHEL [1983]. Math. MD. pp. University of Maryland. NASA Ames Research Center.146 J. Tech. MAUCHLY. pp. AND P. H. Amsterdam. [559] P. Applied Mathematics Panel Report 171. Report TR-88-53. 233-239. ROMINE [1988]. The solution of elliptic partial differential equations using number theoretical transforms with applications to narrow or computer hardware. NY. ed. December. [572] G. [574] H. [576] P. Department of Computer Science. W. 8. Comm. 989-993. 39. Center for Numerical Analysis. J. 10. M. M. pp. EBERHARDT. A. ETA-10: A "poor man's" supercomputer for 1 million dollars. Coll. [557] J. September. On using the Jacobi method on the hypereube. TALUKDAR [1979]. AND F. pp. ECKERT JR. INRIA. On one-sided Jacobi methods for parallel computation. AND C. ENSELME. ElSENSTAT. sur des Methodes de Calcul Scientifique et Technique. pp. Parallel Hermite interpolation: An algebraic approach. April. & Math. Approximate Schur complement preconditioners for serial and parallel computers. A. pp. SIAM J. University of Texas at Austin. ElSENSTAT AND M. Sci. An MIMD architecture system for PDE numerical simulation. LICHNEWSKY. [577] M. 107-122. 29-40. Department of Computer Science. Proc.

6. AND R. Report LA-UR-87-3635. [586] C. Center for Advanced Computation. Parallel Processing Systems. Block relaxation strategies. [590] D. University of Illinois at Urbana-Champaign. J. 111-130. Parallel Computing. [592] D. in Feilmeier et al. [593] D. J. EVANS [1982]. ERLEBACHER. EVANS AND A. J. 313-324. J. 7. 10. EVANS AND A. [602] D. 151-62. A modification of the Quadrant Interlocking Factorisation parallel method. Parallel S. on Sys. pp. Parallel iterative methods for solving linear equations. Math. Optical processing of banded matrix algorithms using outer product concepts. pp. Math. D UNBAR [1983]. S. EVANS AND M. [591] D. pp. Parallel solution to certain banded symmetric and centro-symmetric systems by using the Quadrant Interlocking Factorization method. ed. NASA Langley Research Center. EWCKSEN AND R. 33—48. Whiteman. Int. AND M. ETHRIDGB. Int. LUK. pp. Iterative and direct method* for tolving Poisson's equation and their adaptability to JLLIAC IV. [605] L. [587] D. in Evans [589]. HATZOPOLOUS [1979]. EVANS [1984]. Tech. [1982]. ACM Trans. Simul. [597] D.. HUSSAINI [1987]. [598] D. Comput. 23. pp. VA. EWERBRING. Tech. 227-38. and Testing Symposium. BOKHARI. pp. [601] D. Math. HADJIDIMOS [1980].F. pp. 119—126. Tech. JIANPING. MOORE. 3-18.. Rome. [608] V. Parallel numerical algorithms for linear systems. 4. [609] E. Los Alamos National Laboratory. HADJIDIMOS. 11. EVANS. J. 187-195. 357384. [603] D. EVANS AND R. 139-152. Parallel Computing. 9. pp. A. 201-204. Academic Press. [607] V.. 448-58. [610] V.D. IEEE Trans. 21st Annual Hawaii International Conf. AND V. S. [623]. J. EVANS AND R. Control. Kibernetica. Parallel Computing. in Feilmeier et al. Int. [588] D. New York. pp. [596] D. 247-284. EVANS AND S. OKOLIE [1981]. Comput. C-32. 3-56. Comput. 28-40. EVANS.. [606] V. pp. WILHELMSON [1976]. Los Alamos National Laboratory. Math. Parallel Computing. pp... Sci. Comp. 6.. Parallel Computing. pp. FADEEV [1977]. Report 87-41. pp. An efficient parallel algorithm for the simulation of three-dimensional compressible transition on a 20 processor Flex/32 multicomputer. EVANS AND M.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 147 puting. [594] D. Implementation of the conjugate gradient and Lanczos algorithms for large sparse banded matrices on the ICL DAP. Parallel computations in linear algebra. . in Schultz [1752]. pp. FADEEVA AND D. HADJIDIMOS [1981]. LISHAN [1988]. pp. [600] D. Proc. [589] D. J. Global communication algorithms for hypercubes and other Cayley coset graphs.R. Comput. AND A. [585] G. The solution of linear systems by the QIF algorithm on a wavefront array processor. 149-166. 61-69. iterative methods. MAGARITIS [1988]. 271-275. A parallel linear systems solver. RUTTENBERT [1988]. NOUTSOS [1981]. EVANS [1984]. The parallel solution of banded linear equations by the new Quadrant Interlocking Factorisation (Q. BARLOW [1984]. New parallel algorithms in linear algebra. FADDEN [1980]. New parallel algorithms for partial differential equations. Cambridge University Press. SVD computation on the Connection Machine. 6. J. pp. Int.O. Math. [604] D. EWCKSEN [1972]. Experimental parallel microprocessor system. 2. BEKAKOS [1988]. Math. [583] J. Tech. The AD-10: A digital computer approach to time critical simulation. Los Alamos National Laboratory. Softw. [595] D. ICASE. EVANS. E. EVANS [1979]. On the numerical solution of sparse systems of finite element equations. EVANS AND G. [584] J.. SOJOODI-HAGHIGHI [1982]. Construction of extrapolation tables by systolic arrays for solving ordinary differential equations.) method. The parallel solution of triangular systems of equations. Comput. EVANS [1983]. F. Bulletin de la Direction des Etudes et des Recherches.F. [623]. 8. Int. pp. Mafelap 1978 Conference Proceedings. EVANS AND K. Tech. AND D. pp. pp. pp. 180-187. 143-151. FABER [1981]. 4th Power Plant Dynamics. 7. ed. SHANEHCHI. Hampton. Report LA-UR-87-3136.. A recursive decoupling algorithm for solving banded linear systems. Implementation of a eonveetive problem requiring auxiliary storage. MEGSON [1987]. Math. TRUJILLO [1983]. EVANS. FABER [1987]. FABER [1987]. The convergence factor of the parallel Schwartz overrelaxation method for linear systems. Latency and diameter in sparsely populated processor interconnection networks: A time and space analysis. 1. Comput. The Mathematics of Finite Elements & Applications III.I. AND K. C(l). Report 60. Report LA-UR-83-1676. [599] D.

FEILMEIER. [619] R. RONSCH [1982]. Multiprocessors in Computational Mechanics. FARHAT AND E. Your favorite parallel algorithm may not be as fast as you think. AND C. Purdue University. 285-300. [624] M. FARMWALD [1984]. FEO [1988]. 113-128. SNYDER [1983]. VA. FEILMEIER AND W. 7. ICASE. Comput. Pringle: A parallel processor to emulate chip computers. Tech. 12-27. Matrix computation on processors in one.. [631] W. 745-751. [620] G. [623] M. 1987 Int. University of California at Berkeley. pp. FATOOHI AND C. PETERSON. LUK. NorthHolland. Computer. G. Eng. Meth. AND L. pp. IEEE Trans. 77-92. pp. IEEE. [1986]. 35-43. ed. Department of Civil Engineering. Systolic array computation of the singular value decomposition. D'ARCY [1984]. IEEE Trans. R. Numer. [640] S. Computers. FARHAT AND E. POTTLE [1982]. Cornput. TROTTENBERG [1981]. AND D. 145-155. FlSHER [1985]. [612] C. Tech. 405-413. Solution of finite element systems on concurrent processing computers. Tech. Implementation and analysis of a Navier-Stokes algorithm on parallel computers. Appl. . JORDAN [1977]. Department of Computer Science. PARKINSON [1977]. FERNANDEZ AND B. in Jesshope and Hockney [976]. 76[632] J. [633] A. FINN. 147-165. FELIPPA [1981]. FENG [1981]. Report 87-34. Sci. J. An analysis of the computational and parallel complexity of the Livermore loops. eds. pp. [638] M. NASA Langley Research Center. 72. M. North-Holland. FEILMEIER. SCHENDEL. Proc. pp. Comm. IEEE. FOERSTER. pp. Appl. 424-426. A. 112. University of Maryland.. pp. Some computer organizations and their effectiveness. Comput. [639] H. 13. Computers and Structures. [622] M. KAPAUAN. Eng. Comput. H. 37. WILSON [1987]. A survey of interconnection networks. G. PENUMALLI. 285-338. Proc. BUSSEL [1973]. VOIGT AND C. 2. August. GROSCH [1988]. 319-326. [613] C. 1901-1909. pp. D. [635] D. ed. Department of Computer Science. [1977]. K. ORTEGA. Parallel Computers — Parallel Mathematics. Flex/Si and CRAY-t. [1133]. Proc. NorthHoLland. [625] M. [641] K. FATOOHI AND G. Par. 211-214. New York. Solving the Cauchy-Riemann equations on parallel computers. pp. 54. ICASE. NASA Langley Research Center. FARHAT [1986]. PhD dissertation. GROSCH [1987]. Report CSD-TR-433. GROSCH [1987]. Concurrent iterative solution of large finite element systems. REDDAWAY. eds. FATOOffl AND C. Bounds on the number of processors and time for multiprocessor optimal schedules. pp. FICHTNER. [1984]. Parallel Computing 83: Proceedings of the International Conference on Parallel Computing. STEUBEN. AND J. KASCIC [1986]. FEILMEIER. R. The ILLIAC IV. ROMINE [611] C. WILSON [1987]. HUNT. G.148 J. AND U. [636] P. New York. 335-348. L.. JOUBERT. Parallel Computing 85: Proceedings of the International Conference on Parallel Computing. AND U. Efficient high speed computing with the distributed array processor. pp. 76. F. pp. [621] M. Parallel nonlinear algorithms. Proc. Proceedings of the IMACS Symposium. Computations. W. Report 88-5. [630] S. WILSON [1987]. [615] P. GROSCH [1987]. Numer. FARHAT AND E. NAGEL. pp. Proc. Architecture of a distributed analysis network for computational mechanics. Report LA-6774. [629] E. pp. Implementation of a four color cell relaxation scheme on the MPP. 239-243. in Schultz [1752]. A marching method for solving Poisson's equation on the ETA-10. JOUBERT. Hampton. 948-960. C-22. Comp. Tech. FIELD. [626] C. Conf. Very high speed computing systems.. 175-193. FLYNN [1966]. [618] R. [617] R. [637] M. pp. SPIE Symposium. LOB Alamos National Laboratory. FERNBACH. Vol. FLANDERS. FATOOHI AND C. [614] C. in Kuck et al. SCHENDEL. 3. Nonstandard multigrid techniques using checkered relaxation and intermediate grids. Modal superposition dynamic analysis on concurrent multiprocessors. 2. C-21. The impact of supercomputers on 1C technology development and design. Phys. pp. in Evans [589]. Comm. pp. Amsterdam. pp.. FlSHER [1988].. IEEE Trans. Comm. 341 (Real Time Processing V). pp. pp.. in Kowalik [1116]. FBIBRBACH AND D. FLYNN [1972]. Supercomputers. S. FONG AND T. 2. FEILMEIER [1982]. [634] D. [1986]. FOLLIN AND M. Report 1556. The S-l Mark IIA supercomputer. Parallel Computing. 163-186. two and three dimensions. Some linear algebraic algorithms and their performance on the CRAY-1. Tech. STEVENSON [1979]. [627] T.. [616] R. North-Holland. [628] J. Parallel numerical algorithms. 14(12). AND U. Meth. Implementation of an ADI method on parallel computers.

[656] F. pp. FORNBERG [1983]. pp. Concurrent processing for scientific calculations. MCBRYAN [1983]. Conf. 358-363. Concurrent computation and the theory of complex systems. pp. Physics Today. 66(2). AND I. FOULSER AND R. [1988]. IEEE Trans. 413-470. Conf. [668] P. [657] J. 13th IEEE Computer Soc. in McCormick [1312]. pp.. A parallel nonlinear Jacobi algorithm for solving nonlinear equations. March. J. FRANCIONI AND J. in Heath [860]. FREDERICKSON AND O.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 149 [642] R. Multi-microprocessors: An overview and working example. IEEE Trans. Vectorized incomplete Cholesky conjugate gradient (ICCG) package for the CRAY-1 computer. Matrix multiplication. [649] G. 3. [659] M.. University of Maryland. Square matrix decomposition — Symmetric. Conf. 50-59.. J. (To be published).. FONTECILLA [1987]. [660] P. 223-238. in Gary [700]. GAJSKI [1979]. JONES. California Institute of Technology. SAMEH [1983]. pp. AND A. The implementation of a dynamic load balancer. [666] R. D. G. Department of Computer Science. [670] D. Report ADA050135. OLEINICK [1976]. KOWALA. 35-43. CA. CalTech Publication Hm-97. Algorithms for concurrent processors. 102. 179-196. SUNDHU. pp. Proc. The impact of parallel computing on finite element computations. R. pp. [665] S. pp. A parallel Monte Carlo transport algorithm using a psuedo-random tree to guarantee reproducibility. AND R. Cornput. Tech. Tech. Content Addressable Parallel Processors. Par. [653] G. Tridiagonal factorizations of Fourier matrices and application to parallel computations of discrete Fourier transforms. 190-206. pp. in Heath [858].. IEEE Comp. CMU Cm* review. D. NJ. WILLIAMS [1987]. Steady viscous flow past a circular cylinder. [643] B. An implementation of a 2d-section root finding method for the EPS T-series hypercube. Reliability of Methods for Engineering Analysis. Proc. A. Fox. [658] M. . Volume I: General Techniques and Regular Problems. Proc. DURHAM [1980]. pp. Cedar — A large scale multiprocessor. Parallel Computing. A. 216228. Proc. 4. Parallel solution of ordinary differential equations. Comput. Laser Program Annual Report UCRL500021-81. C-25. 201-224. Pasadena. 149-163. DHAR [1986]. RUBINFELD. [654] G. Solving Problems on Concurrent Processors. M. local. [664] S. FUNDERLIC AND A. in Heath [860]. pp. & Appl. [669] D. FRANKLIN [1978]. Carnegie-Mellon University. An algorithm for solving linear recurrence systems on parallel and pipelined machines. FULLER. pp. Proc. KERSHAW [1982]. Alg. pp. Matrix algorithms on a hypercube I. FOX AND S. Conf. OTTO. [647] G. [652] G. pp. pp. pp. van Nostrand Reinhold. Fox AND S. FOSTER [1976]. scattered. [667] R. FULTON [1986]. Dist. P. pp.. COMPCON 84. 189-191. 495-500. HiROMOTO. 17-32. S. 70-73. Torus data flow for parallel computation of missized matrix problems. pp. Resolution of a general partial differential equation on a fixed size SIMD/MIMD large cellular processor. AND J. Proc. Swansea. Englewood Cliffs. JACKSON [1987]. 1983 Int. OTTO [1986]. 195-210. AND A. L. & Appl. The Saxpy Matrix-1: A general-purpose systolic computer. [645] C. FRAILONG AND J. [662] A.. 1979 Int. 114-121. FULLER AND P. Int. Parallel superconvergent multigrid. RASKIN. The Caltech concurrent computation program... FREDERICKSON. FOX [1987]. A vector implementation of the fast Fourier transform algorithm. 352-372. 77. Solving banded triangular systems on pipelined machines. Inc. Sorente. Comp. IEEE. OTTO. Communication algorithms for regular convolutions and matrix problems on the hypercube. eds. 37(5). P. [671] D. LARSON [1987]. Pineridge Press. FOX [1984]. Fox [1985]. FRANKLIN AND S. Lin. [663] S. pp. Alg. Sci. Math. C-30. Par. OUSTERBOUT. GEIST [1986]. JOHNSON. 308-319. Proceedings of the IMACS International Congress. SALMON. pp. Fox. 524-529. Fox AND W. S. Lin. FRIEDMAN AND D. LYZBNGA. SWAN [1978]. 244-268. 36. GADER [1988]. 281-290. FORNBERG [1981]. [655] G. PAKLEZA [1979]. Prentice-Hall. Department of Computer Science. pp. [648] G. Proc. KUCK. 20(7). SCHREIBER [1987]. Parallel Computing. 169-210. [650] G. in Heath [860]. [644] B. [651] G. FOX. [646] D. Initial measurements of parallel programs on a multiminiprocessor. 4. /nlerconnecfton networks: Physical design and performance analysis. AND R. Par. GAJSKI [1981]. 353-381. FURMANSKI [1987]. GAJSKI. LAWRIE. Comp. in Heath [860]. FULLER. Report 1807. Computer.. OTTO [1984]. [661] P. pp. HEY [1987]. AND J. pp. Lawrence Livermore National Laboratory.

3.. PAULI [1983]. [684] D. G. 3. Athens. [681] E. Center for Supercomputing Research and Development. J.. JALBY [1987]. Par. Res. The influence of memory hierarchy on algorithm organization: Programming FFTs on a vector multiprocessor. in Heath [860]. PANETTA [1986].. Soc. CSRD Report 625. GALLIVAN. [678] E. Conf. Statist. GAJSKI AND J. COMPCON 84. 3. GANNON [1986]. GALLOPOULOS [1985]. in Potter [1584]. PDE. GALLOPOULOS AND Y. On the structure of parallelism in a highly concurrent PDE solver. G. H. 305-321. Proc. GANNON. ZlLLI [1988]. [697] J. Numerical experiments with the Massively Parallel Processor. 252-259. VOIGT AND C. Int. AND J. W. pp. Programming substructure computations for elliptic problems on the CHiP system. WISNIENSKI [1982]. pp. J. ORTEGA. pp. Par. Proc. 1079-1084. Tech. C-33. Proc. Fluid dynamics modeling. GARDINER AND A. IL. Par. AND U. Par. VAN ROSENDALE [1984]. England. Dist. May.. GAO [1987]. GAO [1986]. GAJSKI. [696] Q. University of Illinois at Urbana-Champaign. pp. GAO [1986]. 1982 Int. KUCK. pp. Comp. 286-299. GANNON [1981]. 106-135. [679] E. IEEE Comp. Parallel Computing. R. Comput. GAMBOL ATI. Numer. Vector computer for sparse matrix operations. ed. Restructuring SIMPLE for the CHiP architecture. Center for Supercomputing Research and Development. M. JOHNSON [1978]. [677] K. [676] K. Oper.-K. LAUB [1987]. L. 1980 Int. 26. R. Sci. [688] D. Conf. [689] D. Conf. [694] G. Conf. Greece. block structured grids. pp. Tech. pp. [695] G. To appear in Proc. Iterative algorithmt for iridiagonal matrices on a WSI-multiprocessor. 93-104. pp. 85-103. Oxford. 306-309. Kai. 4. Proc. pp. LAWRIE. SAMEH [1987]. 30. ACM. GANNON AND J. Proc. pp. [687] D.. 1981 Int. Par. pp. pp. 3-21. pp. Proc. 82-89. A pipelined solution method of tridiagonal linear equation systems. SAMEH [1984]. 87-89.150 J. Proc. On the impact of communication complexity in the design of parallel numerical algorithms. Computer. Proc. Restructuring nested loops on the Alliant Cedar cluster: A case study of Gaussian elimination of banded matrices. D. ROMINE [673] D. 4. GANNON AND W. IEEE Trans. SNYDER. Urbana. 1986 Int.-Q. 1983 Int. [685] D. [691] D. GANNON AND J. GALLWAN. pp. JALBY. A stability classification method and its application to pipelined solution of linear recurrences. [693] G. 305-326. 18(6). Proc. University of Illinois at Urbana-Champaign.. Proc.. Comparison of preconditionings for large sparse finite element problems. Cedar. Conf. GAO AND R. pp.-S. SLAM J. Dist. Supercomputing. Essential issues in multiprocessor systems. H. [698] M. GAJSKI. [683] D. Meth. A. Parallel Computing. pp. SAAD [1987]. AND D. A parallel block cyclic reduction algorithm for the fast solution of elliptic equations... GALIL AND W. Comput. D. 1180-1194. PlNI. pp. April. U. W. University of Illinois at Urbana-Champaign. 29-35. 100-105. Proc.. . 8. pp. Par. GRAHAM. [680] E. Comp. 9-27. Par. AND J. 65-80. GANNON [1980]. February. WANG [1983]. J. MEIER. PBIR [1985]. MEIER [1987]. Conf. Report 663. GALLOPOULOS AND S. VAN ROSENDALE [1984]. Conf. Performance guarantees for scheduling algorithms. in Noor [1450]. [672] D. [690] D. 512-519. Implementation of two control system design algorithms on a message-passing hypercube. Vector and Parallel Processors in Computational Science II Conference. GANNON AND J. Conf. pp. The use of BLASS in linear algebra on a parallel processor with a hierarchical memory. [675] Z. SAMEH. [686] D. Par. pp. VAN ROSENDALE [1983]. An efficient general-purpose parallel computer. Center for Supercomputing Research and Development. Proc. [682] G. On the structure of parallelism in a highly concurrent PDE solver. pp. 139-157. CAREY. A maximally pipelined tridiagonal linear equation solver. A note on pipelining a mesh connected multiprocessor for finite element problems by nested dissection. The impact of hierarchical memory systems on linear algebra algorithm design. 197-204. McEWAN [1983]. Parallel architectures for iterative methods on adaptive. Proc. Center for Super-computing Research and Development. The Massively Parallel Processor for problems in fluid dynamics. Report 659. On mapping non-uniform PDE structures and algorithms onto uniform array architectures. AND A.. GALLOPOULOS [1984]. GANNON [1985]. GANNON AND J. Report 543. 84-91. AND G. JALBY. [674] D.. VAN ROSENDALE [1986]. AND A. pp. 215-235. Proceedings of the 7th Symposium on Computer Arithmetic.. [692] D. Tech. University of Illinois at Urbana-Champaign. Proc. 1983 Int. in Birkhoff and Schoenstadt [173].

3. [714] A. Italy. GARY. The Cm* testbed. Comput. in Kuck et al. & Appl. Bedford. G. HEATH [1985]. [709] A. 639-649. Appl. [1133]. pp. pp. GAUTZSCH. 1157-1165. Hypercube Concurrent Comput. Efficient parallel LU factorization with pivoting on a hypercv. NG [1987]. Tech. GEIST [1985]. Tech. ROMINE [1989]. pp. 329-354. in Heath [858]. C-31.. 1577-1582. Matrix triangularization by systolic arrays. 59-66. eds. pp. [701] J. Design of numerical algorithms for parallel processing. Englewood Cliffs. J.. eds. Fox. G. MA. Oak Ridge National Laboratory. ed. Douglass. Paris. Lin. Finding eigenvalues and eigenvectors of unsymmetric matrices using a hypercube multiprocessor. McCormick and U.. GEIST AND E. Third Conf. pp. S. Ada Concurrent Programming. D. NASA-CP-S295. pp. Math. Some complexity results for matrix computations on parallel processor*. 198S. WARD. Jamieson. GARY. pp. GARY [1977]. GENTLEMAN [1975]. ch. Report ORNL-6190.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 151 [699] J. R. Gannon. MIT Press. DAVIS. 25.be multiprocessor. CYBER 200 Application* Seminar. German Research and Testing Establishment for Aerospace. pp. ed. HEATH [1986]. [723] W. Successive overrclaxation. [704] W. JONES. McCORMICK. D. Comput. AND D. 112-115. Proc.. [716] A. Conf. GENTLEMAN [1978]. Report. GEIST AND M. AND Z. GEIST [1987]. Parallel Cholesky factorization on a hypercube multiprocessor. Tech. [713] A. August.. SIEWIOREK. New York.. LU factorization algorithms on distributed-memory multiprocessor architectures. GENTZSCH [1984]. Oak Ridge National Laboratory. Oak Ridge National Laboratory. S EG ALL [1987]. Parallel algorithms for matrix computations. GENTLEMAN AND H. NJ. 9(4). Report ORNL/TM-10938. GEHRINGER. [1984]. LICHNEWSKY. AND E. SEGALL [1982]. [724] W. AND R. October.. 13(3-4). in Kowalik [1116]. & Comp. G. University of Illinois at UrbanaChampaign. Presented at the Parallel Processing Conference at Bargains. Parallel Processing: The Cm* Experience. Vector and Parallel Methods in Scientific Computing. 161-180. Tech. Computer. Alg. GEIST AND G. AND Z. October. Analysis of application* program* and to f t ware requirements for high tpeed computer*. Computer. February. [712] A. HUNG [1981]. Inc. IEEE Trans. STAPHYLOPATIS [1982]. 40-53. Solving finite element problem* with parallel multifrontal schemes. Proc. Digital Equipment Corp. pp. FUNDERLIC [1988]. CO. AND R. GEIST. [715] A. Parallel Processing for Scientific Computing. Tech. [721] W. VBCRIEST [1987]. Department of Computer Science. Association for Computing Machinery. 15(10). pp. Oak Ridge National Laboratory. SWEET [1983]. R. pp. Proceeding* of seminar held at NASA Goddard Space Flight Center. [706] D. [719] W. [700] J. 233-251. Trottenberg. Error analysis of the QR decomposition by Givens transformations. and L. S. WEILAND. GENTZSCH [1983]. Report R-86-1246. GEIST AND M. [708] A. GEHANI [1984]. [717] A. 285-310. Society for Industrial and Applied Mathematics. [702] M.. pp. pp. October. [705] N. multigrid. [710] A. GEHRINGER. AND A. 189-197. Prentice-Hall. 10. [711] A. in Heath [860]. GELENBE. [720] W. Digital Press. Appl.). Benchmark results on physical flow problems. Philadelphia. A partitioning strategy for parallel sparse Cholesky factorization. Sci. Cambridge. Finding eigenvalues and eigenvectors of unsymmetric matrices using a hypercube multiprocessor. Experience with the parallel solution of partial differential equations on a distributed computing system. Proc. 19-26. SPIE 298. LU factorization on distributed-memory multiprocessors. Proceedings of the First Copper Mountain Conference on Multigrid Methods. Matrix triangularization u»ing array* of integrated optical Given* rotation device*. GEAR [1986]. Pottibilitie* and problem* with the application of vector computer*. M. GENTLEMAN [1981]. Copper Mountain. Report ORNL-6211. 20(12). MULLER-RICHARDS [1980]. Real-time Signal Processing IV. [707] E. GEIST. A. . SIAM J. Statist. A. The Characteristics of Parallel Algorithms. GEIST AND C. How to maintain the efficiency of highly serial algorithms involving recursions on vector computers. 15-18. ROMINE [1988]. [718] E. NG [1988]. HEATH. [722] W. GAYLORD AND E. GEIST AND C.. DAVIS [1988]. Matrix factorization on a hypercube multiprocessor. 656-661. Report ORNL/TM-10937. and preconditioned conjugate gradient* algorithm* for tolving a diffusion problem on a vector computer. The potential for parallelitm in ordinary differential equation*. September. (Special Issue. Tech. [703] T. pp. ACM.

152

J. M. ORTEGA, R. G. VOIGT AND C. H. ROMINE

pp. 211-228. [725] W. GENTZSCH [1984]. Numerical algorithm* in computational fluid dynamics on vector computers, Parallel Computing, 1, pp. 19-33. [726] W. GENTZSCH [1984], Vcctorization of Computer Programs with Applications to Computational Fluid Dynamics, Heyden & Son, Philadelphia, PA. [727] W. GENTZSCH [1987]. A fully vectorizable SOR variant, Parallel Computing, 4, pp. 349-354. [728] W. GENTZSCH AND G. SCHAPER [1984]. Solution of large linear systems on vector computers, in Feilmeier et al. [623], pp. 159-166. [729] A. GENZ AND D. SWAYNE [1984]. Parallel implementation of ALOD methods for partial differential equations, in Feilmeier et al. [623], pp. 167-172. [730] A. GEORGE, M. HEATH, AND J. LlU [1986]. Parallel Cholesky factorization on a shared memory multiprocessor, Lin. Alg. & Appl., 77, pp. 165-187. [731] A. GEORGE, M. HEATH, J. LlU, AND E. NG [1986]. Solution of sparse positive definite systems on a shared memory multiprocessor, Int. J. Par. Prog., 15, pp. 309-325. [732] A. GEORGE, M. HEATH, J. LlU, AND E. NG [1988]. Solution of sparse positive definite systems on a hypercube, Tech. Report ORNL/TM-10865, Oak Ridge National Laboratory, October. (Submitted to J. Comput. Appl. Math.). [733] A. GEORGE, M. HEATH, J. LlU, AND E. No [1988]. Sparse Cholesky factorization on a local memory multiprocessor, SIAM J. Sci. Statist. Comput., 9, pp. 327-340. [734] A. GEORGE, M. HEATH, E. NG, AND J. LlU [1987]. Symbolic Cholesky factorization on a local-memory multiprocessor, Parallel Computing, 5, pp. 85-96. [735] A. GEORGE, J. LlU, AND E. NG [1987]. Communication reduction in parallel sparse Cholesky factorization on a hypercube, in Heath [860], pp. 576-586. [736] A. GEORGE, J. LlU, AND E. NG [1988], Communication results for parallel sparse Cholesky factorization on a hypercube. Submitted to Parallel Computing. [737] A. GEORGE AND E. NG [1988]. Parallel sparse Gaussian elimination with partial pivoting, Tech. Report ORNL/TM-10866, Oak Ridge National Laboratory. (To appear in Annals of Operations Research). [738] A. GEORGE AND E. NG [1988]. Some shared memory is desirable in parallel sparse matrix computations, SIGNUM Newsletter, 23(2), pp. 9-13. [739] A. GEORGE, W. POOLE, AND R. VOIGT [1978]. Analysis of dissection algorithms for vector computers, Comput. Math. Appl., 4, pp. 287-304. [740] A. GEORGE, W. POOLE, AND R. VOIGT [1978]. A variant of nested dissection for solving n by n grid problems, SIAM J. Numer. Anal., 15, pp. 662-673. [741] A. GERASOULIS, N. MISSIRILIS, I. NELKEN, AND R. PESKIN [1988]. Implementing Gauss Jordan on a hypercube multicomputer, Proc. 3rd Conf. on Hypercube Multiprocessors. [742] I. GERTNER AND M. SHAMASH [1987]. VLSI architectures for multidimensional Fourier transform processing, IEEE Trans. Comput., C-36, pp. 1265-1274. [743] A. GHOSH [1987]. Realization of conjugate gradient algorithm on optical linear algebra processors, Applied Optics, 26(2), pp. 611-613. [744] H. GlETL [1987]. The conjugate gradient method with vectorized preconditioning on the Siemens XP-200 vector processor, Supercomputer, 19, pp. 43-51. [745] E. GILBERT [1958]. Gray codes and paths on the n-cube, Bell System Tech. J., 37, pp. 815826. [746] E. GILBERT [1982]. Algorithm partitioning tools for a high-performance multiprocessor,Tech. Report UCRL-53401, Lawrence Livermore National Laboratory, Livermore, CA, December. [747] J. GILBERT [1988]. An efficient parallel sparse partial pivoting algorithm, Tech. Report CMI 88/45052-1, Dept. of Science and Technology, Chr-Michelson Institute, August. [748] J. GILBERT AND H. HAFSTEINSSON [1986]. A parallel algorithm for finding fill in a sparse symmetric matrix, Tech. Report TR 86-789, Department of Computer Science, Cornell University. [749] J. GILBERT AND E. ZMLJEWSKI [1987]. A parallel graph partitioning algorithm for a messagepassing multiprocessor, Tech. Report TR 87-803, Department of Computer Science, Cornell University. [750] D. GlLL AND E. TADMOR [1988]. An O(N2) method for computing the eigensystem ofNxN symmetric tridiagonal matrices by the divide and conquer approach, Tech. Report 88-19, ICASE, NASA Langley Research Center, Hampton, VA. [751] S. GlLL [1968]. Parallel programming, Comput. J., 1, pp. 2-10. [752] P. GlLMORE [1968]. Structuring of parallel algorithms, J. ACM, 15, pp. 176-192. [753] P. GlLMORE [1971]. Numerical solution of partial differential equations by associative pro-

A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS

153

cessing, Proc. 1971 FJCC, AFIPS Press, Montvale, NJ, pp. 411-418. [754] P. GlLMORE [1971]. Parallel relocation, Tech. Report, Goodyear Aerospace Corporation, Akron, OH. [755] R. GlNOSAR AND D. HlLL [1985]. Design and implementation of switching systems for parallel processors, Proc. 1985 Int. Conf. Par. Proc., pp. 674-680. [756] M. GlNSBURG [1982]. Some observations on supercomputer computational environments, Proc. 10th IMACS World Congress on System* Simulation and Scientific Computation, vol. 1, IMACS, pp. 297-301. [757] E. Gmoux [1977]. A large mathematical model implementation on the STAR-100 computer, hi Kuck et al. [1133], pp. 287-298. [758] B. GUCKPELD AND R. OVERBEEK [1985]. Quasi-automatic parallelization: A aimplied approach to multiprocessing, Tech. Report ANL-85-70, Argonne National Laboratory, Argonne, IL. [759] I. GLOUDEMAN [1984]. The anticipated impact of supercomputers on finite element analysis, Proc. IEEE, 72, pp. 80-^84. [760] J. GLOUDEMAN, C. HENNRICH, AND J. HODGE [1984]. The evolution of MSC/NASTRAN and the supercomputer for enhanced performance, in Kowalik [1116], pp. 393—402. [761] J. GLOUDEMAN AND J. HODGE [1982]. The adaption of MSC/NASTRAN to a supercomputer, Proc. 10th IMACS World Congress on Systems Simulation and Scientific Computation, vol. 1, IMACS, pp. 302-304. [762] R. GLOWINSKI, G. GOLUB, G. MEURANT, AND J. PERIAUX, eds. [1988]. Proceedings of the First International Symposium on Domain Decomposition Methods for Partial Differential Equations, Philadelphia, PA, Society for Industrial and Applied Mathematics. [763] R. GLOWINSKI AND M. WHEELER [1988]. Domain decomposition and mixed finite element methods for elliptic problems, in Glowinski et al. [762], pp. 144-172. [764] P. GNOFFO [1982]. A vectorized, finite-volume, adaptive-grid algorithm for Navier-Stokes calculations, Numerical Grid Generation, J. Thompson, ed., Elsevier Science Publishing Corp. [765] I. GOHBERG, T. KAILATH, I. KOLTRACHT, AND P. LANCASTER [1987]. Linear complexity parallel algorithms for linear systems of equations with recursive structure, Lin. Alg. fc Appl., 88, pp. 271-316. [766] R. GoKE AND G. LlPOVSKI [1973]. Banyan networks for partitioning on multiprocessor systems, Proc. 1st Ann. Symp. Computer Arch., pp. 21-30. [767] M. GOLDMANN [1988]. Vectorization of the multiple shooting method for the nonlinear boundary value problem in ordinary differential equations, Parallel Computing, 7, pp. 97-110. [768] G. GOLUB AND D. MAYERS [1983]. The use of preconditioning over irregular regions, Proc. 6th Int. Conf. Computing Methods in Science and Engineering, Versailles, France. [769] G. GOLUB, R. PLEMMONS, AND A. SAMEH [1986]. Parallel block schemes for large scale least squares computations, Tech. Report 574, Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, April. [770] G. GOLUB AND C. VAN LOAN [1989], Matrix Computations, The Johns Hopkins University Press, Baltimore, (in press). [771] M. GONZALEZ [1977]. Deterministic processor scheduling, ACM Computing Surveys, 9, ppv173-204. [772] R. GONZALEZ [1986]. Domain Decomposition for Two-Dimensional Elliptic Operators on Vector and Parallel Machines, PhD dissertation, Rice University. [773] R. GONZALEZ AND M. WHEELER [1987]. Domain decomposition for elliptic partial differential equations with Neumann boundary conditions, Parallel Computing, 5, pp. 257—263. [774] GOODYEAR AEROSPACE CORP. [1974]. Application of STARAN to fast Fourier transforms, Tech. Report GER-16109, Goodyear Aerospace Corp., May. [775] K. GOSTELOW AND R. THOMAS [1980]. Performance of a simulated dataflow computer, IEEE Trans. Comput., C-29, pp. 905-919. [776] A. GOTTLIEB [1984]. Avoiding serial bottlenecks in ultraparallel MIMD computers, Proc. COMPCON 84, IEEE Comp. Soc. Conf., pp. 354-359. [777] A. GOTTLIEB, R. GRISHMAN, C. KRUSKAL, K. MCAULIFFE, L. RUDOLPH, AND M. SNIR [1983]. The NYU Ultracomputer — Designing an MIMD shared memory parallel computer, IEEE Trans. Comput., C-32, pp. 175-189. [778] A. GOTTLIEB, B. LUBACHEVSKY, AND L. RUDOLPH [1983]. Basic techniques for the efficient coordination of very large numbers of cooperating sequential processors, ACM Trans. Program. Lang. Syst., 5, pp. 164-189. [779] A. GOTTLIEB AND J. SCHWARTZ [1982]. Networks and algorithms for very-large-scale parallel

154

J. M. ORTEGA, R. G. VOIGT AND C. H. ROMINE

computation. Computer, 15(1), pp. 27-36. [780] D. GOTTLIEB AND R. HIRSH [1988]. Parallel pseudospectral domain decomposition techniques, Tech. Report 88-15, ICASE, NASA Langley Research Center, Hampton, VA. [781] D. GOTTLIEB, M. HUSSAINI, AND S. ORSZAO [1984]. Theory and applications of spectral methods, in Voigt et al. [1925], pp. 1-54. [782] G. GOUDREAU, R. BAILEY, J. HALLQUIST, R. MURRAY, AND S. SACKETT [1983]. Efficient large-scale finite element computations in a Cray environment, in Noor [1450], pp. 141154. [783] W. GRAGG AND L. RBICHEL [1987]. A divide and conquer algorithm for the unitary eigenproblem, in Heath [860], pp. 639-650. [784] M. GRAHAM [1976]. An Array Computer for the Class of Problems Typified by the General Circulation Model of the Atmosphere, PhD dissertation, University of Illinois at UrbanaChampaign, Department of Computer Science. [785] R. GRAHAM [1969]. Bounds on multiprocessing timing anomalies, SIAM J. Appl. Math., 17, pp. 416-429. [786] R. GRAHAM, E. LAWLER, J. LENSTRA, AND A. RINNOOY KAN [1979]. Optimization and approximation in deterministic sequencing and scheduling: A survey, Ann. Discrete Math., 5, pp. 169-. [787] R. GRAVES [1973]. Partial implicitization, J. Comp. Phys., 13, pp. 439-444. [788] J. GRCAR AND A. SAMEH [1981]. On certain parallel Toeplitz linear system solvers, SIAM J. Sci. Statist. Comput., 2, pp. 238-256. [789] A. GREENEAUM [1986]. A multigrid method for multiprocessors, Appl. Math. & Comp., 19(14), PP- 75-88. (Special Issue, Proceedings of the Second Copper Mountain Conference on Multigrid Methods, Copper Mountain, CO, S. McCormick, ed.). [790] A. GREENBAUM [1986]. Solving sparse triangular linear systems using Fortran with parallel extensions on the NYU Ultracomputer prototype, Ultracomputer Note 99, New York University, April. [791] A. GREENBAUM [1986]. Synchronization costs on multiprocessors, Ultracomputer Note 98, New York University, April. [792] A. GREENBAUM AND G. RODRIGUE [1977]. The incomplete Choleski conjugate gradient method for the STAR (5 point operator), Tech. Report, Lawrence Livermore National Laboratory. [793] A. GREENBERG, R. LADNER, M. PATERSON, AND Z. GALIL [1982]. Efficient parallel algorithms for linear recurrence computation, Info. Proc. Letters, 15, pp. 31-35. [794] D. GREENSPAN [1988]. Particle modeling of cavity flow on a vector computer, Comput. Meth. Appl. Mech. Engrg., 66, pp. 291-300. [795] J. GRIFFIN AND H. WASSERMAN [1985]. Parallel debugging: A preliminary proposal, Tech. Report LA-UR-85-3967, Los Alamos National Laboratory. [796] R. GRIMES [1988]. Solving systems of large dense linear equations, J. Supercomputing, 1, pp. 291-300. [797] R. GRIMES AND H. SIMON [1987]. Dynamic analysis with the Lanczos algorithm on the SCS40, Tech. Report ETA-TR-43, Boeing Computer Services, January. [798] R. GRIMES AND H. SIMON [1987]. Solution of large dense symmetric generalized eigenvalue problems using secondary storage, Tech. Report ETA-TR-53, Boeing Computer Services, May. [799] D. GRIT AND J. McGRAW [1983]. Programming divide and conquer on a multiprocessor, Tech. Report UCRL-88710, Lawrence Livermore National Laboratory. [800] W. GROPP [1986]. Dynamic grid manipulation for PDE's on hypercube parallel processors, Tech. Report YALEU/DCS/RR-458, Department of Computer Science, Yale University, March. [801] W. GROPP [1987]. Solving PDEs on loosely-coupled parallel processors, Parallel Computing, 5, pp. 165-174. [802] W. GROPP [1988]. Local uniform mesh refinement on loosely-coupled parallel processors, I. J. Comp. Math. Appl., 15, pp. 375-389. [803] W. GROPP AND I. IPSEN [1988]. Recursive mesh refinement on hypercubes, Tech. Report RR616, Department of Computer Science, Yale University. [804] W. GROPP AND D. KEYES [1988]. Complexity of parallel implementation of domain decomposition techniques for elliptic partial differential equations, SIAM J. Sci. Statist. Cornput., 9, pp. 312-327. [805] W. GROPP AND E. SMITH [1987]. Computational fluid dynamics on parallel processors, Tech. Report YALEU/DCS/RR-570, Department of Computer Science, Yale University. [806] C. GROSCH [1978]. Poisson solvers on a large array computer, Proc. 1978 LASL Workshop

A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS

155

on Vector and Parallel Processors, pp. 98-132. [807] C. GROSCH [1979]. Performance analytit of Poitton advert on array computers, in Jesshope and Hockney [976], pp. 147-181. [808] C. GROSCH [1979]. Performance analytit of tridiagonal equation solvers on array computer*, Tech. Report TR 79-4, Department of Mathematical and Computing Sciences, Old Dominion University, Norfolk, VA. [809] C. GROSCH [1980], The effect of the data transfer pattern of an array computer on the efficiency of some algorithms for the tridiagonal and Poitton problems. Presented at the Conference on Array Architectures for Computing in the 80's and 90's. [810] C. GROSCH [1987]. Adapting a Navier-Stoket code to the ICL-DAP, SIAM J. Sci. Statist. Comput., 8, pp. s96-sll7. [811] D. GRUNWALD AND D. REED [1987]. Benchmarking hypercube hardware and software, in Heath [860], pp. 169-177. [812] R. GuiLILAND [1981]. Solution of the shallow water equations on the sphere, 3. Comp. Phys., 43, pp. 79-94. [813] A. GUPTA, B. MOSSBERG, G. POPE, AND K. SEPEHRNOORI [1985]. Application of vector processors to chemical enhanced oil recovery simulation, Tech. Report 85-5, Center for Enhanced Oil & Gas Recovery Research, University of Texas at Austin. [814] D. GUPTA, G. POPE, AND K. SEPEHRNOORI [1986]. Application of vector processors to chemical-enhanced oil recovery simulation, Comm. Appl. Numer. Meth., 2, pp. 297303. [815] J. GURD, C. KlRKHAM, AND I. WATSON [1985]. The Manchester prototype dataflow computer, Comm. ACM, 28, pp. 34-52. [816] J. GURD AND I. WATSON [1982]. Preliminary evaluation of a prototype dataflow computer, Proc. IFIP World Computer Congress, North-Holland, pp. 545-551. [817] J. GUSTAFSON [1986]. Subdivision of PDE's on FPS scientific computers, Comm. Appl. Numer. Meth., 2, pp. 305-310. [818] J. GUSTAFSON [1988]. Reevaluating Amdahl's law, Comm. ACM., 31, pp. 532-533. [819] J. GUSTAFSON, S. HAWKINSON, AND K. SCOTT [1986]. The architecture of a homogeneous vector supercomputer, Proc. 1986 Int. Conf. Par. Proc., pp. 649-652. [820] J. GUSTAFSON, G. MONTRY, AND R. BENNER [1988], Development of parallel methods for a 1024-processor hypercube, SIAM J. Sci. Statist. Comput., 9, pp. 609-638. [821] J. HACK [1986]. Peak vs. sustained performance in highly concurrent vector machines, Computer, 19(9), pp. 11-19. [822] W. HACKBUSCH [1978]. On the multigrid method applied to difference equations, Computing, 20, pp. 291-306. [823] W. HACKBUSCH AND U. TROTTENBERG, eds. [1982]. Multigrid Methods, Springer-Verlag, Berlin. [824] M. HAFEZ AND D. LOVELL [1983]. Improved relaxation schemes for transonic potential calculations, Paper 83-0372, AIAA. [825] M. HAFEZ AND E. MuRMAN [1978]. Artificial compressibility methods for numerical solution of transonic full potential equation, AIAA llth Fluid and Plasma Dynamics Conference, Seattle, WA. [826] M. HAFEZ AND J. SOUTH [1979]. Vectorization of relaxation methods for solving transonic full potential equations, Flow Research Report, Flow Research, Inc., Kent, WA. [827] B. HAILPERN [1982]. Concurrent processing, Tech. Report RC 9582 (42314), IBM, San Jose, CA, September. [828] L. HALADA [1980]. A parallel algorithm for solving band systems of linear equations, Proc. 1980 Int. Conf. Par. Proc., pp. 159-160. [829] L. HALADA [1981]. A parallel algorithm for solving band systems and matrix inversion, CONPAR 81, Conf. Proc., Lecture Notes in Computer Science III, W. Handler, ed., SpringerVerlag, pp. 433-440. [830] L. HALCOMB AND D. DIESTLER [1986]. Integration of a large set of coupled differential equations on the Cyber 205 vector processor, Comput. Phys. Comm., 39, pp. 27-36. [831] H. HALIN, R. BUHRER, W. HALG, H. BENZ, B. BRON, H. BRUNDIERS, A. ISACCSON, AND M. TADIAN [1980]. The ETHM multiprocessor project: Parallel simulation of continuous system, Simulation, 35, pp. 109-123. [832] S.-P. HAN AND G. Lou [1988]. A parallel algorithm for a class of convex programs, SIAM J. Control Optim., 26, pp. 345-355. [833] W. HANDLER, E. HOFMANN, AND H. SCHNEIDER [1976]. A general purpose array with a broad spectrum of applications, Informatik-Fachbrichte, Springer-Verlag, BerlinHeidelberg.

in Numrich [1469]. Int. COLLEY. pp. pp. Comput. HAYES [1985]. Int. [840] D. 653-660. [849] L. Proc. Hypercube Multiprocessors.. Num. [1987]. [1986]. C-36. ROMINE [1988]. Hypercube applications at Oak Ridge National Laboratory. SIEWIOREK. [842] L. pp. WIRL [1985]. Solving almost block diagonal linear equations on the CDC Cyber 205. J. HATZOPOULOS AND D. PA. [860] M. 1986 Int. p. J. Society for Industrial and Applied Mathematics. HEATH [1985]. [843] M. pp. ROMINE [834] W.). & Comp. 204-207. CO. M. E. in Heath [860]. [838] U. 9-24. Comput. J. Parallel solution of triangular systems on distributed- . Vector processors and CFD. 19(1-4). [855] M... S. Inc. pp. 1986. University of Manchester. HATZOPOULOS [1983]. HAYES [1984]. MCCORMICK. HATZOPOULOS [1982]. 603-634.. [857] M. 133-141. The physical mapping problem for parallel architectures. ed. Asynchronous adaptive methods on parallel computers. Master's thesis. HEATH [1987]. The three-dimensional solution of the equations of flow and heat transfer in glass-melting tank furnaces: Adapting to the DAP. HAYES. Vector access performance in parallel memories using a skewed storage scheme. pp. THOMAS [1986]. GLADWELL [1985].. [846] M. 103-126. AND D. pp. Experiences in benchmarking the three supercomputers CRAY-1M. University of Texas at Austin. Advantages for solving linear systems in an asynchronous environment. LAU. May. Fvjitt* VP-tOO compared with the CYBER 76. The Fast Adaptive Composite-grid method (FAC): Algorithms for advanced computers. Par. Eng. Report. University of Maryland. 1440-1449. Math. AND K. [853] L. WlRGAN [1978]. Society for Industrial and Applied Mathematics. PALMER [1986]. in McCormick [1312]. Proc. 35. [839] D. 259. AND J. A vectorized matrix vector multiply and overlapping block iterative method. [841] L. HEATH. [837] A. A. VOIGT AND C. IEEE Trans. EVANS [1988]. A symmetric parallel linear system solver. S. 6. D. AND B. NASA/NSF Workshop on Parallel Computation in Heat Transfer and Fluid Flow. pp. PIELA [1986]. MABHLE. A. HAPP. McCormick. Department of Mathematics. [861] M. pp. 13. MIZELL [1982]. H. HAYNES. [836] H. 389-394. 15(1). HEATH AND C. Report RM-88-17. HANKEY AND J. Alternating Direction method on vector processors. Math. Met. 1985 Int. S. in Cray Research. Report ORNL-6150. HAY AND I. 23. MUDGE. Int. HEAD-GORDON AND P. O'GALLAGHER. 91-100. JUMP [1987]. Appl. HART [1988]. HARMS AND H. January. pp. Math. ORTEGA [1988]. [858] M. IEEE Can. STOUT. DIRMU multiprocessor configurations. (Special Issue. J. CARLING [1984]. [847] R. HARPER AND J. pp. October. Q. Parallel processing for large scale transient stability. Appl. pp. Parallel Computing. HATZOPOULOS AND N. Parallel Computing. Par. Conf. 652-656. Department of Aerospace Engineering and Engineering Mechanics. Philadelphia. J. Parallel linear system solvers for tridiagonal matrices. [845] M.. An overlapping block iterative scheme for finite element methods. Parallel Cholesky factorization in message-passing multiprocessor environments. Power. Tech. [423]. HEATH.. Tech. 49-66. Proc. Math. CRAY-X/MP. HAYES [1974]. MISSIRLIS [1985]. Architecture of a hypercube supercomputer. pp. Comput. Comm. [835] W. The University of Virginia. 331-340. 987-990. Conf. ed. [854] L. DEVLOO [1986]. HANDLER. Numerical Analysis Report 98. Hypercube Multiprocessors. pp. Comparative analysis of iterative techniques for solving Laplace's equation on the unit square on a parallel processor. Proc. 1043-56. HARRAR AND J. pp. HART. Computer. in Evans [589]. pp. Proceedings of the Second Copper Mountain Conference on Multigrid Methods. SMITH [1988]. T. POTTE. 1987. November. HAYES AND P. [859] M. pp. 6. Parallel algorithms for solving linear equations using Givens transformations. Philadelphia. C. R. 395-417. [848] J. 115-133. in Paddon [1512]. ACM. ORTEGA.. [856] L. A survey of highly parallel computing. Comments on the paper "A short proof of the existence of the W-Z factorization". 12/13. SHANG [1982]. ROSENBERG. ed. HAYES AND P. [850] L. Copper Mountain. R. [851] L. [844] M. [852] L. Proc. G. DEVLOO [1984]. HEATH. AND J. Solution of three-dimensional generalized Poisson equations on vector computers. Conf. HARDING AND J. Oak Ridge National Laboratory. LUTTERMAN [1988]. AND K. University of Texas at Austin. 373-382. Tech.156 J. Comput. 12A. A vectorized version of a sparse matrix-vector multiply.

[866] D. IEEE Comp. [879] L. R. Soc. Comput. pp. G. pp. HELLER AND I. tan.. D. A survey of parallel algorithms in numerical linear algebra. 1980 Int. SIAM J. D. Proc. HEATH. J. AND R. Numerical computation on Cm*. GEM calculations on the DAP. 559-568.. Exploiting fast matrix multiplication within the level 3 BLAS. Conf. 189-203. Smedsaas. pp. Engquist and T. 9(3).). (To appear in SIAM J. New York. HENDRY AND L. Sci. TOTE [1972]. Algorithms for matrix transposition on Boolean n-cube configured ensemble architectures. SIAM J. [865] D. [884] K. 261-269.. 113-122. Conf. 1987 Int. HEATH AND D. Department of Computer Science. Ho AND L. Statist. Computer Science Tech. Appl. pp. [868] D. Proc.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 157 memory multiprocessors. CT. Proc. Cambridge. Conf. [869] R. STEVENSON. HlROMOTO [1984]. [880] N. HlBBARD AND N. Par. Comm. pp. & AppL. CRAY Research Inc. Conf. A pipelined Given* method for computing the QR factorization of a sparse matrix. WESSELING. 26. Par. TRAUB [1976]. pp. Proc. POR. HELLER. IPSEN [1982]. Fox. 253-260.. & Comp... HEMKER. HlLLIS [1985]. HELLIER [1982]. 2240207. Proc. HlGHAM AND R. Phys. Accelerated iterative methods for the solution of tridiagonal linear systems on parallel computers.-T. Conf. Proc. Copper Mountain. [862] M. KETTLER. [876] C. pp. HELLER [1974]. [881] N. in McCormick [1312]. A determinant theorem with application! to parallel algorithms.). DAP implementation of the WZ algorithm. Comm. A hardware design of the SIGMA-1. SIAM J. The Connection Machine. [882] W. P. pp. HEMKER [1984]. HlGBlE [1978]. DE ZEEUW [1983]. 4.. 640-648. JoHNSSON [1986]. M. Some aspect* of the cyclic reduction algorithm for block tridiagonal linear systems. Hypercube Concurrent Comput. B. Phys. eds. Los Alamos National Laboratory. Comput. 29-40. 636-654. 1984 Int. Conf. [871] P. pp. [888] C. HlROMOTO [1985]. 1986 Int. Cornell University. 13(3-4). 11. HEMPEL [1988]. a data flow computer for scientific computations... McCormick and U. Fast polar decomposition of an arbitrary matrix. 188-191. AND P. Statist. 20. HELLER [1978]. ed. pp. Comput. Proc. SGHREIBER [1988]. HlGHAM [1989]. [873] R.. PLEMMONS [1988]. Proc. 23. [877] L. HEMPEL [1988]. [886] R. Performance of multigrid software on vector machines. Computer Science Tech. Amsterdam. [889] C. Par. Proc. Numer. Math. Department of Computer Science. [864] D. WESSELING. Report ANL-87-23. Comput. New Haven. [883] R. Trottenberg. 77. Comput. A portable vector code for autonomous multigrid modules. Report 88-942. Conference on Advanced Research in VLSI. GOSMAN. [875] J. Systolic networks for orthogonal decompositions. Distributed routing algorithm for broadcasting and personalized communication in hypercubes. Control Data STAR-100 processor design. Tech. Report 89-984. KIEFT. pp.. G. S. [872] P. pp. 197206. [885] R. Par. Pub. Cholesky downdating on a hypercube. PDE Software: Modules. Proc. pp. Numer. Argonne National Laboratory. DE ZEEUW [1984]. On the embedding of arbitrary meshes in Boolean cubes with expansion two dilation two. Multigrid methods: Development of fast solvers. COMPCON 72. AND L. SLAM J. SORENSEN [1986].. Sci. Third Conf. 185194. Par. Supercomputer. T. Appl. SIAM Rev. SHIMADA. Cornell University. Alg.. 135-136. pp. CO. HEMKER. Proc. . Anal. AND K. pp. pp. WIGGERS [1981]. MIT Press.-T. [863] D. 621-629. Systolic network for orthogonal equivalence transformations and their application. pp. HlNTZ AND D. DELVES [1984]. Sci. Association for Computing Machinery. 1592—1598. 321-323. pp. in Paddon[l512]. Experiences with the Denelcor HEP. ACM. 13. 558-588. (Special Issue. 484-496. MIT Press. HELLER AND I. Ho AND L. OSTLUND [1980]. Parallel processing a plasma simulation problem using the particle-incell method.. Speeding up FORTRAN (CFT) programs on the CRAY-1. M. [867] D. Proc. HELLER [1976]. pp. 1-4. NISHIDA [1984]. HERTZBERGER. 22. 1. P. The Suprenum communications subroutine library for grid-oriented problems. AND P. AND J. Parallel Computing. IPSEN [1983]. pp. Ho AND L. Parallel multigrid algorithms for the biharmonic and the Stokes equations. FAMP system. [874] R. Anal. Statist.. Tech. JOHNSSON [1987]. implementation and performance. Report LA-UR-85-2393. eds. pp. Proceedings of the First Copper Mountain Conference on Multigrid Methods. JoHNSSON [1987]. Interfaces and Systems. [870] P. 740-777. [878] P. 1987 Int. G. Proc. North-Holland. [887] C.. 524-531. HlRAKI. 311-326. HENKEL.-T. SCHOOREL..

Conf. Performance characterization of the HEP. Programming and Algorithms. vol. Engquist and T. Parallel Computing. HOBBS. AND M. Comm. in Paddon [1512]. Copper Mountain. D. [903] R. The large parallel computer and university research. pp. [912] W. pp. KAWAZOE. Appl. e. Optimizing the FACE (I) Poisson solver on parallel computers. [894] R. Phys. KAMIMURA. R.158 J. Parallel Computers: Architecture. HOCKNEY [1983]. 269-271.. Comput. VOIGT AND C. Proc. [895] R. 521-526.. Parametrization of computer performance. pp. pp. International Association for Mathematics and Computers in Simulation. S. 5. HOCKNEY [1984]. pp. Elsevier. 127-144. eds. [902] R. Parallel Computing. Computer Science Press. A parallel algorithm for the integration of ordinary differential equations. Phys. IIDA. 34. MIMD computing in the USA — 1984. HOCKNEY [1984]. Comput. [892] R. pp. Proc. Optimizing the FACR (I) Poisson-solver on parallel computers. United Kingdom. Characterizing MIMD computers. on Future Systems. Proceedings of the Second Copper Mountain Conference on Multigrid Methods. HOCKNEY [1982]. pp.. Par. Performance of parallel computers. HOCKNEY [1982]. HOCKNEY [1982]. Parallel Processor Systems: Technologies and Applications. (V<x>. LAZOWSKA. G. Vectorized computation of correlation functions from phase space trajectories generated by molecular dynamic calculations. pp. B. Parallel Computing. HOHEISEL. HOSHINO.. [913] H. [914] R. pp. Phys. pp. G. 59-90. The Illiac IV: The First Supercomputer. NARA [1984]. VOGELSANG [1984]. 26. HOCKNEY [1983]. Proceedings of World Congress on System Simulation and Scientific Computation. pp. Comput. pp.g. A fast direct solution of Poisson's equation using Fourier analysis. Vectorized multigrid solvers for the two-dimensional diffusion equation. Y. HOPPE AND H. R. pp. Adam Hilger. 269-288. SEKIGUCHI. MAJIMA [1987]. Addison-Wesley.). SHIRAKAWA [1985]. HOCKNBY [1977]. Reading. J.. & Comp. CO. Characterization of parallel computers and algorithms. TRIMBLE. Interfaces and Systems. ROMINE [890] L. Comm. Structured Concurrent Programming. European Joint Comp. 149-185. [919] T. [896] R. HOCKNEY [1985]. 465-469. HOCKNEY [1984]. HOCKNEY [1985]. nj^. in McCormick [1312]. To appear. 19(1-4). TITUS. Presented at the IBM Symposium on Vector Computers and Scientific Computing. Mapping schemes of the particle-in-cell method implemented on the PAX computer.. Comp. J. [623]. the Denelcor HEP. Phys. HORD [1982]. SCOTT [1978]. C-32. M. Parallel adaptive full-multigrid methods on message-based multiprocessors. Smedsaas. S. ORTEGA. in Kowalik [1116]. HOLT. HOCKNBY [1965]. Super-computer architecture. H. pp. pp. HOCKNEY AND D. [899] R. AND S. Particle codes and the Cray-2. [911] B. Los Alamos National Laboratory. [907] R. H. Ltd. Infotech State of the Art Conf.. Poisson solving on parallel computers. AND H. A universal computer capable of executing an arbitrary number of subprograms simultaneously. M. AND T. 933-941. HOROWITZ [1986]. Proc. [904] R. Parallel Computing. T. Proc.ai 72 J measurements on the S-CPU CRAY X-MP. pp. [915] S. AND R. Par. Proc. Spartan Books. [898] R. Cont. HlROMOTO. pp. T. 285-291. 9-14. [893] R. pp.-C. in Parallel MIMD Computation: HEP Supercomputer and Its Applications [1117].. Bristol. HORIGUCHI. Math. [901] R. HOCKNEY AND C. Lawrence Livermore National Laboratory. 108-113. Parallelized ADI scheme . [897] R./ [917] E. HOCKNEY [1985]. Report UCRL-95055. [910] R.. 45-65. [906] R. 1-14. 20. pp. HOLLAND [1959]. 97-104. [918] T. The nj/ 2 method of algorithm analysis. SCHOEN. 2. 3. HOLTER [1986]. HOCKNEY [1979]. 159-176. GRAHAM. 119136. Conf. Characterizing computers and optimizing the FACR (/) Poisson solver on parallel unicomputers. Vectorizing the interpolation routines of particle-in-cell codes. IEEE Trans. Rome. HOROWITZ [1987]. J. [900] R. Characterization of parallel computers. [916] E. [909] J. 2. Tech. (Special Issue. Conf. 62-71. Report LA-UR-872879. HIGHBERG [1970]. Tech. MA. E. in Feilmeieret al. ACM. 1. 12. McCormick. MUHLENBEIN [1986]. AND D. PDE Software: Modules. 1982 Int.. Proc. HOLTER [1988]. THBIS.. 95-113. HOSHINO. HOCKNEY [1987]. [908] C. A vectorized multigrid solver for the three-dimensional Poisson equation. 1984 Int. JESSHOPE [1981]. 429—444. ed. SNELLING [1984]. [891] R. [905] R.

AND J. Tosmo [1983].. IEEE Power Engineering Society Summer Meeting. J. 1. Highly parallel procesor array UPAX" for wide scientific applications. [924] S. 95-105. Amsterdam. TAKENOUCHI. Interfaces and Systems. Partitioning PDE computations: Methods and performance evaluation. pp. 61. HOUSTIS. Efficient parallel algorithms for processor arrays. B. Conf. ACM Trans. North-Holland. J. pp. Sys. Institute of Electrical and Electronics Engineers. . S.. VAVALIS [1988]. pp. ABE. RlCE [1984]. HOUSTIS. AND Y. HOUSORS AND O. T. KAGEYAMA. [937] K. 287-296. Oper. WiNG [1979]. PACS: A parallel microprocessor array for icientific calculations. DAVIES. HUANG [1974]. Proc. [938] R.. T. 215-248. Monte Carlo simulation of a spin model on the parallel computer PAX.. [923] T. AIAA. Evaluation of a vectorizable t-D transonic finite difference algorithm. [941] D. R. AND K. HOUSTIS. RlCE [1986]. Report 109. pp. HOSHINO AND K. HOSHINO. [929] C.. [928] C. HOUSTIS. Proc. 1983 Int. HOUSTIS. Tech. Department of Computer Science. Par. RlCE. Performance evaluation models for distributed computing. [927] C. T. Application techniques for parallel hardware.. AND E. 5. in Jesshope and Hockney [976]. in Rodrigue [1643]. Processing of the molecular dynamic model by the parallel computer PAX. Inc. MACKE. HOUSTIS. E. Conf. Proc. AND J. T. pp. FERENCZ. 67-87. T. HUGHES. WING [1978]. Proc. University of Illinois at UrbanaChampaign. pp. M. AND J. pp. Paper 79-0276. HUANG AND O. AND K. Conf. [762]. eds. A78-567-0).. HOSHINO. 1984 Int. pp. [942] D. ABRAHAM [1982]. Purdue University. Engquist and T. Department of Computer Science. Hu [1961]. OYANAGI. Fault-tolerant algorithms and their application to solving Laplace equations.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 159 using GECR (Gauss-Elimination-Cyclic Reduction) method and implementation of Navier-Stokes equation on the PAX computer. HOUSTIS. AND M. A Schwartz splitting variant of cubic spline collocation methods for elliptic PDEs. Meth. J. 426433. Center for Advanced Computation. [943] C. in Vichnevetsky and Stepleman [1921]. HUNT. KAMIMURA. 841-848. HUNT [1979]. SHIRAKAWA. E. ITO. On minimum completion time and optimal scheduling of parallel triangulation of a sparse matrix. Comm. RlCE. CAS-26. DAWSON. Report CSD-TR-576.. 365-396. S. VAVALIS [1987]. 169-180. Comput. Pseudo-conjugate directions for the solution of the nonlinear unconstrained optimization problem on a parallel computer. IEEE Trans. [926] C. HUANG AND J. pp. Parallel sequencing and assembly line problem. WING [1984]. in Schultz [1752]. 34. S. HOUSTIS. J. HUFF. Par. T. [920] T. Smedsaas. Y. in Vichnevetsky and Stepleman [1921]. Partitioning and allocation of PDE computations in distributed systems. [940] T. HUANG AND J. 271-279. pp. H. [933] H. OYANAGI [1984]. February.. [931] E. pp. pp. AND G. HUSON. YAMAOKA. 195-221. 726-732. shell and contact/impact problems. A. Proc. KAWAI. Proc. [930] E. pp. in Glowinski et al.. HUGHES AND R. Phys. Tech. pp. Plasma physics on an array processor. 1985 Int. A parallel algorithm for symmetric tridiagonal eigenvalue problems. TAKENOUOCHI. HOUSTIS. E. FERENCZ [1988]. Benchmarking of bus multiprocessor hardware on large scale scientific computing. [921] T. Parallel Computing. SAMARTZIS [1987]. CULLER [1982]. SAWADA [1983]. Optimal parallel triangulation of a sparse matrix. Proc. HUANG AND O. Applications of a parallel processor to the solution of finite difference problems. [939] T. J. DlCKSON [1979]. ABRAHAM [1984]. Comput. AND B. HOUSTIS. RlCE [1987]. 31[922] T. HIOASHINO. Report CSD-TR-745. 141-164. Comput. pp. T. Proc. 261-280. Optimization Theory and Applications. Circuits and Syst. WEBB. K. Par. [925] E. WILSON [1981]. SHIRAKAWA. LEASURE [1986]. on Comp. Large scale vectorized implicit calculations in solid mechanics on a CRAY-X-MP/48 utilizing EBE preconditioned conjugate gradient. Fully vectorized EBE preconditioned for nonlinear solid mechanics: Applications to large-scale three-dimensional continuum. Purdue University. [934] J. [935] J. Mech. AND J. SATO. K. (IEEE Pes Abstract No. [932] T. HOTOVY AND L. HOSHINO. Conf. AND E. Par. 167-174. Res. Parallelization of a new class of cubic spline collocation methods. 1982 Int. Appl. pp. 9. 42. MAJIMA. 31. January. AND A. pp. pp. SEKIGUCm.. TAKENOUCHI [1984]. H. 117-122. RlCE. [936] K. J. Phys. WOLFE. Engrg. 205-219. E. Comm.-M. Tech. 339-344. Los Angeles. J. PDE Software: Modules. HALLQUIST [1987]. The KAP/205: An 38.

MEIER. Conf. I. 1986 Int. M. 57-73. JACOBS. pp. GANNON. W. pp. Photo-Optical Eng. 18(6). HWANG AND J. pp. Nl [1981]. H. Second USA — Japan Comp. No. McGraw Hill. Proc. Sci. Solving the symmetric tridiagonal eigenvalue problem on the hypercube.160 J. pp. Report YALEU/DCS/RR-299. AND H. Soc. Carnegie-Mellon University. HWANG [1984]. Par. pp. HYAFIL AND H. in Heath [860]. BeUingham. Proc. HWANG AND Y. pp. A parallel QR method using fast Givens' rotations. Bounds on the speed-ups of parallel evaluation of recurrences. Report 607.. Eng. 80. K. JESSUP [1987]. CHENG [1980]. K. The communication complexity of parallel algorithms. Vectorization for solving the neutron diffusion equations — Some numerical experiments. Proc. Department of Computer Science.. Parallel algorithms for solving triangular linear systems with small parallelism. 1986 Int. Su. Yale University. Loui [1987]. Department of Computer Science. 205-239. Tech. The Characteristics of Parallel Algorithms. eds. SAAD [1985]. M. Parallel Processing. Proc. HWANG AND Z. HWANG AND F. VOIGT AND C. Computer Architecture and Parallel Computing. IPSBN [1984]. FFT algorithms for SIMD parallel processing systems. SWARTZLANDER. 20S2. 115-197. Proc. KUNG [1975]. Yale University. 102-111. Conf. M. I. BWOGS [1984]. 24. M. JAYASIMHA AND M. McGraw Hill.. Tech. ROMINE advanced source-to-source vectorizer /or ike Cyber 205 supercomputer. [1977]. Hypemet: A communication-efficient architecture for constructing massively parallel computers. L. Tech. AND L. Report. G. A comparative analysis of static and dynamic load balancing strategies. SCHULTZ [1986]. Workshop at NASA-Ames.. pp. Par. BOKHARI [1986]. ISHIGURO AND Y. Nuc. AND A. Conf. GHOSH [1987]. JAMIESON. pp. ACM. pp. Comput. 322-328. 513-521. K. Advances in Computers. Computer. 77. pp. K. SAAD. Tech. W. I. 217-227. The behaviour of conjugate gradient based algorithms on a multi-vector processor with a memory hierarchy. J. University of Illinois at UrbanaChampaign. Y. May. Yale University. Par. IEEE Trans. S. IPSBN [1987]. JAMIESON. Proc. IPSBN AND E. Center for Supercomputing Research and Development. Department of Computer Science. Lin. KUNG [1974]. MIT Press. Report YALEU/DCS/RR-444. Remps: A reconfigurable multiprocessor for scientific supercomputing. 178-182. AND E. 1450-1466. Tech. Department of Computer Science. IPSEN. & Appl. AND S. HWANG [1985]. February. Publ. NY. VLSI computing structures for solving large scale linear systems of equations. Proc. [1986]. Par. The impact of parallel architectures on the solution of eigenvalue problems. [1987]. C-31.. Singular value decomposition with systolic arrays. Comp. I. KUNG [1977]. I. eds. I. JESSUP [1987]. pp. Dist. WA. HWANG. I. Future Computer Requirements for Computational Aerodynamics. IEEE Trans. J. Department of Computer Science. Comput. MEIER [1986]. K. IQBAL. 827-835. November. L. 1985 Int. December. JALBY AND U. Two methods for solving the symmetric tridiagonal eigenvalue problem on the hypercube. C-36. 1215-1224. Multiprocessor supercomputer* for scientific/engineering applications. ed. MUELLER. IPSBN [1984]. S. Alg. 1980 Int. Computer Architecture and Parallel Processing. K. L. New York. K. SAMBH [1986].. JALBY. KoSHl [1982]. pp. HVAFIL AND H. AND R. IPSEN AND Y. NY. INOUYB. K. Conf. L. 20. Optimizing matrix operations on a parallel multiprocessor with a memory hierarchy.. The complexity of parallel evaluation of linear recurrences. U. Conf. University of Illinois at Urbana-Champaign. P.-H. 3. R. K. NorthHolland. Vector computer architecture and processing techniques. Systolic algorithms for the parallel solution of dense symmetric positivedefinite Toeplitz systems. Report 555. Tech. HWANG [1982]. [944] [945] [946] [947] [948] [949] [950] [951] [952] [953] [954] [955] [956] [957] [958] [959] [960] [961] [962] [963] [964] [965] [966] [967] [968] [969] [970] . IPSBN AND E. Partitioned matrix algorithm* for VLSI arithmetic systems. HWANG. L.. 627-638. Proc. SIBGEL [1986]. DOUGLAS. D. pp. pp. HYAFIL AND H. ORTEGA.. D. Report YALEU/DCS/RR-548. Complexity of dense linear system solution on a multiprocessor ring. Par. Xu [1985]. Proc. New York. Report YALEU/DCS/RR-539. J. Conf.. Yale University. Tech.. Center for Supercomputing Research and Development. AND M. Proc. SALTZ. 48-71. 1040-1047.

Concurrent processing adaptation of a multiple-grid algorithm. vol. JOHNSON. L. 221-236. 74-86. pp. Report 629. Computational arrays for band matrix equations. NY. Australia. Evaluation of Illiac: Overlap. ZIEBARTH [1986]. pp. pp. pp. J. Comput. 5. O. pp. JOHNSSON [1985]. 105-126. Optimal parametrized incomplete inverse preconditioning for conjugate gradient calculations. pp. JlANPINO AND K. IEEE Trans. 7. Tech. O. JOHNSON AND M. Tech. C. February. Parallel Computing. PAUL [1983]. 231-239. L. C-31. Yale University. C. AND G. Institute for Scientific Computing. Yorktown Heights. Lecture Notes in Physics. Some principles of parallelism in particle and mesh modeling. PPCG software for the CDC CYBER 205. 659-662. Proc. O. Band matrix systems solvers on ensemble architectures. JOHNSSON [1984]. Report RC-8644. Proc. in Birkhoff and Schoenstadt [173]. PAUL [1981]. May. pp. Cyclic reduction on a binary tree. L. Comput. 4-5. L. JOHNSON [1981]. 345-351. L. Ltd. eds. 313-321. 29(10). pp. Microelectronics 1982. G. Penfield. JOHNSSON [1985]. A data structure for parallel L/U decompaction. Austin. MICCHELLI. Center for Supercomputing Research and Development. Vector algorithms for elliptic partial differential equations based on the Jacobi method. Multigrid for parallel-processing supercomputers. Tech. on Advanced Res. C-29. PRYOR. The implementation of the fast radix £ transform* on array processors. JOHNSSON [1981]. L. pp. Comput. [1979]. Algorithms. pp. 20. Institute for Advanced Computation Newsletter. ed. Seismic Acoustics Lab. pp. Comput. JESSHOPE AND R. LEWITT [1982]. SWISSHELM. Report 4287:TR:81. 362-376. JOHNSSON [1984].. Report YALEU/DCS/RR-339. JESSHOPE [1980]. California Institute of Technology. Pipelined linear equation solvers and VLSI. in Control Data Corporation [411]. . PAUL [1981]. 243-245. IEEE Trans. IMACS. JOHNSON [1983]. vol. Odd-even cyclic reduction on ensemble architectures and the solution of tridiagonal systems of equations. Phys. KUMAR [1985]. Vector function chainer software for banded preconditioned conjugate gradient calculations. Some results concerning data routing in array processors. O. Fourth year Semi-Annual Prog. Report YALEU/DCS/RR-367. Highly concurrent algorithms for solving linear systems of equations.. IBM. Data permutations and basic linear algebra computations on ensemble architectures. Proc. 37. Report 87003. O. JOHNSON. AND S.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 161 [971] [972] [973] [974] [975] [976] [977] [978] [979] [980] [981] [982] [983] [984] [985] [986] [987] [988] [989] [990] [991] [992] [993] [994] [995] [996] [997] Tech. JOHNSSON [1985]. Department of Computer Science. Progress on the 3D wave equation program for the CDC Cyber 205. 90-95. D. vol. IEEE Trans. AND J. pp. EDWARDS [1981]. non-overlap. Tech. TX. P. SWISSHELM. Berlin. JOHNSON AND G. Tech. C. 42-46. 1.. JOHNSON AND J. JOHNSON AND G. JOHNSON. Department of Computer Science.. Institution of Electrical Engineers. JESSHOPE [1980]. Proc. JESSHOPE [1977]. JOHNSSON [1982]. Infotech State of the Art Report: Supercomputers. JESS AND H. Comm. MIT Conf. vol. 264. HOCKNEY. 1 & i. Polynomial preconditioners for conjugate gradient calculations. O. in McCormick [1312]. 72.. C. JOHNSON [1987]. J. 350-356. Fort Collins. JESSHOPE AND J. J. University of Texas Press. Architectures and the Future of Scientific Computation. J. pp. 20-27. AIAA J. An asynchonous parallel mixed algorithm for linear and nonlinear equations. Numer. Anal. University of Illinois at Urbana-Champaign.IX. 10th IMACS World Congress on Systems Simulation and Scientific Computation. Yale University. Datamation. in Schultz [1752]. LlSHAN [1987]. C. A computational array for the QR-method. 195203. Advances in Computer Methods for Partial Differential Equations . S. JOHNSSON [1982]. CO. pp. Parallel processing in fluid dynamics. KEES [1982]. CRAIGIE [1979]. Three-dimensional wave equation computations on vector computers. G. L. Multitasked embedded multigrid for three-dimensional flow simulation. SIAM J. Rep. Springer-Verlag. pp. in VLSI. JOHNSON [1984]. JOHNSON AND M. C-29. IEEE. Maidenhead: Infotech Int. 1. G. May. O.. C. Department of Computer Science. pp. 123-129. SWISSHELM [1988]. Artech House. L. ETA leaves home. G. in Jesshope and Hockney [976].

[1024] T.. SIAM J. JORDAN [1987]. JORDAN. [1014] H. Band matrix systems solvers on ensemble architecture. [1017] H. A conjugate gradient program for the Finite Element Machine. Report YALEU/DCS/TR532. Dist. Tech. JORDAN [1987]. 195-216. SAAD. Trends in Computerized Structural Analysis and Synthesis. A new parallel algorithm for diagonally dominant tri-diagonal matrices. The Force. North-Holland. Report 86-54. pp.. Sci. eds. JOHNSSON AND C. [999] L. Report 85-45. 21-29. D. Report CSDG. A. [1013] H. PODSIADLO [1980]. 419—454. [1018] H. ICASE. L. Tech. 1978 Int. shared memory environment. Report 87-1-1. M. VA. Department of Electrical and Computer Engineering. JORDAN [1978]. Tech. pp. SAWYER [1979]. Proc. Department of Computer Science. Conf. 59-76. Structuring parallel algorithms in an MIMD. Interpreting parallel processor performance measurements. Supercomputers. Boulder. Ho. [1023] T. The Force on the Flex: Global parallelism and portability. Tech. pp.229-237. [1019] H.. Comput. ed. ROMINE [998] L. 133-172. ACM Computing Surveys. SIAM J. The Cm* multiprocessor project: A research review. JOHNSSON [1986]. Parallel Processing and Medium Scale Multiprocessors. Yale University. NASA Langley Research Center. pp. IEEE. s220-s226. JONES AND P. August. 686-700. Wouk. JOHNSSON [1987]. Arch. P. Performance Evaluation of Numerical Software. Multiple tridiagonal systems. Carnegie-Mellon University. SIAM J. VOIGT AND C. [1015] H. A performance evaluation of linear algebra software in parallel architectures. Tech. Statist. JORDAN [1983]. pp. [1007] A. ACM Trans. Y. Report. ed. C. Matrix Anal. Experience using multiprocessor systems: A status report. I. SIAM J. Report YALEU/DCS/RR-552. Boulder. JORDAN [1986]. Proc. Alternating Direction methods on multiprocessors. 66(2). To appear. JOHNSSON [1987]. AND S. 263-66. Int. Tech. Solving narrow banded systems on ensemble architectures. Report CMU-CS-80-131. Society for Industrial and Applied Mathematics. The Finite Element Machine programmer's reference manual. [1012] H. Fast linear algebra routines on hypercubes. 1979 Int. Tech. [1000] L. Conf. [1004] L. [1006] L. Comp. Department of Computer Science. [1022] H. 10th Ann. Solving multiple tridiagonal systems. R. Department of Computer Science. 12. Programming issues raised by a multi-microprocessor. 3. 8.162 J. Computer Systems Design Group. 231-238. Performance measurements on HEP — A pipelined MIMD computer. SCELZA. McComb.. 271-288. 8. Softw. JOHNSSON [1985]. JONES AND E. [1009] A. NY. VA. AND M. R. pp. Ho. GBHRINGER. Symp. JORDAN [1986]. Statist. IEEE. Tech. NASA Langley Research Center. JORDAN [1984]. Comput. Proc. Par. J. Statist. [1020] H. Tech. Pergammon Press. SAIED [1987]. Experience with pipelined multiple instruction streams. SCALABRIN. [1010] H. Par. Proc. CHANSLER.. SCHWARTZ [1980]. University of Colorado. Ho [1987]. [1021] H. JOHNSSON. [1002] L. [1016] H. JORDAN [1981].. Proc. [1005] L. Yale University.. JONES.-T. Department of Computer Science. pp. Sci. Algorithms for matrix transposition on Boolean N-cube configured ensemble architectures.. SCHULTZ [1987]. JORDAN [1978]. CALVERT [1979]. eds. JOHNSSON. Department of Computer Science. 9. M. 113-123. Parallelizing a sparse matrix package. A multimicroprocessor system for finite element structural analysis. Report CSDG 78-2.-T. pp. Hampton. the Alternating Direction method. [1001] L. Los Alamos National Laboratory. January. and Boolean cube configured multiprocessors. JORDAN AND P. University of Colorado. Parallel computation with the Force. University of Texas Press. New York. 4. pp. June. Tech. Comp. AND F. JORDAN [1974]. pp. DURHAM. . Hampton. Fosdick. Appl. G. eds. 72. H. Communication efficient basic linear algebra computations on hypercube architectures. C. pp. [1011] H. University of Colorado. October. K. Proc. 121-165. [1008] A. VEGDAHL [1978]. 93-110. University of Colorado. Noor and H. Proc. JOHNSSON. Comput. ORTEGA. A special purpose architecture for finite element analysis. Solving tridiagonal systems on ensemble architectures. Tajima. Boulder. ICASE. FEILER. SAIED [1986]... AND W. Parallel Computing. the Alternating Direction methods and Boolean cube configured multiprocessors. Par. A comparison of three types of multiprocessor algorithms. [1003] L. 8. Matsen and T. 11. pp. AND F.. JOHNSSON [1988]. [1980].-T. Boulder. F. pp. JORDAN [1985]. Sci. pp. SCHWANS. 354-392. Report CSDG 81-1. JORDAN [1979]. Math. A. pp. JORDAN AND D..

Comput. pp. SNYDER [1984]. KlMURA [1978]. KALE [1985]. A simple atmospheric model on the sphere with 100% parallelism. Conf. [1030] A. AND J. Proc. The preconditioned conjugate gradient algorithm on a multiprocessor. KAMATH AND A. Hamsburg. SIAM J. 68. Moscow. Solution of nonsymmetric systems of equations on a multiprocessor. 7. Lattice Mesh: A multi-bus architecture. KAMOWrrz [1987]. A. 92-99. [1133]. KAPITZA AND D. [1050] R. SAMBH. 96-100. in McCormick [1312]. 1984 GAMM Conference. A short proof for the existence of the WZ-factorization. [1029] G. 474-484.. Info. CUNY. in Cray Research. 320322. Proc. in Kuck et al. CtOETH [1984]. pp. GANNON. in Birkhoff and Schoenstadt [173]. 4. Japan. pp. 20. [1031] S. pp. Tech. [1027] T. Center for Supercomputing Research and Development. KAPS AND M. [1026] T. 5-8. Comp. J.. Block tridiagonal linear systems on a reconfigurable array computer. TAKOCS [1982]. pp.. Proc. University of Illinois at Urbana-Champaign. 1-50. 1985 Int. pp. in Rodrigue [1643]. Proc. PONG [1977]. 117-142. NASA-Goddard Modeling and Simulation Facility. A numerical conformal mapping method for simply connected domains. Proc. Report 507. 1981 Int. The Pringle: An experimental system for parallel algorithm and software testing. [1038] C. Phys. [1032] L. Conf. Techniques for solving block tridiagonal systems on reconfigurable array computers. Trans. University of Illinois at Urbana-Champaign. A projection method for solving nonsymmetric linear systems on multiprocessors. Arch. August. Parallel Computing. KANT AND T. K. Research Review [1980-81]. Parallel Computing. in Vichnevetsky and Stepleman [1920]. Parallel Computing. Performance comparison of the CRAY-2 and CRAY XMP on a class of seismic data processing algorithms. Germany. pp. [1035] C. A 3-D Poisson solver based on conjugate gradients compared to standard iterative methods and its performance on vector computers. JORDAN [1982]. [1041] D. Statist. 700-702. Comp. EPPBL [1987]. JORDAN AND K. NGUYEN [1988]. YANG. Experimental results for multigrid and transport problems. G. 26. pp. AND D. [423]. STORCH [1976]. KAMATH [1986]. [1036] C. Structural computations on the Cedar system. University of Illinois at Urbana-Champaign. [1037] C. Proc. 383-385. Proc. [1044] Y. KAMOWrrz [1988]. AND L. KAMATH AND A. IMACS. 1-6. Computers and Structures. [1033] E. KUCK [1985]. Inc. SAMBH [1984]. 127-139. pp. KAPUR AND J. Tech. vol. 313-316. BROWNE [1981]. KALNEY-RlVAS. SAMBH [1985]. HOSHINO [1985]. Report 611. pp. A. Center for Supercomputing Research and Development. SAMBH [1986]. D. A two-layered mesh array for matrix multiplication. CALMATH: Some problems and applications. pp. KAMGNIA AND A.. [1048] M. 229-232. Mech. pp. KALNAY AND L. 19-24. 6. KOHATA [1982]. Report 591. 140-153. JORDAN [1982]. Conjugate gradient preconditioned for vector and parallel processors. 1984 Int.. A guide to parallel computation and tome CRAY-1 experiences. September. KAPUR AND J. Proc. pp. BROWNE [1984]. Computational Processes and Systems. 210-217. [1042] D. 5th Annual Symp. WANG. JORDAN [1984]. 41-54. Soc. Math. Experiments with the fourth order GISS model of the global atmosphere. Tech. October.. [1028] T. Center for Supercomputing Research and Development. [1045] R. [1034] E. [1040] T. [1043] F. Conf. Conf. Proc. Some linear algebraic algorithms and their performance on the CRAY-1. [1047] H. KAMPE AND T. 4. [1039] E. pp. KAIIABV [1985]. pp. Parallel Computing. KAMIMURA AND T. Highly parallel computing of linear equations on the matrix-broadcast memory connected array processor system. Multiprocessor super systems with programmable architecture based on the data-stream principle. Sci. KAPAUAN. 1. KANBDA AND M. Processing of Alternating Direction Implicit (ADI) method by parallel computer PAX. on Simulation of Large-Scale Atmospheric Processes. JoUBBRT AND E. pp. Izdatel'stvo Nauka. J. Proc. Par. SOR and MGRfv] experiments on the Crystal multicomputer. KAMATH. pp. BAYLISS. 10th IMACS World Congress on Systems Simulation and Scientific Computation. . The solution of tridiagonal linear systems with an MJMD parallel computer. Decentralized parallel algorithms for matrix computations. [1049] R. KAK [1988]. Proc.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 163 [1025] T. Par. [1046] A. pp. 5. SCHLEGL [1987]. 47-54. 701-719. Par. pp. Z. Angew.

A performance survey of the CYBER 205. 281-294. [1056] S. KERCKECFFS [1986]. SLAM J. Appl.164 J. KERSHAW [1982]. in Kowalik [1116]. KENDALL. in Vichnevetaky and Stepleman [1920].. Syntactic and semantic vectorization: Whence cometh intelligence in Supercomputing?. Conf. KASCIC [1984]. Parallel Computing. Parallel Computing. KAUFMAN AND N. [1078] M. 59-67. in Parter [1529]. Proc. ROMINE [1051] A. pp. 10. SEN [1977]. 73-86. KASAHARA AND S. NASA Langley Research Center. pp. MKANKER [1968]. Softw. [1079] E. Vancouver. Sci. KARP AND W. KASCIC [1986].. Parallel minima* search for a maximum. Practical multiprocessor scheduling algorithms for efficient parallel processing. A direct Poisson solver on STAR.. 3. Int. 117125. IEEE. AND J. SPE. MILLER [1966]. Society for Computer Simulation. Supercomputing Systems. North-Holland. Preliminary study of the use of the STAR-100 computer for transonic flow calculations. 20(5). in Rodrigue [1643]. Japanese super-speed computer project. On implicit Runge-Kutta methods for parallel computations. . AND A. KENDALL. [1065] M. [1060] M. Two strategies for root finding on multiprocessor systems. Tech. [1081] D. Banded eigenvalue solvers on vector machines. Interplay between computer methods and partial differential equations: Iterative methods as exemplar. pp. [1062] M. SPE Middle East Oil Tech. [1053] R. ORTEGA.. The impact of vector processors on petroleum resevoir simulation. pp. SIAM J. Sci. pp. GRBBNSTADT [1987]. Paper 11483. C-23. 1169-1174. J. An improved parallel Jaeobi method for diagonalizing a symmetric matrix. [1072] L. [1067] M. Math. KATZ AND M. [1068] H. Optimally stable parallel predictors for AdamsMoulton correctors. [1052] A. Comput. 379-382. New York. KARTASHEV AND S. Theory. KARTASHEV AND S. International Supercomputing Institute. KASCIC [1978]. Vector processing on the CYBER 200 and vector numerical linear algebra. [1061] M. KARP AND R. Petersburg. KASCIC [1984]. IEEE Trans. [1076] R. J. [623]. [1055] L. pp. [1073] M. 1978 LASL Workshop on Vector and Parallel Processors. Vectorization as intelligent processing. 3(1). KARP AND J. A vector-oriented finite-difference scheme for calculating threedimensional compressible laminar and turbulent boundary layers on practical wing configurations. [1063] M. Combin. SILLIMAN. 5. [1987]. Proc. 35-44. JAMESON [1978]. 85-89. 1023-1029. NOLEN. VOIGT AND C.. H. 43-57. VA. 314-333. pp. [1071] L. W. pp. Report 87-58. pp. [1054] R. ACM Trans. [1064] M. pp. [1070] I. [1077] R. International Supercomputing Institute. D. 947-952. pp. 85-126. Recent mathematical and computational developments in numerical weather prediction. Comput. pp. pp. 1390-1411. KAUFMAN [1984]. 72. KARTASHBV. KASCIC [1979]. SIAM J. WATTS [1983]. 3rd GAMM Conf. 14. • [1080] D. STANAT [1984]. KEYES AND W. Proc. Hampton. KASCIC [1984]. Paper 81-1020. M. [1074] S. Comput. 6. SCHRYER [1989]. pp. KARP [1987]. R. Vector processing on the CYBER 200. KENICHI [1981].. MORRELL. Computer. Development of a multiple application reservoir simulator for use on a vector computer. 217-233. 191-210. Programming for parallelism.. pp. FRANKLIN. queuing. eds. KAUFMAN [1974]. [1058] H. in Jesshope and Hockney [976]. AND P. G. AIAA. St. KASAHARA [1984]. Anatomy of a Poisson solver. Math. Supercomputing '87: Proceedings of the Second International Conference on Supercomputing. KARTASHEV. Statist. on Numeric Mathematics in Fluid Dynamics. ICASE. [1075] J. J. [1069] I. AIAA.. G. [1057] A. KELLER AND A. Parallel algorithms for ordinary differential equations — An introductory review.. 4. pp. An almost-optimal algorithm for the assembly line scheduling problem. Appl. Proc. KEELING [1987]. pp. Vorton dynamics: A case study of developing a fluid dynamics model for a vector processor. IEEE Trans. in Kowalik [1116]. pp. A comparison of domain decomposition techniques for elliptic partial differential equations and their parallel implementation. C-33. 1983 Summer Computer Simulation Conf. Solution of single tridiagonal linear systems and vectorization of the ICCG algorithm on the CRAY-1. [1986]. Supercomputer Appl. GROPP [1987]. in Feilmeieret al. KASCIC [1983]. Paper 78-12. KASCIC [1984]. PEACEMAN. 19-35. termination. pp. [1059] M.237-270. Bahrain. Properties of a model for parallel computations: Determinacy. in Fernbach [630]. Proceedings of the 1986 Summer Simulation Conference. pp. KASHIWAGI [1984]. pp. 85-89. NARITA [1984]. KASCIC [1979]. 1. eds. 173-179. Solving two dimensional partial differential equations on vector and scalar machines. 10-33. Math. Comput. FRANKLIN [1985]. [1066] M. M. KATZ.

KOBER AND C. Paper 83-0223. KOHLER [1975]. Gauss-Jordan elimination by VLSI mesh-connected processors. 5MS — A multiprocessor architecture for high speed numerical computations. IBM J. POOLE. McGraw Hill Book Company. KNIGHT. T. AND D. [1107] O. KlNCAID AND D. KlNCAID. OPPE. Department of Mathematical Sciences. [1090] D. Efficient multigrid algorithms for locally constrained parallel systems. ITPACK on supercomputers. KlRKPATRICK. Comm. KOLP AND H. Comput. AIAA 21st Aerospace Sciences Meeting. T. pp. KOGGE [1974]. KLAWE. 1235-1238. Processing efficiency of a class of multicomputer systems. pp. KlLLOUOH [1979]. 147-160. OPPE [1988]. pp. M. 151-161. KlGHTLEY AND I. W. Proceedings of the Second Copper Mountain Conference on Multigrid Methods. [1104] P. K. pp. Appl. Comm. McCormick. [1083] J. Meth... in Gary [700]. AIAA. ILUBCG2: A preconditioned biconjugate gradient .. KlNCAID AND T. Sci. 1978 Int. STONE [1973]. Tech. YOUNG [1982].. [1088] D. [1098] J. [1095] D. Nevada. Proc. Algebraic Discrete Methods. Comm. CO. [1106] W. pp. 1975 ACM National Conference. IEEE Trans. on Comp. pp. & Comp. CAREY. pp. Statist. KlNCAID. Reno. 375-378. The Architecture of Pipelined Computers. A companion of conjugate gradient preconditionings for tkree-dimentional problems in a CRAY-1. KlLLOUOH [1986]. [1099] J. S. Appl. A parallel algorithm for the efficient solution of a general class of recurrence equations. YOUNG [1984]. MlERENDORFF [1986]. PIPPENGER [1985]. THOMPSON [1987]. NASA Langley Research Center. Rice University. A multi-level domain decomposition algorithm suitable for the solution of three-dimensional elliptic partial differential equations. 8. 5(2). SPE Symposium Resevoir Simulation. KlNCAID. C-22. 169-200. New York. KlNCAID AND T. Meth. Conf. [1097] J. Numer.). Vectorized iterative methods for partial differential equations. OPPE [1983]. pp. Phys. Hampton. Symp. VA.. Proc. First Ann. Eckman. A performance analysis of the PASLIB version 2. The use of vector processors in reservoir simulation. SLAM J. System balance analysis for vector computers. SIAM J. 99-112. AND D. Proc. Proc. 271-290. KNIGHT AND D. KODRES [1984]. [1101] U. Comput. Comput. Combining finite element and iterative methods for solving partial differential equations on advanced computer architectures. Appl. pp. pp. Comput.1 SEND and RECV routines on the Finite Element Machine. VOIGT [1975]. 138148. pp. [1108] A. Maximal rate pipelined solutions to recurrence problems. On the performance of tome rapid elliptic solvers on a vector processor. [1087] T. SEPEHENOORI. A preliminary evaluation of the critical path method for scheduling tasks on multiprocessor systems. 576-582. NY. 349-360. A. KOGGE AND H. AND D. Analysis of « parallelized nonlinear elliptic boundary value problem solver with application to reacting flows. OPPE. Parallel solution of recurrence problems. [1102] P. KlGHTLEY AND C. AND N. KONIGES AND D.. ANDERSON [1987]. Proc. Comput. 6. 4. 18. Mini Microprocessors. in Control Data Corporation [411]. [1091] D. pp.. 163-168. [1094] D. Copper Mountain. ICASE. pp. Pract. J. 8. 37. Report 87-21. OPPE. Tech. A hybrid explicit-implicit numerical algorithm for the three-dimensional compressible Navier-Stokes equations. SIAM J. Some graph coloring theorems with application to generalized connection networks. C-24. T. SMOOKE [1987]. On the design of a special purpose scientific programming language. [1084] J. Int. 71-76. KEYES AND D. YOUNG [1986]. [1105] P. pp. Algebraic Discrete Methods. 205-214.. OPPE. 2. Par.. KNOTT [1983]. New York. 13. Math. A parallel algorithm for the general LU factorization. IEEE Trans.. 19(1-4). Res. YOUNG [1986]. Arch. 786-793.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 165 Statist. [1096] D. KOGGE [1981]. Contractor Report 172205. Report TR86-7. pp. AND R. [1086] J. January. [1092] D. Softw. Dev. Numer. KlMURA [1979]. YOUNG [1984]. in Jesshope and Hockney [976]. Springer-Verlag. KOGGE [1973]. [1085] J. DUNLOP [1983]. Numerical Methods. G. 789-796. pp. [1100] R. pp. 28-33. 18-23. Adapting ITPACK routines for use on a vector computer. eds. KNIGHT [1983]. AND D. 893-907. Exp. JONES [1985]. sl66-s202. Dold and B. [1093] D. pp. pp. KINCAID. in Vichnevetsky and Stepleman [1920]. ed. KUZNIA [1978].. [1082] D. T.. Adapting iterative algorithms for solving large sparse linear systems for efficient use of the CDC CYBER 205.. [1103] P. 7. (Special Issue. 701-715. Vector computations for sparse linear systems. pp. [1089] D. Denver. NASA Langley Research Center.

AND S. Parallel supercomputing today and the Cedar approach. G. J. Information Processing '71. Computational Complexity of Sequential and Parallel Algorithms. New York. D. [1782]. KOWALIK. D. 15.. NASA Langley Research Center. M. NY. D. J. pp. pp. pp. S. 47-52. KUCK. in Miklosko and Kotov [1363]. D. NY. Automatic construction of parallel programs. Phys. LAWRIE. Physics Today. D. SAMEH. HAN. Vectorised finite-element stiffness generation: Tuning the Noor-Lambiotte algorithm. KRATZ [1984]. Vectorized and multitasked software packages for solving asymmetric matrix equations. 29-59. WOLFE [1984]. VOIGT AND C. GAJSKI [1986]. D. Cambridge. 109-141. John Wiley and Sons. vol. KUMAR [1984]. pp. LORD. 257-276. VALKOUSKn [1984]. KUCK. KRIST AND T. 1091-1098. KoWALIK [1983]. G. Comm. A. Conf. pp. pp. [1984]. Math. pp. R. E. A debate: Retire FORTRAN?. KRUSKAL [1983]. Y. KUCK [1978]. KRASKA. ANDERSON [1987]. Sagamore Conf. Comput. Algorithm implementation on the Navier-Stokes computer. D. D. 1266-1272. H. Berlin. V. C. An efficient parallel block conjugate gradient method for linear equations. D... J. Optimized matrix solution package! Jor use in plasma physics codes. pp. KUCK. ed. LAWRIE. pp. 967-974. Proc. Comp. Computing the fast Fourier transform on a vector computer. Advances in Computers. 97-123. SAMEH. AND M. CA. Comput. [623]. SNIR [1983]. B. ORTEGA. T. pp. KRUSKAL [1984]. McGRAW. 33. A. D. GAJSKI [1984]. pp. KUMAR [1982]. West Germany. 349-354. POLYCHRONOPOULOS. Parallel MIMD Computation: HEP Supercomputer and Its Applications. M. MIT Press. KOWALIK [1985]. A survey of parallel machine organization and programming. ZANG [1987]. C-32(12). merging and sorting in parallel computations. Springer-Verlag.. F-7. C. C. Cedar project. Wiley. North-Holland. SAMEH. pp. VEIDENBAUM. 977-992. KOTOV [1984]. Proceedings of the NATO Workshop on High Speed Computations. Numerical weather forecast with the multi-microprocessor system SMS201.166 J. KONIGES AND D. 65-107. Proc. NATO ASI Series. NY. P. KOWALIK. KRUSKAL AND M. KuCK AND A. Parallel Computing. NY. J. The Structure of Computers and Computation. in Noor [1450]. IEEE Trans. 119-179. SAMEH [1972]. AND C. 229-244. Parallel computation of eigenvalues of real matrices. A. Some aspects of using vector computers for finite element analyses. pp. DAVIS. 23-36. C-32(10). KUCK. D. KRATZ [1984]. H. D. San Diego.. Design and performance of algorithms for MIMD parallel computers. in Kowalik [1116]. LAWRIE. New York. ROMINE routine /or Me tolvtion of linear asymmetric matrix equations arising from 9-point discretizations. R. R. J. Tech. D. A. in Sharp et al. LEASURE. DAVIES. STREHENDT. DAVIDSON. The effects of program restructuring algorithm change and architecture choice on program performance. Annual Controlled Fusion Theory Conference. Academic Press. TOWLE [1973]. in Kowalik [1116]. C. Paper 2D12. [1109] [1110] [1111] [1112] [1113] [1114] [1115] [1116] [1117] [1118] [1119] [1120] [1121] [1122] [1123] [1124] [1125] [1126] [1127] [1128] [1129] [1130] [1131] [1132] [1133] [1134] [1135] [1136] . ACM Computing Surveys. AND R. CHEN. E. KONIOBS AND D. ANDERSON [1987]. 121-132. 1982 Int. J. J. KuCK [1976]. in Miklosko and Kotov [1363]. M. Academic Press. Proc. Science. vol. D. 1984 Int. in Vichnevetsky and Stepleman [1921]. Conf. Hampton. p. CYTRON. 118. MA. KUCK. 942-946. S. High Speed Computer and Algorithm Organization. AND A. KUCK. Searching. Parallel processing of ordinary programs. KRONSJO [1986]. pp. R. 297-. KOTOV AND V. KUCK [1977]. Formal models of parallel computations. D. AND A. L.. 129-138. KOPP [1977]. LAWRIE. LAMBIOTTE [1979]. 49-54. pp. MURAOKA. KOWALIK AND S. KORN AND J. eds. Preliminary experience with multiple-instruction multiple data computation. LEE. 265-268. 66-75. 43. McDANIEL. pp. Parallel Processing. P. CYTRON. pp. V. Comput. 1. pp. A. Par. IEEE Trans. in Peilmeier [621]. pp. Proc. AND D. pp. VA. BUDNICK. [1977]. J. Par. D. 37(5). 231. New York. pp. KUCK AND D. Proc. Parallel processing of sparse structures. Measurement of parallelism in ordinary Fortran programs. The performance of multistage interconnection networks for multiprocessors. in Feilmeier et al. BECK MAN. 9. Report NASA-TM-89119. SAMEH [1986]. R.

Comput. GAL-EZAR [1985]. in Kuck et al. STOKES [1982]. KUMAR. Conf... Traub. H. Proceedings of 1985 AIAA Computers in Aerospace V Conference. C-31. Hu [1983]. H. AND D. CHAN [1988]. pp. 80192. AND G. 363-376. Computer Science Press. 3. PADUA [1981]. KUNG [1984]. ARUN. 37-46. Sparse Matrix Proceedings (1978). Presented at the IBM Symposium on Vector Computers and Scientific Computing. Symp. pp. AIAA. Speech and Signal Processing. Washington State University. GAL-EZAR [1982]. H. Very Large Scale Integration. on Mini. S. C. S. Comput. KUNG AND Y. DRUMMOND. AND K. Lo. Yu [1982]. S. UCLA. IEEE Trans. 410-416. Systolic algorithms. IEEE. pp. in Control Data Corporation [411]. S. 256-282. Los Angeles. in Parter [1529]. [1133]. eds. Parallel Processing. 65-90. S. J. KUCK AND R. H. Rome. KUNG. Computer. 209-218. GRAVES. 423433. KUMAR AND J. ASSP31. Proc. Par. H. I. in Snyderet al. Fast and parallel algorithms for solving Toeplitz systems. KUNG [1984]. GAL-EZAR. [1987]. WEBB [1985]. 1984 Int. KUNG [1976]. eds. Global operations on the CMU Warp machine. LEISERSON [1979]. Society for Industrial and Applied Mathematics.. KUNG. Comp. WEILMUENSTER [1980]. KUNG AND Y. S. 37(9). may. KUHN AND D. ed.. Stewart. Let's design algorithms for VLSI systems. Kuo AND S. pp. Algorithms and Complexity. Two-color Fourier analysis of iterative algorithms for elliptic problems with red/black ordering.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 167 [1137] D. STEELE. 384-391. California Institute of Technology. H. IEEE Trans. KUNG [1982]. pp. pp. R. pp. S. pp. Youvits. Memo. S. Par. Internal. IEEE Trans. A software technique for reducing the routing time on a parallel computer with a fixed interconnection network. 15(1). KUNG AND S.. 867-884. Proc. KUNG AND R. H. USC Workshop on VLSI and Modern Signal Processing. Triangularization of a positive definite matrix on a parallel computer.. 66-76. S. Academic Press. 127-140. KUMAR [1982]. Synchronized and asynchronous parallel algorithms for multi-processors. Systolic arrays (for VLSI). 18-33. K. KUNG AND J. [1138] R. Wavefront array processor. pp. Eigenvalue. pp. pp. Acoustics. NASA Tech. KUNG AND C. KOWALIK [1986]. pp. [1981]. MD. Proc. A highly concurrent algorithm and pipelined architecture for solving Toeplitz systems. singular value and least squares solvers via the wavefront array processor. H. Rockville. Comput. VLSI Systems and Computations. inKartashev and Kartashev [1055]. 201-212. Proc. On supercomputing with systolic/wavefront array processors. H. Solving linear recurrences on pipelined computers. Dist.and Micro-computers in Control and Measurement.-C. M. Wavefront array processors — Concept to implementation. KUMAR [1986]. 1088-1098. 65-112. 10541066. October. The Burroughs Scientific Processor (BSP). 72. User's guide for vectorized code EQUIL for calculating equilibrium chemistry on Control Data STAR-100 computer. R. Parallel Algorithms for Solving Linear Equations on MIMD Computers. IEEE Trans. HWANG [1987]. ed. J. AND J. JEAN. C-31. H. KUMAR [1988]. KUMAR. KOWALIK [1984]. 153-200. KUNG AND D. pp. AND J. J. language and application. KUMAR AND J. IEEE Computer Society Press.. Academic Press. Computer. S. KUNG [1979]. Proc. Proc. S. San Francisco. R. [1140] A. 450-460. PhD dissertation. The structure of parallel algorithms. BHASKARRAO [1982]. [1808]. pp. pp. Department of Computer Science. 20(7). KUNG. Parallel factorization of a positive definite matrix on an MIMD computer. Hu [1981]. M. KUNKEL AND S. Department of Mathematics. 19. Measuring parallelism in computation-intensive scientific/engineering applications. S. pp. pp. HARRIS [1982]. Integrating high-performance special-purpose devices into a system. 89-98. Conf. New York. D. KUNG [1980]. Advances in Computers. pp. CAM Report 88-15. RUDY. pp.-C. 163-168. vol. STEVENSON [1977]. Why systolic architectures?. pp. [1139] A. Duff and G. NASA Langley Research Center. Experiences with explicit finite [1141] [1142] [1143] [1144] [1145] [1146] [1147] [1148] [1149] [1150] [1151] [1152] [1153] [1154] [1155] [1156] [1157] [1158] [1159] [1160] [1161] [1162] [1163] [1164] difference schemes for complex fluid dynamics problems on STAR-100 and CYBER SOS computers. J.. Kuo AND T. SPROULL. Linear or sparse array for eigenvalue and singular value decompositions?. Architecture. S. Solving positive definite linear systems on vector comput- . KUNG AND R. H.

The Solution of Linear Systems of Equations on a Vector Computer.. pp. Vectorization on the STAR computer of several numerical methods for a fluid flow problem. Proc. 227-242. KRUSKAL. ROY [1983]. LARSON [1984]. LAWRIE [1975]. Proc. Softw. Efficient sparse matrix multiplication scheme for the CYBER 203. [1172] J. IEEE Trans. 319. LEBLANC. [1185] D. AND D. pp. BABB [1987]. Oxford University Press. pp. Math. Access and alignment of data in an array processor. LlDDBLL [1987]. D. LAMBIOTTE AND R. Computers and Structures. Tech. SIGPLAN Notices. Adaptation of a large-scale computational chemistry program for the Intel iPSC concurrent computer. Multitasking on the CRAY X-MP-2 multiprocessor. Report UCID-18619. LAKSHMTVARAHAN AND S. pp. Comput. PhD dissertation. Systolic systems for finite element methods. H. 62-69. LAWRIE AND A. Lawrence Livermore National Laboratory. Math.. Carnegie-Mellon University. 550-573. RANDALL [1975]. R. An empirical study of automatic restructuring of nonnumerical programs for parallel processors. [1133]. p. August. C-24. Int. DHALL [1986]. 12. LEE [1980]. LAWRIE. BAER. A review of parallel finite element methods on the DAP. LARRABEE AND R. [1167] A. Center for Supercomputing Research and Development. 23(9). in Gary [700]. The multiprocessor modified Pease FFT algorithm. 927933. University of Illinois at UrbanaChampaign. The University of Virginia. [1174] J. J. 17(7). 243-256. pp. pp. Cornput. ACM Trans. Math. on Vector and Parallel Methods. pp. LECA AND P. pp. SLAM J. G. 10. Lawrence Livermore Lab.. [1175] J. Simul. pp. BROWN [1988]. [1189] P. Tech. University of Rochester. Mod. Report NASA TN D-7545. Appl. [1178] A. pp. Tech. AND B. January. IEEE Trans.. Conf. SPITERIC [1986]. Department of Civil Engineering. C-34. Comm. KUCK [1985]. C. AND C. 464-472. 18. 17. in Heath [860]. A new hierarchy of hypercube interconnection schemes for parallel computers: Theory and applications. [1181] LAWRENCE LIVERMORE NATIONAL LABORATORY [1979]. [1184] D. Report.. 560-568. Applications of structural mechanics on large-scale multiprocessor computers. A performance analysis of architectural scalability. pp. 161-172. LAMBIOTTE [1984]. Compt. Tech. [1169] S. pp. Coll. 8. Numer. 330-341. Paris. The computation and communication complexity of a parallel banded system solver. B. LAMBIOTTE [1975]. MUSKUS [1987]. Shared memory versus message passing in a tightly coupled multiprocessor: A case study. SAMEH [1983]. ORTEGA. LAMBIOTTE AND L. ACM. Large-scale parallel programming: Experience with the BBN Butterfly parallel processor. Glypnir — A programming language for Illiac IV. AND J. 308-329..168 J. VOIGT [1975]. T. Tech. LEVY. pp. pp. Report 679. VOIGT AND C. Supercomputers and Their Use. LAMPORT [1974].. ROMINE ers. pp. [1173] J. 11. [1177] B. ACM Trans. in Noor [1450]. M. LAI AND H. LANG. LEBLANC [1986]. First. AND P. [1166] A. Kuo. 83-93. [1190] G. KWOK [1986]. LAMBIOTTE [1979]. [1179] J. SCOTT. The solution of tridiagonal linear systems on the CDC STAR-100 computer. Tech. SAMEH [1984]. LAWRIE AND A. Par. Comm. 157-164. The development of a STAR-100 code to perform a 2-D FFT. Three-dimensional finite element analysis of layered fiber-reinforced composite materials. Department of Computer Science. LAKSHMTVARAHAN AND S. [1168] C. . in Kucket al. August. [1191] J. LEE. Conf. Statist. 55-64. A local relaxation method for solving elliptic PDEs on mesh connected array*. 1. The S-l project. pp. Sci. Performance bounds in parallel processor organizations. 441-443. 453-455. University of Illinois at UrbanaChampaign.. HOWSER [1974]. NASA Langley Research Center. Simulation numerique de la turbulence sur un systeme multiprocessor. [1182] D. Tech. [1171] J. [1188] T. 185-195. 1145-1155. LAZOU [1987]. [1176] L. The parallel execution of DO loops. Comput. Sci. ACM. Report. University of Oklahoma. LEE [1977]. 28. [1170] S. [1183] D. Proc. 1986 Int. LAW [1982]. [1165] J. [1180] K. [1186] C. MIELLOU. Softw. Computer. A lower bound on the communication complexity in solving linear tridiagonal systems on cube architectures. Center for Super-computing Research and Development. pp. [1187] T. LAYMAN. in Heath [860]. Report R-82-139. [1192] R. Asynchronous relaxation algorithms for optimal control problems. KWOK [1987]. M. Comp. DHALL [1987]. Report. Department of Applied Mathematics and Computer Science.

Comput. IEEE Trans. PARKINSON. Report 119. Gesellschaft fur Mathematik und Datenverarbeitung. LEOBNDI. LEWIS [1988].. EVANS [1988]. Comput. in Parter [1529]. Parallel Computing (to appear). 26. March. 343-359. 485-502. Norfolk. Suitability of a data flow architecture for problems involving simple operations on large arrays. WAH [1985]. Arbeitspapiere der GMD. CA. Computer-Aided Design of Integrated Circuits and Systems. WOLF. 304-311. pp. . LEISERSON [1985].. Memory Access Patterns in Vector Computers with Application to Problems in Linear Algebra. Sci.. Li AND B. pp. COLEMAN [1988]. [1210] J.. Orderingt for parallel tparte symmetric factorization. CoLEMAN [1987]. LEVIN [1985]. Tech. LEISBRSON [1985]. A survey of problems and preliminary results concerning parallel processing and parallel processors. A Guidebook to Fortran on Supercomputers. [1209] J. 54. [1207] R. in Kowalik [1116]. 77. 9. Statist. SIAM J. 241-258. Proc. The impact of hardware gather/scatter on sparse Gaussian elimination. [1202] M. [1218] A. [1220] D. LlCHNEWSKY [1983].A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 169 [1193] T. pp. LEUZE [1986]. St. IEEE Trans. INRIA. LlCHNEWSKY [1984]. Comput. A parallel triangular solver for a hypercube multiprocessor. PhD dissertation. On minimum parallel computing times for Gaussian elimination. 366-368. A new method for solving triangular systems on distributed memory message-passing multiprocessors. Comput. Department of Computer Science. Report ETA-TR-85. [1205] J. [1212] G. Ll-SHAN AND D. 393-402. 23. Conf. 1985 Int. LlM AND R. 40. [1201] M. pp. LEUZE [1981]. [1200] J. LENSTRA AND A. 539-551. 118-135. pp. Proc. Int. Par. Proc. Li AND T. LlLES. Presented at the SIAM Conference on Parallel Processing for Scientific Computing. Proc. J. Sci. 518-520. [1221] D. A fast implementation of the Jess and Kees algorithm. RJNNOOY KAN [1978].. A survey of ADI implementations on hypercubes. Tech. SIMON [1988]. Proc. Proc... pp. Boeing Computer Services. Parallel triangularization of substructured finite element problems. [1216] K. Fat-treet: Univertal networkt for hardware-efficient supercomputing. Comput. VOLLMAN. LEUZE AND L. Sur la resolution de systems liniaires issus de la methode des elements finis par une machine multiprocesseurs. The design of optimal systolic algorithms. 179. LEWIS AND H. pp. Lin. Math.. Report ETA-TR-90. pp. Cornell University. Duke University. Congressus Numerantium. Fat-treet: Universal networkt for hardware-efficient tupercomputing. Some vector and parallel implementations for preconditioned conjugate gradient algorithms. [1196] C. Cray X-MP and Fujitsu VP tOO. LEVESQUE AND J. [1219] A. SIMON [1986]. pp. in Heath [860]. AND P. A. pp. pp. [1217] A. LELARASMEE. SIAM J. Some vector and parallel implementations for linear systems arising in PDE problems. AND G. [1204] M. Sci. The wavefront relaxation method for time-domain analysis of large scale integrated circuits. Systolic processing for dynamic programming problems. Nr. LEUZE [1988]. 1889-1901. 892-901. pp. IEEE. Oper. Experiments with a vectorized multigrid Poisson solver on the CDC CYBER £05. Par. 434-441. Department of Computer Science. in Heath [860]. eds. & Appl. IEEE Trans. PEYTON [1988]. LlCHNEWSKY [1982]. LEWIS AND H. 131-145. RUEHU. CAD-1. A parallel triangular solver for a distributed-memory multiprocessor. Conf. 22-35. 169-179. 9. THANAKLJ [1987]. Li AND T. Conf. [1208] J. Proc. J. LEWIS AND B. [1203] M. MAHAFFY. [1198] E. pp. [1197] C. 1985 Int. North-Holland. Li AND B. Report TR 87-812. 66-77. [1206] E. Amer. LEISERSON AND J. Par. Supercomputers. SANGIOVANNI-VINCENTELU [1982]. pp. [1215] G. Boeing Computer Services. SAXTON [1983]. pp. August in.. WILLIAMSON [1989]. LEHMAN [1966]. GIGUERE [1984]. C-34. [1211] G. AND A. Alg. 1985 Int. [1199] M. CoLEMAN [1987]. 674-679. [1194] M. Proc. LEMKE [1985]. [1986]. 246. Res. [1195] C. pp. May. [1214] G. Tech. Complexity of scheduling under precedence constraints. San Diego. C-34. 141-160. Parallel Processing by Cellular Automata and Array$. An approach to fluid mechanics calculations on serial and parallel computer architectures. 1986 Int. November. pp. Tech. R... Independent set orderings for parallel Gaussian elimination. Statist. The impact of hardware gather/scatter on sparse Gaussian elimination. pp. WAH [1985]. Conf. [1213] G. 295-314. Proc. Par. VA. Li AND T. pp. LEVINE [1982]. Academic Press. The convergence rate of the Schwartz alternating procedure (V) — for more than two subdomains. D..

Adapting scientific programs to the MIDAS multiprocessor system. ACM. 10. Solving elliptic boundary value problems on parallel processors by approximate inverse matrix semi-direct methods based on the multiple explicit Jacobi iteration. Comput. MALEK [1987]. AND A. SMITHERS [1985]. pp. D. Par. [1248] M. LORD. 1980 Int. Parallel Computing. University of Illinois at Urbana-Champaign. pp. Conf. pp. [1243] H. J. TRIPATHI [1977]. Liu AND A. MODI. pp. AGERWALA [1981]. 15-24. Communication issues in the design and analysis of parallel algorithms. 30. January. Department of Computer Science.. pp. Liu [1986]. Tech. pp. M. The symmetric eigenvalue problem on a multiprocessor. LlNT AND T. Comput. NJ. New York. Computer.. ACM. [1238] S. sl55-s!65. UK. Comp. IEEE Trans. [1240] D. 1984 Int. Scheduling algorithms for multiprogramming in a hard-realtime environment. Comp. Report CS-87-01. 1977 Int. [1247] H. R. EVANS [1987]. [1808]. York University. pp. [1227] E. KOWALIK. Report 590. A reconfigurable varistructure array processor. [1234] J. LUBACHEVSKY AND D. 327-342.248. [1245] R. Lo AND B.. pp. Proc. Explicit semi-direct methods based on approximate inverse matrix techniques for solving boundary value problems on parallel processors. Canada. Eng. Canada. Englewood Cuffs.. LINCOLN [1983]. Ontario. Math. Proc. LlPlTAKIS AND D. 205-210. [1226] B. SAMEH [1987]. 38-47. 165-174. pp.-C. G. Parallel Computing. 143-166. pp. [1231] J. Waterloo University. Statist. C. Liu AND J. RAGSDELL [1988]. Proc. [1223] T. RATHBUN [1984].170 J. Prentice-Hall. ORTEGA. Proc. Conf. Development and use of an asynchronous MIMD computer for finite element analysis. Liu [1987]. [1225] N. 20. pp. pp. [1246] R. LoMAX [1981]. April. 29. Cambridge. Report CUED/F-CAMS/TR. [1241] D. LOOTSMA AND K. in Kartashev and Kartashev [1055]. Center for Super-computing Research and Development. [1230] G. MlRZAIAN [1987]. Solving linear algebraic equations on a MIMD computer. AIAA Comp. Cambridge University Engineering Department. Conf. 11(8). Proc. 133-156. H. Tradeoffs in mapping algorithms to array processors. LINCOLN [1982]. LlN [1987]. NY. 16(5). Parallel Computing. Tech. LlPlTAKIS [1984]. KUMAR [1980]. Proc. Ontario. pp. Reordering sparse matrices for parallel elimination. AND T. Conference. Tech. State-of-the-art in parallel nonlinear optimization. Supercomputers = colossal computations + enormous expectations + renowned risk. MlTRA [1984]. Report CS-87-02. Report CS-78-19. J. [1236] J. [1224] N. LORD. 217-250. LOGAN. 6. [1232] C. [1233] J. Some prospects for the future of computational fluid dynamics. Softw. 719-726.-S. KUMAR [1983]. J.. Computational models and task scheduling for parallel sparse Cholesky factorization.. Chaotic parallel computations of fixed points of nonnegative matrices of unit spectral radius. A fully implicit factored code for computing three dimensional flows on the Illiac IV. AND S. February. Tech. [1229] G. pp. pp. LlVESLBY. AND S. SE-7. 103-117. LlPOVSKI AND M. The use of parallel computation for finite element calculations. LOUTER-NOOL [1987]. Department of Computer Science. WEAVER.. LlPOVSKI AND A. Fluid Dyn. Proc. SIAM J. pp. Parallel Computing. LIN AND D. 3. Appl. 171-184. Conf. Par. Sci. J. 54-67. Conf.. 46-61. LORIN [1972]. 4. Computer. [1244] F. A multiprocessor algorithm for the symmetric tridiagonal eigenvalue problem. KOWALIK. . 109116. [1239] S. pp. PHILLIPPE [1986]. 213-222.. LlPOVSKI AND K. SimuL. 8. Par. pp. 1-18. in Rodrigue [1643]. Par. MAPLES. A linear reordering algorithm for parallel pivoting of chordal graphs. DOTY [1978]. Technology and design tradeoffs in the creation of a modern supercomputer.-S. PULLIAM [1982]. VOIGT AND C. [1249] B. Lo. ROMINE [1222] A. PHILLIPPE. Developments and directions in computer architecture. LOMAX AND T. IEEE Trans. Department of Computer Science. Liu [1978]. 497-502.. Par. AND W. Proc. Inc. C-31. 349-362. John Wiley and Sons. [1235] J. J. in Snyder et al. pp. Solving linear algebraic equations on an MIMD computer. LAYLAND [1973]. Basic linear algetfra subprograms (BIAS) on the CDC CYBER 205. The solution of mesh equations on a parallel computer. Parallelism in Hardware and Software. June. [1228] E. Proc. B. York University. 1985 Int. 174-188. MoLDORAN [1985]. Tech. pp. & Math. Proc. [1237] R. 1984 Int. [1242] H. Parallel and tupercomputing of elliptic problems. LoENDORF [1985].

Hitachi 8810/tO. A. HAGER [1988]. [1268] S.. LUNDSTROM AND G. Par. Computing the singular value decomposition on the Illiac IV. pp. Softw. Department of Electrical and Computer Engineering. LUK [1986]. FABER [1987]. Par. Cornell University. A benchmark companion of three supercomputers: Fujitsu VP-iOO. 130-150. Cornell University. Par. [1255] B. 19-27. Department of Mathematical Sciences. SLAM J. s203-s219. 524-539. MOORE. PhD dissertation. LUNDQUIST [1987].. On parallel Jacobi orderings. J. pp. STEVENS [1976]. Report EECEG-86-2.. Dist. [1271] G. eds. MlTRA [1986]. Tech. A performance evaluation of three supercomputers. To appear in Proc. pp. LUCAS [1987]. 250-260. Solving Problems on Concurrent Processors. Cornell University. [1256] F. pp. Proc. Computing the C-S decomposition on systolic arrays. Fujitsu XP-SOO. 1980 Int. J. [1263] F. ACM. LUK [1985]. Fault-tolerant matrix triangularization on systolic arrays. Proc. Volume II: Scientific and Engineering Applications. A parallel adaptive numerical scheme for hyberbolic systems of conservation laws. 1292-1309. Report URI-044. Tech. Fluid dynamics applications of the ILLIAC IV computer. 143-144.. Statist. OVERBEEK [1987]. LUK [1985]. Tech. Department of Electrical and Computer Engineering. LUK AND S. pp. 75-1. Department of Electrical and Computer Engineering. Proc. LYZENGA. AND G. New York. Cornell University. BARNES [1980]. 6. MACE AND R. LUK [1986]. Conf. Solving Planar Systems of Equations on Distributed-Memory Multiprocessors. Department of Electrical Engineering. Department of Electrical and Computer Engineering. Applications considerations in the system design of highly concurrent multiprocessors. Proc. AND R. OVERBEEK [1983].. To be published. Conf. Computer. Sci. Tech. [1252] O. LUSK AND R. SPIE. Finite elements and the method of conjugate gradient on concurrent processors. A controllable MIMD architecture. 564. 614: Highly Parallel Signal Processing Architectures. LUBACHEVSKY AND D. Los Alamos National Laboratory. [1253] O. Memo. pp. LUK [1986]. 10-24. AND R. [1267] S. PARK [1986]. QlAO [1986]. Clemson University. NJ. 77. Math. . 33. Tech. [1273] M. 7. RAEFSKY. SPIE vol. [1262] F. 452-459. 259-273. LUCIER AND R. Comput. Comp. LUBECK. 18(12). Report ANL-83-97. [1254] R. Academic Press. ASME Int. Lin. 20. & Appl. Comput. CRAY X-MP/S4. Lyzenga. Report LA-UR-87-1522. RAEFSKY. Report EE-CEG-86-7. Statist. December. Computers in Engineering. A. Algorithm-bated fault tolerance for parallel matrix equation solvers. Math. C-36. AND B. MACCORMACK AND K. MENDEZ [1985]. [1259] F.. LUK [1980]. RODRIGUE [1975]. Department of Electrical and Computer Engineering. Lawrence Livermore National Laboratory. MADSEN AND G. 401-406. [1269] E. J. LUBBCK AND V.. A triangular processor array for computing singular values. ACM Trans.. vol. 8. Appl. Report EE-CEG-86-5. Globally optimum selection of memory storage patterns. MOORE. LUNDSTROM [1987]. Comput. [1251] O. Two notes on algorithm design for the CDC STAR100. Real Time Signal Processing VIII. & Comp. Fox and G. LUK AND H. Analysis of a recursive least squares signal processing algorithm. pp. QlAO [1986]. Tech. Finite elements and the method of conjugate gradients on a concurrent processor. Cornell University. pp. PrenticeHall. December. pp. Tech. Alg. pp. Computational Methods and Problems in Aeronautical Fluid Dynamics. Report EE-CEG86-1. Sci. 2. J. 448-465. LVZENGA. pp. pp. SIAM J. [1274] N. Report EE-CEG-85-2. [1264] F. pp. 264-271. 1121-1125. [1257] F. Stanford University. [1270] G. MENDEZ [1986]. [1258] F. To appear in Proc.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 171 [1250] B. Inc. pp. Tech. A rotation method for computing the QR-decomposition. Comput. [1265] F. Englewood Cliffs. Statist. [1261] F. 7. Sci. Implementation of monitors with macros: A programming aid for the HEP and other parallel processors. LUK [1986]. [1260] F. IEEE Trans... [1266] M. Modeling the performance of hybercubes: A case study using the particle-in-cell application. LUK AND S. HAGER [1985]. February. Tech. SIAM J. Proc.. Architectures for computing eigenvalues and SVDs. Argonne National Laboratory.. A chaotic asynchronous algorithm for computing the fixed point of a nonnegative matrix of unit spectral radius. Conf. A parallel method for computing the generalized singular value decomposition. 1985 Int. An implementation of the preconditioned conjugate gradient algorithm on the FPS T-20 hypercube. LUBECK. G. WAGNER [1985]. [1272] R. Hitachi S810/tO and CRAY X-MP/2.

[1293] H. Par. S. [1284] C. ed. 10th IMACS World Congress on Systems Simulation and Scientific Computation. TAJIMA. McBRYAN [1987]. [1283] P. MAPLES. BOULDIN. AND W. PARC AS [1982]. Report TR-683. pp. 514519.586 processors. 87-104. Department of Computer Science. Synchronization analysis of parallel algorithms. M. A distributed implementation method for parallel programming. vol. T. Solving partial differential equations on a cellular tree machine. [1300] O. [1277] N.. American Mathematical Society. Also Computer Science Tech. R. Supercomputing. MADSBN AND G. LOGAN [1984].. [1279] G. pp. WEAVER. MARINESCU AND J. D. in Heath . [1289] D. 179-187. RODRIOUB. Proc. pp. Conf. in Feilmeier [621]. R. Comp. Purdue University. Supercomputers: Algorithms. in Heath [860]. 5. MARTIN [1977]. J. Tech. RODRIGUE [1977]. PEPE. Conf. WEAVER. Information Processing 80. MADSBN. Proc. Pyramids. MAGO AND R. [1287] D. G. Tech. pp. [1280] G. Proc. Preliminary results on multiprocessor modeling and analysis using stochastic. 8. D. [1301] O. University of Illinois at Urbana-Champaign. University of Texas Press. Proc. A cellular computer architecture for functional programming. AND A. RODRIGUE [1976]. Cedar performance measurements. Purdue University. 681-688. MAGO [1980]. 101-112. [1299] O. Lectures in Applied Mathematics. Numerical computation on massively parallel hypercubes. MUELLER-WICHARDS [1987]. pp. 63-79. The Connection Machine: PDE solution on 65. Architectures and Scientific Computation. 1985 Int. D. RATHBUN [1983]. J. Control and Computing.-C. eds. Presented at the Twenty-Fourth Allerton Conference on Communication. MANDELL [1987]. Array processing of the Sdimensional Navier-Stokes equations.. 1987 Int. Information Sciences. in Kuck et al. Inf.. Proceedings of the SIAM Conference on Parallel Processing for Scientific Computing. Proc. AND T. RATHBUN. T. WAN. Performance of a modular interactive data analysis system (MIDAS). McBRYAN [1986]. Int. pp. Numerical computation on massively parallel hypercubes. high level Petri nets. Tech. June. pp.. [1278] G. Proc. LOGAN. Lett*. Rev. Par. December. [1290] D. Soc. COMPCON Spring. J. Proc. 43. pp. MAPLES. McBRYAN [1986]. North-Holland.172 J. A companion oj direct methods for tridiagonal systems on the CDC STAR-100. [1295] W. Proc. [1291] D. MAOO [1979]. pp. Matrix multiplication by diagonals on a vector/parallel processor. Conf. crossbars and thousands of processors. MARTIN. RlCE [1988]. Report 579. [1288] D. MARINESCU AND J. ORTEGA. Report LA-UR-863773. McBRYAN [1985]. 1. Odd-even reduction for pentadiagonal matrices. 103-106. MATSEN AND T. [1281] A. 3-24. Proc. [1282] D. H. KARUSH [1976]. [1276] N. Lawrence Livermore National Laboratory. MARINESCU AND J. D. RlCE [1987]. Sci. Nonhomogeneous parallel computation I. MAPLES [1985]. MARINESCU AND J. pp. vol. pp. 22. The operation and utilization of the MIDAS multiprocessor architecture. POLAND. RlCE [1987]. IEEE Comp. Proc. W.. [1296] F. NSF SBIR. Supercomputer performance evaluation: Status and directions. 1983 Int. Amsterdam. Conf. Center for Supercomputing Research and Development. 1984 Int. MANHARDT. 349-385 and 435-471. G. A network of microprocessors to execute reduction languages. AND D. 1. [1285] C. RlCE [1987]. 309-314. MUDGE. Synchronization of nonhomogeneous parallel computations. pp. [1297] O. March. On the effects of synchronization in parallel computing. [1986]. 41-45. [1133]. Report CS-TR-750. Los Alamos National Laboratory. 197206. Lavington. [1298] O. MARTIN AND D. Using supercomputers as attached processors. Proc. Tech. Domain oriented analysis of PDE splitting algorithms. Research Report LA-UR-86-4219. Purdue University. Los Alamos National Laboratory. pp. May. AND J. Conf. H. IMACS. Experiences and results multitasking a hydrodynamics code on global and local memory machines. [1292] A. McBRYAN [1987]. MARTIN [1980]. Los Alamos National Laboratory. Preprint UCRL-76993. MADSBN AND G. A discourse on a new supercomputer. Report CSD-TR-683.. Par. Monte Carlo photon transport on the NCUBE. and Info. [1294] J. MALONY [1986]. Computational methods for discontinuities in fluids. VOIGT AND C. MARINESCU AND C. 1987. ABDEL-RAHMAN [1987]. Par. LEWIS. 454-463. ROMINB [1275] N. 368-373. [1286] C. Department of Computer Science. pp. D. Phase I Final Reports. 1. LiN [1986].. BAKER [1982]. pp. 415-420. Tech. Report LA-UR-86-4218.

[1988]. Proc. pp. Hackbusch and U. 1928. VAN DE VELDE [1986]. McGLYNN AND L. GEHRINGER [1985]. McBRYAN AND E. Pure. Comp. eds. Language concepts for distributed processing of large arrays. 2. pp. MEIER [1985]. [1319] J. Report 511. [1302] O. Math. Parallel algorithms for elliptic equation solution on the HEP computer. AXELROD [1984]. VAN DE VELDE [1987]. A parallel partition method for solving banded systems of linear equations. [1326] U. [1315] T. Appl. Performance of MSC/NASTRAN on the CRAY computer. Ottawa. Addison-Wesley. 769-795. [1305] O. and Arch. MIT Press. pp. Tech. CONWAY [1979]. [1311] S. McCLELLAN AND D. Inc. . SLAM J. McCORMlCK [1982]. 311-316. Advanced Alg. Multigrid methods for multiprocessor computers. McDONALD [1980]. 1985 Int. Report CSD-TR-661. CO. 38. [1303] O. [1322] P. Non-linear recurrences and EISPACK. [1327] R. Superlinear speed-up through randomized algorithms. Formal analysis of a systolic system for finite element matrices. McBRYAN AND E. Tech. pp.. Sci. 6911. vol. Purdue University. [1309] J. MEHROTRA AND E. Center for Supercomputing Research and Development. Marcel Dekker. McDANIEL [1985]. ed. McGRAW AND T. [1321] C. Hypcrcubc algorithms and implementations. Parallel Computing. s227-s287. Tech. Tech. University of Illinois at UrbanaChampaign. Copper Mountain. McCORMlCK [1988]. Finite element computation with parallel VLSI. 826.. Introduction to VLSI Systems. Statist. Proceedings of the Conference on Parallel Processing using the Heterogeneous Element Processor. Comp. MELHEM [1983]. April 6-10. [1312] S. pp. VAN DE VELDE [1985]. [1306] O.. Proc. in Cray Research.. 117-126. Multigrid Methods II. for Signal Processing. Exploiting multiprocessors: Issues and options. J. Multigrid Methods..A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 173 [860]. Matrix and vector operations on hypercube parallel processors. Proceedings of the Third Copper Mountain Conference on Multigrid Methods. 540-553. Numerical linear algebra on the CEDAR multiprocessor. 73-89. VAN DE VELDB [1985]. VAN DE VELDE [1987].. Hypcrcubc programs for computational fluid dynamics. W. Proc. Report ICMA-83-66. [1320] J. SCALES [1984]. MEIER AND A. October. in Heath [858]. Conf. on Principles of Distributed Computing. Berlin. MEIER [1986]. 8. Lawrence Livermore Laboratory. 33-43. Lawrence Livermore National Laboratory. Par. The Chebyshev method for solving non-self-adjoint elliptic equations on a vector computer. Elliptic equation algorithms on parallel computers. [1308] O. 1228 of Lecture Notes in Mathematics. Comput. [1313] S. pp. RODRIGUE [1979]. McFADDIN AND J. [1316] B. 221-243. VAN DE VELDE [1986]. SPIE. System Sci. 291-300. Unpublished manuscript under NASA Ames Research Center Contract No. Parallel and vector problems on the FLEX/32. University of Pittsburgh. McBRYAN AND E. Proc. 8th ASCE Conf. 147-168. RUMELHART [1988]. PA. Phys. Two parallel SOR variants of the Schwartz alternating procedure. McCORMlCK. The multigrid method on parallel computers. Department of Computer Science. Canada. Numer. Trottenberg. McBRYAN AND E. McCORMlCK AND G. Report. Vol. Comm. McBRYAN AND E. pp. & Appl. pp. 88-98. An abstract systolic model and its application to the design of finite element systems. MEAD AND L. SAMEH [1987]. pp. March 1985. VAN DE VELDE [1986]. pp. MEHROTRA AND T. pp. 35. Report UCRL-91734. [1310] C. McBRYAN AND E. [1314] L. pp. [423].. pp. McCULLEY AND G. 706-719. MCGREGOR AND M. Heat shield response to conditions of planetary entry computed on the ILLIAC IV. in McCormick [1312]. [1325] U. [1328] R. pp. of Symp. Tech. Springer-Verlag. University of Houston. Meth. 205-215. Comput. [1304] O. McBRYAN AND E. in Paddon [1512]. ZAHER [1974]. Explorations in Parallel Distributed Processing. Parallel Computing. Parallel algorithms for elliptic equations. Elec. 2. PRATT [1982]. J. [1324] U.. SALANA [1983]. Adaptive multilevel algorithms on advanced computers. [1318] D. Institute for Computational Mathematics and Applications. [1307] O. Parallel Computing. Comm. [1317] H. 1987. RlCE [1987]. 1-27. Proc. 5. 3. October. MELHEM [1985]. 31. pp. University of Oklahoma. Reading. On making the NAG run faster. [1323] R.

Numer. Comput. 758-761. MELSON [1986]. & Comp. 5. [1342] L.. 4. August. Parallel Computing. Conf. pp. in Kartashev and Kartashev [1055]. Paper 83-0500. S. M.ctv. ed. [1339] R. IEEE Trans. Proc. Multitasking the conjugate gradient method on the CRAY X-MP/48. RHEINBOLDT [1984]. Softw. RAMARO [1988]. [1338] R. Anal. pp.. PODRAZIK [1987]. [1332] R. Supercomputer AppL. Computers and Structures. [1335] R. Application of data flow concepts to a multigrid solver for the Euler equations.. Par. Comput. 231-249. Proceedings of the Second Copper Mountain Conference on Multigrid Methods.). Proc. Proc. Proc. 2.. Benchmark on Japanese-American supercomputers — Preliminary results. MELHEM [1987].. [1341] R. 24. SIAM J. An efficient implementation of the SSOR/PCCG method on vector computers. Comp. SYMES [1987]. Rice University. pp. [1344] N. Sci. Comm. MELHEM [1987]. On ike design of a pipelined/systolic finite element system. pp. S. AIAA 21st Aerospace Sciences Meeting. [1331] R.174 J. 491-494. Toward efficient implementation! of PCCG methods on vector supercomputers. MERRIAM [1985]. 623-633. 4. 6. 70-98. pp. J. 117-138. MEZO AND W. 3. pp. C-36. Comput. H. 1. G. 470-477. 19(1-4). (Special Issue. Par. McCormick. 13. 6. A parallel first-order linear recurrence solver. Parallel Computing. CO. pp. Parallel Gauss-Jordan elimination for the solution of dense linear systems. 36. Appl. MEYER [1977]. Vectorizable multigrid algorithms for transonic-flow calculations. The block preconditioned conjugate gradient method on vector computers. Copper Mountain. Copper Mountain. C-35. 339-344. KELLER [1983]. [1350] G. pp. MEURANT [1988]. January. Tech. Comput.). MELHEM [1987]. MELHEM [1985]. Domain decomposition algorithms for linear hyperbolic equations. [1347] M. pp. CO. [1352] G. (Special Issue. Par. MELHEM [1988]. [1337] R. [1336] R. Proc. 374. p. Report 87-20. Statist. Multicolor reordering of sparse matrices resulting from irregular grids. March. [1345] R. pp. 20. in Glowinskiet al. 182-192. R. & Comp. Determination of stripe ttrv. [1354] R. SIAM J. [1353] G. pp. Math. 19(1-4). MELHEM [1987]. . Domain decomposition versus block preconditioning. A study of data interlock in computational networks for sparse matrix multiplication.. MELHEM [1988]. Also in Control Data Corp. IEEE Trans. MELHEM [1988]. 1419-1433. 14. pp. pp.. Iterative solution of tparte linear tystemt on systolic arrays. 615-619. ROMINE [1329] R. 289-304. MELHEM [1986]. 145-184. Sci. MEYER AND L.. pp. Parallel solution of linear systems with striped sparse matrices. MELSON AND J. [1348] G. Statist. 2. Math. 9. Parallel implementations of gradient based iterative algorithms for a class of discrete optimal control problems. [1355] J.. pp. Math. Appl. Experiences in using the CYBER SOS and CYBER SOS for three-dimensional transonic flow calculations. [1330] R. ORTEGA. MERRIAM [1986].. Comput. 251-254. pp. Department of Mathematical Sciences. 217-238. [762]. Appl. MELHEM [1987]. Int. MENDEZ [1984].ret for finite element matrices. Application of data driven networks to tparte matrix multiplication. SIAM J. Comput. AIAA. J. 1987 Int.. VOIGT AND C. TCHUENTB [1987]. 267-280. 341-365. Tech. Parallel Computing. 24(6). MELKEMI AND M. MEYER AND L. Conf. Meth.. [1349] G. Dist. in Kuck et al. 560-563. 239-264.. PODRAZIK [1987]. MEYER [1986]. pp. Numerical algorithms on the Crystal multicomputer. [411]. Effectiveness of multiprocessor networks for solving the nonlinear Poisson equation. 323-326. ed. [1133]. pp. pp. [1346] M. On the factorization of block-tridiagonals without storage constraints.. Report ICMA-87-105. MELHEM AND K. A mathematical model for the verification of systolic networks. MELHEM AND W. No. University of Pittsburgh. 67-76. MEURANT [1987].. MELHEM [1986]. 1987 Int. 1984. 1101-1107. Proc. 117-132. McCormick. SIAM J. [1340] R. Iterative solutions of tparte linear tyttemt on systolic arrays. A modified frontal technique suitable for parallel systems. Complexity of matrix product on a class of orthogonally connected systolic arrays. ACM Trans. [1343] D. [1333] R. p. Par. 1986 Int. MEURANT [1984]. BIT.. Conf. Proceedings of the Second Copper Mountain Conference on Multigrid Methods. pp. IEEE Trans. [1351] G. [1334] R. pp. pp. pp. An expanded version appeared in the SIAM News 17. Numer.

parallel and analog computation*. 13-43. 197-215. MlRANKER AND A. in Kowalik [1116]. [623]. Parallel method* for solving equations. pp. in Miklosko and Kotov [1363]. Paper 6S8. [1364] R. 23. Tech. pp. SIAM Rev.. American Physical Society Division of Plasma Physics Meeting. [1385] K. Comp. Complexity of parallel algorithm*. IEEE Trans.. pp. 16. EVANS [1984]. [1367] M. Software and Hardware of Parallel Syttemt. in McCormick [1312]. Tech. [1386] J. San Francisco. MlERENDORPP [1988]. [1377] A. MlSSIRLIS [1987]. MODI [1982]. pp. pp. Computing. Alg. pp. in Vichnevetsky and Stepleman [1920]. Computing. Annual Controlled Fusion Theory Conference. ACM. Elsevier. MlURA AND K. [1373] W. MILLER [1974]. [1387] J.. Delft. [1369] W. Correlation of algorithm*. TJAFERIS [1988]. in Mikloskoand Kotov [1363]. Conf. [1357] P. [1368] M. MlSSIRLIS AND F. C-22. ACM. MlRIN [1988]. Algorithms. 181-189. Math. Report 87-04. CA. 65. s43-s58. Parallel methods for the numerical integration of ordinary differential equations. [1365] R. Data transport in Wang'* partition method. Supercomputing '87. pp. pp. 710-717. 25-44. 93-101. pp. MlRANKER [1979]. MlRIN [1987]. Form and content in computer science. Solution method* for bidiagonal and tridiagonal linear systems for parallel and vector computer*. VAN DER VoRST [1988]. pp. on Numerical Simulation of Plasmas. E. pp. MlRIN [1987]. 697-699. toftware and hardware of parallel computers. MlURA [1971]. eds. LlNIGER [1967]. Springer-Verlag. MlCHIELSB AND H. PAPERT [1971]. Synthesis of parallel numerical algorithm*. CA.. San Diego. in Feilmeier et al. [1381] N. 93-114. 524-547. [1383] D. Jacks.. A companion of tome theoretical model* of parallel computation. VAN DER VORST [1986]. MlRANKER [1971]. UCHIDA [1984]. [1380] N. Comm. pp. Center for Advanced Computation. Parallel Computing.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 175 [1356] P.. Berlin. in Miklosko and Kotov [1363]. pp. Springer-Verlag. 8. 5. PhD dissertation. Comput. pp. [1382] N. Paper CMS. pp. Simul. MlKLOSKO AND V. [1374] A. University of London. [1384] K. MlSSIRLIS [1985]. pp. 295-302. WlNKLER [1984]. Delft University of Technology. Statist. [1372] W. MlNSKY AND S. FACOM vector processor VP-100/VP-SOO. Comput. 13. A parallel iterative system solver. Parallel matrix factorizations on a shared memory MIMD computer. Hierarchical relaxation. STOUT [1985]. CA. Jacobi Methods for Eigenvalue and Related Problems in a Parallel Computing Environment. Oxford University Press. A second order iterative scheme suitable for parallel implementation. MlSSIRLIS AND D.. Varying diameter and problem size in mesh-connected computers. 4563. Int. 1985 Int. MlKLOSKO [1984]. [1366] R. ed. 21. Report UCRL-97580. Delft. Parallel Computing. Conf. 87-96. Parallelization of muhigrid method* with local refinement* for a class of nonthared memory system*. Associative Information Techniques. 127-138. 303-320. Report 86-32. 359-395. [1375] A. MlNSKY [1970]. & Appl. MlKLOSKO [1984]. Asynchronous relaxations for the numerical solution of differential equations by parallel processors. 41. J. Sci. Control structures in Illiac IV Fortran. [1371] W. Proc. [1359] H. pp. Submitted to Computers in Phys. 7. University of Illinois at Urbana-Champaign. Parallel Algorithms and Matrix Computation.. Comp. 203-206. MlRIN [1987]. Lin. SIAM J. MODI [1988]. KOTOV. Par. Tech. [1376] A. MILLER AND Q. Proc. On some associative. MlTRA [1987]. [1378] N. MlCHIELSE AND H. [1379] N. Math. Predicting multiprocessing efficiency on the Cray multiprocessors in a timesharing environment/application to a 3-D magnetohydrodynamics code. MlKLOSKO [1984]. Proc. Data transport in Wang'* partition method. [1370] W. pp. pp. 622-627. Lawrence Livermore National Laboratory. Experiences parallelizing a 3-D MHD code. MlCHIELSE [1987]. Predicting multitasking overlap on the NMFECC Cray-2. A parallel iterative method for solving a class of linear systems. MlRANKER [1978]. [1362] J. MlLLSTEIN [1973]. MisSIRLIS [1984]. A survey of parallelism in numerical analysis. [1984]. [1358] P. [1361] J. MlRANKER AND W. The block iterative method for Illiac IV. [1360] J. [1363] J. NY. Delft University of Technology. Scheduling parallel iterative methods on multiprocessor systems. 449-465. Twelfth Conf. 32. 20. 17. Multiprocessing efficiency of 3-D MHD calculations on the NMFECC Cray-S. 267-285. Ox- . Spacetime representations of computational structures. pp. San Diego. Doc.

Phys. in Feilmeieret al. Comput. 7. The scattered decomposition for finite elements. MUHLENBEIN AND S. Cambridge. Mini & Microcomputers in Control and Measurements.. 36. Izdatel'stvo Nauka. Unstructured sparse matrix vector multiplication on the DAP. OTTO [1988]. AND J. Parallel Computing. North-Holland. in Heath [858]. pp. AND C. 181-195. Department of Computer Science. Amsterdam. 34. MORGAN AND L. [1398] D. . Efficient implementation of Jacobi's method on the DAP. VOIGT AND C. Comput. R. pp. Proc. [1409] R. pp.. MORICE [1972]. [1389] J. ROMINE ford. 1-8. Phys.. IEEE Trans. H. MORP AND J. MOHAN [1984]. 157-166. Comp. PA. Wu. in Kowalik [1116]. Solving polynomial systems of equations on a hypercube. 1985 Int. SCHONAUER. Numer. pp. Implementation of QR factorization on the DAP using Householder transformations. MAKINSON [1984]. Cambridge University Engineering Department. PANGALI [1984]. LUBECK [1984]. MOORE AND K. [1396] J. 209-226. in Paddon [1512].. [1392] J. ed. MORJARIA AND G. An alternative Given* ordering. 1. Efficient implementation of the SU(3) lattice gauge theory algorithm on the Fujitsu VPSOO vector processor. Tech. pp. PARKINSON [1984]. DELOSME [1981]. Proc. [1406] K. [1412] T. IRIA. [1394] J. ed. Japanese project on fifth generation computer systems. [1395] D. pp. Math. MOONEY [1986].176 J. 197-206. Tech. Comput. PRYCE [1984]. Massachusetts Institute of Technology. pp. A. 46. C-31. Performance of parallel programs: Model and analyses. MOORE. [1400] R. Cambridge. 501-511. [1411] T. MORIARTY AND D. Comput. MODI AND D. 143-146. PARKINSON [1982]. Comm. Experiences with the Denelcor HEP. 43. DAVIBS. [623]. Report CFDL-TR-87-11. Supercomputer Applications. MoNTOYE AND D. UK. [1405] A. in Heath [860]. Symp. pp. AND E. [1407] K. BOWGEN [1984]. France. G. MODIANO [1987]. pp. MODI. [1397] I. 351-362. Proc. Extension of the parallel Jacobi method to the generalized eigenvalue problem. WARHANT [1985]. Efficiency of parallel processing in the solution of Laplace's equation. pp. A practical algorithm for the solution of triangular systems on a parallel processing system. CLARKE [1984]. MODI AND M. MORISON AND S. 191-197. Cambridge University Engineering Department. Comp. MOLDOVAN. 28. Study of Jacobi methods for eigenvalues and singular value decomposition on DAP. Math. [1399] C. San Francisco. BOWOEN [1984].. Design considerations for the linear solver LINSOL on a CYBER SOS. pp. Report CUED/F-CAMS/TR. Carnegie-Mellon University. Emmen. 2.241. MUKAI [1981]. M. [1401] J. J.. ISSM Int. Efficient multi-tasking of the SU(3) lattice gauge theory algorithm on the CRAY-X-MP. MOLBR [1986]. [1390] J. HARAGUCHI. Tech. Proc. Comput. MGLLER. AND O.. pp. Performance of a common CFD loop on two parallel architectures. 317-320. Simul. LAWRIE [1982]. 99-116. pp. R. Appl. STEIGLITZ [1984]. C. pp. 1984 Int. SCHNEPP [1985]. 252-257. MoTO-OKA [1984]. Mapping an arbitrarily large QR algorithm into a fixed size VLSI array. Comm. Matrix decompositions and inversions via elementary signature-orthogonal transformations. pp. pp. Calcul parallele et decomposition dans la resolution d'equations aux derive es partialles de type elliptique. 235-250. [1413] H. [1414] H. MORIARTY. Tech. Simulation of a reaction-diffusion system on large dimpled surfaces using a vector computer. MODI AND G. 8390. 209-228. NorthHolland. 99-108. 26. MOTO-OKA. pp. HIROMOTO. Moscow. Pittsburgh. [1410] M. [1391] J. [1404] J. [1415] H. Math. in Vichnevetsky and Stepleman [1920]. Matrix computation on distributed memory multiprocessors. Rocquencourt. QR factorization and singular value decompaction on the DAP. in Paddon [1512]. 365-373. [1388] J. [1982]. MODI AND G. [1402] M. KUBA [1985].. pp. Par. Sci. Conf. [1403] W. Fifth Generation Computer Systems. M. Concurrent multigrid methods in an objectoriented environment. [1408] P.. Phys. pp. Applications software of the ES multiprocessor computing complex. W. Parallel algorithms for solving systems of nonlinear equations. Report CUED/F-CAMS/TR. WATSON [1987].. [1393] J. ORTEGA. 59-76. Numer. FORTES [1984]. 443-454. 1076-1082. pp. Math. Report CMU-CS-84-141. Par.-M.. Comm. Conf. MODI AND I. MODI AND I. MOLCHANOV [1985]. pp.242. R. Computational Processes and Systems. PRYCE [1985]. AND D. New York. UK.. Mobile Jaeobi schemes for parallel computation.-M. 39—49.

NAVARRO. PATRICK [1987]. NlCOL [1987]. VA. V.. NlCOL AND J. Tech.. 720-729. Appl. 537-544. 37(9). V. TA'ASAN [1987]. VA. pp. pp. IEEE Trans. Military Electronics Defense EXPO 78. TAI [1985]. NAGEL [1979]. Goettingen. Computer. Report 88-2. P. NASA Langley Research Center. NAIK AND M. Report 87-52. June. Gesellschaft fur Informatik. Tech. PATRICK [1988]. SALTZ [1987]. pp. D. Principles for problem aggregation and assignment in medium scale multiprocessors. Mitteilungen Nr. Performance studies of the multigrid algorithms implemented on hypercube multiprocessor systems. pp. Partitioning: An essential step in mapping algorithms. West Germany. J. pp. NlCOL AND J. Parallel-algorithmen und -Rechnerstrukturen (PARS). SALTZ [1988]. IEEE Trans. D. VA. Hampton. August. V. TUTTLE [1987]. Hampton. Par. REYNOLDS [1987]. No. VALERO [1987]. Experiments with method of lines solvers on a shared-memory parallel computer. Modified Fadeeva algorithm for concurrent execution of linear algebraic operations. Hampton.A. LLABERIA. pp. D. AND M. Hampton. L. pp. 77-89. ScHdNAUER. M. NASH AND S. K.. . Vcrgleich 177 ver- [1417] [1418] [1419] [1420] [1421] [1422] [1423] [1424] [1425] [1426] [1427] [1428] [1429] [1430] [1431] [1432] [1433] [1434] [1435] [1436] [1437] [1438] [1439] [1440] [1441] [1442] tchiedener Losungtverfahren fur lineare Gleickungen mit Diagonalcntpeicherung auf der CYBER 205. VA. Proc. E. LU factorization on parallel computers. Report 87-37.. ICASE.-M. on Information Science and Systems. pp. University of Illinois at Urbana-Champaign. Par. NASA Langley Research Center. NASA Langley Research Center. Report 87-50. Lin. 89-92. Polynomial preconditioning of symmetric indefinite systems. VA. NASA Langley Research Center. Florida State University. D. NASA Langley Research Center. NlCOL [1987]. SALTZ [1987]. K. Mapping a battlefield simulation onto message-passing parallel architectures. VA. Tech. Getting the eyelet out of a supercomputer. Report 580. in Vichnevetsky and Stepleman [1921]. Proceedings of 1984 Conf. 277-291. Math. Hampton. in Heath [860]. pp. Hampton. 129-137. ICASE. Analysis of communication requirements of sparse Cholesky factorization with netted dittection ordering. J. TA'ASAN [1987]. NASA Langley Research Center. HWANG [1983]. VALERO [1986]. Tech. Comm. Hampton. NEVES [1984]. K. B. MOLLER.. Tech. Geneva. Conf. ICASE. NASA Langley Research Center. NANDAKUMAR [1986]. Proceedings of the Conference. pp. Proc. Comput. Proc. ICASE. AND M. 161-166. NAVARRO. HANSEN [1988]. Proc. Dynamic remapping of parallel computations with varying resource demands. Nl AND K. 1983 Int. Conf. J. 11. NASA Langley Research Center. D. 573-580. Implementation of multigrid methods for solving NavierStokes equations on a multiprocessor system. & Appl. Oct. 20(7). Tech. Tech. NAIK AND S. VA. Tech. Report 87-49. ICASE. Report 87-51. MYERS [1986]. pp. 3. Supercomputer Computations Research Institute. Center for Supercomputing Research and Development. NEVES [1982]. AND E. ICASE. Solving matrix problems with no size restriction on a systolic array processor. W. S. D. Performance issues for domain-oriented time-driven distributed simulations. 559-575. Schedules for mapping irregular parallel computations. PHUA. 1073-1087.. LLABERIA. NlCOL AND J. W. NAVON. 676-683. Interario. 3-5. 26. Hampton. Computer. A graph theoretical approach to parallel triangulation of a sparse asymmetric matrix. Comput. NETA AND H. ICASE. Phys. Mathematical libraries for vector computers. NlCOL AND P. NAIK AND S. 303-310. Report FSU-SCRI-87-43. Tallahassee. Parallel algorithms for mapping pipelined and parallel computations. Vectorization of scientific software. AND M. pp. Convergence of parallel multisplitting iterative methods for M-matrices. Report 88-14. Data traffic reduction schemes for sparse Cholesky factorizations. NAKAJIMA [1984]. Alg. ICASE. pp. D. GENTZSCH [1982]. Optimal dynamic remapping of parallel computations. DFVLR. Weather simulation with the multi-microprocessor system SMS 701. AND P. Tech. 19(3). Performance companions among several parallel and vector compute™ on a »et of fluid flow problems. NlCOL [1988]. September. S. NAIK AND M. FL. VA. Tech. Comput. J. Report 87-39. RAMAMURTHY [1987]. 88. ScHNBPP [1985]. J. 37. of the Third SLAM Conference of Parallel Processing for Scientific Computing. 1986 Int. Report IB 262-82 RO1. N. Comput. I. V. Los Angeles. 60-67. MULLER-WlCHANDS AND W. PLEMMONS [1987]. Vectorization of conjugate-gradient methods for large-scale minimization. NEUMANN AND R.. Wiesbaden. Pipelined evaluation of first-order recurrence systems. K. in Kowalik [1116].A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS [1416] H. Tech. Proc. D. THOMPSON.

parallel architecture and optimal speedup. Proc. 1987 Int. The American Society of Mechanical Engineers.. NOOR AND S. Sci. AND R.. ed.. Evaluation of element stiffness matrices on CDC STAR100 computer. BROWNING [1979]. on Future Directions in Computational Mechanics. Numer. ASCE. Comm. [1447] T. 10.178 J. SPE Middle East Oil Tech. 61. Meth.. Par. 40. AND W. NOSENCHUCK AND M. April. Parallelization and performance analysis of the Cooley-Tukey FFT algorithm for shared-memory architectures. [1448] J. ACM. 772-781. NOOR AND R. Appl. D. PFISTER [1985]. On multigrid methods for the NavierStokes computer. The Navier-Stokes computer. FLANNERY [1986]. [1451] A. Comput. 9.4. [1460] A. [1455] A. [1446] J. Proceedings of Symposium at Purdue University. [1458] C. Element stiffness computation on CDC Cyber 205 computer. 53-73. pp. NouR-OMID. Kennedy Space Center. [1466] B. August 1981. Bahrain. Phys. [1465] D. FULTON [1983]. Application of vector processors to the solution of finite difference equations. Two-dimensional nonsteady viscous flow simulation on the Navier-Stokes computer mini-node. LITTMAN. NORTON AND A.. Engrg. R. Florida. Math. pp. [1469] R. [1468] B. NlEVERGBLT [1964]. Computers and Structures. STANAT [1981]. A polynomially stable fast parallel algorithm for tridiagonal systems. NOSENCHUCK. Proc. [1444] J. in Noor [1450]. North-Holland Publishing Co. [1461] A. pp. KASCIC [1979]. [1463] D. C-36. [1464] D. M. LYZENGA [1987]. Symp. Proc.. NOOR AND S. on Finite Element Methods in Flow Problems. pp. in Vichnevetsky and Stepleman [1920]. 17(3).. [1452] A. pp. pp. PARK [1987]. A. 347-354. NOLEN AND P. AND T. Tokyo. NOSENCHUCK. Also in SPEJ. [1449] J. 151-161. IEEE Trans. NOUR-OMID AND K. pp. AND A. Solving structural mechanics problems on the Caltech hypercube. NUMRICH. pp. [1985]. FULTON [1975].. NORTON AND G. ROBERTO [1986]. B. ROMINE [1443] D. NOOR AND J. Meth. Decemberl3-18. The coming of age of the parallel processing supercomputer. H. 947-954. Impact of the CDC-STAR-100 computer on finite-element systems. 7. Parallel methods for integrating ordinary differential equations. NouR-OMID. NlCOL AND F. Phys. GEIGER. [1456] A. OAKES AND R. 731-733. Comput. 101(ST4). University of Tokyo Press. SPE. [1457] A. Proc. ZANG [1988]. NOOR. Appl. Supercomputers for superproblems: An architectural introduction. NoRRIE [1984]. A methodology for predicting multiprocessor performance. AND M. PARLETT. Impact of new computing systems on finite element computations. pp. KAMEL. [762]. Conf. Massachusetts Institute of Technology. M. 189-202. Solving finite element equations on concurrent computers. VOIGT [1975]. Symp. VOIGT AND C. FULTON [1978]. Fifth Symposium on Reservoir Simulation. 7-19. LrTTMAN [1986]. KRIST. 1-32. [1983]. Comput. MOBILE AND V. SCHULTZ [1983]. NOOR. 1-5. 250—260. The finite element method and large scale computation. 8. Computers and Structures. Proc. [1471] S. Finite element dynamic analysis on the CDC STAR100 computer. pp. [1453] A. December. pp. 23rd Annual Space Conf. Computer. NODERA [1984]. pp. Mech. ORTEGA. 1985 Int. ASME Winter Meeting. Report 82448-9. 4th Int. AND R. D. [1454] A. 1984. 222-228. 1. ed. in McCormick [1312]. P. O. PCG method for four color ordered finite difference schemes. AND M. J. Hypermatrix scheme for the STAR-100 computer. AND G. O'DoNNELL. G. Computers and Structures. pp. NOOR. & Math. HARTLEY [1978]. PETERS [1986]. 287-296. NOLEN. Experience running ADINA on CRAY-1. [1467] B. USSR Comput.November 1.. Comparison of Lanczos with conjugate gradient using element preconditioning. [1470] W. Paper 9649. Conf.. 161-176. pp. Structural Div. Proc. 317-328. LrrrMAN [1986]. Supercomputer Applications Symposium. pp. 62-74. in Glowinski et al. NOCEPURENKO [1988]. pp. RAEFSKY. ADINA Conf. [1462] D. Efficient implementation of multidimensional and fast Fourier transforms on a CRAY X-MP. Par. 2. WlLLARD [1987]. SILBERGER [1987]. Comm. pp. Manama.. [1445] A. October 31 .. RAEFSKY [1988]. Comm. H. Comput. [1450] A. Computers and Structures. 26. J. 581-591. Problem tize. NOSENCHUCK AND M. LAMBIOTTE [1978]. pp. KUBA. Proceedings of the ASME Symposium on Parallel Computations and their Impact on Mechanics. [1459] D. Reservoir simulation on vector processing computers. 287-296. Substructuring techniques — Status and projections. Solving the Poisson equation on the . pp. Impact of New Computing Systems on Computational Mechanics. STORAASLI. Conf. NORRIE [1984]. NOOR AND J. 5. 621-632.

RoKHUN [1987]. TAKECHI [1986]. . OLEINICK AND S. Carnegie-Mellon University. Comput. LANCE [1983]. D. Department of Computer Science. OPPE AND D. University of Illinois at Urbana-Champaign. Yale University. ORBITS [1978]. SIAM J. T. FULLER [1978]. Par. 275-300. A wavefront algorithm for LU decomposition of a partitioned matrix on VLSI processor arrays. PhD dissertation. Numerical experiments with a parallel conjugate gradient method. O'LEARY AND G. OED AND O. Systems Engineering Laboratory. Systolic arrays for matrix transpose and other reorderings. and simulation of memory interference in the CRAY X-MP. 23-30. University of Michigan. 3. Center for Advanced Computation... Report YALEU/DCS/RR-554. LANCE [1985]. T. Comput. O'LEARY AND G. Conf. Algebraic Discrete Methods. pp. T. July. Comm. 5. Alg. 545-547. W. Comput. Conf. Sci. From determinacy to systolic arrays. ACM. A study of the efficiency of ILLIAC IV in hydrodynamic calculations. 3. STEWART [1985]. STEWART. D. 369-374. OGURA. DAP-TRAC: A Practical Application of Parallel Processing to a Large Engineering Code. pp. Report TR-1648. Report 98.. C-36: pp. D. OLEINICK [1978]. OED AND O. Parallel Algorithms on a Multiprocessor. International Congress on Computational and Applied Mathematics. Parallel Computing. 77. P. Department of Computer Science. D. OLEINICK [1982]. Parallel implementation of the block conjugate gradiant algorithm. Comm. O'LEARY AND G. On the effective bandwidth of interleaved memories in vector processor systems. P. W. Proc. in Feilmeier et al. Tech. O'LEARY. O'LEARY [1987]. A fast algorithm for the numerical evaluation of conformal mappings. 620-632. KlNCAID [1987].. 117-122. Department of Computer Science. Yale University. W. 1983 Int. Modeling. OPPE AND D. Parallelism and uncertainty in scientific computations. S. Comput. Tech. pp. ERICKSEN [1972]. On the effective bandwidth of interleaved memories in vector processor systems. SIAM J. D. Proc. Tech. Proc. Statist. Tech. Tech. IEEE Trans. ONAGA AND T. 840-853. Dist. Proc. The implementation of a parallel algorithm on C. OED AND O. O'DONNELL AND V. & Appl. D. J. University of Maryland. LANGE [1986]. LANCE [1985]. Par. pp. LANCE [1984]. IEEE Trans. O'LEARY [1987]. OLIGER [1986]. WHITE [1985]. Tech. Center for Numerical Analysis. Proc. D. OBD AND O. Report 59. University of Texas at Austin. Department of Computer Science. OPSAHL [1984]. The Implementation of Parallel Algorithms on an Asynchronous Multiprocessor. Lin. W. C-34. pp. 3. IEEE Trans. pp. Report YALEU/DCS/RR-292. 6. pp. AND J. Comp. D. AND R. Multi-splittings of matrices and parallel solution of linear systems. Belgium. Proc. M.. 343-358. CALAHAN [1976]. 1355-1359. PARKINSON [1986]. pp. D. pp. Meth.. Data flow considerations in implementing a full matrix solver with backing store on the CRAY-1. VAN DE GEUN [1986]. O'LEARY [1984].A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 179 [1472] [1473] [1474] [1475] [1476] [1477] [1478] [1479] [1480] [1481] [1482] [1483] [1484] [1485] [1486] [1487] [1488] [1489] [1490] [1491] [1492] [1493] [1494] [1495] [1496] [1497] FPS-164. UMI Research Press. K. Numer. Systems Engineering Laboratory. J. T. measurement. Assignment and scheduling in parallel matrix factorization. 211-216.. 28. PhD dissertation. The performance of ITPACK on vector computers for solving large sparse linear systems arising in sample oil reservoir simulation problems. G. O'LEARY AND R. STEWART [1987]. C-36.. SHER. April. 1985 Int. University of Leuven. Carnegie-Mellon University. 5. Par.mmp. Transforming linear recurrence relations for vector processors. M. D. OED AND O. Data-flow algorithms for parallel matrix computations. Parallel Computing. P. Report 118. The solution of linear recurrence relations on pipelined processors. ORBITS AND D. Ordering schemes for parallel processing of certain mesh problems. DOMINO: A message passing environment for parallel computations. Report CNA-208. 127-140. pp. pp. Tech. Tech. Par. Department of Computer Science. pp. Conf. STEWART [1986]. A Cray-1 timing simulator. Report CMU-CS-78-125. An algorithm for solving sparse sets of linear equations with an almost tridiagonal structure on SIMD computers. 33—40. W. 630-640. 137-157. 1986 Int.. University of London. KlNCAID [1987]. OPSAHL AND D. 949-957. [623]. pp. Appl. pp. University of Michigan..

Efficient parallel solution of linear systems. PATERA [1981]. CALAHAN [1978]. Computational Methods in Classical and Quantum Physics. PARKINSON [1984]. 347-385. Parallel Solution of Elliptic Partial Differential Equations on a Tree Machine. 832-835. Parallel Computing. Academic Press. PARKINSON [1976]. A case study in the application of a tightly coupled multi-processor to scientific computations.. American Statistical Association.. Vector computers. ORTEGA AND C. WUNDERLICH [1984]. pp. Hampton. Academic Press. pp. ORSZAG AND A. R.180 J. [1527] D. Heiberger. [1508] N. [1520] D. Pub. SIAM Rev. VoiGT [1985]. 7(2). [623]. [1504] J. Cornput. 26. C-29. 19. WHITESIDE [1982]. NY. Comm. Comm. IEEE Trans. pp. Ltd. M. Comm. pp. Parallel systems. [1500] S.. LASL Workshop on Vector and Parallel Processors. 109-117.. The measurement of performance on a highly parallel system. Math. Using the ICL DAP. 17th Annual ACM Symposium on Theory of Computing. R. [1516] V. P. 42. [1525] D. Adv. The ICL Distributed Array Processor DAP. ORTEGA. ORTEGA AND R. pp. [1515] J. Conjugate Direction Methods and Parallel Computing. [1984]. 135-148.. The ijk forms of factorization methods I. VOIGT [1977]. Comm. OSTLUND [1985]. LIDDELL [1982]. 128. Application of microprocessor networks for the solution of diffusion equations. ORTEGA [1988]. Solution of partial differential equations on vector and parallel computers. AND R. ed. PARKINSON AND M. PADDON. Phys.. ORSZAG AND A. Introduction to Parallel and Vector Solution of Linear Systems. pp. PARKINSON [1982]. 28. [1523] D. PARGAS [1982]. 75-84. pp.. Parallel Computing. PARKINSON [1984]. Plenum Press. [1499] S. 475-525.. 227-232. Subcritical transition to turbulence in planar shear flows. Multi-Microprocessor Systems. Fluid Mech. pp. Comput. Parallel computing on a hypercube: An overview of the architecture and some applications. PhD dissertation. Lett. [1510] G. pp. Oxford. 5. [1501] S. PARKER [1980]. ROMINE [1498] D: ORBITS AND D. University of North Carolina. The ijk forms of factorization II. VoiGT [1987]. pp. pp.C. Phys. in Kowalik [1116]. VA. PARKINSON [1987]. New York. [1521] D. [1507] J. 149-162. 3. IEEE Trans. [1518] D.. ICASE. pp. 32-37. ORTEGA AND R. Comput. Secondary instability of wall bounded shear flows. 19th Symp. Proc. Comp. pp. Parallel efficiency can be greater than unity.. Comput. PARKINSON AND H. Notes on shuffle/exchange type switching networks. 127-146. Report 1-3. New York. Organizational aspects of using parallel computers.213-222. 47. Chapel Hill. Proc.. Stanford University. 81-87. HIBBARD. PALMER [1974]. A compact algorithm for Gaussian elimination . [1505] J. pp. [1503] J. in Rodrigue [1643]. Parallel Computing. Clarendon Press. PAN AND J. [1514] Y. [1524] D. 325-336. 375364. Department of Computer Science. A bibliography on parallel and vector numerical algorithms. PARKINSON [1986]. ORTEGA [1987]. pp. An incomplete LDU decomposition of lattice fermions and its application to conjugate residual methods. [1512] D. 261-262. PhD dissertation. 27-32. [1519] D. OVANGAI [1986]. The Distributed Array Processor (DAP). Experience in exploiting large scale parallelism. ed. Supercomputers and Parallel Computation. [1522] D. A CRAY-I simulator and itt application to development of high performance code*. D. 37. PATERA [1981]. Waterloop V2/64: A highly parallel machine for numerical computation. [1509] N. in Feilmeier et al. Comput. Rev. [1506] J. The solution of N linear equations using P processors. [1517] R.. 143-152. 149-240. Hooper.. Simul. 7(2). [1511] Y. Solution of partial differential equations on vector computers. Comput. M. 1977 Army Numerical Analysis and Computers Conference. 23-27. [1502] J. on the Interface of Computer Science and Statistics. PATERA [1983]. 27. Tech. ed. PARKINSON [1982]. G. ORTEGA AND R. 333-344. pp. M. Department of Computer Science. J. Phys. Proc. ROMINE [1988]. OsTROUCHOV [1987]. NASA Langley Research Center. Phys. Proc. Parallel Computing. VOIGT AND C. PAKER [1983]. pp. [1526] D. 247-256. Meyer. C-31. pp. Transition and Turbulence.. Washington. OSTLUND. H. pp. PAKER [1977]. ed. ORSZAG AND A. [1513] Y. Phys. REIF [1985]. pp. Calculation of Von Karman's constant for turbulent channel flow. pp.

223-252. pp. [1540] M. PASCLAK [1988]. [1539] A. Tree macAine* ana" divide-and-congaer algorithms. PERROTT [1987]. pp. Fast direct Poisson solvers for high order finite element discretization in rectangularly decomposable domains. pp. FL. Parallel Computing. 255-261. PERROTT [1979]. NoUR-OMro. PATEL [1984]. 26(5)..A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 181 over GF(t) implemented on highly parallel computer*. Evans. THOMAS [1982]. Application of supercomputers to computational aerodynamics. A Fully Vectorized Numerical Solution of the Incompressible Navier-Stokes Equations. PhD dissertation. Phys. 207-222. [1532] S. C/Vtracompwter implementation of odd-even cyclic reduction. in Paddon [1512]. PETERS [1984]. REED. DANIEL [1985]. 65-73. Impact of computers on aerodynamics research and development. PATEL [1983]. Herts. On k-line and k X k block iterative schemes for a problem arising in 3-D elliptic difference equations. January. The implementation of lattice calculations on the DAP. D. [1533] H. pp. VOIGT [1987]. 823839. 2. [1528] B. 146-179. PETERSON [1984].. Proc. Numer. [1557] V. 1. Berlin. Mississippi State University. AND R. Proc. SLAM J. 474-480. Parallel Computing. Anal. 285-301. Implementation of Lanezoa algorithms on vector computer*. PETERS [1981]. ACM Computing Surveys. [1537] N. 17. 6. PETERSON [1984]. ACM.. J. STEUERWALT [1980]. PATRICK. [1559] V. CONPAR 81. SLAM J. Meth. Anal. 229-249. 62-72. Domain decomposition preconditioners for elliptic problems in two and three dimensions: First approach. Anal. PETERS [1985]. D. NASA Ames Research Center. Parallel Computing. [762]. PRATT [1986]. J.. Tech. PEASE [1967]. [1556] J. [1554] F. architectures for computational chemistry. Computer. NASA Ames Research Center. [1548] W. PARTER AND S. pp. PAUL AND W. J. pp. PETERSON. 929-940. Appl.. [1541] M. J. December. Springer-Verlag. [1545] M. 757-764. 1-18. pp. pp. TUAZON. Comput. Sparsity and Its Applications. PATEL [1982]. January. Comm. Handler. Lecture Notes in Computer Science 111. The impact of domain partitioning on the performance of a shared memory multiprocessor. PATTON [1985]. pp. Num. [1536] K.An adaptation of the fast Fourier transform for parallel processing. PELKA AND A. NOC.. 15. 22. Language developments for supercomputers. PARLETT.3. Parallelism and sparse linear equations. 211-218. LIEBERMAN. Comp.. pp. 5. Finite element ground water models implemented on vector computers. pp. STEUERWALT [1982]. pp. 99-110. PAWLEY AND G. 458-473. PATEL AND H. Report TR 86. JATVIG [1985]. PEASE [1968]. I. [1552] F. Parallel Computing. [1549] R. 9. 19. Proc. [1553] F. [1550] R. [1543] G. PARTER. [1538] N. Fluids. pp. 71-73. [1984]. ed. Cambridge University Press. PATERA [1986]. Ultracomputer Note 19. 1. A standard for supercomputer languages. ACM. Numer. Meth. pp. JORDAN [1984]. Report CR-2032. Academic Press.. Parallel computation and numerical optimization. [1551] C. Supercomputer. PARTRIDGE AND C. . J. Comp. 1. Communication oriented programming of parallel iterative solutions of sparse linear systems. [1544] G. pp. Tech. Large Scale Scientific Computation. pp. The indirect binary n-cube microprocessor array. in Jesshope and Hockney [976]. . 1926. 252-264. [1547] M. 1985 Int. J. [1530] S. NASA Ames Research Center. 25-36. Phys. pp. in Numrich [1469]. STEUERWALT [1985]. Block iterative method* for elliptic and parabolic difference equations. RIACS. pp. AND J.. Proc. [1555] J. Computational aerodynamics and the NASF. Algorithms vs. [1558] V. WILSON [1978]. ed. 913-926. PARTER AND M. pp. [1542] P. PETERSON [1977]. Tech. PEASE [1977]. Conf. 47. Tech. Report TM-85965. Report 129. in Glowinski et al. [1546] M. 165-178. PETERSON [1978]. An introduction to VECTRAN and its use in scientific applications programming. The Mark III hypercubeensemble concurrent computer. Parallel pivoting algorithms for sparse symmetric matrices. Hatfield.. Numer. ed. pp. Implementation of a parallel (SIMD) modified Newton algorithm on the ICL DAP. A parallelized point rowwise successive over-relaxation method on a multiprocessor. D. PESKIN [1981]. 14. Block iterative mcthodt for elliptic finite element equations. Orlando. Department of Computer Science. BAUSCHLICHER [1986]. PETERS [1986]. Numer. of LASL Workshop on Vector and Parallel Processors. B. 18(6). Multiprocessors: Architecture and applications. [1529] S. [1531] S. Peiri net*. pp. SLAM J. pp. W. pp. IEEE Trans. Matrix inversion using parallel processing. 19. PATRICK AND T. Par. 65. [1534] J. [1535] K. 1173-1195. AND M. PARTER AND M. 291-308. New York University. pp.

Tech. Comp. [1565] D. Yale University. 24. Report YALEU/DCS/RR-542. Distributed orthogonal factorization: Givens and Householder algorithms. scheduling. POLYCHRONOPOULOS AND U. Multicolor ICCG methods for vector computers. with implementation for an array processor. PFISTER AND V. pp. POTTER [1983]. PIERCE [1987]. pp. [1568] R. Report CS-88-13. [1581] A. 26. 62-67. 396-403. POTHEN AND P. Tech. 1394-1418. PLATZMAN [1979]. . shortest elimination trees and super-nodes in sparse Cholesky factorization. [1577] P. in Numrich [1469]. Mapping rings and grids onto the FPS T-series hypercube. J. pp. pp. Comm. Department of Computer Science. Implementing domain decoupled incomplete factorizations and a parallel conjugate gradient method on the Sequent Balance 21000. Department of Computer Science. 587-596. KLEINPELDBR. Solution of sparse linear equations arising from power system simulation on vector and parallel processors. Numerical algorithms for parallel and vector computers: An annotated bibliography. H. [1579] T. Amer. RAGHAVAN [1987]. PmLLIPPE [1987]. 7. MELTON. 18(3). HARVEY. 302-312. PLANDER [1984]. Report YALEU/DCS/RR-543. NORTON [1985]. pp. University of Illinois at Urbana-Champaign. G. Speedup bounds and processor allocation for parallel programs on multiprocessors. 1008-1021. POTTLE [1979]. Implicit finite difference simulation of an internal flow in a nozzle: An example of a physical application on a hypercube. pp. 790-797. Conf. 764771. MCAULIFFE. A nongradient minimization algorithm having parallel structure. Report ETA-TR-61. Algebraic Discrete Methods. Department of Computer Science. 273-321.. ISA Trans. 1985 Int. POTHEN [1988]. Pennsylvania State University. POLYCHRONOPOULOS [1986]. K. [1563] B. Implicit finite-difference simulation of an internal flow on a hypercube. Tech. POOLE AND J.182 J. [1561] G. Meteor. VOIGT [1974]. Tech. [1560] W. [1578] T. 68-79. pp. [1576] P.. Proc. 72. Proc. POOLE AND R. PIERRE [1973]. MIT Press. The Massively Parallel Processor. NORTON. R. Par. Boston. Tech. PORTA [1987]. Proc. pp. pp. WEISS [1985]. AND U. 337-347. Parallel processors and multicomputer systems. [1562] G. pp. pp. ORTEGA. Soc. PORTA [1987]. Yale University. pp. [1583] J. PORTA [1987]. E. Report YALEU/DCS/RR-553. PORTA [1987]. 1. D. Tech. in Heath [860]. Algebraic Discrete Methods. [1582] A. A parallel block iterative scheme applied to computations in structural analysis. Department of Computer Science. POTHEN. PhD dissertation. ACM. [1574] W. Boeing Computer Services. Parallel Computing. [1575] D. The IBM ret earch parallel proccttor prototype (RPS): Introduction and architecture. Pennsylvania State University. Orthogonal factorization on a distributed memory multiprocessor. The University of Virginia. Report YALEU/DCS/RR-594. [1572] E. Yale University. Elect. POOLE AND J. [1584] J. Engrg. pp. M. Bull. Yale University. 16(1). Numer. PFISTER. [1573] E. 1985 Int. Report 595. 379-388. Report CS-87-24. Incomplete Choleski conjugate gradient on the CYBER 203/205. 1986 Int. 15. ROMINE IEEE. [1571] E. W. An algorithm to improve nearly orthonormal sets of vectors on a vector processor. Department of Applied Mathematics. W. Proc. August. Anal. [1567] G. POTTER. 60. PLEMMONS [1986]. The ENIAC computations of 1950 — Gateway to numerical weather prediction. Tech. SIAM J. VOIGT AND C. pp. Rev. PooLE [1986]. [1564] D. 3-21. 19-28. MA. [1570] C. pp. Matrix computations. Par. BRANTLEY. 961-968. The symmetric eigenvalue problem. Department of Computer Science. POPLAWSKI [1988]. Conf. 7. Proc. Simplicial cliques. ORTEGA [1984]. ORTEGA [1987]. Hot tpot contention and combining in multistage interconnection networks. Department of Computer Science.. Par. V. in Miklosko and Kotov [1363]. On program restructuring. [1569] C. AND J. BANERJEE [1986]. GEORGE. Image processing on the Massively Parallel Processor. Computer. SIAM J. Multi-color Incomplete Cholesky Conjugate Gradient Methods on Vector Computers. ed. [1580] A.. August. [1566] I. A programmable systolic array for factorial data analysis part II.. SOMESH. pp. Center for Supercomputing Research and Development. Tech. pp. Vector Fortran for numerical problem* on CRAY-1. A programmable systolic array for factorial data analysis part I. Comput. VEMULAPATI [1987].. 1-10. [1985]. Proc. Conf. and communication for parallel processor systems. S. 81-88.. PETERSON [1983]. [1585] C. SLAM J.. 8.

. [1606] J. [1611] S. PRICE AND K. CREWS [1982]. architecture and performance. M. Performance evaluation of multiple processor systems. Special purpose computer for non-linear differential equations. 148-150. Comm. [1607] L. 17.. Comput. pp. Optimal scheduling strategies in a multiprocessor system. An improved parallel processor bound in fast matrix inversion. Eng. 12. 41—47. Supercomputer. in Kowalik [1116]. pp. Designing Efficient Algorithms for Parallel Computers. Hampton. pp. 1983 Int. BURNS [1987]. in Gary [700]. 9. pp. QUARTERONI AND G. pp. [1615] D. Conf. [1600] I. AND R. PVLE AND S. Pet. 7. RANADE [1985]. AIAA J. 41-51. CHEN.. [1591] T. IEEE Trans. REDDAWAY [1984]. Comp. VUILLEMIN [1980]. pp. O. [1589] H. Proc. 89-98. pp. Multilevel load balancing for hyercubes. L.. Concurrent processing: A new direction in scientific computing. Distributed array processor. K. RADEHAUS.. BERKEMEIER. AND H. pp. Li [1977]. Tech. PRYOR AND P. Tech. [1590] D. Carnegie-Mellon University. Letts. Proc. REDDAWAY [1979]. [1612] S. NM. Fort Collins. Proc. pp. 952-958. Application of concurrent processing to structural dynamic response computations. Vector computer for sparse matrix operations. Tech.. RANSON. Performance Based Design and Analysis of Multimicrocomputer Networks. 116-124.. [1593] G. [1604] M. ACM Computing Surveys. WIESEMAN. pp. Tech. Institute for Scientific Computing. KARDELL. COATS [1974]. [1595] D. NASA CP 2335. [1587] F. Department of Computer Science. Conf. [1599] I. pp. AND M. PREPARATA AND J. pp. RAY [1984]. [1598] S. Par. 345-350. Proc. WHEAT [1984]. QuiNN [1987]. AFIPS. PURVINS [1985]. [1601] I. A parallel Monte Carlo model for radiative heat transfer. 211-216. RAMAKRISHNAN AND P. NASA. PATRICK [1987]. Simulation of three-dimensional compressible viscous flow on the Illiac IV computer. 87-89. A software debugging aid for the Finite Element Machine. Comm. NASA Langley Research Center. 54. Comput. PREPARATA AND D. REA [1983]. RAMAKRISHNAN AND P. REDHED.. in Jesshope and Hockney [976]. RATTNER [1985]. Purdue University. Direct method* in reservoir simulation. Cyberplus: A multiparallel operating system. PhD dissertation. 18. Department of Computer Science. LOMAX [1979]. C-36. HOTOVY [1979]. RAMAMURTHY [1987]. [1597] C. McGraw-Hill. Comp. Los Alamos. in McCormick [1312]. January. The DAP approach. Phys. 1985 Nat. [1614] D. [1613] D. ADAMS. Research in Structures and Dynamics — 1984. 309-329. RAMAMOORTHY. QuiNLAN [1988]. Tke cube-connected cycles: A versatile network for parallel computation. New approach to the SD transonic flow analysis using the STAR-100 computer. C-21. Report CMU-CS-78-141. Three-dimensional analysis of [0/90]s and [90/0]s laminates with a central circular hole. Comput. VARMAN [1985]. A. Performance improvement beyond vectorization on the Cyber 205. CO. C-33. 3. pp. [1603] C. Proc. 36. RASKIN [1978]. QlNG-SHI AND W. Conf. 137-146. Optima/ integrated-circuit implementation of triangular matrix inversion. Presented at the Los Alamos Workshop on Operating Systems and Environments for Parallel Processing. [1605] A. PULLIAM AND H. Proc. IEEE Trans. Conf. 327-358. Report 87001. pp. STORAASLI. A parallel algorithm for high-speed subsonic compressible flow over a circular cylinder. M. ACM. 98-99. Report 88-11. REED.. 157-166. WALDOWSKI. 22. Composite Tech. RAMAMOORTHY AND H. AND M. Soc. A Kosloff/basal method 3D migration program implemented on the CYBER 205 supercomputer. [1592] L. Domain decomposition preconditioned for the spectral collocation method. Phys. Proc. 1985 Int. Modular matrix multiplication on a linear array. Comput. J. 1980 Int. August 7-9. pp. CHANDY. Interconnection networks and parallel memory organizations for array processing. 61-102. pp. University of Colorado. [1608] J. AIAA J. K. [1596] M. IEEE Trans. Par. Conf. RONG-QUAN [1983]. An optimal family of matrix multiplication algorithms on linear arrays. SACCHI-LANDRIANI [1988]. 159-167. [1588] F. Conf. pp. 1985 Int. SARWATE [1978]..A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 183 [1586] F. pp. VUILLEMIN [1981]. [1609] W. VA. Rev. pp. J. 14. [1602] C. vol. GONZALEZ [1972]. pp. 4(4).. AND S. PREPARATA AND J. ICASE. [1610] G.. 31-44. Pipeline architecture. Proc. Par. Proc. . Stencils and problem partitionings: Their influence on the performance of multiple processor systems. REED [1983]. New York. Inf.. pp. 295-308. [1594] A. VARMAN [1984]. 376-383.. Report. 24. FULTON [1984]. RAJAN [1972].. 300-309. 534-552. Par. RAJU AND J.

ROBERT AND M. 19. 874-881. C.. REED AND M. NASA Langley Research Center. ROMINE pp. R. [1628] J. Some analysis techniques for asynchronous multiprocessor algorithms. NASA-CP 2216. Purdue University. Proc. PATRICK [1985]. HENDRICKSON. [1623] G. [1617] D. 8-18.. Computers and Structures. Acad. [1624] B. M. Using supercomputers today and tomorrow. 540-546. Supercomputing. IEEE Trans. PATRICK [1985]. Iterative solution of large sparse linear systems on a static data flow architecture: Performance studies. pp. 295-316. RELJNS AND M. CHARNAY. An incomplete Choleski factorization by a matrix partition algorithm. [1643] G. Department of Computer Science. Math. 271. Methodes iteratives serie-parallel. Appl. H. ROBERT. pp. RlCE AND D. Conf. Tech. 2. 161-173. Par. Proceedings of the Fall Joint Computer Conference. Parallel Computations. Parallel iterative methods for the solution of Fredholm integral equations of the second kind. Proc. Computer. pp. R. AND Z. Amsterdam. University of Illinois at Urbana-Champaign. BARTON. Parallel Computing. Large scale flowfield simulation using the Cyber 205. [1644] G. J. Report. Washington. 2. Tech. [1621] L.. 1985 Int. Report 1-140. [1982]. RODRIGUE [1984]. AND R. [1642] J. REHAK. [1634] J. RODRIGUE. REICHEL [1987]. 845-858. Tech. Multi-FLEX machines: Preliminary report. Eng. pp. ROBERT [1985]. M.. 179-194 and 315-326. ZMOB: Hardware from a user's viewpoint. pp. in Vichnevetsky and Stepleman [1921]. Analysis and modeling of Schwartz splitting algorithms for elliptic PDE's. Conf. RICE [1985]. SE-5. 159-177. [1618] D. Softw. [1638] F. eds. [1637] F. 17-30. on Applied Math and Computing. ROBINSON [1979]. pp. RODRIGUE [1986]. New York. Proc. Evaluation of finite element system architectures. May. RlCE [1986]. RlZZI [1985].. Report UCRL-93792. Inner/outer iterative methods and numerical Schwartz algorithms. HODOUS [1985]. Sci. pp. ed. IEEE Computer Society Press. pp. [1640] Y. pp. CENDRES [1985]. Tech. H. 1-38. Paris. Vector coding the finite volume procedure for the Cyber 205. Comput. RlEGER [1981]. Conf. Mate. Purdue University. PATRICK [1984]. Int. IEEE Comput. RlGANATl AND P. Tech. [1630] J. ROBINSON. RlCE [1985]. On implementing the Monte Carlo evaluation of the Boltzmann collision integral on ILLIAC IV. MUSY [1975]. Academic Press. 20. [1646] G. [1645] G. pp. Proc. Computational Aspects of Heat Transfer and Structures. Pattern Recognition and Image Processing. REITER AND G. [1633] C. Iterative solution of large sparse linear systems on a static data flow architecture: Performance studies. [1632] J. REED AND M. 402-409. MARINESCU [1987]. RlCE [1987]. [1626] J. [1987]. [1620] D. pp. [1639] Y. Department of Computer Science. ROBERT [1970]. Parallel Computing.. in Numrich [1469]. RILEY. Par. HARTKA [1982]. [1627] J. PATRICK [1985]. Report CSD-TR-612. 847850. Parallel iterative solution of sparse linear systems: Models and architectures. [1631] J. RlCE [1986]. May. 1984 bit. pp. Third US Army Conf. REED AND M. [1622] J. Parallel Computing. ORTEGA. 520-529. 17(10). Report CSD-TR 516. 2. 97-113. pp. pp. 399-408. ScHNECK [1984]. REILLY [1970]. Evaluation of the SPAR thermal analyzer on the CYBER-20S computer. Adelman. [1636] A. pp. KEIROUZ. Proc. RlZZI AND M. R. Systolic resolution of dense linear systems. TCHUENTE [1985]. 45-67. 205-218. 24-31. Comput. 405-424. A model of asynchronous iterative algorithms for solving large sparse linear systems. Parallelism in solving PDE's. REED AND M. 19. IEEE Trans. A parallel first-order method for parabolic partial differential equations. REID [1987]. [1641] J. pp. Supercomputing about physical objects.. ed. Iterations chaotiques serie-parallel pour des equations non-lineares de point fixe. in Birkhoff and Schoenstadt [173]. [1625] E. Highly Parallel Computers. 329-342. RODRIGUE [1985]. RODRIGUE [1984]. Tech. 17. Report TR-708. AND F. [1629] J. Block LU decomposition of a band matrix on a systolic array. DC. Problems to test parallel and vector languages. 25-32. Coordinated Science Laboratory. [1635] A. Parallel methods for PDE's. Soc. [1616] D. W. [1619] D. Department of Computer Science. RlCE [1987]. C-34. C. pp. Purdue University. in Kowalik [1116]. . G. Proc. pp. pp. Parallel scientific computing: Philosophy and directions. in Heath [860]. RAIROMMNA. Supercomputer. North-Holland. Lawrence Livermore National Laboratory. VOIGT AND C. 20. pp. The exploitation of parallelism by using Fortran 8X features. pp.184 J. 295-312.

Gaussian elimination on hypercubes. Architecture. Conf. pp. N. Montvale. [1659] C. AND M. Lawrence Livermore National Laboratory. [1663] A. Some ideas for decomposing the domain of elliptic partial differential equations in the Schwartz process. North-Holland. A new class of explicit methods for parabolic partial differential equations. IMACS. & Appl. 1. SAAD [1983]. Report TM-73203. [1657] R. Department of Computer Science. Computer. ACM. [415]. Evaluating computer program performance on the CRAY-1. 245-249. eds. ROMINE AND J. RONSCH [1984]. in Cosnard et al. 13(12). Parallel and Large Scale Computers: Performance. AMES. GlROUX.. Comm. Lawrence Livermore National Laboratory. SAAD [1987]. 552-559. Comm. Applications.. Communication complexity of the Gaussian elimination algorithm on multiprocessors. Proc. ROSENFELD [1969]. [1669] R. RODRIGUE. 383-386. HENDRICKSON. MADSEN. PhD dissertation. BAYLOR [1986]. 188-191. [1649] G. pp. Tech. 21. Stability aspects in using parallel algorithms. France. PRATT [1980]. [1653] G. [1984]. in Heath [860]. Proc. Least squares polynomials in the complex plane with applications to solving sparse non-symmetric matrix problems. pp. Comput. E. Analysis of a 2-D code on the CRAY-1. 75-98. Jacobi splittings and the method of overlapping domains for elliptic PDEs. [1660] C. RODRIGUE. Meth. [1650] G. [1667] T. pp.4. RODRIGUE [1986]. RODRIGUE. [1662] W. SAAD [1985]. Report UCRL-78652. 6. AND J. [1668] M. NJ. Amsterdam. Report YALEU/DCS/RR-276. pp. Comm. On the design of parallel numerical methods in message passing and shared memory environments. 63-72. 6. 2. RODRIGUE AND D. Sci. Comp. Department of Applied Mathematics. 549-565. Math. RUDY [1980]. 65-80. AFIPS Press. Lawrence Livermore National Laboratory. Domain decomposition and inner/outer iteration for elliptic partial differential equations II. in Feilmeier et al. pp. WOLITZER [1984]. vol. WOLITZER [1984].. The University of Virginia. Paris. RUDINSKI AND G. [1661] W. PRATT [1982]. Parallel solution of triangular systems on a hypercube. ROMINE [1986]. AND R. RODRIGUE [1986]. Report 79-9. pp. 26. .A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 185 [1647] G. [1670] Y.. RUSSELL [1978]. Parallel solution of triangular systems of equations. [1671] Y. Lawrence Livermore National Laboratory. 77. ROOZE [1988]. Tech. Numer. CHONTENSEN. 645-655.. Fall Joint Comp. 109-114. pp. [1666] J. Report UCRL-95669. An implicit numerical solution of the two dimensional diffusion equation and vectorization experiments. ROGALLO [1977]. NASA Tech. AND M. Practical use of polynomial preconditionings for the conjugate gradient method. pp. ROMINE [1987]. in Rodrigue [1643]. [1648] G. SAAD [1986]. PlEPER [1979]. An asynchronous iteration method of solving nonlinear equations using parallel approximation of an inverse operator. 1. M. 315-340. [1674] Y. Report UCID-18549. [623]. 865-882. Proceedings of International Seminar on Scientific Supercomputers. ACM. 229-241. pp. pp. February 2-6. pp. 225-231. An Illiac program for the numerical simulation of homogeneous incompressible turbulence. Incomplete block cyclic reduction. Lawrence Livermore National Laboratory. & Math. Timing and stability analysis of summation algorithms. NASA Ames Research Center. Alg. [1672] Y. pp. Tech. WOLITZER [1986]. Some new parallel methods for solving the heat equation. Preconditioning by incomplete block cyclic reduction. Lin. Argonne National Laboratory. RODRIGUE AND D. SAAD [1986]. KARUSH [1976]. 10th IMACS World Congress on Systems Simulation and Scientific Computation. Tech. Report UCRL-92077-II. [1655] G. [1651] G. Statist. [1664] J. A production implementation of an associative array processor STARAN. RUDOLPH [1972]. [1658] C. M. [1652] G. Odd-even reduction for banded linear equations. [1656] G. Appl. Tech. C. 101-103. Phys. [1665] L. Factorization Methods for the Parallel Solution of Linear Systems. [1673] Y. 101-128. Tech. [1654] G. RODRIGUE AND D. R6NSCH [1984]. Parallel Computing.. pp. Yale University. Report UCRL-95278. RODRIGUE AND J. 12. USSR Comput. The CRAY-1 computer system. SIAM J. Parallel Computing. Perspectives on large-scale scientific computation. Math. A case study in programming for parallel processors. pp. VICHNEVETSKY. RODRIGUE AND P. in Vichnevetsky and Stepleman [1920]. pp. RUSCHITZKU. SIMON [1984]. ORTEGA [1988]. 42. Tech.

S. Yale University. pp. Sci. CHEN. April. 1030-1038. SALTZ [1987]. Comp. 395-411. [1685] P. in Snyder et al. Parallel Poisson and biharmonic solvers. AND D. SCHULTZ [1987]. IEEE Trans. SAMEH [1984]. SAMEH [1971]. SAMEH. July. [1687] M. Comput. Yale University. C.. Yale University. KUCK [1976]. March.-T. [1699] A. NASA Langley Research Center. 1101-1113. Lin. 9th Annual Allerton Conf. Report YALEU/DCS/RR-461. Solving elliptic difference equations on a linear array of processors. Parallel solution of finite element equations. 191-200. W. [1133]. Data communication in hypercubes. Tech. SAMEH [1981]. & Appl. ORTEGA. pp. Hampton. Analysis of parameterized methods for problem partitioning. [1681] Y. Automated problem scheduling and reduction of synchronization delay effects. pp. pp. Fitzgibbon. Towards developing robust algorithms for solving partial differential equations on MIMD machines. SCHULTZ [1985]. Nearest-neighbor mapping of finite element graphs onto processor meshes. SAAD AND M. M. AND P. Proc. ICASE Report 86-46. SAAD AND M.. pp. Parallel algorithms in numerical linear algebra. SAAD AND A. SAMEH [1985]. [1684] P. R. Comput. pp. [1695] A. NAIK [1988].186 J. SAMEH AND R. Tech. Topological properties of hypercubes. A parallel block Stiefcl method for solving positive definite systems. 25. SADAYAPPAN. 405-411. SAMEH [1984]. 207228. SCHULTZ [1985]. Math. 526-539. 311-328. in Heath [860]. MARTIN [1987]. Numer. [1701] A. Statist. Tech. 579-590. May. Conf. [1689] J. On some parallel algorithms on a ring of processors. October. On Jacobi and Jacobi-like algorithms for a parallel computer. [1690] J. VA. Department of Computer Science. 6. Tech. [1702] A. On two numerical algorithms for multiprocessors. Yale University. Numerical parallel algorithms — A survey. Comput. SAMEH. SAAD AND M. Statist. SALAMA. [1704] A. Cornm. ERCAL. 1987 Int..F. A fast Poisson solver for multiprocessors. AND D. Hampton. Proc. Alternating Direction methods on multiprocessors: An extended abstract. Reduction of the effects of the communication delays in scientific algorithms on message passing MIMD architectures. [1696] A. 623-650. Statistical methodologies for the control of dynamic remapping. CHEN [1987]. 88. Report YALEU/DCS/RR-428. Parallel implementations of preconditioned conjugate gradient methods. [1692] J.. in Schultz [1752]. Alg. AND S. CONPAR 81. V. AND R. ed. Solving triangular systems on a parallel computer. Parallel direct methods for solving banded linear systems. Sci. Tech. 1408-1424.. SAMEH [1983]. pp. ICASE. Yale University. Solving the linear least squares problem on a linear array of processors. [1678] Y. E. VA.. SALTZ [1987]. F. SAMEH [1971]. [1680] Y. MELOSH [1983]. 192-195. [1683] Y.. pp. SCHULTZ [1986]. SADAYAPPAN AND F. SAAD AND M. [1700] A. [1688] J. Department of Computer Science. 680-691. 130-140. [1682] Y. ROMINE [1675] Y. 19—44. VOIGT AND C. [1698] A. SAMEH [1981]. Iterative methods for the solution of elliptic difference equations on multiprocessors. UTKU. 129-134.. [1694] A. [1677] Y. 14. in Birkhoff and Schoenstadt [173]. SAAD. An overview of parallel algorithms in numerical linear algebra. Parallel Computing. pp. Anal. C(l). pp. Data communications in parallel architectures. Solving Schrodinger's equation on the Intel iPSC by the Alternating Direction method. AND M. SAAD AND M. sll8-sl38. pp. Par. [1808]. Proc. JOHNSSON. Presented at the CREST Conference.D. Com- . 37. Automated problem mapping: The Crystal run-time system. SALTZ AND V. SAMEH [1981]. Comm. A. Department of Computer Science. SATED. [1679] Y. SCHULTZ [1986]. SAMEH [1977]. [1693] J. ASCE. in Kowalik [1116]. BRENT [1977]. SCHULTZ [1987]. Phys. Department of Computer Science. SAYLOR [1985]. SAMEH [1985]. H. pp. SAAD AND A. 8. SALTZ AND D. Proceedings of the 8th Conference on Electronic Computation. Report YALEU/DCS/RR-381. 6. SALTZ AND M. June. 175-186. S. Circuit System Theory. pp. pp. NASA Langley Research Center. Mathematical and Computational Methods in Seismic Exploration and Reservoir Modeling. 1049-1063. Report YALEU/DCS/RR-389. pp. Bulletin de la Direction des Etudes et des Recherches. G. pp. Tech. in Kuck et al. Report 87-22. SIAM J. SAAD AND M. SIAM J. [1676] Y. C-36. pp. [1686] F. Department of Computer Science. NICOL [1987]. Ho. L. Mapping finite element graphs onto processor meshes. NAIK. NICOL [1986]. SCHULTZ [1985]. SALTZ. pp. [1703] A. Report YALEU/DCS/RR-537. Illiac IV applications. SIAM J. in Heath [860]. pp. [1697] A. pp. 159-166. ERCAL [1987]. [1691] J.

A. SCHYSKA [1984]. North-Holland. Schendel. SCHONAUER AND E. H.F. [1986]. KOLAR. A new systolic array for the singular value decomposition. IEEE. pp. pp. Phys. Springer-Verlag. 422-437. IMACS. 26. Introduction to the workshop: Some bottlenecks and deficiencies of existing vector computers and their consequences for the develop- . RAITH [1982]. [1730] W. Comm. NY. Proc. [1720] R. and U. in Feilmeier [621]. in Feilmeier et al. [1725] W. KUCK [1977]. Concurrent function evaluations in local and global optimization. Presented at the 1982 Sparse Matrix Symposium. ed. 19(9). 1. 537-552. John Wiley and Sons. SCHONAUER. ed. SCHONAUER. 64. Applications of the Crau-1 for quantum chemistry calculations. R. pp. SAUNDERS AND M. KAMEL [1985]. 72. pp. The efficient solution of large linear systems resulting from the FDM for 3-D PDE's on vector computers. [1719] R. C-26. Emmen. pp. Tech. [1710] N. University of Illinois at Urbana-Champaign. 758779. Department of Computer Science. SAMBH AND D. Parallelle algorithmen in der nichtlinearen optimierung. T380-T382. ed. Conolly). Appl.. [1709] J. Numerical experiments with instationary Jacobi-OR methods for the iterative solution of linear equations. C. Meth. Digital optical computing. Comput. STRAND [1984]. SAYLOR [1987]. MULLER. M. Berlin. W. SCHAEFER [1987]. 963-974. no. Performance of the PDE black box solver FIDISOL on a CYBER 205. SANGUINETTI [1986]. Pseudo-residual type methods for the iterative solution of large linear systems on vector computers. [1722] E. eds. Leapfrog variants of iterative methods for linear algebraic equations. pp. 8191. Schittkowski. Joubert. AND E. [1715] M. Bulletin de la Direction des Etudes et des Recherches. SCHONAUER. JlN. A polyalgorithm with diagonal storing for the solution of very large indefinite linear banded systems on a vector computer. 47-55.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 187 puting. Coll. [1708] A. GENTZSCH.. 10th IMACS World Congress on Systems Simulation and Scientific Computation. pp. Space-time tradeoffs for banded matrix problems. Bassanut. 17. 135-142. 389-395.. [1711] V. SCHONAUER. K. SCHONAUER AND W. Cornell University.. [623]. pp. The Efficient Use of Vector Computers with Emphasis on Computational Fluid Dynamics. SAMBH AND C. Feilmeier. ZAMM. AND H. pp. Preprint 161/84. SCHENDEL AND M. [1713] A. pp. ACM. 10. Amsterdam. 51-59. Center for Supercomputing Research and Development. New York. [1717] U.. 31. Report 661. J. Mech. December. Report R-87-1373. KUCK [1978]. SCHONAUER [1983]. [1723] E. Preconditioning strategies for the conjugate gradient algorithm on multiprocessors. [1987]. ScHNABEL [1985]. Parallel Computing 85. [1727] W. A polynomial based iterative method for linear parabolic equations. eds. SCHENDEL [1984]. NorthHolland. [1729] W. [1721] E. GUEST [1982]. Proc. 12 of Notes on Numerical Fluid Mechanics. Amsterdam. J. 25-30. Tech. Report EE-CEG-85-7. [1724] W. on Vector and Parallel Computing in Scientific Applications. Comput. Parallelization of PDE software for vector computers.D. Tech. [1706] A. Computers and Structures. SAMBH AND D. SCHNABEL [1987]. Comput. E. 25. SCHONAUER [1983]. Ser. [1716] U. 63. KUCK [1977]. pp. J. pp. Design of array processor software for nonlinear structural analysis. pp. Parallel computing in optimization. Department of Electrical and Computer Engineering. SCHNEPF [1986]. 357-382. [1714] P. LUK [1985]. MULLER [1985]. May. 219-230. Supercomputer Applications. SCHNEPF. SAMBH AND D. 193198. SCHONAUER AND K.. Parallel direct linear system solvert — A survey. First Intern. pp. 1.. vol. W. North-Holland. H. IEEE Trans. 326-328. New York. SAVAGE [1984]. Computational Mathematical Programming. pp. pp. Computer. University of Illinois at UrbanaChampaign. Proc. Freie Universitat Berlin. A parallel QR algorithm for symmetric tridiagonal matrices. [1707] A. B. Supercomputer. [1718] D. (Translator. MULLER [1985]. AND H. W. SCHIMMEL AND F. Engrg. SCHNEPF. TAFT [1982]. [1712] J. vol. SCHONAUER [1983]. Scientific Computing on Vector Computers. SCHNEPF [1986]. SARIGUL. [1726] W. AND H. 20.. M. ACM. Introduction to Numerical Methods for Parallel Computers. SCHNEPF AND W. pp. On stable parallel linear system solvers. Fachbereich Mathematik. Applications of the PDE solver FIDISOL on different vector computers. [1705] A. Performance of a message based multiprocessor. 21-28. John Wiley and Sons. ed. 147-153. [1728] W. SAWCHUK AND T.

ScHWANDT [1985]. May. Vieweg. Luk and Van Loan for the symmetric eigenvalue and singular value problems. The Efficient Use of Vector Computers with Emphasis to Computational Fluid Dynamics. 6. AND K. WIETSCHORKE [1987].. Prentice-Hall. New York. Royal Institute of Technology. SCHREIBER [1983]. pp. Comput. SCHULTZ [1984]. Inc. A storage efficient WY representation for products of Householder transformations. SCHREIBER [1982]. W. vol. SCHREIBER AND P. Los Angeles. Solving eigenvalue and singular value problems on an undersized systolic array. Comp. SADAYAPPAN. G. SCHREIBER [1988]. pp. Systolic linear algebra machines in digital signal processing. 13. SPIE Symp. 189-205. PDE software for vector computers. RATTH [1984]. H. M. W. R. 258-267. pp. Department of Numerical Analysis and Computer Science. in Control Data Corporation [411]. 25. 341. Conference of the IFIP Working Group 2. in Birkhoff and Schoenstadt [173]. Cholesky factorization by systolic array. 37. SCHREIBER [1986]. R. S. Newton-like interval methods for large nonlinear systems of equations on vector computers. 333-349. pp. On the systolic arrays of Brent. SCHONAUER. FIDISOL: A black box solver for partial differential equations.. Multiple array processors for ocean acoustic problems. Prentice-Hall. Submitted to 1987 Meeting of ASME. E. Statist. pp. Software considerations for the "black box" solver FIDISOL for partial diferential equations. 459-487. SCHONAUER AND E. Proc. Springer-Verlag. 10. 26. R. ZAMM. 77-92. Modularization of PDE software for vector computers. 4. K.. Comput. SCHNEPF. Report TRITA-NA-8311. Designing PDE software for vector computers as a data flow algorithm. BAUMAN. Systolic arrays for eigenvalue computation. pp. Tech. W. MULLER [1984]. SCHONAUER AND E. AND K.. M. SCHREIBER AND W. 185-194. ERCAL [1987]. SCHNEPP. Paris. pp. eds. 13 of Volumes in Mathematics and its Applications.. 53-57. ORTEGA. Tech. SCHWANDT [1987]. Phy. Inc. Signal Processing. 141-154. pp.. Yale University. H. SIAM J. On systolic array methods for band matrix factorizations.. R. SCHREIBER [1983]. SCHONAUER AND H. R. Bo. Report YALEU/DCS/RR-363. SCHONAUER. SCHONAUER. [1731] [1732] [1733] [1734] [1735] [1736] [1737] [1738] [1739] [1740] [1741] [1742] [1743] [1744] [1745] [1746] [1747] [1748] [1749] [1750] [1751] [1752] [1753] [1754] [1755] [1756] [1757] . SIAM J. USC Workshop on VLSI and Modern Signal Processing. pp. NJ. [1981]. R. 441—451. VOIGT AND C. KUEKES [1982]. NY. Numer. Department of Computer Science. February. March.. SCHREIBER [1984]. pp. Schonauer and W. First Intern. Interval arithmetic block cyclic reduction on vector computers. R. SIAM J. E. ACM Trans. Berlin. Proc.188 J. Department of Computer Science. E. Systolic linear algebra machines: A survey. BIT. Comput. ROMINE ment of general PDE software. W.. Mapping parallel applications to a hypercube. M. Comm. AND H. 197-207. pp. R. Sci. SCHREIBER [1983]. Softw. East 1982. 7. SCHNBPF. R. Block reflectors: Theory and computation. Solving elliptic problems on an array processor system. pp. ed. Phys. RAJTH [1983]. MOLLER [1985]. AND H. M. E. Statist. College Vector and Parallel Computing in Scientific Appl. W. Numerical Algorithms for Modern Parallel Computer Architectures. M. Rensselaer Polytechnic Institute. SCHULTZ. R. T309-T312. 223-232. Schultz. W. Par. Tech. SCHREIBER [1987]. R.5 on Numerical Software.. Englewood Cliffs. Vectorizing the conjugate gradient method. N. R. W. Anal. Comp. in Heath [860]. SCHNBPP [1988]. 303-316. Braunschweig. The questions of accuracy. Sci. pp. Real Time Signal Processing V. SCHREIBER AND B. Report 87-14. in Vichnevetsky and Stepleman [1920]. VAN LOAN [1989]. W. H. SCHNBPP [1987]. SCHREIBER AND C. R. Proc. ed. SCHULTZ [1985]. Systolic arrays: High performance parallel machines for matrix computation. Hay kin. SCHONAUER. Computing generalized inverses and eigenvalues of symmetric matrices using systolic arrays. The redesign and vectorization of the SLDGL-program package for the self-adaptive solution of nonlinear systems of elliptic and parabolic PDE's. SCHREIBER [1986]. Proceedings of the Sixth International Conference on Computer Methods in Science and Engineering. Elliptic Problem Solvers. Academic Press. Comm. SCHWAN. R. pp. ed. 233-237. W.. TANG [1982]. R. PARLETT [1988]. 1-34. Block algorithms for parallel machines. Math. J. NJ. P.. 187-194. Parallel Computing. Englewood Cliffs. SCHNEPP. pp. AND F. SCHREIBER [1987]. geometrical flexibility and vectorability for the FDM. pp. in Birkhoff and Schoenstadt [173]. Sweden. 37. A systolic architecture for singular value decomposition. Dist. Gentzsch. 64.

KAPUR. 341-344. Courant Institute. WIRTH [1980]. 35-48. pp. 4. AFIPS Conf. [1785] D. J. SCOTT. [1784] D. CHARLU. AND H. Int. CA. Comm.. MATISOO [1984]. HANKEY. MANHARDT [1985]. [1788] M. RILEY [1987]. Computational Processes and Systems. P. 26. pp. SHEDLER AND M. Report 206. Comput. [1760] J. BUNING. Appl. Phys. IEEE Trans. SCHWIEGELSHOHN AND L. [1770] M. Numer. Math. 6. SCHWARTZ [1983]. C-33. SNIR [1986]. Distributed data structures for scientific computation.. SCHWANDT [1987]. D. Lang. [1764] M. [1778] A. SHAH [1980]. 3. 424—432. 1247-. ACM. Department of Computer Science. Comm. in Schultz [1752].. 143-154. Comput.. Lin. [1768] S. SEAGER [1986]. SEJNOWSKI. Appl. based on 55 designs. Design and implementation of a VLSI . 273-279. CONPAR 81 Conf. H. J. Frontiers of Supercomputing. in VLSI. pp. On the choice of discretization for solving PDE's on a multi-processor. [1772] C. 129139. [1762] D. Comput. Sys. Tech. P. Further analysis of the QIF method. BOULDIN. Report CSDG-80-1. Parallel Computing. Syst. pp. 55-66. pp. University of California Press. Comm. Program. [1782] D. & Appl. 493-507. R. SEAGER [1986]. 130-135. SCOTT. AND M. Artech House. M. [1769] M. [1765] N. AND P. pp. SEITZ [1984]. Penfield. 18. AND G. Springer-Verlag. [1775] C. MILLIGAN. [1786] G. [1766] R. 135-150. ed. 345-355. N.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 189 [1758] H. AND J. SHELL. ScOTT [1986]. Comp.. IBM Systems J. Izdatel'stvo Nauka. 286-291. SHARP [1987]. [1777] M. 83-98. 142-149. Parallel numerical methods for the solution of equations. E. pp. An Introduction to Distributed and Parallel Processing. in Heath [860]. 77. Tech. SHASHA AND M. [1761] U. A systolic array for cyclic-by-rows Jacobi algorithms. Overhead considerations for parallelizing conjugate gradient. Ultracomputer Laboratory. Avoiding the square-root bottleneck in the Choleski factorization of a matrix on a parallel computer. NCC. Parallel Computing.. pp. eds. LIPOVSKI [1980]. 2. WORLTON. Math. BAGHERI [1987]. The parallel computation of Racah coefficients using transputers. pp. [1773] C. ScOTT. 22-33. Group broadcast mode of interprocessor communications for the finite element machine. 28. New variants of the Quadrant Interlocking Factorization (QIF) method. 77. Blackwell Scientific Publications. 37(5). UPCHURCH. ed. Dist. D. HALEY [1988]. [1787] G. An interval arithmetic method for the solution of nonlinear systems of equations on a vector computer. The computing structures of algorithms and VLSI-based computer architecture. SIMD and MSIMD variants of the NON-VON supercomputer. [1781] J. [1763] D. pp. pp. Parallelizing conjugate gradient for the CRAY X-MP. Parallel block Jacobi eigenvalue algorithms using systolic arrays. SEITZ [1984]. Conf. Moscow. pp. [1780] J. 46. SEITZ [1982]. AND R. SEITZ [1985]. [1771] S. pp. AND J. MIT Conf. HEATH. [1783] J. [1776] C. Handler. Alg. W. 334-340. on Advanced Res. ACM Trans. pp. [1986]. An overview of the Texas Reconfigurable Array Computer. The cosmic cube.. London. Lin. Some experiments in multitasking on an ELXSI system 6400. pp. March. pp.. SCHWARTZ [1980]. P. Ultracomputer Note 69. pp. 419-422. DEMUTH. in Kartashev and Kartashev [1055]. Comparison of parallel SOR algorithms for solution of sparse matrix problems. A taxonomic table of parallel computer*. Lecture Notes in Computer Science III. pp. pp. University of Colorado. [1767] R.. ACM. pp. Evaluation of redundancy in a parallel algorithm. 360-363. 38-45. pp. Physics Today. pp. METROPOLIS. SHAW [1984]. W. THIELE [1987]. 631-641. New York University. SCOTT AND G. [1759] J. Meth. SEITZ AND J. 10. SCOTT.. 10731079. AIAA J. SHANG. LEHMAN [1967]. SHEDLER [1967].. & Comp. Alg. Comm. Soc. Proc. COMPCON 84. Concurrent VLSI architectures. MONTRY [1988]. Par. 484-521. EVANS [1982]. Berkeley. WARD [1986]. New York University. Performance of a vectorized three-dimensional Navier-Stokes code on the CRAY-1 computer.. 2. 323-338. pp. EVANS [1981].. Proc. Efficient and correct execution of parallel programs that share memory. Engineering limits on computer performance. & Appl. J. BOYLE.. Experiments with VLSI ensemble machines. SEDUKEIN [1985]. Proc. SHANEHCHI AND D. 11.. AND B. [1774] C. 4. Ultracomputcrs. IEEE Comp. [1779] J. VLSI and Comp.. 1(3). SCOTT [1981]. J. SHANEHCHI AND D. SHARP. Proc. pp. Ensemble architectures for VLSI — A survey and taxonomy.

An approach to complexity: Numerical computations. IMACS. 15(1). R. VOIGT AND C. 97-107. pp. IEEE Trans. NV. AND G. H. [1789] [1790] [1791] [1792] [1793] [1794] [1795] [1796] [1797] [1798] [1799] [1800] [1801] [1802] [1803] [1804] [1805] [1806] [1807] [1808] [1809] [1810] [1811] [1812] [1813] [1814] [1815] [1816] [1817] . D. August. 319-331. pp. University of Illinois at Urbana-Champaign. AND C. 1. KELLER. pp. SNYDER [1982]. SOUTH. Interconnection networks for SIMD machines. pp. BORCK. An inquiry into the benefits of multigauge parallel computation. SMITH AND J. IEEE Trans. R. SORENSEN [1985]. 108-114. L. Computer. M. Benchmark of the Convex C-l mini supercomputer.190 J. Op. SNYDER [1985]. 1978 Int. AND P. LlN. AND H. H. C-32. Buffering for vector performance on a pipelined MIMD machine. Incomplete LU preconditioners for conjugate gradient type iterative methods.. 7th Symp. AFIPS. The Roscoe operating system. The multistage cube: A versatile interconnection network. SRINIVAS [1983]. R. Optimal parallel scheduling of Gaussian elimination DAG's. Orlando. NISHTOA [1984]. L. Proc. J. pp.. 786-792. 1985 Int. SOLL. SNYDER. JAMIESON. 403-408. H. U. 143-164. AIAA. 23rd Aerospace Sciences Meeting. SlEGEL [1985]. Los Alamos National Laboratory. J. on Reservoir Simulation. Paper 85-0366. SORENSEN [1984]. Preprint. 311-312. Advances in Computer Methods for Partial Differential Equations-III. 274-278. L. HABRA. Interconnection Networks for Large-Scale Parallel Processing: Theory and Case Studies. AND K. 33-48. 5765. AND M. TX. ROMINE systolic array for solving nonlinear parallel differential equations. Conf. Department of Computer Science. eds. State-of-the-art in parallel computing. in Noor [1450]. Proc. H. Proc. SKEEL [1987]. pp.Caughey NYU transonic swept-wing computer program FLO-22-VI for the STAR-100 computer. SOUTH [1985]. Performance measurements for evaluating algorithms for SIMD machines. pp. Report 700. S. B. KELLER. Parallel Computing. S. HAFEZ [1980]. Analysis of pair wise pivoting in Gaussian elimination. LAMBIOTTE [1978]. pp. H. SlMON [1985]. AND R. 8th SPE Symp. N. SNYDER [1986]. 96-98. An architecture of a data flow computer and its evaluation. C-34. W. SOLOMON AND R. A vectorization of the Jameson. Optimising finite element matrix calculations using the general technique of element vcctorization. J. pp. SHBU. ARO Report 80-3. D. FJCC. Tech. shared resource MIMD computer. in Kartashev and Kartashev [1055]. Rept.. SLOTNICK. 22. SMITH. J. Computational transonics on a vector computer. Conf. 6. D. AND J. M. GANNON. SWAIN [1982]. pp. Proc. February. pp. 486-490. Report LA-10005. SE-8. SHIM AD A. 245-252. 228. T. L. AIAA J. J. J. SlEGEL AND R. LUBECK [1986]. SIMMONS AND O. [1985]. NASA Langley Research Center. Vector processor algorithms for transonic flow calculations. HIRAKI. Reno. ORTEGA. pp. Proc. pp. pp. SMITH [1978]. MECA: A supercomputer for Monte Carlo. J. 65-76. IEEE Comp. November. D. Proc. pp. Science. 12(6). RUSSELL [1977]. FL. Introduction to the configurable highly parallel computer. SMARR [1985]. pp. 47-56. 6-8. SOLEM [1984]. Proc. Conf. August.. Par. P. NASA Tech. COMPCON 84. Soc. AND M. Proc. Recent advances in computational aerodynamics. PITTS. SlEGEL. An efficient parallel algorithm of conjugate gradient method. Comput. [1133]. Computer. Algorithmically Specialized Parallel Computers.. DAS [1987].. Tech. Academic Press. Lexington Books. MCREYNOLDS [1962]. D. The solution of the three-dimensional viscous compressible Navier-Stokes equations on a vector computer. FINKEL [1979]. shared memory and the corollary of modest potential. L. Computer. Report LA-UR-86-2890. The SOLOMON computer. M. pp. SlEGEL [1979]. Experience with a vectorized general circulation climate model on STAR-100. SIEWIOREK [1983]. Comput. pp. Proc. Princ. Sys. pp. W. Dallas. in Kuck et al.. H. pp. Par. Par. PITTS [1979]. Softw. Los Alamos National Laboratory. Conf. L. 1109-1117. Army Numerical Analysis and Computers Conference. K. SIEGEL. A pipelined. 1985 Int. R.. G. Proc. D. Waveform iteration and the shifted Picard splitting. IEEE Trans. 157-164. M. Tech. SOUTH. SILVESTER [1988]. L. TM-78665. SlEGEL. Eng. 488-497. McMlLLEN [1981]. 18. 488-496. Parallel Computing. Type architectures.. 14. HAFEZ [1980].

SULLIVAN AND T. pp. CROCKETT. [1844] P. A large scale. STEVENS [1979]. Proc. 45-63. News. VA. STONE [1971]. STORAASLI. Introduction to Computer Architecture. S. AND R. Numerical aerodynamics simulation facility project. Proc. AND L. Inc. BASHKOW [1977]. [1836] J.. pp. in Rodrigue [1643]. Structural Dynamics and Materials Conf. [1821] K. in Jesshope and Hockney [976]. [1825] H. 6th SPE Symposium Reservoir Simulation. 27-38. in McCormick [1312]. pp. 7. [1831] O. IEEE Spectrum. second ed. Addison-Wesley. Performance comparisons for reservoir simulation problems on three supercomputers. NASA Langley Research Center. [1829] H. [1847] P. Parallel Computing. FULTON [1984]. 201-217. 2245. pp. pp. SWAN. pp. [423]. NASA Conf. Hampton. Parallel Computing. Efficient algorithms for pipeline and parallel computers. PEEBLES. AFIPS Nat. [1826] H. pp. pp. Arch. 115-121. 5. A parallel implementation of the QR-algorithm. STEFFEN [1988]. Superpower computers. Arch. Comput. STEVENS [1975]. SE-3. Computer Conf. 14671484.. 251-267. Hampton. of Conf. 1. SIGPLAN Notices. in Structures and Solid Mech. Theory Appl. 72-80. The Finite Element Machine: An experiment in parallel processing. [1835] J. [1820] B. [1832] P. Comput. Su AND A. STRINGER [1982]. NASA Technical Note D-7329. A parallel Jacob son. A comparative survey of concurrent programming languages. pp. 89-104.. AND D. [1839] H. pp. [1830] O. STONE [1975]. IEEE Trans. Sigplan Notices.. pp. Proc. Comput. in Jesshope and Hockney [976]. STRAETER [1973]. 33. A time split difference scheme for the compressible Navier-Stokes equations with applications to flows in slotted nozzles. 4th Annual Symp.. STONE [1973]. VA. on Res. STOTTS [1982]. 153-161. pp. Parallel tridiagonal equation solvers. NOLEN [1982]. J. 185-199. STANAT AND J. T. Math. NASA Langley Research Center. 1. Inc. 55-64. A large scale homogeneous fully distributed parallel machine. FULLER. H. [1822] K. Cm* — A modular multi-microprocessor. Structural dynamic analysis on a parallel computer: The Finite Element Machine. [1837] S. 39.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 191 [1818] P. An efficient parallel algorithm for the solution of a tridiagonal linear system of equations. The solution of tridiagonal systems on the CRAY-1.. pp. pp. MARKOS [1975]. SWARZTRAUBER [1983]. 465-488. [1842] R. homogeneous. SULLIVAN AND T. S. KNOTT. SWARZTRAUBER [1979]. Math. Multiprocessor scheduling with the aid of network flow algorithms. ed. IEEE Trans. FFT algorithms for vector computers. 5.. 28-34. [1843] P. ACM Trans. [1838] R. CFD — A Fortran-like language for the Illiac IV. . High Performance Computer Architecture. STEFFEN [1988]. VA. 105-117. pp. Parallel Computing. Comp. SWARZTRAUBER [1984]. Optim. Nongradient minimization methods for parallel processing computers. 187-196. C-36. 105-117. NASA Langley Research Center. Comp. Pub. New York. Science Research Associates. pp. Hampton. Montvale. NASA Technical Note D-8020. C-20. [1845] P.. Efficiency of D4 Gaussian elimination on a vector computer. [1846] P. [1828] H. RANSON. [1819] B. pp. [1823] G. SUGARMAN [1980]. 289-307. AFIPS Press. 85-94. pp. J. [1827] H. STRAETER AND A. pp. 84-0966-CP. 637-644. pp. Parallel processing with the perfect shuffle. SUTTI [1983]. fully distributed parallel machine. 343-358. NJ. 25th AIAA Structures. STORAASLI. 17(9). CA. 20. STONE [1980]. THAKORE [1987]. [1840] H. pp. Implementation of a resonant cavity package on MIMD computers. Softw.. SWARZTRAUBER [1979]. Matrix operations on a multicomputer system with switchable main memory modules and dynamic control. IEEE Trans. [1834] T. 76-87. Palm Springs... STONE [1987].. Softw. [411]. ADAMS [1982]. 331-342. A parallel variable metric optimization algorithm. pp.Oksman optimization algorithm. Eng. J.. Vectorizing the FFTs. A parallel algorithm for solving general tridiagonal equations. Parallel computation. [1833] T. Multigrid methods for calculation of electromagnets and their implementation on MIMD computers. STRIKWERDA [1982]. 363-425. STEWART [1987]. J. pp. in Noor [1450]. SWARZTRAUBER [1982]. 51-83. STONE [1977]. pp. in Rodrigue [1643]. SIEWIOREK [1977]. [1841] C. [1824] H. Also in Control Data Corp. in Cray Research. Stone. ACM. BASHKOW [1977].

192

J. M. ORTEGA, R. G. VOIGT AND C. H. ROMINE

[1848] P. SWARZTRAUBER [1987]. Multiprocessor FFTs, Parallel Computing, 5, pp. 197-210. [1849] R. SWEET [1987]. A parallel and vector variant of the cyclic reduction algorithm, Supercomputer, 22, pp. 18-25. [1850] R. SWEET [1988]. A parallel and vector variant of the cyclic reduction algorithm, SLAM J. Sci. Statist. Comput., 9, pp. 761-765. [1851] J. SWISSHELM AND G. JOHNSON [1985]. Numerical simulation of three dimentional flowfieldt using the Cyber SOS, in Numrich [1469], pp. 179-195. [1852] J. SWISSHELM, G. JOHNSON, AND S. KUMAR [1986]. Parallel computation of Euler and Navier-Stokes flows, Appl. Math. & Comp., 19(1-4), pp. 321-332. (Special Issue, Proceedings of the Second Copper Mountain Conference on Multigrid Methods, Copper Mountain, CO, S. McConnick, ed.). [1853] C. TAPT [1982]. Preconditioning strategies for solving elliptic equations on a multiprocessor, Tech. Report, Department of Computer Science, University of Illinois at UrbanaChampaign.

[1854] H.-M. TAI AND R. SAEKS [1984]. Parallel system simulation, IEEE Trans. Syst. Man. Cybern., SMC-14, pp. 177-183.

[1855] Y. TAKAHASHI [1982]. Partitioning and allocation in parallel computation of partial differential equations, Proc. 10th IMACS World Congress on Systems Simulation and Scientific Computation, vol. 1, pp. 311-313. [1856] W. TANG [1986]. Schwartz Splitting, A Model for Parallel Computations, PhD dissertation, Stanford University, Department of Computer Science. [1857] O. TELEMAN AND B. JONSON [1986]. Vectorizing a general-purpose molecular dynamics simulation program, J. Comp. Chem., 7, pp. 58-66. [1858] C. TEMPERTON [1979]. Direct methods for the solution of the discrete Poisson equation: Some comparisons, J. Comp. Phys., 31, pp. 1-20. [1859] C. TEMPERTON [1979], Fast Fourier transforms and Poisson solvers on CRAY-1, in Jesshope and Hockney [976], pp. 359-379. [1860] C. TEMPERTON [1979]. Fast Fourier transforms on CRAY-1, Tech. Report 21, European Center for Median Range Weather Forecasts. • [1861] C. TEMPERTON [1980]. On the FACR (1) algorithm for the discrete Poisson equation, J. Comp. Phys., 34, pp. 314-329. [1862] C. TEMPERTON [1984]. Fast Fourier transforms on the CYBER 205, in Kowalik [1116], pp. 403-416. [1863] C. TEMPERTON [1988]. Implementation of a prime factor FFT algorithm on CRAY-1, Parallel Computing, 6, pp. 99-108. [1864] G. TENNILLE [1982]. Development of a one-dimensional stratospheric analysis program for the CYBER 203, in Control Data Corporation [411]. [1865] A. THAKORE AND S. Su [1987]. Matrix inversion and LU decomposition on a multicomputer system with dynamic control, in Kartashev and Kartashev [1055], pp. 291-301. [1866] C. THOLE [1988]. Parallel multigrid algorithms on a message-based MIMD system, in McCormick [1312]. [1867] C. THOLE [1988]. The SUPRENUM approach: MIMD architecture for multigrid algorithms, in McConnick [1312]. [1868] W. THOMAS AND E. LEWIS [1983]. Two vectorized algorithms for the solution of three dimensional neutron diffusion equations, Nuc. Sci. Eng., 84, pp. 67-71. [1869] W. THOMPKINS AND R. HAIMES [1983]. A minicomputer/array processor/memory system for large-scale fluid dynamic calculations, in Noor [1450], pp. 117-126. [1870] K. THURBER [1976], Large Scale Computer Architectures: Parallel and Associative Processors, Hayden Book Co. [1871] K. THURBER AND L. WALD [1975]. Associative and parallel processors, ACM Computing Surveys, 7, pp. 215-245. [1872] G. THURSTON [1987]. A parallel solution for the symmetric eigenproblem, Tech. Report NASA-TM-89082, NASA Langley Research Center, Hampton, VA. [1873] J. TlBERGHIEN, ed. [1984]. New Computer Architectures, Academic Press, Orlando, FL. [1874] D. TOLLE AND W. SlDDALL [1981]. On the complexity of vector computations in binary tree machines, Inform. Process. Lett., 13, pp. 120-124. [1875] S. ToMBOULIAN, T. CROCKETT, AND D. MlDDLETON [1988]. A visual programming environment for the Navier-Stokes computer, ICASE Report 88-6, NASA Langley Research Center, Hampton, VA. [1876] J. TRAUB, ed. [1974]. Complexity of Sequential and Parallel Numerical Algorithms, Academic Press. [1877] J. TRAUB [1974]. Iterative solution of tridiagonal systems on parallel or vector computers,

A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS

193

in Traub [1876], pp. 49-82. [1878] R. TRAVASSOS AND H. KAUFMAN [1980]. Parallel algorithms for solving nonlinear two-point boundary-value problems which arise in optimal control, J. Optim. Theory Appl., 30, pp. 53-71. [1879] P. TRELEAVEN [1979]. Exploiting program concurrency in computing systems, Computer, 12(1), pp. 42-50. [1880] P. TRELEAVEN [1984]. Decentralised computer architecture, in Tiberghien [1873].

[1881] S. TRIPATHI, S. KABLER, S. CHANDRAN, AND A. AGRAWALA [1986]. Report of the workshop on design and performance issues in parallel architectures, Tech. Report CS-TR-1705, Department of Computer Science, University of Maryland, September. [1882] J. TUAZON, J. PETERSON, M. PNIEL, AND D. LIBBBRMAN [1985]. Caltech/JPL Mark II hypercube concurrent processor, Proc. 1985 Int. Conf. Par. Proc., pp. 666-673.

[1883] L. TUCKER AND G. ROBINSON [1988]. Architecture and applications of the Connection Machine, Computer, 21(8), pp. 26-38. [1884] L. UHR [1984], Algorithm Structured Computer Arrays and Networks, Academic Press, Orlando, FL. [1885] J. ULLMAN [1983]. Some thoughts about supercomputer organization, Tech. Report STANCS-83-987, Department of Computer Science, Stanford University, October. [1886] S. UNGER [1958]. A computer oriented towards spatial problems, Proc. IRE, 46, pp. 17441750. [1887] S. UTKU, Y. CHANG, M. SALAMA, AND D. RAPP [1986]. Simultaneous iterations algorithm for generalized eigenvalue problems on parallel processors, Proc. 1986 Int. Conf. Par. Proc., pp. 59-66. [1888] S. UTKU, M. SALAMA, AND R. MELOSH [1986]. Concurrent Cholesky factorization of positive definite banded Hermitian matrices, Int. J. Num. Meth. Eng., 23, pp. 2137-2152. [1889] M. VAJTERSIC [1979]. A fast parallel method for solving the biharmonic boundary value problem on a rectangle, Proc. First European Conference on Parallel Distributed Processing, Toulouse, pp. 136-141. [1890] M. VAJTERSIC [1981]. Solving two modified discrete Poisson equations in 71ogn steps on n2 processors, CONPAR 81, pp. 473-432. [1891] M. VAJTERSIC [1982]. Parallel Poisson and biharmonic solvers implemented on the EGPA multiprocessor, Proc. 1982 Int. Conf. Par. Proc., pp. 72-81. [1892] M. VAJTERSIC [1984]. Parallel marching Poisson solvers, Parallel Computing, 1, pp. 325330. [1893] R. VAN DE GELJN [1987]. Implementing the QR-Algorithm on an Array of Processors, PhD dissertation, University of Maryland, Department of Computer Science. Also Tech. Report TR-1897, Department of Computer Science, University of Maryland, August. [1894] E. VAN DE VELDE AND H. KELLER [1987]. The design of a parallel multigrid algorithm, in Kartashev and Kartashev [1055], pp. 76-83. [1895] H. VAN DER VORST [1982]. A vectorizable variant of some ICCG methods, SIAM J. Sci. Statist. Comput., 3, pp. 350-356. [1896] H. VAN DER VORST [1983]. On the vectorization of some simple ICCG methods, First Int. Conf. Vector and Parallel Computation in Scientific Applications, Paris. [1897] H. VAN DER VORST [1985]. Comparative performance tests of Fortran codes on the CRAY-1 and CYBER 205, Parallel Computers and Computations, J. vanLeeuwenandJ. Lenstra, eds., CWI, Amsterdam. CWI Syllabus 9. [1898] H. VAN DER VORST [1986]. Analysis of a parallel solution method for tridiagonal systems, Tech. Report 86-06, Department of Mathematics and Information, Delft University of Technology. [1899] H. VAN DER VORST [1986]. (M)ICCG for 2D problems on vector computers, Tech. Report 86-55, Department of Mathematics and Information, Delft University of Technology. [1900] H. VAN DER VORST [1986]. The performance of Fortran implementations for preconditioned conjugate gradients on vector computers, Parallel Computing, 3, pp. 49-58. [1901] H. VAN DER VORST [1987]. Analysis of a parallel solution method for tridiagonal linear systems, Parallel Computing, 5, pp. 303-311. [1902] H. VAN DER VORST [1987]. ICCG and related methods for 3D problems on vectorcomputers, Tech. Report A-18, Data Processing Center, Kyoto University, Japan. [1903] H. VAN DER VORST [1987]. Large tridiagonal and block tridiagonal linear systems on vector and parallel computers, Parallel Computing, 5, pp. 45-54. [1904] H. VAN DER VoRST AND J. VAN KATS [1983], Comparative performance tests on the CRAY1 and Cyber 205. Preprint, May. [1905] P. VAN LARHOVEN [1985]. Parallel variable metric algorithms for unconstrained optimiza-

194

J. M. ORTEGA, R. G. VOIGT AND C. H. ROMINE

ft on, Math. Programming, 33, pp. 69-81. [1906] C. VAN LOAN [1986]. The block Jacobi method for competing the singular value decomposition, Computational and Combinatorial Methods in Systems Theory, C. Byrnes and A. Lindquist, eds., Elsevier Science Publishers B.V. (North-Holland), Amsterdam, pp. 245-255. [1907] C. VAN LOAN [1988]. A block QR factorization scheme for loosely coupled systems of array processors. Numerical Algorithms for Modern Parallel Computer Architectures, M. Schultz, ed., vol. 13 of IMA Volumes in Mathematics and its Applications, SpringerVerlag, Berlin, pp. 217-232. [1908] R. VAN LUCHENE, R. LEE, AND V. MEYERS [1986]. Large scale finite element analysis on a vector processor, Computers and Structures, 24, pp. 625-635. [1909] J. VAN RoSENDALE [1983]. Algorithms and data structures for adaptive multigrid elliptic solvers, Appl. Math. & Comp., 13(3-4), pp. 453—470. (Special Issue, Proceedings of the First Copper Mountain Conference on Multigrid Methods, Copper Mountain, CO, S. McCormick and U. Trottenberg, eds.). [1910] J. VAN ROSENDALE [1983]. Minimizing inner product data dependencies in conjugate gradient iteration, Proc. 1983 Int. Conf. Par. Proc., pp. 44-46. [1911] J. VAN ROSENDALE AND P. MEHROTRA [1985]. The BLAZE language: A parallel language for scientific programming, ICASE Report 85-29, NASA Langley Research Center, Hampton, VA. [1912] F. VAN ScOY [1977]. Some parallel cellular matrix algorithms, Proc. ACM Comp. Sci. Conf. [1913] E. VAN WEZENBECK AND W. RAVENEK [1987]. Vectorization of the natural logarithm on the Cyber 205, Supercomputer, 19, pp. 37—42. [1914] S. VANKA [1987]. Vectorized multigrid fluid flow calculations on a CRAY X-MP/48, I. J. Num. Meth. Fluids, 7, pp. 635-648. [1915] C. VAUGHAN AND J. ORTEGA [1987]. SSOR preconditioned conjugate gradient on a hypercube, in Heath [860], pp. 692-705. [1916] S. VAVASIS [1986]. Parallel Gaussian elimination, Tech. Report CS 367A, Department of Computer Science, Stanford University, Stanford, CA. [1917] A. VEEN [1986]. Dataflow machine architecture, ACM Computing Surveys, 18, pp. 365-396.

[1918] V. VENKAYYA, D. CALAHAN, P. SUMMERS, AND V. TISCHLER [1983]. Structural optimization on vector processors, in Noor [1450], pp. 155-190.

[1919] C. VERBER [1985]. Integrated optical architecture for matrix multiplication, Optical Engineering, 24, pp. 19-25. [1920] R. VlCHNEVETSKY AND R. STEPLEMAN, eds. [1984]. Advances in Computer Methods for Partial Differential Equations - V, Proceedings of the Fifth IMACS International Symposium, New Brunswick, Canada. [1921] R. VlCHNEVETSKY AND R. STEPLEMAN, eds. [1987]. Advances in Computational Methods for Partial Differential Equations - VI, Proceedings of the Sixth IMACS International Symposium, New Brunswick, Canada. [1922] V. VOEVODIN [1985]. Mathematical problems in the development of supercomputers, Computational Processes and Systems, Izdatel'stvo Nauka, Moscow, pp. 3-12. [1923] V. VOEVODIN [1986], Mathematical Models and Methods for Parallel Processes, Izdatel'stvo Nauka, Moscow. [1924] R. VOIGT [1977]. The influence of vector computer architecture on numerical algorithms, in Kuck et al. [1133], pp. 229-244. [1925] R. VOIGT, D. GOTTLIEB, AND M. HUSSAINI, eds. [1984]. Spectral Methods for Partial Differential Equations, Society for Industrial and Applied Mathematics, Philadelphia, PA. [1926] R. VoiTUS [1981]. A multiple process software package for the Finite Element Machine, Tech. Report, Department of Computer Science, University of Colorado. [1927] J. VOLKERT AND W. HENNING [1986]. Multigrid algorithms implemented on EGPA multiprocessor, Proc. 1986 Int. Conf. Par. Proc., pp. 799-805. [1928] J. VON NEUMANN [1966]. A system of 29 states with a general transition rule, Theory of Self-Reproducing Automata, A. Burks, ed., University of Illinois Press, pp. 305-317. [1930] D. Vu AND C. YANG [1988], Comparing tridiagonal solvers on the CRAY X-MP /416 system, CRAY Channels, 9(4), pp. 22-25. [1931] E. WACHSPRESS [1984]. Navier-Stokes pressure equation iteration, in Birkhoff and Schoenstadt [173], pp. 315-322. [1932] R. WAGNER [1983]. The Boolean vector machine, 1983 IEEE Conference Proc. 10th Annual Int. Symp. Comp. Arch., pp. 59-66.

[1929] D. VRSALOVIC, D. SIEWIOREK, A. SEGALL, AND E. GEHRINGER [1984]. Performance prediction for multiprocessor systems, Proc. 1984 Int. Conf. Par. Proc., pp. 139-146.

A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS

195

[1933] R. WAGNER [1984]. Parallel solution of arbitrarily sparse linear systems, Tech. Report CS1984-13, Department of Computer Science, Duke University. [1934] R. WAGNER AND M. PATRICK [1988]. A sparse matrix algorithm on the Boolean vector machine, Parallel Computing, 6, pp. 359-372. [1935] D. WALKER, G. Fox, A. Ho, AND G. MONTRY [1987]. A comparison of the performance of the Caltech Mark II hypercube and the Eltsi 6400, in Heath [860]. [1936] Y. WALLACE [1982]. Alternating sequential-parallel calculation of eigenvalues for symmetric matrices, Computing, 28, pp. 1-16. [1937] Y. WALLACH [1984]. On two more eigenvalue methods for an alternating sequential parallel system, Computing, 32, pp. 33-42. [1938] Y. WALLACH AND V. CONRAD [1976]. Parallel solution of load flow problems, Arch. Elektrotechiuk, 57, pp. 345-354. [1939] Y. WALLACH AND V. CONRAD [1980], On block parallel methods for solving linear equations, IEEE Trans. Comput., C-29, pp. 354-359. [1940] J. WALLIS AND J. GRISHAM [1982]. Petroleum reservoir simulation on the CRAY-1 and on the FPS-164, Proc. 10th IMACS World Congress on Systems Simulation and Scientific Computation, vol. 1, pp. 308-310. [1941] J. WALLIS AND J. GRISHAM [1982]. Reservoir simulation on the CRAY-1, in Cray Research, Inc. [423], pp. 122-139. [1942] A. WALLQVIST, B. BERNE, AND C. PANGALI [1987]. Exploiting physical parallelism using supercomputers: Two examples from chemical physics, Computer, 20(5), pp. 9-21. [1943] S. WALTON [1987]. Performance of the one-dimensional fast Fourier transform on the hypercube, in Heath [860], pp. 530-538. [1944] H. WANG [1981]. A parallel method for tridiagonal equations, ACM Trans. Math. Softw., 7, pp. 170-183. [1945] H. WANG [1982]. On vectorizing the fast Fourier transform, BIT, 20, pp. 233-243. [1946] H. WANG [1982]. Vectorization of a class of preconditioned conjugate gradient methods for elliptic difference equations, Tech. Report, IBM Scientific Center, Palo Alto, CA. [1947] W. WARE [1973]. The ultimate computer, IEEE Spectrum, 10(3), pp. 89-91. [1948] H. WASSERMAN, M. SIMMONS, AND A. HAYES [1987]. A benchmark of the SCS-40 computer: A mini-supercomputer compatible with the Cray X-MP/24, Tech. Report LA-UR-87659, Los Alamos National Laboratory, May. [1949] P. WATANABE, J. FLOOD, AND S. YEN [1974]. Implementation of finite difference schemes for solving fluid dynamic problems on Illiac IV, Tech. Report T-ll, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign. [1950] T. WATANABE [1987]. Architecture and performance of NEC supercomputer SC system, Parallel Computing, 5, pp. 247-256. [1951] I. WATSON AND J. GURD [1982]. A practical data flow computer, Computer, 15(2), pp. 5157. [1952] W. WATSON [1972]. The TI-ASC, a highly modular and flexible super computer architecture, Proc. AFIPS, 41, pt. 1, pp. 221-228. [1953] J. WATTS [1979]. A conjugate gradient truncated direct method for the iterative solution of the reservoir simulation pressure equation, Proc. SPE 54th Annual Fall Technical Conference and Exhibition, Las Vegas. [1954] S. WEBB [1980]. Solution of partial differential equations on the ICL distributed array processor, ICL Technical Journal, pp. 175-190. [1955] S. WEBB, J. McKEONN, AND D. HUNT [1982]. The solution of linear equations on a SIMD computer using a parallel iterative algorithm, Comput. Phys. Comm., 26, pp. 325—329. [1956] R. WEED, L. CARLSON, AND W. ANDERSON [1984]. A combined direct/inverse three-dimensional transonic wing design method for vector computers, Tech. Report 84-2156, AIAA, Seattle, WA, August. [1957] E. WEIDNER AND J. DRUMMOND [1982]. Numerical study of staged fuel injection for supersonic combustion, AIAA Journal, 20, pp. 1426-1431. [1958] J. WEILMUNSTER AND L. HOWSER [1976]. Solution of a large hydrodynamic problem using the STAR-100 computer, Tech. Report TM X-73904, NASA Langley Research Center. [1959] J. WELSH [1982]. Geophysical fluid simulation on a parallel computer, in Rodrigue [1643], pp. 269-277. [1960] P. WHITE [1985]. Vectorization of weather and climate models for the Cyber 205, in Numrich [1469], pp. 135-144. [1961] R. WHITE [1985]. Inversion of positive definite matrices on the MPP, in Potter [1584], pp. 730. [1962] R. WHITE [1986]. A nonlinear parallel algorithm with application to the Stefan problem,

Department of Computer Science. Parallel Sn iteration schemes. Yale University. WILLIAMS [1979]. in Book [203]. Par. 50.. Chaotic iteration and parallel divergence. 425—434. Formulation and Computational Algorithms hi Finite Element Analysis. Soc. Finite element analysis on microcomputers. WlNG AND J. pp. Lettrs. IEEE Trans. pp. 152-163. pp. Proc. IEEE Trans. [1963] R. 2-25. IBM. Algebraic Discrete Methods.. [1968] B. Anal. ORTEGA. Inf. Comput. M. 1133—44. 381-94. 279-314.. WONG [1988]. pp. Tech. New York. The calculation of the latent roots and vectors of matrices on the Pilot model of the ACE. [1985] D.. Department of Computer Science. WlLHELMSON [1974]. A parallel triangulation process of sparse matrices. 92-99. 64. A parallel variational method for certain elliptic spectral problems. [1976] E. WONG [1987]. WlDLUND [1988]. [1980] O. AND P. [1989] Y. Iterative sub structuring methods: Algorithms and theory for elliptic problems in the plane. WILLIAMSON [1983]. WILKINSON [1954]. Eckmann. [1972] E. Comput. Transformation of broadcasting into pipelining. Courant Institute. Appl. WHITE [1987]. [1967] O. Par. Report 101. HlROMOTO [1986]. 207-214. 1977 Int. [1992] P. pp. C-33. pp. [1977] E. Comm.. Engrg. Tech. 56-67. [1983] A. pp. Representing matrices as quadtrees for parallel processing. WILLIAMS AND F. Report YALEU/DCS/RR-544. WILSON [1982]. pp. pp. Comp. [1964] R. pp. 7. New York University. WILSON [1983]. 105-116. in Noor [1450]. Report 265. Supercomputers. Tajima. in Jesshope and Hockney [976]. [1978] E. [1973] S. [1986] D. WILSON [1976]. [1979] K. Multisplittings of a symmetric positive definite matrix.. WlNG [1985]. Woo AND J. FARHAT [1988]. WHITESIDE. BOBROWICZ [1985]. [1971] J. [1988] L. Tech. C-29. Approximate polynomial preconditioning applied to biharmonic equations on vector supercomputers. J. Linear and nonlinear finite element analysis on multiprocessor computer systems. Constructive and Computational Methods for Differential and Integral Equations. ROMINE SIAM J. N. Architectures for large networks of microcomputers. A parallel Jacobi diagonalization algorithm for a loop multiple processor system. in Rodrigue [1643]. 127-140. New York. F. Benchmarking a sparse elimination routine on the Cyber .. 170-181. NASA Lewis Research Center. April. Numer. pp. Proc. 72. Solving large elliptic difference equations on Cyber 205. A content addressable systolic array for sparse matrix computation. R.196 J. 409-413. Tech. Phil. SWARZTRAUBER [1984]. eds. Par. A numerical weather prediction model — Computational aspects. 2. Report NASA TM100217. pp. [1981] O. MA. IEEE. Comput. Proc. Pt. Appl. eds. Parallel Computing. 4. Proc. Conf. Proc. Experience with an FPS array processor. A computational model of parallel solutions of linear equations. pp. 23. Comput. WlNG AND J. 536-566. VAN TILBOUG [1980]. IEEE Trans.. pp. [1991] Y. Proc. Matsen and T. HlROMOTO [1985]. WILSON AND C. 453—476. Computational aspects of numerical weather prediction on the Cray computer. G. 567-578.. Tech. Mech. C-29. 195-199. WONG AND J. [1965] R. [1974] D. WISE [1985].. Meth. University of Texas Press.. Camb. H. The portability of programs and languages for vector and array processors. a reconfigurable network computer. Conf. Parallel decomposition of matrix inversion using quadtrees. Micros. WlNKLER [1987]. Los Alamos National Laboratory. A. Special numerical and computer techniques for finite element analysis. LfiVEQUE [1982]. WlDLUND [1984]. pp. WHITE [1986]. HUANG [1980]. [1984] N. Cambridge. 639-652. WlENKE AND R. 399-414. [1987] L. pp. MIT Press. in Noor [1450]. DELOSME [1987]. 195-208. Speedup predictions for large scientific parallel programs on CRAY-XM-P-like architectures. Solving partial differential equations using ILLIA C IV. Parallel algorithm* for nonlinear problems. WISE [1986]. [1969] B. [1970] R. 632-638. Numer. WlENKE AND R. WnTIE AND A. 541-543. Iterative methods for elliptic problems on regions partitioned into substructures and the biharmonic dirichlet problem. Par. Conf. Proc. WlNSOR [1981]. 137-149.. Report LA-UR-85-3597.. pp. pp. Dist. pp. pp. pp. Department of Computer Science. [1982] O. pp.. WILLIAMSON AND P. a distributed operating system for Micronet. 4. VOIGT AND C. SIAM J. HUANG [1977]. Dold and B. 31-40. [1966] O.-M. WlTTlE [1980]. Proc. Meth. 1986 Int. OSTLUND. Vectorization of fluid codes. pp. 20. Research Report RC12878. [1975] D. 1985 Int. Springer-Verlag. pp. HraBARD [1984]. Proc. 6. [1990] Y. Workshop in Interconnection Networks for Parallel and Distributed Processing.

227-239. Understanding supercomputer benchmarks. Iterative Solution of Large Linear Systems. Philadelphia. Parallelism and array processing.-C. KINCAID. Proc.121-130. Department of Computer Science. Report LA-8849-MS. Tech. C. August. Tech. Proc. WORLTON [1984]. YOUNG. 5. Y. Vector. FERZIGER.. RHEINBOLDT [1979]. HARBISON [1978]. HAYES [1985]. [2004] M. Report. New Computing Environments: Parallel. Vector and Systolic. New York. A parallel algorithm for sparse symbolic Cholesky factorization on a multiprocessor. pp. [2007] D. YEW [1986]. [1998] J. Softw. ROGALLO [1984]. YUAN [1987]. Stanford University. pp. II System solution. CHAPMAN. ZMLJEWSKI [1987]. Wu.A BIBLIOGRAPHY ON PARALLEL AND VECTOR NUMERICAL ALGORITHMS 197 205 and the CRAY-1. Implementation of capacitance calculation program CAP2D on iPSC. Navier-Stokes simulation of homogeneous turbulence on the CYBER 205. Loughborough University. 4578. WULF AND S. AND Y.. Limits on parallelism in the numerical solution of linear PDEs. Datamation. J. VA. Comput. 765-777.'m Heath [860]. Report CNA-199.. OPPE. [2005] P. [1993] P. October. WORLTON [1981]. August. Also published as STAN-CS-88-1212. PhD dissertation. AND L. Reflections in a pool of processors. Society for Industrial and Applied Mathematics. . [2001] W. TANAKA. BELL [1972]. pp. Par. pp. 7(2). WUNDERLICH [1985]. Parallel Algorithms for Asynchronous Multiprocessors. ed. in Cray Research. WORLEY [1988]. 153-171. [2014] E. Tech. [2016] D. YOUNG [1971]. [1986]. Compiling algorithms and techniques for the S-810 vector processor. 28. 9.. Parallel Computing. pp. PhD dissertation. pp. University of Texas at Austin. Comp. pp. WOODWARD [1982]. in Rodrigue [1643]. [2009] N. PA. WOUK. 30(14). Carnegie-Mellon University. Oak Ridge National Laboratory. A. D. YASUMURA. pp. AFIPS Press. [2015] E. Center for Numerical Analysis. Transonic flow simulations for 3D complex configurations. Department of Computer Science. 41-47. 199-210. Conf.. [423]. D. Reston. in Gary [700]. IEEE Trans. Computers and Structures. Department of Computer Science. Proc. 1-17. pp. Wouk. ed. Math. Los Alamos National Laboratory. Report ORNL/TM-10945. KANADA [1984]. pp.. ACM Trans. Softw.mmp — A multiminiprocessor. A quantitative evaluation of the feasibility of and suitable hardware structures for an adaptive parallel finite element system. [1999] A. AND R. Department of Computer Science. 271-292. Conf. 251-260. [2010] C. RuBBERT [1982]. ZAVE AND W. Philadelphia. pp. 6th SPE Symposium on Reservoir Simulation. pp. WULF AND C. YOUSIF [1983]. Cornell University. 485-494. 1984 Int. pp. ZAKHAROV [1984]. AFIPS Fall Joint Comp. 8-38. GILBERT [1987]. Trade-offs in designing explicit hydrodynamic schemes for vector computers. [2002] W. Sparse Cholesky Factorization on a Multiprocessor. 285-290. WORLEY [1988]. Math. 44. Academic Press. [2000] C. [1994] P. T. [1995] P.. Proc. C-33. Stanford University. Implementing the continued fraction factoring algorithm on parallel machines. Math. Nested dissection on a mesh-connected processor array. COLE [1983]. University of Illinois at UrbanaChampaign. Design of an adaptive parallel finite element system. Architecture of the Cedar parallel supercomputer. [2011] V.-P. [1996] P. Yu AND P. ZMLJEWSKI AND J. Society for Industrial and Applied Mathematics. Tech. pp. ZAVE AND G. Center for Supercomputing Research and Development. [2013] P. [2006] D. PhD dissertation. WORLEY AND R. ACM Trans. Information Requirements and the Implications for Parallel Computation. [1997] J. Report 609. On the use of vector computers for solving large sparse linear systems. New Computing Environments: Parallel. [2003] M. A philosophy of supercomputing. SCHREIBER [1986]. Stiffness loads and stresses evaluation. ZoiS [1988]. Tech. Inc. [2012] P. [2008] N. and Systolic. Parallel processing techniques for FE analysis I. 247-274.

Sign up to vote on this title

UsefulNot usefulCFD

CFD

- Handbook of Parallel Computing Models, Algorithms and Applications (2008)by songbangtinhcam
- Algebraic and Geometric Methods in Statisticsby juntujuntu
- Introduction to Parallel Computingby Ricardo Caldeira de Oliveira
- Kluwer - Introduction to Parallel Processing - Algorithms and Architecturesby manu143k

- The Design and Analysis of Parallel Algorithm by S.G.aklby Vikram Singh
- Barry Wilkinson, Michael Allen-Parallel Programming_ Techniques and Applications Using Networked Workstations and Parallel Computers (2nd Edition) -Prentice Hall (2004)by Kusum0
- Understanding Advanced Statistical Methods(PDF)by Nuur Ahmed
- 07_dlpby gautamd07

- Handbook of Parallel Computing Models, Algorithms and Applications (2008)
- Algebraic and Geometric Methods in Statistics
- Introduction to Parallel Computing
- Kluwer - Introduction to Parallel Processing - Algorithms and Architectures
- The Design and Analysis of Parallel Algorithm by S.G.akl
- Barry Wilkinson, Michael Allen-Parallel Programming_ Techniques and Applications Using Networked Workstations and Parallel Computers (2nd Edition) -Prentice Hall (2004)
- Understanding Advanced Statistical Methods(PDF)
- 07_dlp
- Parallel n Distributed Systems
- Parallel Databases
- Explaining the HP 'System Profile'
- FloBoss s600+ instruction manual
- CS6303 Computer Architecture Part B 1
- CS6303 Computer Architecture Part B with Keys.pdf
- CPU Computers Policy
- Steel Plant Report
- How to sound like a Parallel Programming Expert - Part 1 Introducing concurrency and parallelism
- Simatic St70
- Avionics Architectures
- Chapter 1 Introduction to Microprocessors
- askdfjnaiow
- Mini-medical 90
- TIA Workshop ITB
- 1
- Research Paper Ubiquitous Parallelism
- 82527
- Diagnosticico e falhas s7
- Introduction to Computer Technology
- 6ES7414-4HR14-0AB0 Manual
- cs668-lec1-ParallelArch
- Parallel Algorithms for Matrix Computations

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.