# The Design and Analysis of Bulk-Synchronous Parallel Algorithms

Alexandre Tiskin Christ Church Trinity Term 1998

Qualifying dissertation submitted in partial ful lment of the requirements for the degree of Doctor of Philosophy at the University of Oxford

Programming Research Group Oxford University Computing Laboratory

To the memory of my father

.

dense matrix computation. We introduce an extension of the BSP model. along with its BSPRAM characteristics. granularity. Gaussian elimination without pivoting and with column pivoting. This thesis presents a systematic approach to the design and analysis of BSP algorithms. matrix multiplication). partially overlapping domains: combinatorial computation. We use BSPRAM to design BSP algorithms for problems from three large. which reconciles shared-memory style programming with e cient exploitation of data locality. Gaussian elimination with nested block pivoting. algebraic path computation). cube dag computation. The asymptotic BSP cost of each algorithm is established. Some of the presented algorithms are adapted from known BSP algorithms (butter y dag computation. The BSPRAM model can be optimally simulated by a BSP computer for a broad range of algorithms possessing certain characteristic properties: obliviousness. slackness. Other algorithms are obtained by application of established non-BSP techniques (sorting. synchronisation-e cient shortest paths computation). communication-e cient multiplication of Boolean matrices.
.Abstract
The model of bulk-synchronous parallel (BSP) computation is an emerging paradigm of general-purpose parallel computing. We conclude by outlining some directions for future research. randomised list contraction. called BSPRAM. or use original techniques speci c to the BSP model (deterministic list contraction. graph computation.

. . . . . . . . . . .Contents
1 Introduction 2 BSP computation
2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
Matrix-vector multiplication . . . . . . . . . . . . . . .3 The BSPRAM model . . . . . . . .3 5. . . . . . . . . . . . Single-source shortest paths computation Minimum spanning tree computation . . . . . . . . . . . . . . . . .6
Fast Boolean matrix multiplication . . . .
37
5 Graph computation in BSP
5. . . . . . List contraction . . . .4 3. . . . . . . . .2 The BSP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 4.6 Complete binary tree computation Butter y dag computation . . . . . Sorting . . . . .5 4. . . .5 5. .3 4. . . . . . . . . . .4 4. . . . . . . . . . . . . . . . . . . . . Nested block pivoting and Givens rotations . . Triangular system solution . . .6 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matrix multiplication . Gaussian elimination without pivoting . . . . . . . . . . Fast matrix multiplication . .5 3. . . . . . . . . Algebraic paths in acyclic graphs . . . . . . . . . . . . . . 12 3. . . . . . . . . 6 2. . . . . . . . . Algebraic path computation . . . . . . . .1 3. . . . . . .1 5. . .
4 6
3 Combinatorial computation in BSP
19
19 22 24 27 30 33 37 39 45 48 53 57 62
4 Dense matrix computation in BSP
4.2 3. . . . . . . . . . . . . . . . . . . .2 5. . . . . . . . . .1 4. . . . . . . . 9 2. . . . . . . . . . . . . . . . . . . . . . . All-pairs shortest paths computation . . . . Column pivoting and Householder re ections . . . Tree contraction . . . . . . . . . . . . . . . . . . Cube dag computation . . . . . . . . . . . . . . .1 Historical background . 2
67
67 73 76 79 84 87
6 Conclusions
90
. . . . . . . . . . . .4 5. . .3 3. . . . .

CONTENTS
3
7 Acknowledgements A The Loomis{Whitney inequality and its generalisations
93 94
.

Many other models have been proposed for parallel computing. analyse and program shared-memory style algorithms that exploit data locality. McC93. design. can be provided in BSP by PRAM simulation. In contrast with the PRAM model. However. because they provide the bene ts of natural problem speci cation. this approach does not allow the algorithm designer to exploit data locality. BSP is de ned as a distributed memory model with point-to-point communication between the processors. Consequently. McC96c. and explicit and independent cost analysis of communication and synchronisation. convenient design and analysis of algorithms. Models based on shared memory are appealing from the theoretical point of view. with all the associated bene ts. In this thesis we propose a new model. McC98]) provides a simple and practical framework for general-purpose parallel computing. and therefore in many cases may lead to ine cient algorithms. much e ort was put into the development of e cient methods for simulation of the PRAM on more realistic models. the BSP model accurately re ects the main design features of most existing parallel computers. the PRAM model has dominated the theory of parallel computing. McC95. and allows one to specify. and straightforward programming. One of the main divisions among the models is by the type of memory organisation: distributed or shared. which stands between BSP and PRAM. this model is far from being realistic. On an abstract level. called BSPRAM. since the cost of supporting shared memory in hardware is much higher than that of distributed memory.Chapter 1
Introduction
The model of bulk-synchronous parallel (BSP) computation (see Val90a. However. The cost models of BSPRAM and BSP are based 4
. The BSPRAM model is based on a mixture of shared and distributed memory. Its main goal is to support the creation of architecture-independent and scalable parallel software. For this reason. The key features of BSP are the treatment of the communication medium as an abstract fully connected network. Paper Val90b] shows how sharedmemory style programming.

and therefore the blocks may di er in size by 1. high granularity | are abundant in scienti c and technical computing. we present the algorithms in their simplest form. 1 i. p is the number of processors. Instead. value n may not be an exact multiple of p. The algorithms presented in this thesis. when we write \divide an array of size n into p regular blocks". partially overlapping domains: combinatorial computation. t]]. For example. but there are important di erences connected with concurrent memory access in the BSPRAM model. The two models are related by e cient simulations for a broad range of algorithms. as well as many other BSP algorithms. without trying to adapt them for lower values of n. We identify some properties of a BSPRAM algorithm that su ce for its optimal simulation in BSP. BSPRAM here plays the role of a methodology for generic BSP algorithm design. throughout the thesis we ignore small irregularities that arise from imperfect matching of integer parameters. Because of that. this condition provides the slackness and granularity necessary for their efcient BSP simulation. Sometimes we use square bracket notation for matrices. In view of our simulation results. and give references to papers that address this problem. we only note where such optimisation is possible. where n is the size of the input. j ]. Practical problems usually satisfy such conditions. 1 s.
. are de ned for input sizes that are su ciently large with respect to the number of processors. When the matrix is partitioned into regular blocks of size m m. and poly is a low-degree polynomial. j n. Apart from simplifying the BSPRAM algorithms. referring to an element of an n n matrix A as A i. we refer to each individual block as A s. dense matrix computation. For the sake of simplicity.5 on the same principles. graph computation. high slackness. A typical form of such a condition is n poly (p). Algorithms possessing at least one of these properties | obliviousness. t n=m. In the subsequent chapters we demonstrate the meaning and use of such properties by systematically designing algorithms for problems from three large.

Chapter 2

**Bulk-synchronous parallel computation
**

2.1 Historical background

The last fty years have seen the tremendous success of sequential computing. As pointed out in Val90a, McC93, McC96c], this was primarily due to the existence of a single model, the von Neumann computer, which was simple and realistic enough to serve as a universal basis for sequential computing. No such basis existed for parallel computing. Instead, there was a broad variety of hardware designs and programming models. One of the main traditional divisions among models of parallel programming is the organisation of memory: distributed versus shared. Shared memory is much costlier to support in hardware than distributed memory. However, shared memory has some important advantages: natural problem speci cation | computational problems have wellde ned input and output, that are assumed to reside in the shared memory. In contrast, algorithms for a distributed memory model have to assume a particular distribution of input and output. This distribution e ectively forms a part of the problem speci cation, thus restricting the practical applicability of an algorithm. convenient design and analysis of algorithms | the computation can be described at the top level as a sequence of transformations to the global state determined by the contents of the shared memory. In contrast, algorithms for distributed memory models have to be designed directly in terms of individual processors operating on their local memories. straightforward programming | the shared memory is uniformly accessible via a single address space using two basic primitives: reading 6

2.1. HISTORICAL BACKGROUND

7

and writing. In contrast, programming for distributed memory models is more complicated, typically involving point-to-point communication between processors via the network. The computational model most widely used in the theory of parallel computing is the Parallel Random Access Machine (PRAM) (see e.g. CLR90, KR90, JaJ92, McC93]). The PRAM consists of a potentially in nite number of processors, each connected to a common memory unit with potentially in nite capacity. The computation is completely synchronous. Accessing a single value in the memory costs the same as performing an arithmetic or Boolean operation on a single value. Several variants of the PRAM model have been introduced. Among them are the exclusive read, exclusive write PRAM (EREW PRAM), which requires that every memory cell is accessed by not more than one processor in any one step, and the concurrent read, concurrent write PRAM (CRCW PRAM), which allows several processors to access a cell concurrently in one step. For the CRCW PRAM, a rule to resolve concurrent writing must be adopted. One of the possibilities, realised in the combining CRCW PRAM (see e.g. CLR90, pages 690{691]), is to write some speci ed combination of the values being written and (optionally) the value stored previously at the target cell. A typical choice of the combining function is some commutative and associative operator such as the sum or the maximum of the values. Another major model of parallel computation is the circuit model (see e.g. KR90, McC93]). A circuit is a directed acyclic graph (dag ) with labeled nodes. We call a node terminal, if it is either a source (node of indegree 0), or a sink (node of outdegree 0). In a circuit, source nodes are labeled as input, sink nodes are labeled as output, and nonterminal nodes are labeled by arithmetic or Boolean operations. Algorithms that can be represented by circuits are oblivious, i.e. perform the same sequence of operations for any input (although the arguments and results of individual operations may, of course, depend on the inputs). Such algorithms are simpler to analyse than non-oblivious ones. Circuits also provide a useful intermediate stage in the design of algorithms for PRAM-type models: the problem of designing a circuit is separated from the problem of scheduling its underlying dag. For example, while the question of an optimal solution to the matrix multiplication problem remains open, one can nd optimal scheduling for particular circuits representing e.g. the standard (n3 ) method, or Strassen's (nlog 7 ) method. In this thesis we study the scheduling problem for several classes of dags. Both the PRAM and the circuit model are simple and straightforward. However, these models do not take into account the limited computational resources of existing computers, and therefore are far from being realistic. The rst step in making them more realistic was to introduce a new complexity measure, e ciency, depending on the number of processors used by the

8

CHAPTER 2. BSP COMPUTATION

algorithm (see KRS90]). New parallel models were gradually introduced to account for resources other than the number of processors. Currently, dozens of such models exists; see LMR96, MMT95, ST98] for a survey. Among the computer resources measured by these models are, according to LMR96], the number of processors, memory organisation (distributed or shared), communication latency, degree of asynchrony, bandwidth, message handling overhead, block transfer, memory hierarchy, memory contention, network topology, and many others. Models that include many di erent resource metrics tend to be too complex. A useful model should be concise and concentrate on a small number of crucial resources. One of the simplest and most elegant parallel models is the BSP model | see Val90a, McC95, McC96c, McC98] for the description of BSP as an emerging paradigm for general-purpose parallel computing. The BSP model is de ned by a few qualitative characteristics: uniform network topology, barrier-style bulk synchronisation, and by three quantitative parameters: the number of processors, communication throughput, and latency. The main principle of BSP is to regard communication and synchronisation as separate activities, possibly performed by di erent mechanisms. The corresponding costs are independent and compositional, i.e. can be simply added together to obtain the total cost. It is easy to extend the BSP model to account for memory e ciency as well. Such an extension is considered in MT], where memory-e cient BSP algorithms for matrix multiplication are analysed. In this thesis we propose a variant of BSP, called BSPRAM, intended to support shared-memory style BSP programming. The memory of BSPRAM has two levels: local memory of individual processors, and a shared global memory. We compare BSPRAM with similar existing models. We then study the relationship between BSPRAM and BSP by means of simulation. Let n denote the size of the input to a program. Following Val90b], we say that a model A can optimally simulate a model B if there is a compilation algorithm that transforms any program with cost T (n) on B to a program with cost O(T (n)) on A. If the compilation algorithm yields a randomised program for A, we call the simulation optimal if the expected cost of the randomised program is O(T (n)). Sometimes the simulation may be restricted to programs from a particular class. We assume that we are free to de ne a suitable distribution of the input and output data to simulate a shared memory model on a distributed memory one. If the described compilation is de ned only for a particular class of algorithms, we say that A can optimally simulate B for that class of algorithms. We show that BSP can optimally simulate BSPRAM for several large classes of algorithms.

h0 (respectively. then the cost of the superstep is de ned as w + h g + l. l) Figure 2. sent) by each processor.2 The BSP model
A BSP computer. h00 )).2. Here g and l are the BSP parameters of the computer. and h = h0 + h00 (another possible de nition is h = max(h0 . A BSP computation is a sequence of supersteps (see Figure 2. a processor receives data that were sent to it in the previous superstep. it can send data to other processors.2).1: A BSP computer P M
9
p
BSP(p.1). a local computation phase and an output phase. The processors may follow di erent threads of computation. Val90b. g. to be received in the next superstep. h00 ) is the maximum number of data units received (respectively. The processors are synchronised between supersteps. Each processor has a fast local memory. for a particular superstep. l)
z
superstep
1 2
}|
{z
superstep
}|
{z
superstep
}|
{
p
comp
comm
comp
comm
comp
Figure 2.2: A BSP computation
2. If. In the input phase. the value l is the communication latency (also called \synchronisation periodicity"). The computation within a superstep is asynchronous. A superstep consists of an input phase. in the output phase. w is the maximum number of local operations performed by each processor.
. Let cost unit be the cost of performing a basic arithmetic operation or a local memory access. THE BSP MODEL
1 P M 2 P M NETWORK(g. introduced in Val89. Val90a]. The value g is the communication throughput ratio (also called \bandwidth ine ciency" or \gap").2. consists of p processors connected by a communication network (see Figure 2.

On the other hand. The communication and synchronisation costs of such simultaneous broadcast are H = O(n log p= log n). The values are scattered so that each one gets to a distinct processor. matrix multiplication can be done optimally in communication and synchronisation. can be performed in 1 + log p= log n phases. g and l. H = S=1 hs is the communication cost. and s S is the synchronisation cost.e. there is a communication-synchronisation tradeo . Such simulation is also called the automatic mode of BSP programming. We de ne the local computation volume W as the total number of local operations. For n = p . These facilities can be obtained by simulating a PRAM on a BSP computer. S = O(log p= log n). If a computation consists of S supersteps with costs ws + hs g + l. then each value is broadcast by a balanced tree of degree n and height log p= log n. P 1 s S . l) to denote a BSP computer with the given values of p. a technique known as two-phase broadcast allows one to achieve perfect balance for the problem of broadcasting n p values from one processor. or with H = O(p) and S = O(1) by sending the value directly to every processor (this was observed in Val90a]). and then performing total exchange of the blocks. where is a constant. We call a BSP computation balanced. for some other problems a need to trade o the costs will arise. g and l as con guration parameters. However. the problem can be solved with H = O(n) and S = O(1) | both cost values are obviously optimal. if W = O(W =p) and H = O(H=p). where W = S=1 ws s P is the local computation cost. 1 < n < p. g. programming with explicit
. H and S typically depend on the number of processors p and on the problem size. Algorithm design should aim to minimise local computation.10
CHAPTER 2. both cost values are trivially optimal. For any asymptotically smaller n. The values of W . communication and synchronisation costs for any realistic values of these parameters. In order to utilise the computer resources e ciently. and the communication volume H as the total number of data units transferred between the processors. 0 < < 1. BSP COMPUTATION
We write BSP(p. It can be performed with H = S = O(log p) by a balanced binary tree. broadcasting or combining. a balanced distribution of data and computation work will lead to algorithms that achieve optimal cost values simultaneously. scattering the blocks so that each one gets to a distinct processor. By dividing the values into p blocks of size n=p. An example of a communication-synchronisation tradeo is the problem of broadcasting a single value from a processor. as opposed to the direct mode. then its total cost is W + H g + S l. Matrix computations provide further examples of problems with and without tradeo s: for instance. Communication-optimal broadcasting of n values. For most problems. but matrix inversion presents a tradeo between communication and synchronisation. a typical BSP program regards the values p. The BSP model does not directly support shared memory. i.

g. In order to achieve e cient simulation of a PRAM on a BSP computer. in which the network is implemented as a random-access shared memory unit. if at least p PRAM processors perform reading or writing at every step. Memory access in the randomised simulation is made uniform by hashing : each memory cell of the simulated PRAM is represented by a cell in the local memory of one of the BSP processors. multiplication of dense matrices. Paper Val90b] provides the following result. (ii ) any CRCW PRAM algorithm with slackness p for a constant > 0. Proof. the PRAM must have more processors than the BSP computer. l = O( ). based on
. Theorem 1. g. and is typically a function of the problem size n and the number of BSP processors p.2. retaining the parameters g and l and the bulk-synchronous structure of the computation. data locality should be preserved when dealing with more structured problems (e. Most practical problems possess the slackness necessary for e cient simulation. the automatic mode does not allow the programmer to exploit data locality. chosen according to some easily computable hash function which ensures nearly random and independent distribution of cells. Virtual processor allocation is equal and non-repeating. We present a randomised BSP simulation of BSPRAM. However. multiplication of sparse matrices with a random pattern of nonzeros).g. each step of a PRAM is implemented as a superstep. Let g = O(1). This lack of data locality may be insigni cant for highly irregular problems (e. we say that a PRAM algorithm has slackness . In the automatic mode. or sparse matrices with a regular nonzero pattern). An optimal randomised simulation on BSP(p. For a BSP computer with a xed value of p. but otherwise arbitrary. We introduce a new BSP-type model. E cient BSP solution of such problems cannot be achieved via the automatic mode. The new model is designed to combine the best features of both automatic and direct BSP programming modes. l) can be achieved for (i ) any EREW PRAM algorithm with slackness log p. On the other hand. See Val90b]. The simulation allows one to write PRAM programs for BSP computers and to predict their performance accurately. Slackness measures the \degree of communication parallelism" achieved by the algorithm. because PRAM processors do not have any local memory.2. called BSPRAM. The next section aims to reconcile the exploitation of data locality with shared-memory style programming. THE BSP MODEL
11
control over memory management. with at least virtual PRAM processors allocated to each of the p BSP processors.

In this thesis we consider an exclusive-read. the direct mode (pure BSP) allows one to exploit data locality. The computation within a superstep is asynchronous. The key di erence is that a BSP superstep is no longer fragmented into independent steps of p individual virtual PRAM processors. This might be called a \semi-automatic mode" of BSP programming. g. based on additional properties of obliviousness and granularity. The simulation mechanism is used to model the global shared memory. The processors are synchronised between supersteps. which in the new model replaces the BSP communication network. a BSPRAM consists of p processors with fast local memories (see Figure 2. The structure of computation in the local memories of BSP processors is preserved.3). concurrent-write BSPRAM (CRCW BSPRAM).12
CHAPTER 2.3: A BSPRAM P M
p
BSPRAM(p. there is a single shared main memory. Formally. On the other hand. in the output phase it can write data to the main memory. that has no restrictions on concurrent access to the main mem-
. in which every cell of the main memory can be read from and written to only once in every superstep. it does not allow one to exploit data locality. l) Figure 2. and an output phase. The automatic mode (PRAM simulation) enables the sharedmemory style BSP programming with all its bene ts. However. We call the new computational model BSPRAM. but only in a distributed memory paradigm.4). exclusive-write BSPRAM (EREW BSPRAM). In the input phase a processor can read data from the main memory. the computation proceeds by supersteps (see Figure 2. In addition. concurrent access to the main memory in one superstep can be either allowed or disallowed. BSP COMPUTATION
1 P M 2 P M MAIN MEMORY(g. allowing both shared-memory style programming and exploitation of data locality. l)
a suitably adapted concept of slackness.
2. and a concurrent-read. A superstep consists of an input phase. As in BSP. The new method is similar to the PRAM simulation method mentioned in the previous section. a local computation phase.3 The BSPRAM model
In the previous section we described two alternative approaches to BSP programming. We also describe a deterministic simulation. The aim of this section is to introduce a new BSP programming method. As with the PRAM.

2.3. THE BSPRAM MODEL

13

superstep

z

superstep

1 2

}|

{z

}|

{z

superstep

}|

{

p

comp out in comp out in comp Figure 2.4: A BSPRAM computation

ory. For convenience of algorithm design, we assume that if a value x is being written to a main memory cell containing the value y, the result may be determined by any prescribed function f (x; y) computable in time O(1). Similarly, if values x1 ; : : : ; xm are being written concurrently to a main memory cell containing the value y, the result may be determined by any prescribed function f (x1 xm ; y), where is a commutative and associative operator, and both f and are computable in time O(1). This corresponds to resolving concurrent writing in PRAM by combining (see e.g. CLR90]). In a similar way to the BSP model, the cost of a BSPRAM superstep is de ned as w + h g + l. Here w is the maximum number of local operations performed by each processor, and h = h0 + h00 . The value of h0 (respectively, h00) is de ned as the maximum number of data units read from (respectively, written to) the main memory by each processor in the superstep. As in BSP, the values g and l are xed parameters of the computer. We write BSPRAM(p; g; l) to denote a BSPRAM with the given values of p, g and l. The cost of a computation consisting of several supersteps is de ned as W + H g + S l, where W , H and S have the same meaning as in the BSP model. One of the early models similar to BSPRAM was the LPRAM model proposed in ACS90]. The model consists of a number of synchronously working processors with large local memories and a global shared memory. The only mode of concurrent memory access considered in ACS90] is CREW. The model has an explicit bandwidth parameter, corresponding to g in BSP and BSPRAM. There is no accounting for synchronisation cost, although it is suggested as a possible extension of the model. Thus, a p-processor LPRAM is equivalent (up to a constant factor) to CREW BSPRAM(p; g; 1). Another model similar to BSPRAM, called the Asynchronous PRAM, was proposed in Gib93] (an earlier version of this model was called the Phase

14

CHAPTER 2. BSP COMPUTATION

PRAM). Like BSPRAM, the Asynchronous PRAM consists of processormemory pairs communicating via a global shared memory. The computation structure is bulk-synchronous, with EREW communication. The model charges unit cost for a global read/write operation, d units for communication startup and B units for barrier synchronisation. Thus, a p-processor Asynchronous PRAM is equivalent (up to a constant factor) to EREW BSPRAM(p; 1; d + B ). A bulk-synchronous parallel model QSM is proposed in GMR99, GMR, Ram99] (an earlier version of this model was called QRQW PRAM). The model has a bandwidth parameter g. A p-processor QSM machine is similar to BSPRAM(p; g; 1) with a special mode of concurrent access to the main memory: any k concurrent accesses to a cell cost k units. Such a model is more powerful than EREW BSPRAM(p; g; 1), but less powerful than CRCW BSPRAM(p; g; 1). An interesting partial alternative to shared memory bulk-synchronous parallel models is o ered by array languages with implicit parallelism. An example of such a language is ZPL (see Sny98]), based on the CTA/Phase Abstractions model described in AGL+ 98]. The developers of ZPL have announced their plans for an extension, called Advanced ZPL, which is likely to be similar to the BSPRAM model. Just as for PRAM simulation, some \extra parallelism" is necessary for e cient BSPRAM simulation on BSP. We say that a BSP or BSPRAM algorithm has slackness , if the communication cost of every one of its supersteps is at least . We adapt the results on PRAM simulation mentioned in the previous section to obtain an e cient simulation of BSPRAM. Theorem 2. An optimal randomised simulation on BSP(p; g; l) can be achieved for (i ) any EREW BSPRAM(p; g; l) algorithm with slackness log p; (ii ) any CRCW BSPRAM(p; g; l) algorithm with slackness p for a constant > 0. Proof. Immediately follows from Theorem 1. In contrast with Theorem 1, no conditions on g and l are necessary, since the simulated and the simulating machines have the same BSP parameters. Apart from randomised simulation by hashing, in some cases an e cient deterministic simulation of BSPRAM is possible. We consider two important classes of algorithms for which such deterministic simulation exists. We say that a BSPRAM algorithm is oblivious if the sequence of operations executed by each processor is the same for any input of a given size (although the arguments and results of individual operations may depend on the inputs). An oblivious algorithm can be represented as a computation of a uniform family of circuits (for the de nition of a uniform family of circuits,

2.3. THE BSPRAM MODEL

15

see e.g. KR90]). We say that a BSPRAM algorithm is communicationoblivious, if the sequence of communication and synchronisation operations executed by a processor is the same for any input of a given size (no such restriction is made for local computation). We say that a set of cells in the main memory of BSPRAM constitutes a granule, if in any input (output) phase each processor either does not read from (write to) any of these cells, or reads from (writes to) all of them. Informally, a granule is treated as \one whole piece of data". We say that a BSPRAM algorithm has granularity if all main memory cells used by the algorithm can be partitioned into granules of size at least . The slackness of a BSPRAM algorithm will always be at least as large as its granularity: . Communication-oblivious BSPRAM algorithms, and BSPRAM algorithms with su cient granularity, allow optimal deterministic BSP simulation. Randomised hashing is not necessary for communication-oblivious algorithms, since their communication pattern is known in advance. Therefore, an optimal distribution of main memory cells across BSP processor-memory pairs can be found o -line. For algorithms with granularity at least p, hashing is not necessary either, since every granule can be split up into p equal parts that are evenly distributed across BSP processor-memory pairs. This makes all communication uniform. In both cases randomised hashing is replaced by a simple deterministic data distribution. Moreover, for communicationoblivious algorithms with slackness at least p , and for algorithms with granularity at least p, concurrent memory access can be simulated by mechanisms similar to the two-phase and (1+ 1 )-phase broadcast described in the previous section. Below we formally state the results on deterministic BSPRAM simulation, published previously in Tis96, Tis98].

**Theorem 3. An optimal deterministic simulation on BSP(p; g; l) can be
**

achieved for

(i ) any communication-oblivious EREW BSPRAM(p; g; l) algorithm; (ii ) any communication-oblivious CRCW BSPRAM(p; g; l) algorithm with slackness p for a constant > 0; (iii ) any CRCW BSPRAM(p; g; l) algorithm with granularity

p.

**Proof. (i ) Since the communication pattern of a communication-oblivious
**

algorithm is known in advance, we only need to show that any computation of EREW BSPRAM (i.e. a particular run of an algorithm) can be performed in BSP at the same asymptotic cost. First, we modify each BSPRAM superstep so that each processor both reads and writes any main memory cell that it either reads or writes in the original superstep. This increases the

each message has to be combined from contributions of several processors before being sent.e. such for an arbitrary h. in which main memory cells represent messages. If an input or output phase has cost h. It is known (see e. we assume that the communication cost of the considered input phase is h = = p . then the degree of its representing node is at most ph.16
CHAPTER 2. we need to analyse only the input phase. The only di erence is that. The constructed graph is bipartite. Each message is broadcast by a tree of maximum degree h and height at most 1 (the tree does not have to be
. BSP COMPUTATION
communication cost of the computation at most by a factor of 2. We then regard the colour of each edge as the identi er of a BSP processor that must obtain the corresponding message from the sending processor. By symmetry. such that all the edges adjacent to the same node are coloured di erently. (This can be proved by splitting each node of degree at most ph into h nodes of degree at most p. page 247]). Simultaneous broadcasting of received messages is done by a method which generalises the simultaneous broadcast technique from Section 2. there is a colouring of its edges with not more than p colours. for an arbitrary bipartite graph and an arbitrary p.g. and then transfer the message to the receiving processor. and received in the input phase represented by v2 . The communication and synchronisation costs of the computation are increased at most by a factor of 2. i. 0 < < 1. This message-passing version of BSPRAM was referred to as \BSP+" in Tis96]. that for any bipartite graph with maximum degree at most p. and writing or reading a value corresponds to sending or receiving a message. (ii ) The proof is similar to that of (i ). with the two parts representing all input and output phases respectively. and does not change the synchronisation cost. Ber85. due to concurrent reading and writing. The above modi cation essentially transforms the computation into a form of message passing. keep it in its local memory for as long as necessary. Two nodes v1 and v2 are connected by an edge e. its sending and receiving may occur in non-adjacent supersteps. It remains to show that the \delayed" messages can be simulated optimally by normal BSP messages. Without loss of generality.) We use the above theorem to colour the computation graph. Consider a particular superstep in the computation.2. We represent the whole BSPRAM computation by an undirected graph. Its di erence from direct BSP mode is that a message can be \delayed". Messages are represented by edges. and broadcast to several processors after being received. any node of degree at most ph has at most h adjacent edges of of each colour. Each superstep is represented by two nodes. there is a colouring of the edges with not more than p colours. one for the input phase and the other for the output phase. if the message represented by e is sent in the output phase represented by v1 . As an easy corollary of this.

Theorem 4. THE BSPRAM MODEL
17
balanced). a processor satis es the received requests by sending the locally stored subgranules of the requested granules to the requesting processors. An optimal deterministic simulation on an EREW BSPRAM(p. Having received its subgranules. It is intuitively clear that in general. In the rest of this thesis. not every BSP algorithm can be optimally simulated on a BSPRAM. g. q. g. all processors receive an identical set of requests. each processor combines any concurrently written data. For two important classes of algorithms | communication-oblivious algorithms and algorithms with su cient granularity | the simulation is deterministic and particularly simple. Each area corresponds to a pair of communicating BSP processors. a processor divides each granule that it must write into p subgranules. An input phase of the BSPRAM algorithm is simulated by two BSP supersteps. we develop and analyse BSP algorithms for some common computational problems. l) can be achieved for any BSP(p. The following result gives a simulation.2. preceded by its length. Sending a message from processor p1 to processor p2 is implemented by p1 writing the message. to the area corresponding to the pair p. In the second superstep. and then the message (if the length is nonzero). However. In this superstep. The main memory of a BSPRAM is partitioned into p2 areas. and then updates the locally stored subgranules. l) algorithm with slackness p. The BSPRAM model is used as the
. which is su cient for most practical applications. For each granule. The communication cost of the computation is increased at most by a factor of 2 1 . The broadcasting forest is partitioned among the processors so that on each level the total degree of nodes computed in any processor is at most 2h. (iii ) Partition each granule into p equal subgranules. and sends to every processor the appropriate subgranules. the BSPRAM shared memory mechanism is at least as powerful as BSP message passing.3. In the rst superstep. Receiving this message is implemented by q reading the length. The proofs of Theorems 2 and 3 show that a BSP computer can execute many practical BSPRAM algorithms within a low constant factor of their cost. Note that since the subgranules of every granule are distributed evenly. The communication and synchronisation costs of the computation are increased at most by a factor of 2. a processor broadcasts a request for each granule that it must read. choose an arbitrary balanced distribution of its subgranules across the processors. Proof. and the synchronisation cost at most by a factor of 1 . Such partitioning can be easily obtained by a greedy algorithm. due to di erent input-output conventions. An output phase of the BSPRAM algorithm is simulated by one BSP superstep.

3. i. In this extended model.
. When analysing slackness and granularity. while keeping BSPRAM shared memory for input-output. Such mixed algorithms can be translated into pure BSP by the simulation mechanisms of Theorems 2. we will indicate explicitly the part of the computation which is considered non-critical. does not a ect the asymptotic cost of the whole algorithm. a computation is a mixture of BSPRAM-style and BSP-style supersteps. and into pure BSPRAM by Theorem 4. The reason for it is that such non-critical computation may be simulated non-optimally without reducing the overall performance. in addition to the shared memory. we will say that the computation switches between shared-memory mode and message-passing mode. In such cases we extend the BSPRAM model by assuming that the processors. are connected by a BSP-style communication network. BSP COMPUTATION
basis of our presentation. we will often ignore a part of the computation which is non-critical. Sometimes it is convenient to use BSP message passing for (some part of) the computation. Every time when such an omission is made.e.18
CHAPTER 2.

the set of processors computing v is a subset of the set of processors computing u. JKS93]. Paper JKS93] shows that such recomputation of nodes is necessary for an asymptotically optimal computation of certain dags in the given model. When a dag is computed in parallel.1. By computation of a dag we understand computation of a circuit based on that dag. We call a node v local. We begin with a simple problem of computing a circuit based on a com19
. we also allow recomputation of nodes. We will consider computational problems of an algebraic nature in Chapters 4 and 5. when this does not create confusion. PY90. arrays. a dag is a directed acyclic graph.g. such as dags. and some algorithms fall under both categories. we concentrate on simple combinatorial objects. In a BSP dag computation. any processor computing v computes also all predecessors of v. In this chapter. in PU87. if all its incoming edges are local | that is. The number of processors is unbounded. The rst few sections deal with BSP computation of dags. in general. A node may be computed. Both terms are usually understood in a broad sense. The communication cost of parallel dag computation has been analysed e. where a nonlocal edge incurs a xed communication delay. we call an edge u ! v local. if it does not require communication | that is. more than once by di erent processors.1 Complete binary tree computation
One of the main divisions in computer science is between \algebraic" and \combinatorial" computation. Also. As de ned in Section 2. We will usually ignore terminal nodes when establishing the size and the depth of a dag. linked lists and trees. However. it is not required by any of the algorithms in this chapter.Chapter 3
Combinatorial computation in the BSP model
3. we will not show terminal nodes in pictures of dags. These papers adopt a synchronous communication cost model.

slackness and granularity. which generalises the broadcast problem considered in Section 2. We denote the complete oneinput. The analysis of complete binary tree computation will help us to illustrate the BSPRAM model. For a BSPRAM computation of the tree. we assume that the input and the outputs are stored in the main memory. n-output binary tree dag by cbtree (n). each node in the tree represents an elementary operation.2: BSPRAM computation of cbtree (16) plete binary tree. This dag has n 1 nonterminal nodes.2 shows the computation for n = 16. We will call these two versions of tree computation top-to-bottom and bottom-to-top.1 shows the dag cbtree (16). By symmetry. or vice versa. Depending on the nature of the computation.1: Complete binary tree cbtree (16)
cbtree (4) cbtree (4)
0 1 2 3
cbtree (4)
4 5 6 7
cbtree (4)
8 9 10 11
cbtree (4)
12 13 14 15
Figure 3. p = 4.20
CHAPTER 3.2 (the operation of each node in the case of broadcast is simple replication of the input). We choose top-to-bottom computation. we can view the node's parent as the input and the children as the outputs.
. The computation method resembles the broadcast techniques from Section 2. and depth log n. As in any circuit. we need to analyse only one of the two.2. This problem often occurs in practice. such as computing all-pre x sums. COMBINATORIAL COMPUTATION IN BSP
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Figure 3. and has important applications to other problems. respectively. Figure 3. as well as the concepts of communication-obliviousness. Figure 3.

the local computation. For 1. and the second superstep the remaining log n log p levels. An important application of complete binary trees is the problem of computing all-pre x sums on an array of size n (see e. Cost analysis. xn 1 ). : : : . A standard method of computing all-pre x sums in parallel. taken independently. Output: values yi. The problem is formulated as follows: given an input array (x0 . yn 1 ) = (x0 . The rst superstep computes the top log p levels of the tree. : : : . except the last superstep. l) in shared-memory or message-passing mode.3 for n = 8. each of the three cost values W . shown in Figure 3. x0 x1 . Computation of a complete binary tree . can be represented by a dag allpref (n). S . the action of a node with inputs x. g.3. with slackness and granularity = = 1. H . the computation proceeds in two supersteps. Here. proposed in BK82] (see also LD94]). Each superstep computes log p levels of the tree. y1 . : : : .1. which computes the remaining log n log p levels. Input: value x. The computation is performed on an EREW BSPRAM(p. COMPLETE BINARY TREE COMPUTATION
21
cbtree (n). the computation proceeds in 1 + 1 supersteps. is trivially optimal. If 1. If 0 < < 1. LD94]). Description.g. Parameters: integer n p1+ for some constant > 0. y is
x
y
x
x y
. communication and synchronisation costs are
W = O(n=p)
H = O(
1
n=p) = O(n=p)
S = O( 1 ) = O(1)
The algorithm is oblivious. x1 . compute the output (y0 . a circuit based on
W = O(n=p)
H = O(n=p)
S = O(1)
For 0 < < 1. the local computation. x0 x1 xn 1). Bottom-to-top complete binary tree computation is symmetric to the algorithm above. computed by the circuit. In Algorithm 1. communication and synchronisation costs are
Algorithm 1. 0 i < n. Ble93. where is an associative operator computable in time O(1).

If n p1+ for a constant > 0. 0 i < n.g. shown in Figure 3.
3. JaJ92]). let us de ne u0 = xi . S = O(1). CLR90]). the rst computed bottom-to-top. with n=2 nodes in each level. H = O(n=p). CLR90. i i
. and depth 2 log n.3 by a dot .g. For all i. The butter y dag b y (n) takes n inputs xi . we have introduced a dummy value. COMBINATORIAL COMPUTATION IN BSP
x0 x1 x2 x3 x4 x5 x6 x7
y0 y1 y2 y3 y4 y5 y6 y7
Figure 3. Another application of the butter y dag is in the bitonic sorting network (see e.3: All-pre x sums dag allpref (8) To expose the symmetry of the method. In total. x is
x
x
The dag allpref (n) consists of two linked complete binary trees. Parallel algorithms for the butter y dag computation have been proposed in various parallel models (see e. the dag allpref (n) has 2n 2 nonterminal nodes. and let uk . The action of a node with inputs . Algorithm 1 allows one to compute all-pre x sums with BSP cost W = O(n=p).22
CHAPTER 3.2 Butter y dag computation
The butter y dag represents the dependence pattern of the Fast Fourier Transform. 1 k log n. 0 i < n. the second top-to-bottom. and produces n outputs yi . The dag contains log n levels of nodes.

Input: values xi. Val90a] (see also GHSJ96]). 0 i < n. 0 i < n. Figure 3. uk . i j i j In total. the butter y dag computation can be split into two stages. The computation of a level in b y (n) consists of n=2 independent computations of b y (2). If n is su ciently large with respect to p. and with outputs uk+1 . each comprising 1=2 log n levels. and the depth of the dag is log n. l)
. computed by the circuit. each of the two stages can be completed in one superstep. Therefore. the algorithm is as follows. In level k.4: Butter y dag b y (16) denote the output of level k 1. Description. there are 1=2 n log n nonterminal nodes. Parameters: integer n p2.
and proceeds in two supersteps.4 shows the butter y dag b y (16). four independent computations of b y (4) are performed. BUTTERFLY DAG COMPUTATION
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
23
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Figure 3. In each superstep. In both
Algorithm 2. so that ulog n = yi. The computation is performed on an EREW BSPRAM(p. Output: values yi.2. the computation of any k consecutive levels consists of n=2k independent computations of b y (2k ). Figure 3. In general. if and only if jj ij = 2k . Similarly. a circuit based on b y (n). each comprising 1=2 log n levels and consisting of n1=2 independent tasks. Computation of the butter y dag b y (n). uk+1 . g.5 shows the two-superstep computation of b y (16). As observed in PY90. the butter y dag can be partitioned in a way suitable for bulk-synchronous parallel computation. there is a node i with inputs uk .3.

0 i. x(3) .
3.k+1 whenever such node exists (1) (2) (3) vn 1. yij . with slackness = n=p. vij0 input respectively x(1) .j. The three-dimensional cube dag cube 3 (n) with inputs x(1) . Here we describe a BSPRAM version of the BSP cube dag algorithm from McC95]. The asymptotic BSP costs of Algorithm 2 are independently optimal.j+1.j.k .5: BSPRAM computation of b y (16) supersteps. The local computation. communication and synchronisation costs are W = O(n log n=p) H = O(n=p) S = O(1) The algorithm is oblivious. k < n.1) vi+1. yik .k . such that v0jk . vi0k . vi. x(3) jk ik ij vijk is connected to each of the nodes (3.k.24
CHAPTER 3.n 1.3 Cube dag computation
The cube dag de nes the dependence pattern that characterises a large number of scienti c algorithms. The algorithm for other dimensions is similar. E cient BSP computation of a butter y dag for 1 n < p2 is considered in Val90a]. j. we consider the computation of a three-dimensional cube dag. and granularity = n=p2 .n 1 produce respectively yjk .j. vi. vi. and jk ik ij (1) (2) (3) outputs yjk . vi. x(2) . yik . COMBINATORIAL COMPUTATION IN BSP
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
b y (4)
b y (4)
b y (4)
b y (4)
b y (4)
0 4 8 12
b y (4)
1 5 9 13
b y (4)
2 6 10 14
b y (4)
3 7 11 15
Figure 3. contains n3 nonterminal nodes vijk . For simplicity.j. each processor is assigned n1=2 =p independent butter y dags of size n1=2 . x(2) . Cost analysis. yij
.k .

CUBE DAG COMPUTATION
000 300 303
25
030
033
333
Figure 3. the array V = (vijk ) is partitioned into p3=2 regular cubic blocks of size n=p1=2 . j. Vi. k < p1=2 . The diagonal layer of simultaneously computed independent blocks forms the computation \wavefront". In this algorithm.6: Cube dag cube 3 (4) 000
m00
m0m
cube 3 (n=p1=2 )
0m0
mmm 0mm
Figure 3.3.6 shows the cube dag cube 3 (4).k 1 become available. j. a circuit based on cube 3(n). Vi.k . Parameters: integer n p1=2 .j. 0 i.
Algorithm 3.k . k < n. Figure 3. Input: values x(1) . yij . The shaded diagonal layer of blocks is the current wavefront. 0 i.7 shows a stage in the BSP computation of cube 3 (n). k < n. Computation of the cube dag cube 3(n). The algorithm computes a block Vijk as soon as the data from its predecessors Vi 1. therefore the computation can be completed in O(p1=2 ) supersteps. The BSP algorithm for computing the dag cube 3 (n) is given in McC95].3. x(3) . Figure 3. jk ik ij (1) (2) (3) Output: values yjk . Each block de nes a dag isomorphic to cube 3 n=p1=2 .7: BSPRAM computation of cube 3 (n). j. 0 i. The total number of layers is 3p1=2 2. yik .
. m = n 1 The depth of the dag is 3n 2. x(2) .j 1. computed by the circuit.j. We denote these blocks by Vijk .

The computation of a block requires the communication of O(n2 =p) values on the surface of the block.26
CHAPTER 3. However. Consider two orthogonal partitionings of the base plane into n parallel lines. Therefore. the BSP cost of computing d 2 1 a cube dag cube d (n) is W = O(nd =p). l) computation of cube 3(n) with local computation cost W 5=36 n3 requires communication volume H 1=6 n2 . we will call it the base plane. Any BSPRAM(p. The algorithm is oblivious. More than 4=9 n2 1=6 n2 = 5=18 n2 of the lines consist of local nodes only. The intersection of these two line families contains more than (2=3 n)2 = 4=9 n2 nodes. To prove the optimality in H . g. we need the following lemma. if the local computation cost is at most 5=36 n3 . 0 s < 3p1=2 2. The conditional optimality of Algorithm 3 can now be demonstrated as follows. Suppose the communication volume is less than 1=6 n2. The computation is performed on an EREW BSPRAM(p. S = O(p). Proof. H = O(n). It can be computed by an algorithm similar to Algorithm 3 with BSP cost W = O(n2=p). the blocks Vijk with i + j + k = s are computed. The de nition of a three-dimensional cube dag can be naturally generalised to any dimension. Therefore. In total. COMBINATORIAL COMPUTATION IN BSP
and proceeds in 3p1=2 2 stages. The maximum number of blocks computed in any one stage is 3p=4. At least one of the planes in the higher-indexed half contains less than (1=6 n2 )=(1=2 n) = 1=3 n nonlocal nodes. We call this intersection the base diamond. and the highest-indexed node in the base diamond the base node. By construction. 5=36 n3 of which are in the lower-indexed half of the dag. l)
. each comprising a constant number of supersteps. Lemma 1. The local computation cost is W = O(n3=p). we have more than 5=18 n3 local nodes. there are more than 2=3 n lines consisting of local nodes only. In general. Cost analysis. g. The middle plane divides the dag into a lower-indexed and a higher-indexed half. The total communication cost is H = h p1=2 = O n2=p1=2 . A two-dimensional dag is called the diamond dag. Then the dag must have less than 1=6 n2 nonlocal nodes. In each partitioning. Consider the set of lines intersecting the base plane orthogonally at the base diamond. H and S are optimal for any computation with W = O(n3 =p). The synchronisation cost is S = O p1=2 . S = O p d 1 . The BSP cost values of Algorithm 3 are not independently optimal. H = O nd 1 =p d 1 . with slackness and granularity = = n2 =p. Partition the dag into n parallel planes.
Description. the local computation cost is more than 5=36 n3 . the communication volume is at least 1=6 n2 . each of these nodes must be computed by at least the same processors as the base node. In stage s. the communication cost of each stage is h = O(n2 =p). Conversely.

(i ) Trivial. By Lemma 1. 0 s < S . (iii ) S = p1=2 . Here we consider comparisonbased sorting of an array x = (xi ). (ii ) H =
27
Theorem 5.4. Let W = O(n3 =p). the total communication volume must be at least (1=6 n2 =p) p3=2 = n2 p1=2 .
(ii ) The proof is an extension of the proof given in PU87] for the diamond dag. Since the local computation cost of each chain is O(n3 =p). g. there are p1=2 blocks in each \long" chain with local computation cost at most 5=36 n3=p3=2 . GR88. the total communication cost H = n2 p1=2 =p = n2 =p1=2 . 0 i < n. where 0 s<S ms = n.
n=
X
0 s<S
ms =
X
0 s<S
1 ms S 2=3
X
0 s<S
m3 s
1=3
S 2=3 W 1=3
Hence by assumption S n3=2 =W 1=2 = (p1=2 ). each containing at least 1=2 p1=2 blocks. We denote the sizes of P these segments by ms. The total cost of s local computation is bounded from below by the sum over all supersteps: P 3 0 s<S ms .g. we will ignore all computation in a block after the highest-indexed node has been computed once.
3. Even when communication is perfectly balanced. The total number of such blocks is (3=4 p) p1=2 = p3=2 .
. the computation of a block can start only after each node in the previous block has been computed at least once. There are 3=4 p \long" chains. In every chain. Consider 3p chains of blocks parallel to the main diagonal. SORTING
requires (i ) W = (n3 =p). rst computed in each particular superstep. (iii ) Nodes of the main diagonal viii . 1 i n. Any BSPRAM(p. TB95] and references therein). Without loss of generality. Since the communication within a superstep is not allowed. For the purpose of a lower bound. l) computation of cube 3 (n) with W = O(n3=p)
n2 =p1=2 . Theorem 5 can be easily generalised to other dimensions. JaJ92. By Holder's inequality. Col93. Therefore. form a consecutive segment. Many parallel sorting algorithms of di erent complexity have been proposed (see e. Partition the cube dag into p3=2 cubic blocks of size n=p1=2 . the communication volume of such a block is at least 1=6 n2 =p.
Proof. all nodes in the diagonal cubic block of size ms spanned by the segment s must be computed by the processor that rst computes the highest-indexed node of the block.3.4 Sorting
Sorting is a classical problem of parallel computing. The local computation cost of a block is therefore m3 .

: : : . note that the order of primary samples from each subarray is preserved. : : : . xq+1 \ xk . Probably the simplest parallel sorting algorithm is parallel sorting by regular sampling (PSRS). the size of a secondary block is at most n=p2 (p + 2p) = 3n=p. corresponding to the intervals x0 . the set of all x in x such that a < x < b. each of size n=p.28
CHAPTER 3. Each secondary block is distributed across the processors. x1 . if xq . and at most two boundary primary blocks in each subarray (because a boundary block must contain one or both secondary block boundaries). : : : . xk+1 . The dashed bars at the bottom show the elements of x assumed to lie between the samples. Let us show that any secondary block contains at most 3n=p elements. the array x is partitioned into p subarrays x1 . xp .. proposed in SS92] (see also its discussion in LLS+ 93]). The PSRS algorithm proceeds as follows. Paper HJB] describes an optimised version of the algorithm. xq . We denote the secondary samples by x0 . Primary samples are shown as white dots. and its e cient implementation on a variety of platforms. the elements of each secondary block can be collected in time O(n=p). Therefore. xq . The subarrays xq are sorted independently by an optimal sequential algorithm. We denote the primary blocks of xq by xq . In the second stage of merging. We denote the samples of the subarray xq by xq . xk+1 . i. p + 1 regularly spaced primary samples are selected from each subarray (the rst and the last elements of a subarray are included among the samples). With respect to any secondary block. xq+1 an inner i i block. xp 1 . we select p + 1 regularly spaced secondary samples from the sorted array of primary samples (the rst and the last elements are again included in the samples). The problem now consists in merging the p sorted subarrays. xq+1 xk . First. bi denote an open interval. there are at most p inner primary blocks in total (because there are only p primary samples inside the secondary block). We call a primary block xq . xq . xp . xk+1 = . we divide all the primary blocks of x into three categories.
.e. The method is illustrated in Figure 3. p (p + 1) primary samples are collected top 0 1 p gether and sorted by an arbitrary sequential algorithm. xq 1 . if xq . Let ha.8 for p = 3. The samples divide each subarray into p prip 0 mary blocks of size at most n=p2 . After that. xp . Now it remains to collect the elements of each secondary block in one particular processor. The secondary samples partition the elements of x into p secondary blocks. In the rst stage of merging. COMBINATORIAL COMPUTATION IN BSP
we may assume that the elements of x are distinct (otherwise. i i i i and a boundary block. an outer block. Dotted lines show the rearrangement of primary samples into a sorted array at the bottom. and then sorted by an e cient sequential algorithm. we should attach a unique tag to each element). Then. but the samples from di erent subarrays may be interleaved. if it is neither inner nor outer. : : : . For a xed secondary block de ned by xk . The state of the array x after local sorting of the subarrays is represented by three horizontal bars at the top. : : : .

boundary and outer for x1 . x2 are shown by cross-hatching. The processor merges the received primary blocks.8: BSPRAM sorting by regular sampling the number of elements between adjacent primary samples is not necessarily equal. x2 . SORTING
29
x1 0
x1 1
x1
x1 2
x1 3
x2 0
x2 1
x2
x2 2
x2 3
x3 0
x3 1
x3
x3 2
x3 3
x0
x1
x2
x3
Figure 3. We assume that the input and output arrays are stored in the main memory. The secondary block x1. a processor picks a secondary block and collects its elements. E ciency of the above computation with samples is not critical. Primary blocks that are inner. g. Only inner and boundary blocks may contain elements from x1 . Black dots indicate the secondary samples. In the second superstep. 0 i < n. Sorting by regular sampling . Description. In the third superstep. l) and proceeds in three supersteps. Parameter: integer n p3. The number of such blocks is at most 3p.4. a processor receives from other processors (in message-passing mode) all primary blocks that may intersect with the assigned secondary block. x2 is shown by cross-hatching. Algorithm 4. In the rst superstep. sort them and select p secondary samples. since the number of samples does not depend on n. and writes them to the main memory. sorts it with an e cient sequential algorithm. The computation is performed on a CRCW BSPRAM(p. and their total size is at most 3n=p. Input: array x = (xi). In order to do this. reads it. The sorting algorithm based on PSRS can be easily implemented in the BSPRAM model. Output: x rearranged in increasing order. the processors perform an identical computation: read the p (p + 1) primary samples.3. discarding the values that do not belong
. selects p + 1 primary samples. a processor picks a subarray xq . with all xi distinct. simple hatching and no hatching respectively.

The most common problem on linked (or doubly-linked) lists is list ranking : for each item determine its distance from the head (or the tail) of the list (see e. in SS92. For n p3 . COMBINATORIAL COMPUTATION IN BSP
to the assigned secondary block. such as computing all-pre x sums on a list.g.5 List contraction
This and the following sections consider BSPRAM computation on pointer structures.g. The problem is rather more complicated on parallel models. = n=p2 (ignoring non-critical computations with samples). The easiest way to obtain
. The local computation. respectively. Cost analysis. The merged result is written to the main memory. it uses a pipelined tree merging technique similar to the one employed by Cole's algorithm (see e. We assume that the computation cost of merging two items is O(1). Paper Goo96] presents a more complex BSP sorting algorithm. communication and synchronisation costs are
W = O(n log n=p)
H = O(n=p)
S = O(1)
The algorithm is not communication-oblivious. The order of items is de ned by pointers: each item contains a pointer to the next item in the sequence. Col93]).30
CHAPTER 3. S = O log n= log(n=p) . such as linked lists and trees.
3. contract the list to a single item. A linked list is a sequence of items. For smaller values of n. asymptotically optimal for any n p. JaJ92. ABK95]. CLR90. H = O n=p log n= log(n=p) . the algorithm is identical to PSRS. In a sequential model of computation. Despite its asymptotic optimality. Lower bounds on communication complexity of sorting for various parallel models can be found e. we view these problems as instances of an abstract problem of list contraction : given an abstract operation of merging two adjacent items as a primitive. List ranking can be applied to more general list problems. it usually involves pointer jumping and some payload operations (e.g. Its BSP costs are W = O(n log n=p). A more practical BSP sorting algorithm for small values of n is described in GS96]. A doubly-linked list contains backward as well as forward pointers. Its slackness and granularity are = n=p. The asymptotic BSP costs of Algorithm 4 are independently optimal. The rst and the last items of the list are called its head and tail. the algorithm from Goo96] is unlikely to be practical in the case of n p. a list can be contracted by a trivial algorithm that traverses the list of n items in (n) time. Following LM88]. RMMM93]). The last item contains the null pointer. summation of item values). Implementation of the merging primitive is problem-dependent.g.

Many attempts have been made to improve the PRAM time-processor e ciency of randomised list contraction. optimal both in time and in the time-processor product. Second stage. in the PRAM model the time-processor product1 is still suboptimal. g. Input: linked list of size n. Although it is slightly suboptimal in time.
Cost analysis. we use the term work for the actual number of operations performed. An algorithm from RM96] is time-processor optimal.5. Then pairs of adjacent items that \look at each other" merge. In each round every item is marked either forward-looking of backward-looking by ipping an independent unbiased coin. This processor completes the contraction by local computation.
Algorithm 5. Since the size of the list is expected to decrease exponentially. Paper MR85] introduced a technique of random mating. and consists in merging all mating pairs. An analysis along the lines of LM88] shows that O(log p)
. The processor to hold each merged pair is chosen at random (other methods of choice are possible).
rounds will su ce on the rst stage with high probability. The random mating algorithm proceeds in a sequence of rounds. The computation is performed on an EREW BSPRAM(p. given su cient slackness. Each round is implemented by a superstep.
First stage.3. the communication cost of
1 This time-processor product is also sometimes called `work'. Description. The remaining n=p items are collected in a single processor.
Each processor reads an equal number of input items from the main memory. due to some processors being idle for some time. therefore the expected number of parallel steps is (log n). The following straightforward implementation of random mating is based on a BSP algorithm suggested by McC96a]. Randomised list contraction . The expected amount of computation performed by the above algorithm is optimal. We reduce the list from n to n=p items by repeated rounds of random mating. the computation is performed in message-passing mode and proceeds in two stages. it performs better in practice than the more sophisticated algorithm from AM90]. LIST CONTRACTION
31
an e cient parallel list contraction algorithm is by randomisation. One round of the algorithm reduces the size of the list by about a quarter. The procedure is repeated until only one item is left. After that. Output: input list contracted to a single item. Optimal e ciency for randomised list contraction is much easier to achieve in the BSPRAM model. l). which may be smaller than the time-processor product. Parameter: integer n p2 log p. however.

g. The BSPRAM implementation of deterministic mating is as follows. The graph is used to mark each processor either forward-looking or backward-looking. communication and synchronisation costs of the whole algorithm are
Wexp = O(n=p)
Hexp = O(n=p)
Sexp = O(log p)
The algorithm is not communication-oblivious. Its expected slackness is exp = n=p2 . equal to O(n=p). dominates all subsequent communication with high probability. provided that the input size is su cient. e. JaJ92. At this stage. Therefore. After that.32
CHAPTER 3. Expected (with high probability) local computation. Parameter: integer n p3 log p. Each round starts with contracting all chains of adjacent items that are local to any particular processor.g. Let m be the total number of items before the current round. Another direction of research has been aimed at providing an optimal deterministic algorithm for list contraction. Such a marking always exists and can be easily computed from the graph by a greedy algorithm in sequential time O(p2 ). Its granularity is = 1. The weight of an edge v1 ! v2 is de ned as the number of adjacent pairs of items. the total number of remaining items is at most 3m=4. COMBINATORIAL COMPUTATION IN BSP
the rst round of mating. a complete weighted digraph is constructed. it is much easier to design an optimal deterministic algorithm for list contraction in the BSP model. Any item that remains in the list has both neighbours outside its containing processor. This completes the current round. but their distribution across the processors may not be even. described below. Then pairs of neighbours that \look at each other" merge. The graph has p nodes. Known e cient deterministic algorithms for PRAM (see e. it is necessary to redistribute the items so that each processor receives at most 3m=4p of them. bitwise operations on integers. The algorithm proceeds in several rounds. Algorithm 6.
. Deterministic list contraction . Such algorithms are complicated and often assume non-standard arithmetic capabilities of the computational model. RMMM93]) typically involve the method of symmetry breaking by deterministic coin tossing introduced in CV86]. As in the case of randomised algorithms. each node representing a processor. The forward and backward marks are assigned in such a way that the number of adjacent pairs of items \looking at each other" is at least m=4. where the leading and the trailing item are contained in the processor represented by v1 and v2 respectively. Each item assumes the mark of the containing processor. Our deterministic list contraction algorithm is based on the technique of deterministic mating.

Cost analysis. the computation is performed in message-passing mode and proceeds in two stages. We reduce the list from n to n=p items by repeated rounds of deterministic mating. it seems likely that H and S are optimal for any computation with W = O(n=p). the communication cost of the rst round. pairs of adjacent items that \look at each other" merge. dominates all subsequent communication. the whole graph is collected in a single processor. but the question of a proof remains open. Output: input list contracted to a single item. The computation is performed on an EREW BSPRAM(p. communication and synchronisation costs of the whole algorithm are W = O(n=p) H = O(n=p) S = O(log p) The algorithm is not communication-oblivious. The third superstep redistributes the items so that each processor receives at most 3m=4p of them. The local computation. the asymptotic values of W . TREE CONTRACTION
33
Each processor reads an equal number of input items from the main memory.6. The processor to hold the merged pair is chosen arbitrarily between the two processors holding the original items. Since the size of the list decreases exponentially. Second stage. After that. Some lower bounds for parallel list contraction have been proved in Sib97]. Description. This processor computes the marks and tells each processor its mark. This processor completes the contraction by local computation.
Input: linked list of size n. = 1 (ignoring non-critical computation of processor marks). First stage.
3.3. After that. The remaining n=p items are collected in a single processor. each processor reduces all local chains and computes the number of links from local items to items in each of the other processors.
In Algorithms 5 and 6. In the second superstep. S are not independently optimal. Each round is implemented by three supersteps. g. This completes the current round. This establishes the weights of edges leaving the processor's node in the representing graph.6 Tree contraction
Both the randomised and the deterministic versions of list contraction can be used to solve the problem of tree contraction. Intuitively. equal to O(n=p). In the rst superstep. H . This well-studied problem (see
. The total number of rounds necessary to reduce the list to n=p items in the rst stage is O(log p). Its slackness and granularity are = n=p2 . l).

and by the root (unless the bridge contains the root of the tree.9 shows a tree partitioned into seven
. JaJ92. compression. merging can be classi ed both as raking and compression. the algorithm from GMT88] (more precisely. where a non-leaf with only one child absorbs its child. Bridges possess an important characteristic property: each of them is attached to the tree by at most one leaf. COMBINATORIAL COMPUTATION IN BSP
Figure 3. the problem is de ned in terms of an abstract operation of merging two adjacent nodes in a binary tree. Tree contraction provides an e cient solution to problems connected with parallel evaluation of arithmetic expressions. Another possibility is to reduce the problem to list contraction by considering lists associated with the tree. such as its Euler tour.9: BSPRAM tree contraction e. LM88]). and all its children). which is a leaf.g. In tree contraction. One method to obtain an e cient PRAM algorithm for tree contraction is by generalising the technique of random mating (see e. where a leaf is absorbed into its parent node. We sketch this implementation below. The parent and children of the resulting node are respectively the parent of and the absorbing node and the children of the absorbed node. the merging operation is considered primitive. The latter approach is followed in GMT88] (see also RMMM93]).g. its \m-contraction" phase) can be e ciently implemented on a BSPRAM. As before. Figure 3. The main idea of the method is to partition a tree of size n into edgedisjoint subtrees of size at most n=p. The parent and child of the resulting node are respectively the parent and the remaining child of the absorbing node. Several approaches to tree contraction have been developed for the PRAM model.34
CHAPTER 3. It is also used as a subroutine in some parallel graph algorithms. Similarly to list contraction. two kinds of merging are allowed:
raking. Although not originally intended for the BSP model. the goal of tree contraction is to reduce the tree to a single node. called bridges. such as the minimum spanning tree computation. RMMM93]) generalises list contraction.
When the absorbing node has only one child.

randomised (respectively. This is made possible by the characteristic single-leaf attachment property of the bridges. Parameter: integer n p2 log p (respectively. Third stage.
. By this distribution. The partitioning is computed by several rounds of list contraction (speci cally. The distribution is computed by another all-pre x sums computation on the Euler tour of the tree. The remaining tree of size p is collected in a single processor. The partitioning can be computed by list contraction (speci cally. After this. The total size of the bridges assigned to any single processor is at most n=p. This processor completes the contraction by local computation. reducing the bridges to their common root. all-pre x sums computation) on the Euler tour of the tree.
First stage. The partitioning of the tree in the rst stage and the dis-
Wdet/exp = O(n=p)
Hdet/exp = O(n=p)
Sdet/exp = O(log p)
The algorithm is not communication-oblivious. n
p3 log p) for the
tribution of the bridges in the second stage are computed by Algorithms 5 or 6. l). The list contraction is performed by Algorithms 5 or 6. The costs of the whole algorithm (deterministic or expected with high probability) are
Cost analysis. = 1. Their costs dominate (deterministically or with high probability) the cost of the remaining two stages. Input: tree of size n. Description. deterministic) version of the algorithm. g. each processor is assigned either a single bridge. Each processor receives the assigned bridges and performs sequential tree contraction on each of them. The computation is performed on an EREW BSPRAM(p. Its slackness and granularity are the same as in Algorithms 5 or Algorithm 6: det/exp = n=p2 . Fourth stage.6. A distribution of bridges across the processors is computed. the computation is performed in message-passing mode and proceeds in four stages.3. The tree is partitioned into bridges. Output: input tree contracted to a single node. Second stage. and can be obtained by cutting the tree in at most 2p 1 nodes. Paper GMT88] (see also RMMM93]) shows that such partitioning always exists. TREE CONTRACTION
35
bridges. Tree contraction . all-pre x sums computation with varying basic operation) on the Euler tour of the tree. Each processor reads an equal number of input nodes from the main memory. The BSPRAM algorithm is as follows. or several bridges with a common root.
Algorithm 7.

little extra e ort is required.36
CHAPTER 3. tree contraction can be performed in the BSPRAM model with the aid of list contraction. The obtained algorithm for tree contraction has the same asymptotic costs as the list contraction algorithm employed. COMBINATORIAL COMPUTATION IN BSP
Thus.
.

1 i.1) (4. j ] b j ] for all i. rather than the cost of input/output. j . j is independent (although it requires concurrent reading from b j ] and concurrent writing to c i]). c are n-vectors over a semiring.1).1 Matrix-vector multiplication
Matrix-vector multiplication is a common operation in scienti c computing. especially in iterative approximation methods. Matrix A is 37
. j ] b j ]
1 i n
(4. and therefore
c i]
can be performed in parallel. The BSPRAM algorithm for matrix-vector multiplication is a straightforward adaptation of the BSP algorithm from BM93. By assuming that matrix A is predistributed.3) for di erent pairs i. The method consists in straightforward computation of the family of linear forms
c i] =
n X j =1
A i. The general problem is dened as computation of the product A b = c. This assumption is natural in iterative approximation. ignoring the BSP cost of such distribution. We assume that matrix A has been initially distributed across the processors. : : : . McC95]. we concentrate on the cost of matrix-vector multiplication.2)
Following (4. where A is an n n matrix. and b. where the cost of initial matrix redistribution can be amortised over a long series of iterations. we need to set
c i]
and then compute
0 for i = 1. j n (4. n
c i] + A i.Chapter 4
Dense matrix computation in the BSP model
4.3) Computation (4.

Computation of the product A i.1). p1=2
(4..5)
..38
CHAPTER 4.
A 1.3) can be expressed in terms of blocks as
c i]]
0 for i = 1..2: Matrix-vector multiplication in BSPRAM represented by a square of size n in the integer plane (see Figure 4.
1 C A
(4..1: Matrix-vector multiplication dag
b
A
c
Figure 4. : : : . c are divided into p1=2 conforming regular intervals b j ]]. p1=2 ]] A p1=2 . j A c
i
Figure 4. matrix A must be divided into p regular square blocks of size n=p1=2 (see Figure 4. j ] b j ] requires the input from node b j ]. DENSE MATRIX COMPUTATION IN BSP
b j
i. c i]] of size n=p1=2 . @
A
p1=2 .4)
Vectors b. and the output to node c i]. p1=2 ]]
. .2). (4.2). Computation (4.
0 A 1. 1]]
. In order to provide a communication-e cient BSP algorithm. 1]] A = B . c are represented by projections of the square A onto the coordinate axes. Arrays b..

6) for a particular pair i. Matrix-vector multiplication . Input: n-vector b over a semiring. g. Cost analysis. the processor reads the block b j ]]. The resulting vector c is the product of A and b. j . The problem is formulated as follows: given a lower triangular matrix A and a vector c. j ]]. Each processor performs the computation (4. nd
. Then it computes the product A i. the solution is b = A 1 c.2). Description.3). j ]] b j ]] by (4. The proof is omitted. The problem can be formulated more generally over a semiring. Algorithm 8. the computation proceeds in one superstep. Parameters: integer n p1=2. One of their most common applications is in solution of general linear systems. due to its similarity to the optimality proof for matrix multiplication (Theorem 7 in Section 4. For a nonsingular matrix A over a eld. In the input phase. n n matrix A over a semiring. 1 i. After the initialisation step (4. j p1=2 (4.6) The initial distribution of A is such that every processor holds a separate block A i. j ]] b j ]] sequentially c i]]
W = O(n2 =p)
H = O(n=p1=2 )
S = O(1)
The algorithm is oblivious. j . Concurrent writing is resolved by addition of the written blocks to the previous content of c i]].2 Triangular system solution
Triangular systems of linear equations play an important role in scienti c computation.4. where the system matrix is decomposed into a product of triangular factors. The generalised problem is: given a lower triangular matrix A and a vector c over a semiring. communication and synchronisation costs are
c i]] + A i.5). (4. and computes the block product A i. predistributed across the processors. j ]] b j ]] for all i. (4. The algorithm is as follows. and then the resulting triangular systems are solved. It can be shown that Algorithm 8 is an optimal algorithm for matrixvector multiplication. l).3). Output: n-vector c = A b. nd a vector b such that A b = c.
4.3). with slackness and granularity = = n2 =p1=2 . The local computation.2). TRIANGULAR SYSTEM SOLUTION
and then
39
by (4. using the closure operation.2. The computation is performed on a CRCW BSPRAM(p. The computed block is then written to c i]] in the main memory.

assuming that the matrix closure A = I + A + A2 + exists. i > j . k] b k]. k]
b k] =
vij b j]
+ A i. McC95]). there is no need to compute A explicitly: the standard substitution technique can be used to nd b = A c in sequential time (n2 ). Solution of a triangular system by substitution can be viewed as a dag computation. k < i n. j ]. j ] b j ]
.3: Triangular system solution dag the vector b = A c. k] is to node b i]. j ] b j ]. c are represented by projections of the square A onto the coordinate axes. DENSE MATRIX COMPUTATION IN BSP
b j j. k] represents the computation of i. Explicit computation of the closure A requires (n3 ) operations. i c i j i
Figure 4.40
CHAPTER 4.g. The output of a node A k. b k] = A k. and to all products A i. the action of a node vij . j ] b j ] . and the output to node APi]. As before. i > j . Computation of the product A i. j i. Arrays b. It requires the input from node c i]. Here. is
b j]
P
1 k<j A i.3). However. A node A k. requires the input from node A j. The above computation can also be arranged as a diamond dag cube 2 (n) (see e. j ] b j ]. k] c k] + 1 j<k A k. The equivalence of the original and the generalised problem over a eld follows from the identity A = (I A) 1 . and from all products A k. 1 j < k. j A i. matrix A is represented by the lower triangular part of a square of size n in the integer plane (see Figure 4.

9)
b = b1 b2
and then applying block substitution:
c = c1 c2 b2 A22 (c2 + A21 b1 )
b1
A11 c1
. c into conforming intervals of size n=2. i < j . k] (c k] + ) c k]
P
1 j<k A k.2. The recursive algorithm works by dividing matrix A into square blocks of size n=2.
(4.3). k] (c k] + ) = b k] vkk b k] = A k. S = O(p) (see Section 3.8) (4.4. and can be computed by Algorithm 3 with BSP cost W = O(n2 =p). TRIANGULAR SYSTEM SOLUTION
b 1] b 2] b 3] b 4]
0 0 0 0
41
c 1] c 2] c 3] c 4] b 1] b 2] b 3] b 4]
Figure 4. j ]
b j] =
Nodes vij .4 shows the resulting dag.4: Triangular system solution by a diamond dag and the action of a node vkk is
A k. Figure 4. H = O(n).
A = A11 A A21 22
dividing vectors b. An alternative approach to triangular system solution is recursion.7) (4. It is similar to the diamond dag cube 2 (n). are not used.

9) is performed by the following schedule. DENSE MATRIX COMPUTATION IN BSP
The procedure can be applied recursively to nd the vectors A11 c1 and A22 (c2 + A21 b1). The initial distribution of A should allow one to perform the described computations without redistributing the matrix. This assumption is natural in typical applications. j ]] with i j = q.10)
. the recursion tree is computed in depth. Description.rst order. Input: n-vector c over a semiring. Each triangular system in (4.8). j < p. Initially. The algorithm is as follows. if the block size is large enough. Algorithm 9. By assuming that the matrix A is predistributed. The easiest way of achieving this is to partition matrix A into p2 regular square blocks A i. triangular systems are solved sequentially by an arbitrarily chosen processor. Triangular system solution . Output: n-vector b = A c. we can only exploit the parallelism within block multiplication. 0 i.g. The resulting vector is As in the previous section. We now describe the allocation of block triangular systems and block multiplication tasks in (4. and is de ned by recursion on the size of the matrix and vectors. we concentrate. (4.9) is also solved in parallel by all processors available at that level. or all blocks A q. on the cost of solving the triangular system. where the matrix A is obtained from a previous computation. Therefore. In each level of recursion. the matrix and vectors are divided into regular blocks of size m=2 as shown in (4.9) is performed in parallel by all processors available at that level. Then.42
CHAPTER 4. predistributed across the processors. Let n0 = n=p. computation (4. g. Denote the matrix size at the current level of recursion by m. A processor q can be assigned to hold e. There is no substantial parallelism between block triangular system solution and block multiplication tasks in (4. Parameters: integer n p3=2 . l). rather than the cost of input/output. all blocks A i. or all blocks A i.9) to the BSPRAM processors. such as solution of general linear systems. at which the algorithm switches from parallel to sequential computation. j ]] for 0 j < p. keeping n for the original size. In each level of recursion. n n matrix A over a semiring. When blocks become su ciently small.7). every block multiplication in (4.
A c = A (c A11A c1 A c ) 22 2 + 21 11 1
(4. as before. all p processors are available to compute the triangular system solution A11 c1 . we assume that matrix A has been initially distributed across the processors at no BSP cost. The computation is performed on a CRCW BSPRAM(p. j ]]. q]] for 0 i < p. Value n0 is the threshold.9).

compute b1 by recursion. Each of these computations is performed with all processors that are available without matrix redistribution. The values for W = Wp(n).9) on the processor that holds the current block. Cost analysis. Theorem 6. We have obtained a sequence of subdags G0 . H = Hp(n). g.2. (iii ) S = (p). Any BSPRAM(p. l) computation of the triangular system substitution dag with W = O(n2 =p) requires (i ) W = (n2 =p). The computation of each subdag can start only after each node in the previous
. Compute c2 + A21 b1 .3).
H = O(n)
S = O(p)
= =
The algorithm is oblivious. compute (4. Let q0 be the rst processor computing the node v00 . S = Sp(n) can be found from the following recurrence relations: n0 < m n m = n0 Wq (m) = 2 Wq=2(m=2) + O(m2 =q) O(n2) 0 Hq (m) = 2 Hq=2(m=2) + O(m=q1=2 ) O(n0) Sq (m) = 2 Sq (m=2) + O(1) O(1) as
W = O(n2 =p) n=p. Proof. (ii ) Let m0 = 0. Let q1 be the rst processor computing the node vm1 m1 . The use of block multiplication in the recursive algorithm does not lead to an improvement in communication cost. Let m1 be the rst row of G not containing a node computed by processor q0 (if every row contains a node computed by q0 . Repeat this process until some current mr+1 = n (and therefore Fr+1 is empty). Large blocks. and F2 as a subdag of F1 consisting of columns j with m2 j n 1. Let m2 be the rst row of F1 not containing a node computed by processor q1 . Gr along the main diagonal of the dag G (see Figure 4. De ne G1 as a subdag of F1 consisting of rows i with m1 i m2 . Finally. : : : . If 1 m n0 . If n0 < m n. compute A22 (c2 + A21 b1 ) by recursion. (ii ) H = (n). Then compute A21 b1 by Algorithm 8. then m1 = n). We now show that the asymptotic BSP cost of Algorithm 9 cannot be reduced by any computation of the substitution dag (Figure 4. TRIANGULAR SYSTEM SOLUTION
43
Small blocks. and F1 as a subdag of G consisting of columns j with m1 j n 1. De ne G0 as a subdag of G consisting of rows i with 0 i m1 . with slackness and granularity
The above analysis shows that the recursive algorithm for triangular system solution has the same BSP cost as the diamond dag algorithm.5).4. (i ) Trivial.

to one of the processors computing vkk . we will ignore all computation in a subdag after the highest-indexed node has been computed once. For each k. 0 s < r. mr j < k.
. We may also assume that each computed value is used at least once. such that vkj is computed by processor qs . For each k. processor qs must communicate the value computed in vkk to the processor computing vms+1 k . there are three cases:
vkk is not computed by processor qr . processor qs must send a distinct value. For the purpose of a lower bound.44
CHAPTER 4. ms k < ms+1 . Now consider the subdag Gr . vkk is computed by processor qs. processor qr must communicate the value computed in some vkj . there are two cases:
vkk is not computed by processor qs. DENSE MATRIX COMPUTATION IN BSP v00
G0
vm1 m1 G1 vmr mr Gr vn
1. ms j < k. Similarly to the rst case above. Therefore. part (ii ) subdag has been computed at least once. therefore the communication cost of computing Gs is at least ms+1 ms . Consider a subdag Gs . there is some j .5: Proof of Theorem 6. mr k < n. By construction of Gs.n 1
Figure 4. processor qs must communicate the value computed in vkj to one of the processors computing vkk . for each k. Therefore. By construction of Gs.
Thus. the node vms+1 k is not computed by processor qs.

3. mr j < k.
4. form a consecutive segment. vkj is computed by processor qr for all j . part (iii ). rst computed in each particular superstep. including the ordinary substitution and the recursive block substitution dag. We denote the sizes P of these segments by ms . as well as the diamond dag. However. for each k.4. Since the communication within a superstep is not allowed. the total communication cost is at least (m1 m0 ) + (m2 m1 ) + + (n mr ) n=p1=2 = n n=p1=2 = (n)
(iii ) The proof is a two-dimensional version of the proof for Theorem 5. the total number of such values k cannot exceed n=p1=2 .3. By Cauchy's inequality. Since W = O(n2 =p). mr j < k. We deal
. processor qs must send or receive a distinct value. MATRIX MULTIPLICATION
45
vkk is computed by processor qr . The local computation cost of a block is therefore m2 . in particular the method used by Algorithm 9 itself. Note that the recursive block substitution used in Algorithm 9 de nes a dag di erent from the ordinary substitution dag in Figure 4.3 Matrix multiplication
In this section we describe a BSPRAM algorithm for one of the most common problems in scienti c computation: dense matrix multiplication. which suggests that these algorithms may be optimal for triangular system solution under the new extended de nition. the computation of Gs must be completed before the computation of Gs+1 can start. where 0 s<S ms = n. Thus. the value computed in vkj must be communicated to processor qr in order to compute vkk . Nodes of the main diagonal vii . but for some j . all nodes in the diagonal square block of size ms spanned by the segment s must be computed by the processor that rst computes the highest-indexed node of the block. 0 s < S . it may be possible to extend the theorem to a more general class of dags. The standard algorithms for all the above dags have similar BSP costs. except at most n=p1=2 values.
n=
X
0 s<S
ms =
X
0 s<S
1 ms S 1=2
X
0 s<S
m2 s
1=2
S 1=2 W 1=2
Hence by assumption S n2 =W = (p). Theorem 6 does not cover all standard methods of triangular system solution in BSP. Since for every s. 0 i < n. Hence. therefore the communication cost of computing Gr is at least n mr n=p1=2 . the node vkj is not computed by processor qr . In this case. The total cost of s local computation is bounded from below by the sum over all supersteps: P 2 0 s<S ms .

j. k.11)
Following (4. j. k]
A i. We aim to parallelise the standard (n3 ) method. k is independent (although it requires concurrent reading from A i.6: Matrix multiplication dag with the general problem of computing the matrix product A B = C . j ] and B j.46
CHAPTER 4.13) for di erent triples i. previously used e.
. j. k
C
Figure 4. j i. B . k]
for all i.12) (4. k]
C i. j. described in McC95. Computation (4. C are n n matrices over a semiring. DENSE MATRIX COMPUTATION IN BSP
B j. and therefore can be performed in parallel.2) with symmetric three-dimensional problem partitioning.11). k n
(4. k]). McC96c]. j ] B j. k]
and then compute
0 for i. 1 i. we need to set
C i. j. : : : . k] =
A i. k n. The algorithm combines the idea of two-phase broadcast (see Section 2. The BSPRAM algorithm implementing this method is derived from the BSP algorithm due to McColl and Valiant. and concurrent writing to C i. k]
1 i. n
(4.13)
V i. k] + V i. k]. k i. j.g. asymptotically optin X j =1
mal for sequential matrix multiplication over a general semiring (see HK71]). where A. j ] B j. k]
C i. k = 1. k V A i. The method consists in straightforward computation of the family of bilinear forms
C i.

j. k]] (4. j ]] B j. k]] A i. k]. p1=3 (4. k]] = A i. Such partitioning induces a partition of the matrices A.4. j. C are represented as projections of the cube V onto the coordinate planes. B.7). (4. k = 1. C (see Figure 4. Each processor computes a block product V i.. The algorithm works by a straightforward partitioning of the problem dag. k]. p1=3 ]] Ap Ap and similarly for B . B . C into p2=3 regular square blocks of size n=p1=3 .. j.12).
C
. B j. 1]] 1 A 1..7). j.13). j ]] B j. k]] C i.3. MATRIX MULTIPLICATION
B
47
V A
Figure 4. k p1=3 .7: Matrix multiplication in BSPRAM in ACS90] (see also ABG+ 95] for further references and experimental results).13) can be expressed in terms of blocks as C i. p1=3 ]] C . The algorithm is as follows. 0 A 1. j ]. Computation of the node V i. k. : : : . In order to provide a communication-e cient BSP algorithm. Computation (4. (4. Array V is represented as a cube of volume n3 in integer threedimensional space (see Figure 4. k]] + V i. and the output to node C i.. j.14) @ A . 1]] 1=3 . the array V must be divided into p regular cubic blocks of size n=p1=3 (see Figure 4. k]] C i. 1=3 . k] requires the input from nodes A i. k]] sequentially by (4. (4.6).16) for all i.12).. k]] 0 for i. j.15) and then V i. A = B . . Arrays A. 1 i.

l) computation of the standard matrix multiplication dag requires (i ) W = (n3 =p). Concurrent writing is resolved by addition of the written blocks. Algorithm 10 will serve us as a building block for more advanced matrix algorithms developed in the following sections. therefore H = (n2 =p2=3 ). Any BSPRAM(p. this is not so for commutative
. j. (i ).4 Fast matrix multiplication
In Section 4. one of the three projections of the set must contain at least n2 =p2=3 nodes. k]]. (4.3 we considered the problem of matrix multiplication over a semiring.
4.16) for a particular triple i. k]] in the main memory. k. (iii ) S = (1). Since there are at least n3 =p nodes in the set. (iii ) Trivial. As mentioned before. g. k]] by (4.12).13). Then it computes the product V i. the standard (n3 ) sequential algorithm is optimal for a general semiring. Matrix multiplication . The local computation. the computation proceeds in one superstep. k]] = A i. Cost analysis. However. Description. j. In the input phase. (ii ) Since n3 nodes are computed by p processors.48
CHAPTER 4. Input: n n matrices A. We now prove that Algorithm 10 is an optimal parallel realisation of the standard matrix multiplication algorithm. B over a semiring. g.15).
W = O(n3 =p)
H = O(n2 =p2=3 )
S = O(1)
The algorithm is oblivious. the processor reads the blocks A i. Each processor performs the computation (4. (ii ) H = (n2 =p2=3 ). The computation is performed on a CRCW BSPRAM(p. j ]] B j. k]] is then written to C i. there is a processor that computes at least n3 =p nodes. communication and synchronisation costs are
Algorithm 10. with slackness and granularity = = n2 =p2=3 . DENSE MATRIX COMPUTATION IN BSP
After the initialisation step (4. The following theorem was suggested by Pat93]. Proof. Output: n n matrix C = A B . The block V i. Parameter: integer n p1=3 . j. We apply the discrete Loomis{Whitney inequality (see Appendix A) to this set of nodes. j ]] and B j. Theorem 7. l). The resulting matrix C is the product of A and B .

y(r) 0 x(r) x(r) + (r) A i. k N . for each r. The rst such algorithm was proposed by Strassen in his groundbreaking paper Str69]. no lower bound asymptotically better than the trivial (n2 ) has been found. A bilinear circuit for the matrix product A B = C computes a family of bilinear forms
C i.g. k]
!
(4.19)
for all r. k]
and then compute
0 for i. FAST MATRIX MULTIPLICATION
49
rings with unit. there (r (r are some (r) 6= 0. j ] ij
! X N
j. Each of the terms in (4. N ij (r) y(r) + (r) B j. as well as the ij (r output of C i. j ] for i.k=1
(r) jk B
j. Computation of the node v(r) requires the input of A i.
. However. j such (r that (r) 6= 0. where (r) . called bilinear. j = 1. and concurrent writing to zik ). 1 r R. : : : . k] + ik ) v(r) for i.17) is represented by a node v(r) . ij We assume that all R terms in (4. ik ) are constant elements of the ring. Since then. k]. k = 1. The number R is called the ij multiplicative complexity of the bilinear circuit. which allow \fast" matrix multiplication algorithms. : : : .e. jk) .17)
(r (r for 1 i. k] for j. the model for matrix multiplication can be restricted to a special class of circuits. much work has been done on the complexity of matrix multiplication over a commutative ring with unit.4. It is not di cult to see (see e. : : : . k] =
R X r=1
(r) ik
N X
i. k] for all i. and of B j. k such that jk) 6= 0. k] C i. : : : . j ].4. The natural computational model for matrix multiplication over a commutative ring with unit is the model of arithmetic circuits. C be N N matrices over a commutative ring with unit. n
(4.17).19) for di erent values of r is independent (although it requires concurrent reading from A i.j =1
(r) A i.18)
x(r) 0. k = 1. we need to set
C i. 1 r R. Let A. B. N
(4. HK71]) that without loss of generality. Computation (4. k] for all j. N y jk v(r) x(r) y(r) (r C i. therefore it can be performed in parallel. jk) 6= 0 and ik ) 6= 0. Following (4. B j.17) are nontrivial. nor is there any indication that the current O(n2:376 ) algorithm from CW90] is close to optimal. We represent the bilinear circuit (4. i.17) by a dag that we call a bilinear dag. k such that ik ) 6= 0. j ] for all i. k = 1.

j ]] for i. The recursion tree is computed in breadthrst order.
1 C A
(4. C is divided into regular square submatrices of size n0 n0 . B . : : : . N Y (r) Y (r) + bjk V (r) X (r) Y (r) C i. until the current matrix size is reduced to
. Paper HK71] shows that for any arithmetic circuit solving the matrix multiplication problem over a commutative ring with unit in r nonscalar multiplications or divisions. j = 1. : : : .. Any bilinear circuit for multiplying matrices of size N N can be applied to matrices of size n N . .50
CHAPTER 4. N ik for all r.
A 1.21)
X (r) 0. The described data distribution allows one to compute the linear forms in (4.11) has multiplicative complexity R = N 3 . so that the distributions of each of the above submatrices are even and identical. k]] + c(r) V (r) for i.. k = 1. Examples of a suitable distribution of A. GvdG96. The algorithm uses a data distribution that allows one to compute the linear forms in (4.
0 A 1. : : : . GHSJ96]). C . The matrices A. 1]] A = B .. The computation (4. k = 1.20)
and similarly for B . there exists a bilinear circuit solving this problem with multiplicative complexity at most 2r.22) for j. k]] (4. The matrices are distributed across the processors. or any block-cyclic distribution with square blocks of size at most n0 =p1=2 . The procedure is applied recursively to compute the block product V (r) = X (r) Y (r) . C are divided into regular square blocks of size n=N . C are the cyclic distribution. N ]] A N. where ! = logN R. B . k]]
and then
0 for i. k = 1..
A BSP version of the algorithm was proposed in McC96b] (see also KHSJ95.22) without communication.19) can be expressed in terms of blocks as
C i. k]] C i.18). : : : . 1]]
. N
(4. (4. N ij (r) B j. B .22) in a constant number of supersteps. N ]]
. where n0 = n=p1=! . @
A N.. The rst non-standard bilinear circuit for matrix multiplication with N = 2 and R = 7 was proposed in Str69]. 1 r R. The resulting algorithm has sequential complexity (n! ). Y (r) 0 X (r) X (r) + a(r)A i. Each of the matrices A. DENSE MATRIX COMPUTATION IN BSP
A bilinear circuit based on the standard de nition of matrix product (4.

g. at which the data are redistributed among the processors. the recursion tree is evaluated in breadth.21). At
W = O(n! =p)
H = O n2 =p2!
1
S = O(1)
The algorithm is oblivious. generate R multiplication subproblems by executing the rst three lines of (4. The computation is performed on a CRCW BSPRAM(p. In each level of recursion. and then the computation (4. The data distribution ensures that the linear steps can be performed without communication. Parameter: integer n p1=! .rst order. so that each subproblem can be solved sequentially. the redistribution occurs only twice | rst on the down-sweep. Compute the result by executing the nal line of (4. the data are implicitly redistributed when the matrix size m passes the threshold n0 .22) in parallel for all r. FAST MATRIX MULTIPLICATION
51
this point the data are redistributed. Value n0 is the threshold.20). Fast matrix multiplication . The values for W = W (n). Solve the multiplication subproblems in parallel by R simultaneous recursive calls. and is de ned by recursion on the size of the matrix. and p independent matrix multiplication subproblems are generated. l).
. Large blocks. Therefore. We denote the matrix size at the current level of recursion by m.4. Let n0 = n=p1=! . If 1 m n0 . keeping n for the original size. At this stage. The algorithm is as follows: Algorithm 11.4. If n0 < m n. then on the up-sweep of the recursion tree. Input: n n matrices A. H = H (n).22) in parallel for all r. Small blocks. Output: n n matrix C = A B . In the above description. Description.22) by the following schedule. B over a commutative ring with unit. compute (4. Cost analysis. with slackness = n2 =p2! 1 and granularity = n2 =p2! 1 +1 . the matrix is divided into N 2 regular square blocks of size m=N as shown in (4.22) sequentially on the processor where the data are held. We perform the initialisation (4. S = S (n) can be found from the following recurrence relations: n0 < m n m = n0 W (m) = R W (m=N ) O(n! =p) 0 H (m) = R H (m=N ) O(n2 =p) 0 S (m) = S (m=N ) O(1) as
n0.

6) does not have to follow the recursive pattern of fast matrix multiplication. the communication cost of Algorithm 11 is optimal. we do not have any single dag underlying the computation. (ii ) H = (n2 =p2! 1 ).
. (ii ) Induction on p and n. In contrast with standard matrix multiplication. A statement that any BSP implementation of a bilinear circuit with parameter ! would require H = (N 2 =p2! 1 ) is obviously invalid.17). (i ). The resulting circuit could be computed in H = O(N 2 =p2!1 1 ) = o(N 2 =p2! 1 ).. Thus. Theorem 8 is not a replacement for the optimality proof of Algorithm 10 (Theorem 7). The dag for recursive multiplication of matrices of size n based on the circuit (4. Algorithm 11 performs input/output at level 0. Theorem 8. We now prove the optimality of Algorithm 11. The outermost level is computed by Algorithm 11 without communication.17). the statement of Theorem 7 is more general than Theorem 8 applied to the dag based on standard 2 2 matrix multiplication. l) computation of the recursive matrix multiplication dag based on (4. arbitrary n). since we might be able to emulate such a circuit by a circuit with !1 < !. For each of these copies. duplicating other terms). Each level is formed from disjoint copies of the bilinear dag corresponding to the basic circuit. Inductive step (p ! R p. n ! N n). by introducing \spurious" terms in (4. g.52
CHAPTER 4. The outermost level of the dag consists of n2 disjoint copies of the basic dag. a lower bound on communication cost of a general bilinear matrix multiplication circuit is closely related to the lower bound on its computation cost. The analysis can be easily extended to the case where di erent levels of recursion are de ned by di erent basic circuits of the form (4. (iii ) S = (1). and data redistribution at level log p= log R.17) consists of log n= log N levels. Therefore. R = 8). Therefore. Any BSPRAM(p.17) requires (i ) W = (n! =p).g. which is beyond the reach of current complexity theory. (iii ) Trivial. we restrict our analysis to circuits obtained by recursive application of a xed basic circuit (4. This is because the computation of the standard matrix multiplication dag (Figure 4. Induction base (p = 1. Proof. The rest of the dag consists of R disjoint copies of the fast matrix multiplication dag of size n. therefore the overall communication is optimal. It should be noted that for standard matrix multiplication (a basic dag with N = 2. DENSE MATRIX COMPUTATION IN BSP
We now prove that Algorithm 11 is an optimal parallel realisation of the fast matrix multiplication algorithm. Trivial.17) (e. but all such circuits must have a xed maximum size N and a xed minimum multiplicative complexity R (and therefore a xed minimum exponent !).

together with the inverse matrices L 1 and U 1 . or for matrices of some particular types. Gaussian elimination can be described in many ways. GAUSSIAN ELIMINATION WITHOUT PIVOTING
53
4. In this section we consider the simplest form of Gaussian elimination. The BSP cost of the resulting computation is W = O(n3 =p). or terminate at all. and can be easily adapted to other forms of Gaussian elimination. Paper McC95] proposes to reduce the problem to the computation of a cube dag cube 3 (n). DHS95]).g.2 for triangular system solution. The reduction is similar to the one described in Section 4.5 Gaussian elimination without pivoting
This section and the next describe a BSPRAM approach to Gaussian elimination. or in time O(n! ) by block Gaussian elimination.23)
where the dot indicates a zero block.4.6 and 4. This basic form of elimination is not guaranteed to produce correct result. The cube dag method is straightforward for LU decomposition. used for algebraic path computation and matrix inversion. We will consider pivoting in Sections 4. H = O(n2 =p1=2 ). First we nd the LU decomposition
. the algorithm produces the LU decomposition A = L U . However. The LU decomposition of A is A = L U . a method primarily used for direct solution of linear systems of equations. such as QR decomposition by Givens rotations. U into square blocks of size n=2. it works well for matrices over some particular domains. This standard method was suggested as a means of reducing the communication cost in ACS90] (for the transitive closure problem).3). A lower communication cost for LU decomposition can be achieved by an alternative algorithm. L. Gaussian elimination and its variations are applied to a broad spectrum of numerical. In this section we consider it as LU decomposition of a real matrix. and R is an n n upper triangular matrix. based on recursive block Gauss{Jordan elimination (see e. S = O(p1=2 ) (see Section 3. Let A be an n n real diagonally dominant or symmetric positive de nite matrix. such as closed semirings.5. when performed on arbitrary matrices. using fast matrix multiplication. GPS90. which does not involve the search for pivots. More generally. Given a nonsingular matrix A. where L is an n n unit lower triangular.7. This decomposition can be computed in sequential time O(n3 ) by plain Gaussian elimination. The algorithm works by dividing the matrices A.
A=L U :
A11 A12 = L11 A21 A22 L21 L22
U11 U12 U22
(4. such as symmetric positive de nite matrices over real numbers. Chapter 5 presents another form of Gaussian elimination without pivoting. symbolic and combinatorial problems. The parallel complexity of Gaussian elimination has been extensively studied in many models of parallel computation.

24){(4. l). The computation is performed on a CRCW BSPRAM(p. Description. Then we apply block Gauss{Jordan elimination to nd the blocks L21 . When blocks become su ciently small. along with the inverse blocks L111 and U111 . The depth at which the algorithm switches from p-processor to singleprocessor computation can be varied. min = 1=2 2=3 = max . real number . we introduce a real parameter . controlling the depth of parallel recursion. Input: n n real matrix A. all p processors are available to compute the LU decomposition. Gaussian elimination without pivoting . by applying the algorithm recursively. Value n0 is the threshold. the recursion tree has to be computed in depth. DENSE MATRIX COMPUTATION IN BSP
of the block A11 = L11 U11 . every block multiplication in (4.25)
We now describe the allocation of block LU decomposition tasks and block multiplication tasks in (4. g. The algorithm is as follows. This variation allows us to trade o the costs of communication and synchronisation in a certain range. Output: decomposition A = L U .54
CHAPTER 4.25) is performed in parallel by all processors available at that level. U221 .rst order. We complete the computation by taking
L 1=
L111 1 L L22 21
L111
L221
1 U 1 = U11
U111 U12 U221 U221
(4. Each block LU decomposition is also performed in parallel by all processors available at that level. if the block size is large enough. at which the algorithm switches from parallel to sequential computation. block LU decomposition is computed sequentially by an arbitrarily chosen processor. where L is an n n unit lower triangular matrix.24){(4. Parameters: integer n p.
Algorithm 12. we can only exploit the parallelism within block multiplication. There is no substantial parallelism between block decomposition and block multiplication tasks in (4.25) to the BSPRAM processors.24)
A second recursive application of the algorithm yields the LU decomposition A22 L21 U12 = L22 U22 .
symmetric positive de nite.25). Initially. In each level of recursion. U12 :
L21
A21 U111
U12
L111 A12
(4. and U is an n n upper triangular matrix. and is de ned by recursion on the size of the matrix. Therefore. keeping n for the original size.24){(4. Denote the matrix size at the current level of recursion by m. we assume that A is diagonally dominant or
. and the inverse blocks L221 . Let n0 = n=p . In order to account for this tradeo .

U111 by recursion. For = max = 2=3. Small blocks. L111 . A lower bound on synchronisation cost can be proved for a general class of dags. Finally. U221 by recursion. including the three subclasses above. or the cube dag algorithm. U11 .25) is performed by the following schedule.25) on that processor. the communication cost of Algorithm 12 dominates the synchronisation cost. identical to the proof for the cube dag (Theorem 5. S = O(p2=3 ). Then compute L21 . S = S (n) can be found from the following recurrence relations: n0 < m n m = n0 3 =p) W (m) = 2 W (m=2) + O(m O(n3 ) 0 2 =p2=3 ) O(n2 ) H (m) = 2 H (m=2) + O(m 0 S (m) = 2 S (m=2) + O(1) O(1) as
W = O(n3 =p) n2=p2=3 . should be considered when the problem is moderately sized. In this case. and compute (4. gives
. This improvement in communication e ciency is o set by a reduction in synchronisation e ciency.24){(4. U22 .23). the communication cost is as low as in matrix multiplication (Algorithm 10). Each of these computations is performed with all available processors. ordinary elimination and block recursive elimination produce di erent dags. As in triangular system solution (Section 4.24){(4. If n0 < m n. H = H (n). Cost analysis.2). For large n.5. GAUSSIAN ELIMINATION WITHOUT PIVOTING
55
In each level of recursion.23). the cost of Algorithm 12 is W = O(n3 =p). with slackness and granularity
For = min = 1=2. and therefore the communication improvement should outweigh the loss of synchronisation e ciency. A proof. L221 . If 1 m n0 . After that. Gaussian elimination without pivoting cannot be de ned as a computation of any particular dag. The values for W = W (n). the cost of Algorithm 12 is W = O(n3 =p).
H = O(n2 =p )
S = O(p )
= =
The algorithm is oblivious. This is asymptotically equal to the BSP cost of the cube dag method from McC95]. Large blocks. S = O(p1=2 ). Then. compute L22 . U12 and A22 L21 U12 by Algorithm 10. compute L221 L21 L111 and U111 U12 U221 by Algorithm 10. part (iii )). The cube dag method. H = O(n2 =p1=2 ). compute L11 . choose an arbitrary processor from all currently available. Smaller values of . The result is the decomposition (4. This justi es the use of Algorithm 12 with = max = 2=3. H = O(n2 =p2=3 ). the matrix is divided into regular square blocks of size m=2 as shown in (4.4. computation (4.

provided that W = O(n3 =p). If an O(n2 ) matrix multiplication algorithm is eventually discovered. the range of parameter becomes tighter. Input: n n real matrix A. DENSE MATRIX COMPUTATION IN BSP
the bound S = (p1=2 ). the product A B can be computed by LU decomposition of a 2n 2n matrix:
plication (Theorem 7) holds also for standard Gaussian elimination without pivoting. Description. n0 = n=p is the threshold between parallel and sequential computation. the tradeo between H and S will disappear. The modi ed algorithm is as follows. except that block multiplication is performed by Algorithm 11. B .
I B I I B A I = A I I A B Therefore. rather than Algorithm 10. H = H (n). Parameters: integer n p3=! . As ! approaches the value of 2. with slackness and granularity = = n2 =p2! 1 . A lower bound on the communication cost can be obtained by the standard method of reducing the matrix multiplication problem to LU decomposition. and U is an n n upper triangular matrix. Fast Gaussian elimination without pivoting . the lower bound H = (n2 =p2=3 ) for standard matrix multi-
. For any n n real matrices A. which must be appropriately de ned to exclude Strassen-type methods. real number . Output: decomposition A = L U . S = S (n) can be found from the following recurrence relations: n0 < m n m = n0 W (m) = 2 W (m=2) + O(m! =p) 1 O(n! ) 0 H (m) = 2 H (m=2) + O m2 =p2! O(n2 ) 0 S (m) = 2 S (m=2) + O(1) O(1) as W = O(n! =p) H = O(n2 =p ) S = O(p ) The algorithm is oblivious. The computation is identical to Algorithm 12. As before. Fast matrix multiplication can be used instead of standard matrix multiplication for computing block products. Cost analysis. min = 1=(! 1) 2=! = max . Algorithm 13. The values for W = W (n). where L is an n n unit lower triangular matrix. we assume that A is diagonally dominant or symmetric positive de nite.56
CHAPTER 4.

Alternatively.g.4. A particular feature of computation over a nite eld is that any nonzero element. Since the approach of BH74] requires a search for the pivot along a matrix row.27)
. a sequence of nested blocks of maximum possible rank. since its approach is to nd a sequence of nested nonsingular blocks. Block Gaussian elimination without pivoting may fail for a similar reason. As in the previous section. V . Another pivoting method suitable for block triangular decomposition has been proposed in BH74]. we can compute the decomposition (4. obtaining an e cient BSP algorithm for certain forms of Gaussian elimination with pivoting. This procedure di ers from the one described in Section 4. V may not have full rank. Here A is an arbitrary n n matrix. and V an upper triangular n n matrix. its BSP cost is higher than the cost of nested block pivoting. or become zero during elimination. As we show below. can serve as a pivot. such that
A This problem is closely related to the problem of transforming V to row echelon form (see BCS97]). or. we consider Gaussian elimination on A rectangular matrices of a special form. We assume that the inverse of a nonzero eld element can be computed in time O(1). proposed in Sch73] (see also BCS97. A similar approach applies to computation of the QR decomposition of a real matrix by Givens rotations.
D E F G
A = U V
(4. We call this technique nested block pivoting. Ort88]). The problem consists in nding a full-rank 2n 2n transformation matrix D E .26) by a recursive procedure. Let A be an n n matrix over a nite eld. U are partiFG tioned into regular square blocks of size n=2. NESTED BLOCK PIVOTING AND GIVENS ROTATIONS
57
4. if the original matrix is singular. Mod88. or any nonsingular block. the problem can be solved by the cube dag method. Plain Gaussian elimination without pivoting may fail to nd the LU decomposition of A.6 Nested block pivoting and Givens rotations
In this section we extend the results of the previous section. Matrices A. To describe nested block pivoting. using the standard elimination scheme (see e.5 in that A we cannot compute block inverses. Matrices D E . section 16. this allows us to use a restricted version of pivoting.6. if some diagonal blocks are singular or become singular during elimination. and an n n FG upper triangular matrix U .5]). because some diagonal elements may be zero initially.26)
0D D E E 1 0A A 1 0U U 1 11 12 11 12 BD21 D22 E21 E22 C BA11 A12 C = B 11 U12 C 22 C B F F G G C B V 21 V 22 C B @ 11 12 11 12A @ 11 12 A @ A
F21 F22 G21 G22 V22
(4. Let V be a 2n n matrix over a nite eld.

30).30) schematically as 2 3 1 2
. Taking V the decomposition
P11 P12 P21 P22
we obtain
A21 = Y11 V11
0I B P11 P12 B P P @ 21 22
I
1 0A A 1 0A A 1 11 12 C BA11 A12 C = B Y11 Y12 C 21 22 C B C BV V A @ Y C A @ 11 12 22 A
V22 V22 A22 = Y12 V12 Y22
(4.28){(4. we can improve the e ciency by changing the order in which low-level block operations are executed.29)
Finally.58
CHAPTER 4. If the other element is also nonzero.28){(4.28){(4. However. it is eliminated by subtracting the pivot multiplied by an appropriate scaling factor. DENSE MATRIX COMPUTATION IN BSP
21 First the algorithm is applied recursively to n n=2 matrix A11 . an arbitrary nonzero element is used as a pivot. the recursive procedure (4.30)
Matrix D E can now be computed as the product of the three transformaFG tion matrices from (4. We represent the block elimination order de ned by (4. the decomposition is trivial. The base of recursion is the elimination on a 2 1 matrix.28)
where
P11 P12 P21 P22
11 In the next stage we apply the algorithm recursively to matrices A11 and Y Y22 . Otherwise. we apply the algorithm recursively to matrix Z12 . If both matrix elements are zero. due to a large number of recursive calls. obtaining V22
0Q Q 1 0A A 1 0U U 1 11 12 11 12 BQ21 Q22 C B Y11 Y12 C = B 11 Z12 C B C B Y C B Z12 C @ Q33 Q34 A @ 22 A @ 22 A
Q43 Q44 V22
(4. As opposed to Gaussian elimination without pivoting. obtaining Z22
0I B R11 R12 B R R @ 21 22
I
1 0U U 1 0U U 1 C B 11 Z12C = B 11 U12 C 22 C C B Z12C B A@ A 22 A @
(4.30) is not e cient when implemented in BSP.

and block multiplication immediately to the left has been performed.4. In general. and therefore after elimination 3. In this schematic representation. in parallel with elimination 3. Vertical bars j represent block multiplication. For 4 4 matrices.6. However. 2 denotes the elimination of Y11 and V22 . we can eliminate any block as soon as the result of multiplication immediately on its left is available. elimination 4 in the left column can be computed immediately after elimination 2. A block multiplication can be performed after all blocks immediately to the left have been eliminated. The optimised elimination scheme is 4 3 2 1 5 4 3 2 7 6 5 4 8 7 6 5
Similarly. the original. recursive elimination scheme for 8 8 matrices is 14 13 11 10 5 4 2 1 15 14 12 11 6 5 3 2 17 16 14 13 8 7 5 4 18 17 15 14 9 8 6 5 23 22 20 19 14 13 11 10 24 23 21 20 15 14 12 11 26 25 23 22 17 16 14 13 27 26 24 23 18 17 15 14
. a block elimination can be performed after the block immediately below has been eliminated. and 3 denotes the elimination of Z22 . the elimination scheme of the recursive algorithm is 5 4 2 1 6 5 3 2 8 7 5 4 9 8 6 5
Here the length of the vertical bars corresponds to the size of matrices being multiplied. NESTED BLOCK PIVOTING AND GIVENS ROTATIONS
59
Here 1 denotes the elimination of V11 . Note that the elimination 4 in the bottom row has to be computed after the matrix multiplication immediately on the left.

l). We now describe the allocation of elementary block decomposition tasks and matrix multiplication tasks to the BSPRAM processors. In each matrix multiplication stage. The resulting algorithm is similar to Algorithm 12. Output: decomposition D A = U . where D is a full-rank n n matrix. DENSE MATRIX COMPUTATION IN BSP
8 7 6 5 4 3 2 1 9 8 7 6 5 4 3 2 11 10 9 8 7 6 5 4 12 11 10 9 8 7 6 5 16 15 14 13 12 11 10 9 17 16 15 14 13 12 11 10 19 18 17 16 15 14 13 12 20 19 18 17 16 15 14 13
and the optimised scheme is
This method can be generalised to arbitrary matrix sizes. Let n0 = n=p . In each decomposition stage. We call such blocks elementary. Description. Parameters: log p n p1+ log p some constant > 0. integer for log log
min = 1 2
= max. The computation is performed on a CRCW BSPRAM(p. Each of these stages is implemented by a superstep. at most one matrix multiplication from every column is performed. The computation alternates between decomposition of elementary blocks and parallel multiplication of the resulting matrices. De0 note the matrix size at the current level of recursion by m. Decomposition of a 2n n matrix can be computed by an r r optimised elimination scheme. Value n0 is the size of elementary blocks. g.60
CHAPTER 4. we allocate p=k processors to multiplication of matrices of size n=k. real number . at most one entry from every column of the elimination scheme is computed. keeping n for the original size. We use an optimised elimination scheme of size r = p . variation of results in a tradeo between the communication and synchronisation costs.
Algorithm 14. As before. We apply the optimised block elimination procedure to the matrix A . For any k. where each entry corresponds to a block of size n0 = n=r. and U is an n n upper triangular matrix. Elimination within an elementary block is performed sequentially by a single BSPRAM processor.
2 log p
+
2 3 + log p
. Input: n n matrix A over a nite eld. in that a real parameter controls the depth at which block decomposition is performed sequentially. Gaussian elimination with nested block pivoting .

etc. the total cost of matrix multiplications is
3 n3 n3 W1 = O n + 2 22 p + 4 42 p + = O(n3 =p) p 2 2 2 H1 = O pn=3 + 2 24=3n p2=3 + 4 44=3n p2=3 + = O(n2 =p2=3 ) 2 S1 = O(p log p)
. In general. Cost analysis. Let n=K be the size of the largest matrix product occuring in a particular matrix multiplication superstep. Every such product is computed on p=(2k) processors. The total cost of elementary block decompositions is
W0 = S0 O(n3 ) = O(n3 log p=p2 ) 0 H0 = S0 O(n2 ) = O(n2 log p=p ) 0 S0 = O(p log p)
The total computation and communication cost of all matrix multiplications is dominated by the cost of the largest matrix multiplication.6. Indeed. Decomposition of elementary blocks is computed sequentially. Since r = p p. The computation cost of this superstep is at most
3 3 3 O (n=K ) t (n=(2K )) t (n=(4K )) t p=K p=(2K ) p=(4K )
= O Kn p 2 =O
3
and the synchronisation cost is at most )2 )2 )2 O ((n=K2=3 t ((n=2K2=3 t ((n=4K2=3 t p=K ) p=2K ) p=4K )
K 4=3 p2=3
n2
where t denotes the maximum operator. in each matrix multiplication stage. there are at most 2K supersteps for any particular K . all elementary block decompositions occuring in the same superstep can be performed in parallel. There are at most two supersteps with K = 1. as described above. at most four supersteps with K = 2. The value for S = S (n) can be found from the following recurrence relation: n0 < m n m = n0 S (m) = 2 S (m=2) + O(1) O(1) as S = O(p log p). at most two matrix products of size n=2. for any k we compute at most k matrix products of size k. NESTED BLOCK PIVOTING AND GIVENS ROTATIONS
61
We decompose the matrix using an optimised elimination scheme of size r = p . we compute at most one matrix product of size n.4. In general. Matrix multiplication is performed by Algorithm 10. Therefore. etc.

the cost of Algorithm 14 is W = O(n3 =p).26) is given by the Givens v rotation
c s s c
a = u v
where c = a=(a2 + v2 )1=2 .62
CHAPTER 4. s = v=(a2 + v2 )1=2 . For = max .28){(4. the total BSP cost is
W = O(n3 =p) n2 =p2=3 . Since the cube dag method is slightly better than Algorithm 14 for = min.5 apply to the choice of a particular value of . Considerations similar to the ones discussed in Section 4. similar to Algorithm 12. with slackness and granularity
For = min. and decomposing elementary blocks by the cube dag algorithm at the lower level. which is suitable for certain numerical and combinatorial computations on matrices. we considered Gaussian elimination without pivoting or with nested block pivoting. most numerical matrix problems require that in each step. performing elimination by an optimised scheme at the higher level.30) and Algorithm 14 are then directly applied to computation of the QR decomposition of a real matrix by Givens rotations. u = (a2 + v2 )1=2 . In this case the recursion has to be deeper. However. For a 2 1 matrix a . one can use fast matrix multiplication instead of standard matrix multiplication for computing block products. DENSE MATRIX COMPUTATION IN BSP
Since W0 = O(W1 ). the problem of nding an e cient hybrid algorithm remains open. H1 = O(H0 ). the cost of Algorithm 14 is W = O(n3 =p). This is slightly higher than the BSP cost of the cube dag method. This would be straightforward if Algorithm 14 were a purely recursive algorithm. H = O(n2 =p2=3 ).5 and 4. Recursive equations (4. H = O n2 (log p)1=2 =p1=2 . S = O p1=2 (log p)3=2 . the pivot is chosen globally within
. decomposition (4. In this case the communication cost is as low as in matrix multiplication (Algorithm 10) and Gaussian elimination without pivoting (Algorithm 12). therefore the synchronisation cost will slightly increase.
H = O(n2 log p=p )
S = O(p log p)
= =
The algorithm is oblivious.
4. since Algorithm 14 computes the elimination scheme in an optimised non-recursive order.7 Column pivoting and Householder re ections
In Sections 4. Algorithm 14 can be used to compute the QR decomposition of a real matrix. To reduce slightly the cost of local computation and communication. one might expect that a lower BSP cost may be achieved by a hybrid algorithm.6. However. This improvement in communication e ciency is o set by a reduction in synchronisation e ciency. S = O p2=3 (log p)2 .

Matrix A may not have full rank. r n. Householder re ections. A. if we regard operations on columns (column elimination and matrix-vector multiplication) as elementary operations. which is similar to nested block pivoting. and the sizes of other
blocks conform to the above. U12 are n=2 n=2. which are performed sequentially.31) D21 D22 A21 A22 U22 where blocks D11 . using matrix multiplication to perform these updates (see e. U into rectangular blocks
D11 D12 A11 A12 = U11 U12 (4. An alternative method of QR decomposition.32)
P11 P12 P21 P22
A11 A12 = U11 U12 Y22 A21 A22
. An alternative method is to combine the updates from several column eliminations. It is easy to obtain a BSPRAM algorithm for Gaussian elimination with column pivoting. Taking the decomposition A21
P11 P12 P21 P22
we obtain
A11 = U11 A21
(4. Since an elementary data unit is a column of size r. or even across the whole matrix (full pivoting ). A11 . we can apply either of the two methods described in Section 4. and both elementary operations have sequential time complexity O(r). and column pivoting is stable on the average.7. Here we give a BSPRAM algorithm based on this approach. or recursive substitution. the local computation and communication costs of the resulting algorithm are equal to O(r) times the cost of the original triangular system algorithm. U11 . is similar to column pivoting. such that D A = U . We also considered QR decomposition by Givens rotations. We consider Gaussian elimination with column pivoting on rectangular matrices.2: diamond dag algorithm. The problem consists in nding a full-rank r r transformation matrix D. Therefore. only full pivoting guarantees numerical stability. For square matrices (r = n) this yields W = O(n3 =p). COLUMN PIVOTING AND HOUSEHOLDER REFLECTIONS 63
a column (column pivoting ). As before. A12 . H = O(n2 ). A similar recursive procedure for sequential computation has been introduced in Tol97].4. and an r n upper triangular matrix U . the algorithm is applied recursively to matrix A11 . the algorithm can be described recursively. For an arbitrary nonsingular real matrix. in the ScaLAPACK library (see CDO+ 96]). We partition matrices D. e. First. Bre91]). The synchronisation cost S = O(p) remains unchanged.g. Let A be an r n real matrix. In this case the data dependency between elementary tasks is similar to that of triangular system solution.g. Such an approach is often used in parallel numerical software.

U is an r n upper triangular matrix.32){(4. Value n0 is the threshold. which involves the search of the largest column element as a pivot.33) is performed by the following schedule. Gaussian elimination with column pivoting . obtaining
Q22
U11 U12 = U11 U12 Y22 U22
Matrix D can now be computed as the product of the two transformation matrices from (4. Initially. the matrices are divided as shown in (4. we can only exploit the parallelism within block multiplication. Let n0 = n=p. In each level of recursion. Input: r n real matrix A. the recursion tree has to be computed in depth.
Algorithm 15. When blocks become su ciently small. In each level of recursion.31). where D is a full-rank r r matrix. all p processors are available to compute the decomposition D A = U . and compute (4. Each block decomposition is also performed in parallel by all processors available at that level.33) is performed in parallel by all processors available at that level. Denote the matrix width at the current level of recursion by m.32){(4. if the block size is large enough. and
.33) to the BSPRAM processors. choose an arbitrary processor from all currently available. and elimination of all other elements by subtracting the pivot multiplied by an appropriate scaling factor. r n. We now describe the allocation of block decomposition tasks and block multiplication tasks in (4.32){(4.
Description.64 where
CHAPTER 4.33) on that processor. every block multiplication in (4. Parameter: integer n p.
Small blocks. block decomposition is computed sequentially by an arbitrarily chosen processor. l). g.33)
Then we apply the algorithm recursively to matrix Y22 . Output: decomposition D A = U . Therefore. at which the algorithm switches from parallel to sequential computation. If 1 m n0 .32){(4. There is no substantial parallelism between block decomposition and block multiplication tasks in (4. keeping n for the original width. The base of recursion is the elimination on a single column. Then.33). DENSE MATRIX COMPUTATION IN BSP
P11 P12 P21 P22 I A12 = U12 A22 Y22
(4.rst order.32){(4. The resulting algorithm is as follows.33). computation (4. The computation is performed on a CRCW BSPRAM(p.32){(4.
and is de ned by recursion on the size of the matrices.

The local computation cost W and communication cost H are dominated by the decomposition of elementary blocks: W = O(p) O(r2 n0 ). COLUMN PIVOTING AND HOUSEHOLDER REFLECTIONS 65
Large blocks. The above analysis shows that the recursive algorithm for column pivoting has the same BSP cost as the column-based triangular system algorithm. compute U22 by recursion.
recurrence relation:
Cost analysis. where pivot search is performed locally in a rectangular block of columns.31). If n0 < m n. due to a higher computation cost. full pivoting guarantees numerical stability. Algorithm 15 can be used to compute the QR decomposition of a real matrix by Householder re ections.4. Finally. For real matrices. in practice it is used less frequently than column pivoting. U12 by recursion. A straightforward extension of the method is block column pivoting. but the communication cost will remain unchanged. In contrast with matrix problems considered previously. In contrast with previous pivoting methods. a column u is eliminated by a symmetric orthogonal transformation matrix P = I 2v vT . compute U11 . As before. H = O(p) O(rn0). Each of these computations is performed with all available processors. pivot search is done across the whole matrix on each elimination stage. The value for S = S (n) can be found from the following
n0 < m n m = n0 S (m) = 2 S (m=2) + O(1) O(1)
as S = O(p). using fast matrix multiplication in column pivoting will reduce slightly the cost of local computation. therefore the synchronisation cost will slightly increase. with slackness and granularity = = n2 =p2 . thus improving numerical properties of the algorithm. we consider Gaussian elimination with full pivoting. Then compute Y22 by Algorithm 10. Finally. The result is the decomposition (4. In this method. there is no known block method for Gaussian elimination with full pivoting. For square matrices (r = n) we obtain:
W = O(n3=p)
H = O(n2 )
S = O(p)
The algorithm is communication-oblivious. Recursive equations (4. instead of a single column. however. the recursion will have to be deeper.7.32){(4.33) and Algorithm 15 are then directly applied to computation of the QR decomposition of a real matrix by Householder re ections. Using block multiplication for matrix updates does not improve the asymptotic communication cost of column pivoting. where v is a unit vector constructed from column u in linear time. The only
. In this method.

but the synchronisation cost S is much higher. The described ne-grain algorithm can be applied to Gaussian elimination with column pivoting. Each of n elimination stages is implemented by O(1) supersteps. Note that the communication cost H is lower than that of Algorithm 15. in which case it presents a discontinuous communication-synchronisation tradeo with Algorithm 15. S = O(n). The matrix is partitioned across the processors by regular square blocks of size n=p1=2 .66
CHAPTER 4. DENSE MATRIX COMPUTATION IN BSP
viable approach to this problem seems to be ne-grain.
. The total BSP cost of the algorithm is therefore W = O(n) w = O(n3 =p). Consider Gaussian elimination with full pivoting on an n n matrix. H = O(n) h = O(n2 =p1=2 ). h = O(n=p1=2 ). The cost of each stage is dominated by the cost of a rank-one matrix update: w = O(n2 =p).

the method is of theoretical importance. AHU76]). S = O(1). see e.g. The proposed algorithm is not practical. since a graph can be represented by its adjacency matrix. we consider the problem of multiplying two n n Boolean matrices.g. and involves huge constant factors. which is easy to prove for a general semiring. In this section we give a positive answer to this question. including Kronrod's algorithm (also known as Four Russians' algorithm. LD90. pages 747{748]). However. and a recent algorithm from BKM95]. 1)-matrices over the ring of integers modulo n +1 (see e. cannot be extended to the Boolean 67
.1 Fast Boolean matrix multiplication
Graph computation is a large and particularly well-studied area of combinatorial computation. and vice versa. In this section. using conjunction ^ and disjunction _ as multiplication and addition respectively. As shown in Section 4. For matrices over a general semiring.Chapter 5
Graph computation in the BSP model
5. the fast matrix multiplication algorithm has BSP cost W = O(n! =p). There are also subcubic methods. these cost values are independently optimal (Theorem 8). the Boolean matrices are viewed as (0. because it indicates that the lower bound H = (n2 =p2=! ). In this method. AHU76].4. H = O(n2 =p2=! ). the adjacency matrix is Boolean. It is natural to ask whether the asymptotic communication cost H can be reduced by using properties speci c to Boolean matrices. The lowest known exponent is achieved by fast Strassen-type multiplication. Pat74. or CLR90. The straightforward method is standard matrix multiplication. It has a strong connection to matrix computation. Many graph algorithms have matrix analogues. since it works only for astronomically large matrices. describing an algorithm with communication cost H = O(n2 =p). of sequential complexity (n3 ). pages 537{538]. In the simplest case of an unweighted graph.

The graph G is triangle-free | existence of a triangle in G would imply that for some i. but exponential in p. we represent the n3 computed Boolean products by a cube of volume n3 in integer three-dimensional space. or few zeros. C de ne a tripartite graph. but C i. j ] ^ B j. the multiplication problem can be solved on a BSPRAM in W = O(n3 =p). B.3. Its cost is polynomial in n (more precisely. i. We replace C by C . Again. where A. X i. i] = 1. the zeros). We rst give an intuitive explanation of the method. H = O(n2 =p). The main idea is to nd the high-level structure of the two matrices and of their product. the resulting algorithm is non-oblivious. S = O(1) by partitioning the cube into layers of size n=p n n. k] = C k. if we consider them as adjacency matrices of its three bipartite connection subgraphs. GRAPH COMPUTATION IN BSP
case. C are square Boolean matrices of size n. parallel to the coordinate plane representing matrix B . H = O(n2 =p). S = O(1). B . To expose the symmetry of the problem. A simple undirected graph is called triangle-free if it does not contain a triangle (a cycle of length three). parallel to the coordinate plane representing matrix C . k. Boolean matrices A. The structure of a Boolean matrix product is best described in the language of graph theory. i] = 1 = 0. The proposed algorithm performs communication only on matrices containing few ones. We will denote this tripartite graph by G = (A. C are square n n matrices. parallel to the coordinate plane representing matrix A. Since A. In contrast with the standard and Strassen-type methods. As in Section 4. B . j ] = B j. For a Boolean matrix X . Assuming for the moment that A and B are random matrices. j. C must have a special structure that allows us to decompose the computation into three matrix products corresponding to the three easy cases above. If both A and B are dense. We argue that in this case the triple A. Computation of this structure requires no prior knowledge of C . with the size of each partite class equal to n. k] = C k. the problem can be solved at the same cost by a similar partitioning into layers of size n n=p n. k] = 1. and look for a C such that A ^ B = C . If A contains at most n2 =p ones. Let us consider the standard (n3 ) computation of the product A ^ B = C . we partition the cube into layers of size n n n=p. if B contains at most n2 =p ones. Note that the property of a graph G to be triangle-free is symmetric: matrix C does not
. let X denote the Boolean complement to the transpose of X . i]. A matrix with m ones (or m zeros) can be communicated by sending the m indices of the ones (respectively. we modify it as follows. the graph G is equitripartite. j ] = X j. the full structure can be found with little extra communication. Symmetrically. A i. therefore A i. but C still has relatively many zeros. C ). The remaining case is when A and B have relatively many ones. of order n! ). B .e. As soon as this basic structure is determined. the problem can be solved in W = O(n3 =p).68
CHAPTER 5. B . it is likely that the partial product computed by each of the layers contains at most n2 =p zeros.

For disjoint A. d)-partitioning of
. the matrices A. Note that by this de nition. it might also be applicable to similar extremal graph problems. (ii) A ^ B = C . Die97]. B ) an -regular subgraph. Bol78. We have the following lemma. In particular. Here we apply this lemma directly to the Boolean matrix multiplication problem. we have jd(X. B are maximal matrices such that A ^ B = C . Proof. and Y B . Let us call the factors A and B maximal. e(G) = jE j. Y ) = e(X. B . if changing a zero to a one in any position of A or B results in changing some zeros to ones in the product C (and therefore some ones to zeros in C ). The following simple result is not necessary for the derivation of our algorithm. (iii) G = (A. The basis of this approach is Szemeredi's Regularity Lemma (see e. In the de nitions and theorems below. including the minimum density problem. It is then noted in Bol78] that. Lemma 2 shows that in the product A ^ B = C . if for every X A. Let v(G) = jV j. C are in a certain sense interchangeable: any matrix which is a Boolean product can also be characterised as a maximal Boolean factor. C be arbitrary n n Boolean matrices. and is given only to illustrate the connection between Boolean matrix multiplication and triangle-free graphs. A triangle-free graph is called maximal if the addition of any new edge creates a triangle. C ) is a maximal equitripartite triangle-free graph.g. however. a general approach to problems of this kind has been developed. Lemma 2. the minimum density problem was \completely unresolved". Let G = (V. jX j > jAj. Y )=(jX j jY j). One of the few references to maximal equitripartite triangle-free graphs appears in Bol78]. as of the time of writing. B V we call (A. KS96]). Y V . It also gives a characterisation of such matrices by maximal triangle-free graphs. let e(X. it is easy to see from the discussion above that this problem is closely related to Boolean matrix multiplication. The following statements are equivalent: (i) A. Since then. pages 324{325] states the problem of nding the minimum possible density of such a graph. Let A. FAST BOOLEAN MATRIX MULTIPLICATION
69
play any special role compared to A and B . Straightforward application of the de nitions.1. a maximal tripartite triangle-free graph may not be maximal as a general triangle-free graph. B )j < .5. we call such a graph maximal if we cannot add any new edge so that the resulting graph is still tripartite and triangle-free. Y ) denote the number of edges between X and Y . Y ) as d(X. We say that G has an ( . B ^ C = A. E ) be a simple undirected graph. we follow KS96. We de ne the density of the bipartite subgraph (X. When we consider tripartite triangle-free graphs. For disjoint X. C ^ A = B . B. Y ) d(A. B . jY j > jB j.

a further generalisation is known as the Blow-Up Lemma). or empty. is as follows. d)-partitioning. we take H to be a triangle. B . If (d ) =(2 + ). Theorem 10 (Szemeredi). and G is triangle-free. Let H be a subgraph of R with maximum degree > 0. Its precise statement. and the cluster graph is also equitripartite. where n 2212 18 . and let R be the cluster graph of this partitioning. we obtain the following corollary of Theorem 9: if d 4=5. By simplifying the condition on d and . GRAPH COMPUTATION IN BSP
size m. We will not apply the de nition of -regularity directly. d)-partitioning by removing a small number of nodes and edges. then each cluster is a subset of one of the three parts. Note that if e(G) = o v(G)2 . C are Boolean matrices of size
. paper + ADL10 94] gives an algorithm which nds the subgraph G0 in sequential time O 22 17 n! . )-partitioning by simply excluding the -regular subgraphs with densities between and d. The size of the resulting ( . d)-partitioning is a graph with m nodes corresponding to the clusters. such that for any two clusters A. A cluster graph of an ( . where ! is the exponent of matrix multiplication (currently 2:376 by CW90]). B ) is either -regular with density at least d. then G contains a subgraph isomorphic to H . See KS96. If G is an equitripartite graph. slightly adapted from KS96]. Proof. d)-partitioning is at most M = 2210 17 .
Since we are interested in triangle-free graphs. an arbitrary graph G contains a subgraph G0 with an ( . See KS96.
Theorem 9 (Komlos. Let G be a graph with
Proof. For every > 0 there is an M = M ( ) such that for any d. called clusters. Our main tool is Szemeredi's Regularity Lemma. it states that any graph can be transformed into a graph with an ( .70
CHAPTER 5. Die97]. the statement of Theorem 10 becomes trivial. where A. the bipartite subgraph (A. For an equitripartite graph G of size 3n. if V can be partitioned into m disjoint subsets of equal size. in which two nodes are connected by an edge if and only if the two corresponding clusters form a non-empty bipartite subgraph. an ( . and e(G n G0 ) (d + ) v(G) 2 . Also note that for any d . then its cluster graph R is also triangle-free. d 1. Instead. paper KS96] calls it the Key Lemma. Let d > > 0. we will use the following theorem (in a slightly more general form. d)-partitioning of size at most M . d2 =4. We are now able to describe the proposed communication-e cient algorithm for computing A ^ B = C . Informally. B . Simonovits). an ( . Die97]. d)-partitioning can be obtained from an ( .

The product A ^ B = C can therefore be found by selecting each subblock of C from C or C0 = ( A ^ B) _ (A ^ B). C = C0 _ C. i]]. j. we can nd a large subgraph G0 G with an ( . i]] is a zero matrix. B. and d + 2=p 2 =p) after the application of Theothat the communication cost H = O(n rem 10). The algorithm is as follows. and C0 = ( A ^ B) _ (A ^ B). each cluster graph contains O 2245 p68 edges. 0 i. B. Let G0 = (A0 . d)-partitioning of the graph G0 is. The two cases (C0 k. j ]] or B0 j. j ]]. Therefore. The number of nodes in the cluster graph of each block is O 2210 17 = O 2244 p68 . In order to compute the ( . = d2 =4 = 1=(4p4 ). k]] is a zero matrix by Theorem 9. If C0 k. i]]. Therefore. The matrix C00 = A0 ^ B0 need not be computed. and G = ( A.1. such that d2 =4 (neces2 (necessary to ensure sary as a condition of Theorem 9). then for any j . We represent the n3 elementary Boolean products as a cube of volume n3 in the three-dimensional index space. Each processor computes cluster graphs for p2 blocks. i]] zero or non-zero) can be distinguished by the cluster graph of G0 alone. i]]. respectively. and C k. Note that A = A0 _ A.5. a partitioning of G0 into m3 regular cubic subblocks. d)-partitioning for all p3 blocks of G at the communication cost H = O(n2 =p). B0 . C). k]] and G i. where C00 = A0 ^ B0 . 0 j < m. C00 k.
n. j. k]]. J ]]. The ( . then C k. d)-partitioning of size 3m M ( ). C0 ). k]] denote similar subblocks of G and G. up to a permutation of indices. Each block is local to a par-
. The cluster graphs must be exchanged between the processors. i]] is a zero matrix. C ). we require that this number of edges is at most O(n2 =p). with p2 blocks in each layer. k]] = (A0 i. Let G i. therefore it is su cient to assume that n = 2245 p68 . J. i]]). Consider an arbitrary subblock C0 k. C). K p. Let us choose positive real numbers and d. i]] is non-zero. j. FAST BOOLEAN MATRIX MULTIPLICATION
71
ticular processor. The cube G is partitioned into p3 regular cubic blocks of size n=p. i]]. We have e( G) (d + )n2 =p2 .
1 We use the sans-serif font for blocks. i]] = C k. therefore the total68number of cluster graph edges computed by a processor is at most O 2246 p . in order to reduce the number of subscripts. Also let G = G n G0 . and C k. where A = A I. Let us denote a subblock of G0 by G0 i. B0 j. j. We take d = 1=p2 . B = B0 _ B. either A0 i. i]] = C0 k. blocks must be grouped into p layers of size n n n=p. B. K ]] for some 1 I. and C = A ^ B. If C0 k. and C = A ^ B = C00 _ C0 . To obtain an algorithm with low communication cost. The computation of the two Boolean matrix products in C0 = ( A ^ B) _ (A ^ B) requires that the same p3 cubic blocks are divided into p similar layers of sizes n=p n n and n n=p n. k < m. where the choice is governed by the cluster graph. C0 k. By Theorem 10. Let G = (A. and corresponds to an equitripartite triangle-free graph1 G = (A. We shall identify the cubic block with its graph G. B = B J. i]] = C k.

The blocks of matrix C are partitioned equally among the processors. reads the necessary blocks of matrices A. The algorithm proceeds in three stages. g. GRAPH COMPUTATION IN BSP
in integer three-dimensional space. and the cluster graph of G. We use the notation of Theorem 10 and of the subsequent discussion. and n = 2246 ( !) 1 p68 . its local computation cost is polynomial in n with exponent !. First stage. C = C0 _ C. each containing p2 cubic blocks. The Boolean products A ^ B and A ^ B are computed by partitioning the cube into layers of size n=p n n and n n=p n. B = B0 _ B. This cube is partitioned into p3 regular cubic blocks. and still H = O(n2 =p). it is communication-oblivious. for su ciently large n. B. B . Although the algorithm is essentially non-oblivious. Second stage. such that A ^ B = C . each containing p2 cubic blocks. ! < < 3. Description. The algorithm is performed on a CRCW BSPRAM(p. S = O(1). Matrix C is initialised with zeros. The cube is partitioned into layers of size n n n=p. communication and synchronisation costs are
Algorithm 16. the local computation cost is W = O(n =p). C. Cost analysis. Then for each block G the processor writes back the matrices A. and for each cubic block G computes the graph G0 and the decompositions A = A0 _ A. Every processor reads the necessary cluster graphs. the algorithm improves on the asymptotic cost of any Strassen-type BSP algorithm with exponent . l). respectively. Its slackness and granularity are = n2 =p.
. Algorithm 16 is asymptotically optimal in communication. The resulting matrix C is the Boolean matrix product of A and B . Input: n n Boolean matrices A.72
CHAPTER 5. = 1. Every processor picks a layer. The Boolean matrix product A ^ B = C is represented as a cube of size n
W = O 2246 p68 n!
H = O(n2 =p)
S = O(1)
Hence. Therefore. as directed by the cluster graph. Output: n n Boolean matrix C . Third stage. Parameter: integer n = 2245 p68 . for any > !. Boolean matrix multiplication . The algorithm is trivially optimal in synchronisation. The local computation. Moreover. since examining the input already costs H = (n2 =p). Then the Boolean sum C0 = ( A ^ B) _ (A ^ B) is computed. and then computes each block C by selecting its subblocks from C or C0 . B .

Let an n n matrix A over a semiring represent a weighted graph with nodes 1. GM84b. This method is asymptotically optimal for matrices over a general semiring. If the graph is not complete. : : : . The BSP matrix closure computation is similar to LU decomposition (see Section 4. We repeat the description of block Gauss{ Jordan elimination from Section 4. we use Gaussian elimination without pivoting. considering closed semirings instead of arbitrary semirings). or by restricting the type of the matrix. GM84a.
. It uni es many seemingly unrelated computational problems. Note that in this general setting.g. giving BSP cost W = O(n3 =p). the distance does not have to correspond to any particular \shortest" path in the graph. In order to compute the closure of a square matrix over a general semiring. j ].2 Algebraic path computation
In this section we consider the problem of nding the closure of a square matrix over a semiring. Zim81. In the case of numerical matrices. n. or by recursive block Gauss{Jordan elimination. All these tasks can be viewed as instances of the algebraic path problem for an appropriately chosen semiring. We will return to this special case in Section 5. ALGEBRAIC PATH COMPUTATION
73
5. H = O(n2 =p1=2 ). regular language generation. Let A = I + A + A2 + be the closure of matrix A (it is not guaranteed to exist in a general semiring). whenever this closure exists. we assume that non-edges have length zero. Guaranteed termination can be achieved by restricting the domain (e. The length of an edge i ! j is de ned as the semiring element A i. The distance between nodes i. This problem is also known as the algebraic path problem.2. We assume that the closure of a semiring element can be computed in time O(1). which can be shown by a standard reduction of the matrix multiplication problem. j is de ned as the semiring element A i.5. and the distances are realised by shortest paths. More information on applications of the algebraic path problem can be found in Car79. network reliability.5 for numerical matrices with certain special properties. S = O(p1=2 ). Rot90]. 1 plays the role of the zero. making the changes necessary to adapt it to the algebraic path problem. such as graph connectivity. The method is similar to the one described in Section 4. as we did in Section 4. lengths and distances have their standard graph-theoretic meaning | in particular. The problem can be solved by the cube dag method. In the special case where the semiring is the set of all nonnegative real numbers with 1.5. network capacity. provided that the computation terminates.5.5).4. Gaussian elimination over a general semiring is not guaranteed to terminate. j ]. computation of the matrix closure corresponds to matrix inversion: A = (I A) 1 . Matrix closure A can be computed by sequential Gaussian elimination in time (n3 ). Let A be an n n matrix over a semiring. and the operations min and + are used as semiring addition and multiplication respectively. In the absence of pivoting.

74
CHAPTER 5. The algorithm is as follows. we introduce a real parameter . In each level of recursion.2) is also performed in parallel by all processors available at that level. if the block size is large enough. Output: n n matrix closure A (assuming it exists). Initially. overwriting A. all p processors are available to compute the closure A . Input: n n matrix A over a semiring.2). Therefore.3)
where G = A22 + A21 A11 A12 . real number . The depth at which the algorithm switches from p-processor to singleprocessor computation can be varied.1)
A11 A12 A21 A22
A11 A11 A12 A21 A11 A22 + A21 A11 A12
A22 A21 A12 A11
A22 A22 A21 A12 A22 A11 + A21 A22 A12
(5. In order to account for this tradeo . GRAPH COMPUTATION IN BSP
For convenience we assume that the resulting matrix A must replace the original matrix A in the main memory of BSPRAM.
Algorithm 17. controlling the depth of parallel recursion. block closure is computed sequentially by an arbitrarily chosen processor. if all taken closures exist.2) is performed in parallel by all processors available at that level. The procedure can be applied recursively to nd A11 and A22 . min = 1=2 2=3 = max .
.
A = A11 A12 A21 A22
and then applying block Gauss{Jordan elimination:
(5. When blocks become su ciently small. every block multiplication in (5.2) to the BSPRAM processors. The algorithm works by dividing the matrix into square blocks of size n=2. Each block closure in (5. The computation terminates.rst order. There is no substantial parallelism between block closure and block multiplication tasks in (5.2)
after which every Aij overwrites Aij . Parameters: integer n p. We now describe the allocation of block closure tasks and block multiplication tasks in (5. The resulting matrix is
11 A = A11 + AG A12 G A21 A11 A11 A12 G A21 A11 G
(5. Algebraic path computation . the recursion tree is computed in depth. This variation allows us to trade o the costs of communication and synchronisation in a certain range. we can only exploit the parallelism within block multiplication.

Algorithm 17 with = max = 2=3 is better suited for large values of n. identical to the ones used for Algorithm 12. g. compute A21 . at which the algorithm switches from parallel to sequential computation.2) on that processor. computation (5. A22 by Algorithm 10. and = min = 1=2 may perform better for a moderate n. the problem of computing the n n matrix product A B can be reduced to algebraic path computation by considering the closure of a 3n 3n lower triangular matrix
Description. A12 . Similarly to Algorithm 12. give the BSP cost W = O(n3 =p) H = O(n2 =p ) S = O(p ) The algorithm is oblivious. Small blocks.
. Denote the matrix size at the current level of recursion by m. Lower bounds for the BSP cost of algebraic path computation are also similar to the bounds given in Section 4. CLR90]). In each level of recursion.5.5. Therefore. keeping n for the original size. Recurrence relations. compute A11 by recursion.g. Let n0 = n=p . and compute (5. Cost analysis. Overwrite every Aij by Aij . min = 1=(! 1) 2=! = max. choose an arbitrary processor from all currently available. Large blocks. If the ground semiring is a commutative ring with unit.2. A21 . Value n0 is the threshold. real number . l). The computation is performed on a CRCW BSPRAM(p. Then compute A12 . If 1 m n0 . compute A22 by recursion. A11 by Algorithm 10.
0I 1 0 I @B I A = @ B
A I
A B A I
1 I A
(5. Then. After that. fast matrix multiplication can be used instead of standard matrix multiplication for computing block products. obtaining the matrix closure (5. Finally.2) is performed by the following schedule.4)
(see e. with slackness and granularity = = n2=p2=3 .1). Pat74. ALGEBRAIC PATH COMPUTATION
75
and is de ned by recursion on the size of the matrix. Algorithm 18. The modi ed algorithm is as follows. In particular. Each of these computations is performed with all available processors. Fast algebraic path computation . Parameters: integer n p3=! . If n0 < m n. the lower bound H = (n2 =p2=3 ) for standard matrix multiplication (Theorem 7) holds also for matrix closure over a general semiring. the matrix is divided into regular square blocks of size m=2 as shown in (5.3).

rather than Algorithm 10.76
CHAPTER 5. Thus. It is interesting to analyse the same problem in the special case where the underlying graph is acyclic. Such a graph is represented by a lower triangular matrix. therefore the procedure can be applied recursively to nd A111 and A221 . From now on. It is easy to see that matrices A11 . with slackness and granularity
A11 A12 A21 A22
A111 A11 A12 A21 A11 A22 A21 A11 A12
A22 A21 A12 A11
A221 A22 A21 A12 A22 A11 + A21 A22 A12
(5. so that the edges are directed from higherindexed to lower-indexed nodes. Cost analysis. Recurrence relations. identical to the ones used for Algorithm 13. then the algorithm for computing the inverse A 1 can be obtained by transforming equations (5.
5.5)
after which every Aij overwrites Aij .
. H = O(n2 =p2=3 ).6)
where G = A22 A21 A111 A12 . A22 are symmetric positive de nite. A directed acyclic graph can be topologically sorted by an all-pairs shortest paths algorithm with BSP cost W = O(n3 =p). If A is a real symmetric positive de nite matrix. GRAPH COMPUTATION IN BSP
Input: n n matrix A over a commutative ring with unit.4). except that
block multiplication is performed by Algorithm 11.2)
into the form
The algorithm is oblivious. or by a fast algorithm similar to Algorithms 13. The resulting matrix is
1 1 1 1 A 1 = A11 + A11 A112 AG A 1A21 A11 G 21 11
A111 A12 G G 1
1
(5. The computation is identical to Algorithm 17. 17. overwriting A. Output: n n matrix closure A (assuming it exists). we assume that the input graph is topologically sorted.3 Algebraic paths in acyclic graphs
In the previous section we considered the algebraic path problem on an arbitrary directed graph. Description. S = O(log p) (see Section 5. 18. matrix inversion can be performed by a standard algorithm similar to Algorithms 12. give the BSP cost
W = O(n! =p)
H = O(n2 =p )
S = O(p )
= =
n2 =p2! 1 .

7)
A11 A22 A21
A11 A22 A22 A21 A11
(5. Block multiplication in (5. As before. keeping n for the original size. block closure is computed sequentially by a processor chosen arbitrarily from the available processors. The computation is performed on a CRCW BSPRAM(p.
A = A11 A A21 22
and then applying block Gauss{Jordan elimination:
(5.8) can be computed independently and simultaneously. ALGEBRAIC PATHS IN ACYCLIC GRAPHS
77
Let A be an n n lower triangular matrix over a semiring.8)
after which every Aij overwrites Aij . Algorithm 19. and is de ned by recursion on the size of the matrix. Let n0 = n=p1=3 . Input: n n lower triangular matrix A over a semiring.
. An important di erence from the case of arbitrary matrices is that the block closures A11 and A22 in (5. we partition the set of available processors into two subsets of equal size. overwriting A. This partitioning takes place in every level of recursion. g. The algorithm is as follows.5. Parameter: integer n p2=3.rst order. The resulting matrix is
A = A A11 A A 22 22 A21 11
(5. The procedure can be applied recursively to nd A11 and A22 .9)
The computation terminates. Output: n n matrix closure A (assuming it exists). if all taken closures exist. at which the algorithm switches from parallel to sequential computation. We assume that the resulting lower triangular matrix A must replace the original matrix A in the main memory of BSPRAM. Denote the matrix size at the current level of recursion by m. The recursion tree is computed in breadth. In order to exploit this feature.8) is performed in parallel by all processors available at the current level of recursion.3. Algebraic path computation in an acyclic graph . When blocks become su ciently small. each computing one of the two block closures. The algorithm works by dividing the matrix into square blocks of size n=2. we solve the problem of computing the matrix closure A by recursive block Gauss{Jordan elimination. until p independent tasks are created. Description. Value n0 is the threshold. l).

Input: n n lower triangular matrix A over a commutative ring with unit. and the synchronisation cost is higher than that of matrix multiplication by only a factor of log p. Overwrite every Aij by Aij . Compute A11 (respectively. partition the set of currently available processors into two equal subsets. Parameter: integer n p2=! . The asymptotic communication and local computation costs of Algorithm 19 are the same as for matrix multiplication.7). choose an arbitrary processor from the currently available. Then compute A21 by Algorithm 10. the second) subset. S = Sp(n) can be found from the following recurrence relations: n0 < m n m = n0 Wq (m) = Wq=2 (m=2) + O(m3 =q) O(n3 ) 0 Hq (m) = Hq=2 (m=2) + O m2 =q2=3 O(n2 ) 0 Sq (m) = Sq=2 (m=2) + O(1) O(1) as
W = O(n3 =p) n2 =p2=3 . using the processors of the rst (respectively. If A is a matrix over a commutative ring with unit. Fast algebraic path computation in an acyclic graph . given in Section 5. and compute (5. H = Hp(n). the matrix is divided into regular square blocks of size m=2 as shown in (5. The values for W = Wp(n). GRAPH COMPUTATION IN BSP
In each step of recursion. If 1 m n0 . with slackness and granularity
From the analysis of Algorithm 19. There is no communication-synchronisation tradeo . the algebraic path problem on an acyclic graph appears to be asymptotically easier than on a general graph. except that block multiplication is performed by Algorithm 11. Output: n n matrix closure A (assuming it exists). Large blocks. computation (5. rather than Algorithm 10.
. fast matrix multiplication can be used instead of standard matrix multiplication for computing block products. Algorithm 20.8) is performed by the following schedule.8) on this processor. The modi ed algorithm is as follows. using all processors available at the current level of recursion. obtaining the matrix closure (5. Small blocks. The computation is identical to Algorithm 19. Description.2. overwriting A. If n0 < m n. A22 ) by recursion. Then.78
CHAPTER 5.9).
H = O n2 =p2=3
S = O(log p)
= =
The algorithm is oblivious. The proof of a lower bound on communication cost H = (n2 =p2=3 ) is identical to the proof for general graphs. Cost analysis.

8)
into the form
The algorithm is oblivious. therefore the procedure can be be applied recursively to nd A111 and A221 . If A is a real nonsingular lower triangular matrix. Most algorithms for matrix closure in the (min. then the algorithm for computing the inverse A 1 can be obtained by transforming equations (5. for all i. Matrices A11 . in this section we use the term all-pairs shortest paths problem as a synonym for the matrix closure problem in the (min. The resulting matrix is
A 1=
A111 A221A21 A111 A221
(5. +) semiring.
5.2 we considered the algebraic path problem over an arbitrary semiring. ALL-PAIRS SHORTEST PATHS COMPUTATION
found from the following recurrence relations:
79
Cost analysis. will always denote the semiring addition
. The values for W = Wp(n). when applied to path lengths. Since the min operation is idempotent.10)
after which every Aij overwrites Aij . Here we deal with a special case where the semiring is the set of real numbers with 1. A22 are nonsingular lower triangular. Therefore.5. j there is a path from i to j of length A i. S = Sp(n) can be
n0 < m n Wq (m) = Wq=2(m=2) + O(m! =q) Hq (m) = Hq=2(m=2) + O m2 =q2! Sq (m) = Sq=2(m=2) + O(1) W = O(n! =p) H = O n2 =p2!
1 1
as
m = n0 O(n! ) 0 O(n2 ) 0 O(1)
S = O(log p)
= =
n2=p2! 1 . H = Hp(n).11)
Thus. j ] | this is one of the shortest paths from i to j . as well as the distances. +) semiring can be extended to compute the shortest paths between all pairs of nodes.4 All-pairs shortest paths computation
In Section 5. triangular matrix inversion can be performed by an algorithm similar to Algorithms 19 or 20. with slackness and granularity
A11 A22 A21
A111 A221 A22 A21 A11
(5. and the numerical operations min and + are used as semiring addition and multiplication respectively.4. Symbols + and .

see also CLR90]). CLR90]).g. To compute the shortest paths between all pairs of nodes in parallel. Let X (k1 . therefore all shortest paths are unique. Therefore. we consider the case where all edge lengths are nonnegative. Its block recursive version. Alternatively.g. where m is the maximum path size in X . called the path matrix. Initially. the local computation cost of computing An by repeated squaring is W = (n3 log n)=p . A re ned version of path doubling was proposed in AGM97. can be applied to the all-pairs shortest paths problem. In this context. This greedy algorithm nds all shortest paths from a xed source in order of increasing length. When run in parallel.e. the new method does not improve on the synchronisation cost by itself. S = O(p ). j ] if path X i. Note that X (0) = I and X = X (0. We then extend our method to general lengths. for an arbitrary . solves the problem with BSP cost W = O(n3 =p). Addition and multiplication of path matrices are de ned in the natural way. each entry X i. but a lower synchronisation cost. one can apply Dijkstra's algorithm independently to each node as a source (this approach is suggested e. We use the term path size for the number of edges in a path. In order to design such an algorithm. The technique of Gauss{Jordan elimination. : : : . i. or corresponds to a simple path from i to j . let X (k) denote the matrix of all paths in X of size exactly k. j ] = X i. identical to Algorithm 17. j ] is either 1. 1=2 2=3. j ] has size k. H = O(n2 =p ).2. The sequential time complexity of Dijkstra's algorithm is (n2 ). Compared to the Floyd{Warshall algorithm.
. however. We assume for simplicity that all edges and paths in the graph have di erent lengths. numerical min and +. X (k) i. In such a matrix X . This tradeo motivates us to look for an improved BSP algorithm. The main idea of the method is to perform path doubling. The resulting algorithm has BSP cost W = O(n3=p). : : : . in Joh77. and X (k) i. this method allows one to compute the matrix An = A with local computation cost W = O(n3 =p). H = O(n2 ). keeping track not only of path lengths. therefore An = A . Gauss{Jordan elimination is commonly known as the Floyd{Warshall algorithm (see e. ks ) = X (k1 ) + + X (ks ) (remembering that + denotes numerical min). It thus has a higher communication cost. an improvement can be achieved by combining the new method with Dijkstra's algorithm. but also of path sizes. S = O(1). the problem with nonnegative lengths can be solved by Dijkstra's algorithm ( Dij59]. m) = I + X (1) + + X (m).80
CHAPTER 5. considered in Section 5. we use the principle of path doubling. j ] = 1 otherwise. More precisely. than the Floyd{Warshall algorithm. that would solve the all-pairs shortest paths problem e ciently both in communication and synchronisation. Matrix An can be obtained by repeated squaring in log n matrix multiplications. GRAPH COMPUTATION IN BSP
and multiplication. For an integer k. Tak98]. Fos95]). We assume that lengths and sizes are kept in a single data structure. No shortest path may contain more than n edges.

Ak (k). the BSP cost of which is negligible. k=2 < l k. Consider any shortest path of size in the range l +1.
. matrix Ak (l) contains at most 2n2 =k nontrivial entries. we write X Y . The partitioning can be computed by a greedy algorithm. and a nal subpath of size at most k. l). The sparse argument of each subproblem contains at most 4n2 k1=3 p2=3 nontrivial entries. l) Ak is therefore W = O n3 =(k p) . : : : . Partition the set of columns in the strip into p1=3 =k1=3 equal subsets. Y . We call X and Y disjoint. Since Ak (0. By partitioning the set of columns of the second argument of this subproblem into p1=3 k2=3 equal subsets. j ] trivial. Each 1= strip de nes an npk=33 n n sparse-by-dense matrix multiplication sub1 problem. a decomposition of the matrix into p1=3 =k1=3 equal horizontal strips. 1 k < n.5. and the average number of nontrivial entries per matrix is at most 2n2 =k. Consider the matrices Ak (k=2+1). j (ignoring path sizes). This path consists of an initial subpath of size l. such that each subset contains at most 4n2 k1=3 p2=3 nontrivial entries. This partitioning de nes. j ] Y i. or X i. l) has at most 2n2 =k nontrivial entries. : : : . if i = j . matrix Ak contains all shortest paths of size at most k (and maybe some other paths). 1 1 The total number of resulting sparse-by-dense matrix multiplication subproblems is p. l) 2 into p1=3 =k1=3 equal subsets. j ] for all i. j . if X i. Therefore. Each 1= block de nes an npk=3 3 npk=33 n sparse-by-dense matrix multiplication 1 1 subproblem. up to a permutation of rows. For an integer k. This can be done by rst partitioning the set of rows in Ak (0. l) Ak e ciently in parallel. j ] is trivial for all i. or Y i. if either X i. up to a permutation of columns. The total number of nontrivial entries in all these matrices is at most n2 (since the matrices are disjoint). H = O n2 =(k1=3 p2=3 ) . and a dense matrix Ak . Decompose the path matrix Ak into a disjoint semiring sum Ak = I + Ak (1)+ + Ak (k). S = O(1).4. The BSP cost of computing the matrix product Ak (0. Therefore. Our next goal is to compute all shortest paths of size at most 3k=2. j ] = 1. for some l. l) Ak A3k=2 . The above product involves a sparse matrix Ak (0. we obtain p1=3 1=3 2=3 sparsek 1= by-dense matrix multiplication subproblems of size npk=3 3 npk=3 p1=3nk2=3 . such that each subset contains at most k2=23np1=3 nontrivial entries. Suppose that we have computed Ak for some k. l) Ak requires not more than 2n3 =k semiring multiplications. Consider one of the above subproblems. To compute Ak (0. This partitioning de nes. computation of Ak (0. we need to partition the problem into p sparse-by-dense matrix multiplication subproblems. ALL-PAIRS SHORTEST PATHS COMPUTATION
81
For path matrices X . a decomposition1=of the strip into equal square blocks. j ]. the matrix product Ak (0. l) Ak contains all shortest paths of size at most 3k=2: Ak (0. where all the sparse arguments have an approximately equal number of nontrivial entries. We call an entry X i. 3k=2.

Therefore. The cost of the second stage is W = O(n3 =p). For 1 r p2=3 . Matrix Ap (q) contains all shortest paths of sizes that are multiples of q (and maybe some other paths). H = O(n2 =p2=3 ). by at most log3=2 r rounds of path doubling. picks n=p nodes. Parameters: integer n p. Compute the product Ar (q) Ar = A . H = O(n2 =p2=3 ). integer r. Description. n=p runs per processor. all shortest paths for the original matrix A can be computed as the matrix product Ap (q) Ap = A . All-pairs shortest paths (nonnegative case). Input: n n matrix A over the (min. S = O(1). GRAPH COMPUTATION IN BSP
The path doubling process stops after at most log3=2 p rounds.
Cost analysis.
First stage. The computation is performed on a CRCW BSPRAM(p. g. +) semiring of nonnegative real numbers with 1. The cost of the resulting algorithm is W = O(n3 =p). we can nd a q such that the matrix Ar (q) has at most n2 =r nontrivial entries. For some q. Any shortest path in A consists of an initial subpath of size that is a multiple of q. The resulting algorithm is as follows. Output: n n matrix closure A . S = O(log p). Third stage. Compute Ar and Ar (q). Each processor receives the matrix Ap(q). Second stage.
Algorithm 21. therefore the communication cost of applying Dijkstra's algorithm to nd Ar (q) is H = O(n2 =r). l). 1 r p2=3 . 1 q p. The synchronisation cost of the rst stage is S = O(log r). Broadcast Ar (q) and compute the closure Ar (q) by n independent runs of Dijkstra's algorithm. which is only better) has been computed. The result of this computation across all processors is the matrix closure Ap (q) . and computes all shortests paths originating in these nodes by n=p independent runs of Dijkstra's algorithm. S = O(1). The cost of the third stage is W = O(n3 =p). Therefore. We can further reduce the synchronisation cost by terminating the path doubling phase after fewer than log3=2 p steps. Matrix Ar (q) contains at most n2 =r nontrivial entries. it can be broadcast to every processor with communication cost H = O(n2 =p). H = O(n2 =r). when the matrix Ap (or some matrix Ap . The local computation and communication costs of the
rst stage are dominated by the cost of its rst round: W = O(n3 =p) and H = O(n2 =p2=3 ).
and proceeds in three stages. matrix Ap (q) contains at most n2 =p nontrivial entries. 0 < q r.82
CHAPTER 5. and a nal subpath of size at most q p.
.

Algorithm 21 allows the following variation. The resulting algorithm is as follows. and the multiple Dijkstra algorithm with r = 1. The two extremes of Algorithm 21 are the communication-e cient algorithm with r = p2=3 . Algorithm 22. We p2 ((p)) as a product Ap ((p)) = Ap (q) Ap ((p) q). Formally. p2 2 q). and computing the closure Ap (p) . it remains to compute the product Ap2 (p) Ap2 = A . 2p. we compute the matrix Ap2 by 2 log3=2 p steps of parallel path doubling. if and only if the graph does not contain a cycle of negative length. the problem is de ned as nding the closure A of a matrix A over the (min. because Dijkstra's algorithm does not work on graphs with negative edge lengths. it remains to compute the product Ap (p) Ap = A . For some q. recovering the product Ap (q) Ap (p q) = Ap(p). the disjoint sum Ap2 (q) + Ap2 ((p) q) contains at most 2n2 =p nontrivial entries. In the rst stage.
. In the third stage. We cannot use our original method to solve this more general problem. H = O(n2 ). Instead of computing the closure Ap (q) . In the second stage. For some q. ALL-PAIRS SHORTEST PATHS COMPUTATION
83
The local computation. and recover their product Ap2 ((p)). the disjoint sum Ap (q)+ Ap (p q) contains at most 2n2 =p nontrivial entries. W = O(n3 =p). but also the local computation cost. we do not consider this option. Ap2 (p. the second stage of the algorithm can be replaced by broadcasting the matrices Ap(q) and Ap (p q) (or. The closure is dened. represent matrix Ap (p) as a product Ap (p) = Ap(q) Ap (p q). Parameter: integer n p. +) semiring of all real numbers with 1. early termination of the parallel path doubling phase would increase not only the communication cost. their disjoint sum). However. Now the closure Ap ((p)) = Ap2 (p) can be computed by sequential path doubling. H = O(n2 =p2=3 ). with slackness = n2 =r and granularity = 1. represent matrix A 0 q < p=2.5. p2 ). All-pairs shortest paths . Therefore.4. we can get round this di culty by replacing Dijkstra's algorithm with an extra stage of sequential path doubling. S = O(1). : : : . W = O(n3 =p). and Ap2 ((p) q) = Ap2 (2p q. communication and synchronisation costs of the whole algorithm are
W = O(n3 =p)
H = O(n2 =r)
S = O(1 + log r)
The algorithm is communication-oblivious. In the third stage. Let Ap2 ((p)) = q. In contrast with the nonnegative case. The extended algorithm has three stages. 0 q < p=2. S = O(log p). Therefore. equivalently. we collect matrices Ap2 (q) and Ap2 ((p) q) in 2a single processor. We now extend Algorithm 21 to graphs where edge lengths may be negative. 2p 2 : : : .

5. It is not clear if an extension of Algorithm 16 can be obtained for path matrix multiplication. First stage. g.5 Single-source shortest paths computation
In the previous section. H = O(n2 =p2=3 ). we
. ) semiring for paths of maximum reliability. The synchronisation cost of the rst stage is S = O(log p). Description. The disjoint sum Ap (q) + Ap ((p) q) contains at most 2n2 =p nontrivial entries.84
CHAPTER 5. The communication and synchronisation costs of the second stage are H = O(n2 =p). S = O(1). the (max. but also to any semiring where addition is idempotent. The described method is applicable not only to the (min. Output: n n matrix closure A . The examples are the (_. The local computation cost of the second stage is dominated by the cost of its rst round. Compute the closure Ap2 ((p)) = Ap2 (p) by sequential path doubling. min) semiring for paths of maximum capacity. The computation is performed on a CRCW BSPRAM(p. we presented deterministic BSP algorithms for nding all-pairs shortest paths in a dense weighted graph. 0 q p=2. Compute Ap2 and Ap2 ((p)) by at most 2 log3=2 p rounds of parallel path doubling. The cost of the third stage is W = O(n3 =p). with slackness = n2 =p and granularity = 1.
W = O(n3 =p)
H = O(n2=p2=3 )
S = O(log p)
The algorithm is communication-oblivious. since the path doubling process involves the multiplication of path matrices. l). Note that in the case of transitive closure computation by Algorithm 21. ^) semiring for the problem of transitive closure. Second stage. +) semiring of real numbers with 1. +) semiring. The local computation. Compute the product Ap2 (p) Ap2 = A . Represent Ap22 ((p)) as 2Ap2 ((p)) = Ap2 (q) Ap2 ((p) q). 2Collect Ap2 (q)+ Ap2 ((p) q) in a single processor. GRAPH COMPUTATION IN BSP
and proceeds in three stages. Cost analysis. Boolean matrix multiplication (Algorithm 16) cannot be used instead of general matrix multiplication (Algorithm 10). In this section. equal to W = O(n3 =p). Third stage. communication and synchronisation costs of the whole algorithm are
Input: n n matrix A over the (min. and recover Ap2 ((p)) = Ap (q) Ap2 ((p) q). S = O(1). or the (max. rather than ordinary Boolean matrices. The local computation and communication costs of the rst stage are dominated by the cost of its rst round: W = O(n3 =p) and H = O(n2 =p2=3 ).

Parameter: integer n p2. the computation can be viewed as a sequence of n log n=s steps. for numerical min and + respectively. We develop an e cient BSP algorithm for the single-source shortest paths problem. With high probability. by using a simple randomisation method proposed in UY91]. with arbitrary real edge lengths. then an arbitrary subset of n log n=s nodes contains a sample node with high probability (see e. Without loss of generality. we then compute all outgoing shortest paths of size at most n log n=s. The high-probability argument requires that m = (n1+ ) for some constant > 0. After that. which also includes the source node. with edges corresponding to the shortest paths between the samples. We assume that the input is a sparse graph with m edges. The single-source shortest path problem in a graph corresponds to a system of linear equations in the (min. These subpaths are among the shortest paths just computed. therefore the sequential cost of the Bellman{Ford algorithm is O(mn). when applied to edge lengths. performed independently on s initial vectors. First. The sequential cost of this computation is O(mn log n).
. n m n2 .g. Let A be the graph matrix. Randomised single-source shortest paths in sparse graph . For every sample node.g. each of which is a multiplication of a sparse n n matrix by a dense n s matrix. we choose a random subset of s n sample nodes. The standard Bellman{ Ford single-source shortest paths algorithm (see e. A total of n steps may be required. The next step of the algorithm is to compute all-pairs shortest paths in the sample graph. one matrix multiplication is su cient to complete the computation of single-source shortest paths in the original graph. The randomised shortest paths algorithm needs to compute shortest paths from s di erent sources. and consists in multiplying the system matrix by a vector. we assume that the source node has index 1. If s n for a constant > 0. An iteration step has sequential cost O(m).5. As before. called the source. +) semiring. Let the sample graph be de ned as the graph on the sample nodes. We keep using the notation + and . the graph is weighted and directed.
Algorithm 23. Consider any shortest path beginning at the source. UY91]). Single-source shortest paths can therefore be computed by the following simple. up to path size n log n=s. CLR90]) can be viewed as solving this linear system by Jacobi iteration.5. Alternatively. this path is divided by the sample nodes into shortest subpaths of size at most n log n=s. This corresponds to n log n=s steps of Jacobi iteration. and some edge lengths may be negative. The considered graph may be sparse. synchronisation-e cient randomised BSP algorithm. SINGLE-SOURCE SHORTEST PATHS COMPUTATION
85
deal with the problem of nding all shortest paths from a single distinguished node.

Broadcast matrix A. and compute the n-vector c B . with slackness and granularity = = s=p. Compute s s matrix C = B J = J T An log n=s J . The local computation cost of nding the closure C is W = O(s3 ). such that J i. Let s = min(m=n. We do not consider this alternative ne-grain algorithm. The communication cost of broadcast in the rst stage is H = O(m). 1. this vector is equal to the rst row of A : c B = (0. The local computation cost of the rst stage is determined by the cost of the Bellman{Ford algorithm: W = O(mn log n=p). and proceeds in three stages. Cost analysis. Both costs are dominated by the respective costs of the rst stage. Without loss of generality. GRAPH COMPUTATION IN BSP
Matrix A contains m = (n1+ ) nontrivial entries. for a constant > 0. Let J denote an n s matrix. 1) A .
W = O(mn log n=p)
H = O(m)
S = O(1)
The algorithm is communication-oblivious. 1) A . let the sample nodes have indices 1. Collect matrix C in a single processor. because of its high synchronisation cost. Let the s-vector c be the rst row of C : c = (0. both costs are dominated by the respective costs of the rst stage. The communication cost of Algorithm 23 can be reduced by partitioning matrix A in the rst stage into regular square blocks. and J i.
. and performing the computation of matrix B in n log n=s supersteps. With high probability. Compute s n matrix B = J T An log n=s by s independent runs of the Bellman{Ford algorithm. +) semiring of real numbers with 1. j ] = 0 if i = j . Third stage. The local computation cost of Algorithm 23 di ers from the cost of the sequential Bellman{Ford algorithm by a factor of log n. The computation is performed on a CRCW BSPRAM(p. Select s random sample nodes. communication and synchronisation costs of the whole algorithm are
Input: n n matrix A over the (min. Second stage. : : : . Description. : : : . Again. The communication cost of collecting the sample matrix C in the second stage is H = O(s2 ). 1. : : : . 1. The local computation cost of the third stage is W = O(sn=p). The communication cost of broadcast in the third stage is H = O(s). j ] = 1 otherwise. and compute its closure C by an e cient sequential algorithm. The local computation. s. Broadcast c. Output: n-vector (0. : : : . 1) C . It is not known whether the single-source shortest path problem can be solved in parallel with local computation cost O(mn=p). l). First stage. including the source node. g. n1=2 ).86
CHAPTER 5.

the MST problem can be viewed as an instance of the algebraic path problem. The problem input is a symmetric matrix over the semiring of all real numbers with 1. A standard PRAM solution to the problem is provided by another greedy algorithm. A useful theorem from MP88] relates the MST problem to the problem of nding shortest paths in a graph. CLR90]). the MST can be found in sequential time O(n2 ). therefore O(log n) rounds are su cient for a dense graph. therefore the total PRAM complexity of the algorithm is O(log2 n). The procedure is repeated until only one node is left. See MP88].g. such as the algorithms by Kruskal and Prim (see e.5. We also assume that all edge weights are distinct (otherwise we can break the ties by attaching a unique identi er to each edge). The resulting set of edges is a subforest of the MST.
. but it is not known if an O(m) deterministic algorithm exists. coarse-grain approach to the problem. the greedy algorithms work in time O(m). We assume that initially the edges of the graph are stored in the main memory. Proof. Due to the symmetry of the matrix. Theorem 11 (Maggs. attributed to Boruvka and Sollin (see e. if the input edges are sorted by weight. These greedy algorithms nd the MST of a graph with n nodes and m edges in time O(m log n). where the weight of an edge between two nodes is de ned as the minimum weight of an edge between the two corresponding components. The contraction of tree components in each round can take up to O(log n) steps. Connected components of this forest are regarded as nodes of a new graph.6. Chr75. Standard sequential algorithms for MST computation employ greedy techniques.6 Minimum spanning tree computation
Finally. and the special structure of the (min. The number of nodes in the graph is reduced by at least a factor of two in each round. AHU83. with complexity O(log3=2 n). we consider the problem of nding the minimum spanning tree (MST) in a weighted undirected graph. However. Paper JM95] presents a more e cient PRAM algorithm. where path weight is de ned as the weight of the heaviest edge in the path. A path between two nodes in the minimum spanning tree is the lightest path between these nodes. The MST of the original graph is the union of the forests constructed in all rounds. The algorithm _ works by selecting for each node the shortest incident edge. max) semiring. in contrast with the (n3 ) complexity of standard algorithms for the general algebraic path problem. Many algorithms with a lower asymptotic complexity have been proposed. Plotkin). where the operations min and max are used as semiring addition and multiplication respectively.g. MINIMUM SPANNING TREE COMPUTATION
87
5. The BSPRAM model suggests an alternative. As observed in MP88]. Paper KKT95] describes a randomised O(m) MST algorithm. JaJ92]).

Any edge that does not belong to one of the subgraph MSTs does not belong to the MST of the whole graph. integer m n p2
the second stage is dominated by the cost of the rst stage. such an edge cannot be the lightest path between its ends. and then computes the MST of their union. m n p2 log p. Each processor computes the MST of the subgraph of G de ned by the m=p local edges. The communication cost of the initial distribution is H = O(m=p). the communication cost of collecting the local MSTs is O(n p). The communication and synchronisation costs of both algorithms are H = O(m=p). m n log n) for the randomised (respectively. Parameters: integer n p2 (respectively. Therefore. Thus. Indeed. Its slackness and granularity are = = m=p. S = O(1). and then sorts the edges of the obtained local MST. We compute the MST of each subgraph separately by an e cient sequential algorithm. the computation is performed in message-passing mode and proceeds in two stages. and the expected local computation cost of the randomised algorithm is Wexp = O(m=p). The cost of sorting the edges of the local MSTs is O(n log n=p). deterministic) version of the algorithm. consider a partitioning of the m edges into p arbitrary subsets. Second stage. The local computation cost of merging the local MSTs in the second stage is O(n p log p). either deterministic or randomised. and can be found as the MST of this union. The assumptions on the size of n and m imply that the BSP cost of
. The algorithm is communicationoblivious. Each subset de nes a subgraph of the original graph. the local computation cost of the deterministic algorithm is Wdet = O(m log n=p). obtaining the MST of G. The computation is performed on an EREW BSPRAM(p. since there is a lighter path in the MST of the subgraph that contains that edge. The resulting BSPRAM algorithm is as follows. m=p edges per processor. l). Minimum spanning tree .
Algorithm 24. After that. Edges of G are partitioned across the processors. Input: undirected weighted graph G with n nodes and m edges. and O(m=p) for the randomised version. Output: minimum spanning tree of G. n p2 log p). g. The (expected) local computation cost of computing subgraph MSTs in the rst stage is O(m log n=p) for the deterministic version.
First stage. Description.
Cost analysis. (respectively. This processor merges the received edges into a sorted sequence. GRAPH COMPUTATION IN BSP
To nd the MST. The local MSTs are collected in a single processor. Therefore.88
CHAPTER 5. the MST of the whole graph is contained in the union of subgraph MSTs. The size of each local MST is O(n).

5.
. DG98] present more complex BSP algorithms. that are e cient for smaller input sizes.6. For dense graphs. these algorithms are similar to Algorithm 24. MINIMUM SPANNING TREE COMPUTATION
89
Papers ADJ+ 98.

The BSP costs of the presented algorithms are summarised in Tables 6. Constant factors involved in the asymptotic costs of the presented algorithms are fairly small. Some of the algorithms exhibit a tradeo between communication and synchronisation costs.4 (with constant factors omitted).1{6. and graph computation.1: Summary of combinatorial algorithms
. while retaining the concept of data locality. The BSPRAM approach encourages natural speci cation of the problems: the input and output data are assumed to be stored in the main memory. where these factors are astroProblem Complete tree Butter y dag Cube dag Sorting List contraction Tree contraction
W n=p n log n=p n3 =p n log n=p n=p n=p
90
H n=p n=p n2 =p1=2 n=p n=p n=p
1 1
S
p1=2
1 log p log p
Table 6. It was shown that the BSP and the BSPRAM models are related by e cient simulation for a broad range of algorithms. computation on dense matrices. with the exception of fast Boolean matrix multiplication.Chapter 6
Conclusions
In this thesis we have developed a systematic approach to the design and analysis of bulk-synchronous parallel algorithms. It is assumed that the input size is su ciently large with respect to the number of processors. This model enhances the standard BSP model with shared memory. granularity. no assumptions on data distribution are necessary. based on the BSPRAM model. The use of shared memory simpli es the design and analysis of BSP algorithms. We have presented BSPRAM algorithms for popular computational problems from three large domains: combinatorial computation. slackness. We have identi ed some characteristic algorithm properties that enable such simulation: communication-obliviousness.

91 Problem Matrix-vector multiplication Triangular system solution Matrix multiplication Gaussian elimination no pivoting. min H nonnegative. min H no pivoting.3: Summary of fast matrix algorithms Problem Algebraic paths general. min H no pivoting. any W
n! =p n! =p n! =p n!
W
H S n2 =p2=! 1 n2 =p2=! p2=! n2 =p1=2 p1=2 n2 =p2=! 1 n2 =p 1
exp(p68 )
Table 6. min S acyclic All-pairs shortest paths general nonnegative. min H general. min S Single-source shortest paths general. min S column pivoting
W H 2 =p n=p1=2 n n2=p n n3=p n2 =p2=3 n3=p n2 =p2=3 n2 =p1=2 n2 =p2=3 1=2 n2 (log1p=)2 p n2
1 1
S
p p2=3 p1=2 p2=3 (log p)2 p1=2 (log p)3=2 p
Table 6. min H block pivoting. min S block pivoting. min S Boolean matrix mult normal min H .2: Summary of matrix algorithms Problem Matrix multiplication Gaussian elimination no pivoting.4: Summary of graph algorithms
. randomised Minimum spanning tree randomised deterministic
n3 =p n3 =p
W
H
S
n2 =p2=3 p2=3 n2 =p1=2 p1=2 n2 =p2=3 log p n2 =p2=3 log p n2 =p2=3 log p n2 1 m m=p m=p
1 1 1
mn log n=p m=p m log n=p
Table 6.

Other research directions include a numerical stability study for the dense matrix algorithms obtained in Chapter 4. Asymptotic optimality has been proven for some of the presented algorithms. For other algorithms. A possible direction of future research consists in obtaining lower bounds on communication-synchronisation tradeo s.
. these concepts include blocking and recursion.92
CHAPTER 6. we have discussed possible methods of obtaining a lower bound. CONCLUSIONS
nomically large. Apart from communication-synchronisation tradeo s. and an extension of these algorithms to sparse matrices. Lower bounds on communication and synchronisation were treated separately. The question remains whether a practical communicatione cient algorithm for Boolean matrix multiplication exists. Implementation of the BSPRAM model and experimental evaluation of the algorithms presented in this thesis would be useful rst steps in this direction. Our presentation highlights certain algorithmic concepts that are important in developing future BSP languages and programming tools. Providing their e cient support both on the local and the inter-processor level is a challenging task for future BSP software development.

as well as the anonymous referA ees of my several papers. Alex Wagner. I am grateful to my supervisor Bill McColl for the exciting opportunity to participate in this project. Radu Calinescu. Ben Juurlink. Jimmie Lawson.
93
. to my wife Karina and to all my family for their love. Thanks to my Oxford colleagues for the stimulating atmosphere. Further thanks are due to Rob Bisseling. Constantinos Siniolakis. and for many illuminating discussions. when I was a visiting student at Oxford University Computing Laboratory. The thesis was typeset in L TEX 2" with packages A AMS-L TEX and PSTricks.Chapter 7
Acknowledgements
This work began in 1993. Alex Gerbessiotis. Karina Terekhova. My subsequent DPhil studies were supported in part by ESPRIT Basic Research Project 9072 GEPPCOM | Foundations of General Purpose Parallel Computing. funded by the Soros/FCO Scholarship. to my friends for making my life in Oxford full and enjoyable. support and understanding. I thank my examiners Richard Brent and Mike Paterson for their attentive reading and valuable advice.

The discrete analog of the Loomis{Whitney inequality relates the size of a nite set of points in Zm to the sizes of its orthogonal projections onto r-dimensional coordinate subspaces.1) for exponents and . 1 i n.Appendix A
The Loomis{Whitney inequality and its generalisations
The optimality of several BSP matrix algorithms presented in Chapter 4 rests on the discrete Loomis{Whitney inequality. r = 2. is Holder's inequality. and let A1 .1)
. A3 be the orthogonal projections of A onto the coordinate planes. It was introduced in LW49] (see also Had57. A2. and exponents 1 .2) 94
i=1
bi
(A. In this appendix we propose a new simple proof and a generalisation of the discrete Loomis{Whitney inequality. to the areas of its orthogonal projections onto r-dimensional coordinate subspaces. One of the most common inequalities in analysis is the classical Cauchy inequality (see e. It can be further generalised by taking m 2 sequences. For simplicity. bi 0. Another important result is the Loomis{Whitney inequality.g. : : : .
n X i=1
a1=2 b1=2 i i
n n X !1=2 X !1=2 i=1
ai
A standard generalisation of (A. relating the volume of a compact set in Rm . BZ88]) to simplify the proof of the classical isoperimetric \volume-to-surface" inequality. let us take m = 3. The discrete Loomis{Whitney inequality states that jAj jA1 j1=2 jA2 j1=2 jA3 j1=2 (A. 1 r m. The latter is closely related to Cauchy's inequality and the classical isoperimetric inequality. Let A be a nite set of points in Z3. such that 1 + + m = 1. Mit70]): for two nite sequences ai . m . + = 1. m 2.

k=1
a1=2 b1=2 c1=2 ij jk ik
n n n X !1=2 X !1=2 X !1=2
i. B = b1=2 . 1g. k. r = 2.im n
Q
Q
1 s1 < <sr m as1 :::sr is1 . : : : . and k k is the Frobejk P x2 ki=2 . the inequality can be proved straightforwardly from the properties of the Frobenius norm.j =1
aij
j.3) when aij .4) and (A. i2 ] = aij .1) follows from (A. Let m 3. where X = (x ). isr ] 0.
Inequality (A. : : : .k=1
bjk
i.1) and (A. bjk = bj for all k. im n. 1 s m 1. : : : . C = c1=2 . In nius (Euclidean) matrix norm: kX k = ij ij 1 ij
this form. k n. Theorem 12. The discrete Loomis{Whitney inequality (A.isr n as1 :::sr is1 .6) with m = 3.k=1
cik
(A.
n X i. For any as1:::sr is1 . and cki = 1 for all i.:::. where A = aij=2 .5) are special cases of the following more general inequality. Inequality (A.3) when aij = aj for all i. they are special cases of the following general inequality: for nite sequences aij .95 where j j denotes the cardinality of a nite set. Inequalities (A. a12 i1 . cki 2 f0.3) can be rewritten in matrix form as tr(ABC ) kAk 1 kB k kC k. However. 1 s1 < < sr m.2) follows from (A.6)
a23 i2 . bjk .
P
1 i1 . 1 i. i3 ] = bjk . isr ]
1 s1 < <sr m
P
mr 1 (m) 1 r
1 is1 . m
1 as i] m 1
m 1 Y
n X
! m1 1
(A. bjk . 1 r m. Y i = tr(XY T ) kX k kY k. a13 i1 . cik 0.:::.4)
(A. and of the Frobenius inner product hX. We have tr(ABC ) kAk k(BC )T k kAk kB k kC k. : : : . 1 i n. 1 s1 < < sr m.j. and exponents 1 = = m 1 = m1 1 .5)
where As1 :::sr . Inequalities (A. are the projections of A onto m r coordinate subspaces of dimension r. 1 i1 .2) seem at rst to be unrelated. For m 1 sequences as i] 0. We now consider the multidimensional versions of the above inequalities. isr ]
mr 1 (m) 1 r
: (A.3)
The Cauchy inequality (A. i3 ] = cik .
.3) corresponds to (A. j. the discrete Loomis{Whitney inequality is
s1 < <sr
1 r jAs1:::sr jmr 1 (m) . the generalised Holder inequality is
n Y Xm 1
For a nite set A
jAj Q1
as i] : s=1 i=1 i=1 s=1 Zm .

: : : .3) does not seem to generalise to
X
ijk ik k
a1=2 b1=2 c1=2 = ij jk ik c1=2 ik
j
X
X
j
a1=2 b1=2 ij jk
1=2
X
ik
X X
bjk
"X
i
c1=2 ik
j
X
c1=2 ik
X
k 1=2
aij
j
j 1=2 #
aij
1=2
X
j
bjk
1=2
=
X X
j
X
ij
aij
1=2
"X X
k
bjk cik
1=2 1=2
X
i
bjk
X
i
#
1=2
cik
1=2
X
ij
aij
1=2
=
n X ij
aij
n X jk
bjk
1=2
n X ik
cik
1=2
A generalisation for arbitrary r and m is straightforward. that can be easily generalised to a proof of (A.k+1:::m 1 i1 . im 2 ] = 1. : : : . ik+1 .6) when r = m 1.6) degenerates to an identity when r = 1 or r = m. : : : .
a1:::k
1. elementary proof of (A. 1g. Inequality (A.6) (although the inequality for tr(ABC ) can be extended to an arbitrary number of matrices). We have
Proof.1) to the left-hand side of (A. im 1 ] = ak im 1 ]
for 1 k < m 1. we give an alternative.4) follows from (A.6).
. We apply Cauchy's inequality (A. The matrix proof of inequality (A.6) when as1 :::sr is1 . The generalised symmetric Holder inequality (A.6). ik 1 . The subexpressions to which Cauchy's inequality is applied are denoted below by square brackets. : : : . The discrete Loomis{ Whitney inequality (A.3) three times. THE LOOMIS{WHITNEY INEQUALITY AND ITS GENERALISATIONS a proof of (A. isr ] 2 f0. and a1:::m 2 i1 .96APPENDIX A. To avoid dealing with the cumbersome notation of (A.3).5) follows from (A.

Journal of Computer and System Sciences. and O. Parallel sorting with limited bandwidth. E. Rieping. 9(1):71{86. January 1994. 71(1):3{ 28. Data Structures and Algorithms. Hopcroft. W. 1995. ABK95] M. Aho. A three-dimensional approach to parallel matrix multiplication. C. Abstractions for portable. and I. A. 16(1):80{109. Yuster. W. Joshi. 97
. C. D. V. Griswold. and R. The algorithmic aspects of the regularity lemma. IBM Journal of Research and Development. 39(5):575{581. ADJ+ 98] M. Lefmann. Alverson. Lin. AHU76] A. 1983. Gustavson. Kutylowski. Notkin. V. March 1990. The Design and Analysis of Computer Algorithms. Snyder. Balle.Bibliography
ABG+ 95] R. M. Aho. Dittrich. W. 1998. Ullman. Adler. A. IEEE Transactions on Parallel and Distributed Systems. K. ADL+ 94] N. V. Snir. and J. Addison-Wesley. Galil. Rodl. In Proceedings of the 10th ACM SPAA. Aggarwal. Theoretical Computer Science. 1995. J. On the exponent of the all pairs shortest path problem. G. January 1998. and L. Palkar. D. AGL+ 98] G. and P. Agarwal. M. H. AGM97] N. In Proceedings of ACM SPAA '95. Alon. scalable parallel programming. Communication-optimal parallel minimum spanning tree algorithms. Journal of Algorithms. S. A. Addison-Wesley Series in Computer Science and Information Processing. Alon. Hopcroft. J. AHU83] A. Adler. Communication complexity of PRAMs. pages 129{136. Karp. M. R. and M. D. and J. G. B. E. Byers. Juurlink. M. Z. 1976. and R. April 1997. Chandra. J. Addison-Wesley. Duke. Margalit. F. Ullman. ACS90] A. 54(2):255{262.

Brent and H. March 1982. 1979. BM93] R. 1995. The design and implementation of the ScaLAPACK LU. editor. Algebraic Complexity Theory. A regular layout for parallel adders. P. D. BK82] R. Burago and V. January 1974. A. 5:173{184. Motwani. Department of Mathematics. A simple randomized parallel algorithm for list ranking. A. BH74] J. Springer-Verlag. Geometric Inequalities. Petitet. Scienti c Programming. and R. volume 6. Number 315 in Grundlehren der mathematischen Wissenschaften. Synthesis of Parallel Algorithms. H.98 AM90]
BIBLIOGRAPHY
R. Bol78] B. Stanford University. Khanna. QR and Cholesky factorization routines. Anderson and G. Bre91] R. and M. Miller. Morgan Kaufmann. 1985. Ostrouchov. F. Zalgaller. E. Clint Whaley. Scienti c computing on bulk synchronous parallel architectures. S. Ble93] G. Dongarra. W. A. Berge. IEEE Transactions on Computers. Clarendon Press. Car79] B. Clausen. December 1993. J. BZ88] Yu. On diameter veri cation and Boolean matrix multiplication. Academic Press. D. Information Processing Letters. Bisseling and W. 1997. Hopcroft. 1988. January 1990. Choi. Carre. Oxford Applied Mathematics and Computer Science Series. BCS97] P. Triangular factorization and inversion by fast matrix multiplication. McColl. S. Number 285 in Grundlehren der mathematischen Wissenschaften. 1993. 33(5):269{273. part 1 of North-Holland Mathematical Library. Walker. Blelloch. BKM95] J. August 1991. Burgisser. T. second revised edition. P. Graphs and Networks.
. Reif. chapter 1. E. In J. Parallel algorithms in linear algebra. Mathematics of Computation. Shokrollahi. Springer. H. Bunch and J. CDO+ 96] J. Department of Computer Science. Extremal Graph Theory. pages 35{60. M. In Proceedings of the Second NEC Research Symposium. North-Holland. and R. Preprint 836. 1996. 31(3):260{264. Ber85] C. J. 21(125):231{236. Kung. R. Technical Report STAN-CS-95-1544. Basch. Pre x sums and their applications. L. Brent. Bollobas. 1978. Graphs. University of Utrecht.

Huang. 34(2):137{153. Reif. Foster. H. 1:269{271. Synthesis of Parallel Algorithms. Fos95] I. Numerische Mathematik. W. chapter 22. W. 1995. Wiley| Interscience Series in Discrete Mathematics. Graphs and Algorithms. Gondran and M. Gib93] P. Sadayappan. Dehne and S. A framework for generating distributed-memory parallel programs for block recursive algorithms. Deterministic coin tossing with application to optimal parallel list ranking. Numerical Linear Algebra with Applications. Reif. Matrix multiplication via arithmetic progressions. Gupta. 1995. 70(1):32{53. Demmel. DG98] F. and R. Vishkin. John Wiley & Sons. 1984. Cormen. 1959. July 1986. A note on two problems in connection with graphs. Number 173 in Graduate Texts in Mathematics. S. Synthesis of Parallel Algorithms. Graph Theory. chapter 10. CV86] R. S. Cole and U. GM84a] M. Minoux. May 1996. editor. E. Morgan Kaufmann. W. 2(2). Graph theory: an algorithmic approach. 1998. Gibbons. The MIT Electrical Engineering and Computer Science Series. Die97] R. Journal of Parallel and Distributed Computing. In J. Christo des. In J. Dijkstra. Higham. 1997. J. The MIT Press and McGraw{Hill. Designing and Building Parallel Programs.BIBLIOGRAPHY
Chr75]
99
N. In Proceedings of the Workshop on Advances in Parallel and Distributed Systems. Johnson. 1990. Col93] R. GHSJ96] S. P. C. H. pages 957{ 997. N. Block LU factorization. Addison{ Wesley. 9(3):251{280. Cole. Computer science and applied mathematics. Asynchronous PRAM algorithms. and R. L. Gotz. Leiserson. Winograd. 1975. Dij59] E. Diestel. Morgan Kaufmann.
. pages 453{495. Introduction to Algorithms.-H. CLR90] T. B. Rivest. Journal of Symbolic Computation. C. K. Academic Press. Springer. Practical parallel algorithms for minimum spanning trees. DHS95] J. Parallel merge sort. H. CW90] D. editor. 1993. 1993. Information and Control. and R. Coppersmith and S. Schreiber. March 1990.

1996. C. 32(1):54{135. 1996. H. pages 139{156. Plemmons. A. Gallivan. SIAM Journal of Applied Mathematics. A. K. pages 223{232. M. Had57] HJB] HK71] H. Can a shared memory model serve as a bridging model for parallel computation? Theory of Computing Systems. R. JaJa. Architecture and Technology. April 1999. Vorlesungen uber Inhalt. 1984. E. and V. A new deterministic parallel sorting algorithm with an experimental evaluation. In Proceedings of the 28th ACM STOC. and A. Optimal tree contraction in an EREW model. V. Parallel Processing Letters. Annals of Discrete Mathematics. To appear. Springer-Verlag. Kerr. Grayson and R. Goodrich. Sameh. The QRQW PRAM: Accounting for contention in parallel algorithms. J. editors. P. Hadwiger. Y. In S. Gibbons and W. SIAM Review. Cambridge University Press. A. Concurrent Computations: Algorithms. K. Dickinson. A. Matias. Gerbessiotis and C. Gibbons. Hopcroft and L. Minoux. R. 28(2):733{769. Parallel algorithms for dense linear algebra computations. D. van de Geijn. A high performance parallel Strassen implementation. Ober ache und Isoperimetrie. Tewksbury. G. March 1990. Y. 1996. Ramachandran.
Goo96] GPS90] GR88] GS96]
GvdG96] B. Matias. E cient Parallel Algorithms. Schwartz. W. L.
. 1957. and V. 19:147{164. 6(1):3{ 12. 20(1):30{36. 1988. Deterministic sorting and randomized median nding on the BSP model. J. Bader. Gondran and M. On minimizing the number of multiplications necessary for matrix multiplication. and Shang-Hua Teng. Communication-e cient parallel sorting. SIAM Journal on Computing. J. Siniolakis. Gibbons. Helman. Ramachandran. January 1971. A. Number 93 in Grundlehren der mathematischen Wissenschaften. Journal of Experimental Algorithmics. and S. P. To appear. J. Gazit. B. 1988. In Proceedings of the 8th ACM SPAA. H. Rytter.100 GM84b] GMR] GMR99] GMT88]
BIBLIOGRAPHY
M. Linear algebra in dioids: A survey of recent results. and D. Miller. R.

-H. 42(2):321{328. van Leeuwen. P. R. Snir. Scienti c Programming. Lakshmivarahan and S. B. An Introduction to Parallel Algorithms. Parallel Computing. and P. Kruskal. and H. Sadayappan. L. Simonovits. October 1993. 71(1):95{132. Shi. Handbook of Theoretical Computer Science. P. P. Lu. J. K. Elsevier. Metaxas. Information and Computation. Karp and V. Szemeredi's Regularity Lemma and its applications in graph theory. editor. Joh77] D. chapter 17. In J. Technical Report 96-10. 19(3):383{ 401. Lakshmivarahan and S. Schae er. Parallel algorithms for shared memory machines. JKS93] H.
. JM95] D. Klein. E cient algorithms for shortest paths in sparse networks. Wong. B. Journal of the ACM. Johnson and P. KKT95] D. KR90] R. P. and R. Parallel computing using the pre x problem. A parallel algorithm for computing minimum spanning trees. M. Johnson.BIBLIOGRAPHY
JaJ92]
101
J. March 1990. Shillington. Oxford University Press. Ramachandran. McGrawHill. LLS+93] X. LD94] S. Huang. July 1993. Jung. November 1995. Analysis and design of parallel algorithms: Arithmetic and matrix problems. 1996. M. Spirakis. H. Rudolph. KS96] J. J. 19:1079{1103. 1995. Komlos and M. Kirousis. A randomized linear-time algorithm to nd minimum spanning trees. Journal of the ACM. pages 869{941. Karger. A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction. Kumar. Addison{ Wesley. 1990. C. January 1977. 1992. Johnson. L. 1994. Dhall. W. LD90] S. 105(1):94{104. A complexity theory of e cient parallel algorithms. March 1995. 4(4):275{289. E. 1990. Journal of Algorithms. Li. and M. KHSJ95] B. McGrawHill series in supercomputing and parallel processing. DIMACS. 24(1):1{13. Theoretical Computer Science. On the versatility of parallel sorting by regular sampling. P. Tarjan. Lower bounds and e cient algorithms for multiprocessor scheduling of directed acyclic graphs with communication delays. Dhall. JaJa. K. and R. S. KRS90] C.

In J. McColl. pages 43{46. Algorithmica. McC96b] W. chapter 14. F. 1998. editor. and R. Oxford applied mathematics and computing science series. volume 1000 of Lecture Notes in Computer Science. In M. McC96a] W. In Proceedings of the 15th IFIP World Computer Congress. pages 337{391. F. McC93] W. Abstract Machine Models for Parallel and Distributed Computing. editors. Modi. A BSP realisation of Strassen's algorithm. editors. H. Parallel Algorithms and Matrix Computation. M. Models of parallel computation: a survey and synthesis. McColl. R. F. Lectures on parallel computation. pages 25{36. 1996. Parallel Algorithms and Applications. An inequality related to the isoperimetric inequality. Models and resource metrics for parallel and distributed computation. E. Bouge et al. Nash. F. 1996. Goodeve. MMT95] B.. Kara. Number 165 in Grundlehren der mathematischen Wissenschaften. pages 46{ 61. Mills. 1988. Foundations of time-critical scalable computing. Proceedings of Euro-Par '96 (I). 1995. General purpose parallel computing. Spirakis. 1949. volume 4 of Cambridge International Series on Parallel Computation. Computer Science Today: Recent Trends and Developments. volume 1123 of Lecture Notes in Computer Science. P. L. H. Communication-e cient parallel algorithms for distributed random-access machines. H. volume 2. SpringerVerlag. D. E. McColl. Davy. LW49] L. Springer-Verlag. and J.
. McColl. McC98] W. Analytic Inequalities. Osterreichische Computer Gesellschaft. pages 61{70. Tarjan. 8:35{59. Bulletin of the AMS. In Proceedings of the 28th Hawaii International Conference on System Sciences. IOS Press. S.102 LM88]
BIBLIOGRAPHY
C. Clarendon Press. J. 1988. Mod88] J. 1993. Gibbons and P. Whitney. Cambridge University Press. and J. Leiserson and B. Universal computing. Loomis and H. Reif. M. Springer-Verlag. 55:961{962. In L. 1970. F. J. Maggs. Mitrinovic. 3:53{77. Maggs. McC96c] W. 1996. F. McColl. Mit70] D. Matheson. McC95] W. 1995. McColl. Private communication. IEEE Press. 1996. R. editors. Scalable computing. LMR96] Zhiyong Li. In A. van Leeuwen.

In M. Paterson. In Proceedings of the 2nd International Congress of Mathematicians. December 1996. F. Maggs and S. A general purpose shared-memory model for parallel computation. Papadimitriou and J. Rote. Ramachandran. Plotkin. List ranking and list scan on the Cray C{ 90. Pat74] M. Numerische Mathematik. Papadimitriou and M. Reif. Computing Supplementum. M. MR85] G. chapter 3. pages 483{489. 1993. H. Heath. Information Processing Letters. 1974. H. 1999. Paterson. Algorithmica. volume 105 of IMA Volumes in Mathematics and Applications. 16(4):639{646. April 1990. and F. S. Parallel tree contraction and its applications.BIBLIOGRAPHY
MP88]
103
B. List ranking and parallel tree contraction. Springer-Verlag. PY90] C. Minimum-cost spanning tree as a path. A communication-time tradeo . 7:155{189. S. pages 478{489. 26(6):291{293. 19(2):322{328. 1973. Schreiber. Ortega. Morgan Kaufmann. F. 1988. In Proceedings of the 26th IEEE FOCS. T. Reid-Miller. Modugno. Introduction to Parallel and Vector Solution of Linear Systems. Miller. Ranade. MT] W. S. In J. D. Path problems in graphs.
. Tiskin. Complexity of product and closure algorithms for matrices. A. 1993. Ullman. Reid-Miller. Private communication. Plenum Press. M. Schonhage. McColl and A. Sch73] A. Unitare Transformationen gro er Matrizen. 53(3):344{356. Synthesis of Parallel Algorithms. RMMM93] M. PU87] C. August 1987. G. Algorithms for Parallel Processing. Rot90] G. editor. and R. Ort88] J. Miller and J. L. January 1988. Frontiers of Computer Science. Towards an architecture-independent analysis of parallel algorithms. To appear. 1990. SIAM Journal of Computing. Reif. Ram99] V. Pat93] M. editors.nding problem. L. RM96] M. 20:409{417. Memory-e cient matrix multiplication in the BSP model. 1985. A. SIAM Journal of Computing. pages 115{194. Journal of Computer and System Sciences. H. Yannakakis.

Shi and J. April 1998. volume 1124 of Lecture Notes in Computer Science. Algorithmica. Ullman and M. Subcubic cost algorithms for the all pairs shortest path problem. Snyder. J. P. Gaussian elimination is not optimal. Reeve. 1969. editor. H. Valiant. A. Numerische Mathematik. 1997. University of Washington. 20:309{318. Parallel sorting by regular sampling. Parallel Processing and Arti cial Intelligence. Skillicorn and D. B. 18(4):1065{1081. S. 33(8):103{111. In L. Strassen. Journal of Parallel and Distributed Computing. Robert. Yannakakis. ACM Computing Surveys. A. A. Better trade-o s for parallel list ranking. 1992. pages 15{22. Tiskin. D. Proceedings of Euro-Par '96 (II). T. January 1998. Bouge. 1989. February 1991. D. A ZPL programming guide (version 6. 7(2):285{301. Fraigniaud. P. Talia. G. 30(2):123{169. 1996. The bulk-synchronous parallel random access machine. V. 14(4):361{372. 20(1):100{125. and Y. 196(1{2):109{130. Models and languages for parallel computation. Sibeyn. The bulk-synchronous parallel random access machine. 1995. SpringerVerlag. editors. SIAM Journal on Computing. A general-purpose parallel sorting algorithm. Theoretical Computer Science. L. Tridgell and R. Mignotte. Takaoka. L. Locality of reference in LU decomposition with partial pivoting. Tiskin. 13:354{356.
Tis98] Tol97] UY91] Val89] Val90a]
. A bridging model for parallel computation. Technical report. 1998. 1997. chapter 2. SIAM Journal of Matrix Analysis and Applications. Brent. Bulk-synchronous parallel computers. pages 221{230. Valiant. High-probability parallel transitive closure algorithms. A. John Wiley & Sons. G. F. June 1998. pages 327{338.2). August 1990. Communications of the ACM. L.104 Sib97] Sny98] SS92] ST98] Str69] Tak98] TB95] Tis96]
BIBLIOGRAPHY
J. Schae er. International Journal of High-Speed Computing. In M. Toledo. In Proceedings of 9th ACM SPAA.

Zimmermann. Valiant. Linear and Combinatorial Optimization in Ordered Algebraic Structures. pages 943{971. Elsevier. U. chapter 18. 1990. In J. G. Handbook of Theoretical Computer Science.BIBLIOGRAPHY
Val90b] Zim81]
105
L. 1981. General purpose parallel architectures. van Leeuwen. volume 10 of Annals of Discrete Mathematics. North-Holland. editor.
.