01240575

1
Cilk vs MPI: Comparing two very dierent parallel programming styles

Sonny Tham and John Morris
Abstract We measured the relative performance of two support systems for parallel programming on networks of workstations: Cilk - an extension of C with dataow semantics - and MPI - a commonly used library for message passing. Although the two systems present signicantly dierent parallel processing models, we attempted to code the benchmark problems in similar ways. The problems selected were matrix multiplication, travelling salesman problem, quick sort, Gaussian elimination, fast Fourier transform and nite dierencing. We compared run times, speed-ups and coding eciency as measured by lines of code in our implementations of the problems. Cilk showed a speed advantage when smaller numbers of large messages are transferred in a computation, enabling it to gain more from the underlying active messages implementation. Cilk code for algorithms with natural dataow solutions was more compact, whereas algorithms which have simple iterative update-in-place styles (Gaussian elimination and nite dierencing) were more eciently expressed when MPI was used. Keywords: MPI, Cilk, networks of workstations.
program.) Many implementations are available[2], [3], [4]: we chose MPICH called from C[4] and wrote a custom runtime environment for our Achilles router[5]. B. Cilk Cilk was designed in Leisersons laboratory at MIT[6], [7]. It is a dialect of C augmented with a small number of additional keywords, such as the thread or cilk qualier which causes a function to be compiled for execution as a thread. The Cilk pre-processor converts a Cilk program into C which is compiled and linked with the Cilk run-time library. The extensions provide dataow semantics and allow a programmer to specify
threads of computation - each instance of a thread is associated with a data structure known as a closure and continuations which point to data slots in closures.
I. Introduction Ecient parallel processing often depends on the identication and reduction of overheads: overheads in the communications hardware, synchronisation, algorithms and runtime systems result in lost CPU cycles and speed-ups which may not only fail to approach the theoretical maximum but even drop below 1.0. In this study, we measured the relative performance of two language and run-time system combinations: Cilk - an extension of C with dataow semantics - and C programs calling routines from the Message Passing Interface (MPI). MPI and Cilk represent very dierent ways of achieving parallel execution and assessment of their strengths and weaknesses running real programs is thus valuable. The two systems have quite dierent characteristics: we outline their salient features in the following two sections. A. MPI The MPI library adds message passing capabilities to a language, allowing programmers to transmit packets of data between processors: it denes a common interface to an underlying message passing system[1]. Programs contain explicit send and receive calls. The language used has no implicit parallel capabilities: parallelism is achieved by simply starting multiple tasks on multiple processors. Receive commands block until data has been transmitted by a sender and thus provide synchronisation between parallel tasks on dierent processors. (Non-blocking send and receive calls are available, but they are more complex to
School of Electrical, Electronic and Computer Engineering, The University of Western Australia, WA 6009, Australia morris@ee.uwa.edu.au
When a thread is spawned, an associated closure is created: the closure consists of:

a pointer to the code of the computation, slots for data needed by the computation, space for local variables, continuations which point to slots in other closures and a join counter - a count of the numer of words of data still required before the computation can proceed. When it reaches zero, the closure is ready to run because it now has all the data which it requires.
Since full closures (ones for which the join counter has reached zero) contain all the data needed for a computation, they may be executed on any processor. This is the source of parallelism in Cilk: full closures are migrated to idle processors for execution. This is handled entirely by the run-time system, Cilk programmers do not need to code any parallelism explicitly. The run-time system uses a work-stealing strategy in which idle processors steal work (represented by closures) from busy processors. Work can also be explicitly distributed to specic processors, but the work-stealing loadbalancing strategy has been shown to be ecient in its progress towards completion of a program. Cilks dataow model is well suited to NoW parallel processing because each closure or parcel of work consists of a description of the computation work to be performed, along with the actual data on which the work will be performed and each parcel can be sent to and then executed on any host in a NoW. Threads can easily be programmed to perform relatively large amounts of computation, achieving ecient use of a NoW system, even when the communica-
Proceedings of the 2003 International Conference on Parallel Processing (ICPP03) 0190-3918/03 $ 17.00 2003 IEEE
tion bandwidth is relatively low. The dataow model also embodies relaxed synchronization constraints: if a closure on processor A is waiting for data from several others (say B, C and D), the data can arrive in any order. Processors C and D can send their data before processor B has completed its computation and neither C nor D need wait for A to synchronize with B before they can send data. Thus a slow response from B will not put C and D into busy-waiting loops and allow them to continue with other useful work. This is implicit in the dataow programming model and the programmer does not need to add explicit code to allow for the possibility of dierent eective processing rates. The run-time system also does not have to allocate large buers to allow large numbers of (possibly very large) messages to be received out of order: the program will allocate exactly the needed amount of space in closures waiting for data - without explicit directions from the programmer. Thus Cilk would be expected to tolerate heterogeneous processor networks and slower links better than MPI. Comparisons between a Fast Ethernet network and a much faster Achilles network in this and other work (which evaluated the Achilles network itself[8]) demonstrate Cilks ability to tolerate slower networks. Examples of both MPI (gure 11) and Cilk code (gure 10) appear in section III. II. Experiments We used an MPI library (MPICH[1]) called from C. Since Cilk is based on C, the core processing code for each of the problems studied here can be made as similar as the diering parallel paradigms will allow. A. The Myrmidons All our experiments were conducted on the Myrmidons1 a dedicated network of up to 9 150MHz Intel processors. Two dierent communications systems were used: 1. links to a conventional Fast Ethernet hub and 2. an Achilles cross-bar router[9], [5], [10] (see gure 1). The two network infrastructures again provide very different capabilities: Fast Ethernet is readily available and very economic, but its performance is characterized by long latencies and low total bandwidth due to its bus architecture. Achilles, on the other hand, is a true cross-bar switch providing very low latency (as a consequence of the simple cross-bar circuitry) and very high bandwidth (a consequence of the wide datapaths allowed by the stacks 3-D structure). A separate study compared the network capabilities of Achilles and Fast Ethernet[8], [10]. Use of the two networks in this study ensured that artefacts due to particular aspects (e.g. latency or raw bandwidth) of the communications infrastructure did not bias the comparison of the two programming styles.
1 Achilles
Fig. 1. The Achilles router stack: a high bandwidth, low latency cross-bar switch which can directly connect up to nine processors
B. Metrics We chose a number of benchmark problems with various algorithmic styles (e.g. regular iterative, recursive, naive search, ...) and varying communications needs so that the underlying run-time systems were exercised in various ways: they are listed in Table I and described in more detail in subsequent sections. Each problem was encoded in both Cilk and MPI. The two systems use distinct semantic models: this means that it was not possible to make the two implementations of each problem essentially identical, but we attempted to make them as similar as possible without sacricing eciency. Thus the basic algorithm used to solve each problem was the same wherever possible, but implementation details differed to take into account the dierent run-time systems. The salient details of each implementation are described in the following sections. The actual code is set out by Tham[8]. We used speed-up for multiple processors compared to execution on a single processor as the primary eciency metric. In addition to execution speed, we were interested in the ease of programming using the two dierent models and counted the lines of code needed to solve the problem in each case. This is a rather basic metric and certainly not always representative of the programming eort (which includes design time) required to solve each problem. However, it does measure the relative expressiveness of the two systems. Lines of code would also be expected to correlate with eort required to debug any program. Thus whilst many valid criticisms of this simple metric can be made, we assert that lines of code nevertheless provides valuable information on the relative ease of use of competing systems or languages. C. Achilles In a set of separate experiments, Tham has measured the performamce of implementations of the problems used here on our Achilles network and on Fast Ethernet[8]. Those experiments showed the benets of Achilles higher bandwidth and lower latency: this paper focusses on the relative
led an army of Myrmidons in the siege of Troy.
Problem Matrix Multiplication Travelling Salesman Quick sort Gaussian Elimination Fast Fourier Transform Finite Dierencing
(per iteration)

Parameters matrix size n # cities n, sequential threshold s list length n, sequential threshold s matrix size n vector length n matrix size n
Complexity O(n3 ) O(n!) O(n log n) O(n3 ) O(n log n) O(n2 )
Number of Messages O(p)

n O(Ps )
Message Size O(n2 ) O(n) O(n) O(n2 ) O(n) O(n)
Regular? Yes No No Yes Yes Yes
Synchronisation End of computation Every iteration Every iteration Every iteration Last n p iterations Every iteration
O(logs n) O(n) O(n) O(p)
p denotes the number of processors participating in the computation. x Py denotes the number of permutations of y items which can be generated from a list of x items. Full details, code listings etc. may be found in Tham[8].
TABLE I Benchmark problem summary
performance of Cilk and MPI. D. Problems and Implementations D.1 Matrix multiplication A simple problem with parallelism that is easily realized, but only at the expense of communication overhead, with O(n2 ) communication cost in a O(n3 ) algorithm, the computation of a matrix product: C = AB only shows signicant speed-up for large matrices, see gure 2. On the Myrmidons, matrix sizes of 300 300 were required before speed-ups approached the ideal values. MPI version The message passing version of this algorithm uses a master controlling process and p slave processes. Each of the p processes is spawned on a separate host. For simplicity, the master controlling process was assigned to a separate host. The master process distributes the B matrix and appropriate stripes of the A matrix to each of the p slave processes. The slave processes begin the computation of their stripe of the C matrix as soon as they have received all the required data. When the slave processes have completed their stripe of the C matrix, they immediately return the result to the master process, which will be waiting to collect the results. Cilk version The Cilk version of this algorithm starts with a single master thread that spawns p computation threads corresponding to the computation of the p stripes of the C matrix. The master thread then waits for all the spawned threads to nish before it exits and the program terminates. The computation threads are placed in a ready queue on the
same host as soon as they are spawned. These threads can then be stolen from the host by idle nodes. At the problem size chosen, stealing is faster than computation of any submatrix of the product and all p computation threads are stolen by idle processors before the master processor can extract a second sub-matrix computation from the ready queue. Note that Cilk allows explicit allocation of work to processors: this would be more ecient for a regular problem like this one. However, the default load balancing strategy is both simpler for the programmer and more ecient when the individual processors may have dierent processing capability, whether due to dierent CPU power or eective power due to other work on a host. So, although it places Cilk at a slight disadvantage in this controlled environment (homogenous processor set and no competing work load), we performed a more realistic test. We have plotted the eect of problem size on speedup for Cilk and MPI versions of the matrix multiplication problem (gure 2) and speedup against number of processors (gure 3). In both graphs, the Cilk versions better performance is evident. Cilks active messages based run-time system has lower overheads: it has one less layer of data buering. Active messages also allow better overlap of computation and communication on a multitasking platform. In this case, there are large blocks of data to be transferred at startup and completion: messages do not need to arrive in a prescribed order and active messages deposit data directly into the destination data structure. Even in the relatively homogeneous experimental environment provided by the Myrmidons, operating system overheads will ensure each host provides slightly dierent numbers of cycles for computation in any short period, meaning that there are always
Fig. 2. Matrix multiplication: Speed up as a function of problem size on 3 processors
Fig. 3. Matrix multiplication: Mean speed-up for Cilk and MPI vs number of processors
hosts that are eectively slower than others. Thus although the computation cannot complete until the slowest processor has completed its sub-task, none of the faster processors need to wait for a slower processor to nish before they can synchronize and transfer their data: they transfer a block of the result matrix directly into the data structure on the master processor as soon as they have nished. D.2 Travelling salesman The travelling salesman problem is a classic hard problem: its complexity is O(n!). The data set has only n integers, i.e. it is very small, and thus data transmission demands are relatively small. Almost perfect speed-ups are therefore easily obtained with large problems, in which n processors are set to work on problems of size n 1. For our experiments, the problem was set up so that work was distributed
to slave processors until a threshhold number of cities, s, remained to be evaluated. At this point, the algorithm completed the evaluation of the best tour for these s cities on a single processor. We included one optimisation to the standard brute-force search. Evaluation of a subtour is immediately abandoned if its cumulative distance is greater than the minimum total tour distance already found. This optimisation requires processors to broadcast the cost of a tour when one is discovered which is cheaper than the previous global minimum cost. MPI version The message passing implementation uses a master process that generates subtours for sequential calculation and places them in a task queue from which they are extracted by worker processes. When a worker thread requests a parcel of work, it is sent the next viable subtour for sequential calculation and the current global minimum. Whenever results are passed back to the master process, the global minimum is updated if needed. For ease of implementation, the master process was run on its own host and each of the p slave processes was also assigned its own processor. The global minimum was updated at a regular synchronization point. Cilk version The Cilk version was very easily, yet eciently implemented. Cilks dataow model of computation maps very well to algorithms that are parallelised using recursion for task decomposition. An initial thread spawns n threads to evaluate the subtours corresponding to the n dierent possible starting cities. Each of those threads spawns another n 1 threads, adding an additional unvisited city to the head of the tour to be evaluated, and so on recursively until the sequential threshold is reached. The remaining subtour is then evaluated sequentially and results sent back up the recursion tree. The Cilk version of the algorithm was able to gain more from the global minimum optimisation because when a new global minimum was detected, active messages were sent immediately to update the global minimum on all nodes. The MPI version could only update the global minimum on computation nodes at the completion of computation of each sequential search. Our implementations allowed the communication:computation ratio to be adjusted easily, with small values of the threshhold, s, generating large numbers of communications events and stressing the network. With a 16 cities and a sequential treshhold of 15, gure 4(a) shows very little dependence on the network capability with Achilles and Fast Ethernet networks showing similar speed-ups. However there is a signicant dierence between Cilk and MPI: the Cilk implementation was better able to overlap communication and computation so that the optimisation which distributed new global minima to all processors was more ecient and individual processing units were able to more quickly abandon infeasible candidate tours.
Fig. 4. Travelling salesman problem: speedup for Cilk and MPI using Fast Ethernet and Achilles networks, n = 16, s = 15
Fig. 6. Quick sort: Speed up for Cilk and MPI as a function of number of processors
E. Quick Sort A quick sort has a computational complexity of order O(n log n). A problem size of 2,097,152 oating point elements (8Mbytes of data or half the RAM available on each workstation) was chosen. The benchmark implementation recursively partitions the full list into pairs of sublists which may be distributed to other processors which continue to partition the data until a threshhold is reached at which the sort is completed sequential on the current processor2. The threshold for reversion to sequential computation was set at 131,072 elements (512kbytes) to allow approximately 16 sequential tasks to be generated for parallel sorting. MPI version A task queue model was used: a master process partitions the list to be sorted until sublist sizes have fallen below the sequential treshold. These sublists are added to a task queue for sorting by a slave process. The slave processes continually request work from the master process (at the same time returning results from previous tasks) until there are no more sublists to process. Cilk version The Cilk version of this algorithm was simply implemented. The initial thread partitions the list and spawns two child threads that in turn partition their lists and so on. Spawning of threads continues in this fashion until the length of a sublist falls below the sequential threshold and the sublist is sorted sequentially. Threads are automatically stolen from the ready queues of busy machines by idle machines. Figure 6 shows the less than ideal speed-ups expected for quick sort: with 6 processors, we see only 2.3 for Achilles and 1.6 for Fast Ethernet. The speed-up exhibited by the Cilk implementation reached a ceiling at 1.6 (see gure 6). This was caused by
2 Note that this may not be the optimum parallel sorting algorithm: our aim here was to compare Cilk and MPI in a communication-bound environment.
Fig. 5. Travelling salesman problem: speed-up vs sequential threshold (16 cities, 3 processors)
When the sequential threshold was reduced, the number of messages increases dramatically and results are less conclusive: for three processors - Cilk had only a slight edge for n = 16, s = 15, which was lost for n = 16, s = 14 and regained at s = 13. The Cilk implementation switches context between an I/O thread and a computation one when messages are received: if the context switching rate becomes too high, we observed greatly increased overheads (see the next section) and this degrades Cilks performance. Elimination of this problem should allow Cilks performance to increase. However, in this problem, the messages are extremely small, so that the reduced data copying that active messages allow does not play a major factor and both implementations perform similarly.
the Cilk run-time system running on a uniprocessor host poorly handling context switching between the two running user-level threads (one for computation and one for communication - to allow communication and computation to overlap). Our implementation of the Cilk run-time system for NoWs uses the Linux pthreads library[11] for multithreading. A number of problems were observed with the performance of pthreads, in particular, when two processor hungry threads are competing for CPU time: speed of context switching slows to 40% of normal, and occasionally, a single context switch can take over 10 times the average switching time. This problem with the Cilk run-time system running on NoWs that rely on user level multi-threading has been previously identied3 [12]: it is one of the reasons that continued development of Cilk for NoWs has stalled in recent years. This aspect of performance of the Cilk run-time system for NoWs could be improved by moving the communications handling functions into the operating system kernel or network device driver. The MPI implementation of the quick sort does not attempt to overlap computation with communication. Messages also became smaller as the number of processors increased so that multiple data copying by MPI did not have as great an adverse eect, allowing MPIs speed-up to continue to increase. F. Fast Fourier Transform The FFT algorithm implemented here uses the O(n log n) Cooley-Tukey decimation in time algorithm [13], which recursively subdivides the problem into its even and odd components until the length of the input is 2. This base case is a 2-point discrete Fourier transform (DFT), whose output is a linear combination of its inputs. The Cooley-Tukey method requires a vector whose length is a power of 2. For a vector of n points, log2 n passes are required for the complete transform. The rst pass performs n/2 2-point DFTs and in each subsequent iteration the number of 2point DFTs is halved. A Cooley-Tukey FFT has a computational complexity O(n log n). There are log2 n passes, with O(n) arithmetic operations performed per pass. A simple parallel decomposition for p processors (p = 2k ) allows the rst log2 n log2 p passes to proceed without interprocessor communication. Before the beginning of the last log2 p passes each processor exchanges the results of the previous iterations results with one other processor. The rst log2 n log2 p passes require no interprocessor communication except the transmission of the initial vector, so a communication cost of O(n) is amortized over O(n) log n computations, assuming n p. For each of the nal log2 p passes, a total of n points must be transferred on which O(n) arithmetic operations are performed. Overall, there are O(log n) computations per data point communicated: this is lower than, for example matrix mul3 It
Fig. 7. FFT: Speed up vs number of processors
tiplication, but the computations are considerably more complex, allowing reasonable speed-ups to be achieved. MPI version Each of the worker processes is assigned n/p data points. The worker processes all enter the rst phase - log2 n log2 p iterations which do not require interprocessor communication. They then wait on a synchronisation barrier before beginning the second phase. For each of the log2 p passes in the second phase, each processor exchanges data with one other processor at the beginning of the pass, followed by a local calculation loop. Cilk version The Cilk version of the algorithm is very similar to the MPI version. A master thread spawns p threads each responsible for n/p points in the rst phase of computation. The implementation relies on work stealing rather than static scheduling because the overhead of work stealing is small compared to the FFT computation time. The phase 1 threads perform the rst n p iterations of communicationfree computation. The master thread waits for all phase 1 threads to complete before spawning p phase 2 threads. Explicit barriers and explicit data transfers were added to the Cilk code for phase 2. Again, the phase 2 threads are stolen by idle hosts which then use explicit active messages to exchange data at the beginning of each of the p passes in phase 2. The speed-ups shown in gure 7 were measured with a vector of 218 (262,144) points. An initial communication intensive period is followed by a relatively long computation period so that we would not expect a signicant dierence between the two implementations in the rst phase. Overall, the computation cost for FFT has a relatively high constant factor resulting in close to perfect speed-ups and small dierences between the two implementations. G. Gaussian Elimination This algorithm solves a system of n linear equations on n variables. The result is a square matrix decomposed into upper and lower triangular submatrices. For an n n
is believed to originate in the kernel scheduler of Linux kernel
2.0.x.
matrix, the computation requires O(n3 ) time. The algorithm repeatedly eliminates elements of the matrix beneath successive diagonal elements. The number of rows and columns to be processed begins at n and falls by one on each successive pass. The pseudo-code for the sequential algorithm is: for i := 1 to n do for j := i+1 to n do for k:= n+1 downto i do a[j][k] := a[j][k] - a[i][k]*a[j][i]/a[i][i]; Partial pivoting is usually employed to improve numerical stability. Before entering the j loop, the row with the largest absolute value in the currently active column (ith ) is swapped with the row currently containing a[i][i]. The aim is to make the active diagonal element as large as possible. In the parallel version, stripes of columns are assigned to dierent processors. Each processor performs a part of the j loop in each iteration. For each iteration of the i loop, the processor that is responsible for the ith column performs a local pivoting operation, then sends the index of the pivot row and the data in the pivot column to all other processors. The other processors then use this index to perform their own local pivoting operations and the pivot column data to compute the k loop. As the computation progresses, columns on the left side of the matrix are progressively completed and take no further part in the computation. Thus, if a simple striping technique is used, processors progressively drop out of the computation. A further optimization allocates narrow stripes to processors in a round-robin fashion to ensure that none become idle. MPI version The p worker processes were each assigned 10 column stripes in a round robin fashion until all columns have been assigned. The i loop then starts on all processors. For each iteration, the processor responsible for the ith column performs a pivoting operation locally and then sends the pivot row number and entire pivot column to the other processors, which will all be waiting. The other processors can then execute a pivoting operation locally. All processors then perform the k loop on the columns assigned to them. Cilk version The Cilk version of this algorithm does not use a strict dataow decomposition of the computation. Instead, p computation threads are spawned explicitly on p processors and data is transferred using explicit active messages. This makes the implementation very similar to the MPI version. MPI consistently showed signicantly better performance for Gaussian elimination: a key factor here is that the Cilk version was programmed with explicit active messages (rather than the implicit ones that are created with other problems) and is able to gain little from its ecient handling of active messages. It now wastes more time in context switches which is reected in the smaller measured
Fig. 8. Gaussian elimination: speed-up for Cilk and MPI vs number of processors
speed-ups. Message sizes also decrease as the number of processors increased and as the computation proceeds reducing the eective cost of additional copying by MPI. H. Finite Dierencing Finite dierencing is an iterative algorithm where in each iteration, each element in a matrix is updated to take the value of the average of its four nearest neighbours (above, below, left and right). Iteration continues until a convergence criterion is reached. The Jacobi algorithm[14] implemented in this study uses two stages for each iteration. In the rst stage, new values are computed entirely from old values. Then new values are copied back over the old values and used in the next iteration. (The alternative or Gauss-Seidel method just updates elements as the computation progresses [15] and requires only one pass.) For an initial matrix of size n n and p processors: stripes of n/p elements are sent to each processor. In each iteration, O(n2 ) work is done and O(n) elements from the boundaries of each stripe are interchanged between processors. The initial O(n2 ) cost of distributing work to each processor is amortized over many iterations needed to reach convergence and has little eect on the results. The computation per cell is simple and fast (compared to, say, matrix multiplication which has the same computation:communication ratio) and communication is needed in each iteration (whereas matrix multiplication communicates only nal results) therefore large problems of this type are needed for speed-ups to be measured: a matrix size of 900 900 elements was chosen here. MPI version The message passing implementation of this algorithm uses a master process that distributes the stripes to the p processing units and then on a per-iteration basis, collects local convergence status from each of the processing units and
Algorithm Matrix multiplication Travelling Salesman Quick sort Fast Fourier Transform Gaussian Elimination Finite Dierencing
MPI 271 618 617 290 173 521
Cilk 222 268 378 327 245 577
Cilk:MPI Ratio 1:1.2 1:2.3 1:1.6 1:0.9 1:0.7 1:0.9
TABLE II Lines of code in the MPI and Cilk implementations
Fig. 9. Finite dierencing: Cilk and MPI vs number of processors
then informs each of the processing units of the global convergence status. When global convergence is obtained, the master process then collects the resultant matrix stripes from the slave processing units. Between iterations, slave processing units exchange boundary information with each other independently of the master process. Cilk version The Cilk version of this algorithm does not use a strict dataow decomposition of the computation. Instead, the master thread spawns p processing threads and assigns a stripe of the matrix to each of them. It then waits for all the processing threads to terminate before exiting. Active messages were used to update boundary values on every iteration. The processing threads work as follows: after receiving a stripe of the matrix from the master thread, each processing thread enters a loop which repeatedly performs one dierencing step, a check for local convergence and then sends active messages to update its status on all other processing units and boundary data on processors responsible for neighbouring stripes. All processors then wait on a barrier, after which global convergence is checked. When global convergence is reached, result stripes are sent back to the master thread. This problem has a very regular exchange of small messages ( 900 oating point values) and the Cilk implementation was again programmed to send explicit messages. Message size is small and aected neither by the number of processors (except in the rst iteration and nal iterations out of many) nor progress through the computation, so the two implementations are essentially similar in all respects - leading to similar speedups for both - as shown in gure 9. III. Code Complexity To assess the ease of programming in the two systems, we counted the number of lines of code in each implemen-
tation, see Table II. The Cilk algorithm implementations were, in general, shorter in length, more concise, and hence easier to understand. In particular, the code contained far fewer communications related statements. Some of the algorithms, such as the travelling salesman problem, quick sort, and matrix multiplication, could be decomposed easily using Cilks dataow approach, and the Cilk implementations of these algorithms were extremely concise. Figures 11 and 10 show code excerpts of the setup and communication sections of the the MPI and Cilk implementations of the travelling salesman problem as an example. Note that the local calculation loop is identical in both versions. However, some algorithms, like the fast Fourier transform, Gaussian elimination and, in particular, nite dierencing, could not be simply implemented using a strict dataow approach. In these cases, explicit message passing using active messages was implemented in the Cilk run-time system and used to complement the dataow decomposition techniques used while retaining as much communications eciency as possible (note the send data to call in the code that has a coresponding receive data from handler in the run-time system). However, even in the implementations of these algorithms, the code was not signicantly longer than the MPI implementations and was still very easy to understand. IV. Conclusion MPI implementations tended to outperform their Cilk counterparts when a relatively large number of small sized messages were transferred during the computation. This is evident in the results for Gaussian elimination and nite dierencing. Large messages place a higher burden on the MPI run-time system in overheads associated with buering and copying. These overheads are much smaller with smaller messages. Cilks use of active messages provided most benet when the message sizes were larger as in matrix multiplication and FFT. Certain algorithms coded with MPI have very clear, deterministic communications patterns, and when this is the case, such as in the nite dierencing and Gaussian elimination, the program explicitly deals with each incoming message knowing the message content and does not need the header associated with an active message to identify the operation required on the data it has received. In such
9
int main(int argc, char *argv[]) { ... if (rank==0) { ... do { worker=MPI ANY SOURCE; MPI Recv(&mesg,1,MPI INT,worker,0,MPI COMM WORLD,&status); worker=status.MPI SOURCE; if (mesg==0) { // Worker returning results pending--; MPI Recv(&numfound,1,MPI INT,worker,30,MPI COMM WORLD,&status); MPI Recv(&foundmin,1,MPI DOUBLE,worker,30,MPI COMM WORLD,&status); MPI Recv(&mybest,NUMCITIES+1,MPI INT,worker,30, MPI COMM WORLD,&status); if(foundmin<globmin) { globmin=foundmin; memcpy(&best[0],&mybest[0],(NUMCITIES+1)*sizeof(int)); } } else { /* Worker is asking for work */ if(numinqueue()>0) { do { queueremove(&thistask,&thisdist); } while ((numinqueue()>0)&&(thisdist>globmin)); if (thisdist>globmin) { mesg=2; MPI Send(&mesg,1,MPI INT,worker,20,MPI COMM WORLD); } else { mesg=1; // Here is some work! MPI Send(&mesg,1,MPI INT,worker,20,MPI COMM WORLD); MPI Send(&thistask[0],NUMCITIES+1,MPI INT, worker,20,MPI COMM WORLD); MPI Send(&thisdist,1,MPI DOUBLE,worker,20,MPI COMM WORLD); MPI Send(&globmin,1,MPI DOUBLE,worker,20,MPI COMM WORLD); pending++; } } else if (pending==0) { mesg=2; // No more work! MPI Send(&mesg,1,MPI INT,worker,20,MPI COMM WORLD); } else { mesg=2; // Wait a while! MPI Send(&mesg,1,MPI INT,worker,20,MPI COMM WORLD); } } } while ((pending > 0) (numinqueue() > 0)); mesg=3; // stop! for(i=1;i<nproc;i++) { MPI Send(&mesg,1,MPI INT,i,20,MPI COMM WORLD); } } else { /* This is the code for the workers */ ... morework=1; do { mesg=1; // ask for work MPI Send(&mesg,1,MPI INT,0,0,MPI COMM WORLD); MPI Recv(&mesg,1,MPI INT,0,20,MPI COMM WORLD,&status); if(mesg==1) { MPI Recv(&thistask[0],NUMCITIES+1,MPI INT,0,20,MPI COMM WORLD, &status); MPI Recv(&thisdist,1,MPI DOUBLE,0,20,MPI COMM WORLD,&status); MPI Recv(&globmin,1,MPI DOUBLE,0,20,MPI COMM WORLD,&status); tsp internal(thistask,thisdist,globmin, &foundmin,mybest,&numfound,newtours,&thisdist); mesg=0; MPI Send(&mesg,1,MPI INT,0,0,MPI COMM WORLD); MPI Send(&numfound,1,MPI INT,0,30,MPI COMM WORLD); MPI Send(&foundmin,1,MPI DOUBLE,0,30,MPI COMM WORLD); MPI Send(&mybest[0],NUMCITIES+1,MPI INT,0,30,MPI COMM WORLD); } else if (mesg==3) { morework=0; } } while (morework==1);
} ... }
Fig. 11. Setup and communication sections of the MPI version of the travelling salesman problem.
algorithms, the overhead associated with active messages handler tags, as used by Cilk, are unnecessary and degrade the performance of Cilk implementations compared with their MPI counterparts. The MPI algorithm implementations displayed fairly consistent relative performance gains when the underlying interconnection architecture was changed from Fast Ethernet to Achilles. MPI did not seem to favour one architecture over the other. Performance of Cilk implementations was more related to number and size of messages transferred rather than how well the algorithm was framed in Cilks dataow model of computation. Cilk algorithms performed better than their MPI counterparts when a relatively small number of large messages were transferred during computation, for example, in the matrix multiplication and FFT algorithms. The Cilk implementation of the matrix multiplication algorithm running on Fast Ethernet performed nearly as well
as the MPI version of the algorithm running on the much faster Achilles network. The results from the FFT algorithm are also interesting. A Cilk FFT running on Fast Ethernet was slower than the MPI version running on Fast Ethernet, but on Achilles it performed as well as or better than the MPI version. In general, Cilk algorithms gained more performance from a change of underlying interconnect from Fast Ethernet to Achilles. Cilks use of active messages has more eect when when a large amount of computation can be performed on large amounts of data included in a message that can be quickly transferred between hosts. Algorithms which are readily expressed in a dataow form - matrix multiplication, travelling salesman and quick sort - have very compact, readily understood Cilk implementations. This is reected in signicantly reduced counts of lines of code - less than 50% in the travelling salesman problem. The nite dierencing problem also has a very
10
cilk oat tsp spawn(parms p) { ... tsp internal(p.map,p.thistour,p.cummdist, &mybest,besttour,&numfound, newtours,&newcummdist); if(mybest<foo) foo=mybest; if(numfound0) { for(i=0;i<numfound;i++) { memcpy(&newparms.map,&p.map, sizeof(oat)*2*NUMCITIES); memcpy(&newparms.thistour[0], &newtours[i][0], sizeof(int)*(NUMCITIES+1)); newparms.cummdist=newcummdist; newres[i]= spawn tsp spawn(newparms); } sync; } for(i=0;i<numfound;i++) { if(newres[i]mybest) { mybest=newres[i]; } } return mybest; }
Fig. 10. Setup and communication sections of the Cilk version of the travelling salesman problem
iterative solutions and smaller messages better. MPI was clearly more ecient than Cilk only in the iterative, irregular Gaussian elimination problem. It is relevant to note that later versions of Cilk[16] have provided sharedmemory capabilities, which allow simple decompositions of some algorithms at signicant expense in the maintenance of consistency of the shared memory. References
[1] M. Snir, MPI: The complete reference, MIT Press, MA: Cambridge, USA, 1996. [2] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, A highperformance, portable implementation of the MPI message passing interface standard, Parallel Computing, vol. 22, no. 6, pp. 789828, Sep 1996. [3] M. Lauria, S. Pakin, and A.A. Chien, Ecient layering for high speed communication: The MPI over Fast Messages (FM) experience, in Proceedings of Cluster Computing 2, 1999, pp. 107116. [4] N.J. Nevin, The performance of LAM 6.0 and MPICH 1.0.12 on a workstation cluster, Tech. Rep. OSC-TR-1996-4, Ohio Supercomputing Center, Columbus: Ohio, USA, 1996. [5] Sonny Tham, John Morris, and Richard Gregg, Achilles: High bandwidth, low latency, low overhead communication, in Australasian Computer Architecture Conference, Auckland, New Zealand. Jan. 1999, pp. 173184, Springer-Verlag, Singapore. [6] R.D. Blumofe, C.F. Joerg, B.C. Kuszmaul, C.E. Leiserson, K.H. Randall, and Y. Zhou, Cilk: an ecient multithreaded runtime system, in Proceedings of the fth ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP 95), California: Santa Barbara, USA, 1995. [7] M. Frigo, C.E. Leiserson, and K.H. Randall, Proceedings of the 1998 ACM SIGPLAN Conference on programming languages, in Proceedings of the fourth ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP 1993), California: San Diego, USA, 1998. [8] C. K. Tham, Achilles: A high bandwidth, low latency, low overhead network interconnect for high performance parallel processing using a network of workstations, Ph.D. thesis, The University of Western Australia, 2003. [9] Richard R Gregg, David Herbert, James McCoull, and John Morris, Thetis: A Parallel Processor Leveraging Commercial Technology, in Proceedings of the Australian Computer Science Conference, Adelaide, Feb. 1995. [10] Sonny Tham and John Morris, Performance of the achilles router, in Proceedings of the Asia-Pacic Computer Systems Architecture Conference, 2003, pp. . [11] Livermore Computing, POSIX threads programming, http: //www.llnl.gov/computing/tutorials/pthreads/, 2003. [12] MIT Laboratory for Computer Science, Distributed Cilk home page, Cilk website (http://supertech.lcs.mit.edu/cilk/home/distcilk5.1.problems), 2002. [13] William H. Press, Brian P. Flannery, Saul A. Teukolsky, and William T. Vetterling, Numerical Recipes: The Art of Scientic Computing, Cambridge University Press, Cambridge, UK, 1986. [14] D. Young, Iterative solution of large linear systems, Academic Press, New York: New York, USA, 1971. [15] L. Hageman and D. Young, Applied Iterative Methods, Academic Press, New York: New York, USA, 1981. [16] R. D. Blumofe, M. Frigo, C. F. Joerg, C. E. Leiserson, and K. H. Randall, Dag-consistent distributed shared memory, in Proc. of the 10th Intl Parallel Processing Symp. (IPPS96), Apr. 1996, pp. 132141.
simple dataow form - one in which a new matrix is created from the old one in every step. However, this form, directly implemented, would result in large closures which would need to be transferred between processors. A much more ecient implementation only transfers values at the boundaries of each stripe between processors using explicit messages: in Cilk, this requires explicit programming of messages, making the Cilk implementation essentially identical to the MPI one. For algorithms which have simple iterative solutions, Gaussian elimination and nite dierencing (both the updatein-place iteractive algorithm and the dataow version are simple for this problem), MPI implementations were less complex. Gaussian elimination is less regular (message sizes reduce as the computation proceeds) and gains more from the iterative pure message-passing programming style than the regular nite dierencing problem. Although FFT also has a natural dataow solution, in Cilk, this involves creating large numbers of closures on the master processor which are lled in from slave processors before they can migrate - making the master processor a signicant bottleneck. Thus for FFT, the Cilk implementation was similar to its pure iterative MPI cousin and was slightly longer. In summary, problems which have simple dataow solutions and involve transfer of large blocks of data are simpler and faster in Cilk, whereas MPI handles problems with

01240575

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

01240575

Uploaded by

Copyright:

Available Formats

1

Cilk vs MPI: Comparing two very dierent parallel programming styles

led an army of Myrmidons in the siege of Troy.

Complexity O(n3 ) O(n!) O(n log n) O(n3 ) O(n log n) O(n2 )

Number of Messages O(p)

Message Size O(n2 ) O(n) O(n) O(n2 ) O(n) O(n)

Regular? Yes No No Yes Yes Yes

O(logs n) O(n) O(n) O(p)

Fig. 2. Matrix multiplication: Speed up as a function of problem size on 3 processors

Fig. 7. FFT: Speed up vs number of processors

is believed to originate in the kernel scheduler of Linux kernel

MPI 271 618 617 290 173 521

Cilk 222 268 378 327 245 577

Cilk:MPI Ratio 1:1.2 1:2.3 1:1.6 1:0.9 1:0.7 1:0.9

TABLE II Lines of code in the MPI and Cilk implementations

Fig. 9. Finite dierencing: Cilk and MPI vs number of processors

You might also like