You are on page 1of 33


By P F Leggett, S P Johnson and M Cross Parallel Processing Research Group Centre for Numerical Modelling and Process Analysis University of Greenwich London SE18 6PF UK. ABSTRACT The Computer Aided Parallelisation Tools (CAPTools) [1] is a set of interactive tools aimed to provide automatic parallelisation of serial Fortran computational Mechanics (CM) programs. CAPTools analyses the users serial code and then through stages of array partitioning, mask and communication calculation, generates parallel SPMD (Single Program Multiple Data) messages passing Fortran. The parallel code generated by CAPTools contains calls to a collection of routines that form the CAPTools communications Library (CAPLib). The library provides a portable layer and user friendly abstraction over the underlying parallel environment. CAPLib contains optimised message passing routines for data exchange between parallel processes and other utility routines for parallel execution control, initialisation and debugging. By compiling and linking with different implementations of the library the user is able to run on many different parallel environments. Even with todays parallel systems the concept of a single version of a parallel application code is more of an aspiration than a reality. However for CM codes the data partitioning SPMD paradigm requires a relatively small set of message-passing communication calls. This set can be implemented as an intermediate thin layer library of message-passing calls that enables the parallel code (especially that generated automatically by a parallelisation tool such as CAPTools) to be as generic as possible. CAPLib is just such a thin layer message passing library that supports parallel CM codes, by mapping generic calls onto machine specific libraries (such as CRAY SHMEM) and portable general purpose libraries (such as PVM an MPI). This paper describe CAPLib together with its three perceived advantages over other routes: as a high level abstraction, it is both easy to understand (especially when generated automatically by tools) and to implement by hand, for the CM community (who are not generally parallel computing specialists), the one parallel version of the application code is truly generic and portable, the parallel application can readily utilise whatever message passing libraries on a given machine yield optimum performance.

1 Introduction
Currently the most reliable and portable way to implement parallel versions of computational mechanics (CM) software applications is to use a domain decomposition data partitioning strategy to ensure that data locality is preserved and inter-processor communication is minimised. The parallel hardware model assumes a set of processors each with its own memory, linked in some specified connection topology. The parallelisation paradigm is single processmultiple data (SPMD); that is each processor runs the same application except using its own local data set. Of course, neighbouring processors (at least) will need to exchange data during the calculation and this must usually be done in a synchronised manner, if the parallel computation is to faithfully emulate its scalar equivalent. One of the keys to enabling this class of parallel application is the message-passing library that enables data to be efficiently exchanged amongst the processors comprising the system. Up until the early 1990s, parallel vendors typically provided their own message passing libraries, which were naturally targeted at optimising performance on their own hardware. This made it very difficult to port a CM application from one parallel system to another. In the early 1990s, portable message passing libraries began to emerge. The two most popular such libraries are PVM [2] and MPI [3]. One or other, or both of these libraries is now implemented on most commercial parallel systems. Although this certainly addresses the issue of portability, these generic message-passing libraries may give far from optimal performance on any specific system. On CRAY-T3D systems, for example, the PVM library performance is somewhat inferior to the manufacturers own SHMEM library [4]. Hence, to optimise performance on such a system the parallel application needs to utilise the in-house library. Although both PVM and MPI are powerful and flexible they actually provide much greater functionality than is required by the CM community in porting their applications to commercial parallel hardware. This issue was recognised by the authors some years ago when they were working on the design phase of some automatic parallelisation tools for FORTRAN computational mechanics codes CAPTools [1,5,6,7,8,9]. The challenge was to produce generic parallel code that would run on any of the commercially available high performance architectures. The key factor that inhibited the generation of truly generic parallel code was the variety of the message passing libraries and the structure of the information passed into the resulting calls as arguments. From an extensive experience base of code parallelisation, the CAPTools team recognised that all typical inter-processor communications required by structured mesh codes (typical of CFD applications) could be addressed by a concise set of function calls. Furthermore it transpired that these calls could be easily implemented as a thin software layer on top of the standard message passing libraries PVM and MPI plus a parallel systems own optimised libraries (such as Cray T3D/T3E SHMEM). Such a thin layer software library could have three distinct advantages over other routes: as a high level abstraction it is both easy to understand and to implement by hand, for the CM community (who are not generally parallel computing specialists), the one parallel version of the application code is truly generic and portable, the parallel application can readily utilise whichever message passing libraries on a given machine yields optimum performance. In this paper we describe the design, development and performance of the CAPLib message passing software library that is specifically targeted at structured mesh CM codes. As such, we are concerned with:- ease of use by the CM community, portability, flexibility and

computational efficiency. Such a library, even if it is a very thin layer must represent some kind of overhead on the full scale message passing libraries; part of the performance assessment considers this issue. For such a concept to be useful to the CM community its overhead must be minimal.

2 CAPLib Design and Fundamentals

CAPLibs primary design goal was to provide the initialisation and communication facilities needed to execute parallel Computational Mechanics code either parallelised manually or generated by the CAPTools semi-automatic parallelisation environment. A secondary goal is to provide a generic set of utilities that make the compilation and execution of parallel programs using CAPLib as straightforward as possible. The library is also supplied with a set of scripts to enable easy and standardised compilation of parallel code with different versions of CAPLib and for the simple execution of the compiled executable on different machines. This section discusses the design, features and fundamentals of the library. 2.1 Design

The different layers of software of CAPTools generated code are shown in Figure 1. CAPLib has been implemented over MPI [3] and PVM [2], the most important standard parallel communications libraries in current use, to provide an easy method of porting CAPLib to different machines. Where possible versions of CAPLib have been developed for proprietary libraries in order to obtain maximum performance, for example, the Cray SHMEM library [4] or Transtechs i860toolset library [11]. CAPTools generated parallel code CAPLib API MPI
Figure 1 CAPLib Software layer



Transtech i860toolset

The library has been designed to meet the following criteria: Efficient. Speed of communications is perhaps the most vital characteristic of a parallel message-passing library. Startup latency has been found to be a very important factor effecting the performance of parallel programs. The addition of layers of communication software over the hardware communication mechanism increases the startup latency of all communications. It is important therefore to access the communication mechanism of a machine at the lowest level possible. Each implementation of CAPLib attempts to utilise the lowest level communications API of each parallel machine in order to achieve low latency and therefore as fast communications possible. Portable. Code written to use CAPLib is portable across different machines. Only recompilation is necessary. Correct. It is vitally important for parallelised computational mechanics programs to give the same answers in parallel as in serial. The commutative (global) message passing functions provided by CAPLib are implemented so as to guarantee that the same result is seen on every processor. This can be of vital importance for the correct execution of parallel code and its successful completion. For example, a globally summed value may

be used to determine the exit of an iterative loop. If the summed value is not computed in a consistent manner across all processors, then round off error may cause some processors to continue executing the loop whilst others exit, resulting in communication deadlock. Generic. The library is generic in the sense that decisions about which processor topology to execute on are taken at run time. CAPTools generated code compiled with CAPLib will run, for example, on 1 processor, a pipeline of 2 processors, a ring of 100 processors, or a torus of 64. The scripts provided with the library are also generic. For example, capmake and caprun are scripts that allow the user to compile and run parallel code without knowing system specific compiler and execution procedures. Simple. The library itself has been kept as simple as possible both in the design of the API and in its implementation. By keeping the library simple with the minimum number of functions and also the minimum number of arguments to those functions, the library is easily ported to different parallel machines. Also an uncomplicated interface is more easily understood and assimilated by the user. Parallel Hardware Model


CAPTools currently generates parallel code based on a Distributed Memory (DM) parallel hardware model, which is illustrated in Figure 2. In the CAPLib parallel hardware model processors are considered to be arranged in some form of topology, where each processor is directly connected to several others, e.g. a pipe, ring, grid, torus or full (fully connected). Each processor is assigned a unique number (starting from 1). In the case of grid and torus topologies, each processor also has a dimensional processor number. Memory is considered local to each processor and data is exchanged between processors via message passing of some form between directly connected processors. CAPTools generated parallel code can also be executed on Shared Memory (SM) systems providing, of course, CAPLib has been ported to the system. On a SM system, each processor still executes the same SPMD program operating on different sections of the problem data. The main difference between this and operation on a DM system is that message-passing calls can be implemented inside CAPLib as memory copies to and from hidden shared memory segments. In this respect the CAPLib model differs from the usual parallelisation model used on SM machines that assume every processor can directly access all memory of the problem. By restricting the memory each processor accesses and enforcing a strict and explicit ordering to the update of halo regions and calculation of global values, the CAPLib parallel hardware model ensures that there will be very little memory contention on SM systems and particularly on Distributed Shared Memory (DSM) systems. As the number of processors becomes large, for example, some of the machines recently built for the Accelerated Strategic Computing Initiative [10] (ASCI) have thousands of processors, the localisation of communications becomes very important. Distributing data onto processors, taking into account the hardware processor topology, can localise communication between processors and thus minimise contention in the communications hardware.

processor 1 CPU 1,2


pipeline topology 2 3 2,2


5 full topology 3 1 2 4

local memory

2d grid topology 1,1



Figure 2 CAPLib parallel hardware model


Process Topologies

Knowledge of the processor topology of the parallel hardware a parallel code is to run on is very important. It can be used to optimise the speed and distance travelled by messages between processes. CAPTools attempts to generate code that will minimise the amount of communication needed, however, to perform those communications that are required as quickly as possible, the process topology must be mapped onto the processor topology. CAPLib uses the concept of a process topology for this reason. An intelligent mapping of process to processors will give better performance than would be possible from a random allocation. By placing processes so that most communications are needed only between directly connected neighbouring processors, the distance the communications have to travel is minimised, avoiding hot spots and maximising bandwidth. An awareness of process topology also allows for more efficient programming in global communications; for example, the use of a hyper-cube to maximise global summations in parallel (see section 6.3). By requiring that processes are connected in a pipe or grid type topology, it is possible for CAPTools to generate parallel code for structured mesh parallelisations using directional communications, i.e. where communication is specified as being up or down, left or right of a process rather than to a particular process id. This programming style can make it easier for the user to write and understand parallel code, especially for grids of two or more dimensions. Where possible, CAPLib tries to use the fastest methods of communication that are available on a particular machine. It might be that communications to neighbouring processors could be made directly through fast, dedicated hardware channels. The topology required for a particular run of a parallel program, e.g. pipe, ring, and the number of processes can be specified to the CAPLib utilities and to the parallel program at run time in a number of ways:- via environment variable; as a flag on the command line; a configuration file or if none of the previous is set, by asking the user interactively. The topologies currently available from CAPLib are pipe, ring, grid, torus and full (all to all). 2.4 Messages

Each messages sent and received using the CAPLib communication routines has a length, type and a destination.

2.4.1 Message Length The length is defined in terms of the number of items to be communicated. Zero or a negative number of items must result in no message being sent. All CAPLib communication routines check for length <= 0 as it is possible for zero or negative lengths to occur in a users parallel code, perhaps due to values read into an application code at runtime. By putting the check inside the CAPLib calls this minimises the differences between the serial and parallel application code and increases generated codes readability. 2.4.2 Message Type In common with PVM and MPI, each of the communication routines in CAPLib has an integer argument specifying the type of the data being communicated. Internally, within the library, the number of items being sent is multiplied by the type size (in bytes) to calculate the length of a message. The type of a message must be known by the library functions for several reasons: 1. Generic code. Different machines have different default type sizes. Specifying the message length in terms of number of items and variable type is generic. If the message length was specified as number of bytes this would not be generic, e.g. a message of 2 REALS on a Sun workstation by would be 8 bytes in length, whilst on a Cray-T3D it would be 16 bytes. Specifying the variable type allows the library to accommodate for the change in type size rather than the user program. 2. Compiler switches. Knowing the variable type allows the library to accommodate for changes in type size caused by the use of compiler switches. If the size of types was set statically within the library then it would fail to cope with compiler switches changing a type size, e.g. real (four bytes) might be promoted to double precision (eight bytes). For this reason, those routines within CAPLib which are dependent on the size assigned to floating point types, such as the commutative operation functions (see section 3.5), are held in a special file capinc.F, which is compiled alongside the users code using the capmake utility (see section 7) with the same compiler switches, rather than pre-compiled into the CAPTools library. The type sizes are determined dynamically at run time during the call to CAP_INIT (see section 3.1) by a call to a routine CAP_SET_TYPESIZES also defined in capinc.F. The following code illustrates this process for the type INTEGER.


The value in SI(2) is copied to RI(2) byte by byte with the number of bytes copied increased until RI(1) and RI(3) differ. As RI(1) and RI(3) are either side of RI(2) in memory this means that the number of bytes copied is more than the number of bytes

used to hold RI(2). This method has been found to be generic and works on every machine tested so far. 3. Heterogeneous computing. If a parallel program is sending messages within a heterogeneous environment then size and storage of data types may differ between processors. One processor may use little endian (low bytes first) and another big endian (high bytes first) storage, i.e. bytes in a message may have to be swapped at destination or origin depending on the data type. Floating point representation may also be different; e.g. default size might be 4 bytes or one machine and 8 bytes on another. For the library to be able to convert between different storage types it must know which type is being communicated in order to apply the correct translation. Currently the library makes the assumption that all processors are homogenous but the knowledge of type of messages within the library allows for adding heterogeneous capability in the future if this is found to be desirable. 2.4.3 Message Destination Message destination is determined by an integer argument passed in each communication call. A negative value indicates a direction, a positive value indicate a process number. The code generated by CAPTools for structured mesh parallelisations currently assumes a pipeline or grid process topology. The communication calls therefore use the negative values to indicate direction to the left or right (or up and down) of a processes position in topology. These are available as predefined CAPLib constants such as CAP_LEFT, CAP_RIGHT for improved readability. A characteristic of parallel SPMD code written for an ordered topology is a test for neighbour existence before communication. This is because the first processor does not have a neighbour to its left and the last processor does not have a neighbour to its right. CAPLib functions perform the necessary tests for neighbour processor existence internally to improve the readability of CAPTools generated parallel code. Having the neighbour test within the library also reduces the possibility of error (and therefore deadlock) in any manually written parallel code. The functions also test for zero-length messages, as mentioned earlier, since this is often a possibility, so that the user avoids having to perform this chore as well. Typical hand written user code without these internal tests might look like as follows:IF (N.GT.0) THEN IF (MYNUM.LT.NPROC) CALL ANY_RECEIVE(A,N*4,MYNUM+1) IF (MYNUM.GT.1) CALL ANY_SEND(A,N*4,MYNUM-1) ENDIF

where MYNUM is the processor number and NPROC is the number of processors. Using CAPTools communications library the code becomes

where the receive communication will only take place if N is >=0 and a processor is present to the right and similarly for the send communication if a processor is available to the left.

3 Requirements For Message-Passing from Structured Mesh Based Computational Mechanics code
CAPLib satisfies the general requirements for message-passing from parallelisations of structured mesh based Computational Mechanics. The library has to provide for:

Initialisation of required process topology Data Partition calculation Termination of parallel execution Point to point communications Overlap area (halo) update operations Commutative operations, i.e. local value ->global value using some function Broadcast operations Algorithmic Parallel Pipelines In the following sections, the general requirements for communication and parallel constructs for CM codes and the CAPLib calls that address these requirements are described, particularly emphasising their novel aspects. To illustrate this discussion a simple onedimensional parallel Jacobi code (Figure 3) obtained using CAPTools is used. The CAPLib library routines are summarised in Table 1 below. CAPTool Communication Library (CAPLib) Routine Summary
Function Name

T y p e

Function Arguments

B l o c k i n g

B u f f e r e d

C y c l i c

I x I x I x P P E P P E P P E x x x x x x x x x x x x x x x

P P E P x P x E x P x P x E x S x S x S x G G G G G x x x x x



G x G x

CAPLib Function Type Key I Initialisation, termination and control P Point to point communication E Ordered exchange communication between neighbours S Synchronisation on non-blocking communication G Global communication or commutative operation
Table 1 Summary of CAPLib Routines

Figure 3 CAPTools generated parallel code for simple 1-D Jacobi program


Initialisation, Partition Calculation and Termination

The routine CAP_INIT is called in the example code to initialise the library. It must be called before any other CAPLib function is used. This call sets up the internal channel arrays and other data structures that the library needs to access. In some implementations of the library (e.g. the PVM version) this routine is also responsible for starting all slave processes running. CAP_INIT is responsible for the allocation of processes to processors in such a manner as to

minimise the number of hops between adjacent processes in the requested topology and therefore the overall process to process communication latency, maximising communication bandwidth. CAP_INIT is also responsible for communicating information on the runtime environment such as hostname and X Window display name to all processes. The size of each data type is also dynamically determined by CAP_INIT. A general requirement for message-passing SPMD code is for each parallel process to be assigned a unique number and also to know the total number of processors involved. CAP_INIT sets CAP_PROCNUM (the process number) and sets the CAP_NPROC (the number of processes). Both variables are used in internally, but can be referenced in the application code through a common block in the generated code. The next stage is the calculation of data assignment for each process. Adhering to the SPMD model, the partitioning of the arrays TNEW and TOLD for this example on 4 processes would require each process to be allocated a data range of 250 array elements in order for each processor to obtain a balanced workload (see, for example Figure 4). The CAPLib function CAP_SETUPPART is passed the minimum and maximum range of the accessed data range and the number of processes. It returns to each process its own unique value for the minimum and maximum value for the partitioned data range (variables CAP_LTOLD and CAP_HTOLD in Figure 3). If the example was partitioned onto 4 processes then CAP_SETUPPART would return to process 1 the partition range 1 to 250, process 2 the partition range 251 to 500, process 3 the partition range 501 to 750 and process 4 the partition range 751 to 1000. Each process also requires an overlap region because of data assigned on one process but used on a neighbouring process. This will necessitate the communication of data assigned on one process into the overlap region of their neighbouring process. Due to the organised partition of the data the overlap areas need only be updated from their neighbouring processes. The data partition of the partitioned array TOLD in comparison with the original un-partitioned array is shown in Figure 4.
UN-PARTITIONED ARRAY TOLD 1 251 501 751 1000

PE 1

PE 2

PE 3

PE 4


KEY: Overlap Area Update Lower Overlap Update Higher Overlap

250 751 500 1000 750

Figure 4 Comparison of an un-partitioned and partitioned 1-D array.

The routine CAP_FINISH must be called at the end of a program run to successfully terminate use of the library. On some machines, this call is necessary if control is to return to the user once the parallel run has completed.


Point to Point Communication

The CAP_SEND and CAP_RECEIVE functions perform point to point communications between two processors. Typically these functions appear in pipeline communications (see section 3.4) but are also used to distribute data across the processor topology during initialisation of scalars and arrays etc. CAPLib has a selection of communication routines that allow the user to perform point to point communications in a variety of ways. The are two main groups, those of blocking and non-blocking and these are discussed separately in the next sections. Each communication has the generic arguments of address (A), length (NITEMS), type (TYPE) and destination (PID) with additional arguments depending on the routine. All the point-to-point routines are summarised in Table 1. 3.2.1 Blocking Communication Blocking communications do not return until the message has been successfully sent or received. The Non-cyclic blocking communications will not communicate beyond the boundaries of the process topology when directional message destinations are given, Directions are indicated by a negative PID argument. For example, in a pipeline, the first process will not send to its left, or the last process to its right. This will also be true of a ring topology, grid and torus (multi-dimensional ring). Where communications are required to loop around a topology like a ring or torus, as is the case for programs with cyclic partitions, the cyclic routines can be used. These do not test for the end or beginning of a processor topology. Buffered routines are provided so that data that is non-contiguous can be buffered and sent as a single communication. The extra arguments are STRIDE (stride length in terms of ITYPE elements) and NSTRIDE (the number of strides). In other words NSTRIDE lots of NITEM elements, STRIDE elements apart, will be communicated in each call. This approach avoids the multiple start up latencies incurred using a communication for each section of data. On most platforms there is a message size dependent limit at which point the time spent gathering and scattering data to and from buffers can be greater than the latency effect of using multiple communications. The buffered routines switch internally to non-buffered communications if this limit is exceeded. This limit is currently set statically but in the future it is hoped to perform an optimal calculation for the limit during the call to CAP_INIT. CAPTools provides a user option to generate buffered or non-buffered communications. 3.2.2 Non-Blocking Communication It is often the speed of communication that reduces the efficiency of parallel programs more than anything else. To improve code performance, many parallel computers allow programs to start sending (and receiving) several messages and then to proceed with other computation asynchronously whilst this communication takes place. CAPLib supports this approach by providing non-blocking sends and receives. Non blocking communications are implemented in CAPLib using the underlying host systems non-blocking routines where possible. Where such routines are not available, non-blocking routines have been implemented using a variety of techniques, for example, communication threads running in parallel with the main user code. Table 1 lists the non-blocking routines currently available in the library. Non-blocking communication routines, e.g. CAP_ASEND, begin the non-blocking operation but return to the user program immediately the communication has been initiated. The communication itself takes place in parallel with execution of the following user code. The

arguments are the same as for the blocking communications but with the addition of a message synchronisation id as the last argument. To make sure a message has completed its journey the user code calls a CAP_SYNC routine to test for completion, passing the destination and synchronisation id as arguments. The CAP_SYNC routines either return immediately, if a communication has finished, or wait for it to complete, if it has not. A particular communication is identified completely by the message destination and the synchronisation id. Depending on the hardware and underlying communications library that CAPLib is ported to, the implementation of the non-blocking routines can be done in several different ways. For some implementations the synchronisation call is used to actually unpack the messages because the underlying library does not provide a non-blocking receive using the same model as CAPLib, for example the PVM implementation. Buffered non-blocking communications are also handled differently depending on the underlying library and hardware. Buffered non-blocking communications consist of two stages, for a send, first the packing of data into a buffer, and then the communication of the buffered data. A receive communication must first receive the buffered data and then unpack it. If the parallel processor node is of a type that has a separate processor for communications, that can be programmed to perform work asynchronously with the main processor, then the packing and unpacking can be performed by the communications processor and overlapped with computation done on the main processor. This relies on the communications processor having dual memory access to the main processors memory. The benefit of this it is that both stages of buffered communication are then performed in parallel with computation. The Transtech Paramid [11] is a good example of such a system. However, it may be that the communications processor is of lower speed than the main processor and the time taken to unpack is actually longer than if the main processor had done the unpacking in serial mode itself. CAPLib therefore makes use of this approach only where it would improve performance. It is more often the case that parallel nodes consist of single processors and do not provide any direct hardware support for non-blocking buffered communications. On such systems, messages can still be received asynchronously, but the processor must do data unpacking and there is no real parallel overlapping during the packing/unpacking stage. Libraries such as MPI implement non-blocking communications on workstations using parallel threads. Although this provides the mechanism for non-blocking buffered communications, because the thread will run on the same processor, the unpacking is not actually performed in parallel, but through time slicing. Therefore no real parallel benefit on packing/unpacking is gained. If the underlying communications library used by CAPLib does not directly support buffered non-blocking communication then the unpacking must be performed at the synchronisation stage, once the buffered message has been received. CAPLib implements this by keeping a list of asynchronous communications and whenever a CAPLib synchronisation call is made, all outstanding messages from the list are unpacked. Because of the extra complexity of using non-blocking communications it is a common procedure to write or generate message-passing code that uses blocking communications as a first parallelisation attempt. Once this version has been tested thoroughly and proved to give the correct results, a non-blocking version can be produced to optimise the performance (In CAPTools this merely requires clicking on one button [8]). Before data that has been transmitted using non-blocking functions can be used in the case of a receive communication, or re-assigned in the case of a send communication, the completion

of the communication involving the data must be verified. For maximum flexibility and efficiency in synchronisation on message completion the communication model used by CAPLib for ordering of message arrival and departure for synchronising on message completion is as follows: Messages are sent in order of the calls made to send to a particular destination, D. Messages are received in order of calls made to receive from a particular destination, D. This implies that: Synchronisation on the sending of message Mi to destination D guarantees that messages Mi-1, Mi-2,... sent to destination D have arrived. In the example below the synchronisation using ISENDB by statement S3 on the message sent by S2 also guarantees that the message sent by S1 has arrived.

Synchronisation on the receiving of message Mj from a destination D guarantees that messages Mj-1, Mj-2... have been received at destination D. In the example below the synchronisation using IRECVB by statement S3 on the message requested by S2 also guarantees that the message requested by S1 has arrived.

Waiting for completion of a send to a destination does not guarantee that a particular receive has taken place from that destination and vice versa. In the example below the synchronisation using ISENDB by statement S3 on the message sent by S2 does not guarantee that the message requested by S1 has arrived.

Waiting for completion of a communication with a particular destination D does not guarantee that any other sends or receives to or from another destination has completed. In the example below the synchronisation using ISENDB by statement S4 on the message sent by S3 does not guarantee that the messages requested by S1 or sent by S2 has arrived.

This model is flexible enough to allow for the automatic generation of non-blocking communications within CAPTools [8]. The ability to synchronise several messages in a particular direction with one synchronisation, i.e. waiting for the last message to be sent is enough to guarantee that all messages previous to the last have been sent, makes code generation a lot easier. It also reduces the overhead of synchronisation. The model also

allows for overlapping both sends and receives simultaneously to a particular destination and for multiple tests on the same synchronisation id, which is essential for an automatic overlapping code generation algorithm. The flexibility of this model has allowed CAPTools to generate overlapping communications with synchronisation that guarantees correctness in a wide range of cases. This includes loop unrolling transformations, synchronisation and overlapping communications in pipelined loops. Code appearance is enhanced by the merger of synchronisation points, that is only possible with this communication model. 3.3 Exchanges (Overlap Area/Halo Updates)

For any array that is distributed across the process topology each process will have an overlap region in the array that is assigned on another process (see Figure 4). These overlap areas are updated when necessary. The overlap region is updated by invoking a call to CAP_EXCHANGE, which performs a similar function to the MPI call MPI_SENDRECV. This communication function will send data to a neighbouring process's overlap area as well as receiving data into its own overlap region from the neighbouring processor. CAP_EXCHANGE must ensure that no deadlock occurs and allow for non-communication beyond the edge of the process topology for the end processes. Most important is the fact that this type of communication is fully scalable, i.e. is not dependent on the number of processes, taking at most 2 steps to complete (see Figure 5). If the hardware allows non-blocking communication an exchange can be performed in 1 step by communicating in parallel.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


Figure 5 Communication pattern for blocking exchange operation on 16 processors.



A Pipeline in a parallel code involves each processor performing the same set of operations on a successive stream of data. Pipelined loops are a common occurrence in parallel CM codes and are often essential to implement, for example, recurrence relations, guaranteeing correctness of the parallel code. Because a pipeline serialises a loop it must be surrounded by outer loop(s) in order to achieve parallel speed up. The main disadvantages of pipelines are that during the pipeline process some processors will be idle at the start up and shut down stages. Another disadvantage is the potentially significant overhead of the numerous communication startup latencies. Figure 6 shows a simple example of a loop that has been parallelised using a pipeline.

Figure 6 Example of a Pipeline

With a low communication startup latency, good parallel efficiency can be achieved (see section 9.1) 3.5 Commutative Operations

The Jacobi example in Figure 3 uses a convergence criteria is based on DIFMAX. Since the TNEW and TOLD arrays have been partitioned across the processors, each processor will calculate its own local value for DIFMAX, however, it is necessary to calculate the global value. Collective parallel computation operations (minimum, maximum, sum, etc.) take a value or values assigned on each process and combine them with all the values on all other processes into a global result that all processes receive. This is performed by the CAPLib function CAP_COMMUTATIVE, which is analogous to the MPI global reduction function MPI_ALL_REDUCE. The latency for this type of communication is dependent on the number of processors. CAPLib minimises the effect of this by internally using a hyper-cube topology where possible to perform the commutative operation. The commutative routines in CAPLib are summarised in Table 1. The routine CAP_COMMUTATIVE performs the commutative operation defined by the passed routine FUNC on the data item VALUE for all processes. On entry to the routine, VALUE holds a local value on each processor, on exit it contains a global value computed from all local values across processors using routine FUNC to combine contributions. For example, the serial code to sum a vector is:SUM=0.0 DO I = 1, N SUM=SUM+A(I) END DO

The parallel equivalent of this using CAP_COMMUTATIVE, given that the array A has been partitioned across processes is:

In this example CAP_LOW and CAP_HIGH are the local low and high limits of assignment to A on each processor. The procedure CAP_RADD is a predefined procedure to perform a real value addition. A list of all predefined commutative functions is given in the CAPLib user manual [12]. Each procedure has the same arguments, which mean that CAP_COMMUTATIVE can call it generically. For example, CAP_RADD is defined as

The routine CAP_DCOMMUTATIVE is a derivation of CAP_COMMUTATIVE and performs a commutative operation in one dimension of a grid or torus process topology, the direction (e.g. left or up) being indicated by an additional argument. This type of commutative operation can be necessary when a structured mesh code is partitioned in more than one direction. The routine CAP_MCOMMUTATIVE provides for commutative operations on an array of data rather than one item. Combining several CAP_COMMUTATIVE calls to form one CAP_MCOMMUTATIVE call allows a corresponding reduction in latency.

CAPTools generates code with commutative operations whenever it can match the code in a loop to the pattern of a commutative operation. Without commutative communications, the code generated would involve complex and convoluted communication. An interesting observation is that commutative operations performed in parallel will actually give answers with less round off than the corresponding serial code. For example, consider the summation of ten million small numbers in serial. As the summation continues, each small value will be added to an increasingly larger sum. Eventually the small numbers will cease to have an impact on the sum because of the accuracy of adding a small value to a large one using computer arithmetic. The parallel version of the summation will first have each processor sum their section of the sum locally, communicate the local summations, and add them to obtain a global sum. The accuracy will be greater since each local summation will involve less numbers, and therefore there will be smaller differences in magnitude than the complete serial summation. In addition, the summation of the local summations to obtain the global value will be of relatively similar sized numbers. If this were not the case, this would not be acceptable for many users performing parallelisations from existing serial code. Part of the parallelisation process is to validate the parallel version against the serial version. Obviously, the parallelised code must produce the same results in order to pass the validation process. Although a parallel commutative operation may not produce exactly the same result as the serial one, it will at least be more accurate rather than inaccurate and so most validation tests should be passed. As well as getting as near as possible the same results as the serial version, commutative operations must also produce the same answer on all processes. For example, often the calculation of the sum of the difference between two vectors, i.e. a residual value, is used to determine whether to terminate an iterative loop. If the calculation of the residual value is not the same on all processors then the calculated values may cause the loop to terminate on some processes but loop again on others. Obviously, this will cause the parallel execution to lock. To obtain the same results on all processes the commutative operation must be performed in the same order on all processes to incur the same round off errors or broadcast a single global value. A common array operation is to find the maximum or minimum value and its location in the array. The equivalent commutative operation in parallel must be performed in the same order to return the same location as the serial code. If there are several occurrences of the local maximum/minimum value in the array then it may be that several processes might find their own local maxima/minima. In order to avoid this, the commutative operation must know the direction in which the array is traversed. The routines CAP_COMMUPARENT and CAP_COMMUCHILD provide a mechanism for this. The argument FIRSTFOUND (see Table 1) determines how the commutative operation determines a location. If FIRSTFOUND is set to true then for a maximum commutative calculation it is the maximum value location found on the lowest numbered processor in the given dimension that is required on all processes. This would be the case for a serial loop running from low to high through an array. For example, consider the example in Figure 7 with data A=(/7, 9, 2, 2, 9, 5, 9/). Although there are maximums at positions 2, 5 and 7 the serial code will set MAXLOC at 2 due to the use of a strict greater than test. The parallel code will similarly produce the result MAXLOC=2 on all processors.




If the test for the maximum value had been .GE. rather than .GT. then MAXLOC would be set to the location of the last maximum value rather than the first and therefore the value of FIRSTFOUND in CAP_COMMUPARENT would be set to .FALSE.. CAP_COMMUPARENT works by sending the current maximum values processor number with the maximum value as the commutative operation is performed among the processors. In the commutative communication algorithms, the location for the COMMUPARENT is also packed into the message. CAP_COMMUPARENT internally stores the processor that owns the desired location(s). This processor is then used in any number of calls to CAP_COMMUCHILD to broadcast the correct value to all processors. 3.6 Broadcast Operations

Broadcast operations are used to move data from one process to all other processes. The simplest of these is a broadcast of data from the first process to all others, termed a master broadcast. CAPLib provides the CAP_MBROADCAST routine to do this. In fact, rather than sending data directly to all processes from the master process, the master broadcast will use the same communication strategies as the CAP_COMMUTATIVE call. These strategies, described in section 6, take advantage of the internal process topology to reduce the number of communications and steps to complete the operation. A second type of broadcast is the communication of data from a particular process to all others. CAPLib provides the routine CAP_BROADCAST to do this. The OWNER argument is passed in set to true for the process owning the data and false for all others. CAP_BROADCAST is implemented currently as a COMMUTATIVE MAX style operation on the OWNER argument to tell every process which particular process is the owner of the data to be broadcast. The data is again transmitted from the owning process to the other processes in an optimal fashion using the internal process topology.

4 CAPLib on the Cray T3D/T3E

4.1 Implementation

CAPLib has been ported to the Cray T3D and T3E using PVM, MPI and the SHMEM library. The SHMEM library version is described below. Of the three, the SHMEM CAPLib is by far the fastest, latency and bandwidth being a reflection of the performance of the SHMEM library. Typical latency is under 7 s and bandwidth greater than 100 MB/s for large messages on the T3D and 5 s and 300 MB/s on the T3E. The SHMEM version of CAPLib is written in C rather than FORTRAN because of the need to do indirect accessing. Synchronous message passing was implemented using a simple protocol build on the Cray SHMEM_PUT library routine, which is faster than SHMEM_GET. Figure 8 shows this protocol used to send data between two processors.

Processor Sending Data &SENDTO 0x0 &SENDTO 0x41327

Busy wait

Interconnect SHMEM_PUT 0x41327

Processor Receiving Data


SHMEM_PUT 0x76543 Data. (NBYTES) 0x41327 &ISYNC SHMEM_PUT 1 &SENDTO 0x0 &ISYNC 1 Data. (NBYTES) 0
Busy wait

Figure 8 CAPLib protocol used for communication on T3D/T3E using SHMEM library.

The receiving processor first writes the starting address it wishes to receive data into to a known location on the sending processor and then waits for the sending processor to write the data and send a write-data-complete acknowledgement. The sending processor waits on a tight spin lock (busy wait loop) for a non-zero value in the known location. When the address has arrived it uses SHMEM_PUT to place its data directly into the address on the receiving processor. The sending processor then calls SHMEM_QUIET to make sure the data has arrived and then sends a write-data-complete acknowledgement to the receiving processor. The pseudo code for this procedure is shown in Figure 9.
send(a, n, cn) /* send data a(n) to channel cn (processor cn2p(cn)) */ { /* wait for address from receiving processor to arrive in addr(pe) */ pe=cn2p(cn) while (!addr(pe)) /* send data */ shmem_put(*addr(pe),a,n,pe) /* wait */ shmem_quiet() /* ack send complete*/ shmem_put(ack(mype),1,1,pe) /* reset address */ addr(pe)=0 } recv(b, ,n ,cn) /* recv data b(n) from channel cn (processor cn2p(cn))*/ { pe=cn2p(cn) /* place recv address in sending pe at address addr(pe) */ shmem_put(addr(mype),&b,1,pe) /* wait to data ack to arrive */ while (!ack(pe)) /* reset ack */ ack(pe) = 0}

Figure 9 Pseudo code for send/recv using Cray SHMEM calls

To obtain maximum performance all internal arrays and variables involved in a communication are cache aligned using compiler directives. To avoid any conflicts in all-to-all communications the variables used to store addresses and act as acknowledgement flags are all declared as arrays with the T3D processor number being used to reference the array elements. In this way each send address and data acknowledgement can only be set by one particular processor. Asynchronous communication has been partially implemented by removing the wait on the write-data-complete acknowledgement in the receive and placing it in CAP_SYNC_RECV. The send operation is not currently fully asynchronous since it can not start until it receives an address from the receiving processor to send data to. Commutative operations have also been implemented using these low-level functions and the hyper-cube B method (see section 6.3) is the default commutative employed.

Pahud and Cornu [13] show that communication locality can influence the communication times in heavily loaded networks on the T3D. CAPLib uses the location of the processor within the processor topology shape allocated to a particular run to determine CAP_PROCNUM (the CAPTools processor number) for each processor in an optimal way so as to minimise communication time. The numbering is chosen to provide a pipeline of processors through the 3-D topology shape so that the number of hops from processor CAP_PROCNUM to processor CAP_PROCNUM+1 is minimised. Another way of improving communication performance for some parallel programs (particularly all-to-all style communication) is to order the communications so that an optimum communication pattern is used, reducing the number of steps to perform a many-tomany operation. Unstructured mesh code will often use this type of operation. 4.2 Performance

This section discusses the performance of the different CAPLib point-to-point and exchange message passing functions on the Cray T3D and T3E. The speed of other message passing libraries and CAPLib performance are compared where possible. Figure 10 and Figure 11 shows the latency and bandwidth respectively on the T3D for SHMEM versions of CAP_SEND (synchronous), CAP_ASEND (non-blocking), CAP_EXCHANGE and CAP_AEXCHANGE (non-blocking). As a comparison, these graphs also show timings for MPI_SEND, MPI_SSEND and MPI_SENDRECV. Figure 12 and Figure 13 show similar graphs for the Cray T3E. An examination of these figures shows that CAPLib performs at least as well as the standard MPI implementation on each machine. The CAPLib SHMEM implementation is superior to using MPI or PVM calls both in latency and bandwidth. Generally the overhead of using the CAPLib over MPI library instead of direct calls to MPI is negligible. CAP_SEND implemented using SHMEM has a startup latency of around 7s. The overall bandwidth obtained on the T3E for all communication measurements is far higher than that of the T3D. The bandwidth for CAP_SEND on the T3D for messages of 64Kb is around 116Mb/sec, on the T3E this number is 297 Mb/sec. This is due to hardware improvements between the two systems. CAP_EXCHANGE has been implemented on the Cray systems under SHMEM to partially overlap the pair-wise send and receive communications it performs, and this is reflected in the bandwidth obtained, 143Mb/sec on the T3D and 416 Mb/sec on the T3E. Note that the bandwidth for MPI_SENDRECV (50Mb/sec on the T3D and 284Mb/sec on the T3E) are very poor in comparison with CAP_EXCHANGE. Each performs a similar communication, a send and receive to other processors, but CAP_EXCHANGE is able to schedule its communications so as to overlap because it is based on directional communication whereas MPI_SENDRECV communication is based on processor numbers only and is unable to do this. The graphs for the figures are obtained by performing a Ping-Pong communication many times and taking average values. However, the non-blocking communication Ping-Pong test has synchronisation after each communication. In this respect, the non-blocking results are artificial in that they do not reflect the greater performance that will be obtained in real codes where synchronisation will generally be performed after many communications. The graphs for CAP_ASEND and CAP_AEXCHANGE therefore give a measure of the overhead of performing synchronisation on non-blocking communication and do not reflect the latency and bandwidth that is obtained in real use.

Latency (Cray T3D) 10000 160

Bandwidth (Cray T3D)



Bandwidth (Mbytes/Sec)

Data Transfer Time (us)











1 1 10 100 Data Size (REAL Items) 1000 10000

0 1 10 100 Data Size (REAL Items) 1000 10000

Figure 10 CAPLib communication latency on Cray T3D

Latency (Cray T3E) 1000

Figure 11 CAPLib communication bandwidth on Cray T3D

Bandwidth (Cray T3E) 450

Bandwidth (Mbytes/Sec)

Data Transfer Time (us)













1 1 10 100 Data Size (REAL Items) 1000 10000

0 1 10 100 Data Size (REAL Items) 1000 10000

Figure 12 CAPLib communication latency on Cray T3E

Figure 13 CAPLib communication bandwidth on Cray T3E

5 CAPLib on the Paramid

5.1 Implementation

The Transtech Paramid version of the CAPTools communications library uses the low-level Transtech/Inmos i860toolset communications library [11]. The Paramids dual processor node architecture makes it ideal for non-blocking asynchronous communications since the Transputer part of a node can be performing communication whilst the i860 is computing. Non-blocking communications are implemented for the Paramid in CAPLib using an asynchronous router program that runs on the Transputer. To minimise latency for small nonblocking communications, the period of synchronisation between the Transputer and the i860 during initialisation of a non-blocking communication must be kept to a minimum. In addition, the amount of effort required for the Transputer to send and receive asynchronously must be as small as possible and as near the time for a normal direct synchronous send as possible. Figure 14 shows a process diagram of the router process that executes on the Transputers during runs that use the asynchronous version of CAPLib. The diagram shows the threads in the router process for sending data asynchronously down one channel. For each channel pair (IN and OUT channels), there will be a two sets of these threads to allow independent communication in both directions. This arrangement is also duplicated for each

channel connection to other nodes. The send and receive threads of the router process are linked by channels over the Transputer links to the destination nodes corresponding send and receive threads (where more links are needed than are physically available, implicit use is made of the INMOS virtual routing mechanism, [14]). For every channel the client send thread processes send requests and places them in send request queue. A similar action is performed for receive requests by the client receive thread. The send thread removes requests from the send queue and communicates the data as soon as the corresponding receive thread on the other processor is ready to receive, i.e. when the receive thread has itself removed a request to receive from the receive request queue. The send and receive threads update an acknowledgement counter for each channel so that the users program can synchronise on the completion of certain communications. It is worth emphasising that using this model, channel communication down one channel is completely independent of communication down another. It is up to the users program to synchronise at the correct point to guarantee the data validity of data communicated in each direction.
TTM200 TRAM (i860/transputer) Router process/transputer
send thread reqs out send request queue receive request queue

TTM200 TRAM (i860/transputer) Router process/transputer

receive thread reqs out

reqs in client (send) req thread

reqs in client (recv) req thread

DATA A (1..N) +1 for every send ISENDACK

.. .. CAP_ASEND(A,N,1,-2,ISEND) sends a request pkt (address A, length N, type 1) to router process .. CAP_SYNC_SEND (-2,ISEND) synchronsiation call does busy wait until ISENDACK>=ISEND

.. DATA .. B (1..N) CAP_ARECEIVE(B,N,1,-1,IRECV) sends a request pkt +1 for (address B, length N, type 1) to every router process receive .. CAP_SYNC_RECEIVE (-1,ISEND) IRECVACK synchronisation call does busy wait until IRECVACK>=IRECV

Users program/i860

Users program/i860

Figure 14 Transputer router process for asynchronous communication on Transtech Paramid



Figure 15 and Figure 16 give the latency and bandwidth characteristics of CAPLib on the Paramid. The best latency is around 33S with the bandwidth approaching peak performance at around the 500-byte message size. Notice that the peak bandwidth of CAP_AEXCHANGE is roughly twice that of CAP_SEND showing that it is performing its send and receive communication asynchronously in parallel. The latency cost for small messages (~40 bytes) is higher than the synchronous CAP_EXCHANGE because of the extra complexity of setting up an asynchronous communication. However in real applications the increased asynchronous latency will usually be hidden by the overall benefits of performing computation whilst communicating.

Latency (Transtech Paramid) 100000 3




Bandwidth (Mbytes/Sec)

Data Transfer Time (us)


100 0.5

10 1 10 100 Data Size (REAL Items) 1000 10000

0 1 10 100 Data Size (REAL Items) 1000 10000

Figure 15 CAPLib latency on Transtech Paramid

Figure 16 CAPLib bandwidth on Transtech Paramid

6 Optimised Global Commutative Operations

As global commutative operations usually only involve the sending and receiving of very small messages, typically 4 bytes, it is the communication startup latency which will dominate the time taken to perform the commutative operation. This is because the communication startup latency is relatively expensive on most parallel machines. It is for this reason that in many parallelisations, commutative operations can be a governing factor affecting efficiency and speed up. It is extremely important, therefore, to implement commutative operations as efficiently as possible. In order to do this, the commutative routines in CAPLib take advantage of the processor topology, that is, how each processor may communicate with other processors. Many of the parallel machines on the market today are connected using some kind of topology to facilitate fast communication in hardware. For example, processors in the Cray T3D are connected to a communications network arranged as a 3-D torus. However, although the hardware is connected as a torus, there is in fact no limitation on what processors a particular processor may talk to at the hardware level; the communication hardware will route messages from one processor to another around the torus as needed. From the perspective of the methods used to perform commutative (and broadcast) operations it is this direct processor to processor topology that is important, not the underlying hardware topology that implements it. This means, for example, that although the Cray T3D is based on a 3D Torus, for commutative operations internally within CAPLib it is considered fully connected. The commutative topology used internally within CAPLib will therefore depend on the direct processor-to-processor routing available on the machine the program is running on. The commutative methods available are then directly related to the commutative topology. Currently CAPTools supports a pipe, ring, grid and two different hyper-cube commutative methods. In order to compare the efficiency of each method we define the following:P = The number of processes. C = The total number of communications for a commutative operation. S = The total number of steps involved in the method. We define a step as a number of communications performed in parallel such that the time/latency of all communications is equivalent to that of one communication. Some communication

devices are serial devices only allowing one communications at a time. For example, the Ethernet connecting workstations is a serial communications device since only one packet may be present on the Ethernet at any one time. For these devices, although we can consider the communications in one step taking place in parallel for the purposes of analysis, they will in fact be serialised in practice. The key to efficient commutative operations is to perform as much communication in parallel as possible, i.e. by minimising the number of parallel communication steps needed to perform the commutative operation, the effect of communication startup latency will be minimised. The time for the commutative operation to take place is approximately proportional to the number of communication steps, S. This is the most important term to reduce. The communication time between processes is often affected by the number of communications occurring simultaneously. It is therefore important that both the overall number of communications and the number of communications per step is also minimised. The type of communication taking place at each step also determines performance. If all the communications in a step are between neighbour processors then there will be little contention on the communication network as the communications take place. If the communications are not to nearest neighbours then the number of communications will affect the time to complete the step since the routing mechanism of the hardware will be used to deliver messages and contention may occur. If the process topology has not been mapped well onto the hardware topology, it will often be the case that communication from a nearest neighbour process is not in hardware a communication between nearest neighbour processors. For example, a ring topology implemented onto a pipeline of processors will require the connection between the last and first processor to be sent via a routing mechanism to the first processor. Communications along this link will always be slower than along the other links and in a commutative communication step the slowest communication will determine the time for the step. 6.1 Commutative Operation using a Pipeline

Figure 17 shows a diagram of a pipeline of processes and the communication pattern for a commutative operation. The number of communications and steps is proportional to the number of processes. This is because the value contributed from each process must be passed down to the last process and then the result is passed back up the pipeline again.









C = 2(P - 1) S = 2(P - 1)
Figure 17 Commutative operations using a pipeline connection topology.

The number of steps for a ring commutative operation is the same as for a pipeline but the number of communications is higher. On some hardware, this will give a pipe topology the edge in performance over the ring. If it is possible for a commutative operation to be performed around the ring using non-blocking communications then the number of steps can be halved. Communication around a ring requires all the values to be accumulated in an array in process order during communication and then the commutative computation performed using the array once all values have been communicated. This is to avoid round off problems and guarantees that each processor calculates the same result. Buffer space is required on each process to perform this operation and for a very large parallel run, i.e. thousands of processes, this may be disadvantageous. If it is possible for the hardware to perform communication simultaneously in both directions then the performance can be even higher since values can travel both ways around the ring at the same time, reducing the distance to the furthest process to p/2. 6.2 Commutative Operation using a Grid

Figure 18 shows a diagram of a 2D grid of processes and the communication pattern for a commutative operation. Each stage of the commutative operation is across one of the dimensions, d, of the grid. This method would be used when a grid of processors can only talk directly to its grid neighbours, otherwise it is advantageous to use a hyper-cube method (see next section).
















2,8 Stage 1
































6,8 Stage 2

















C=2 S=2

d i =1 d i =1



j =1, j i

Pi (Pj 1)

Figure 18 Communication pattern for commutative operation using a grid.

Where Pi is the number of processors in dimension I and d is the number of dimensions.

6.3 Commutative Operation using Hyper-cubes

In a hyper-cube topology of dimension d, each process is connected directly to d other processes. Algorithms implemented using a hyper-cube offer the best performance generally over other methods because the number of steps to perform a commutative operation is related to d, i.e. 2d, not the number of processes, P. For non-trivial P, the hyper-cube offers far greater performance than any other topology. There are a number of ways to implement a commutative operation on a hyper-cube. Two methods are currently implemented in CAPLib. Method A uses a pair-wise exchange between processes until every process has the result. Method B uses a binary tree algorithm. Both rely on the connectivity offered by the hyper-cube. Both methods A and B guarantee the order of computation will be the same on every process and therefore the values obtained will be the same on all processes. This is obviously the case with method B. In method A, this is guaranteed by always combining combinations with that from the lower numbered processor on the left hand side of the summation. The pair-wise exchange of data that characterises the Method-A operation can be further improved if non-blocking communications. Overlapping the exchange of data reduces the number of steps by a factor of two but relies on the performance of two small non-blocking communications out-performing two small blocking communications. CAPLib does not currently implement a non-blocking version of MethodA.

4 3 2 1








Method A

C = Pd (d > 1) S = 2d (blocking exchange) S = d (nonblocking exchange)

Figure 19 Communication pattern for commutative operation using a hyper-cube (method A, d=4)

8 7 6 5

4 3 2 1








Method B

C = 2(d+1) - 2 S = 2d

Figure 20 Communication pattern for commutative operation using a hyper-cube (method B, d=4).

In order to use these methods in runs where the number of processes does not exactly make up a hyper-cube the methods must be modified to account for this. For method A if we consider the odd number of processes to be k, then the last k processes send their values to the first k processes before the main part of the procedure begins. This ensures the values from these processes are used. When the main procedure is complete, the end k processes receive the result. Method B handles odd processes by extending the binary tree communication strategy to include the extra k processes.


Comparison of Commutative Methods

Table 2 show a comparison of the number of steps and the number of communications needed for a commutative operation using the methods implemented inside CAPTools. Pipe
2 4 8 16 32 64 128 256 512 1024

2 6 14 30 62 126 254 510 1022 2046

Hyper-cube A
2 12 56 240 992 4032 16256 65280 261632 1047552

Hyper-cube B
1 2 3 4 5 6 7 8 9 10 2 8 24 64 160 384 896 2048 4608 10240

2 6 14 30 62 126 254 510 1022 2046

2 6 14 30 62 126 254 510 1022 2046

Steps Sync.
2 4 6 8 10 12 14 16 18 20

Steps Async

2 4 6 8 10 12 14 16 18 20

2 6 14 30 62 126 254 510 1022 2046

Table 2 Number of steps S, and communications C, for a commutative operation using different methods.

Obviously the Hyper-cube methods are the best for P>4; the pipe and ring methods would only be used on machines where the hyper-cube is not available, for example, machines built of hard-wired directly connected processors in a pipeline or grid. Each of the hyper-cube methods performs the operation in d steps, but B takes fewer communications overall than A, for P>2. For a large number of processes this factor becomes very important as time for a large number of simultaneous communications in one step can be affected by message contention across the hardware processor interconnect. For A, the number of messages remains constant at each step in a commutative operation at P/2. The number of communications in each successive step using method B reduces by a factor of 2 and therefore any contention is minimised to the first few steps. The number of steps needed to complete the operation using A can however be halved if non-blocking communications are used. Figure 21 shows a graph of communication latency for CAP_COMMUTATIVE using CAPLib over SHMEM on the Cray T3D using a pipeline and the two hyper-cube methods. The graph clearly demonstrates the effect of using different global communication algorithms. Global communication using a pipeline becomes rapidly more expensive as the number of processors increase. The best performance is given by the Hyper-cube B algorithm. Note that in this case MPI_ALLREDUCE which is the MPI equivalent to CAP_COMMUTATIVE does not perform as well as the Hyper-cube methods employed by CAP_COMMUTATIVE. Indeed, the CAP_COMMUTATIVE function has performed better than the corresponding MPI_ALL_REDUCE function in all ports of CAPLib so far undertaken.





Time (us)





0 1 10 Processors 100 1000

Figure 21 CAP_COMMUTATIVE latency on Cray T3D

7 CAPLib Support Environment

One of the major reasons that parallel environments are often difficult to use is the amount of configuration and details the user must know about the system in order to successfully compile and run their parallel programs. As part of the CAPTools parallelisation environment, a set of utilities is provided to aid users in compiling, running and debugging their parallel programs. The main utilities are capf77 and capmake, which allows compilation of the users source code; caprun, which provides a mechanism for parallel execution of the users compiled executable; and capsub which provides a simple generic method for submitting jobs to parallel batch queues. The characteristics of the utilities are:Simple to use The utilities hide from the user as much as possible the details of the compilation and execution of parallel programs. Parallel compilation usually requires extra flags on the compile line and special libraries linked in. Many parallel environments require a complex initialisation process to begin the execution of a parallel program. Parallel execution often fails, not because the users program is incorrectly coded, but because they have wrongly configured the parallel environment in some way. By hiding the messy details of configuration from the user, execution becomes both quicker and more reliable. In many cases, the users do not need a detailed knowledge of the parallel environment they are utilising at all. Generic interface Each utility uses a set of common arguments across the domains of parallel environment (e.g. MPI) and machine type, e.g. Cray T3D. This makes it easy for the user to migrate from one machine or parallel environment to another. The main generic arguments are: -mach -penv -top -debug n1 n2..

Machine type, e.g. Sun, Paramid, T3D. Parallel environment type, e.g. PVM, MPI, i860toolset, shmem. Parallel topology type, e.g. pipe2, ring4, full6, grid2x2. Execute in debug mode on processors n1, n2 etc..

When a utility is executed it first checks for the existence of the environment variables CAPMACH and CAPPENV that provide default settings for the machine type and parallel environment type. These can be set manually by the user in their login script or by the execution of the usecaplib script, which attempts to determine these automatically from the host system. The command line argument versions of the environment variables can be used to over-ride any defaults.

8 Parallel Debugging
The debugging of parallel message passing code often requires the user to start up multiple debuggers and trace and debug the execution on several processes. The main disadvantages of having several debuggers running on the workstation screen is the large amount of resource both in computer time and physical memory that this can require. Each debugger (with graphical user interface) may require 40 Mbytes and starting up several debuggers or attaching to several running processes can take minutes on a typical workstation. Recently computer vendors and third party software developers have begun to address this issue by allowing the debugger to handle more than one process and a time and allow the user to quickly switch from one process to another. This dramatically reduces the memory cost since only one debugger is now running and, if the same executable is running on all processors, only a single set of debugging information need be loaded. Examples of commercial debuggers that provide such a facility are TotalView [15] and Sun Microsystemss Workshop development environment [16]. Cheng and Hood in [17] describe the design and implementation of a portable debugger for parallel and distributed programs. Their clientserver design allows the same debugger to be used both on PVM and MPI programs and suggest that the process abstractions used for debugging message-passing can be adopted to debug HPF programs at the source level. Recently the High Performance Debugging Forum [18] has been established to define a useful and appropriate set of standards relevant to debugging tools for High Performance Computers. The caprun script has a -debug <process-set> argument that allows users to specify a set of processes that they wish to debug. On systems that do not yet provide a multi-process debugger but do provide some mechanism to debug parallel processes using this option will result in a set of debuggers appearing on the screen attached to the chosen process set. CAPLib also provides a library routine called CAP_DEBUG_PROC that allows a debugger to be attached to an already running process where this is possible, perhaps following some error condition. When a process calls CAP_INIT, one of the tasks undertaken is to check command line arguments and environment variables. If -debug is found then a call is made to CAP_DEBUG_PROC that calls a machine dependant system routine to run the script capdebug. This script is passed a set of information such as the calling process-id, DISPLAY environment variable and executable pathname that allows a debugger to be started up, attached to the calling process and displaying on the host machines screen. The caprun script also has a capdbgscript argument that allows the user to specify a set of debugger commands to be executed by each debugger when starting up. As an example
caprun -m sun -p pvm3 -top ring5 -debug 1-3 -dbxscript stopinsolve jac

This will start up 3 debuggers attaching to processes 1-3 on the users workstation, all debuggers will then execute the script stopinsolve which might contain
print cap_procnum stop in solve cont

This would print the CAPTools processor number, set a break point in routine solve and continue program execution.

9 Results
This section gives a series of results obtained for parallelisations using CAPTools and CAPLib, of two of the well-known NAS Parallel Benchmarks (NPB) [19], APPLU (LU) and

APPBT (BT). The LU code is a lower-upper diagonal (LU) CFD application benchmark. It does not, however, perform a LU factorisation but instead implements a symmetric successive over-relaxation (SSOR) scheme to solve a regular-sparse, block lower and upper triangular system. BT is representative of computations associated with the implicit operators of CFD codes such as ARC3D at NASA Ames. BT solves multiple independent systems of non-diagonally dominant, block tridiagonal equations. The codes are characterised in parallel form by pipeline algorithms, making all codes sensitive to communication latency. The results for the benchmarks refer to three different versions/revisions of the same code. Rev 4.3 is a serial version of the benchmarks written in 1994 a starting point for optimised implementations. Version NPB2.2 is a parallel version of the codes written by hand by NASA and using MPI communication calls. Version NPB2.3, the successor to NPB2.2, has both a serial and parallel version. The results presented here are for runs of CLASS A, 64x64x64 size problems. For each code, a SPMD parallelisation using a 1-D and in some cases a 2-D partitioning strategy were produced using CAPTools. The results for runs using these parallelisations on the Cray T3D, Transtech Paramid and the SGI Origin2000 are presented in the following sections together with results for runs of the NPB2.2/2.3 parallel MPI versions.
9.1 LU

The results for LU runs on the Cray T3D, T3E, SGI Origin 2000 and Transtech Paramid are shown in Figure 22 to Figure 25 respectively. The T3D and T3E results compare the performance of 1-D and 2-D parallelisations of LU using CAPTools. The 1-D version can only be run on a maximum of 64 processors because of the size of problem being solved (64x64x64). The 2-D version was run up to 8x8 processors and gives very reasonable results. Figure 23 shows graphs of execution time for 1-D and 2-D parallelisations of LU using CAPTools on the Cray T3E with different versions of CAPLib. The best results are given as expected by the SHMEM version of CAPLib although for the 2-D runs the differences are quite small. These small differences are in part due to the pipelines present in LU code. The 1-D version has pipelines with a much longer startup and shutdown period than the 2-D version and therefore performance is more dependent on the startup latency of the communications. Another factor is the memory access patterns required for communication in the 2nd dimension which use buffered CAPLib calls such as CAP_BSEND/BRECEIVE that gather data before sending and scatter data after receiving. The memory accesses are non-contiguous and therefore tend to create numerous cache misses. The subsequent slower memory access is a large element of the communication time and reduce the bandwidth obtained compared with contiguous memory accesses by a factor of 4 for CAPLib/SHMEM, 2.6 for CAPLib/MPI and 2.2 for CAPLib/PVM. Since the scatter/gather part of the communication is the same regardless of the underlying communication library the effect is to reduce the overall difference in performance between the different versions of CAPLib. The Origin2000 graph compares performance figures for a version of LU parallelised at NASA Ames using MPI with the 1D and 2-D code produced by CAPTools. All three codes exhibit super-linear speedup up to 32 processors. Thereafter performance falls considerably. This is probably due to the NUMA architecture of the Origin hardware. . It is thought that this is due to the caching mechanisms of the Origin NUMA hardware. As the number of processors involved in the parallel execution of an application is increased this appears to have an effect on caching and remote memory access time resulting in the observed performance degradation and fluctuation.

NAS LU Version 4.3 (Cray T3D) 300 600

NAS LU Version 4.3 64x64x64 (Cray T3E)




Linear CAPTools 1D CAPTools 2D


1-D (CAPLib/SHMEM) 1-D (CAPLib/MPI) 1-D (CAPLib/PVM3) 2-D (CAPLib/SHMEM) 2-D (CAPLib/MPI) 2-D (CAPLib/PVM3)


Time (Secs)

Speed Up






0 0 50 100 150 Processors 200 250 300

0 0 50 100 150 Processors 200 250 300

Figure 22 Speed up results for NAS LU benchmark (NPB 4.3) on Cray T3D
NAS LU Version 2.2/2.3 (SGI Origin 2000) 70

Figure 23 Execution times for NAS LU benchmark (NPB 4.3) on Cray T3E
NAS LU Version 4.3 (Transtech Paramid) 16



12 50 Linear CAPTools 1D (2.3) NASA 2D (2.2) CAPTools 2D (2.3)

Linear 1-D Non-Blocking 1-D Blocking


Speed Up

Speed Up



6 20 4


0 0 10 20 30 Processors 40 50 60 70

0 0 2 4 6 8 Processors 10 12 14 16

Figure 24 Speed up results for NAS LU benchmark (NPB 2.2/2.3) on SGI Origin2000

Figure 25 Speed up results for NAS LU benchmark (NPB 4.3) on Transtech Paramid



Results for CAPTools generated BT NPB4.3 and 2.3 runs on the Cray T3D, Transtech Paramid and the SGI Origin 2000 are shown in Figure 26 to Figure 27 respectively. The T3D results show near linear speed up on runs up to 256 processors. The Paramid results (for runs on a 32x32x32 problem using a 1-D parallelisation) demonstrate how the use of non-blocking communications, automatically generated using CAPTools, can improve the performance of codes on machines where non-blocking communication is supported. The Origin2000 graph compares performance figures for the NPB2.2 version of BT parallelised at NASA Ames using MPI with the NPB2.3 code produced by CAPTools. The NPB2.2 parallel version uses over decomposition to avoid idle times [20,21] achieved by distributing p2 partitions onto p processors. The better performance of the NASA parallelisation can probably be attributed to the use of the over-decomposition method. Both sets of results tail off and then fluctuate above 25 processors. Again, it is thought that this is due to the caching mechanisms of the Origin NUMA hardware. Recent enhancements by SGI to MPI have been made in order to enforce processor-memory affinity or to ensure that data resides in memory that is local to the processor using the data. However, new results using these enhancements have not yet been obtained. The T3D results are far superior, showing the benefit of low latency and distributed memory architecture for these types of codes.

NAS BT Version 4.3 (Cray T3D) 300 50 45 250 40 35 200 Linear CAPTools 2D 30

NAS BT Version 2.2/2.3 (SGI Origin 2000) Linear NASA 2D (2.2) CAPTools 2D (2.3)

Speed Up


Speed Up

25 20

100 15 10 50 5 0 0 50 100 150 Processors 200 250 300 0 0 5 10 15 20 25 Processors 30 35 40 45 50

Figure 26 Speed up results for NAS BT benchmark (NPB 4.3) on Cray T3D

Figure 27 Speed up results for NAS BT benchmark (NPB2.2/ 2.3) on SGI Origin2000

10 Conclusions
Computational mechanics application codes are one class of software that can really make good use of high performance parallel computing systems. With todays and, at least, the immediate futures architectures then for CM codes to map well onto these HPC systems they will use SPMD paradigms to partition the scalar task. This approach requires frequent synchronised access to data generated on other processors, which makes them crucially dependent on highly efficient message passing libraries. Fortunately, in the past few years a number of such libraries have been developed. Although most systems have their own library that optimises inter-processor communication performance, all commercial systems have also implemented the standard PVM and/or MPI libraries as well. Of course, these libraries have been developed to facilitate a wide range of inter-processor communication and so they are very flexible and general. By contrast, the type and function of communication calls required by CM codes is well specified and compact; they do not need the wide functionality of the generic libraries. In this paper, we have described a set of message passing tools that are specifically targeted at servicing the needs of CM codes. This thin layer toolkit, CAPLib, uses both machine specific and generic libraries. The key advantages of CAPLib as the message-passing library for CM codes over other options are: Its performance is at least as good as the generic libraries It is truly portable and provides the user with options to use the optimum library on any given system, It is easier to use and understand by the CM community of users. CAPLib has been implemented over PVM, MPI and SHMEM, amongst others and ported to a variety of parallel architectures, including the CRAY T3D and T3E, IBM SP2 and SGI Origin systems. Recent work on CAPLib includes extension of the library to cover unstructured mesh applications, implementation in a shared memory and mixed shared/distributed environment plus implementation on other machines such as DEC Alpha Clusters, Fujitsu AP3000 and VX systems, NEC Cenju4 and SX4 and Linux clusters.

1 C.Ierotheou, S.P. Johnson, M. Cross, P.F. Leggett, Computer Aided Parallelisation Tools (CAPTools) - conceptual overview and performance on the parallelisation of structured mesh codes, Parallel Computing, Vol. 22, pages 163-195, 1996. 2 Geist A., Beguelin A., Dongarra J., Jiang W., Nanchek R., Sunderam V., PVM, Parallel Virtual Machine, A Users Guide and Tutorial for Networked Parallel Computing, MIT Press, 1994. 3 Message Passing Interface Forum, MPI: A message passing interface standard, Computer Science Dept. Technical. Report CS-94-230, University of Tennessee, Knoxville, 1994. 4 Cray Research Inc., SHMEM Technical Note for C, SG-2516 2.3, October 1994. 5 S.P. Johnson, M. Cross, M.G. Everett., Exploitation of symbolic information in interprocedural dependence analysis, Parallel Computing, Vol. 22, pages 197-226, 1996. 6 Leggett P.F., Marsh A.T.J., Johnson S.P., Cross M., Integrating user knowledge with information from the parallelisation tools to facilitate the automatic generation of efficient parallel FORTRAN code, Parallel Computing, Vol. 22, pages 259288, 1996. 7 S.P. Johnson, C. Ierotheou, M. Cross, Automatic parallel code generation for message passing on distributed memory systems, Parallel Computing, Vol. 22, pages 227-258, 1996. 8 E.W.Evans, S.P.Johnson, P.F.Leggett and M.Cross Automatic Code Generation of Overlapped Communications in a Parallelisation Tool, Parallel Computing, Vol 23, pp 1493-1523, 1997. 9 E.W. Evans, S.P. Johnson, M. Cross and P.F. Leggett Automatic Generation of Multi-dimensionally Partitioned Parallel CFD code in a Parallelisation Tool, Parallel CFD 97, conference proc., pub. North Holland, 1997. 10 11 Transtech, Paramid User Guide, Doc Ref No PMD M 312, Transtech Parallel Systems Ltd, High Wycombe, Bucks, UK, HP13 5RE.,, 1993. 12 P.F. Leggett, CAPLib User Manual, Parallel Processing Research Group, University of Greenwich, (in preparation) 13 M. Pahud, T. Cornu, Measurement of the Contention Phenomena on the Cray T3D Architecture, SIPAR Workshop on Parallel Computing, Bienne, October 6, 1995. 14 INMOS, ANSI C Toolset User Guide, IMS D0314-D0CA, INMOS Ltd., Bristol, UK, October, 1992. 15 Dolphin Interconnect Solutions Ltd., TotalView Multiprocess Debugger Users Guide, Version 3.7.7, September 1997. 16 SunSoft Inc.,, 1997. 17 D Cheng, R. Hood, A Portable Debugger for Parallel and Distributed Programs, SuperComputing 94, IEEE Computer Soc., 1994. 18 High Performance Debugging Forum,, 1998. 19 D. Bailey et al. The NAS Parallel Benchmarks, RNR Technical Report RNR-94-007, March, 1994. 20 J. Bruno, P.R. Cappello Implementing the Beam and Warming Method on the hypercube, Proceedings of the 3rd conference on Hyercube Concurrent Computers and Applications, Pasadena, CA (Jaunary 19020, 1988). 21 R.F. Van de Wijngaart Efficient Implementation of a 3-Dimensional ADI method on the iPSC/860, Supercomputing 93, Portland, OR (November 15-19, 1993).