You are on page 1of 19

Journal of Parallel and Distributed Computing 61, 96114 (2001)

doi:10.1006jpdc.2000.1666, available online at http:www.idealibrary.com on

Parallel Multigrid for Anisotropic Elliptic Equations


M. Prieto, R. Santiago, D. Espadas, I. M. Llorente, and F. Tirado
Departamento de Arquitectura de Computadores y Automatica, Facultad de Ciencias F@ sicas,
Universidad Complutense, 28040 Madrid, Spain
E-mail: mpmatiasdacya.ucm.es, rubensmdacya.ucm.es, despadasdacya.ucm.es,
llorentedacya.ucm.es, ptiradodacya.ucm.es

Received July 9, 1999; revised April 5, 2000; accepted April 28, 2000

In this paper two well-known robust multigrid solvers for anisotropic


operators on structured grids are compared: alternating-plane smoothers
combined with full coarsening and plane smoothers combined with semi-
coarsening. The study has taken into account not only numerical properties
but also architectural ones, focusing on cache memory exploitation and
parallel characteristics. Experimental results for the sequential algorithms
have been obtained on two different systems based on the MIPS R10000 pro-
cessor, but with different L2 cache sizes and memory bandwidths (an SGI O2
workstation and an SGI Origin 2000 system). Although the alternating-plane
approach is the best choice for sequential implementations, experimental
estimations show poor parallel efficiencies. For the semicoarsening alternative
two different parallel implementations have been considered. The first one
has optimal parallel characteristics but due to deterioration of the con-
vergence properties its realistic efficiency is not satisfactory. In the second
one, some processors remain idle during a short period of time on every mul-
tigrid cycle. However, the second parallel algorithm is more efficient since it
preserves the numerical properties of the sequential version. Parallel
experiments have also been taken on a Cray T3E system.  2001 Academic Press
Key Words : parallel multigrid; robust smoothers; anisotropic partial dif-
ferential equations.

1. INTRODUCTION

In multigrid literature, several methods have been proposed for solving


anisotropic operators and achieving robustness when the coefficients of a discrete
operator can vary throughout the computational domain (due to grid stretching or
variable coefficients):

v Alternating-direction plane smoothers in combination with full coarsening


[4, 14]

0743-731501 35.00 96
Copyright  2001 by Academic Press
All rights of reproduction in any form reserved.
PARALLEL MULTIGRID 97

v Block smoothing in combination with semicoarsening [5, 25]


v Recombination of the corrections of more than one semicoarsening grid
(multiple semicoarsening) [17]
v Standard coarsening combined with a semicoarsened smoother ( flexible
multiple semicoarsening) [28]
v Point-wise smoothing combined with a fully adaptive coarsening process
(algebraic multigrid) [24]

Structured grids are widely used in the numerical simulation of many physical
systems because they allow a relatively easier sequential and parallel implementa-
tion. Furthermore, single-block algorithms are very efficient because parallel
implementation and cache memory exploitation is possible due to the regular data
structures in the structured grids. Single-block grids are commonly used as building
blocks for multiblock grids that are required to deal with complex geometries or to
facilitate parallel processing [13].
We compare two common approaches to achieve robustness on single-block
structured grids: alternating-plane smoothers combined with standard coarsening
and plane smoothers with semicoarsening algorithms. We study their numerical and
architectural properties for the solution of a 3-D discrete anisotropic elliptic
problem on stretched cell-centered grids. This paper is organized as follows: The
numerical problem used in our simulations is described in Section 2. The numerical
properties (convergence factor) obtained for different anisotropic problems are
found in Section 3. Architectural properties, namely memory hierarchy exploitation
and parallel efficiency, are presented in Sections 4 and 5. The paper ends with some
conclusions.

2. THE NUMERICAL PROBLEM

2.1. Mathematical Model and Discretization

The test problem studied in this paper is the 3-D anisotropic diffusion equation:

 ,  ,  ,
x \ +
:
x
+
y
;
\ + \ +
y
+
z
#
z
=S , (x, y, z). (1)

The solution ,(x, y, x) is a differentiable function defined on the open domain 0,


(x, y, z) # [0, 1]_[0, 1]_[0, 1], with Dirichlet boundary conditions on the
boundary $0, and S , (x, y, z) is a specified source function. Coefficients :, ;, and
# in Eq. (1) can, in general, be functions of the spatial variables.
A finite-volume discretization of Eq. (1) is applied on a cell-centered computa-
tional grid where the explicit form of the function ,(x, y, x) on the boundary $0
is used to implement the Dirichlet boundary conditions. The discretization is given
by the following system of equations:
98 PRIETO ET AL.

2:
(X p , i+1jk &(X p +X) , ijk +X, i&1jk )
2x ijk

2;
+ (Y p , ij+1k &(Y p +Y) , ijk +Y, ij&1k )
2y ijk

2#
+ (Z p , ijk+1 &(Z p +Z) , ijk +Z, ijk&1 )=S ,ijk
2z ijk

i, j, k=1, 2, 3, ... N, (2)

where, in order to get a less cumbersome expression, we have defined the discrete
coefficients:

2x ijk =x i+1jk &x ijk ; 2y ijk =y ij+1k & y ijk ; 2z ijk =z ijk+1 &z ijk

1 1 1
Xp = Yp = Zp=
x i+2jk &x ijk y ij+2k & y ijk z ijk+2 &z ijk

1 1 1
X= Y= Z= .
x i+1jk &x i&1jk y ij+1k & y ij&1k z ijk+1 &z ijk&1

2.2. Robust Multigrid Methods

There are several methods for solving a system of linear equations such as the
one defined by (2). We have studied some relevant properties of one of the most
popular, known as multigrid. This technique is based on two fundamental prin-
ciples: relaxation and coarse grid correction. The relaxation procedure is based on
a standard iterative method, called a smoother in multigrid literature. The
smoother is used to damp high frequency or oscillatory components of the error,
since it fails to attenuate low-frequency components. The coarser grid correction is
used to damp low-frequency components. On a suitable coarser grid, low-frequency
components appear more oscillatory and so the smoother can be applied effectively.
Transfer grid operators are used in order to connect the grid levels. The prolonga-
tion operator maps data from the coarser level to the current one while the restric-
tion operators transfer values from the finer level to the current one. The smoother
and transfer grid operators have to be carefully chosen since the overall perfor-
mance of the algorithm can decrease dramatically [4].
Multigrid based on standard coarsening (cell-centered coarsening consists in
joining fine grid cells to obtain coarse grid cells) and point-wise relaxation as a
smoother properly solves the discrete operator (2) when anisotropy does not exist
(i.e., :r;r#). Brandt's fundamental block relaxation rule [4] states that all
strongly coupled unknowns (coordinates with relatively larger coefficients) should
be relaxed simultaneously in order to efficiently solve an anisotropic operator. So,
if :> >;r# in Eq. (1), we could use x-line relaxation as an efficient smoother.
However, if :r;> >#, then (x, y)-plane relaxation is needed to provide a good
smoother. This approach has been successfully applied to solve a large number of
PARALLEL MULTIGRID 99

problems when the coupled variables in the discrete operator are known
beforehand, for example, the solution of the NavierStokes equations when the grid
stretching is normal to the body [26] or grids with nonunitary aspect ratios [16].
Consequently, a general rule in multigrid is to solve simultaneously those
variables which are strongly coupled by means of plane or line relaxation. However,
in a general situation the nature of the anisotropy is not known beforehand, so
there is no way to know which variables are coupled. Moreover, if the problem is
solved on a stretched grid or the equation coefficients differ with each other
throughout the domain (computational and physical anisotropy, respectively) the
values of the coefficients and their relative magnitudes vary for different parts of the
computational domain. In such cases the multigrid techniques based on point-wise
or plane-wise smoothers combined with full coarsening fail to smooth error com-
ponents.
Several approaches have been proposed in the literature to make multigrid a
robust solver that can deal with a wide range of problems (a more precise definition
of a robust solver can be found in [29]). We will focus on the two following alter-
natives:

Robust multigrid smoothers combined with full-coarsening. One solution consists


in exploring all the possibilities, i.e., use alternating-line relaxation (x-line smooth-
ing sweep Ä y-line smoothing sweep) in 2-D and alternating-plane relaxation
(( y, z)-plane smoothing sweep Ä (x, z)-plane smoothing sweep Ä (x, y)-plane
smoothing sweep) in 3-D [14].

Plane smoothers combined with semicoarsening. Several robust methods have


been proposed based on changing the coarsening strategy. Among them, we have
considered the multigrid scheme presented in [5]. For a 2-D problem, the basic
idea is to use x-line relaxation and y-semicoarsening (doubling the mesh size only
in the y-direction) together, or, in the same way, to use y-line relaxation and
x-semicoarsening. This approach extends naturally to 3-D in the form of xy-plane
relaxation (xz-plane, yz-plane) and z-semicoarsening (y-semicoarsening, x-semi-
coarsening).

2.3. Details about the Implementation

The test code has been developed in C language. The system of equations is
solved by the full approximation scheme (FAS) [4], which is more involved than
the simpler correction scheme but can be applied to solve nonlinear equations. Both
approaches (alternating-plane and semicoarsening) use V(# 1 , # 2 ) cycles (# 1 and # 2
denote the number of presmoothing and postsmoothing sweeps, respectively)
implemented in an ``all-multigrid'' way. The plane solver used in the 3-D smoothing
procedure is a robust 2-D multigrid V(1, 1)-cycle employing full coarsening and
alternating-line smoothers; the approximate 2-D solution obtained after one 2-D
cycle is sufficient to provide robustness and good convergence rates in the 3-D
solvers [14]. Thomas' algorithm or one 1-D V(1, 1)-cycle is used to solve the lines.
Lexicographic GaussSeidel and zebra plane relaxation are used in the following
100 PRIETO ET AL.

sequential and parallel experiments, respectively. The restriction is done by means


of a full weighted operator, and trilinear (alternating-plane approach) or lineal
(semicoarsening approach) interpolation are used as prolongation operators. (See
[29] for detailed definitions of these intergrid operators.)
The experimental convergence factors presented in this paper are the asymptotic
average residual reduction factors of one 3-D V(1, 1) FAS cycle:

&r n& 2
\= nA. (3)
&r n&1& 2

They have been obtained for a homogeneous problem with right-hand side
S,(x, y, z)=0 and boundary condition ,(x, y, z)=0, and starting with a random
initial guess. The homogeneous problem reduces roundoff errors and thus allows an
accurate measurement of asymptotic convergence factors.

3. CONVERGENCE RATE

This paper studies two sources of anisotropy: anisotropic equation coefficients


and nonunitary cell aspect ratios due to grid stretching. These sources reflect dif-
ferent situations: the former represents problems with a uniform anisotropy
throughout the domain, while the latter exhibits problems with an anisotropy that
varies from cell to cell. Grid stretching is commonly used in grid generation to pack
points in regions with large solution gradients while avoiding an excess of points in
more benign regions. In the present work, the stretching of the grid in a given direc-
tion is characterized by the stretching geometric factor ; (quotient between two
consecutive mesh sizes in the same direction h k =;h k&1 ).
In order to compare the robustness of both approaches we consider a 3-D
Cartesian grid with geometric stretching in the three dimensions (see Fig. 1 for a
2-D example). Robustness is absolutely necessary in this example since the
anisotropy varies throughout the whole domain.

FIG. 1. A 64_64 grid with geometric stretching factor ;=1.3.


PARALLEL MULTIGRID 101

FIG. 2. Convergence factor of one V(1, 1) cycle for the isotropic equation (u xx +u yy +u zz =0) on a
64_64_64 grid with different stretching factors (top chart), and the anisotropic equation (au xx +bu yy +
cuzz =0) for several coefficient sets on an uniform 64_64_64 grid (bottom chart).

As Fig. 2 (top chart) shows, both methods solve the problem effectively. For the
semicoarsening approach, each coarsening direction exhibits the same performance
since a dominant direction does not exist. The alternating-plane smoother improves
its convergence factor as the anisotropy grows (as was shown in [14]).
On the other hand, when the anisotropy is located on one plane the best coars-
ening procedure is the one that maintains coupling of strongly connected
unknowns. For example, if we make the coefficients grow in the xy-plane on a
uniform grid (see Fig. 2, bottom chart), z-coarsening gets a better convergence fac-
tor because the smoother becomes an exact solver for high anisotropies. Alternat-
ing-plane smoothers present a similar behavior, becoming a direct solver when the
unknowns are strongly coupled, i.e., the better the plane is solved the better the
obtained convergence.
The behavior exhibited by the convergence rate of the alternating-plane approach
can be explained in terms of the smoothing factor of the three alternating sweeps.
If the problem is essentially isotropic, then each of the three alternating sweeps con-
tributes to the smoothing factor. In problems with a moderate anisotropy in the
xy-plane, the high-frequency error is mainly reduced in the z-direction sweep. In
102 PRIETO ET AL.

strongly anisotropic problems the z-direction sweep reduces the smooth error com-
ponents as well, which actually solves the problem rather than just smoothing the
error.
In conclusion, we can state that the alternating-plane approach exhibits a better
convergence factor than the semicoarsening approach. However, in order to com-
pare different algorithms, we have to take into account not only their numerical
efficiency but also their architectural properties.

4. MEMORY HIERARCHY EXPLOITATION

Iterative algorithms, especially multigrid methods, merely reach a disappointingly


small percentage of their theoretically available CPU performance when applied to
large problems. One of the most important reasons for this phenomenon is that
current memory technology cannot provide data fast enough to keep the CPU
busy. The common approach to hiding this problem is to use a hierarchical
memory structure with fast but small caches at the top of the hierarchy and the
slow but large memory at the bottom.
Since data cache misses are one of the dominant factors in performance, it seems
reasonable to take a closer look at the cache behavior of the algorithms under
study. The measurements have been taken on two different processors based on the
MIPS R10000 processor running at 250 MHz, an SGI O2 workstation and one
processor of an SGI Origin 2000 system (O2K). The R10000 has a 32 Kbyte
primary data cache, but the external L2 cache varies: 1 Mbyte on the O2 and 4
Mbyte on the O2K. It is instructive to consider that, in the O2K for example, the
latency of the L2 cache is approximately 10 times larger than the L1 latency, with
this ratio growing to 60 times larger for the local main memory [12].
As is well known, caches are designed to exploit spatial and temporal locality of
memory references. Therefore, they can only speed up accesses to frequently and
recently used data. In order to improve data locality of iterative methods, some
authors have successfully employed different techniques such as data access trans-
formations (loop interchange, loop fusion, loop blocking, and prefetching) and data
layout transformations (array padding and array merging) [6, 22, 23]. Our codes
have been compiled using aggressive compiler options (Ofast=ip3210k on the O2
and Ofast=ip27 on the O2), but we have not employed any data transformations
to optimize cache reuse. However, plane smoothers allow a better exploitation of
the temporal locality with regard to common point smoothers, which have to per-
form global sweeps through data sets that are too large to fit in the cache.
Using the R10000 hardware counters and the SGI perfex profiling tool [1], we
have recorded that the number of L1 and L2 cache misses of the alternating-plane
smoother are in general greater than those of the semicoarsening. As expected, the
x- and y-semicoarsening algorithms produce less misses due to the memory
organization of the 3-D data structures. These algorithms exhibit more spatial
locality than the z-semicoarsening and the alternating-plane approach.
For a 32_32_32 problem size, the number of L1 misses (it is approximately the
same for both systems) for the x-semicoarsening approach is about 12 and 50 0
PARALLEL MULTIGRID 103

FIG. 3. L2 cache misses for the resolution (in particular, to reach a residual norm equal to 10 &12 )
of the isotropic equations using V(1, 1) multigrid cycles on 32_32_32 and 64_64_64 grids.

less than those obtained for the z-semicoarsening and the alternating-plane
approach, respectively. These differences grow slightly for the 64_64_64 problem
size.
Figure 3 shows the number of L2 cache misses for the resolution (in particular,
to reach a residual norm equal to 10 &12 ) of the isotropic equations using V(1, 1)
multigrid cycles on 32_32_32 and 64_64_64 grids. The number of misses only
depends on the problem size and the particular smoother employed since our code
has not been optimized for either constant coefficients or uniform grids. It is inter-
esting to note that the alternating-plane approach and the z-semicoarsening have
around 1.75 and 2.5 times more misses than the x-semicoarsening on the O2 system
for both problem sizes. However, on the O2K system, where the large second level
cache allows a better exploitation of the temporal locality, the differences grow with
problem size since, for smaller problems, the spatial locality has less impact on the
number of cache misses. The greatest differences between both systems are obtained
from the x- and y-semicoarsening (around 3.4 times more misses on the O2 system)
due to temporal locality effects, which are lower on the alternating-plane and
z-semicoarsening approaches where the spatial locality has more influence.

FIG. 4. Convergence factor per work unit for several 64_64_64 grids with different stretching factors.
104 PRIETO ET AL.

FIG. 5. Amount of memory required by our code on an SGI Origin 2000 for 32_32_32, 64_64_
64, and 128_128_128 grid sizes.

Therefore, in order to make a more realistic comparison between the two


methods and the different semicoarsening directions, we have to measure the con-
vergence factor per work unit (WU), where a work unit has been defined as the
execution time needed to evaluate the equation metrics on the finest grid. As Fig. 4
shows, the alternating-plane approach exhibits a better behavior than the semicoars-
ening; i.e., it reduces the same error in less time. We should note that the program
used to generate the present results is not fully optimized since it was coded to deal
with many different methods and situations and so these results may be improved
with some coding practice. In any case, the relative performance between the
methods should not change. For example for a grid with a stretching factor equal
to 1.3 the convergence factor of the alternating-plane smoother is 7 0 better than
the convergence factor of the x-semicoarsening, 6 0 better than the y-semicoarsen-
ing and 9.5 0 better than the z-semicoarsening.
Finally, we should also mention that the memory requirements of the semicoar-
sening approach are about twice as large as those of the alternating-plane approach
(see Fig. 5).

5. PARALLEL PROPERTIES

5.1. Outline

One can adopt one of the following strategies to get a parallel implementation of
a multigrid method:

v Domain decomposition with multigrid. A domain decomposition is first


applied to the finest grid. Then, a multigrid method is used to solve problems inside
each block. Since interdomain connections are limited to the finest level, com-
munications are only required on this level [3].
v Multigrid with grid partitioning. A multigrid method is used to solve the
problem in the whole grid. Therefore, communications are required on each level
[15].
PARALLEL MULTIGRID 105

Domain decomposition methods are often considered with finite element dis-
cretization on parallel computers. They are easier to implement and imply fewer
communications since they are only required on the finest grid. Additionally, they
can be applied to general multiblock grids. However, the decomposition methods
lead to algorithms which are numerically different to the sequential version and
have a negative impact on the convergence rate. Grid partitioning retains the con-
vergence rate of the sequential algorithm. However, it implies more communication
overheads since data exchange is required on each grid level. A hybrid approach
that applies the V-cycle on the entire multiblock domain while the smoothers are
performed inside each block (domain decomposition for the smoother) has been
proposed in [13].
We have limited this research to the grid partitioning technique. In this
approach, alternating-plane presents worse parallel properties than semicoarsening
because, regardless of the data partitioning applied, it requires the solution of
tridiagonal systems of equations distributed among the processors. Since the
semicoarsening approach does not need an alternating-plane smoother, an
appropriate 1-D data decomposition in the semicoarsening direction can be chosen
so that a parallel tridiagonal solver is not needed. Thus, for instance, an x-direction
partitioning is suitable for x-semicoarsening combined with yz-plane
smoothers.

5.2. Parallel Alternating-Plane Smoothers

The parallel implementation of the alternating-plane approach requires the solu-


tion of distributed planes, and so, as an alternating-line smoother is used in their
multigrid resolution, it is necessary to solve tridiagonal systems of equations that
are distributed among the processors. Using the SGI cvd profiling tool [1], we
have found that solving the lines is the most time-consuming task of our code. For
example, a 32_32_32 problem size expends around 80 0 on the line solver
routines. Therefore, the study of the parallel implementation of an alternating-line
2-D smoother can be used to estimate the efficiency of our 3-D code. For simplicity,
we have studied the most efficient and simple case, where only one of the plane
directions is partitioned. In this way, one of both lines has to be solved in parallel
and the other can be solved using a sequential algorithm since the whole line lies
in one processor.
The parallel solution of tridiagonal systems has been widely studied; see, for
example, [2, 7, 10, 11]. In the following section we have studied three different
solvers for the solution of the decomposed sweep: The pipelined gaussian elimina-
tion method (PGE), Wang's algorithm, and the matrix transposition method.
Problem sizes are chosen so that they correspond to problems arising from the
solution of the 3-D multigrid problems studied. We should also note that the
parallel methods considered below lead to a Jacobi-like update order. For sim-
plicity, other update orders, such as lexicographic or redblack actualizations, are
not analyzed here. In any case, the parallel efficiency that can be obtained with
these schemes is always lower than that of the Jacobi.
106 PRIETO ET AL.

5.2.1. The pipelined gaussian elimination method. The PGE, also known as the
pipelined Thomas algorithm, for the solution of multiple tridiagonal systems in
parallel, aims to solve the problem of the intrinsic sequentiality of the solution of
a single system by sequencing the solution of many systems among the processors
in a pipeline way [9, 18, 19]. This algorithm is graphically represented in Fig. 6.
Each processor has to wait for the completion of the forward or backward step
computation of the Thomas algorithm on the preceding processors. In the forward
step (Fig. 6, left-hand chart) the data to be transferred are the equation coefficients
while in the backward one, the transferred data are the solutions (Fig. 6, right-hand
chart). Communications which happen on the same phase are equally shaded.
As this figure illustrates, instead of using a separate message after completion of
the forward or backward step for a single line (system) it is more convenient to
gather data from more than one line in each message. Collecting data helps to
reduce the communications delayor overheadsince any message sending implies
the appending of a considerable amount of control information [20]. The number
of systems each processor works on in a step, and hence, the length of the message
that each processor has to send to the following one, will be called the pipe size,
and it determines the balance between communications and pipe delay. We use the
term pipe delay to represent the fact that each processor has to wait idle until the
first block of data comes from the preceding processor.
The trade-off between communications and pipe delay (increasing the pipe size
implies increasing the pipe delay, but decreasing the pipe size increases the com-
munications overhead) is responsible for the existence of an optimal pipe sizea
range of optimal pipe sizes, actually, which, as was proved in a previous paper
[8], is a very important factor when optimizing this algorithm. From now on in
this work, the optimal pipe size will be used with this algorithm.
In principle, the Y-sweep is done directly in one step, but this produces very low
data locality since cache lines are aligned along rows, not along columns. This can

FIG. 6. Pipelined Gaussian elimination scheme. The left-hand diagram represents equation coef-
ficients transfers in the forward step while the right-hand diagram shows the equation solutions transfers
in the backward step.
PARALLEL MULTIGRID 107

be improved by means of the blocking of this Y-sweep, i.e., calculating along each
column on a certain number of rows. Therefore, we have to adjust another
parameter of the algorithm, the block size, to take maximum advantage of this
blocking. From now on, the blocking of the Y-sweep will be applied in all the algo-
rithms studied and also on the sequential solver.

5.2.2. Wang's algorithm. Wang's algorithm for the solution of a tridiagonal


system of equations that is broken up in to P=2 p blocks (one on each processor)
consists of four different steps where, unlike other parallel tridiagonal algorithms,
global communications are not required. It starts by applying an independent
Gaussian elimination on each processor. After sending the updated first equation of
every block to the preceding processor, a new elimination step is applied. Then, the
shape of the local coefficient matrices becomes similar to the letter ``N.'' The algo-
rithm follows this with a new unidirectional communication and it produces an
upper triangular matrix in the third step. In the last step, the system is diagonalized
and the solution is obtained [27]. It can also be used to solve several tridiagonal
systems, by means of a pipeline-like scheduling, similar to the PGE algorithm. In
this case it is also possible to block the algorithm in order to improve the data
locality.

5.2.3. Matrix transposition. The transposition of the data matrix previous to


the Y-sweep, in order to carry out this Y-sweep in the same way as the X-sweep
was done, is another technique for the solution of this kind of problem. The main
drawback that this approach presents is the extremely poor data locality of the
transposition process. The main benefit is that it does not suffer from the pipe
delay, as PGE and Wang do, since the computation is totally concurrent, although
there is a great amount of simultaneous communication. If the interconnection
network is efficient enough to avoid contention, the performance of this algorithm
is satisfactory [8]. In Fig. 7, this algorithm is graphically represented. It should be
noted that, after the movement of blocks, and before the calculation, the blocks
have to be internally transposed in order to yield the totally transposed matrix.
Although this method is not at all efficient, research has been done, inspired by
this algorithm and avoiding the problem of the transposition of internal blocks,

FIG. 7. Matrix transposition scheme. Blocks are internally transposed prior to calculation, not only
moved.
108 PRIETO ET AL.

FIG. 8. Efficiency of 2-D alternating-line smoothers for the 32 (top), 64 (middle), and 128 (bottom)
problems, on the CRAY (left) and the SGI Origin 2000 (right).

which introduces efficient alternative approaches such as the mapping transposition


scheme for large problems [8].

5.2.4. Comparison of results. Experimental results are presented in Fig. 8, for


the solution of an alternating-line process on different planes32_32, 64_64, and
128_128 bidimensional problemsfor both machines described above and for the
three algorithms presented.
Regarding the efficiencies (as usual, the efficiency has been defined as
T s (N V T p ), where T s is the execution time of the fastest sequential algo-
rithmGaussian elimination, N is the number of processors, and T p is the execu-
tion time of the parallel algorithm), they are quite low, due to the small problem
size, which makes the communications overhead very important. A better perfor-
mance can be obtained for larger problems. However, memory limitations prevent
PARALLEL MULTIGRID 109

us from solving 3-D problems whose corresponding 2-D planes are big enough to
obtain satisfactory efficiencies on medium-sized parallel computers.

5.3. Parallel semi-coarsening

Since the semicoarsening approach does not need the alternating-plane


smoothers, an appropriate 1-D data decomposition and plane-smoother combina-
tion can be chosen so that a parallel tridiagonal solver is not needed and satisfac-
tory efficiencies can be obtained.
The parallel code has been developed using native MPI versions. The compiler
options have been set to -O3 on the Cray T3E (for the O2K, see Section 4). As we
have explained above, y-semicoarsening and x-semicoarsening are more efficient
from an architectural point of view, since they exhibit a better spatial locality. For
the same reasons, they are also the best choice from a message passing point of
view, since message passing performance also depends on the spatial locality of the
messages [21]. Using the Apprentice profiling tool, we have recorded that the time
spent performing data cache operations on a two processor simulation using z-par-
titioning, i.e., using z-semicoarsening, is about 17 0 larger than in the x-partition-
ing for a 64_64_64 problem size. This difference is lower than in the SGI systems
analyzed above due to better behavior of the T3E memory hierarchy when spatial
locality does not exist [20]. Common grid transfer operators such as linear (semi-
coarsening approach) or trilinear (alternating-plane approach) interpolation and
full weighted restrictors are parallel by nature. Hence, they do not suffer any change
in the parallel code. However, the lexicographic GaussSeidel smoothers employed
in the previous sections can no longer be applied, so they have been replaced by
plane relaxation in a redblack zebra order of planes. This variation introduces a
slight change on the numerical properties of our parallel code [14].
Since the linear system of equations has to be solved exactly on the coarsest grid
(by means of an iterative solver), it is usually chosen to be as coarse as possible to
reduce the computational cost. However, this decision may cause some processors
to remain idle in the coarsest grids, and the parallel efficiency of the multigrid algo-
rithm is then reduced.
Our first algorithm avoids idle processors using an approach known as the
parallel U-cycle method [30] where the number of grid levels has been fixed so that
the coarsest grid employed on each processor has one plane (this will be referred
to hereafter as the critical level). For example, for a 64_64_64 grid, the U-cycle
employs five levels on a two-processor simulation, four levels on a four-processor
simulation, etc. Figure 9 presents the parallel efficiency obtained on the T3E using
one U(1, 1) multigrid cycle with five iterations of the smoother on the coarsest grid.
However, with this approach, five iterations are not enough to solve the system
of equations on the coarsest level, since the required number of iterations grows
with the system size. Consequently, as the parallel algorithm may differ from the
sequential version in that it has worse numerical properties, it is more convenient
to express the efficiency using the execution time needed to solve the whole
problem, i.e., to reach a certain residual norm. In particular, we have chosen 10 &12
and the corresponding efficiency will be referred to hereafter as the realistic parallel
110 PRIETO ET AL.

FIG. 9. Parallel efficiency of one U-cycle for 64_64_64 and 32_32_32 grids using up to 16 pro-
cessors on a Cray T3E. The number of levels has been fixed so that the coarsest grid has only one plane
on each processor and the smoother employs five iterations on the coarsest grid.

efficiency. Figure 10 shows the results using this realistic efficiency on a Cray T3E.
The number of iterations on the coarsest level has been fixed so that the residual
norm on this level was reduced by six orders of magnitude. In any case (six orders
of magnitude has been found as a trade-off between numerical and architectural
properties), the efficiency obtained is only satisfactory using up to eight processors.
Isotropic simulations present better parallel efficiencies than the anisotropic cases
due to convergence properties.
Our second approach implements a true V-cycle method. Thus, as we go down
below the critical level we have to dynamically rearrange the communication pat-
terns and grid distribution. Note that the number of idle processors is doubled on
every level below the critical one. In order to improve the algorithm below the criti-
cal level a damped Jacobi smoother has been employed, since it has better parallel
properties than the zebra scheme. This change has no important effect on the con-
vergence factor as long as the weighting is chosen properly [14].

FIG. 10. Realistic parallel efficiency of the U-cycle method for a 64_64_64 grid using up to 16 pro-
cessors on a Cray T3E. The number of levels has been fixed so that the coarsest grid has only one plane
on each processor. The smoother employs the necessary iterations on the coarsest grid to reduce the
residual by six orders of magnitude on this level.
PARALLEL MULTIGRID 111

FIG. 11. Realistic parallel efficiency of the V-cycle method using five smoother iterations on the
coarsest level for 32_32_32 (left-hand charts) and 64_64_64 (right-hand charts) problem sizes, on
a Cray T3E (top charts) and an SGI O2K (bottom charts).

As Fig. 11 shows, the realistic parallel efficiency dramatically improves with


respect to the first approach. Although results going down to the coarsest level are
presented, obviously it is not necessary to reach this level, since the system can be
solved on a previous grid without any significant worsening of the numerical
properties.
It is interesting to note that the 16 processor simulation for the 64_64_64
problem size on the O2K does not decrease the parallel efficiency. A better memory
hierarchy exploitation is the reason for this fact, since in this case the local problem
sizes (around 6 Mbyte) are similar to the L2 cache size (4 Mbyte). Note that on the
Cray T3E the L2 cache size is not large enough (96 Kbyte) to store all the local
data, and so this behavior is not exhibited.
Similar results have been obtained for stretched grids since the algorithm pre-
serves the numerical properties of the sequential version.

6. CONCLUSIONS

Two common robust multigrid approaches for anisotropic operators have been
compared, taking into account both numerical and architectural properties:
v Both alternatives present similar convergence rates for low anisotropies.
However, the alternating-plane smoothing process becomes an exact solver for high
anisotropies and so the convergence factor tends to zero with increasing anisotropy.
This optimal behavior is not shown by the semicoarsening approach when all the
112 PRIETO ET AL.

strongly connected unknowns (coordinates with relatively larger coefficients) are


not relaxed simultaneously.
v With regard to the cost-per-cycle, the alternating-plane approach is about
380 more expensive. This difference depends not only on the number of operations
(by means of the SGI perfex [1] profiling tool, we have found that the number of
floating-point operations used in the alternating-plane algorithm is 26 0 larger than
in semicoarsening), but also on the memory hierarchy use. Among the three
possibilities that can be employed in the semicoarsening approach, yz-plane relaxa-
tion and x-coarsening is the best choice because it exhibits a better spatial locality
(the code is written in C language). This fact is also important in the parallel setting
because communication costs depend also on the spatial locality of the messages.
v Using the convergence factor per work unit as the definitive sequential
metric that takes into account both numerical and architectural properties, the
alternating-plane smoother presents better behavior than the semicoarsening, i.e., it
reduces the same error in less time.
v The memory requirements for the semicoarsening approach are about twice
as large as for the alternating-plane approach.
v Parallel implementation of the alternating-plane approach involves the solu-
tion of distributed planes and lines, which makes it complicated and less efficient.
These difficulties can be avoided by the semicoarsening approach and a proper
linear decomposition. However, using a U-cycle technique in order to avoid idle
processors, the parallel efficiency obtained is not satisfactory due to the number of
iterations required to solve the system on the coarsest grid. Thus, a trade-off
between numerical and parallel properties has to be considered and a true V-cycle
seems to be the best choice in spite of the fact that some processors remain idle
during a short period of time on every multigrid cycle.

ACKNOWLEDGMENTS

This work has been supported by the Spanish research grants TIC 96-1071 and TIC 99-0474. We
thank Ciemat and CSC (Centro de Supercomputacion Complutense) for providing access to the systems
that have been used in this research.

REFERENCES

1. ``Origin 2000 and onyx2 Performance Tuning and Optimization Guide,'' available at http:
techpub.sgi.com.
2. J. C. Agu@ and J. Jimenez, A binary tree implementation of a distributed tridiagonal solver, Parallel
Comput. 21 (1995), 667686.
3. P. B. B. Smith and W. Gropp, ``Domain Decomposition, Parallel Multilevel Methods for Elliptic
Partial Differential Equations,'' Cambridge Univ. Press, Cambridge, MA, 1996.
4. A. Brandt, ``Multigrid Techniques: 1984 Guide with Applications to Fluid Dynamics,'' Technical
Report GMD-Studien 85, May 1984.
5. J. E. Dendy, S. F. McCormick, J. Ruge, T. Russell, and S. Schaffer, Multigrid methods for three-
dimensional petroleum reservoir simulation, in ``Tenth SPE Symposium on Reservoir Simulation,
February 1989.''
PARALLEL MULTIGRID 113

6. C. C. Doublas, J. Hu, U. Rude, and C. Weib, Cache optimizations for structured and unstructured
grid multigrid, in ``Proceedings of the PDCS'98 Conference, July 1998.''
7. C. C. Douglas, S. Malhotra, and M. H. Schultz, ``Transpose Free Alternating Direction Smoothers
for Serial and Parallel Methods,'' MGNET at http:www.mgnet.org, 1997.
8. D. Espadas, M. Prieto, I. M. Llorente, and F. Tirado, Solution of alternating-line processes on
modern parallel computers, in ``Proceedings of the 28th Annual Conference on Parallel Processing
(ICPP '99), September 1999,'' pp. 208215, IEEE Computer Society Press, Los Alamitos, CA.
9. F. F. Hatay, D. C. Jespersen, G. P. Guruswamy, Y. M. Rizk, C. Byun, and K. Gee, A multi-level
parallelization concept for high-fidelity multiblock solvers, in ``Proceedings of Supercomputing '97,
November 1997,'' ACMIEEE, New York.
10. J. Lopez, O. Platas, F. Arguello, and E. L. Zapata, Unified framework for the parallelization of
divide and conquer based tridiagonal systems, Parallel Comput. 23 (1997), 667686.
11. A. Krechel, H. J. Plum, and K. Stuben, Parallelization and vectorization aspects of the solution of
tridiagonal linear systems, Parallel Comput. 14 (1990), 3149.
12. J. Laudon and D. Lenoski, The SGI Origin: A ccNUMA highly scalable server, in ``Proceedings of
the International Symposium on Computer Architecture (ISCA '97), June 1997.''
13. I. M. Llorente, B. Diskin, and N. D. Melson, Alternating plane smoothers for multiblock grids,
SIAM J. Sci. Comput. 22 (2000), 218242.
14. I. M. Llorente and N. D. Melson, Behavior of plane relaxation methods as multigrid smoothers,
Electr. Trans. Numer. Anal. 10 (2000), 92114.
15. I. M. Llorente and F. Tirado, Relationships between efficiency and execution time of full multigrid
methods on parallel computers, IEEE Trans. Parallel Distrib. Systems 8 (1997), 562573.
16. C. W. Oosterlee, A GMRES-based plane smoother in multigrid to solve 3-D anisotropic fluid flow
problems, J. Comput. Phys. 130 (1997), 4153.
17. A. Overman and J. V. Rosendale, Mapping robust parallel multigrid algorithms to scalable memory
architectures, in ``Proceedings of the Sixth Copper Mountain Conference on Multigrid Methods,''
1993.
18. A. Povitsky, ``Parallel Directionally Split Solver Based on Reformulation of Pipelined Thomas
Algorithm,'' Technical Report 45, ICASENASA, 1998.
19. A. Povitsky, ``Paralellization of the Pipelined Thomas Algorithm,'' Technical Report 48, ICASE
NASA, 1998.
20. M. Prieto, D. Espadas, I. M. Llorente, and F. Tirado, Message passing evaluation and analysis on
Cray T3E and SGI Origin 2000 systems, in ``Proceeedings of the 4th Euro-Par Conference (Euro-
Par '99), Lecture Notes in Computer Science, August 1999,'' pp. 173182, Springer-Verlag, Berlin
New York.
21. M. Prieto, I. M. Llorente, and F. Tirado, Partitioning regular domains on modern parallel
computers, in ``Proceedings of the 3th International Meeting on Vector and Parallel Processing,''
(J. M. L. M. Palma, J. Dongarra, and V. Hernandez, Eds.), pp. 411424, 1999.
22. D. Quinlan, F. Bassetti, and D. Keyes, Temporal locality optimizations for stencil operations within
parallel object-oriented scientific frameworks on cache-based architectures, in ``Proceedings of the
PDCS'98 Conference, July 1998.''
23. U. Rude, Iterative algorithms on high performance architectures, in ``Proceedings of the Europar'97
Conference, 1997,'' pp. 5771.
24. J. Ruge and K. Stuben, Algebraic multigrid (AMG), in ``Multigrid Methods'' (S. McCormick, Ed.),
Frontier in Applied Mathematics, Vol. 5, SIAM, Philadelphia, 1986.
25. S. Schaffer, A semicoarsening multigrid method for elliptic partial differential equations with highly
dicontinuous and anisotropic coefficients, SIAM J. Sci. Comput. 20 (1998), 228242.
26. J. L. Thomas, B. Diskin, and A. Brandt, ``Textbook Multigrid Efficiency for the Incompressible
NavierStokes Equations: High Reynolds Number Wakes and Boundary Layers,'' Technical Report
99-51, ICASE, 1999.
114 PRIETO ET AL.

27. H. H. Wang, A parallel method for tridiagonal equations, ACM Trans. Math. Software 7 (1981),
170183.
28. T. Washio and K. Oosterlee, Flexible multiple semicoarsening for three-dimensional singularly
perturbed problems, SIAM J. Sci. Comput. 19 (1998), 16461666.
29. P. Wesseling, ``An Introduction to Multigrid Methods,'' Wiley, New York, 1992.
30. D. Xie and L. R. Scott, The parallel u-cycle multigrid method, in ``Proceedings of the 8th Copper
Mountain Conference on Multigrid Methods, 1996.''

You might also like