Professional Documents
Culture Documents
2001 - Parallel Multigrid For Anisotropic Elliptic Equations PDF
2001 - Parallel Multigrid For Anisotropic Elliptic Equations PDF
Received July 9, 1999; revised April 5, 2000; accepted April 28, 2000
1. INTRODUCTION
0743-731501 35.00 96
Copyright 2001 by Academic Press
All rights of reproduction in any form reserved.
PARALLEL MULTIGRID 97
Structured grids are widely used in the numerical simulation of many physical
systems because they allow a relatively easier sequential and parallel implementa-
tion. Furthermore, single-block algorithms are very efficient because parallel
implementation and cache memory exploitation is possible due to the regular data
structures in the structured grids. Single-block grids are commonly used as building
blocks for multiblock grids that are required to deal with complex geometries or to
facilitate parallel processing [13].
We compare two common approaches to achieve robustness on single-block
structured grids: alternating-plane smoothers combined with standard coarsening
and plane smoothers with semicoarsening algorithms. We study their numerical and
architectural properties for the solution of a 3-D discrete anisotropic elliptic
problem on stretched cell-centered grids. This paper is organized as follows: The
numerical problem used in our simulations is described in Section 2. The numerical
properties (convergence factor) obtained for different anisotropic problems are
found in Section 3. Architectural properties, namely memory hierarchy exploitation
and parallel efficiency, are presented in Sections 4 and 5. The paper ends with some
conclusions.
The test problem studied in this paper is the 3-D anisotropic diffusion equation:
, , ,
x \ +
:
x
+
y
;
\ + \ +
y
+
z
#
z
=S , (x, y, z). (1)
2:
(X p , i+1jk &(X p +X) , ijk +X, i&1jk )
2x ijk
2;
+ (Y p , ij+1k &(Y p +Y) , ijk +Y, ij&1k )
2y ijk
2#
+ (Z p , ijk+1 &(Z p +Z) , ijk +Z, ijk&1 )=S ,ijk
2z ijk
where, in order to get a less cumbersome expression, we have defined the discrete
coefficients:
2x ijk =x i+1jk &x ijk ; 2y ijk =y ij+1k & y ijk ; 2z ijk =z ijk+1 &z ijk
1 1 1
Xp = Yp = Zp=
x i+2jk &x ijk y ij+2k & y ijk z ijk+2 &z ijk
1 1 1
X= Y= Z= .
x i+1jk &x i&1jk y ij+1k & y ij&1k z ijk+1 &z ijk&1
There are several methods for solving a system of linear equations such as the
one defined by (2). We have studied some relevant properties of one of the most
popular, known as multigrid. This technique is based on two fundamental prin-
ciples: relaxation and coarse grid correction. The relaxation procedure is based on
a standard iterative method, called a smoother in multigrid literature. The
smoother is used to damp high frequency or oscillatory components of the error,
since it fails to attenuate low-frequency components. The coarser grid correction is
used to damp low-frequency components. On a suitable coarser grid, low-frequency
components appear more oscillatory and so the smoother can be applied effectively.
Transfer grid operators are used in order to connect the grid levels. The prolonga-
tion operator maps data from the coarser level to the current one while the restric-
tion operators transfer values from the finer level to the current one. The smoother
and transfer grid operators have to be carefully chosen since the overall perfor-
mance of the algorithm can decrease dramatically [4].
Multigrid based on standard coarsening (cell-centered coarsening consists in
joining fine grid cells to obtain coarse grid cells) and point-wise relaxation as a
smoother properly solves the discrete operator (2) when anisotropy does not exist
(i.e., :r;r#). Brandt's fundamental block relaxation rule [4] states that all
strongly coupled unknowns (coordinates with relatively larger coefficients) should
be relaxed simultaneously in order to efficiently solve an anisotropic operator. So,
if :> >;r# in Eq. (1), we could use x-line relaxation as an efficient smoother.
However, if :r;> >#, then (x, y)-plane relaxation is needed to provide a good
smoother. This approach has been successfully applied to solve a large number of
PARALLEL MULTIGRID 99
problems when the coupled variables in the discrete operator are known
beforehand, for example, the solution of the NavierStokes equations when the grid
stretching is normal to the body [26] or grids with nonunitary aspect ratios [16].
Consequently, a general rule in multigrid is to solve simultaneously those
variables which are strongly coupled by means of plane or line relaxation. However,
in a general situation the nature of the anisotropy is not known beforehand, so
there is no way to know which variables are coupled. Moreover, if the problem is
solved on a stretched grid or the equation coefficients differ with each other
throughout the domain (computational and physical anisotropy, respectively) the
values of the coefficients and their relative magnitudes vary for different parts of the
computational domain. In such cases the multigrid techniques based on point-wise
or plane-wise smoothers combined with full coarsening fail to smooth error com-
ponents.
Several approaches have been proposed in the literature to make multigrid a
robust solver that can deal with a wide range of problems (a more precise definition
of a robust solver can be found in [29]). We will focus on the two following alter-
natives:
The test code has been developed in C language. The system of equations is
solved by the full approximation scheme (FAS) [4], which is more involved than
the simpler correction scheme but can be applied to solve nonlinear equations. Both
approaches (alternating-plane and semicoarsening) use V(# 1 , # 2 ) cycles (# 1 and # 2
denote the number of presmoothing and postsmoothing sweeps, respectively)
implemented in an ``all-multigrid'' way. The plane solver used in the 3-D smoothing
procedure is a robust 2-D multigrid V(1, 1)-cycle employing full coarsening and
alternating-line smoothers; the approximate 2-D solution obtained after one 2-D
cycle is sufficient to provide robustness and good convergence rates in the 3-D
solvers [14]. Thomas' algorithm or one 1-D V(1, 1)-cycle is used to solve the lines.
Lexicographic GaussSeidel and zebra plane relaxation are used in the following
100 PRIETO ET AL.
&r n& 2
\= nA. (3)
&r n&1& 2
They have been obtained for a homogeneous problem with right-hand side
S,(x, y, z)=0 and boundary condition ,(x, y, z)=0, and starting with a random
initial guess. The homogeneous problem reduces roundoff errors and thus allows an
accurate measurement of asymptotic convergence factors.
3. CONVERGENCE RATE
FIG. 2. Convergence factor of one V(1, 1) cycle for the isotropic equation (u xx +u yy +u zz =0) on a
64_64_64 grid with different stretching factors (top chart), and the anisotropic equation (au xx +bu yy +
cuzz =0) for several coefficient sets on an uniform 64_64_64 grid (bottom chart).
As Fig. 2 (top chart) shows, both methods solve the problem effectively. For the
semicoarsening approach, each coarsening direction exhibits the same performance
since a dominant direction does not exist. The alternating-plane smoother improves
its convergence factor as the anisotropy grows (as was shown in [14]).
On the other hand, when the anisotropy is located on one plane the best coars-
ening procedure is the one that maintains coupling of strongly connected
unknowns. For example, if we make the coefficients grow in the xy-plane on a
uniform grid (see Fig. 2, bottom chart), z-coarsening gets a better convergence fac-
tor because the smoother becomes an exact solver for high anisotropies. Alternat-
ing-plane smoothers present a similar behavior, becoming a direct solver when the
unknowns are strongly coupled, i.e., the better the plane is solved the better the
obtained convergence.
The behavior exhibited by the convergence rate of the alternating-plane approach
can be explained in terms of the smoothing factor of the three alternating sweeps.
If the problem is essentially isotropic, then each of the three alternating sweeps con-
tributes to the smoothing factor. In problems with a moderate anisotropy in the
xy-plane, the high-frequency error is mainly reduced in the z-direction sweep. In
102 PRIETO ET AL.
strongly anisotropic problems the z-direction sweep reduces the smooth error com-
ponents as well, which actually solves the problem rather than just smoothing the
error.
In conclusion, we can state that the alternating-plane approach exhibits a better
convergence factor than the semicoarsening approach. However, in order to com-
pare different algorithms, we have to take into account not only their numerical
efficiency but also their architectural properties.
FIG. 3. L2 cache misses for the resolution (in particular, to reach a residual norm equal to 10 &12 )
of the isotropic equations using V(1, 1) multigrid cycles on 32_32_32 and 64_64_64 grids.
less than those obtained for the z-semicoarsening and the alternating-plane
approach, respectively. These differences grow slightly for the 64_64_64 problem
size.
Figure 3 shows the number of L2 cache misses for the resolution (in particular,
to reach a residual norm equal to 10 &12 ) of the isotropic equations using V(1, 1)
multigrid cycles on 32_32_32 and 64_64_64 grids. The number of misses only
depends on the problem size and the particular smoother employed since our code
has not been optimized for either constant coefficients or uniform grids. It is inter-
esting to note that the alternating-plane approach and the z-semicoarsening have
around 1.75 and 2.5 times more misses than the x-semicoarsening on the O2 system
for both problem sizes. However, on the O2K system, where the large second level
cache allows a better exploitation of the temporal locality, the differences grow with
problem size since, for smaller problems, the spatial locality has less impact on the
number of cache misses. The greatest differences between both systems are obtained
from the x- and y-semicoarsening (around 3.4 times more misses on the O2 system)
due to temporal locality effects, which are lower on the alternating-plane and
z-semicoarsening approaches where the spatial locality has more influence.
FIG. 4. Convergence factor per work unit for several 64_64_64 grids with different stretching factors.
104 PRIETO ET AL.
FIG. 5. Amount of memory required by our code on an SGI Origin 2000 for 32_32_32, 64_64_
64, and 128_128_128 grid sizes.
5. PARALLEL PROPERTIES
5.1. Outline
One can adopt one of the following strategies to get a parallel implementation of
a multigrid method:
Domain decomposition methods are often considered with finite element dis-
cretization on parallel computers. They are easier to implement and imply fewer
communications since they are only required on the finest grid. Additionally, they
can be applied to general multiblock grids. However, the decomposition methods
lead to algorithms which are numerically different to the sequential version and
have a negative impact on the convergence rate. Grid partitioning retains the con-
vergence rate of the sequential algorithm. However, it implies more communication
overheads since data exchange is required on each grid level. A hybrid approach
that applies the V-cycle on the entire multiblock domain while the smoothers are
performed inside each block (domain decomposition for the smoother) has been
proposed in [13].
We have limited this research to the grid partitioning technique. In this
approach, alternating-plane presents worse parallel properties than semicoarsening
because, regardless of the data partitioning applied, it requires the solution of
tridiagonal systems of equations distributed among the processors. Since the
semicoarsening approach does not need an alternating-plane smoother, an
appropriate 1-D data decomposition in the semicoarsening direction can be chosen
so that a parallel tridiagonal solver is not needed. Thus, for instance, an x-direction
partitioning is suitable for x-semicoarsening combined with yz-plane
smoothers.
5.2.1. The pipelined gaussian elimination method. The PGE, also known as the
pipelined Thomas algorithm, for the solution of multiple tridiagonal systems in
parallel, aims to solve the problem of the intrinsic sequentiality of the solution of
a single system by sequencing the solution of many systems among the processors
in a pipeline way [9, 18, 19]. This algorithm is graphically represented in Fig. 6.
Each processor has to wait for the completion of the forward or backward step
computation of the Thomas algorithm on the preceding processors. In the forward
step (Fig. 6, left-hand chart) the data to be transferred are the equation coefficients
while in the backward one, the transferred data are the solutions (Fig. 6, right-hand
chart). Communications which happen on the same phase are equally shaded.
As this figure illustrates, instead of using a separate message after completion of
the forward or backward step for a single line (system) it is more convenient to
gather data from more than one line in each message. Collecting data helps to
reduce the communications delayor overheadsince any message sending implies
the appending of a considerable amount of control information [20]. The number
of systems each processor works on in a step, and hence, the length of the message
that each processor has to send to the following one, will be called the pipe size,
and it determines the balance between communications and pipe delay. We use the
term pipe delay to represent the fact that each processor has to wait idle until the
first block of data comes from the preceding processor.
The trade-off between communications and pipe delay (increasing the pipe size
implies increasing the pipe delay, but decreasing the pipe size increases the com-
munications overhead) is responsible for the existence of an optimal pipe sizea
range of optimal pipe sizes, actually, which, as was proved in a previous paper
[8], is a very important factor when optimizing this algorithm. From now on in
this work, the optimal pipe size will be used with this algorithm.
In principle, the Y-sweep is done directly in one step, but this produces very low
data locality since cache lines are aligned along rows, not along columns. This can
FIG. 6. Pipelined Gaussian elimination scheme. The left-hand diagram represents equation coef-
ficients transfers in the forward step while the right-hand diagram shows the equation solutions transfers
in the backward step.
PARALLEL MULTIGRID 107
be improved by means of the blocking of this Y-sweep, i.e., calculating along each
column on a certain number of rows. Therefore, we have to adjust another
parameter of the algorithm, the block size, to take maximum advantage of this
blocking. From now on, the blocking of the Y-sweep will be applied in all the algo-
rithms studied and also on the sequential solver.
FIG. 7. Matrix transposition scheme. Blocks are internally transposed prior to calculation, not only
moved.
108 PRIETO ET AL.
FIG. 8. Efficiency of 2-D alternating-line smoothers for the 32 (top), 64 (middle), and 128 (bottom)
problems, on the CRAY (left) and the SGI Origin 2000 (right).
us from solving 3-D problems whose corresponding 2-D planes are big enough to
obtain satisfactory efficiencies on medium-sized parallel computers.
FIG. 9. Parallel efficiency of one U-cycle for 64_64_64 and 32_32_32 grids using up to 16 pro-
cessors on a Cray T3E. The number of levels has been fixed so that the coarsest grid has only one plane
on each processor and the smoother employs five iterations on the coarsest grid.
efficiency. Figure 10 shows the results using this realistic efficiency on a Cray T3E.
The number of iterations on the coarsest level has been fixed so that the residual
norm on this level was reduced by six orders of magnitude. In any case (six orders
of magnitude has been found as a trade-off between numerical and architectural
properties), the efficiency obtained is only satisfactory using up to eight processors.
Isotropic simulations present better parallel efficiencies than the anisotropic cases
due to convergence properties.
Our second approach implements a true V-cycle method. Thus, as we go down
below the critical level we have to dynamically rearrange the communication pat-
terns and grid distribution. Note that the number of idle processors is doubled on
every level below the critical one. In order to improve the algorithm below the criti-
cal level a damped Jacobi smoother has been employed, since it has better parallel
properties than the zebra scheme. This change has no important effect on the con-
vergence factor as long as the weighting is chosen properly [14].
FIG. 10. Realistic parallel efficiency of the U-cycle method for a 64_64_64 grid using up to 16 pro-
cessors on a Cray T3E. The number of levels has been fixed so that the coarsest grid has only one plane
on each processor. The smoother employs the necessary iterations on the coarsest grid to reduce the
residual by six orders of magnitude on this level.
PARALLEL MULTIGRID 111
FIG. 11. Realistic parallel efficiency of the V-cycle method using five smoother iterations on the
coarsest level for 32_32_32 (left-hand charts) and 64_64_64 (right-hand charts) problem sizes, on
a Cray T3E (top charts) and an SGI O2K (bottom charts).
6. CONCLUSIONS
Two common robust multigrid approaches for anisotropic operators have been
compared, taking into account both numerical and architectural properties:
v Both alternatives present similar convergence rates for low anisotropies.
However, the alternating-plane smoothing process becomes an exact solver for high
anisotropies and so the convergence factor tends to zero with increasing anisotropy.
This optimal behavior is not shown by the semicoarsening approach when all the
112 PRIETO ET AL.
ACKNOWLEDGMENTS
This work has been supported by the Spanish research grants TIC 96-1071 and TIC 99-0474. We
thank Ciemat and CSC (Centro de Supercomputacion Complutense) for providing access to the systems
that have been used in this research.
REFERENCES
1. ``Origin 2000 and onyx2 Performance Tuning and Optimization Guide,'' available at http:
techpub.sgi.com.
2. J. C. Agu@ and J. Jimenez, A binary tree implementation of a distributed tridiagonal solver, Parallel
Comput. 21 (1995), 667686.
3. P. B. B. Smith and W. Gropp, ``Domain Decomposition, Parallel Multilevel Methods for Elliptic
Partial Differential Equations,'' Cambridge Univ. Press, Cambridge, MA, 1996.
4. A. Brandt, ``Multigrid Techniques: 1984 Guide with Applications to Fluid Dynamics,'' Technical
Report GMD-Studien 85, May 1984.
5. J. E. Dendy, S. F. McCormick, J. Ruge, T. Russell, and S. Schaffer, Multigrid methods for three-
dimensional petroleum reservoir simulation, in ``Tenth SPE Symposium on Reservoir Simulation,
February 1989.''
PARALLEL MULTIGRID 113
6. C. C. Doublas, J. Hu, U. Rude, and C. Weib, Cache optimizations for structured and unstructured
grid multigrid, in ``Proceedings of the PDCS'98 Conference, July 1998.''
7. C. C. Douglas, S. Malhotra, and M. H. Schultz, ``Transpose Free Alternating Direction Smoothers
for Serial and Parallel Methods,'' MGNET at http:www.mgnet.org, 1997.
8. D. Espadas, M. Prieto, I. M. Llorente, and F. Tirado, Solution of alternating-line processes on
modern parallel computers, in ``Proceedings of the 28th Annual Conference on Parallel Processing
(ICPP '99), September 1999,'' pp. 208215, IEEE Computer Society Press, Los Alamitos, CA.
9. F. F. Hatay, D. C. Jespersen, G. P. Guruswamy, Y. M. Rizk, C. Byun, and K. Gee, A multi-level
parallelization concept for high-fidelity multiblock solvers, in ``Proceedings of Supercomputing '97,
November 1997,'' ACMIEEE, New York.
10. J. Lopez, O. Platas, F. Arguello, and E. L. Zapata, Unified framework for the parallelization of
divide and conquer based tridiagonal systems, Parallel Comput. 23 (1997), 667686.
11. A. Krechel, H. J. Plum, and K. Stuben, Parallelization and vectorization aspects of the solution of
tridiagonal linear systems, Parallel Comput. 14 (1990), 3149.
12. J. Laudon and D. Lenoski, The SGI Origin: A ccNUMA highly scalable server, in ``Proceedings of
the International Symposium on Computer Architecture (ISCA '97), June 1997.''
13. I. M. Llorente, B. Diskin, and N. D. Melson, Alternating plane smoothers for multiblock grids,
SIAM J. Sci. Comput. 22 (2000), 218242.
14. I. M. Llorente and N. D. Melson, Behavior of plane relaxation methods as multigrid smoothers,
Electr. Trans. Numer. Anal. 10 (2000), 92114.
15. I. M. Llorente and F. Tirado, Relationships between efficiency and execution time of full multigrid
methods on parallel computers, IEEE Trans. Parallel Distrib. Systems 8 (1997), 562573.
16. C. W. Oosterlee, A GMRES-based plane smoother in multigrid to solve 3-D anisotropic fluid flow
problems, J. Comput. Phys. 130 (1997), 4153.
17. A. Overman and J. V. Rosendale, Mapping robust parallel multigrid algorithms to scalable memory
architectures, in ``Proceedings of the Sixth Copper Mountain Conference on Multigrid Methods,''
1993.
18. A. Povitsky, ``Parallel Directionally Split Solver Based on Reformulation of Pipelined Thomas
Algorithm,'' Technical Report 45, ICASENASA, 1998.
19. A. Povitsky, ``Paralellization of the Pipelined Thomas Algorithm,'' Technical Report 48, ICASE
NASA, 1998.
20. M. Prieto, D. Espadas, I. M. Llorente, and F. Tirado, Message passing evaluation and analysis on
Cray T3E and SGI Origin 2000 systems, in ``Proceeedings of the 4th Euro-Par Conference (Euro-
Par '99), Lecture Notes in Computer Science, August 1999,'' pp. 173182, Springer-Verlag, Berlin
New York.
21. M. Prieto, I. M. Llorente, and F. Tirado, Partitioning regular domains on modern parallel
computers, in ``Proceedings of the 3th International Meeting on Vector and Parallel Processing,''
(J. M. L. M. Palma, J. Dongarra, and V. Hernandez, Eds.), pp. 411424, 1999.
22. D. Quinlan, F. Bassetti, and D. Keyes, Temporal locality optimizations for stencil operations within
parallel object-oriented scientific frameworks on cache-based architectures, in ``Proceedings of the
PDCS'98 Conference, July 1998.''
23. U. Rude, Iterative algorithms on high performance architectures, in ``Proceedings of the Europar'97
Conference, 1997,'' pp. 5771.
24. J. Ruge and K. Stuben, Algebraic multigrid (AMG), in ``Multigrid Methods'' (S. McCormick, Ed.),
Frontier in Applied Mathematics, Vol. 5, SIAM, Philadelphia, 1986.
25. S. Schaffer, A semicoarsening multigrid method for elliptic partial differential equations with highly
dicontinuous and anisotropic coefficients, SIAM J. Sci. Comput. 20 (1998), 228242.
26. J. L. Thomas, B. Diskin, and A. Brandt, ``Textbook Multigrid Efficiency for the Incompressible
NavierStokes Equations: High Reynolds Number Wakes and Boundary Layers,'' Technical Report
99-51, ICASE, 1999.
114 PRIETO ET AL.
27. H. H. Wang, A parallel method for tridiagonal equations, ACM Trans. Math. Software 7 (1981),
170183.
28. T. Washio and K. Oosterlee, Flexible multiple semicoarsening for three-dimensional singularly
perturbed problems, SIAM J. Sci. Comput. 19 (1998), 16461666.
29. P. Wesseling, ``An Introduction to Multigrid Methods,'' Wiley, New York, 1992.
30. D. Xie and L. R. Scott, The parallel u-cycle multigrid method, in ``Proceedings of the 8th Copper
Mountain Conference on Multigrid Methods, 1996.''