Bspar

Applied Numerical Mathematics 35 (2000) 6786
Sparse approximate inverse and multilevel block ILU preconditioning techniques for general sparse matrices 6
Jun Zhang 1
Department of Computer Science, University of Kentucky, 773 Anderson Hall, Lexington, KY 40506-0046, USA
Abstract We investigate the use of sparse approximate inverse techniques in a multilevel block ILU preconditioner to design a robust and efcient parallelizable preconditioner for solving general sparse matrices. The resulting preconditioner retains robustness of the multilevel block ILU preconditioner (BILUM) and offers a convenient means to control the ll-in elements when large size blocks (subdomains) are used to form block independent set. Moreover, the new implementation of BILUM with a sparse approximate inverse strategy affords maximum parallelism for operations within each level as well as for the coarsest level solution. Thus it has two advantages over the standard BILUM preconditioner: the ability to control sparsity and increased parallelism. Numerical experiments are used to show the effectiveness and efciency of the proposed variant of BILUM. 2000 IMACS. Published by Elsevier Science B.V. All rights reserved. Keywords: Sparse matrices; Incomplete LU factorization; Multilevel ILU preconditioner; Sparse approximate inverse; Krylov subspace methods
1. Introduction Preconditioned Krylov subspace methods have become a popular choice of iterative solution techniques and some of them are especially aimed at solving nonsymmetric (unstructured) linear systems, see, e.g., [30]. The convergence rate of a preconditioned iterative method is usually dictated by the quality of the preconditioner. With the availability of a high quality preconditioner, the choice of the Krylov subspace accelerator is not that critical [16,30,38]. This observation shifts the focus of nding efcient iterative solvers from choosing accelerators to constructing and identifying robust preconditioners [16]. With the advent and gradual popularity of parallel and distributed architectures, the search for preconditioners that are suitable for high performance computers (parallelizable preconditioners) has become an important issue.
6
This research was supported in part by the University of Kentucky Center for Computational Sciences. E-mail address: jzhang@cs.uky.edu (J. Zhang). 1 URL: http://www.cs.uky.edu/jzhang.
0168-9274/00/$20.00 2000 IMACS. Published by Elsevier Science B.V. All rights reserved. PII: S 0 1 6 8 - 9 2 7 4 ( 9 9 ) 0 0 0 4 7 - 1
68
J. Zhang / Applied Numerical Mathematics 35 (2000) 6786
The main tradeoffs among general purpose preconditioners are in their intrinsic efciency, generality, parallelism, and robustness. In general, high accuracy incomplete LU (ILU) type preconditioners (with large amount of ll-in) are more robust but cost more in construction than their lower accuracy counterparts do. High accuracy preconditioners also have less inherent parallelism due to the increased coupling between rows and columns. For parallelizable preconditioners, several versions of sparse approximate inverse (SAI) techniques have been developed [2,6,9,11,17,19,41] in the past few years. These preconditioners have the property of affording maximum parallelism and are shown to be efcient for certain type of problems. However, for large groups of general sparse matrices, their robustness has not reached the stage of competing with that of the more traditional sequential preconditioners such as ILUT [28]. On the other hand, the parallelism of a straightforward implementation of ILUT is limited, especially when a high accuracy factorization is required [28]. It has been noted by several authors [11,17,35] that most SAI techniques have certain drawbacks and limitations. One of these drawbacks is that they tend to work well for small size problems due to their inherent local coupling property [11,35]. This is true at least for the Frobenius norm minimization based SAI techniques. As a result, these authors have proposed to use SAIs as local preconditioners in block type global preconditioners. The latter include block SSOR and preconditioners based on Schur complement techniques. The multilevel block ILU preconditioner [32,33] (BILUM) generalizes the idea of successive independent set orderings with a multilevel structure and offers a good degree of parallelism. For certain type of problems, it may deliver a convergence rate that is almost independent of the problem size [33]. The solution with the independent blocks in BILUM is done with exact inverse or singular value decomposition based regularized inverse [32,33]. These strategies are efcient for blocks of small size and the cost of such inversion strategies grows rapidly as the size of the blocks increases. For domain decomposition based implementations with large size blocks, the cost of constructing such preconditioners may be prohibitive. Another associated problem is the high cost of memory usage; since large size blocks are no longer dense but their inverses are. We investigate the use of SAI techniques for inverting large size blocks in the construction of BILUM. An SAI is also constructed for the last reduced system and is used as a solver for an approximate solution on the last level. In fact, almost all multilevel methods based on Schur complement techniques employ some kind of SAI techniques. Some of them are very simple. The utilization of specially designed SAIs in algebraic multilevel methods are common [25,26]. Although some SAIs based on high order polynomials are shown to be optimal for certain multilevel methods for solving some specially structured matrices, their practical use and efcient implementation are questionable. As a result, they are usually replaced by suboptimal SAIs for efciency, especially on parallel computers [24]. In addition, SAIs have been used with multigrid method [1,21,36]. The use of a general purpose SAI technique in a general purpose multilevel preconditioning method seems to be attractive and promising. This paper is organized as follows. Section 2 outlines several SAI techniques for general sparse matrices and a particular method that will be used in our numerical implementation. Section 3 discusses multilevel block ILU preconditioning techniques. Section 4 investigates the benets of using SAIs in BILUM (BILUM-SAI). Section 5 contains numerical experiments and comparison of BILUM-SAI with several existing preconditioning methods. Section 6 includes some concluding comments.
69
2. Sparse approximate inverse A sparse approximate inverse (SAI), often used as a preconditioner with Krylov subspace methods, is a sparse matrix M which is a good approximation to A1 , the inverse of a general sparse matrix A. The major driving force behind the search for efcient SAIs is their potential advantages in parallel computing. The idea is that if such an M can be constructed by some means, the preconditioning process is just a matrixvector operation and is therefore fully parallelizable. Nevertheless, there are evidences to show that this type of preconditioners may solve certain problems that are not suitable for ILU type preconditioners [11] and they offer an alternative for traditional ILU type preconditioners. There exist several techniques to construct SAI preconditioners. They can be roughly categorized into three groups [3]; i.e., sparse approximate inverses based on Frobenius norm minimization [9,11,17, 19], factored sparse approximate inverses [2,22,41], sparse approximate inverses computed from an ILU factorization [30]. Each of these groups contains a variety of different constructions and each of them has its own merits and drawbacks. In other words, none of them is absolutely better than others in all comparison metrics (construction cost, application cost, robustness, efciency, etc.). A comprehensive survey and comparison of several existing SAI techniques can be found in [3]. In this paper, we only discuss in detail the one that we will use with our multilevel block ILU preconditioning (BILUM) technique and refer interested readers to the original papers for detailed discussions of other SAI techniques. Our choice of this SAI is based on the ready availability of the computer program. Other SAI techniques can be incorporated by just changing the lines of the program that call the SAI routine. The SAI technique that we discuss here is based on the Frobenius norm minimization and was developed by Chow and Saad [11]. The basic idea is to compute a sparse matrix M A1 as the solution of the following constrained minimization problem:
M
min I AM
F,
(1)
where is the sparsity pattern (a set of sparse matrices) and F is the Frobenius norm of a matrix. An important feature of the minimization problem (1) is that it can be decoupled into the sum of the squares of the 2-norms of the individual columns of the residual matrix as
n
I AM
2 F
=
j =1
ej Amj
2 2,
(2)
in which ej and mj are the j th column of the identity matrix and of the matrix M, respectively. Hence, the computation of M is equivalent to minimizing the individual functions fj (m) = ej Am 2 , 2 for j = 1, 2, . . . , n. (3)
Thus the computations can be carried out in parallel. Note that the above approach yields a right approximate inverse, a left approximate inverse can be computed by solving a constrained minimization for I MA F = I AT M T F . This is equivalent to computing a right approximate inverse for AT and taking the transpose of the resulting matrix. While the distinction between left and right approximate inverses may be important for nonsymmetric matrices, most techniques are geared to construct right approximate inverse. However, it makes no difference in our case to use right or left approximate inverse since we do not construct an SAI preconditioner for the original matrix, but SAIs for the independent blocks.
70
The method proposed in [11] uses a few steps of an iterative method to reduce the residuals corresponding to each column of the approximate inverse (3). The minimization is performed by taking a sparse initial guess and solving approximately the n linear subproblems Amj = ej , for j = 1, 2, . . . , n, (4)
with several steps of a nonsymmetric descent type method, such as a minimal residual method or untruncated GMRES. The iterative method works in sparse mode; i.e., mj is stored and operated on as a sparse vector, and the Arnold basis in GMRES is kept in sparse format. For a few iterations, the approximate columns mj remain sparse, small elements are dropped and several effective dropping strategies are discussed in [11]. Chow and Saad [11] also suggested to use the already computed columns to precondition each linear system (self-preconditioning). Algorithm 2.1 is the minimal residual iteration with self-preconditioning. Algorithm 2.1. Self-preconditioned minimal residual iteration [11]. 1. Start with M = M0 2. For outer iteration = 1, 2, . . . , no , do 3. For each column j = 1, 2, . . . , n, do 4. Dene s := mj = Mej 5. For inner iteration = 1, 2, . . . , ni , do 6. r := ej As 7. z := Mr 8. q := Az 9. := (r, q)/(q, q) 10. s := s + z 11. Apply numerical dropping to s 12. End do 13. Update the j th column of M := mj := s 14. End do 15. End do There are several parameters to be chosen in Algorithm 2.1 and the optimal choices for these parameters are certainly problem dependent. For difcult problems, the performance of the algorithm depends heavily on the correct choice of these parameters. However, to use the sparse approximate inverse algorithm as a subroutine, it is convenient to use a xed set of parameters. Thus, our implementation chooses the following parameters. We use GMRES [30] with self-preconditioning as the minimal residual iteration method. The initial guess is M0 = I with = 1/ AAT 1 . The number of outer iterations is 2 and that of inner iterations is 3. The threshold dropping tolerance is set to be the same as used in BILUM. We relay on the parameter ll, the only one that we may adjust in our calling routine, to control the sparsity of M. We remark that the use of self-preconditioning increases the cost of each iteration substantially. However, we found that the algorithm with self-preconditioning is much more robust than the one without. Furthermore, the use of self-preconditioning reduces the numbers of outer and inner iterations. The overall costs of using or not using self-preconditioning are not much different for producing sparse approximate inverses of similar quality for moderate problems. For difcult problems, we found that
71
self-preconditioning is sometimes necessary in order to compute a useful sparse approximate inverse. Although this remark is in agreement with the original motivation for self-preconditioning [11], we mention that self-preconditioning was not found to be very viable in [3]. Our explanation is that different test problems are used in [3] and it is not surprising that different observations may be made with a method that allows several parameters to be chosen.
3. Multilevel block ILU preconditioning Most multilevel preconditioning techniques (multilevel methods) explicitly or implicitly exploit the concept of independent set ordering [15]. A block independent set is dened as a set of groups (blocks) of unknowns such that there is no coupling between unknowns of any two different groups (blocks) [33]. The traditional point (scalar) independent set is considered as a block independent set with blocks of uniform size 1. There are various heuristic strategies that may be used to nd a block independent set with different properties [29,33]. We may order the unknowns associated with the independent set rst, followed by the other unknowns. We use as level reference and assume 0 L for a certain integer L. The permutation matrix P , associated with such an ordering, transforms the original matrix into a matrix which has a two by two block structure

T A P A P =
D E
F C
(5)
where D is a block diagonal matrix of dimension m , and C is a square matrix of dimension n m . Note that A0 = A. For simplicity, we denote both the permuted and unpermuted matrices by A . For the consideration of load balancing in parallel computations, it is desirable that the sizes of the independent blocks are uniform, but this is not a necessary requirement for the denition of the independent set or for the techniques described in this paper to work properly. In algebraic multilevel preconditioning techniques, the reduced systems are recursively constructed as the Schur complement with respect to either D or C . In the case of BILUM [29,33], such an construction amounts to performing a block LU factorization of the form

D E
F C
I
1 E D
0 I
D 0
F A+1
(6)
where A+1 is the Schur complement with respect to C and I is the generic identity matrix on level . Note that n+1 = m . D = diag(D,1 , . . . , D,l ) is a block diagonal matrix. Some dropping strategies are used to control the amount of ll-in by discarding certain elements of small absolute values, or, in addition, limiting the total number of elements allowed in each row of the L and U factors [29,31, 33]. The two strategies are referred to as the single dropping strategy and the double dropping strategy, respectively. The resulting multilevel block ILU factorization is then used as a preconditioner in a Krylov subspace method based iterative solver. In the implementation of BILUM in [33], the block diagonals D consist of small size blocks. These 1 small blocks are usually dense and an exact inverse technique is used to compute D by inverting each small block independently (in parallel). In [32], some regularized inverse technique based on singular
72
value decomposition is used to invert the (potentially near singular) blocks approximately. As we noted in the introduction section, such direct inversion strategies usually produce dense inverse matrices even if the original blocks are highly sparse with large sizes. Most multilevel methods construct a sparse approximate inverse M with respect to D using some special features of D . We propose an implementation that uses the SAI algorithm to compute a sparse approximate inverse M,i for each block D,i and we use M = diag(M,1 , . . . , M,l ) as an SAI for D . Note that
l
I D M
=
j =1
I,j D,j M,j
F.
This implementation is at least computationally more sound than directly computing an SAI of D , in which case the exact sparsity pattern of D may not be captured by a dynamical sparsity search scheme used in our SAI algorithm. 1 We know that the elements of D will be restricted into the diagonal blocks of D and the individual computation of each SAI implicitly uses this a priori sparsity information. It follows that our decoupled 1 computation does not destroy any existing coupling of D between blocks; because there is no such coupling in the exact inverses. Furthermore, the computation of each SAI can be carried out in parallel and this is the reason that we do not require that the SAI algorithm to have parallelism, except for the construction of SAI for the last reduced system. This observation gives one more reason to use self-preconditioning. In addition, if M is computed on serial computers, it may be possible to use the previously computed M,j as the initial guess for computing M,j +1 . Hence, the factorization (6) is replaced by

D E
F C
I E M
0 I
1 M
F A+1
= L U ,
(7)
1 where the approximate Schur complement A+1 = C E M F . The use of M is just for notational convenience and we only need M in the preconditioning process. Usually, an additional dropping strategy may be applied to A+1 to keep the desired sparsity of the preconditioner. Fig. 1 is an illustration of one level BILUM factorization with sparse approximate inverse. The link between algebraic multigrid methods and BILUM has been investigated in [32]. If we rewrite the factorization (7) as
I E M
0 I
1 M
0 A+1
I 0
M F I
(8)
1 with the assumption that M = D . Corresponding to the factorization (8), we may dene an interpolation operator as in an algebraic multigrid method in the form
I = +1
M F I
and the restriction operator I +1 = E M I .
73
Fig. 1. An illustration of one level BILUM factorization with sparse approximation inverse.
Then we have the following results that are similar to those of algebraic multigrid method.
1 Proposition 3.1. If the factorization (8) exists and is exact, i.e., M = D , then the following relations are satised: 1. The coarse level operator satises
A+1 = I +1 A I ; +1 2. If, in addition, the matrix A is symmetric, then I +1 = I T . +1

1 Proof. Since M = D , the proof follows by direct verication, or see [32].
(9)
(10)
The relation (9) is called the Galerkin condition and is used in nite element and algebraic multigrid method to generate coarse grid operators [37]. Different algebraic multigrid methods differ in the way that the intergrid transfer operators are dened. In BILUM, these operators are naturally dened based on the matrix factorization, others are dened according to certain heuristic arguments.
4. BILUM and SAI The multilevel block ILU preconditioner (BILUM) is based on the ILU factorization (7). On each level , a block ILU factorization is performed and an approximate reduced system A+1 is formed as in (7). The whole process, nding block independent set, permuting matrix, performing ILU factorization, is repeated with respect to A+1 , recursively. The recursion is stopped when the last reduced system AL is small enough such that an efcient SAI can be constructed. Then the SAI algorithm is used
74
to construct a sparse approximate inverse ML for AL . However, we do not store any reduced systems on any level, including the last one. Instead, we store a sparse matrix

M E
for 0
< L 1,
and ML . In fact, the product E M as in (7) is only computed for the approximate Schur complement and is not stored. The action of E M in L during the preconditioning process is recovered by applying M to a vector and followed by applying E . Interesting enough, a similar storage scheme was also used in the context of fast Wavelet transforms [4]. The approximate solution on the last level is obtained by applying one operation of ML of the last reduced system to the corresponding vector. This is different from the implementation of BILUM in [33], where the last reduced system is solved to a certain accuracy by a Krylov subspace method preconditioned by ILUT for the last reduced system. Thus an inner-outer iteration scheme is employed and this necessitates the use of a exible variant of GMRES, i.e., FGMRES [27], which allows variable preconditioners. A parallelizable approach using multicoloring strategy is proposed in [29]. But our method seems to be fully parallelizable and does not need to use FGMRES. Suppose the right hand side vector b and the solution vector x is partitioned according to the independent set ordering as in (5), we would have, on each level,

x =
x,1 x,2
b =
b,1 b,2
The forward elimination is performed by solving for a temporary vector y , i.e., for = 0, 1, . . . , L 1, by solving

I E M
0 I
y,1 y,2
b,1 b,2
with
(F1): y,1 = M b,1 , (F2): y,2 = b,2 E y,1 .
Note that we separate the applications of M and E since we do not have the product matrix E M as discussed above. We then compute an approximate solution for the last reduced system as xL = ML bL . A backward substitution is performed to obtain the solution by solving, for = L 1, . . . , 1, 0,

1 M
F I
x,1 x,2
y,1 y,2
with
(B1): x,2 = y,2 , (B2): x,1 = M (y,1 F x,2 ).
The preconditioned iteration process is reminiscent of a multigrid V-cycle algorithm [5], see Fig. 2. A Krylov subspace iteration is performed on the nest level acting like a smoother, the residual is then transferred level by level to the coarsest level, where one application of ML is used to yield an approximate solution. In the above discussion of solution procedure, we omitted the permutations and inverse permutations that must be performed before and after operations on each level. This is also the approach that we used in our current computer programs (and in BILUM [33]). On the other hand, we may permute the matrices
75
Fig. 2. A BILUM preconditioned Krylov subspace solver.
on each level at the construction phase. In this case, only the global permutation is needed before and after the application of the preconditioner [29]. The difference between permuting matrices (prepermutation) and permuting vectors (postpermutation) is a tradeoff between the preprocessing (construction) and solution (preconditioning) costs. Permuting a matrix costs much more than permuting a vector, but the former is done in preprocessing phase and the latter in solution phase. However, vector permutation and inverse permutation are required in all cases at least on the nest level. For domain decomposition based implementation of BILUM with large size blocks (large independent set), most nodes have been factored out during the rst few reductions. The added cost of postpermutation for coarse level (short) vectors is minimal and the overall cost of postpermutation is insignicant. Note that all parts of the computation are just matrixvector operations or vector updates and are therefore fully parallelizable within each level; i.e., the method does not suffer from sequential drawback of the traditional ILU type factorization techniques. If we presparsify the matrix A so that the number of nonzero elements in each row is not more than p, and if the double dropping strategy is used to control the sparsity of the reduced systems so that each row of the reduced systems has at most p nonzero elements, we have an upper bound on the number of nonzero elements of the resulting preconditioner. Proposition 4.1. Let m be the size of the independent set on level and k be maximum number of nonzero elements kept in each row by the SAI algorithm. Let the number of nonzero elements of each row of all A , = 0, 1, . . . , L, and ML not exceed p, then the number of nonzero elements of BILUM with SAI and L levels of reduction is bounded by (p + k)n kmL + p
L
m .
=1
(11)
Proof. The proof is similar to that given in [34] and thus omitted.
76
We mention in passing that several authors have proposed to incorporate SAI techniques in two-level methods based on the Schur complement implementation [9,18,23]. In particular, Tang [35] proposed an interesting two-level method using an SAI technique. Let A be permuted by a generalized redblack ordering into a two by two block form

A=
D E
F C
The inverse of this reordered block matrix is A1 = D 1 I + FA1 ED 1 1 A1 ED 1 1 D 1 FA1 1 A1 1

,
(12)
where A1 = C ED 1 F is again the Schur complement. In [35] a local SAI, D 1 of D, is computed and is used to compute an approximation A1 of the Schur complement A1 . Then the local SAI A1 of A1 1 is computed. The globally coupled local inverse preconditioner is then dened as
M = A1 =
D 1 (I + F A1 E D 1 ) 1 A1 E D 1 1
D 1 F A1 1 A1 1
(13)
For matrices arising from standard nite difference discretization or low order nite element discretization of Laplace type operators, A is usually bipartite and can be divided by two colors. In these cases, T T D is a diagonal matrix and the computation of A1 is cheap. Given a vector x = (x1 , x2 )T , x1 contains elements for the ne grid nodes (black) and x2 contains the elements for the coarse grid nodes (red). The preconditioning step Mx can be performed as

Mx = M
x1 x2
y D 1 F z z
where y = D 1 x1 and z = A1 (Fy x2 ). In fact, this looks similar to a simple two-level ILUM [29] 1 with the product U 1 L1 computed as A1 and the coarsest (2nd) level solution is approximated by the one from an SAI technique. To be more clear, the factored inverse of (6) is (without the level index )

D 1 0
D 1 FA1 1 A1 1
I ED 1
0 I
(14)
and (12) is just a different version with (14) being multiplied out, which transforms a factored inverse into a standard inverse. Tangs tests demonstrated the advantage of the two-level method over the standard implementation of SAI [35]. Since a factored inverse usually contains more information than a nonfactored one with similar storage cost [11], it is not clear if the computed product is better than the factored form.
77
5. Numerical experiments Standard implementations of BILUM have been described in detail in [29,33]. We used the SAI algorithm of Chow and Saad [11] to replace the direct inverse subroutine of BILUM. The BILUMSAI implementation is different from those reported in [29,33] in the sense that we do not have an inner iteration process to solve the last reduced system. Thus the preconditioner is a xed approximate solution process. Unless otherwise indicated explicitly, we used these default parameters for our preconditioned iterative solver: GMRES with a restart value of 50 was used as the accelerator; the maximum number of reductions allowed was 10, i.e., L = 10; the reduction process also stopped when the size of the reduced system on a certain level was smaller than the uniform block size; the threshold dropping tolerance was = 104 . The blocks are searched by a greedy algorithm [33]. Starting with a seed node, a given number of nearest neighboring nodes are grouped together. Only uniform blocks are formed on all levels. All matrices were considered general sparse and nonsymmetric. The right hand side was generated by assuming that the solution is a vector of all ones and the initial guess was a vector of some random numbers. The computations were terminated when the 2-norm of the residual was reduced by a factor of 107 . We also set an upper bound of 100 for the GMRES iteration. The numerical experiments were conducted on a Silicon Graphics workstation. In all tables with numerical results, bsize is the size of the uniform blocks, iter shows the number of preconditioned GMRES iterations, prec shows the CPU time in seconds for the preprocessing (preconditioner construction) phase, solu shows the CPU time for the solution phase, total shows the sum of prec and solu, levl shows the actual number of reductions, spar shows the sparsity ratio which is the ratio between the number of nonzero elements of the preconditioner to that of the original matrix. The symbol indicates lack of convergence. In addition, p and are used in the dropping strategies, ll controls the number of elements kept in each block of size bsize. 5.1. Convectiondiffusion problem We rst consider a convectiondiffusion problem uxx uyy Re(sin x cos y ux cos x siny uy ) = 0, (15)
dened on the unit square. Here Re is the so-called Reynolds number. Dirichlet boundary condition was assumed. For the discretization of Eq. (15), we used a 9-point fourth-order compact nite difference discretization scheme with a uniform mesh h = 1/101 [20]. The same test problem and discretization scheme have been used in [33] to evaluate BILUM. The resulting matrices with different values of Re have 10,000 unknowns and 88,804 nonzeros, and are referred to as 9-point matrices for convenience. The percentage of diagonal dominance of the rows and columns of the matrices becomes smaller as Re increases [39,40]. Our rst test is to give an overview on the performance of incomplete LU factorization (ILUT), sparse approximate inverse (SAI), and multilevel block ILU with sparse approximate inverse (BILUMSAI) as preconditioners for GMRES. For ILUT and BILUM-SAI, the double dropping strategy was enforced. Table 1 gives some performance comparison of the preconditioners with a xed sparsity ratio of approximately 4.5 for solving the 9-point matrices with different Re. In this particular case, the maximum number of iterations for SAI was set to be 5000. We nd that the performance of SAI deteriorated rapidly
78
Table 1 Comparison of preconditioners for solving the 9-point matrices with different Re ILUT Re 0 1 10 100 1000 10000 100000 1000000 iter 10 10 11 10 7 14 19 21 prec 0.88 0.88 0.87 0.82 0.80 0.94 0.99 1.00 solu 0.95 0.95 1.06 0.92 0.65 1.35 1.89 2.88 iter 85 98 174 189 169 1710 SAI prec 32.9 33.7 32.4 32.3 32.4 34.9 solu 7.30 8.72 15.2 16.6 14.7 156.2 iter 14 14 16 17 9 9 25 28 BILUM-SAI prec 9.94 9.85 9.91 9.88 9.76 9.91 11.1 9.96 solu 1.58 1.58 1.83 1.95 0.97 0.98 2.96 3.37 levl 6 8 7 8 10 9 9 10
>5000 iterations >5000 iterations
as the symmetry of the matrices became worse. It did not converge for Re 105 . 2 ILUT was more efcient than BILUM-SAI in 6 out of 7 cases. The difference between ILUT and BILUM-SAI is not dramatic, and it is likely to be reversed if both preconditioners are implemented on parallel computers. However, the difference between SAI and BILUM-SAI is dramatic, the slight gain in parallelism by using SAI standing alone may not offset the big loss in convergence. It seems that BILUM-SAI offers a good compromise between efciency and parallelism. In some cases, the number of actual level of reductions was less than the maximum 10. In parallel implementations, it may not be a good idea to allow too many levels of reductions [24,31]. We further tested BILUM-SAI with different dropping strategies and different block sizes for solving Eq. (15) with Re = 100. For the double dropping strategy, we kept at most 20 largest elements (in absolute value) in each L and U factors and in each block. We varied the block size from 20 to 90, the corresponding percentages of the block ll-in were from 100% to 22%. The results are given in Table 2. It can be seen that the iteration count increased and the sparsity ratio decreased as the block size increased (with xed ll-in). This is expected and we would like to have such a exibility in order to adjust the preconditioner to the congurations of a given machine. Except for the near full ll-in case, the double and single dropping strategies yielded comparable results. This seems to indicate that the overall sparsity of BILUM-SAI is affected by the sparsity of the SAIs of the blocks. Note that for the block size of 30, BILUM-SAI with the double dropping strategy did not converge, so the actually used block size was 31. This kind of behavior is not uncommon for parallelizable iterative solvers [31]. For a reason that would be difcult to determine, one or more local SAIs generated a very bad inverse. The difculty disappeared when we used smaller or larger block sizes. Table 3 tabulates comparison of BILUM-SAI with BILUM using different block sizes for solving a 9-point matrix with Re = 1000. Both BILUM-SAI and BILUM used the single dropping strategy. In the case of BILUM-SAI, we only kept about half of the elements in the SAIs of the blocks. We see that BILUM was much more efcient and faster, but used twice as much memory space as BILUM-SAI did.
2 The performance of SAI may be improved by ne-tuning the parameters.
79
Table 2 Comparison of BILUM-SAI with different dropping strategies for solving a 9-point matrix with Re = 100 Single dropping strategy bsize 20 30 40 50 60 70 80 90 iter 22 31 42 42 53 76 85 99 prec 19.4 9.68 7.79 8.45 8.40 8.53 8.72 8.76 solu 2.74 3.42 4.32 4.30 5.43 7.28 8.21 9.80 spar 6.46 4.34 3.14 3.12 2.80 2.65 2.62 2.56 iter 26 34 43 43 54 76 86 99 Double dropping strategy prec 5.35 6.67 7.20 7.76 7.92 8.32 8.41 8.56 solu 2.49 3.31 4.30 4.25 5.43 7.32 8.29 9.87 spar 3.27 3.05 2.86 2.82 2.73 2.63 2.60 2.56
The block size is 31.
Table 3 Comparison of BILUM-SAI and BILUM with different block sizes and single dropping strategy for solving a 9-point matrix with Re = 1000 BILUM-SAI bsize 5 10 15 20 30 40 50 60 70 80 iter 48 46 34 35 28 24 19 17 17 10 prec 13.4 7.99 8.26 7.36 7.24 8.64 10.2 11.2 13.0 14.8 solu 4.74 4.18 2.99 3.11 2.56 2.26 1.91 1.79 1.89 1.85 spar 3.50 3.00 3.26 3.07 3.38 3.67 4.18 4.40 4.70 5.09 iter 3 3 3 3 3 3 3 3 3 3 prec 10.5 8.05 7.69 5.56 5.16 4.36 5.01 4.90 4.69 4.55 BILUM solu 0.85 0.70 0.67 0.46 0.45 0.48 0.55 0.60 0.64 0.63 spar 5.56 5.48 5.79 5.37 6.16 6.79 7.99 8.78 9.63 10.6
In fact, when the block sizes were larger than 50, BILUM overowed the preallocated work array and we had to adjust memory allocation. This test shows the merits and drawbacks of both methods; e.g., if memory is a constraint, BILUM-SAI is more exible.
80
Table 4 BILUM-SAI and BILUM with blocks of size 100 for solving the FIDAP matrices Matrices FIDAP004 FIDAP006 FIDAP020 FIDAP024 FIDAP028 FIDAP029 FIDAP032 FIDAP036 FIDAPM08 FIDAPM33 Unknowns 1,601 1,651 2,203 2,283 2,603 2,870 1,159 3,079 3,876 3,876 Nonzeros 32,287 49,479 69,579 48,733 77,653 23,754 11,343 53,851 103,076 103,076 iter 27 50 95 38 34 2 28 39 99 79 solu 1.05 2.09 4.90 1.86 2.50 0.06 0.35 2.14 9.28 2.98 spar 6.95 3.92 3.45 4.95 4.69 4.90 6.64 4.53 4.80 6.94 iter (BILUM) 2 3 3 3 3 4 3
5.2. FIDAP matrices The FIDAP matrices 3 were extracted from the test problems provided in the FIDAP package [14]. They model the incompressible NavierStokes equations. Several of these matrices contain small or zero diagonal values [10]. The zero diagonals are due to the incompressibility condition of the NavierStokes equations [10]. Many of them cannot be solved by the standard BILUM preconditioner. In some cases, even the construction of BILUM fails due to the occurrence of very ill conditioned blocks. Nevertheless, some of them may be solved by an enhanced version of BILUM using singular value decomposition based regularized inverse technique and variable block sizes [32]. In Table 4, we used BILUM-SAI with the single dropping strategy and a xed block size 100 to solve some FIDAP matrices. In the last column of Table 4, we also list the number of iterations for BILUM. We see that both BILUM-SAI and BILUM can solve some of these matrices. When BILUM converged, it usually did it in fewer iterations and used more memory space. We also note that there were three matrices that were not solved by BILUM, but were solved by BILUM-SAI. The results seem to support the view that SAI may deal with some matrices that are difcult for more standard preconditioning techniques. The sparsity ratios of BILUM-SAI are smaller than those of the enhanced BILUM [32]. 5.3. WIGTO966 matrix The WIGTO966 matrix 4 has 3,864 unknowns and 238,252 nonzeros. It comes from an Euler equation model and was supplied by L. Wigton from Boeing. It is solvable by ILUT with large values of p [7]. This matrix was also used to compare BILUM with ILUT in [31], and BILUTM with ILUT in [34],
3 All FIDAP matrices and the SAYLR4 matrix are available online from the MatrixMarket (http://math.nist.gov/ MatrixMarket) of the National Institute of Standards and Technology. 4 The WIGTO966 matrix is available from the author.
81
Table 5 Solving the WIGTO966 matrix by BILUM-SAI and ILUT BILUM-SAI bsize 200 200 150 100 80 71 70 70 70 60 p 80 40 45 34 40 25 50 60 21 30 ll 80 80 50 28 40 22 50 60 20 30 iter 90 95 99 98 87 90 36 46 99 96 total 13.8 44.5 30.6 17.9 21.1 13.9 20.6 25.5 13.2 17.0 solu 10.1 10.1 8.86 7.16 7.44 6.25 3.47 4.91 6.56 7.59 spar 1.88 1.69 1.30 0.98 1.32 0.88 1.63 1.89 0.82 1.10 p 400 400 360 360 360 340 340 340 320 320 104 103 105 104 103 105 104 103 104 103 ILUT iter 16 18 18 20 33 28 44 42 41 39 total 72.0 52.8 76.8 68.7 61.4 76.6 76.2 59.5 71.3 54.2 solu 5.05 5.14 5.21 5.64 8.97 7.91 13.2 10.9 11.9 10.5 spar 9.65 8.57 9.48 9.17 8.71 9.14 8.92 8.14 8.59 7.90
and to test point and block preconditioning techniques in [8,10]. Since ILUT requires very large amount of ll-in to converge, the WIGTO966 matrix is ideal to test alternative preconditioners and to show the least memory that is required for convergence. For example, BILUM (with GMRES(10)) was shown to be 6 times faster than ILUT with only one-third of the memory required by ILUT [31]. BILUTM (with GMRES(50)) converged almost 5 times faster and used just about one-fth of the memory required by ILUT [34]. In our current tests, we chose several values of and p for ILUT and recorded some best results. For BILUM-SAI with the double dropping strategy, we varied block sizes and p and the block ll-in value ll, some of the best results are listed in Table 5. It shows that BILUM-SAI could converge with a sparsity ratio of 0.82, which is the least sparsity ratio that we know for a preconditioner to solve WIGTO966. The least sparsity ratio that gave us convergence for ILUT is 7.90. Under these test conditions, BILUM-SAI spent less than a quarter of the total CPU time and used less than one-ninth of the memory that was required by ILUT. 5.4. RAEFSKY1 matrix The RAEFSKY1 matrix 5 has 3,242 unknowns and 294,276 nonzeros. It is from an incompressible ow in a pressure driven pipe at time 50 and was supplied by H. Simon from Lawrence Berkeley National Laboratory (originally created by A. Raefsky from Centric Engineering). This matrix was also used by Gould and Scott [17] to test their SAI technique. Fig. 3 shows the convergence history of BILUM-SAI with different amounts of ll-in (a) and different block sizes (b). In both cases, the double dropping strategy with p = ll was used. For Fig. 3(a), we xed the block size as 150 and varied the amount of
5 The RAEFSKY and VENKAT matrices are available online from the University of Florida sparse matrix collection [12] at http://www.cise.ufl.edu/davis/sparse.
82
(a)
(b)
Fig. 3. Convergence behavior of BILUM-SAI for solving the RAEFSKY1 matrix. (a) Different amount of ll-in. (b) Different block size.
ll-in (ll) of the blocks, we found that more ll-in gave better convergence rate. Fig. 3(b) compares the performance of BILUM-SAI as the block size became large while keeping the amount of ll-in in proportion. To this end, we kept only one-third of the elements in each block, i.e., bsize/ll = 3. We see that blocks of larger sizes gave better results. The behavior of BILUM-SAI for solving RAEFSKY1 seems desirable. 5.5. VENKAT01 matrix The VENKAT01 matrix has 62,424 unknowns and 1,717,792 nonzeros. It is from an unstructured 2D Euler solver at time step 0 and was supplied by V. Venkatakrishnan from NASA. The test conditions were set to be similar to the above for RAEFSKY1. For Fig. 4(a), we xed the block size as 240 and varied the amount of ll-in (ll) of the blocks. Once again, we found that more ll-in gave better convergence rate in general. The exception was that when ll = 20, BILUM-SAI did not converge. The reason was probably the same as for the 9-point matrices that we reported earlier. The lack of robustness of SAI for individual blocks was the cause. Fig. 4(b) compares the performance of BILUM-SAI when we kept only a quarter of the elements in each block, i.e., bsize/ll = 4. The results are similar to those for RAEFSKY1 with better convergence rates for larger block sizes. 5.6. SAYLR4 matrix The SAYLR4 matrix has 3,564 unknowns and 22,316 nonzeros. It is from a 3D oil reservoir simulation and was taken from the well-known HarwellBoeing collections [13]. This test was to show the effect of scaling matrix on the performance of BILUM-SAI and SAI. Two matrix scaling strategies were tested. The rst scaling strategy was performed with respect to the columns so that the 2-norm of each column is unity. The second scaling strategy was to scale the rows rst and then scale the columns. For SAI, we
83
(a)
(b)
Fig. 4. Convergence behavior of BILUM-SAI for solving the VENKAT01 matrix. (a) Different amount of ll-in. (b) Different block size.
Fig. 5. Convergence behavior of BILUM-SAI and SAI without matrix scaling for solving the SAYLR4 matrix.
chose ll = 80 and the resulting preconditioner has a sparsity ratio of 12.7. For BILUM-SAI, we used the single dropping strategy and ll = 40 with a block size of 60. The resulting BILUM-SAI preconditioner has a sparsity ratio 4.6 (a slightly larger value was obtained with scaling). Fig. 5 indicates that BILUMSAI without matrix scaling converged very fast, but SAI standing alone, with much more memory cost, did not converge. On the other hand, Fig. 6(a) shows that the convergence rates of both BILUM-SAI and SAI were hampered by the column scaling strategy. Fig. 6(b) shows the convergence deterioration of BILUM-SAI and SAI with both column and row scalings is striking. None of them fully converged.
84
(a)
(b)
Fig. 6. Convergence behavior of BILUM-SAI and SAI with matrix scaling for solving the SAYLR4 matrix. (a) Column scaling was applied. (b) Both row and column scalings were applied.
This test demonstrates the uncertain consequence of matrix scaling strategies applied to preconditioning techniques.
6. Concluding remarks We have studied the use of sparse approximate inverse with multilevel block ILU preconditioning techniques (BILUM) for solving general sparse matrices. The implementation strategy offers exibility in controlling the amount of ll-in during the ILU factorization when large size blocks are used for domain decomposition based implementation of multilevel method. Our numerical experiments with several matrices show that the proposed combination indeed demonstrates the anticipated exibility and efciency. As a parallelizable high accuracy preconditioner, BILUM-SAI is much more efcient than the standard SAI techniques. The advantage of BILUMSAI over the standard BILUM is the savings in the memory cost. We were able to construct very economic preconditioners to solve the WIGTO966 matrix. This parallelizable preconditioner combines the efciency and robustness of ILU preconditioning techniques with the parallelism inherited from multilevel structure and sparse approximate inverse techniques. In fact, it offers a good compromise between efciency and parallelism. When an SAI is used as local components in some global preconditioners, such as in the implementation of multilevel methods, the inherent parallelism in preprocessing phase is not utilized. In this sense, any SAI techniques can be used. The better one is the cheapest one in construction cost, provided they are of similar quality. A good candidate may be a factored approximate inverse of some kind [2,41]. BILUM-SAI outperformed SAI in robustness and efciency, but may lose certain degree of parallelism, depending on the implementation details and the architectures in question. There is no clear
85
single conclusion concerning the relative superiority of BILUM-SAI, BILUM, and ILUT. The degree of inherent parallelism is decreasing and no one is absolutely better than the other two as a general purpose preconditioner. This result comes as no surprise. A general consensus is that there is unlikely to be a general purpose preconditioner that is superior for all types of problems. The task for researchers is to identify the particular preconditioners that are efcient for a large group of problems. In this aspect, we see that BILUM-SAI offers alternative strength to supplement BILUM and ILUT to deal with certain types of problems which are more difcult to handle by the latter. References
[1] R.N. Banerjee, M.W. Benson, An approximate inverse based multigrid approach to the biharmonic problem, Internat. J. Comput. Math. 40 (1991) 201210. [2] M. Benzi, M. Tuma, A sparse approximate inverse preconditioner for nonsymmetric linear systems, SIAM J. Sci. Comput. 19 (3) (1998) 968994. [3] M. Benzi, M. Tuma, A comparative study of sparse approximate inverse preconditioners, Appl. Numer. Math. 30 (23) (1999) 305340. [4] G. Beylkin, R. Coifman, V. Rokhlin, Fast wavelet transforms and numerical algorithms I, Comm. Pure Appl. Math. 44 (1991) 141183. [5] W.L. Briggs, A Multigrid Tutorial, SIAM, Philadelphia, PA, 1987. [6] T.F. Chan, W.P. Tang, W.L. Wan, Wavelet sparse approximate inverse preconditioners, BIT 37 (3) (1997) 644660. [7] A. Chapman, Y. Saad, L. Wigton, High-order ILU preconditioners for CFD problems, Technical Report UMSI 96/14, Minnesota Supercomputer Institute, University of Minnesota, Minneapolis, MN, 1996. [8] E. Chow, M.A. Heroux, An object-oriented framework for block preconditioning, ACM Trans. Math. Software 24 (2) (1998) 159183. [9] E. Chow, Y. Saad, Approximate inverse techniques for block-partitioned matrices, SIAM J. Sci. Comput. 18 (1997) 16571675. [10] E. Chow, Y. Saad, Experimental study of ILU preconditioners for indenite matrices, J. Comput. Appl. Math. 86 (2) (1997) 387414. [11] E. Chow, Y. Saad, Approximate inverse preconditioners via sparse-sparse iterations, SIAM J. Sci. Comput. 19 (3) (1998) 9951023. [12] T. Davis, University of Florida sparse matrix collection, NA Digest 97 (23) (June 7, 1997). [13] I.S. Duff, R.G. Grimes, J.G. Lewis, Users guide for the HarwellBoeing sparse matrix collections, Technical Report TR/PA/92/86, CERFACES, Toulouse, France, 1992. [14] M. Engelman, FIDAP: Examples Manual, Revision 6.0, Technical Report, Fluid Dynamics International, Evanston, IL, 1991. [15] J.A. George, J.W. Liu, Computer Solution of Large Sparse Positive Denite Systems, Prentice-Hall, Englewood Cliffs, NJ, 1981. [16] G.H. Golub, H.A. van der Vorst, Closer to the solution: iterative linear solvers, in: I.S. Duff and G.A. Watson (Eds.), The State of the Art in Numerical Analysis, Clarendon Press, Oxford, 1997, pp. 6392. [17] N.I.M. Gould, J.A. Scott, Sparse approximate-inverse preconditioners using norm-minimization techniques, SIAM J. Sci. Comput. 19 (2) (1998) 605625. [18] M. Grote, T. Huckle, Efcient parallel preconditioning with sparse approximate inverses, in: Seventh SIAM Conference on Parallel Processing for Scientic Computing, San Francisco, CA, 1994, pp. 466471. [19] M. Grote, T. Huckle, Parallel preconditioning with sparse approximate inverses, SIAM J. Sci. Comput. 18 (1997) 838853.
86
[20] M.M. Gupta, R.P. Manohar, J.W. Stephenson, A single cell high order scheme for the convection-diffusion equation with variable coefcients, Internat. J. Numer. Methods Fluids 4 (1984) 641651. [21] T. Huckle, Sparse approximate inverses and multigrid methods, in: Sixth SIAM Conference on Applied Linear Algebra, Snowbird, October 29November 1, 1997. [22] L.Y. Kolotina, A.Y. Yeremin, Factorized sparse approximate inverse preconditioning I: theory, SIAM J. Matrix Anal. Appl. 14 (1993) 4558. [23] L.Y. Kolotina, A.Y. Yeremin, Factorized sparse approximate inverse preconditioning II: solution of 3D FE systems on massively parallel computers, Internat. J. High Speed Comput. 7 (1995) 191215. [24] M. Neytcheva, A. Padiy, M. Mellaard, K. Georgiev, O. Axelsson, Scalable and optimal iterative solvers for linear and nonlinear problems, Technical Report MRI 9613, Mathematical Research Institute, University of Nijmegen, The Netherlands, 1996. [25] Y. Notay, Using approximate inverses in algebraic multilevel methods, Numer. Math. 80 (3) (1998) 397417. [26] A.A. Reusken, Approximate cyclic reduction preconditioning, Technical Report RANA 97-02, Department of Mathematics and Computing Science, Eindhoven University of Technology, The Netherlands, 1997. [27] Y. Saad, A exible innerouter preconditioned GMRES algorithm, SIAM J. Sci. Statist. Comput. 14 (2) (1993) 461469. [28] Y. Saad, ILUT: a dual threshold incomplete LU preconditioner, Numer. Linear Algebra Appl. 1 (4) (1994) 387402. [29] Y. Saad, ILUM: a multi-elimination ILU preconditioner for general sparse matrices, SIAM J. Sci. Comput. 17 (4) (1996) 830847. [30] Y. Saad, Iterative Methods for Sparse Linear Systems, PWS Publishing, New York, NY, 1996. [31] Y. Saad, M. Sosonkina, J. Zhang, Domain decomposition and multi-level type techniques for general sparse linear systems, in: J. Mandel, C. Farhat and X.-C. Cai (Eds.), Domain Decomposition Methods 10, Contemporary Mathematics, Vol. 218, Amer. Math. Soc., Providence, RI, 1998, pp. 174190. [32] Y. Saad, J. Zhang, Enhanced multi-level block ILU preconditioning strategies for general sparse linear systems, Technical Report UMSI 98/98, Minnesota Supercomputer Institute, University of Minnesota, Minneapolis, MN, 1998. [33] Y. Saad, J. Zhang, BILUM: block versions of multielimination and multilevel ILU preconditioner for general sparse linear systems, SIAM J. Sci. Comput. 20 (6) (1999) 21032121. [34] Y. Saad, J. Zhang, BILUTM: a domain-based multilevel block ILUT preconditioner for general sparse matrices, SIAM J. Matrix Anal. Appl., to appear. [35] W.-P. Tang, Towards an effective sparse approximate inverse preconditioner, SIAM J. Matrix Anal. Appl. 20 (4) (1999) 970986. [36] W.-P. Tang, W.L. Wan, Sparse approximate inverse smoother for multi-grid, Technical Report CAM 98-18, Department of Mathematics, University of California at Los Angeles, Los Angeles, CA, 1998. [37] P. Wesseling, An Introduction to Multigrid Methods, Wiley, Chichester, England, 1992. [38] J. Zhang, Preconditioned Krylov subspace methods for solving nonsymmetric matrices from CFD applications, Comput. Methods Appl. Mech. Engrg., to appear. [39] J. Zhang, On convergence of iterative methods with a fourth-order compact scheme, Appl. Math. Lett. 10 (2) (1997) 4955. [40] J. Zhang, On convergence and performance of iterative methods with fourth-order compact schemes, Numer. Methods Partial Differential Equations 14 (1998) 262283. [41] J. Zhang, A sparse approximate inverse for parallel preconditioning of sparse matrices, Technical Report No. 281-98, Department of Computer Science, University of Kentucky, Lexington, KY, 1998.

Bspar

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bspar

Uploaded by

Copyright:

Available Formats

Applied Numerical Mathematics 35 (2000) 6786

J. Zhang / Applied Numerical Mathematics 35 (2000) 6786

J. Zhang / Applied Numerical Mathematics 35 (2000) 6786

J. Zhang / Applied Numerical Mathematics 35 (2000) 6786

J. Zhang / Applied Numerical Mathematics 35 (2000) 6786

J. Zhang / Applied Numerical Mathematics 35 (2000) 6786

I,j D,j M,j

and the restriction operator I +1 = E M I .

J. Zhang / Applied Numerical Mathematics 35 (2000) 6786

A+1 = I +1 A I ; +1 2. If, in addition, the matrix A is symmetric, then I +1 = I T . +1

J. Zhang / Applied Numerical Mathematics 35 (2000) 6786

(F1): y,1 = M b,1 , (F2): y,2 = b,2 E y,1 .

(B1): x,2 = y,2 , (B2): x,1 = M (y,1 F x,2 ).

J. Zhang / Applied Numerical Mathematics 35 (2000) 6786

Fig. 2. A BILUM preconditioned Krylov subspace solver.

J. Zhang / Applied Numerical Mathematics 35 (2000) 6786

The inverse of this reordered block matrix is A1 = D 1 I + FA1 ED 1 1 A1 ED 1 1 D 1 FA1 1 A1 1

J. Zhang / Applied Numerical Mathematics 35 (2000) 6786

J. Zhang / Applied Numerical Mathematics 35 (2000) 6786

>5000 iterations >5000 iterations

J. Zhang / Applied Numerical Mathematics 35 (2000) 6786

The block size is 31.

J. Zhang / Applied Numerical Mathematics 35 (2000) 6786

J. Zhang / Applied Numerical Mathematics 35 (2000) 6786

J. Zhang / Applied Numerical Mathematics 35 (2000) 6786

J. Zhang / Applied Numerical Mathematics 35 (2000) 6786

J. Zhang / Applied Numerical Mathematics 35 (2000) 6786

J. Zhang / Applied Numerical Mathematics 35 (2000) 6786

J. Zhang / Applied Numerical Mathematics 35 (2000) 6786

You might also like