You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/325366199

Wavelet Based Deflation of Conjugate Gradient Method

Conference Paper · May 2017


DOI: 10.4203/ccp.111.9

CITATIONS READS

0 26

2 authors:

Jakub Kružík David Horák


Institute of Geonics AS CR VŠB-Technical University of Ostrava
15 PUBLICATIONS   17 CITATIONS    62 PUBLICATIONS   479 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Efficient Lifetime Estimation Techniques for General Multiaxial Loading View project

PERMON View project

All content following this page was uploaded by Jakub Kružík on 25 May 2018.

The user has requested enhancement of the downloaded file.


Wavelet Based Deflation of Conjugate Gradient Method

J. Kruzik1 , D. Horak1,2
1
IT4Innovations National Supercomputing Center, VSB-Technical University of Os-
trava, Czech Republic
2
Department of Applied Mathematics, VSB-Technical University of Ostrava, Czech
Republic

Abstract

This paper introduces a Krylov subspace deflation technique based on a discrete wavelet
compression. This technique is based on an observation that the deflation coarse prob-
lem matrix is closely related to a matrix obtained by a discrete wavelet transformation.
Thanks to this observation, we know exactly how the deflation space should look like.
Moreover, we can directly and cheaply assemble this space. We showcase both nu-
merical and performance aspects of our approach on the deflated conjugate gradient
method. However, our findings should be also valid for other deflated Krylov subspace
methods, like GMRES or MINRES.

Keywords: deflation, projected preconditioning, conjugate gradient, deflated conju-


gate gradient, DCG, CG, wavelet compression, coarse problem, Krylov subspace

1 Introduction
Conjugate Gradient (CG) method [1] is one of the most used iterative method for the
solution of
Ax = b, (1)
where A is a symmetric positive definite matrix of dimension n. The convergence of
the CG method depends on the spectrum of the system matrix A [2]. Let us denote
the eigenvalues of A as

λmin = λ1 ≤ λ2 ≤ · · · ≤ λn = λmax .

Then the convergence rate is related to the condition number

λmax
κ (A) = . (2)
λmin

1
If we could somehow improve the spectrum of A by eliminating k eigenvalues
closest to zero we would get, potentially much improved, condition number of altered
system A e
  λ
κ A e = max . (3)
λk+1
The process describing how to do this is called deflation and is shown in Section 2.
In Section 3 we introduce a new deflation space based on a discrete wavelet compres-
sion. Numerical results are presented in Section 4.

2 Deflation
Deflation for CG, also known as CG with projected preconditioning (CGPP) or (PPCG),
was introduced independently in [3, 4, 5]. Deflation tries to improve the convergence
of the CG method by projecting a set of a few badly behaving eigenvectors out of the
search directions. This set can be written in a matrix form as

W = [w1 , w2 , . . . , wk ] ∈ Rn×k , k < n

and is called the deflation space. Assuming that W is a full rank matrix and W is a
subspace spanned by columns of W , we can denote a projector
−1
P = I − W W T AW WTA

onto an A-conjugate complement of W.


The Deflated Conjugate Gradient (DCG) method is shown in Algorithm 2. The
first difference from standard CG (Algorithm 1) is the following modification of the
initial estimate −1 T
x0 = x−1 + W W T AW W r−1 .
This ensures that the first residual r0 is orthogonal to W. Moreover, the modified
initial guess x0 is an exact solution in W. Therefore, we can restrict ourselves to
finding the solution in the A-conjugate complement of W because by summing these
solutions we obtain the solution on the whole space. In other words, we have

xi ∈ x−1 + span{W, p0 , p1 , . . . , pi−1 }. (4)

This restriction is achieved by using P to project each search direction pi onto the
A-conjugate complement of W (lines 4 and 11),
−1
pi+1 = P ri+1 + βi+1 pi = ri+1 + βi+1 pi − W W T AW W T Ari+1 .

The subtraction in the previous equation can be thought of in several ways. The
restriction was already mentioned. In the domain decomposition sense, the space W
−1
represents a coarse space, and the W T AW is known as a coarse problem (CP).

2
Algorithm 1: CG Algorithm 2: DCG
Input: A, x0 , b Input: A, x−1 , b, W
T
−1 T
1 P = I − W W AW W A
1 r0 = b − Ax0 2 r−1 = b − Ax−1
T
−1 T
3 x0 = x−1 +W W AW W r−1
4 r0 = b − Ax0
2 p0 = r 0 5 p0 = P r0
3 for i = 0, · · · : 6 for i = 0, · · · :
4 s = Api   7 s = Api  
5 αi = riT ri / sT pi 8 αi = riT ri / sT pi
6 xi+1 = xi + αi pi 9 xi+1 = xi + αi pi
7 ri+1 = ri − αi s   10 ri+1 = ri − αi s 
T

8 βi+1 = ri+1 ri+1 / riT ri 11 βi+1 = ri+1 T
ri+1 / riT ri
9 pi+1 = ri+1 + βi+1 pi 12 pi+1 = P ri+1 + βi+1 pi
Output: xi Output: xi

Another possibility is to see it as a correction step. This is because the method gen-
erates the same Krylov subspace as the standard CG (see Equation (4)) and therefore
the propagation of the error, caused by leaving the deflation space out of the search
direction, is needed.
DCG method can also be precondtioned, see e.g. [6].
The open question is how to obtain the deflation space efficiently. Computing,
even approximately, the eigenvectors belonging to the smallest eigenvalues is gen-
erally very costly. When solving multiple right hand sides, it is possible to expand
the deflation space with eigenvectors found as a by-product of earlier solves [7]. The
same idea can be used to preserve already found information against the restart of
GMRES [8].

3 Discrete Wavelet Compression

As was mentioned in the previous section, it is hard to obtain the eigenvectors be-
longing to the smallest eigenvalues. Therefore, we need to look for alternatives. We
show in the following paragraphs that a deflation space can be obtained in a form of a
discrete wavelet compression basis.
Let us describe the idea of a discrete wavelet compression using the fast wavelet
transform (FWT) with Haar wavelet [9, 10]. Assume that the input we want to com-
press is a matrix A ∈ Rn×n , and that number of rows n is divisible √ by 2. First, we
create an orthonormal projector onto a scaling subspace with two 1/ 2 in each row

3
shifted by two positions against the previous row, i.e.
 
1 1 0 0 0 ··· 0 0
1 0 0 1 1 0 · · · 0
 0 n
×n
H1,n = √  .. .. .. .. .. . . .. ..  ∈ R 2 ,

2 . . . . . . . .
0 0 0 0 0 ··· 1 1

a projector onto a wavelet subspace with the same structure but with 1/ 2 and
and √
−1/ 2 in each row, i.e.
 
1 −1 0 0 0 · · · 0 0
1 0 0 1 −1 0 · · · 0 0 
 n
G1,n = √  .. .. .. .. .. . . .. ..  ∈ R 2 ×n .
2 . . . . . . . . 
0 0 0 0 0 · · · 1 −1
Note that the number of columns of each projector is the same as the number of rows
of the matrix that we want to compress, while the number of rows is just a half. First
index describes the projector level, and will be explained later. The second one tells us
the number of rows in the input matrix. Finally, we can create a transformation matrix
 
H1,n
M1,n = ∈ Rn×n .
G1,n

Applying transformation matrix M1,n from left and its transpose from right on the
input matrix, we obtain
T
H1,n AGT1,n
 
T H1,n AH1,n
M1,n AM1,n = T ∈ Rn×n .
G1,n AH1,n G1,n AGT1,n
The resulting matrix contains all information of the input. However, the first block of
the resulting matrix contains the most useful information (so-called trends), while the
other blocks contains just fine details of the input. Therefore, assuming that n/2 is
divisible by 2, we can create another transformation matrix M1,n/2 , and apply it in the
same way as before, but now just on the first block
T T
M1,n/2 H1,n AH1,n M1,n/2 =
T T T
GT1,n/2
 
H1,n/2 H1,n AH1,n H1,n/2 H1,n/2 H1,n AH1,n
T T T ∈ Rn×n .
G1,n/2 H1,n AH1,n H1,n/2 G1,n/2 H1,n AH1,n GT1,n/2
Again, the first block contains the most useful information and now its dimension is
a quarter of the original. We can obtain this block without creating the matrices G1,∗
and M1,∗ . Moreover, we can directly assemble the product of H1,n/2 H1,n as
 
1 1 1 1 0 ··· 0 0
1 0 0 0 0 1 · · · 0 0 
 n
H2,n = H1,n/2 H1,n = √  .. .. .. .. .. . . .. ..  ∈ R 4 ×n ,
4 . . . . . . . .
0 0 0 0 0 ··· 1 1

4

where each row contains four 1/ 4 shifted by four positions against the previous row.
We call this matrix a second level projector onto a scaling subspace.
Assuming that n is divisible by 2m we can create up to m level√projection matrix
Hm,n in the same way, i.e. each row would contain m values of 1/ 2m shifted by m
positions against the previous row. On the other hand, if n is not divisible by 2m and
we would still like to use m level projection matrix, we have to employ extensions of
wavelet transforms for arbitrary lengths of input data, see e.g. [11]. In this paper we
restricted ourselves on matrices with sizes divisible by sufficiently high 2m .
We mentioned in the previous section that we can think of the deflation coarse
problem matrix W T AW in term of a coarse grid. The coarse grid contains the most
important information about the solution. Similarly, our input matrix transformed by
Hm,n projection, contains most of the information. We have

W T AW ≈ Hm,n AHm,n
T
,

and therefore we can choose our deflation space as a transpose of the m level projec-
T
tion matrix onto scaling subspace, i.e. W = Hm,n . This choice can also be viewed
as a deflation space assembled as an aggregation of subdomains (each row represents
a subdomain, if ith degree of freedom belongs to the subdomain then ith value of the
row is set to one), where subdomains are chosen algebraically.
Note, that we chose Haar wavelet, but other wavelet basis could be used as well.

4 Numerical Experiment
We implemented DCG as PETSc [12, 13, 14] KSP solver. Parallel MUMPS [15,
16] with Cholesky factorization was employed for the DCG CP solution. We used
PCREDUNDAT to split MPI ranks into subcommunicators. Each subcommunicator
owns a copy of whole CP matrix and therefore the solves are done redundantly on
each subcommunicator. This keeps the direct solves efficient, even for small sizes of
CP [17, 18].
Our benchmark models a 2D Poisson equation on a unit square with homogeneous
Dirichlet on the boundary. The discretization is standard centred finite differences on
n × n grid. Right hand side is set to one and initial estimate is zero.
We tested our deflation method on problems with n equal to 2,048 (4,194,304 de-
grees of freedom (DOFs)), 4,096 (16,744,464 DOFs) and 8,192 ( 67,108,864 DOFs).
Tests were run on Salomon [19] supercomputer using 100 cores. Salomon com-
puting nodes consists of two 12-core Intel Xeon E5-2680v3 (Haswell) and 128 GB of
memory.
The stopping criterion is given by relative residual kri k / kr0 k < 10−6 for CG and
kri k / kr−1 k < 10−6 for DCG, ensuring a fair comparison.
The results are reported in Tables 1 to 3 and in accompanying Figures 1 to 3. By
T
level we mean the level of the transposed projection matrix Hm,n used as the deflation

5
Level Iterations Time Time per Iter. Redundancy CP size
None 3377 9.205 0.003 None None
10 3519 71.66 0.020 5 4,096
9 2794 97.92 0.035 5 8,192
8 1727 14.62 0.008 2 16,384
7 895 9.575 0.011 2 32,768
6 446 6.713 0.015 2 65,536
5 224 5.673 0.025 2 131,072
4 112 5.974 0.053 1 262,144
3 56 7.784 0.139 1 524,288
2 27 11.20 0.415 1 1,048,576
1 13 20.34 1.565 1 2,097,152
Table 1: Results for n = 2, 048 with 4,194,304 DOFs. The fastest setting is given in
bold.

Number of iterations Time


4000 100
3500 90
80
3000
70
Iterations

Time [s]
2500 60
2000 50
1500 40
30
1000
20
500 10
0 0
None 10 9 8 7 6 5 4 3 2 1
Deation level

Figure 1: Results for n = 2, 048 with 4,194,304 DOFs. Dependency of the number of
iterations and time to solution on the deflation level

space in DCG. Level None represents CG method. Results are reported with optimal
CP redudancy (number of subcommunicators).
We can see that DCG with our wavelet based deflation space brings large improve-
ments in both the number of iterations and time to solution over the standard CG
method. However, the solution of DCG CP represents a significant bottleneck. While
we can reduce the size of CP by increasing the deflation level, this reduces the effec-
tiveness of deflation. Also note that for n = 2, 048 and level 10, we actually got more
iterations than with CG. This is probably due to loss of orthogonality of residuals. It
suggests that a residual replacement scheme might be, for some problems, necessary.

6
Level Iterations Time Time per Iter. Redundancy CP size
None 6819 86.03 0.013 None None
10 5660 132.5 0.023 2 16,384
9 3528 88.55 0.025 2 32,768
8 1813 51.74 0.029 2 65,536
7 910 32.49 0.036 1 131,072
6 458 23.56 0.051 1 262,144
5 227 20.17 0.089 1 524,288
4 114 21.31 0.187 1 1,048,576
3 57 30.12 0.528 1 2,097,152
2 28 48.78 1.742 1 4,194,304
1 14 91.64 6.546 1 8,388,608
Table 2: Results for n = 4, 096 with 16,744,464 DOFs. The fastest setting is given in
bold.

Number of iterations Time


7000 140
6000 120
5000
Iterations

100

Time [s]
4000
80
3000
60
2000
1000 40
0 20
None 10 9 8 7 6 5 4 3 2 1
Deation level

Figure 2: Results for n = 4, 096 with 16,744,464 DOFs. Dependency of the number
of iterations and time to solution on the deflation level.
Level Iterations Time Time per Iter. Redundancy CP size
None 13756 818.3 0.059 None None
10 7188 646.8 0.090 2 65,536
9 3758 361.0 0.096 1 131,072
8 1881 202.4 0.108 1 262,144
7 940 118.8 0.126 1 524,288
6 465 88.02 0.189 1 1,048,576
5 233 76.94 0.330 1 2,097,152
4 116 96.48 0.832 1 4,194,304
3 59 139.4 2.363 1 8,388,608
2 29 218.5 7.534 1 16,777,216
1 13 403.8 31.06 1 33,554,432
Table 3: Results for n = 8, 192 with 67,108,864 degrees of freedom (DOFs). The
fastest setting is given in bold.

7
Number of iterations Time
14000 900
12000 800
700
10000
Iterations

600

Time [s]
8000 500
6000 400
300
4000
200
2000 100
0 0
None 10 9 8 7 6 5 4 3 2 1
Deation level

Figure 3: Results for n = 8, 192 with 67,108,864 DOFs. Dependency of the number
of iterations and time to solution on the deflation level.

5 Conclusion

We have shown that our deflation space based on wavelet compression can signifi-
cantly improve both the convergence rate and solution time. In our worst conditioned
test case the number of iterations was reduced by a factor of 59 and solution time by a
factor of 10 for the optimal settings.
The deflation CP represents a bottleneck. We could avoid the cost of CP entirely,
by A-orthonormalization of the deflation space. Another possibility we would like to
investigate is using a low level deflation for the original problem and then again for
the solution of CP. This could be done recursively until the CP is very small and easily
solvable.
The extension to matrices with dimension not divisible by 2m will have to be in-
vestigated. The optimal choice of the wavelet basis is also an important question.

Acknowledgement

The work was supported by The Ministry of Education, Youth and Sports from the
National Programme of Sustainability (NPU II) project “IT4Innovations excellence in
science - LQ1602” and from the Large Infrastructures for Research, Experimental De-
velopment and Innovations project “IT4Innovations National Supercomputing Cen-
ter LM2015070”, and by the internal student grant competition project SP2017/169
“PERMON toolbox development III”. The authors acknowledge the Czech Science
Foundation (GACR) project no. 15-18274S.

8
References
[1] M.R. Hestenes, E. Stiefel, “Methods of conjugate gradients for solving linear
systems”, Journal of research of the National Bureau of Standards, 49: 409–
436, 1952.

[2] L.N. Trefethen, D. Bau, Numerical Linear Algebra, SIAM, 1997, ISBN
0898713617.

[3] Z. Dostal, “Conjugate gradient method with preconditioning by projector”, In-


ternational Journal of Computer Mathematics, 23(3-4): 315–323, 1988.

[4] R.A. Nicolaides, “Deflation of Conjugate Gradients with Applications to Bound-


ary Value Problems”, SIAM Journal on Numerical Analysis, 24(2): 355–365,
1987.

[5] G.I. Marchuk, Y.A. Kuznetsov, “Theory and applications of the generalized
conjugate gradient method”, Advances in Mathematics. Supplementary Studies,
10: 153–167, 1986.

[6] Y. Saad, M. Yeung, J. Erhel, F. Guyomarc’h, “A Deflated Version of the Con-


jugate Gradient Algorithm”, SIAM Journal on Scientific Computing, 21(5):
1909–1926, 2000.

[7] A. Stathopoulos, K. Orginos, “Computing and Deflating Eigenvalues While


Solving Multiple Right-Hand Side Linear Systems with an Application to Quan-
tum Chromodynamics”, SIAM Journal on Scientific Computing, 32(1): 439–462,
2010.

[8] J. Erhel, K. Burrage, B. Pohl, “Restarted GMRES preconditioned by deflation”,


Journal of Computational and Applied Mathematics, 69(2): 303 – 318, 1996,
ISSN 0377-0427.

[9] G. Bachmann, L. Narici, E. Beckenstein, Fourier and Wavelet Analysis, Univer-


sitext. Springer New York, 2002, ISBN 9780387988993, LCCN 99036217.

[10] D. Walnut, An Introduction to Wavelet Analysis, Applied and Numerical Har-


monic Analysis. Birkhäuser Boston, 2002, ISBN 9780817639624, LCCN
20025367.

[11] C. Taswell, K.C. McGill, “Algorithm 735: Wavelet Transform Algorithms for
Finite-duration Discrete-time Signals”, ACM Trans. Math. Softw., 20(3): 398–
412, Sept. 1994, ISSN 0098-3500.

[12] S. Balay, S. Abhyankar, M.F. Adams, J. Brown, P. Brune, K. Buschelman,


L. Dalcin, V. Eijkhout, W.D. Gropp, D. Kaushik, M.G. Knepley, L.C. McInnes,
K. Rupp, B.F. Smith, S. Zampini, H. Zhang, H. Zhang, “PETSc Web page”,
http://www.mcs.anl.gov/petsc, 2016.

9
[13] S. Balay, S. Abhyankar, M.F. Adams, J. Brown, P. Brune, K. Buschelman,
L. Dalcin, V. Eijkhout, W.D. Gropp, D. Kaushik, M.G. Knepley, L.C. McInnes,
K. Rupp, B.F. Smith, S. Zampini, H. Zhang, H. Zhang, “PETSc Users Man-
ual”, Technical Report ANL-95/11 - Revision 3.7, Argonne National Laboratory,
2016.

[14] S. Balay, W.D. Gropp, L.C. McInnes, B.F. Smith, “Efficient Management of
Parallelism in Object Oriented Numerical Software Libraries”, in E. Arge,
A.M. Bruaset, H.P. Langtangen (Editors), Modern Software Tools in Scientific
Computing, pages 163–202. Birkhäuser Press, 1997.

[15] P.R. Amestoy, I.S. Duff, J. Koster, J.Y. L’Excellent, “A Fully Asynchronous
Multifrontal Solver Using Distributed Dynamic Scheduling”, SIAM Journal on
Matrix Analysis and Applications, 23(1): 15–41, 2001.

[16] P.R. Amestoy, A. Guermouche, J.Y. L’Excellent, S. Pralet, “Hybrid scheduling


for the parallel solution of linear systems”, Parallel Computing, 32(2): 136–156,
2006.

[17] V. Hapla, D. Horak, M. Merta, Use of Direct Solvers in TFETI Massively Paral-
lel Implementation, pages 192–205, Springer Berlin Heidelberg, Berlin, Heidel-
berg, 2013, ISBN 978-3-642-36803-5.

[18] V. Hapla, D. Horak, “TFETI Coarse Space Projectors Parallelization Strategies”,


in Parallel Processing and Applied Mathematics - 9th International Conference,
PPAM 2011, Torun, Poland, September 11-14, 2011. Revised Selected Papers,
Part I, pages 152–162, 2011.

[19] “Salomon Web page”, https://docs.it4i.cz/salomon/


hardware-overview/.

10

View publication stats

You might also like