Professional Documents
Culture Documents
1 s2.0 S0010465517301200 Main
1 s2.0 S0010465517301200 Main
article info a b s t r a c t
Article history: H8 [aitch-phi] is a program package based on the Lanczos-type eigenvalue solution applicable to a
Received 10 March 2017 broad range of quantum lattice models, i.e., arbitrary quantum lattice models with two-body interactions,
Accepted 12 April 2017 including the Heisenberg model, the Kitaev model, the Hubbard model and the Kondo-lattice model.
Available online 25 April 2017
While it works well on PCs and PC-clusters, H8 also runs efficiently on massively parallel computers,
which considerably extends the tractable range of the system size. In addition, unlike most existing
Keywords:
packages, H8 supports finite-temperature calculations through the method of thermal pure quantum
Numerical linear algebra
Lattice fermion models (TPQ) states. In this paper, we explain theoretical background and user-interface of H8. We also show
Quantum spin liquids the benchmark results of H8 on supercomputers such as the K computer at RIKEN Advanced Institute for
Computational Science (AICS) and SGI ICE XA (Sekirei) at the Institute for the Solid State Physics (ISSP).
Program summary
Program Title: H8
Program Files doi: http://dx.doi.org/10.17632/vnfthtyctm.1
Licensing provisions: GNU General Public License, version 3 or later
Programming language: C
External routines/libraries: MPI, BLAS, LAPACK
Nature of problem: Physical properties (such as the magnetic moment, the specific heat, the spin suscep-
tibility) of strongly correlated electrons at zero- and finite-temperature.
Solution method: Application software based on the full diagonalization method, the exact diagonalization
using the Lanczos method, and the microcanonical thermal pure quantum state for quantum lattice model
such as the Hubbard model, the Heisenberg model and the Kondo model.
Restrictions: System size less than about 20 sites for an itinerant electronic system and 40 site for a local
spin system.
Unusual features: Finite-temperature calculation of the strongly correlated electronic system based on
the iterative scheme to construct the thermal pure quantum state. This method is efficient for highly
frustrated system which is difficult to treat with other methods such as the unbiased quantum Monte
Carlo.
© 2017 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
http://dx.doi.org/10.1016/j.cpc.2017.04.006
0010-4655/© 2017 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
M. Kawamura et al. / Computer Physics Communications 217 (2017) 180–192 181
wants to compare the results with experiments. Therefore, exact by constructing the matrix elements of the Hamiltonian on the
methods are valuable though they are usually available only for fly. To carry out the Lanczos and TPQ calculations efficiently, we
small size problems. They are important in two ways; as a tool implement two algorithms for massively parallel computation;
for obtaining reliable answers to the original problem (when the one is the conventional parallelization based on the butterfly-
original problem is a small size problem) and as a generator of structured communication pattern [13] (the butterfly method) and
reference data with which one can compare the results of other the other is newly developed method [the continuous-memory-
numerical methods (when the original problem is beyond the access (CMA) method]. Because the numerical cost of the butterfly
applicable range of the exact methods). method is proportional to the number of terms in the second quan-
For the last few decades, the numerical diagonalization package tized Hamiltonian, it requires very long time to simulate the com-
for quantum spin Hamiltonians, TITPACK [4] has been widely used plicated Hamiltonian relevant to the real compounds. In contrast
in the condensed matter physics community. Based on TITPACK, to the conventional algorithm, the newly developed method can
other numerical diagonalization packages such as KOBEPACK [5] realize the continuous memory access and numerical cost which
and SPINPACK [6] have also been developed. There is another pro- does not depend on the number of terms during multiplication be-
gram performing such a calculation in the ALPS project (Algorithms tween the Hamiltonians and a wavefunction. In this paper, we also
and Libraries for Physics Simulations) [7]. However, the size of the explain these two algorithms and compare their computational
target system is limited for TITPACK, KOBEPACK, and ALPS because speed.
they do not support the distributed memory parallelization. Al- This paper is organized as follows: In Section 2, we introduce the
though SPINPACK supports such parallelization and an efficient basic usage of H8. How to download and build H8 are explained
algorithm with the help of symmetries, this program package in Section 2.1, and how to use H8 is explained in Section 2.2.
cannot handle the general interactions such as the Dzyaloshinskii– We also explain what types of models can be treated by using
Moriya and the Kitaev interactions that recently receive extensive H8 in Section 2.3. In Section 3, we detail algorithms implemented
attention. in H8. The representation of the Hilbert space adopted in H8 is
In addition, advances in quantum statistical mechanics [8–11] explained in Section 3.1. How to implement multiplication of the
open a new avenue for exact methods to finite-temperature calcu- Hamiltonian to a wavefunction is explained in Section 3.2. Imple-
lations without the ensemble average. This development enables mentation of full diagonalization method, the Lanczos method, and
us to calculate finite-temperature properties of quantum many- the TPQ method is explained in Section 3.3, Section 3.4, Section 3.5,
body systems with computational costs similar to calculations respectively. In Section 4, we explain the parallelization of multi-
of ground state properties. Now, it is possible to quantitatively plication of the Hamiltonian to a wavefunction. How to distribute
compare theoretical predictions for temperature dependence of, the wavefunction for each process is explained in Section 4.1.
for example, the specific heat and the magnetic susceptibility with In Section 4.2, we explain the conventional butterfly algorithm
experimental results [12]. However, the above program packages for parallelizing the Hamiltonian-wavefunction multiplication. We
have not supported this calculation yet. To perform the large- also explain another parallelization method (the CMA method),
scale calculations directly relevant to experiments by utilizing the which is suitable for treating complicated quantum lattices mod-
parallel computing architectures with small bandwidth and dis- els. In Section 5, we show the benchmark results of H8. In Sec-
tributed memory, a user-friendly, multifunctional, and parallelized tion 5.1, we examine the validity of the TPQ method by using H8.
diagonalization package is highly desirable. In Section 5.2, we show the benchmark results of parallelization
H8 [aitch-phi], a flexible diagonalization package for solving
efficiency for 18-site Hubbard model and 36-site Heisenberg model
on supercomputers. We also show the benchmark results of the
a wide range of quantum lattice models, has been developed to
CMA method in Section 5.3. Finally, Section 6 is devoted to the
overcome the problems in the previous program packages. In H8,
summary.
we implement the Lanczos method for calculating the ground
state and properties of a few excited states, thermal pure quan-
2. Basic usage of H8
tum (TPQ) states [11] for finite-temperature calculations, and full
diagonalization method for checking results of Lanczos and TPQ
2.1. How to download and build H8
methods at small clusters, with an easy-to-use and flexible user
interface. By using H8, one can analyze a wide range of quantum
The gzipped tar file, which contains the source codes, samples,
lattice models including conventional Hubbard and Heisenberg
and manuals, can be downloaded from the H8 download site [14].
models, multi-band extensions of the Hubbard model, exchange
For building H8, a C compiler and the BLAS/LAPACK library [15]
couplings that break SU(2) symmetry of quantum spins such as the
are prerequisite. To enable the parallel computations, the Message
Dzyaloshinskii–Moriya and the Kitaev interactions, and the Kondo
Passing Interface (MPI) library [16] is also required.
lattice models describing itinerant electrons coupled to quantum
For building H8, the CMake utility [17] can be used as follows:
spins. It is easy to calculate a variety of physical quantities such
as the internal energy, specific heat, magnetization, charge and $ cd $HOME/build/hphi
spin structure factors, and arbitrary one-body and two-body static $ cmake $PathTohphi
Green’s functions at zero temperature or finite temperatures. $ make
As we mention above, one of the aims of developing H8 is to
make a flexible diagonalization program package that enables us Here, one builds H8 in $HOME/build/hphi and it is assumed that
to directly compare the theoretical calculations and experimental the environment variable, $PathTohphi, is set to the path to the
data. In addition to the simple model Hamiltonians introduced source tree path of H8 (the top directory of H8). If the CMake util-
in textbooks of quantum statistical physics, more complicated ity cannot find the MPI library on the system, the H8 executable is
Hamiltonians inevitably appear to quantitatively describe elec- automatically compiled without the MPI library. In this example,
tronic properties of real compounds. In the Lanczos and the TPQ CMake will choose a C compiler automatically. Instead, one can
simulation of the quantum lattice model in the condensed matter specify the compiler explicitly as follows:
physics, the most time-consuming part is the multiplication of the $ cmake -DCONFIG=$Config $PathTohphi
Hamiltonian to a wavefunction. Excepting the case for the very $ make
small system, it is unrealistic to store all non-zero elements of the
Hamiltonian in memory. Therefore we perform the multiplication where $Config is chosen from the following configurations:
182 M. Kawamura et al. / Computer Physics Communications 217 (2017) 180–192
the Heisenberg model and the Hubbard model, one can use H8 i,j,k,l σ1 ,σ2 ,σ3 ,σ4
in Standard mode. In Standard mode, one can specify all the input where tiσ1 jσ2 is a generalized transfer integral between site i with
parameters by a single file with a few lines, from which the files spin σ1 and site j with spin σ2 and Iiσ1 jσ2 kσ3 lσ4 is the generalized
described above are generated automatically before starting the two-body interaction which annihilates a spin σ2 particle at site
calculation. It is thus much more easier to simulate and analyze j and a spin σ4 particle at site l, and creates a spin σ1 particle at
†
these models in Standard mode as long as the model and the site i and a spin σ3 particle at site k. Here, ĉiσ (ĉiσ ) is the creation
lattice are supported in Standard mode. The models supported in (annihilation) operator of an electron on site i with spin σ =↑
Standard mode will be explained in the following sections. or ↓. For the Hubbard and Kondo models, the 1/2-spin fermions
are only allowed. Note here that any system of localized spins can
2.2.1. Flow of calculations be regarded as a special case of the above Hamiltonian. Therefore,
A typical flow of calculations in H8 is shown as follows: one can use H8 for solving such quantum spin models by straight-
forward interpretation of the spin interaction in terms of t and I in
1. Make input files the above expressions. H8 also has a capability of handling spins
An example of input file is shown for Standard mode below: higher than S = 1/2.
In Standard mode, H8 can treat the following models. (The
L = 4
titles of the following items are the corresponding keys for the
W = 4
model parameter in a Standard input file. Here, the keys, "Fermion
model = "Spin"
HubbardGC", "SpinGC" and "KondoGC" are the grand-canonical
method = "Lanczos"
version of "Fermion Hubbard", "Spin" and "Kondo", respec-
lattice = "square lattice"
tively.)
J = 1.0
(i) "Fermion Hubbard"/ "Fermion HubbardGC"
2Sz = 0
The Hamiltonian is given by
Here L and W are linear extents of square lattice, in x and y † †
∑ ∑ ∑ ∑
Ĥ = −µ ĉiσ ĉiσ − tij ĉiσ ĉjσ + U n̂i↑ n̂i↓ + Vij n̂i n̂j , (4)
directions, respectively. Spin means the Heisenberg model,
iσ ij,σ i ij
and J is the exchange coupling. By using this input file,
one can obtain the ground state of the Heisenberg model where µ is the chemical potential, tij is the transfer integral be-
for 16 = 4 × 4 sites by performing the Lanczos method. tween site i and site j, U is the on-site Coulomb interaction, Vij is the
†
Details of keywords in the input files can be found in the long-range Coulomb interaction between i and j sites. n̂iσ = ĉiσ ĉiσ
manuals [14]. is the number operator at site i with spin σ and n̂i = n̂i↑ + n̂i↓ .
M. Kawamura et al. / Computer Physics Communications 217 (2017) 180–192 183
2.5. Samples To specify a state of a site in H8, we use an n-ary, which takes
one of n different values. For example, in the case of an itinerant
There are some sample input files and reference results electronic system, the local site may take four states, 0, ↑, ↓, and
in samples/ directory. For example, samples/Standard/ ↑↓, and therefore it is natural to represent it by a 4-ary. In this
184 M. Kawamura et al. / Computer Physics Communications 217 (2017) 180–192
representation, a basis |0, ↑, ↓, ↑↓⟩ is labeled by [0, 1, 2, 3]4 = 3.3. Multiplying Hamiltonian to wavefunctions (matrix–vector prod-
[0123]4 = 0 · 43 + 1 · 42 + 2 · 41 + 3 · 40 as |0, ↑, ↓, ↑↓⟩ = |[0123]4 ⟩. uct)
Here, [am−1 , . . . , a1 , a0 ]n = [am−1 · · · a1 a0 ]n (0 ≤ aj < n, j ∈
[0, m)) represents the m-digit n-ary number, i.e., Here, we explain the main operation in H8, multiplying the
m−1 Hamiltonian to the wavefunctions. Because the dimension of the
∑
[am−1 , . . . , a1 , a0 ]n = [am−1 · · · a1 a0 ]n = aj · nj . (7) wavefunction becomes exponentially large by increasing the num-
j=0
ber of sites, it is impossible to store the whole matrix in the actual
calculations. Thus, in the Lanczos (Section 3.4) and the TPQ (Sec-
To use the standard bitwise operations (e.g. logical disjunction tion 3.5) methods, we only store the wavefunctions and calculate
and exclusive or), we decompose the 4-ary representations as the the matrix elements at each interaction on the fly.
following example: A basis |0, ↑, ↓, ↑↓⟩ of the 4-site system is We explain the procedure of multiplying the Hamiltonian to the
indexed by a quartet of 2-digit binary numbers as wavefunctions by taking the 4-site Heisenberg model with S = 1/2
[[00]2 , [01]2 , [10]2 , [11]2 ]4 on the chain with periodic boundary condition as an example. The
0 ↑ ↓ ↑↓ Hamiltonian is block-diagonalized with ∑ blocks being characterized
z
by the total magnetization, ŜTotal ≡ i Ŝiz . For the block satisfying
= [ 0, 0 , 0, 1 , 1, 0 , 1, 1 ]2 z
⟨ŜTotal ⟩ = 0, the wavefunction can be represented by using the
= 0 · 27 + 0 · 26 + 0 · 25 + 1 · 24 simultaneous eigenvectors of all Ŝiz ’s as
+ 1 · 23 + 0 · 22 + 1 · 21 + 1 · 20 . |φ⟩ = a3 |↓↓↑↑⟩ + a5 |↓↑↓↑⟩ + a6 |↓↑↑↓⟩
For the Kondo system, the local sites are classified into itinerant + a9 |↑↓↓↑⟩ + a10 |↑↓↑↓⟩ + a12 |↑↑↓↓⟩. (10)
electron part and localized spin part. However, for simplicity,
we use 4-ary representations likewise for the itinerant electronic Here, the quartet of up or down arrows, which we call a ‘‘con-
system, i.e., we represent a localized spin by |↑⟩ = |[[10]2 ]4 ⟩ or figuration’’ in what follows, specifies a basis vector. In the actual
|↓⟩ = |[[01]2 ]4 ⟩. calculations, we store the list of the configurations.
For localized spin-S systems (S = 1/2, 1, 3/2 . . . ), to represent The Hamiltonian is defined as
Hilbert spaces we use (2S + 1)-ary representation. For example, in 3
a spin-1 system the wavefunction is represented as |1, 0, −1⟩ =
∑
Ĥ = Ŝi · Ŝ mod (i+1,4) = Ĥz + Ĥxy , (11)
|[2, 1, 0]3 ⟩. To treat the spin-S system, we used Bogoliubov repre- i=0
sentation shown in Appendix A.1. In H8, we can treat the more
complicated spin systems such as the spin angular momentums where Ĥz and Ĥxy are defined as follows,
have different values at each site (mixed spin systems). 3
In general, at ith site with the spin angular momentum Si , the
∑
Ĥz = Ŝiz Ŝ zmod (i+1,4) ,
Hilbert space at ith site is represented by (2Si + 1)-ary number and
i=0
the Hilbert space of the system is represented by |φNs −1 , . . . 81 , φ0 ⟩
3
where φi = {−Si , −Si + 1, . . . , Si } and Ns is the whole number ∑ 1
of sites. By using the multiplications of the n-ary number, we Ĥxy = (Ŝi+ Ŝ −mod (i+1,4) + Ŝi− Ŝ +mod (i+1,4) ). (12)
2
represent the mixed spin systems as follows: i=0
We note that, when the canonical system is selected by the (Ŝ0+ Ŝ1− + Ŝ0− Ŝ1+ )|↑↓↑↓⟩ = |↑↓↓↑⟩. (13)
input file, H8 automatically constructs the restricted Hilbert space
{φ} that has the specified particle number or total Sz . To perform In H8, if the exchange operation is possible, the operation is
the efficient reverse lookup in the restricted Hilbert space, we performed by bit operations as follows:
use the two-dimensional search method [4,19]. We also use the
algorithm quoted in [20] as ‘‘finding the next higher number after [0011]2 ˆ [1010]2 = [1001]2 , (14)
a given number that has the same number of 1-bits’’ to perform an
where ˆ represents the ‘‘bitwise exclusive or’’. Before performing
efficient two-dimensional search method.
the exchange operator, we count bits in the configuration and
3.2. Full diagonalization method judge whether the exchange operation is possible or not. If the ex-
change operation is not possible, we do not perform the operation
We generate the matrix of Ĥ by using above-mentioned basis and regard the contribution from the configuration as zero, i.e. ,
set |ψj ⟩ (j = 1 · · · dH , dH is the dimension of the Hilbert space):
Hij = ⟨ψi |Ĥ|ψj ⟩. By diagonalizing this matrix, we can obtain all (Ŝ0+ Ŝ1− + Ŝ0− Ŝ1+ )|↓↓↑↑⟩ = 0. (15)
the eigenvalues Ei and normalized eigenvectors |8i ⟩ (i = 1 · · · dH ). We note that the sign arises from the exchange relation be-
In the diagonalization, we use lapack routine such as dsyev or tween fermion operators in the itinerant electrons systems such
zheev. We also calculate and output the expectation values ⟨Ai ⟩ ≡ as Hubbard models, while the sign does not appear in the localized
⟨8i |Â|8i ⟩. These values are used for the finite-temperature calcu- spin systems. To obtain the sign, it is necessary to count the
lations. parity of the given bit sequences. We use the algorithm for parity
From ⟨Ai ⟩ ≡ ⟨8i |Â|8i ⟩, we calculate finite-temperature prop- counting that is explained in the literature [20].
erties by using the ensemble average
∑N
⟨Ai ⟩e−β Ei 3.4. Lanczos method
⟨Â⟩ = ∑N
i=1
. (9)
i=1 e−β Ei For the sake of completeness, we briefly review the principle of
In the actual calculation, the ensemble averages are performed as the Lanczos method. For more details see, for example, the TITPACK
a postprocess. manual [2] and the textbook by M. Sugihara and K. Murota [21].
M. Kawamura et al. / Computer Physics Communications 217 (2017) 180–192 185
The Lanczos method is based on the power method. In the eigenvectors by using only two vectors whose dimensions are the
power method, by successive operations of the Hamiltonian Ĥ to dimension of the total Hilbert space. In H8, to reduce the numeri-
the arbitrary vector x0 , we generate bases Ĥn x0 . The obtained linear cal cost, we store two additional vectors; a vector for accumulating
space Kn+1 (Ĥ, x0 ) = {x0 , Ĥ1 x0 , . . . , Ĥn x0 } is called the Krylov the diagonal elements of the Hamiltonian, and another vector for
subspace. The initial vector is represented by the superposition of the list of the restricted Hilbert space explained in Section 3.3. The
the eigenvectors ei (corresponding eigenvalues are Ei ) of Ĥ as dimension of these vectors is that of the Hilbert space. As detailed
∑ below, to obtain the eigenvector by the Lanczos method, one
x0 = ai ei . (16) additional vector is necessary. Within some number of iterations,
i we obtain accurate eigenvalues near the maximum and minimum
We note that all the eigenvalues are real number because Hamilto- eigenvalues. The necessary number of iterations is small enough
nian is a Hermitian operator. By operating Ĥn to the initial vector, compared to the dimensions of the total Hilbert space. We note
we obtain the relation as that it is shown that the errors of the maximum and minimum
⎡ ⎤ eigenvalues become exponentially small as a function of Lanczos
∑ ( Ei )n iterations (for details, see Ref. [21]). Details of generating the initial
n
Ĥ x0 = E0n ⎣a0 e0 + ai ei ⎦ , (17) vectors and convergence conditions are shown in Appendix A.2 and
E0
i̸ =0 Appendix A.3, respectively.
where we assume that E0 has the maximum absolute value of
the eigenvalues. This relation shows that the eigenvector of E0 3.4.1. Inverse iteration method
becomes dominant for sufficiently large n. We note that the Krylov In some cases, the accuracy of obtained eigenvectors by the
subspace does not change under the constant shift of the Hamilto- Lanczos method is not enough for calculating the correlation func-
nian [22], i.e., tions. To improve the accuracy of the eigenvectors, we implement
the inverse iteration method in H8, which is also implemented in
Kn+1 (Ĥ, x0 ) = Kn+1 (Ĥ + aI , x0 ), (18) TITPACK [4].
By using the approximate value of the eigenvalues En , by succes-
where I is the identity matrix and a is a constant. By taking constant sively operating (Ĥ − En )−1 to the initial vector y0 , we can obtain
a as the large positive value, the leading part of (Ĥ + aI)n x0 becomes the accurate eigenvector for En . The linear simultaneous equation
the maximum eigenvectors and vice versa. Thus, by constructing for such procedure is given by
the Krylov subspace, we can obtain eigenvalues around both the
maximum and the minimum eigenvalues. yk = (Ĥ − En )yk+1 . (24)
In the Lanczos method, we successively generate the nor-
By solving this equation by using the conjugate gradient (CG)
malized orthogonal basis v0 , . . . , vn−1 from the Krylov subspace
method, we can obtain the eigenvector. We note that we take y0
Kn (Ĥ, x0 ). We define the initial vector and associated components
as the eigenvector obtained by the Lanczos method and add small
as v0 = x0 /|x0 |, β0 = 0, x−1 = 0. From this initial condition, we
positive number to En for stabilizing the CG method. Although
can obtain the normalized orthogonal basis as follows:
we can obtain more accurate eigenvector by using this method,
αk = (Ĥvk , vk ), (19) additional four vectors are necessary to perform the CG method.
w = Ĥvk − βk vk−1 − αk vk , (20)
3.5. Finite-temperature calculations by TPQ method
βk+1 = |w |, (21)
w
vk+1 = . (22) The method of the TPQ states is based on the fact [10] that it is
|w | possible to calculate the finite-temperature properties from a few
From these definitions, it is obvious that αk and βk are real number. wavefunctions (in the thermodynamic limit, only one wavefunc-
In the subspace spanned by these normalized orthogonal bases, tion is necessary) without the ensemble average. In what follows
the Hamiltonian is transformed as we describe the prescription proposed in [11], which we adopt
in H8. Such wavefunctions that replace the ensemble average
Tn = Vn† ĤVn . (23) are called TPQ states. Because a TPQ state can be generated by
operating the Hamiltonian to the random initial wavefunction,
Here, Vn is the matrix whose column vectors are vi (i =
† we directly use the routine of the Lanczos method to the TPQ
0, 1, . . . , n − 1). We note that Vn Vn = In (In is an n × n identity
calculations. Here, we explain how to construct the TPQ state,
matrix). Tn is a tridiagonal matrix and its diagonal elements are αi
which offers a simple way for finite-temperature calculations.
and subdiagonal elements are βi . It is known that the eigenvalues
Let |ψ0 ⟩ be a random initial vector. How to generate |ψ0 ⟩ is
of Ĥ are well approximated by the eigenvalues of Tn for sufficiently
described in Appendix A.4. By operating (l − Ĥ/Ns )k (l is a constant,
large n. The original eigenvectors of Ĥ are obtained by ei = Vn ẽi , Ns represents the number of sites) to |ψ0 ⟩, we obtain the kth TPQ
where ẽi are the eigenvectors of Tn . From Vn , we can obtain the states as
eigenvectors of Ĥ. However, in the actual calculations, it is difficult
to keep Vn because its dimension is too large to store [dimension of (l − Ĥ/Ns )|ψk−1 ⟩
|ψk ⟩ ≡ . (25)
Vn = (dimension of the total Hilbert space) × (number of Lanczos |(l − Ĥ/Ns )|ψk−1 ⟩|
iterations)]. Thus, to obtain the eigenvectors, in the first Lanczos
From |ψk ⟩, we estimate the corresponding inverse temperature βk
calculation, we keep ẽi because its dimension is small (upper bound as
of the dimensions of ẽi is the number of Lanczos iterations). Then,
2k/Ns
we again perform the same Lanczos calculations after we obtain βk ∼ , uk = ⟨ψk |Ĥ|ψk ⟩/Ns , (26)
the eigenvalues from the Lanczos methods. From this procedure, l − uk
we obtain the eigenvectors from Vn . where uk is the internal energy. Arbitrary local physical properties
In the Lanczos method, by successive operations of the Hamil- at βk are also estimated as
tonian on the initial vector, we obtain accurate eigenvalues
around the maximum and minimum eigenvalues and associated ⟨Â⟩βk = ⟨ψk |Â|ψk ⟩/Ns . (27)
186 M. Kawamura et al. / Computer Physics Communications 217 (2017) 180–192
In a finite-size system, a finite statistical fluctuation is caused Numerical costs become smaller if all operations can be done
by the choice of the initial random vector. To estimate the average similar to Fig. 4(a) in which the memory access is limited within
value and the error of the physical properties, we perform some each process. In addition, if we can apply simultaneously the all
independent calculations by changing |ψ0 ⟩. Usually, we regard the terms in the second quantized Hamiltonian acting to the part of
standard deviations of the physical properties as the error bars. wavefunctions continuously stored in the CPU-cache memory, the
numerical cost becomes small and independent from the number
4. Parallelization of the terms. Indeed, we can partially convert the conventional
algorithm into an efficient algorithm by employing a permutation
4.1. Distribution of wavefunction of the bits. By the permutation, we can also achieve continuous
memory access, which is usually faster than random access hap-
In the calculation of the exact diagonalization based on the pening inevitably in the conventional algorithm (Section 4.2). In
this section, we explain the newly developed algorithm to realize
Lanczos method or the finite-temperature calculations based on
the continuous memory access and to enhance the computing
the TPQ state, we have to store the many-body wavefunction on
efficiency.
the random access memory (RAM). This wavefunction becomes
Any partial Hamiltonians acting on the rightmost M bits are
exponentially large with increasing the number of sites. For ex-
represented by a 2M by 2M matrix that acts on 2M successive
ample, when we compute the 40-site 1/2 spin system in the Sz
components, namely, 0th to (2M − 1)th components, of the wave-
unconserved condition, its dimension is 240 = 1,099,511,627,776
function. Then, if the W (≤M) rightmost bits are permuted in the
and its size in RAM becomes about 17.6 TB (double precision
bit representation of the basis as,
complex number). Such a large wavefunction can be treated by
distributing it to many processes by utilizing distributed memory |[σNs −1 · · · σW +1 σW σW −1 · · · σ1 σ0 ]2 ⟩
parallelization.
In H8, the ranks of the processes are labeled by using the → |[σW −1 · · · σ1 σ0 σNs −1 · · · σW +1 σW ]2 ⟩, (28)
leftmost bits in the bit representation of the wavefunction basis. where σi = 0, 1 (i ∈ [0, Ns )), the partial Hamiltonians acting on the
When the leftmost bits corresponding to a Np -site subsystem are W th bit, (W + 1) th bit, and so on, are again represented by a matrix
chosen to label the ranks of the processes, the number of processes that acts on the successive components of the wavefunction. If we
have to be (2S + 1)Np (for spin systems) or 4Np (for itinerant electron find an integer W that satisfies mod(Ns , W ) = 0 and decompose
systems). Then, each process keeps partial wavefunction of the the total Hamiltonian into the partial Hamiltonians that act on sets
complementary NLocal (≡Ns − Np )-site subsystem. Distribution of of M successive bits that overlap each other, the multiplication of
wavefunctions of an Ntotal (≡Ns ) spin-1/2 system is illustrated in the Hamiltonian is implemented with continuous memory access
Fig. 3. We note that the configuration of the Np sites is fixed in the by repeating the W -bits permutation.
same process. When the wavefunctions are distributed in many processes, the
For example, we show the distribution of the wavefunction of W -bits permutation is achieved by the following steps. First, we
the NTotal -site 1/2 spin system in the Sz unconserved condition in permute the components in the wavefunction within the rank ℓ
M. Kawamura et al. / Computer Physics Communications 217 (2017) 180–192 187
Fig. 4. Schematic picture of parallelization of Hamiltonian-wavefunctions product. We take a 4-site spin-1/2 Heisenberg model as an example.
(ℓ) (ℓ)
process, vj (j = 0, 1, . . . , 2Ns −Np − 1), with a buffer uj as Details of the continuous access will be reported elsewhere [23].
([σNs −1 ···σNs −Np ]2 )
The CMA method is implemented in Standard mode for S = 1/2
u[σW −1 ···σW −N spins without total Sz conservation in the present version of H8.1
p σW −Np −1 ···σ0 σNs −Np −1 ···σW ]2
([σNs −1 ···σNs −Np ]2 )
= v[σN −N . (29) 5. Benchmark results and performance
s p −1 ···σW σW −1 ···σW −Np σW −Np −1 ···σ0 ]2
Here, we carry out TPQ simulations for large system sizes with
changing numbers of threads and processes to examine paralleliza-
tion efficiency of H8. We choose two typical models; a half-filled
18-site Hubbard model on a square lattice and a 36-site Heisenberg
model on a kagome lattice. The speedup of the simulation for the
18-site Hubbard model with up to 3072 cores is measured on SGI
ICE XA (Sekirei) at ISSP (Table 1). The speedup of the large-scale
simulation for the 36-site Heisenberg model with up to 32,768
cores is examined by using the K computer at RIKEN AICS (Table 1).
5.1. Benchmark results of TPQ simulations In Fig. 6, we can see significant acceleration caused by the
increase of CPU cores. This acceleration is almost linear up to
To examine the validity of the TPQ calculation, we compute the 192 cores, and is weakened as we further increase the number
temperature dependence of the doublon density of the Hubbard of cores. This weakening of the acceleration comes from the load
model (U /t = 8) for an 8-site cluster with the periodic boundary imbalance in the S z conserved simulation. For example, when we
condition. The shape of the cluster is illustrated in the inset of Fig. 5. use 3 OpenMP threads and 1024 MPI processes, the number of sites
The Hubbard model is defined by setting µ = 0, Vij = 0, tij = t for associated with the ranks of processes (Np in Section 4.1 ) becomes
the nearest-neighbor pairs of sites, ⟨i, j⟩, and tij = 0 for further 5, and the number of sites in the subsystem in each process ( NLocal
neighbor pairs of sites. An example of the input file for the 8-site in Section 4.1) becomes 13. Each process has different numbers of
Hubbard model used in this calculation is shown as follows: up- and down-spin electrons. In the process which has the largest
Hilbert space, there are six up-spin electrons and six down-spin
a0W = 2 electrons, and the Hilbert space becomes (13 C6 )2 = 2,944,656.
a0L = 2 On the other hand, the process which has the smallest Hilbert
a1W = -2 space has nine up-spin electrons, nine down-spin electrons and the
a1L = 2 (13 C9 )2 = 511,225 dimensional Hilbert space. This large difference
model = "Fermion Hubbard" of the Hilbert-space dimension between each process causes a load
method = "TPQ" imbalance. This effect becomes large as we increase the number of
lattice = "square lattice" processes.
t = 1.0 When we fix the total number of cores (for examples 1536
U = 8.0 cores), we can find that the process-major computation (256 pro-
cesses with 6 threads) is faster than the thread-major computation
In Fig. 5, we show the temperature dependence of the doublon (64 processes with 24 threads). Since the numerical performance
density, ⟨n̂↑ n̂↓ ⟩, calculated by the TPQ state and by the canonical of both of them are equivalent, this behavior seems strange at
ensemble. We confirm that the TPQ calculations well reproduce the first sight. The origin of difference might be explained as follows:
results of the canonical ensemble average. In addition to the finite- When we double the number of processes, the amount of data
temperature calculations, we perform the ground state calcula- communicated by each process at once becomes a half of the
tions by the Lanczos method and the calculated doublon density original size, and the communication time decreases. On the other
at zero temperature is plotted by the dashed line. We also confirm hand the communicated data-amount and the communication
that doublon density at low temperature well converges to the time are unchanged even if we double the number of threads. In
value of the ground state. All the results show the validity of the other words, the algorithm described in Section 4.2 parallelizes
TPQ method. calculation and also communication.
M. Kawamura et al. / Computer Physics Communications 217 (2017) 180–192 189
Table 1
Specification of the single node on the supercomputer Sekirei [24] at ISSP and the K
computer [25] at RIKEN AICS.
Sekirei K computer
CPU Xeon E5-2680v3 × 2 SPARC 64TM VIIIfx
Number of cores per node 24 8
Peak performance 960 GFlops 128 GFlops
Main memory 128 GB 16 GB
Memory band width 136.4 GB/s 64 GB/s
Network topology Enhanced hypercube Six-dimensional mesh/torus
Network band width 7 GB/s × 2 5 GB/s × 2
where the first term ĤK represents the Kitaev Hamiltonian [27], the
second term ĤJ describes diagonal exchange couplings between
nearest-neighbor (n.n.) spins, and the third term Ĥoff describes
complicated off-diagonal spin–spin interactions. The fourth term
Ĥ2 and the fifth term Ĥ3 describe the second neighbor (2nd n.) in-
teractions and the third neighbor (3rd n.) interactions, respectively.
Details of the Hamiltonians are given in the previous study [12].
In the lower panel of Fig. 8, elapsed time per TPQ step for
the conventional and the CMA method is shown for a Kitaev
Hamiltonian, ĤK , a Kitaev–Heisenberg Hamiltonian, ĤK + ĤJ , a
n.n. Hamiltonian, Ĥnn = ĤK + ĤJ + Ĥoff , a Hamiltonian including
2nd n. interactions, Ĥnn + Ĥ2 , and the ab initio Hamiltonian Ĥ =
Ĥnn + Ĥ2 + Ĥ3 . Here, we use 16 nodes (24 cores per node) on Sekirei
at ISSP the TPQ simulation with 3 threads and 128 processes. Even
when the number of the coupling term increases, from the sparse
Kitaev Hamiltonian ĤK to the dense and complicated Hamiltonian Fig. 8. Lattice structure and bonds for Ĥ (upper panel) and elapsed time per TPQ
of Na2 IrO3 , Ĥ, the elapsed time per TPQ step of the CMA method step for the conventional butterfly and the CMA method (lower panel). The three
different nearest-neighbor (n.n.) bonds, ΓX , ΓY , and ΓZ , the second neighbor (2nd n.)
does not show significant increase while the elapsed time of the bonds ΓZ2nd , and the third neighbor (3rd n.) bonds Γ 3rd are illustrated in the upper
conventional scheme shows significant increase. panel. In the lower panel, the (cyan) squares denote the elapsed time per TPQ step
Although, for the sparse Kitaev Hamiltonian ĤK , the conven- of the conventional scheme and the (red) circles denote the elapsed time per TPQ
tional scheme shows slightly better performance than that of the step of the CMA method.
CMA method, the CMA method is three times faster than the
conventional one even for the n.n. Hamiltonian Ĥnn . Furthermore,
even though, from Ĥnn to Ĥnn +Ĥ2 , the number of the bonds clearly speed does not depend on the number of terms in the Hamilto-
increases, the elapsed time per TPQ step of the CMA method only nians, while the computational cost of the conventional butterfly
shows slight increase. method is proportional to the number of terms in the Hamiltoni-
ans. Thus, the CMA method is suitable for treating the complicated
6. Summary Hamiltonians for real materials. Actually, for the low-energy model
of Na2 IrO3 , we show that the speed of the CMA method is about
To summarize, we explain the basic usage of H8 in Section 2. six times faster than that of the conventional method. So far, the
By preparing one file within ten lines, for typical models in the CMA method is applicable to the limited class of Hamiltonians and
condensed matter physics, one can easily perform the exact di- lattice geometries. In addition, its efficiency largely depends on the
agonalization based on the Lanczos method or finite-temperature models. It is a remaining challenge to remove the limitation from
calculations based on the TPQ method. By using the same file, one the CMA method.
can also perform the large-scale calculations on modern supercom- Recent development of the theoretical tools for obtaining the
puters. By properly editing the intermediate input files generated low-energy models of real materials enables us to directly compare
by Standard mode execution, it is also easy to perform the similar the theoretical results and experimental results. Although the lim-
calculations for complicated Hamiltonians that describe the low- itation of the system size is severe, the exact analyses based on the
energy physics of real materials. This flexibility and user-friendly Lanczos method or the TPQ method offer a firm basis for examining
interface are the main advantage of H8 and we expect that H8 the validity of the theoretical results. The present version of H8
will be a useful tool for broad spectrum of scientists including can calculate only the static physical quantities, which is difficult
experimentalists and computer scientists. to be directly measured in experiments. Extending H8 to calculate
In Section 3, we explain the basic algorithms implemented dynamical properties such as the dynamical spin/charge structure
in H8. Although the Lanczos method and the TPQ method are factors and the optical conductivity is a promising way to make H8
detailed in the literature, to make our paper self-contained, we ex- more useful. Such an extension will be reported in the near future.
plain the essence of these methods. Furthermore, in Section 4, we
detail the parallelization methods of the multiplication of Hamil- Acknowledgments
tonian to the wavefunctions; one is the conventional butterfly
method and another one is the newly developed method. We also We would like to express our sincere gratitude to Prof. Hidetoshi
show the benchmark results of H8 on supercomputers in Sec- Nishimori and Mr. Daisuke Tahara. Implementation of the Lanczos
tion 5. The results indicate that parallelization of H8 works well algorithm in H8 written in C is based on the pioneering diagonal-
on these supercomputers and the computational time is drastically ization package TITPACK ver. 2 written in Fortran by Prof. Nishi-
reduced. mori. For developing the user interface of H8, we follow the design
We also introduce the new algorithm for parallelization, i.e., the concept of the user interface in the program for variational Monte
(CMA) method. The main advantage of this method is that the Carlo method developed by Mr. Tahara. A part of the user interface
M. Kawamura et al. / Computer Physics Communications 217 (2017) 180–192 191
in H8 is based on his original codes. We would also like to thank A.3. Convergence condition for Lanczos method
the support from ‘‘Project for advancement of software usability
in materials science’’ by The Institute for Solid State Physics, The In H8, we use dsyev (routine of lapack) for diagonalization of
University of Tokyo, for development of H8 ver.1.0. We thank the Tn . We use the energy of the first excited state of Tn as a criteria
computational resources of the K computer provided by the RIKEN of convergence. In the standard setting, after five Lanczos step,
Advanced Institute for Computational Science through the Gen- we diagonalize Tn every two Lanczos step. If the energy of the
eral Trial Use project (hp160242). This work was also supported first excited states agrees with the previous energy within the
by Grant-in-Aid for Scientific Research (15K17702, 16H06345, required accuracy, the Lanczos iteration finishes. The accuracy of
and 16K17746) and Building of Consortia for the Development the convergence can be specified by LanczosEps (ModPara file in
of Human Resources in Science and Technology from the MEXT Expert mode).
of Japan and supported by PRESTO, JST (JPMJPR15NF). We also After obtaining the eigenvalues, we again perform the Lanczos
thank numerical resources from the Supercomputer Center of the iteration to obtain the eigenvector. From the eigenvectors |8n ⟩, we
Institute for Solid State Physics, The University of Tokyo. calculate energy En = ⟨8n |Ĥ|8n ⟩ and variance ∆ = ⟨8n |Ĥ2 |8n ⟩−
(⟨8n |Ĥ|8n ⟩)2 . If En agrees with the eigenvalues obtained by the
Appendix A. Details of implementation Lanczos iteration and ∆ is smaller than the specified value, we
finish the diagonalization.
A.1. Bogoliubov representation If the accuracy of Lanczos method is not enough, we perform the
CG method to obtain the eigenvector. As an initial vector of the CG
In the spin system, the spin indices in input files of transfer, method, we use the eigenvectors obtained by the Lanczos method
InterAll, and correlation functions are specified by using the in the standard setting. This often accelerates the convergence.
Bogoliubov representation. Spin operators are written by using
creation/annihilation operators as follows: A.4. Initial vector for the TPQ method
S
For TPQ method, an initial vector is given by using a random
σ ĉi†σ ĉiσ
∑
Ŝiz = (A.1)
number generator, i.e. coefficients of all components for the initial
σ =−S
vector are given by random numbers. The seed is calculated as
S −1
†
∑ √
Ŝi+ = S(S + 1) − σ (σ + 1)ĉiσ +1 ĉiσ (A.2) 123432 + (nrun + 1) × |rs | + kThread + NThread × kProcess (A.6)
σ =−S
S −1 where rs , kThread , NThread , and kProcess are the same as those for the
† Lanczos method [Eq. (A.5)]. nrun indicates the counter of trials,
∑ √
Ŝi− = S(S + 1) − σ (σ + 1)ĉiσ ĉiσ +1 . (A.3)
which is equal to or less than the total number of trials NumAve
σ =−S
in an input file for Standard mode or a ModPara file for Expert
mode. We can select a type of initial vector from a real number
A.2. Initial vector for the Lanczos method or complex number by using InitialVecType in a ModPara
file. kThread , NThread , kProcess indicate the thread ID, the number of
In the Lanczos method, an initial vector is specified with threads, the process ID, respectively; the initial vector depends
initial_iv(≡rs ) defined in an input file for Standard mode or both on initial_iv and the number of processes.
a ModPara file for Expert mode. A type of an initial vector can
be selected from a real number or complex number by using References
InitialVecType in a ModPara file.
[1] D. Pines, P. Nozières, The Theory of Quantum Liquids, Perseus Books Publish-
• For canonical ensemble and initial_iv≥ 0. ing, 1999.
A non-zero component of a target of Hilbert space is given [2] C. Kittel, Introduction To Solid State Physics, John Wiley & Sons, Inc., 2005.
by [3] E. Dagotto, Rev. Modern Phys. 66 (1994) 763–840.
[4] http://www.stat.phys.titech.ac.jp/nishimori/titpack2_new/index-e.html.
[5] http://quattro.phys.sci.kobe-u.ac.jp/Kobe_Pack/Kobe_Pack.html.
(Ndim /2 + rs )%Ndim , (A.4) [6] http://www-e.uni-magdeburg.de/jschulen/spin/.
[7] B. Bauer, L.D. Carr, H.G. Evertz, A. Feiguin, J. Freire, S. Fuchs, L. Gamper, J.
where Ndim is a total number of the Hilbert space and Ndim /2 Gukelberger, E. Gull, S. Guertler, A. Hehn, R. Igarashi, S.V. Isakov, D. Koop, P.N.
is added to avoid selecting the special configuration for a Ma, P. Mates, H. Matsuo, O. Parcollet, G. Pawski, J.D. Picon, L. Pollet, E. Santos,
default value initial_iv = 1. When a type of an initial V.W. Scarola, U. Schollwck, C. Silva, B. Surer, S. Todo, S. Trebst, M. Troyer, M.L.
vector is selected as a real number, a coefficient value is Wall, P. Werner, S. Wessel, J. Stat. Mech. Theory Exp. 2011 (05) (2011) P05001.
http://stacks.iop.org/1742-5468/2011/i=05/a=P05001.
given by 1,√while as a complex number, the value is given [8] M. Imada, M. Takahashi, J. Phys. Soc. Jpn. 55 (10) (1986) 3354–3361.
by (1 + i)/ 2. [9] J. Jaklič, P. Prelovšek, Phys. Rev. B 49 (1994) 5065–5068. http://dx.doi.org/10.
• For grand canonical ensemble or initial_iv < 0. 1103/PhysRevB.49.5065. http://link.aps.org/doi/10.1103/PhysRevB.49.5065.
An initial vector is given by using a random generator, i.e. co- [10] A. Hams, H. De Raedt, Phys. Rev. E 62 (2000) 4365–4377. http://dx.doi.org/10.
1103/PhysRevE.62.4365. http://link.aps.org/doi/10.1103/PhysRevE.62.4365.
efficients of all components for the initial vector are given by
[11] S. Sugiura, A. Shimizu, Phys. Rev. Lett. 108 (2012) 240401.
random numbers. The seed is calculated as [12] Y. Yamaji, Y. Nomura, M. Kurita, R. Arita, M. Imada, Phys. Rev. Lett. 113 (2014)
107201.
123432 + |rs | + kThread + NThread × kProcess , (A.5) [13] P. Pacheco, Parallel Programming with MPI, Morgan Kaufmann Publishers,
ISBN: 9781558603394, 1997.
where rs is a number given by an input file (parameter [14] http://ma.cms-initiative.jp/en/application-list/hphi.
initial_iv), kThread is the thread ID, NThread is the number [15] E. Anderson, Z. Bai, C. Bischof, L. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A.
of threads, and kProcess is the process ID. Therefore the initial Greenbaum, S. Hammarling, A. McKenney, D. Sorensen, LAPACK Users’ Guide,
third ed., Society for Industrial and Applied Mathematics, 1999.
vector depends both on initial_iv and the number of [16] http://www.mpi-forum.org/.
processes. Random numbers are generated by using SIMD- [17] K. Martin, B. Hoffman, IEEE Soft. (ISSN: 0740-7459) 24 (1) (2007) 46–53. http:
oriented Fast Mersenne Twister (dSFMT) [28]. //dx.doi.org/10.1109/MS.2007.5.
192 M. Kawamura et al. / Computer Physics Communications 217 (2017) 180–192