Lecture 10

Molecular Materials Modeling:
Linear scaling DFT &
High Performance Computing
Joost VandeVondele
Exam date and format
June 17, 2016 : oral exam, 25+5min each, use doodle link below to register,
First come first served
Location HIT, G 31.1.
For those students that registered for the Exam

(pick on slot only, register with full name):
http://doodle.com/poll/w4468gsiepv4znif
Topics from the course and exercises.
Bring your solutions from the exercises as a basis for discussion.

DFT: computational aspects (I)
min E KS [n(r)]
n(r)
N el
n(r)=∑i=1 ∣ψi (r )∣2
∫ ψi (r) ψ j (r )dr=δij
ψi (r )=∑α C αi ϕα (r )
Minimize an energy functional with respect to the electron density.

The density is obtained from Nel single particle orbitals that are orthonormal.
Single particle orbitals are expanded with basis functions which can be non-orthogonal
DFT: computational aspects (II)
Naturally leads to a generalized eigenvalue equation of the form:
H KS C i=S C i ϵi
Where the Nel lowest eigenvalues & vectors are needed
H KS Hamiltonian or Kohn-Sham matrix

S Overlap matrix
Ci Orbital coefficient vector i
ϵi Orbital eigenvalue i
T
C i S C j =δij Orthonormality
Yielding the Kohn-Sham energy, computed via the density matrix P
P=∑ C i C Ti The density matrix

n(r )= ∑ ϕ α (r) P α β ϕβ (r) The density and basis
E KS [n(r)]=E KS [ P ] functions
The Kohn-Sham energy
DFT: computational aspects (III)
A two step iterative procedure (simplified) is needed:
Compute the matrix elements of HKS

● Computational procedure depends on the choice of basis, code, etc.
● Dominant term for small systems (< 100s atoms)
● >10 years of development in the current CP2K code, efficient
SCF
Compute P from HKS

● Dominant term for large systems (>100s atoms)
● Diagonalization-like procedures standard
● 'Interesting' methods for large systems.
Diagonalization
● Well established in numerical algebra
● Robust for the symmetric (hermitian) matrices
● Obtaining all eigenvalues /eigenvectors of a
matrix is O(N3) even if the matrix is sparse.
● LAPACK: dsyevd / dsyevr / dsyevx
● Only extremal eigenvectors : iterative methods
e.g. Lanczos / Arnoldi
● Can be O(N)-O(N2)
Some numbers
for traditional simulations DFT
● Matrix dimension: the number of basis functions

– Typically ~ 10-20 functions per atom (LCAO)
– 2'000 – 40'000
● Rank P: the number of electrons
– Typically ~ 4 per atom
– 200 – 5'000
– 10% - 50% of the eigenvectors of H needed
● # of 'diagonalizations' needed for 'science'
– 100s for static calculations
– 100'000s for dynamic calculations (ab initio MD)
Linear Scaling SCF
Largest O(N3) calculation with CP2K
(~6000 atoms)
4nm
Traditional approaches to solve the self-
consistent field (SCF) equations are O(N3)
limiting system size significantly.
22nm
New algorithms are O(N), allowing for far
larger systems to be studied.
Avoid finding 20% lowest eigenvectors of a

10'000'000 x 10'000'000 matrix:
m
22nm 22n
Sparse linear algebra needed Largest O(N) calculation with CP2K

(~1'000'000 atoms)
Linear scaling techniques do not speedup calculations on small systems.

They enable calculations on large systems
Diagonalization vs. Linear Scaling
Bulk liquid water, traditional diagonalization vs. Linear scaling algorithms, ε=10-5
Typical crossover point (vs. OT) still a few thousand atoms.
Some properties of P
Orthonormality of the C's is equivalent to idempotency of P
T
C i S C j=δij ⇔ PS PS=PS
P is a projector on the subspace spanned by the Ci
Included in P Excluded in P The sparsity of P depends on the chemistry

of the system or the spectral properties of (H,S)
Systems with an energy gap (semiconductors)

exhibit exponential decay for the density
matrix.
Systems without an energy gap (metals), at

zero temperature, show a polynomial decay.
Fig. John R. Brews

1D example: Hückel Theory
0 1 0 0 0 c
( )
1 0 c 0 0 0
Metal (c=1) H= 0 c 0 1 0 0 Insulator (c=0.5)
0 0 1 0 c 0
0 0 0 c 0 1
c 0 0 0 1 0
Decay as sin(|i-j|)/|i-j| Decay as exp(-a|i-j|)

Some numbers for real examples
# Atoms: 13'846
# Basis functions: 133'214
Basis quality: DZVP
At a threshold 10-5
the percentage non-zero elements is:
H, S : 2%
P : 15%
Inv(S) : 20%
Typical: 20'000 non-zeros per row
An amorphous hole conducting material

used in solar cells
Localized electrons ?
Molecular orbitals are eigenstates of the Hamiltonian, and typically delocalized

(i.e. Bloch-like states)
N el
n(r , r ' )=∑i=1 ψi (r) ψi (r ' )
Can the localized nature of n be represented in the orbitals ?
Note that the idea of localized

electrons is very natural in
chemistry: Lewis structure
Orthogonal transformations
Orthogonal (unitary) transformations keep the density matrix invariant.
N el
ψk (r)=∑i=1 ψi (r )Qik
∑k Qik Q jk =δij
QT Q=1=QQ T
A
Q=e
T
A =− A
We can perform any orthogonal transformation that is convenient.
Usually, we prefer that the one-electron states are eigenvectors of H,

The freedom can be used to enforce other properties
Maximally localized orbitals
Known as Boys orbitals or Maximally Wannier functions
Find an orthogonal transformation that minimizes the sum of the 'spread'

Of the MOs
N el
Ω=∑i=1 (〈 ψi∣x ∣ψi 〉−〈 ψi∣x∣ψi 〉 )+.
2 2
2 2
(〈 ψi∣ y ∣ψi 〉−〈 ψi∣ y∣ψi 〉 )+.
(〈 ψi∣z 2∣ψi 〉−〈 ψi∣z∣ψi 〉2 )
In the condensed phase (periodic boundary conditions),

we need to employ a periodic operator (Exp[-2 Pi x i/ L])to define the functional.
http://www.psi-k.org/newsletters/News_57/Highlight_57.pdf
Wannier centers and orbitals
The centers provide an intuitive

picture about the location of
electrons.
The orbitals (see exercises)

are well localized.
An complementary view on P as f(H)
The density matrix can also be seen as a (matrix) function of H
1
PS= −1 Fermi function
S H −μ I
1+exp( )
kT
In the limit of small kT a step function is obtained, conveniently written as
1 −1
PS= (1−sign(S H −μ I ))
2
If we can exploit the sparsities of H, S and P computing the matrix
functions, we can be more efficient than diagonalization based approaches
Almost any way to compute a matrix function can be (has been) tried...
Chebyshev expansion, contour integrals, recursions, minimizations
The matrix sign function
For diagonalizable A, eigenvectors of A are eigenvectors of sign(A),

with eigenvalues of -1 and 1 respectively
+1
-1
0
Sign matrix iterations
Various iterative schemes exist to compute sign(A), the simplest is
Newton Schulz iteration, requires only matrix multiplications
→
In exact arithmetic convergence is quadratic:
The number of correct digits of Xn is doubled for each iteration
Convergence illustration
Matrix inversion from sign iterations
So, can be used to compute inv(S), sqrt(S), inv(sqrt(S)) ....

Millions of atoms
in the condensed phase
Accurate basis sets, DFT

46656 cores
The electronic structure

O(106) atoms in < 2 hours
Minimal basis sets:

DFT, NDDO, DFTB
9216 cores
Bulk liquid water. Dashed lines represent ideal linear scaling.
VandeVondele, Borstnik, Hutter, JCTC, 8, 3565 (2012)

Filtering threshold
The cost of the calculation depends on the criterion
(threshold) used to decide if matrix elements are zero.
Roughly: 10x more accurate means 2x the cost.
Cost per SCF step for a dftb calculation on 6912 water molecules as a function of the ﬁltering threshold for the matrix multiplication. The dashed
line represent as ﬁt using the functional form aε −1/3 . The error in the trace and the total energy at full SCF convergence is linear in the threshold.
More advanced SCF methods
Available in CP2K
TRS4 :
trace resetting purification (Niklasson et al. JCP (2003))
Automatically determines the chemical potential
Curvy Steps (Shao et al. JCP (2003)):

Robust minimization approach based on
Petascale supercomputing
1 petaflops = solve 100'000 coupled equations for 100'000 unknowns in 1 sec.
= 1'000'000'000'000'000 multiplications/additions per sec.
#1 = 34 petaflops (June 2014), Switzerland: 6 petaflops (rank 6, 1st in europe)
The 37 fastest computers in the world have peak petaflop performance
Parallel computers have followed a path of

sustained exponential growth for 20 years
●Serial computers.... do not exist anymore

Serial programs become irrelevant
●1 month now = 1 day in 5 years
●Few experimental techniques show

exponential increases in throughput,
performance, or resolution
Hardware: Supercomputers
Switzerland (CSCS) has installed

Europe's fastest supercomputer.
Current top 10 (Nov 2013): China (1st), US

(2nd), Japan (4th), CH (6th) Germany (8th)
20 minuten, Nov. 12, 2013

CP2K: algorithms & implementation
Research & co-design: Hardware vendors & scientists look
together for the best solution (both soft- and hardware)
How can we program
for 10'000 – 1'000'000 cores ? for 'emerging' architectures ?
GPU
CPU
Could save 300 MWh/yr for our group.
M. Del Ben, J. Hutter, J. VandeVondele, J. Chem. Theory Comput 8, 4177-4188 (2012)

DBCSR: a sparse matrix library
Distributed Blocked Compressed Sparse Row
Distributed Blocked Cannon Sparse Recursive
Data is distributed on a 2D process grid
For (static) load balancing
Borstnik, VandeVondele, Weber, Hutter, Parallel Computing, http://dx.doi.org/10.1016/j.parco.2014.03.012

Cannon's Algorithm
2-D grid => √P communication steps
●Reduces to the known, efficient, algorithms in the dense case.

●Avoids worst-case all-to-all communication.
●Communication volume scales as 1/sqrt(P).

Performance: strong scaling
13846 atoms and 39560 electrons (cell 53.84 A), 133214 basis functions.
At full scale-out on the XC30 one multiplication takes less than 0.5s on average, one SCF step 24s.
Towards O(1) :
constant walltime with proportional resources
Stringent test:
Small blocks, large overhead
Total time Very sparse matrices
Running with 200 atoms / MPI task
Local multiplies
Local multiplies constant (OK!).
Overhead
Overhead & Communication
Communication Grows with sqrt(N)
Needs a replacement for Cannon
Work is underway to replace the Cannon algorithm with something new!

Retain the sqrt(N) max comm, yield constant comm in the limit.
DBCSR software layout
Local Multiplication
Ato 3
4
Ato 1
Ato 2
m4
4
4
m
m
m
m
m
Ato
Ato
Ato
Ato
Atom 1
Atom 2
Atom 2 = Atom 2 + ∑ Atom 2
× Atom 3
Atom 4
C C A B
● Submatrix multiplication for every Cannon tick

● overlaps with communication, in principle
● Based on 'atomic blocks' not individual elements
Cache-Oblivious Multiplication
Recursive multiplication of
matrices to reuse cache
M,K,N loop order

Inner: N Mid: K Inner: N
Outer: M
Outer: M
Mid: K
C A B
CSR:
Two stage processing: GPUs
A two stage process:

● Index building: figure out what to calculate
● Computation: Do the calculations
GPU performs flops
CPU does bookkeeping / MPI

LibSMM: fast small MxM
Chemical block sizes: 1,4,5,6,13,23,26

Generate autotuned kernels
Similar to ATLAS, FFTW,...
Unrolling, loop ordering, ….
Compiler based (no asm).
Device: LibCUSMM
Optimizing transfers from global device memory to shared mem/registers.

LibCUSMM performance
Measured performance against a roofline model based on memory transfer

Enabling async progress
Double buffering on host and device : async for MPI and device
1st CPU-GPU comparison
Performance comparison of the multi-threaded DBCSR library based on 23x23 matrix

blocks, and was not using the MPI capabilities. The benchmark was run on a dual Sandy Bridge
(E5-2620, 2.0GHz, 6 cores) machine, equipped with one NVIDIA Tesla K20 card.
CPU-GPU on hybrid Daint
# nodes 1 CPU-only 1 CPU + 1 GPU CPU+GPU CPU / GPU
Blocked ratio
3844 617s 459s 406s 1.5
1024 2208s 1351s 1054s 2.1
512 4046s 2566s 1341s 3.0
256 7124s 4686s OOM-GPU N/A
128 14268s OOM-GPU OOM-GPU N/A
As expected, GPU benefit decreases as communication becomes important.
GPU memory limit (~6 Gb) is triggered in these tests,

no further memory vs. speed trading possible (i.e. 2.5D/3D multiplication)
Blocking groups 'atoms into molecules', improves data-locality but increases

total data and flops: 85 PFLOP vs 132 PFLOP, 256 vs 512 nodes needed
Testcase 'H2O-dft-ls-orig' : 20'000 atoms

Full system science case
Matrix dims ~ 772868 x 772868
Threshold ~1E-6
% non-zero ~ 4%
SCF steps ~ 50
# multiplies needed ~ 2000
Dense flops needed =
1846613343679824128000
Actual flops needed =

849928403736295802
Sparsity boost = 2172x
GPU flop % = 99.4
Time on 5184 nodes = 6264s

80'000 atoms DFT, high accuracy settings
Aggregated nanoparticles in explicit solution Sustained actual flops = 0.13 PF
Relevant for 3rd generation solar cells
Sustained dense flops = 294.7 PF
Electronic properties of
TiO2 nanocrystals
Grätzel, Nature (1991,2001)
TiO2 nanoparticles are a key ingredient in various systems,

including Dye Sensitized Solar Cells.
Models in explicit solvent
Sizes ranging from 44k to 118k basis functions
ACN solvent treated with the Kim-Gordon DFT model: naturally suited for linear scaling
With explicit MD based equilibration
A crucial aspect for linear scaling calculations... how to get reliable

structures. Typical empirical models are often not good enough.
Yielding detailed
geometric information
Large fluctuations in smaller crystals, and

compressed surfaces.
Electronic properties
DOS and band gap
Strong fluctuations of the DOS on a frame by frame basis

Visualizing band 'states'

Lecture 10

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 10

Uploaded by

Copyright:

Available Formats

Molecular Materials Modeling:

Linear scaling DFT &

High Performance Computing

Location HIT, G 31.1.

For those students that registered for the Exam

Topics from the course and exercises.

Bring your solutions from the exercises as a basis for discussion.

Minimize an energy functional with respect to the electron density.

H KS Hamiltonian or Kohn-Sham matrix

Yielding the Kohn-Sham energy, computed via the density matrix P

P=∑ C i C Ti The density matrix

Compute the matrix elements of HKS

Compute P from HKS

● Matrix dimension: the number of basis functions

Avoid finding 20% lowest eigenvectors of a

Sparse linear algebra needed Largest O(N) calculation with CP2K

Linear scaling techniques do not speedup calculations on small systems.

Included in P Excluded in P The sparsity of P depends on the chemistry

Systems with an energy gap (semiconductors)

Systems without an energy gap (metals), at

Fig. John R. Brews

Decay as sin(|i-j|)/|i-j| Decay as exp(-a|i-j|)

Typical: 20'000 non-zeros per row

An amorphous hole conducting material

Molecular orbitals are eigenstates of the Hamiltonian, and typically delocalized

Can the localized nature of n be represented in the orbitals ?

Note that the idea of localized

Orthogonal (unitary) transformations keep the density matrix invariant.

We can perform any orthogonal transformation that is convenient.

Usually, we prefer that the one-electron states are eigenvectors of H,

Find an orthogonal transformation that minimizes the sum of the 'spread'

In the condensed phase (periodic boundary conditions),

The centers provide an intuitive

The orbitals (see exercises)

In the limit of small kT a step function is obtained, conveniently written as

For diagonalizable A, eigenvectors of A are eigenvectors of sign(A),

Newton Schulz iteration, requires only matrix multiplications

So, can be used to compute inv(S), sqrt(S), inv(sqrt(S)) ....

Accurate basis sets, DFT

The electronic structure

Minimal basis sets:

Bulk liquid water. Dashed lines represent ideal linear scaling.

VandeVondele, Borstnik, Hutter, JCTC, 8, 3565 (2012)

Curvy Steps (Shao et al. JCP (2003)):

#1 = 34 petaflops (June 2014), Switzerland: 6 petaflops (rank 6, 1st in europe)

The 37 fastest computers in the world have peak petaflop performance

Parallel computers have followed a path of

●Serial computers.... do not exist anymore

●1 month now = 1 day in 5 years

●Few experimental techniques show

Switzerland (CSCS) has installed

Current top 10 (Nov 2013): China (1st), US

20 minuten, Nov. 12, 2013

How can we program

for 10'000 – 1'000'000 cores ? for 'emerging' architectures ?

Could save 300 MWh/yr for our group.

M. Del Ben, J. Hutter, J. VandeVondele, J. Chem. Theory Comput 8, 4177-4188 (2012)

Data is distributed on a 2D process grid

For (static) load balancing

Borstnik, VandeVondele, Weber, Hutter, Parallel Computing, http://dx.doi.org/10.1016/j.parco.2014.03.012

●Reduces to the known, efficient, algorithms in the dense case.

●Communication volume scales as 1/sqrt(P).

Work is underway to replace the Cannon algorithm with something new!

● Submatrix multiplication for every Cannon tick