Professional Documents
Culture Documents
Dominguez DualSPHysics Workshop 2015 Keynote Optimisation and SPH Tricks
Dominguez DualSPHysics Workshop 2015 Keynote Optimisation and SPH Tricks
2. DualSPHysics implementation
2.1. Implementation in three steps
2.2. Neighbour list approaches
3. CPU acceleration
4. GPU acceleration
4.1. Parallelization problems in SPH
4.2. GPU optimisations
5. Multi-GPU acceleration
5.1. Dynamic load balancing
5.2. Latest optimisations in Multi-GPU
5.3. Large simulations
6. Future improvements
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
1.1. Why is SPH too slow?
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
1.1. Why is SPH too slow?
because:
• Each particle interacts
with more than 250
neighbours.
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
1.1. Why is SPH too slow?
because:
• Each particle interacts
with more than 250
neighbours.
• ∆t=10-5-10-4 so more
than 16,000 steps are
needed to simulate 1.5
s of physical time.
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
1.1. Why is SPH too slow?
Drawbacks of SPH:
• SPH presents a high computational cost that increases when increasing the
number of particles.
The time required to simulate a few seconds is too large. One second of
physical time can take several days of calculation.
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
1.2. High Performance Computing (HPC)
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
1.2. High Performance Computing (HPC)
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
1.2. High Performance Computing (HPC)
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
1.2. High Performance Computing (HPC)
Advantages: GPUs provide a high calculation power with very low cost and without
expensive infrastructures.
Drawbacks: An efficient and full use of the capabilities of the GPUs is not
straightforward.
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
Outline
1. Introduction
1.1. Why is SPH too slow?
1.2. High Performance Computing (HPC)
1.3. DualSPHysics project
2. DualSPHysics implementation
2.1. Implementation in three steps
2.2. Neighbour list approaches
3. CPU acceleration
4. GPU acceleration
4.1. Parallelization problems in SPH
4.2. GPU optimisations
5. Multi-GPU acceleration
5.1. Dynamic load balancing
5.2. Latest optimisations in Multi-GPU
5.3. Large simulations
6. Future improvements
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
2. DualSPHysics implementation
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
2.1. Implementation in three steps
For the implementation of SPH, the code is organised in 3 main steps that are repeated
each time step till the end of the simulation.
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
2.2. Neighbour list approaches
Initial Data
Particle Interaction (PI) consumes more than
95% of the execution time. However, its
Neighbour List
(NL)
implementation and performance depends
greatly on the Neighbour List (NL).
Particle System
NL step creates the neighbour list to
Interaction (PI) Update (SU)
optimise the search for neighbours during
particle interaction.
Save Data
(occasionally)
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
2.2. Neighbour list approaches
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
2.2. Neighbour list approaches
In this example:
2h
2h
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
2.2. Neighbour list approaches
In this example:
2h
2h
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
2.2. Neighbour list approaches
In this example:
47 real neighbours
2h (dark gray particles)
2h
2h
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
2.2. Neighbour list approaches
Verlet List
• The computational domain is divided in cells of side 2h (cut-off limit).
• Particles are stored according to the cell they belong to.
• So each particle only looks for its potential neighbours in the adjacent cells.
• Array of real neighbours is created for each particle.
Array with real
neighbours of...
a1
a2
particle a
a3
a4
b1
b2 particle b
b3
c1
c2
particle c
c3
c4
d1
particle d
d2
...
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
2.2. Neighbour list approaches
∆h=v(2·Vmax·C·dt)
Vmax: maximum velocity
C: time steps that list is fixed
dt: physical time for one time step
v: constant to remove inaccuracies
in calculations
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
2.2. Neighbour list approaches
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
Testcase: Dam break flow impacting on a structure
Video link:
https://youtu.be/_OFsAVuwxaA
Computational Memory
runtime requirements
CLL (Cell-linked list) fast minimum
the fastest very heavy
VLX (improved Verlet list)
(only 6% faster than CLL) (30 times more than CLL)
VLC (classical Verlet list) the slowest the most inefficient
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
Outline
1. Introduction
1.1. Why is SPH too slow?
1.2. High Performance Computing (HPC)
1.3. DualSPHysics project
2. DualSPHysics implementation
2.1. Implementation in three steps
2.2. Neighbour list approaches
3. CPU acceleration
4. GPU acceleration
4.1. Parallelization problems in SPH
4.2. GPU optimisations
5. Multi-GPU acceleration
5.1. Dynamic load balancing
5.2. Latest optimisations in Multi-GPU
5.3. Large simulations
6. Future improvements
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
3. CPU acceleration
Previous ideas:
SPH is a Lagrangian model so particles are moving during simulation.
Each time step NL sorts particles (data arrays) to improve the memory access
in PI stage since the access pattern is more regular and efficient.
Another advantage is the ease to identify the particles that belongs to a cell by
using a range since the first particle of each cell is known.
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
3. CPU acceleration
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
3. CPU acceleration
60%
40% Cells 2h
Cells 2h/2
20% Cells 2h/3
Cells 2h/4
0%
0 N 1,000,000
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
3. CPU acceleration
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
3. CPU acceleration
• OpenMP is the best option to optimize the performance for systems of shared
memory like multi-core CPUs.
• It can be used to distribute the computation load among CPU cores to
maximize the performance and to accelerate the SPH code.
• It is portable and flexible whose implementation does not involve major
changes in the code.
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
3. CPU acceleration
16-19
16-19
10-19 20-23
20-23 24-27
24-27
28-31
28-31
00-09
32-35
32-35
36-39
36-39 40-43
Boundary particle 40-43
Cells ID Fluid particle 44-47
44-47 48-49
48-49
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
3. CPU acceleration
read-only access
read/write access
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
3. CPU acceleration
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
3. CPU acceleration
Speedup for different number of particles (N) when applying symmetry, the
use of SSE instructions. Two different cell sizes (2h and 2h/2) were
considered.
2.5
Using 300,000 particles, the
maximum speedup was 2.3x
using Symmetry, SSE and cell
2.0
size 2h/2.
Speedup
1.5 SSE(2h/2)
SSE(2h)
Speedup was obtained when
Symmetry(2h/2) compared to the version without
Symmetry(2h) optimizations.
1.0
0 100,000 200,000 300,000
N
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
3. CPU acceleration
2
Symmetric
Asymmetric
Speedup was obtained when
1 compared to the most efficient
0 100,000
N
200,000 300,000 single-core version.
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
Outline
1. Introduction
1.1. Why is SPH too slow?
1.2. High Performance Computing (HPC)
1.3. DualSPHysics project
2. DualSPHysics implementation
2.1. Implementation in three steps
2.2. Neighbour list approaches
3. CPU acceleration
4. GPU acceleration
4.1. Parallelization problems in SPH
4.2. GPU optimisations
5. Multi-GPU acceleration
5.1. Dynamic load balancing
5.2. Latest optimisations in Multi-GPU
5.3. Large simulations
6. Future improvements
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
4. GPU acceleration
GPU implementation
DualSPHysics was implemented using the CUDA programming language to run SPH
method on Nvidia GPUs.
Important: An efficient and full use of the capabilities of the GPUs is not
straightforward. It is necessary to know and to take into account the details of the
GPU architecture and the CUDA programming model.
Two initial GPU implementation were tested: Partial and full GPU implementation.
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
4. GPU acceleration
Initial Data
Neighbour List
(NL)
Data transfer
CPU-GPU
GPU
Particle
Interaction (PI)
Save Data
(occasionally)
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
4. GPU acceleration
Initial Data
Data transfer
CPU-GPU
GPU
Neighbour List
(NL)
Particle System
Interaction (PI) Update (SU)
Data transfer
GPU-CPU
Save Data
(occasionally)
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
4.1. Parallelization problems in SPH
• No balanced workload:
Warps are executed in blocks of threads. To execute a block some resources are assigned
and they will not be available for other blocks till the end of the execution. In SPH the
number of interactions is different for each particle so one thread can be under execution,
keeping the assigned resources, while the rest of threads have finished.
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
4.1. Parallelization problems in SPH
Problems in Particle Interaction step
• Code divergence:
GPU threads are grouped into sets of 32 (warps) which execute the same
operation simultaneously. When there are different operations in one warp these
operations are executed sequentially, giving rise to a significant loss of efficiency.
NO DIVERGENT WARPS
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
4.1. Parallelization problems in SPH
Problems in Particle Interaction step
• Code divergence:
GPU threads are grouped into sets of 32 (warps) which execute the same
operation simultaneously. When there are different operations in one warp these
operations are executed sequentially, giving rise to a significant loss of efficiency.
+ +
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
4.1. Parallelization problems in SPH
Problems in Particle Interaction step
• No coalescent memory accesses:
The global memory of the GPU is accessed in blocks of 32, 64 or 128 bytes, so
the number of accesses to satisfy a warp depends on how grouped data are. In
SPH a regular memory access is not possible because the particles are moved each
time step.
COALESCED ACCESS
0
1
2
3
4
5
6 Only 1 access to
7
8 memory is required
9
10
11
12
13
14
15
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
4.1. Parallelization problems in SPH
Problems in Particle Interaction step
• No coalescent memory accesses:
The global memory of the GPU is accessed in blocks of 32, 64 or 128 bytes, so
the number of accesses to satisfy a warp depends on how grouped data are. In
SPH a regular memory access is not possible because the particles are moved each
time step.
NON COALESCED ACCESS
0
1
2
3
4
5
45 4 memory accesses
46
47 are required
48
56
57
13
14
15
16
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
4.1. Parallelization problems in SPH
Problems in Particle Interaction step
• No balanced workload:
Warps are executed in blocks of threads. To execute a block some resources are
assigned and they will not be available for other blocks till the end of the
execution. In SPH the number of interactions is different for each particle so one
thread can be under execution, keeping the assigned resources, while the rest of
threads have finished.
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
4.2. GPU optimisations
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
4.2. GPU optimisations
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
4.2. GPU optimisations
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
4.2. GPU optimisations
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
4.2. GPU optimisations
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
4.2. GPU optimisations
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
4.2. GPU optimisations
Computational runtimes (in seconds) using GTX 480 for different GPU
implementations (partial, full and optimized) when simulating 500,000
particles.
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
4.2. GPU optimisations
120%
Division of the domain into
Speedup of fully optimized 100% smaller cells
GPU code over GPU code Adding a more specific CUDA
without optimizations is: 80% kernel of interaction
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
4.2. GPU optimisations
120
8
90
6
Runtime (h)
60
30
4
CPU Single-core 0
CPU 8 cores GTX GTX Tesla GTX
2 GTX 480 480 680 K20 Titan
GTX 680 vs CPU 8 cores 13 16 17 24
GTX Titan vs CPU Single-core 82 102 105 149
0
0 4,000,000 8,000,000 12,000,000
N
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
4.2. GPU optimisations
The simulation of real cases implies huge domains with a high resolution, which
implies simulating tens or hundreds of million particles.
The use of one GPU presents important limitations:
- Maximum number of particles depends on the memory size of GPU.
- Time of execution increases rapidly with the number of particles.
24 12
16 8
4
8
0
0
0 2 4 6 8 10 12 14 16 18 20
GTX 480 GTX 680 Tesla K20 Tesla M2090 GTX Titan Particles (millions)
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
Outline
1. Introduction
1.1. Why is SPH too slow?
1.2. High Performance Computing (HPC)
1.3. DualSPHysics project
2. DualSPHysics implementation
2.1. Implementation in three steps
2.2. Neighbour list approaches
3. CPU acceleration
4. GPU acceleration
4.1. Parallelization problems in SPH
4.2. GPU optimisations
5. Multi-GPU acceleration
5.1. Dynamic load balancing
5.2. Latest optimisations in Multi-GPU
5.3. Large simulations
6. Future improvements
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
5. Multi-GPU implementation N×
The physical domain of the simulation is divided among the different MPI processes.
Each process only needs to assign resources to manage a subset of the total amount of
particles for each subdomain.
MPI
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
5. Multi-GPU implementation N×
Solutions:
- Overlapping between force computation and communications: while data is
transferred between processes, each process can compute the force interactions among its
own particles. In the case of GPU, the CPU-GPU transfers can also be overlapped with
computation using streams and pinned memory.
- Load balancing. A dynamic load balancing is applied to minimise the difference
between the execution times of each process.
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
5.1. Dynamic load balancing N×
Due to the nature Lagrangian of the SPH method, is necessary to balance the load
throughout the simulation.
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
5.1. Dynamic load balancing N×
Results using one GPU and several GPUs with dynamic load balancing
GTX 285
GTX 480
GTX 680
0 5 10 15
runtime (h)
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
5.1. Dynamic load balancing N×
Results using one GPU and several GPUs with dynamic load balancing
GTX 285
GTX 480
GTX 680
0 5 10 15
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
5.2. Latest optimisations in Multi-GPU
Process 0 Process 1
HEAD HEAD
P V R POS POS P V R
H P V R P V R H
O E O H H O E O
S L O E O O E O S L
P S L VEL VEL S L P
P P
RHOP RHOP
Data of particles Data of selected Data to Data Data of selected Data of particles
particles send received particles
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
5.2. Latest optimisations in Multi-GPU
Process 0 Process 1
x2 x2
GPU CPU CPU GPU
memory memory memory memory
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
5.2. Latest optimisations in Multi-GPU
Before Now
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
5.2. Latest optimisations in Multi-GPU
Before
Internal force computation Halo-1 computation
Halo-1 to GPU Halo-2 to GPU Halo-2 computation
Reception of halo-1 Reception of halo-2
Execution time
Now
Internal force computation
Halo-1 computation
Reception of halo-1 Halo-1 to GPU
Halo-2 computation
Execution time
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
5.2. Latest optimisations in Multi-GPU N×
Testcase for results
• Dam break flow.
• Physical time of simulation is 0.6 seconds.
• The number of used particles varies from 1M to 1,024M particles.
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
5.2. Latest optimisations in Multi-GPU N×
Results of efficiency
The simulations were carried out in the Barcelona Supercomputing Center BSC-
CNS (Spain). This system is built with 256 GPUs Tesla M2090.
All the results presented here were obtained single precision and Error-correcting
code memory (ECC) disabled.
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
5.2. Latest optimisations in Multi-GPU N×
0
0 32 64 96 128
GPUs
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
5.2. Latest optimisations in Multi-GPU N×
12%
1M/gpu (new) 1M/gpu (old)
4M/gpu (new) 4M/gpu (old)
8M/gpu (new) 8M/gpu (old)
9%
6%
3%
0%
0 32 64 96 128
GPUs
The latest improvements have reduced this percentage by half for different
number of GPUs and different number of particles
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
5.3. Large simulations 64×
Simulation of 1 billion SPH particles
Large wave interaction with oil rig using 10^9 particles
dp= 6 cm, h= 9 cm
np = 1,015,896,172 particles
nf = 1,004,375,142 fluid particles
physical time= 12 sec
# of steps = 237,065 steps
runtime = 79.1 hours
Video link:
https://youtu.be/B8mP9E75D08
5.3. Large simulations 32×
Simulation of a real case
Using 3D geometry of the beach Itzurun in Zumaia-Guipúzcoa (Spain) in Google Earth
32 x M2090 (BSC)
Particles: 265 Millions
Physical time: 60 seconds
Steps: 218,211
Runtime: 246.3 hours
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
Video links:
https://youtu.be/nDKlrRA_hEA
https://youtu.be/kWS6-0Z_jIo
Outline
1. Introduction
1.1. Why is SPH too slow?
1.2. High Performance Computing (HPC)
1.3. DualSPHysics project
2. DualSPHysics implementation
2.1. Implementation in three steps
2.2. Neighbour list approaches
3. CPU acceleration
4. GPU acceleration
4.1. Parallelization problems in SPH
4.2. GPU optimisations
5. Multi-GPU acceleration
5.1. Dynamic load balancing
5.2. Latest optimisations in Multi-GPU
5.3. Large simulations
6. Future improvements
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
6. Future improvements N×
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
6. Future improvements
Splitting
Coalescing
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
6. Future improvements
DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)
Video link:
https://youtu.be/OzPjy2aMuKo
Optimisation and SPH Tricks