Melander Et Al. - 2020 - Massively Parallel Nodal Discontinous Galerkin Finite Element Method Simulator For Room Acoustics

International Journal of High Performance Computing Applications
Massively Parallel Nodal Discontinous Galerkin Finite

Element Method Simulator for Room Acoustics
Journal: International Journal of High Performance Computing Applications
Manuscript ID hpc-20-0135
Manuscript Type: Research Paper
Date Submitted by the

28-Aug-2020
Author:
Complete List of Authors: Melander, Anders; Danmarks Tekniske Universitet, Applied mathematics
and computer science
Strøm, Emil; Technical University of Denmark
Fo
Pind, Finnur; Technical University of Denmark
Engsig-Karup, Allan; Danmarks Tekniske Universitet, Applied
Mathematics and Computer Science
Jeong, Cheol-Ho; Technical University of Denmark
rP
Warburton, Timothy; Virginia Polytechnic Institute and State University,

Mathematics
Chalmers, Noel; Virginia Polytechnic Institute and State University,
ee
Mathematics
Hesthaven, Jan; EPFL
High-performance computing, Multi-device acceleration, Heterogeneous

rR
Keywords: CPU-GPU computing, Room acoustic simulation, Discontinuous Galerkin

method
We present a massively parallel and scalable nodal Discontinuous

ev
Galerkin Finite Element Method (DGFEM) solver for the time-domain

linearised acoustic wave equations. The solver is implemented using the
libParanumal finite element framework with extensions to handle
curvilinear geometries and frequency dependent boundary conditions of
iew
relevance in practical room acoustics. The implementation is

benchmarked on heterogeneous multi-device many-core computing
architectures, and high performance and scalability are demonstrated for
Abstract: a problem that is considered expensive to solve in practical applications.
In a benchmark study, scaling tests show that multi-GPU support gives
the ability to simulate large rooms, over a broad frequency range, with
realistic boundary conditions, both in terms of computing time and
memory requirements. Furthermore, numerical simulations on two non-
trivial geometries are presented, a star-shaped room with a dome and
an auditorium. Overall, this shows the viability of using a multi-device
accelerated DGFEM solver to enable realistic large-scale wave-based
room acoustics simulations using massively parallel computing.
Note: The following files were submitted by the author for peer review, but cannot be converted to PDF.
You must view these files (e.g. movies) online.
SageH.bst
SageV.bst
http://mc.manuscriptcentral.com/ijhpca
Page 2 of 22 International Journal of High Performance Computing Applications
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Fo
20
21
22
23
rP
24
25
26
ee
27
28
29
rR
30
31
32
ev
33
34
35
36
iew
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60 http://mc.manuscriptcentral.com/ijhpca
International Journal of High Performance Computing Applications Page 3 of 22
1 Journal Title
2 Massively Parallel Nodal Discontinous XX(X):1–20
3 c The Author(s) 2016
4
5
Galerkin Finite Element Method Reprints and permission:
sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/ToBeAssigned
6
7
Simulator for Room Acoustics www.sagepub.com/
SAGE
8
1 1 2,3
9 Anders Melander , Emil Strøm , Finnur Pind , Allan P. Engsig-Karup , Cheol-Ho Jeong3 ,
1
10
Tim Warburton4 , Noel Chalmers4 and Jan S. Hesthaven5
11
12
13
14
Abstract
15
16 We present a massively parallel and scalable nodal Discontinuous Galerkin Finite Element Method (DGFEM) solver
17 for the time-domain linearised acoustic wave equations. The solver is implemented using the libParanumal finite
18 element framework with extensions to handle curvilinear geometries and frequency dependent boundary conditions
19
20 of relevance in practical room acoustics. The implementation is benchmarked on heterogeneous multi-device many-
21 core computing architectures, and high performance and scalability are demonstrated for a problem that is considered
Fo
22 expensive to solve in practical applications. In a benchmark study, scaling tests show that multi-GPU support gives
23 the ability to simulate large rooms, over a broad frequency range, with realistic boundary conditions, both in terms
24
of computing time and memory requirements. Furthermore, numerical simulations on two non-trivial geometries are
rP
25
26 presented, a star-shaped room with a dome and an auditorium. Overall, this shows the viability of using a multi-device
27 accelerated DGFEM solver to enable realistic large-scale wave-based room acoustics simulations using massively
28
ee
parallel computing.
29
30
31 Keywords
rR
32 High-performance computing, multi-device acceleration, heterogeneous CPU-GPU computing, room acoustic

33 simulation, discontinuous Galerkin method.
34
35
ev
36
37 1 Introduction This motivates the use of wave-based methods, where the
38
iew
governing partial differential equations that describe wave

39
Room acoustic simulations are widely used in, e.g., building motion are solved numerically. Given the right input data,
40
41 design, virtual reality and hearing research. The task is these methods are very accurate, since no approximation
42 challenging from a computational point of view as it to the wave propagation is made and all wave phenomena
43 involves simulating large and complex domains, over a are inherently accounted for. Several different numerical
44 techniques have been applied, such as the finite-difference
broad frequency spectrum and long times. Room sizes range
45
46 from 30 m3 to 30 000 m3 , our auditory system can hear time-domain (FDTD) method (Botteldooren 1995), the
47 frequencies from 20 Hz to 20 kHz, and sound typically pseudospectral time-domain (PSTD) method (Hornikx et
48 lasts in rooms somewhere between 0.5 s (living room) to al. 2016), the finite volume method (FVM) (Bilbao 2013),
49
3.0 s (concert hall). Historically, the prevailing approach has
50
51 been to apply geometrical acoustics methods (Savioja and
1 Scientific Computing Section, Department of Applied Mathematics
52 Svensson 2015), such as the image source method (Allen
and Computer Science, Technical University of Denmark, Kgs. Lyngby,
53 and Berkley 1979) or ray-tracing (Krokstad et al. 1968), Denmark.
54 2 Henning Larsen, Copenhagen, Denmark.
where the acoustic wave is approximated as a bundle of
55 3 Acoustic Technology Group, Department of Electrical Engineering,
56 rays that are propagated in the room using the laws of ray Technical University of Denmark, Kgs. Lyngby, Denmark.
4 Department of Mathematics, Virgina Tech, United States.
57 optics. This reduces the computational task considerably, but 5 Chair of Computational Mathematics and Simulation Science, cole
58 the approximation is only appropriate in the high-frequency Polytechnique Fdrale de Lausanne, Lausanne, Switzerland.
59
limit. At low-mid frequencies, wave phenomena such as
60 Corresponding author:
diffraction and interference dominate and other simulation Anders Melander
methods must be used. Email: anders.d.melander@gmail.com
Prepared using sagej.cls [Version: 2017/01/17 v1.20]
2 Journal Title XX(X)
1
2 the boundary element method (BEM) (Hargreaves et al. et al. is described as efficient, it cannot handle complex
3 2019), and the finite element method (FEM) (Craggs geometries or non-trivial boundary conditions (BCs).
4 1994). Recently, nodal high-order-accurate variants of The goal of this work is to introduce the details of a
5
6 the FEM, such as the spectral element method (SEM) HPC approach to simulate three-dimensional room acoustics
7 (Pind et al. 2019) and the discontinuous Galerkin finite using the DGFEM. To the best of our knowledge, this is
8 element method (DGFEM) (Wang et al. 2019), have been the first time a massively parallel GPU-accelerated wave-
9 applied to simulate room acoustics. High-order methods based DGFEM room acoustic simulator is presented in
10
11 result in better accuracy-per-computational-cost and they the literature. The simulator is written in C/C++ with
12 have the potential to significantly reduce the runtime of Message Passing Interface (MPI) as its backbone, and
13 simulations, and are a must for having good accuracy in the device-interfacing is implemented using the single
14 wave propagation problems over long integration times kernel language for parallel programming Open Concurrent
15
(Kreiss and Oliger 1972). The nodal DGFEM is particularly Compute Abstraction (OCCA) (Medina et al. 2014) for
16
17 well suited for room acoustic simulations, because it portable multi-threading code development across different
18 combines the attractive features of geometric flexibility, many-core hardware architectures. Curvilinear meshes and
19 high-order accuracy, suitability for parallel computing, and locally reacting frequency independent and frequency
20
lean memory usage. In the DGFEM schemes, the solution dependent impedance boundary conditions are supported in
21
Fo
22 is computed locally for each element, with communication the implementation.

23 between elements only at the element boundaries. This data The paper is organized as follows. Section 2 presents the
24 locality can be utilized for parallelization of the solver. governing equations and the boundary conditions. Section
rP
25
3 presents the numerical discretization of the governing
26
27 The major drawback of wave-based methods is the equations. In Section 4, the HPC implementation is described
28 and Section 5 contains several numerical experiments that
ee
associated high computational cost. Simulating large spaces

29 offer insights into the performance of the simulator. Finally,
30 (larger than, say, 1 000 m3 ) into the mid frequency range (up
some concluding remarks are offered in Section 6.
31 to and beyond 1 kHz) typically requires tens or hundreds
rR
32 of millions of degrees of freedom (DoF). Despite progress

33 in numerical methodology, wave-based methods are mostly 2 The mathematical model
34
35 applied to small rooms at very low frequencies. However,
ev
Acoustic wave propagation in a lossless medium is governed

36 with the rapid advancements in many-core computing
by the coupled first order system of partial differential
37 hardware, massively parallel high-performance computing
38 equations
iew
(HPC) – in particular, the use of graphics processing units

39
(GPUs) for scientific computing – offers a way to greatly 1
40 vt = − ∇p,
41 extend the usability of wave-based methods to large spaces ρ (1)
42 and for a broad spectrum of frequencies. There is some pt = −ρc2 ∇ · v,
43 past work on the HPC implementation of wave-based room
44 where v(x, t) [m/s] is the particle velocity at position x in
acoustic simulations. Lopez et al. (2013) and Spa et al.
45 the domain Ω at time t, p(x, t) is the acoustic pressure [Pa],
46 (2015) studied the GPU implementation of the FDTD
47 ρ is the density of air and c is the speed of sound in air. In
method and found that, when implemented efficiently, the
48 this work, we use ρ = 1.2kg/m3 and c = 343 m/s. We let
use of GPUs could drastically improve the computational
49 u, v and w be the Cartesian coordinates x, y, z parts of the
performance. The FDTD method, like the DGFEM, lends
50 velocity field v = [u, v, w]T .
51 itself naturally to parallelization but has a low arithmetic
52 complexity. Takahashi and Hamada (2009) studied the GPU
53 acceleration of the BEM when applied to the frequency 2.1 Boundary conditions
54
domain Helmholtz equation and reported a considerable To solve realistic problems, boundary conditions are needed,
55
56 improvement in performance. Lastly, Morales et al. (2015) to accurately capture the absorption properties of the
57 considered a distributed memory multi-CPU implementation boundary of the domain ∂Ω. In the following, we will
58 of an adaptive rectangular decomposition (ARD) algorithm introduce the boundary condition types that have been
59
for room acoustic simulations and found that they could implemented in the DGFEM solver. These are rigid,
60
simulate large domains over broad frequency ranges using frequency independent and frequency dependent boundary
the HPC setup. While the ARD implementation of Morales conditions.
Prepared using sagej.cls
Melander et al. 3
1
2 2.1.1 Rigid The rigid boundary condition perfectly reflects which can be rewritten using a partial fraction decomposition
3 the impinging wave back into the domain Ω. Mathematically
4 this is expressed as
Q
X Ak
5 Y (ω) =Y∞ + +
λk − jω
6 k=1
! (6)
7 n̂ · v = 0 on ∂Ω, (2) S
X Bk + jCk Bk − jCk
8 + .
αk + jβk − jω αk − jβk − jω
9 where n̂ is the outward pointing normal of the boundary k=1
10 surface ∂Ω. Here Q is the number of real poles, S is the number of

11
12 complex conjugate pole pairs of Y (ω), λk are the real poles,
13 αk ± jβk are the complex conjugate pole pairs, and Y∞ ,
14 Ak , Bk and Ck are numerical coefficients. For causality, λk
15 2.1.2 Locally-reacting frequency independent The fre-
and αk must be semi-positive. Furthermore, for each set of
16 quency independent impedance boundary condition adds
17 numerical coefficients, the passivity of the multipole model
absorption to the domain boundaries, with all frequencies
18 must be checked. If the model is non-passive, it will add
being absorbed equally. The absorption is dictated by the
19 energy to the simulation, potentially leading to a blowup
20 normal incidence surface impedance Z. The normal velocity
of the solution. The total number of poles Npoles = Q + 2S,
21 at the boundary becomes
Fo
22 should be chosen large enough to satisfy an error threshold

23 p of the multipole approximation of the boundary admittance.
n̂ · v̂ = vn = = pY, (3)
24 Z
rP
25 By applying the inverse-Fourier transform to (4), we get

where Y is the normal boundary admittance. Using high
26
27 values of Z implies a highly reflective boundary conditions, Z t
28 whereas values close to the characteristic impedance of the vn (t) = p(t0 )y(t − t0 )dt0 . (7)
ee
−∞
29 air Zair = ρc results in a highly absorptive boundary.
30 Then by applying the inverse Fourier transform to (6) and
31
rR
32 inserting the result into (7), the expression for the velocity
33 vn at the boundary becomes
34 2.1.3 Locally-reacting frequency dependent Most real
35 Q
ev
world surfaces exhibit frequency dependent absorption. X

36 vn (t) =Y∞ p(t) + Ak φk (t)+
Mathematically, this can be written as
37 k=1
(8)
S
38
iew
p̂(ω)

(1) (2)
X
39 v̂n = = p̂(ω)Y (ω), (4) 2 Bk ψk (t) + Ck ψk (t) ,
Z(ω)
40 k=1
41 where ω is the angular frequency and (v̂n , p̂) denote where φk is the accumulator for the k’th real poles, and
42 (1) (2)
the Fourier transformed normal velocity and pressure, ψk and ψk are the accumulators for the k’th complex
43
44 respectively. Z(ω) is no longer constant, as in the frequency conjugate poles pairs. These accumulators are defined by a
45 independent case, but is a complex function of the angular set of ordinary differential equations (ODEs)
46 frequency. For a particular surface, Z(ω) can be measured,
47 dφk
e.g., in an impedance tube or using pressure-velocity λk φk = p(t),
48 dt
49 sensors. Alternatively, several material models exist to (1)
dψk (1) (2)
50 estimate surface impedance of many common materials and αk ψk + βk ψk = p(t), (9)
dt
51 constructions. (2)
52 dψk (2) (1)
αk ψk − βk ψk = 0.
53 To incorporate the frequency dependent boundary dt
54 conditions accurately and efficiently into the time-domain Thus, when using the ADE method, an additional set of
55 simulation, the method of auxiliary differential equations
56 ODEs must be solved at the boundary in each time step.
57 (ADEs) can be used (Pind et al. 2019; Dragna et al. 2015). In The added computational cost due to this additional set
58 the ADE method, the surface admittance is approximated as of ODEs is relatively low, as the amount of extra ODEs
59 a rational function, defined as, needed to solve is small compared to the system that must
60
a0 + ... + aN (−jω)N be solved for the acoustic wave equations. In total, there
Y (ω) = , (5) will be Npoles times the number of boundary nodes with
1 + ... + bN (−jω)N
1
2 frequency dependent boundary condition ODEs to solve. variables vhk and pkh , takes the following form
3 This extra work generally grows significantly slower than the
4 ∂vhk k
Z Z
1
total number of nodes in a domain when the mesh resolution `n (x)dx = − ∇pkh `kn (x)dx
5 D k ∂t ρ Dk
6 increases in the domain. Z
1
n̂ pkh − p∗ `kn (x)dx,

7 +
ρ ∂Dk
8 Z k Z
9 ∂ph k
3 The numerical model ` (x)dx = − ρc2 ∇ · vhk `kn (x)dx
10 D k ∂t n Dk
11
Z
3.1 Spatial discretization + ρc2 n̂ · vhk − v∗ `kn (x)dx.

12 ∂D k
13 The spatial derivatives in (1) are discretized using DGFEM. (13)
14 DGFEM is a high-order method (Hesthaven and Warburton
15 2008) capable of solving the time-domain acoustic partial where the star-labeled terms denote numerical fluxes that
16 propagate the solution between elements. This can be
differential equations within a domain Ω, subject to an
17
18 appropriate set of boundary and initial conditions. The rewritten in a discrete formulation using matrix-based
19 spatial domain is tessellated by K non-overlapping elements operators for integration and differentiation
20 as
21 K dukh 1 1
= − Dxk pkh + Lk pkh − p∗ n̂x ,
Fo

[ (14a)
22 Ω ' Ωh = Dk . (10) dt ρ ρ
23 k=1
dvhk 1 1
= − Dyk pkh + Lk pkh − p∗ n̂y ,

24 In this work, the elements are taken to be tetrahedral dt ρ ρ
(14b)
rP
25
elements in the three-dimensional space. On each of these dwhk 1 1
26 = − Dzk pkh + Lk pkh − p∗ n̂z ,

(14c)
27 elements the solution variables v and p are expressed on dt ρ ρ
28 the k’th element in terms of polynomial basis functions and dpk
ee
= −ρc2 Dx ukh + Dy vhk + Dz whk

29 corresponding modal or nodal coefficients. This is expressed dt
30 + ρc2 Lk n̂ · vhk − v∗ .

as (14d)
31
rR
32 Np Np
The matrix operations on the right hand side are expressed
33
X X
vhk (x, t) = v̂nk (t)φn (x) = vhk (xkn , t)`kn (x),
34 in terms of the operators defined on the reference elements
n=1 n=1
35 (11)
ev
Np Np
36 Dr = Vr V −1 , Ds = Vs V −1 , Dt = Vt V −1 ,
X X
pkh (x, t) = p̂kn (t)φn (x) = pkh (xkn , t)`kn (x).
37 (15)
n=1 n=1 M = (VV T )−1 .
38
iew
39 Here φn (x) is the hierarchical modal basis function and

Here Vij ≡ φj (ri ) is the generalized Vandermonde matrix
40 `ki (x) is the corresponding nodal Lagrange polynomial basis
41 and (VX )ij ≡ ∂φj (ri )/∂X are defined in terms of reference
function on element Dk . v̂nk (t) and p̂kn (t) are the modal
42 element coordinates X = r, s, t.
43 coefficients for the modal basis functions, and vhk (xkn , t)
44 and pkh (xkn , t) are the corresponding nodal coefficients
45 for the nodal basis functions. The modal basis functions
46 These operators can, via the chain rule of the continuous
are assumed to be orthonormal as this simplifies several
47
needed computations. On each element xkn is a node variables, e.g.,
48
49 distribution of Np = ((N + 1)(N + 2)(N + 3))/6 distinct
∂f (x) ∂r ∂f (x) ∂s ∂f (x) ∂t ∂f (x)
50 nodal interpolation points, where N is the order of the basis = + + , (16)
∂x ∂x ∂r ∂x ∂s ∂x ∂t
51
functions. An α-optimized nodal distribution is used due to
52 be related to the corresponding discrete differential
53 its low Lebesque constants (Hesthaven and Teng 2000). The
operations in the physical domain for each element
54 solutions on the elements are used to approximate the global
55 solutions v and p in terms of direct sums
56 Dxk = rxk Dr + skx Ds + tkx Dt ,
57 K
M K
M Dyk = ryk Dr + sky Ds + tky Dt , (17)
58 v ≈ vh = vhk , p ≈ ph = pkh . (12)
59 k=1 k=1
Dzk = rzk Dr + skz Ds + tkz Dt .
60
The strong formulation of the governing equations, when Furthermore, we introduce the lift operator L needed for the
using the nodal basis series to represent the unknown incorporation of boundary conditions via numerical fluxes at
Melander et al. 5
1
2 the element edges We are interested in the numerical flux along the outward
3 pointing element boundary normal n̂, which leads to the
4 L = (M)
−1
E, M = (VV T )−1 , (18) creation of the following operator
5
6
7 where E is the column-wise concatenation of mass matrices N = n̂ · F = n̂x Ax + n̂y Ay + n̂z Az . (23)
8 defined in terms of the nodal distributions on the faces.
9 Applying an eigendecomposition on N yields
10
To avoid computing and storing these matrices for every
11 N = RDR−1 , (24)
12 element, the matrix operators are determined for a reference
13 element, and a geometric mapping from the arbitrary mesh
where
14 elements to the reference elements is used, e.g.,
15 
n̂x n̂

− n̂ρcx − n̂n̂xz − n̂xy
16  n̂ρcy
Mk = J k (VV T )−1 , (19) n̂
− ρcy 0

1 
17 R=
 ρc
 ,
18  n̂z
 ρc − n̂ρcz 1 0 
where J k is a diagonal matrix storing the computed Jacobian

19 1 1 0 0
20 values of the local map of the k’th element to the reference   (25)
21 element. Hence, we need to compute and store the J k ’s c 0 0 0
Fo
22
 
0 −c 0 0
for each element, and the procedure in this way treats D= .
23 0 0 0 0
24 curvilinear and straight-sided elements in a similar way. By  
0 0 0 0
rP
25 utilizing matrix-based integration via the mass matrices in

26 the numerical scheme all integrals in the weak formulation
27 Here the diagonal of D contains the eigenvalues of N.
is exact to twice the polynomial order of the polynomial
28 Since all eigenvalues are real, the system is hyperbolic, and
ee
29 basis. For curvilinear elements where the Jacobians are non-

therefore the eigenvalues represent the speeds of propagation
30 constant, this may lead to cubature errors in the operators.
of information of the characteristic variables b, defined as
31
rR
32 R−1 q = b (Eleuterio 1999).

The choice of the numerical flux stabilizes the dispersion
33
34 and dissipation properties of the DGFEM scheme. In this
35
ev
work, the upwind flux is chosen. The upwind flux combines

36 low dispersion errors over a wide range of frequencies with
37
dissipation for underresolved frequencies (Ainsworth 2004; By splitting the eigenvalue matrix D into its positive and
38
iew
39 Hu and Atkins 2002). To derive the upwinding flux, a negative parts

40 Riemann problem at each element edge must be solved. We
41 start by rewriting (1) in the quasi-linear form of a hyperbolic D = D+ + D−
42    
system c 0 0 0 0 0 0 0
43
44

0 0 0
 
0 0 −c 0
 

0 (26)
∂q ∂q ∂q = + ,
45 + ∇ · F(q) = + Aj = 0, (20) 0 0 0 0
 0 0 0 0
∂t ∂t ∂xj
46
 
0 0 0 0 0 0 0 0
47
where q(x, t) = [u, v, w, p]T is the acoustic variable vector,
48
j ∈ [x, y, z] is the Cartesian coordinate index and Aj is a D+ corresponds to the parts of b where the direction is
49
50 constant flux Jacobian matrix, given as along the normal vector n̂, meaning the information is
51 leaving the element, whereas D− corresponds to the parts
52
 δxj

0 0 0 ρ of b where the direction is opposite n̂, implying that the
53 
 0 δyj 
0 0 information is entering the element. In the numerical flux we
54 ρ 

Aj = 
 0 δzj  , (21)
55  0 0 ρ  seek to combine the information from the current element
56 ρc2 δxj ρc2 δyj ρc2 δzj 0 boundary point, given in q− h with the information from the
57 neighbouring element boundary point, given in q+ h . Since
58 where δij denotes the Kronecker delta function. The flux of n̂ · F = Nqh , the following holds
59 the system is defined as
60
n̂ · F = RDR−1 qh = RD+ R−1 qh + RD− R−1 qh .
F = [Ax q, Ay q, Az q]. (22) (27)
1
2 By the properties of D+ and D− , this results in the 2016). Thus, in this work, the time step size is given as
3 numerical upwind flux
4 1 1
∆t = CCFL min(∆xl ) , (31)
5 c N2
(n̂ · F)∗ = RD+ R−1 q− − −1 +
h + RD R qh . (28)
6
where ∆xl is the smallest edge length of the mesh elements
7
8 Boundary conditions are weakly enforced through the and CCFL is a constant of O(1).
9 numerical flux terms. For element faces that lie on the
No measures have been taken to automatically address
10 domain boundary, q+ h needs to be defined according to the potential stiffness of the ADEs. Implicit-explicit
11
the boundary conditions. For surface impedance boundary
12 time-stepping (Kennedy and Carpenter 2003) or stiffness
13 conditions, an averaging approach is used to impose
restriction, by means of a constrained multipole mapping
14 Dirichlet conditions, where the neighboring value is given by
process (Wang et al. 2020), could be used to ensure the
15 p+ − + −
h = ph and vh = −vh + 2gv , where n̂ · gv = vn , with
16 ADEs do not cause a numerical instability. However, in
vn given by either (3) or (8). Thus, at the domain boundary
17 all numerical experiments, the spatial scheme is found to
18 nodes, the flux term in (14d) becomes n̂ · (vhk − v∗ ) = n̂ ·
determine the condition on the stable time step size.
19 vh− − vn .
20
21
Fo
22 3.3 Meshing and curvilinear geometry

23 3.2 Temporal discretization and stability
24 Meshing of the computational domain Ωh is done using the
rP
25 The semi-discrete system in (14) can be expressed as a open-source meshing tool gmsh (Geuzaine and Remacle
26
general form ODE 2009). When affine (straight-sided) elements are used,
27
28 the mesh generator is used to generate a low-order
ee
dqh
29 = L(qh (t), t), (29) finite element mesh, and the simulator then adds the α-
dt
30
optimized nodes to define the high-order mesh elements, in
31 where L is the spatial operator. This ODE system is
rR
32 accordance with the polynomial basis order used. To also

integrated in time using a low-storage 4th order explicit
33 support handling complex meshes with curvilinear features,
34 Runge-Kutta time-stepping method,
curvilinear elements may be introduced. Using curvilinear
35
ev
(0)
qh = qnh , elements, the internal high-order mesh nodes are shifted to
36
37
 better fit the geometry (Pind et al. 2019). For room acoustic
k(i) = a k(i−1) + ∆tL(q(i−1) , tn + c ∆t),
38 i h i simulations, this is a particularly important feature, since
iew
i ∈ [1, . . . , 5] :
39 q(i) = q(i−1) + b k(i) ,
h h i most real-world rooms contain curved or complex shaped
40 (5)
q n+1
=q , boundary surfaces.
41 h h
42 (30) Currently, gmsh has the ability to generate a high-
43 order curvilinear mesh, however, only with equidistant node
44 where ∆t = tn+1 − tn is the time step size and the distributions. This is sub-optimal, because of the rapid
45
coefficients ai , bi , ci are given in Ref. (Carpenter and growth of the Lebesque constant associated with equidistant
46
47 Kennedy 1994). node distributions (Hesthaven 1998). Our solution here is
48 Explicit time-stepping methods impose a conditional to apply an interpolation post-processing step on the high-
49
stability criterion, ∆t ≤ C1 /(max |λe |), where λe are the order mesh generated by gmsh, where the equidistant nodes
50
51 eigenvalues of the spatial discretization and C1 is a constant are moved to the α-optimized curvilinear node positions.
52 related to the size of the stability region of the time-stepping This can be done by re-interpolating the node position by
53 method (Pind et al. 2019). There are two mechanisms at representing the coordinates as polynomial series, cf. (11).
54
play that influence the maximum allowed time step: 1) Then, let r and re be the r, s and t coordinates in the
55
56 the eigenvalue spectrum’s of the matrix operators, and 2) reference element for α-optimized and equidistant node
57 the stiffness of the ADEs. In DGFEM, the matrix operator distributions, respectively. Using a hierarchical modal basis,
58 eigenvalue spectrum scales as max |λe | ∼ C2 cN 2γ , where the coefficients r̂ of this representation can be determined
59
the C2 constant is dictated by the size of the smallest mesh from
60
element and γ is the highest order of differentiation in
e
the governing equations (γ = 1, here) (Engsig-Karup et al. r = Vr̂, re = V r r̂, (32)
Melander et al. 7
1 e
2 where (V r )ij ≡ φj (rei ), using same basis used to define V
3 but defined in terms of the different set of equidistant nodes.
4 This implies the relationship between the node distributions
5
6 e
7 r = V(V r )−1 re (33)

8
9 from which an interpolation matrix I : re 7→ r is defined as
10 Figure 1. Example of a three CPU core setup. The domain is
11 e
I = V(V r )−1 . (34) split into three chunks that are stored in separate CPU cores
12 memory.
13
Using this operator offers a way to go from the
14
15 gmsh generated curvilinear mesh with equidistant node When executing the code on GPUs, the code is structured
16 distribution to a curvilinear mesh with α-optimized node such that each CPU core is assumed to have access to one
17 distribution. GPU, e.g., only CPU0 can communicate directly with GPU0 .
18
19 This implies that the interface communications between CPU
20 cores become more costly, as compared to only using CPUs,
21 because CPU0 must copy all interface information from
Fo
22
GPU0 into RAM on CPU0 , before it can communicate this
23
24 information to CPUx through MPI, where 0 < x ≤ M − 1
rP
25 denotes an arbitrary CPU core out of the total of M CPUs.

26 CPUx has to copy the received interface information to
27
its connected GPU, GPUx . In other words, using GPUs
28
ee
29 introduces an additional communication step, and therefore

30 4 HPC implementation additional communication overhead, between the CPU and
31 GPU in each time-stepping stage. This implies that we expect
rR
32
a penalty in the scaling of computational performance when
33 The proposed HPC room acoustics simulator is based on
34 adding additional CPU-GPU pairs to run the simulation, as
35 the open source libParanumal framework (Karakus et
ev
compared to only using CPUs. However, using the GPUs

36 al. 2019; wirydowicz et al. 2019), which is a set of highly is very attractive because of the very large number of
37 optimized finite element flow solvers for heterogeneous cores available and high on-chip bandwidth, so much faster
38
iew
(GPU/CPU) systems. A Message Passing Interface (MPI) simulation is expected when using GPUs compared to CPUs.
39
is used to handle communication between CPUs. Note,
40 The communication structure can be seen in Figure 2, where
41 that when we use the term ‘CPU’, it refers to one single the red arrows are the CPU and GPU communications and
42 CPU core, while ‘GPU’ refers to the entire GPU. MPI the blue arrows are the MPI communications.
43 works by treating each CPU core as its own entity, with its
44
own separate memory. Information can be shared between
45
46 CPU cores through the messaging interface. In practice
47 this means that libParanumal splits the domain into
48 different chunks. This is illustrated in Fig. 1. Each CPU
49
core only stores the simulation information needed for the
50
51 computations in elements belonging to its own part of the
52 domain. However, as Fig. 1 illustrates, the CPUs share an
53 interface between them, where information has to be shared
54
for the acoustic wave to propagate from one CPU to the
55
56 next. This means that for some elements, the numerical
57 flux has to be computed based on neighbouring elements
58 that are located on a different CPU, requiring that MPI
59 Figure 2. Communication overview of M total heterogeneous
communication between cores must take place. This is called
60 CPU-GPU pairs. All CPUs can communicate through MPI,
a halo exchange. The same communication is used when GPUs can only communicate through their attached CPU by
GPUs are used. proxy.
1
2 The implementation of one stage of the time-stepping • GPU : 4 × NVIDIA Tesla V100 SXM2 32 GB,
3 method in (30), including all the interface communication,
4 is summarized in Algorithm 1. The GPU functions, called with the three nodes being connected by InfiniBand. The
5
kernels, must be launched in every iteration of the time compilers used for compilation are:
6
7 stepping algorithm.
8 • GCC v7.4.0,
9 Algorithm 1: One stage of the LSERK time-stepping • OpenMPI v3.1.3,
10 method • OCCA v1.0,
11 (i−1)
Input: Previous stage qh
12 • CUDA compilation tools v10.0.130.
(i)
Result: An updated stage qh
13
1 GPU to CPU memory copy of interface nodes
14 For the scalability tests, we will use the setups of 1, 2, 4, 8,
(CPU/GPU). Simultaneously initiate volume integral
15 and 12 GPUs. Note that when running tests with 4 or less
compute kernel (GPU).
16
2 Initiate the interface MPI communication (CPU). GPUs, they will all be on the same node. This means that
17
3 CPU to GPU memory copy of interface nodes
18 when going from 4 to 8 GPUs there will be an additional
(CPU/GPU).
19 cost of transfer, as internal communication is not as fast as a
4 Initiate surface integral compute kernel (GPU).
20 memory copy within a node.
5 Update based on current stage (GPU).
21
Fo
22 All simulations are performed using double-precision

23 On most modern graphics cards the memory movement floating point numbers. If the numerical accuracy of the
24 double is not required, one can change the datatype to floats.
to/from pinned host memory and device memory can be
rP
25 This change will result in a roughly 2× speed increase, and

26 done with an on-board direct memory access (DMA) engine
27 which can fully overlap with compute activity on the GPU a decrease in the memory footprint by about the same factor.
28
ee
die. This means that the device-host communication and the

29
volume integral computation of the right hand side of (14)
30 5.1 Convergence
31 can start simultaneously, as shown in step 1. Ideally, the
rR
32 communication work in steps 1, 2 and 3 is finished before 5.1.1 Cube domain As an initial validation of the
33 the volume kernel finishes. This way, communication costs simulator, a convergence test in a cube domain with
34 (x, y, z) ∈ [0, 1] and rigid boundaries is carried out. The
are completely hidden. This is possible because the volume
35
ev
36 kernel is completely asynchronous with respect to the host. simulation is initiated using a Gaussian pulse pressure initial
37 Step 4, however, cannot start until all communication of steps condition
38
iew
1-3 is finished, and the volume kernel is finished.

(x − xs )2 + (y − ys )2 + (z − zs )2

39 p(x, 0) = exp −
As can be seen from algorithm 1, the CPU-GPU pairs sxyz
40
41 are currently synchronized in each stage, meaning that (35)
42 an uneven workload distribution will cause a performance where (xs , ys , zs ) is the source position and sxyz = 0.3 m2 is
43 penalty. For this reason, it is desirable to have the overall the spatial variance, which governs the frequency content of
44 the pulse. For this case, an analytic solution exists (Sakamoto
communication work of each CPU equally balanced, as
45
46 other CPU-GPU pairs will be locked in step 3 until they 2007). The domain is meshed using structured meshes of
47 have sent/received all of the information needed. Overall, tetrahedral elements. We study both p-convergence, i.e., the
48 the amount of communication required in each time-step is error behavior as a function of the basis function order for
49 considerable for high resolution simulations, which means a fixed mesh, and h-convergence, i.e., the error behavior as
50
51 that it cannot be expected that the simulator demonstrates a function of the mesh element size. The numerical error
52 perfect scaling on multiple GPUs. is defined as e = ||pDGFEM − pTS ||∞ , where pDGFEM is the
53 simulated pressure in the entire field at t = 0.01s and pTS
54 is the true solution. The time step size is computed using a
55 5 Numerical experiments
56 large time step (close to the stability threshold), by having
All numerical experiments are carried out using the HPC
57 CCFL = 1 in (31).
58 system at the Technical University of Denmark (DTU)
Figure 3 shows the resulting p-convergence, tested for
59 Computing Center. Three computing nodes are used, where
three different meshes, having varying resolution. The
60 each node consists of:
figure reveals the expected spectral convergence, i.e., an
• CPU : 2 × Intel Xeon Gold 6142 @ 2.60 GHz, exponentially fast decay of errors, is indeed observed.
Melander et al. 9
1
2 boundaries. A convergence test is carried out, where the
3 domain is meshed either using affine mesh elements or
4 curvilinear mesh elements. The reference solution here is
5
6 computed using an extremely fine mesh of affine elements
7 (410 308 elements) and N = 8 basis function order. The
8 numerical error is then defined as e = ||pDGFEM − pDGFEM
fine ||
9 , where p DGFEM
is the simulated solution in 10 randomly
10
11 chosen receiver points at t = 0.01 and pDGFEMfine is the
12 simulated solution in the same receiver points for the fine
13 Figure 3. p-convergence for the rigid cube test case. mesh reference simulation.
14 Figure 5 shows the resulting p-convergence, tested for
15 Figure 4 shows the h-convergence for different orders of three different meshes of varying resolution, using either
16
the basis. Table 1 lists the resulting convergence rates. The
17 affine elements or curvilinear elements. When using affine
18 table also includes convergence rates for a case where the elements, the high-order convergence is destroyed, as the
19 time step size has been reduced significantly (CCFL = 0.02). convergence rate flattens out for orders higher than N ≈
20 A convergence rate in the range O(hN ) and O(hN +1 ) is 3. However, when using the curvilinear elements, spectral
21
Fo
expected (Hesthaven and Warburton 2008). The results are

22 convergence is observed.
23 mostly in good agreement with the expected rates, with the
24 exception of N = 1 and N = 2. This is explained by the
rP
25 meshes not being fine enough for these low orders to give an
26
accurate representation of their convergence. Furthermore, it
27
28 is seen that reducing the time step size has only marginal
ee
29 influence on the convergence rates, indicating that in this test

30 case, the spatial errors dominate.
31
rR
32
33
34 (a) Affine elements.
35
ev
36
37
38
iew
39
40
41
42
43 Figure 4. h-convergence for the rigid cube test case.
44
45
46 (b) Curvilinear elements.
47 N CCFL = 1 CCFL = 0.02
48 1 -0.778 -0.778 Figure 5. p-convergence for the cylinder geometry using either
affine or curvilinear meshes.
49 2 -3.239 -3.240
50
3 -3.919 -3.920 Further insights are offered in Figure 6, where h-
51
52 4 -4.618 -4.620 convergence is shown for both the affine and the curvilinear
53 5 -5.752 -5.752 case. It is seen that for the affine elements, a convergence rate
54 6 -6.188 -6.188 of around O(h2 ) is achieved consistent with the geometric
55
7 -7.754 -7.747 approximation, whereas a much higher convergence rate is
56
57 8 -8.161 -8.166 observed for the curvilinear elements. However, the error
58 Table 1. Convergence rates for the h-convergence study. plateaus at around 10−7 , which is likely due to aliasing errors
.
59
that stem from the transformation Jacobians and normals no
60
5.1.2 Cylinder domain Consider now a cylinder shaped longer being constant. This results in inexact integration of
domain, of 1 m height and 1 m radius and having rigid variable terms in the inner products used to compute the
1
2 matrix based operators. Anti-aliasing techniques could be layer backed by an air cavity (BC2). The surface admittance
3 employed to reduce the error further (Engsig-Karup et al. of the BCs are estimated using Miki’s model and a transfer
4 2016; Kirby and Karniadakis 2003). However, in practical matrix method (Miki 1990; Allard and Atalla 2009). Vector
5
6 room acoustics simulations, error levels below 10−7 are fitting is used for the multipole mapping (Gustavsen and
7 rarely needed and, furthermore, such anti-aliasing techniques Semlyen 1999). The material is mapped in the frequency
8 increases the computational cost of the scheme. range of 50 to 1500 Hz. An estimate of the relative mapping
9 error is given by
10
11
|Y (f ) − Yfit (f )|
12 = , (36)
13 Y (f )
14
15 where Y indicates the mean across frequency. Table 2
16 summarizes the properties of the considered boundary
17 conditions and the multipole mapping. Note that more poles
18
are needed in the mapping of BC1, because the admittance
19
20 response fluctuates more with frequency for this absorber.
21 (a) Affine elements.
Fo
22
23 ID dmat d0 σmat Npoles Re Im
24 BC1 0.02 0.2 50 000 8 0.0216 0.0177
rP
25
BC2 0.05 0.0 10 000 4 0.0213 0.0071
26
27 Table 2. Material properties and multipole mapping error for the
28 BCs used in the boundary validation study. dmat [m] is the
ee
porous material thickness, d0 [m] is the air cavity depth, σmat

29
[Ns/m4 ] is the flow resistivity of the porous material, Npoles is the
30 number of poles used in the multipole mapping and Re [%] and
31 Im [%] are the relative mapping errors of the real and imaginary
rR
32 parts of the surface admittance, respectively.

(b) Curvilinear elements. .
33
34 Figure 6. h convergence for the cylinder geometry using either
35
ev
affine or curvilinear meshes.

36 Figure 7 shows the resulting Fourier transformed pressure
37
38 responses of the single reflection study. An excellent match
iew
39 5.2 Boundary condition validation between the analytic solution and the simulation is found for
40 both BCs considered, both in terms of magnitude and phase.
41 A validation study of the locally reacting frequency
This confirms the high precision of the BC implementation.
42 dependent boundary conditions is carried out, using a normal
43 incidence single reflection case. For this case, an analytic
44
solution exists (Thomasson 1976). A 6 m × 10 m × 10 m
45
46 rectangular domain is meshed with an unstructured mesh of
47 tetrahedral elements. The frequency dependent impedance 5.3 Heterogeneous multi-device CPU-GPU
48 boundary is the Y Z plane at x = 0. The resolution of the performance
49
mesh is roughly 8 points per wavelength (PPW) at 1 kHz,
50 5.3.1 Strong scaling Strong scaling is characterised by
51 the highest frequency of interest. The source is placed 2
how the runtime changes while keeping the problem size
52 m from the impedance boundary (S : (2, 5, 5) m) and the
53 constant and increasing the number of processing units. Thus
receiver is placed between the source and the boundary
54 the overall workload is constant, while the workload of each
(R : (1, 5, 5) m). The simulation duration is tfinal = 0.02 s,
55 processing unit decreases, as the workload is split among an
56 which ensures that no parasitic reflections from the other
increasing number of processing units.
57 room surfaces arrive at the receiver position. A Gaussian
58 pulse initial condition, centered at the source position, with Four meshes of varying resolutions are constructed,
59
sxyz = 0.2 m2 is used. described in Table 3. These meshes are used to analyse the
60
Two boundary conditions are considered; a porous strong scaling of the simulator for two different polynomial
material mounted on a rigid backing (BC1) and a thin porous orders, N = 4 and N = 6. In all cases, rigid BCs are used.
Melander et al. 11
1
2 #GPUs Mesh 1 Mesh 2 Mesh 3 Mesh 4
3 1 1 1 − −
4
5 2 0.630 0.903 1 1
6 4 0.393 0.656 0.851 0.913
7 8 0.290 0.586 0.704 0.743
8 12 0.159 0.384 0.505 0.573
9
Table 4. Strong scaling computational efficiency ηS for the
10
meshes in Table 3 with N = 4. ”−” denotes that running the
11 (a) Magnitude.
simulation was not possible due to it requiring more memory
12 than the GPUs in the setup have.
13
14
15 As two of the meshes with N = 4 were not able to be
16 run on 1 GPU due to memory limitations, a comparison
17
where using M b = 2 GPUs as a baseline is shown in table
18
19 5. These numbers are more directly comparable between
20 meshes. As expected, good scaling occurs for high GPU
21 (b) Phase. counts, provided a high resolution mesh is used to ensure that
Fo
22
the efficiency does not suffer due to the overhead associated
23 Figure 7. Simulation of a single reflection from a locally
24 reacting frequency dependent impedance boundary compared with communication and synchronization of the cores.
rP
25 against the analytic solution.

26 #GPUs Mesh 1 Mesh 2 Mesh 3 Mesh 4
27 2 1 1 1 1
28
ee
4 0.623 0.727 0.851 0.913

29
30 # of elements DoF N = 4 DoF N = 6 8 0.460 0.649 0.704 0.743
31 Mesh 1 57k 1 995k 4 788k 12 0.252 0.425 0.505 0.574
rR
32 Table 5. Strong scaling computational efficiency ηS for the

Mesh 2 1 266k 44 310k 106 344k
33 meshes in Table 3 with N = 4 and using M b = 2 GPUs as the
34 Mesh 3 3 558k 124 530k 298 872k baseline.
35
ev
Mesh 4 6 583k 230 405k 552 972k

36 Table 3. Meshes and resulting DoF used for strong scaling
37 analysis, where k denotes kilo, i.e. ×1000.
In Figure 8, the speed-up factors with M b = 2 GPUs as a
38 baseline can be seen. Note that when going from 8 GPUs to
iew
39 12 GPUs for mesh 1 and 2, the communication time begins

40
to dominate the runtime. The GPUs do not have a high
41
42 enough workload, resulting in them finishing the volume
43 A measure of the overall strong scaling computational kernel computation of step 1 in Algorithm 1 before the CPUs
44 have finished the communications in step 3.
efficiency of the simulator is defined as
45
46
TRb M b
47 ηS = , (37)
48 TR M
49
50 where TRb denotes the runtime of a baseline case, M b denotes
51 the number of GPUs used for the baseline case, and TR
52 and M denote the same for the comparison case. Thus if
53 ηS = 1, perfect scaling occurs when going from M b to M .
54
55 In practice, perfect scaling is impossible, as there is always
56 some latency introduced when scaling up the number of
57 processors used. Figure 8. Speed-up factors, with the 2 GPU configuration as a
58 baseline.
59
An array of simulations is carried out, using the meshes
60
defined in Table 3, and the runtime is recorded. The resulting We now repeat the same tests, but using a polynomial
strong scaling computational efficiency is shown in Table 4. order N = 6 basis. This increases the number of DoFs
1
2 involved. Because of the increased size of the problem, variables, then we see that at least two million field values
3 meshes 3 and 4 require a minimum of 4 and 8 GPUs, would have to be streamed to and/or from device memory
4 respectively. Thus, only the strong scaling performance for to achieve the target 80% efficiency of the GPU memory
5
6 meshes 1 and 2 is considered. Table 6 shows the strong bus. Through the lens of this simplified model, it is apparent
7 scaling efficiency factor ηS . The scaling efficiency has that the efficiency of a memory bound kernel degrades
8 improved, as compared to the N = 4 case, due to the for small problem sizes which is precisely the regime we
9 increase in DoF. enter when performing a strong scaling study, wherein the
10
11 number of GPUs is increases but the amount of problem
#GPUs Mesh 1 Mesh 2
12 data remains fixed. In practice the kernel launch time can
13 1 1 1
be significantly higher and the achieved throughput can be
14 2 0.813 0.923
significantly lower, however the fundamental limit to strong
15 4 0.525 0.859
scaling remains.
16 8 0.390 0.700
17
18 12 0.235 0.486
19 Table 6. Strong scaling computational efficiency ηS for two of 5.3.2 Weak scaling Weak scaling is characterised by how
20 the meshes in Table 3 with N = 6. the runtime changes when fixing the problem size for each
21 processing unit. Thus the problem size is increased when
Fo
22 We combine the information about mesh 1 and 2 from increasing the number of processing units, to keep the overall
23
Table 4 and 6 in Figure 9. Here we clearly see the efficiency workload constant on each processing unit. The weak scaling
24
rP
25 increase when comparing N = 4 to N = 6 for both of the computational efficiency is given by

26 two meshes.
27 TRb
ηW = . (38)
28
ee
TR
29
30 The nature of weak scaling requires us to construct new
31 meshes, these meshes are shown in Table 7. The meshes are
rR
32
structured meshes, with around 55 000 elements per GPU.
33
34
35
ev
DoF DoF DoF

36
# of elements N =4 N =6 N =8
37
Figure 9. Strong efficiency for mesh 1 and 2 for two different 55k 1 925k 4 620k 9 075k
38 Mesh 1
iew
N = 4 and N = 6 polynomial orders.

39 Mesh 2 111k 3 885k 9 324k 18 315k
40 Mesh 3 219k 7 665k 18 396k 36 135k
41 In general, strong scaling is hampered by a number
Mesh 4 439k 15 365k 36 876k 72 435k
42 of challenges, which cause the loss in efficiency. Latency
43 Mesh 5 661k 23 135k 55 524k 109 065k
accumulates for copying kernel arguments from host to
44 Table 7. Meshes and resulting DoF used for weak scaling
device, for launching a kernel, for copying halo data between
45 analysis, where k denotes kilo, i.e. ×1000.
46 and device, and MPI message passing latency in the halo
47 exchange. Launching each kernel incurs a time overhead
48 that is fixed relative to the problem size. On a NVIDIA The weak scaling efficiency results are shown in Figure
49
V100 the latency can be as low as 5 µs but as high as 10. The scaling improves with increasing basis function
50
51 20 µs if kernels are not adequately pipelined. We apply an order. In particular, good scaling performance is achieved
52 Amdahl type scaling model adapted for a memory bound for N = 8. This behavior again shows that the scaling is
53 kernel that streams data between the GPU cores and the GPU very dependent on the overall resolution of the mesh, and
54
device memory. To achieve 80% efficiency in streaming data, that the GPUs will be bottle-necked by the communication
55
56 the kernel must stream at least B = 4R∞ T0 bytes of data, for low resolutions. Having a higher basis function order
57 where R∞ is the maximum streaming rate of the kernel and corresponds to each GPU having a higher workload, which
58 T0 is the time it takes to launch the kernel. If the kernel means that the communication time is no longer dominating
59
is to achieve 80% of peak throughput it would need to the runtime. Increasing the resolution further beyond 9
60
stream approximately B = 4 × 900 GB/s × 5µs ≈ 18 MB. million DoF per GPU would steadily move the weak scaling
Assuming that the kernel only streams double precision efficiency closer to 1.
Melander et al. 13
1
2 Mesh 1 Mesh 2
3 #GPUs Rigid BC1 BC2 Rigid BC1 BC2
4
5 1 1 1 1 1 1 1
6 2 0.868 0.844 0.842 0.926 0.924 0.890
7 4 0.569 0.554 0.551 0.880 0.872 0.846
8 8 0.419 0.413 0.413 0.714 0.708 0.687
9
Table 8. Strong scaling computational efficiency ηS for different
10
boundary conditions.
11
12 Figure 10. Weak scaling computational efficiency for the
13 meshes in Table 7.
14
15 Mesh 1 Mesh 2
16 #GPUs BC1 BC2 BC1 BC2
17 The results for both the strong and weak scaling show
1 +0.43% −0.09% +0.86% + − 2.67%
18 that the number of GPUs one should use depends heavily
19 on the problem size. Adding more compute power for 2 +3.25% +2.89% +1.26% +0.90%
20 an application can greatly decrease the overall runtime 4 +3.01% +3.01% +1.87% +0.86%
21 8 +1.44% +1.06% +1.87% +0.91%
Fo
22 until some threshold is reached, where the communication

Table 9. Relative change in runtime compared to the rigid case.
23 overhead begins dominating the runtime. Importantly, this
24 implies that if the number of GPUs used for solving a
rP
25 problem is selected in such a way that each GPU has a

26
27 high workload, an excellent scaling can be achieved to large 5.4 Dispersion and runtime as a function of
28 numbers of GPUs, thus allowing the efficient simulation of simulation frequency range
ee
29 very large problems.

30 In practice, it is desirable to simulate a large portion of the
31 audible frequency spectrum with the wave-based simulator.
rR
32
It is therefore of interest to analyze how the runtime of
33 5.3.3 Influence of BCs on performance By having non-
34 the simulator grows as the valid frequency range of the
rigid BCs, an additional computation must take place at
35
ev
simulation is increased. The valid frequency range of the

the domain boundary nodes, particularly for the case of
36 simulation depends on the spatio-temporal resolution – a
37 frequency dependent BCs, where the ODE system in (9)
high enough resolution must be chosen such that numerical
38 must be solved at the boundary. An experiment is carried
iew
39 errors are below a certain threshold. A relevant threshold

out, where the influence of the non-rigid BCs on runtime
40 in room acoustic simulations is the audibility threshold for
and on strong scaling is investigated. For this experiment,
41 dispersion errors, which is generally taken to be around 2%
42 we consider meshes 1 and 2 in Table 3 and use N = 6. These
(Saarelma et al. 2016).
43 are cube shaped domains (all bounding surfaces are equally
44 large). We set the BCs such that two surfaces are frequency Using high-order basis functions will result in improved
45 accuracy of the simulation for a given mesh resolution, which
dependent impedance boundaries, with the BCs defined in
46
Table 2, and the remaining four surfaces have frequency in turn means that fewer PPW can be used to achieve a certain
47
48 independent absorption with Z = 10 000 Ns/m4 . This is level of accuracy (Pind et al. 2019; Kreiss and Oliger 1972).
49 meant to represent a more realistic simulation scenario. A 1D dispersion analysis of a single wave mode propagation
50 is carried out to roughly assess the PPW needed for different
51 The strong scaling results are shown in Table 8 and the
basis function orders to main numerical dispersion error
52 relative change in runtime when having non-rigid BCs, as
53 below the audibility threshold. The 1D advection equation
compared to the rigid case, is shown in Table 9. The runtime
54 is
increases by an average of only 1.3%. These results show
55 ut + cux = 0, (39)
56 that having realistic boundary conditions only has a marginal
57 influence on the performance. Thus the scaling results shown which has solutions on the form u(x, t) = f (kx − ωt) =
58 in Sections 5.3.1 and 5.3.2. will also hold for cases with f ((ω/c)x − ωt), where f (s) is the initial condition. By
59
realistic BCs. The runtime is more affected for the smaller assuming a solution ansatz f (s) = ejs , the exact solution
60
mesh – this is expected since the boundaries are a larger after n time steps will have a phase shift corresponding to
proportion of this mesh. e−ωn∆t . Thus, the numerical solution at time step n, un , and
1
2 the initial condition u0 , are related by to ensure valid simulation results up to a certain frequency
3 fmax . This allows us to analyse how the runtime changes
4 u0 = uN e−ω̂N ∆t , (40) as the frequency range of the simulation is extended,
5
6 when using different basis function orders. Two rooms are
where ω̂ is the numerical frequency, which will differ from
7 considered, a 4 m × 2.7 m × 3 m ‘bedroom’ and a 9 m × 7 m
8 the exact frequency ω due to the dispersion of the numerical × 3m ‘classroom’. In both cases, frequency independent BCs
9 scheme. This non-linear equation can be solved numerically are used and a simulation time (impulse response length) of
10 for ω̂, and in this work a Levenberg-Marquardt algorithm
11 1 s is simulated. The rooms are simulated using 4 GPUs.
is used. By comparing the numerical frequency against the
12
13 exact one, the dispersion relationship can be established
14 since cd /c = ω̂/ω, where cd is the numerical wave speed.
Figure 12 shows the measured runtime for the
15 Figure 11 shows an example dispersion relationship for the
16 two rooms. The runtime is measured for fmax =
N = 2, 4, 6 basis function orders. In this dispersion analysis,
17 250, 500, 1000, 2000, 4000 Hz. The runtime increases
18 the time step is kept very small, ∆t = 10−5 s and wave speed
quickly with frequency in all cases, after a certain frequency
19 c = 1 m/s, to ensure that the spatial errors dominate. Thus,
is reached. This is due to the inherent nature of having to
20 the analysis only takes into account the spatial discretization
21 mesh the three dimensional volume of the room. However,
Fo
errors. The dispersion relation is measured after n = 2 time

22 the runtime starts to increase rapidly at a much higher
23 steps. In practice, the dispersion relation will depend on both
frequency for the higher order cases. For N = 2, it was not
24 spatial and temporal errors, and will, furthermore, vary with
possible to go beyond fmax = 1000 Hz, because the runtime
rP
25 the simulation duration, where the benefits of using high-

26 was longer than 24 hours, which we set as the maximum
order basis functions versus low-order will increase with
27 allowed runtime for the tests. A function on the form
28 increasing number of time steps (Kreiss and Oliger 1972). 4
ee
r = afmax is fitted to the measured runtimes, where r is the

29 In 3D simulations, the dispersion will also be direction
estimated runtime. This is illustrated with the dashed lines
30 dependent. Taken together, the dispersion properties are
31 in the figure. The resulting fitted a coefficients are shown
rR
very case dependent. It is thus emphasized that this is only

32 in Table 11. As can be seen, a is considerably lower when
33 intended as a rough estimate of the PPW needed to maintain
using the higher order basis functions. The R2 measure of
34 the desirable error levels, when using different basis function
the model fit quality is also presented, in all cases a fairly
35
ev
orders.
36 good fit is observed.
37
38
iew
39
40
41
42
43
44
45
46
Figure 11. 1D dispersion analysis results when using
47
∆t = 0.0001, c = 1 m/s and n = 2 time steps.
48
49
50 (a) Bedroom.
51 N =2 N =4 N =6
52 Points per wavelength 17.7 7.3 6.0
53
54 Table 10. Points-per-wavelength needed to ensure spatial
dispersion errors remain under 2%, based on the 1D dispersion
55 analysis of figure 11.
56
57
58 Table 10 lists the PPW needed to ensure that spatial (b) Classroom.
59
dispersion errors remain under 2% for the dispersion analysis
60 Figure 12. Measured runtime as a function of simulation
example case considered. Using these PPW values as frequency range fmax . The dashed lines are the fitted runtime
4
reference, we can construct meshes with the right resolution model on the form r = afmax .
Melander et al. 15
1
2 Bedroom Classroom
3 N a R2 a R2
4
5 2 1.04 · 10−8 0.92 2.74 · 10−8 0.96
6 4 2.11 · 10−10 0.86 8.36 · 10−10 0.91
7 6 1.06 · 10−10 0.85 3.28 · 10−10 0.88
8 4
Table 11. Coefficients a for the runtime model r = afmax and
9 associated R2 values for evaluating the goodness of the fit.
10
11
12 5.5 Large shoebox-shaped rooms
13
14 To give additional insights into the performance of the
15 simulator for challenging practical cases, we consider two
Figure 13. Geometric model of auditorium B341-A21 at DTU.
16 very large shoebox-shaped rooms. The first is 5 000 m3 ,
17
18 which is meant to imitate a large auditorium, and the other
is 20 000 m3 , which is meant to imitate a large concert hall. GPUs results in slightly worse performance compared to
19
20 These rooms are meshed using a structured mesh, N = 4 the case of using 8 GPUs, indicating that communication
21 costs between cores have begun to dominate. For the concert
Fo
basis function order and the spatial resolution is 6 PPW at

22 hall, the simulation cannot be run on 1 or 2 GPUs due to
1000 Hz. In practice, a common rule of thumb for second
23
memory limitations. For the higher GPU counts, the runtime
24 order schemes is to use 6 PPW at the highest frequency of
rP
25 interest. Thus, it is reasonable to assume that when using drops considerably as the number of GPUs is increased,
26 N = 4 basis order, the accuracy is likely acceptable at and likely would drop further if more than 12 GPUs were
27 used. We have thus shown that for very large rooms, we
frequencies beyond 1000 Hz. For reference, the Schroeder
28
ee
can simulate up to 30 times the Schroeder frequency within

29 frequency of these rooms is 28 Hz and 20 Hz, for the
30 auditorium and the concert hall, respectively, assuming the practical timeframes. However, one has to keep in mind that
31 auditorium has a reverberation time of 1 s and the concert this simulation was run on structured meshes in a rectangular
rR
32 room with rigid BCs. A more complex geometry, which has

hall a reverberation time of 2 s. The room and mesh details
33
to be meshed as an unstructured mesh, could lead to smaller
34 are summarized in Table 12.
35 time-steps and therefore longer runtime.
ev
36 Room Auditorium Concert hall

37 Dimensions 10m × 10m × 50m 10m × 40m × 50m
38 5.6 Complex geometry cases
iew
Volume 5 000 m3 20 000 m3

39
# of elements 1 920k 7 680k 5.6.1 Auditorum B341-A21 at DTU As an example of
40
41 Schröder Freq. 28 Hz 20 Hz using the simulator on realistic 3D geometries, consider
42 Table 12. Details regarding the rooms and the meshes used. the room in Figure 13, which shows a simplified model
43 of auditorium B341-A21 at the Technical University of
44
45 These rooms are simulated for a time of 2 s. Boundary Denmark (DTU). As can be seen, the model contains a desk
46 conditions are rigid. Table 13 lists the resulting runtimes, at the bottom and several benches on the slope. Boundary
47 when using a different number of GPUs to solve the problem. conditions have been chosen to represent a realistic scenario,
48 where the floor, table and benches are set to be rigid,
49
#GPUs Auditorium Concert hall the walls have frequency independent impedance boundary
50
51 1 6.52 − conditions with Z = 15 630 Ns/m4 , which corresponds to
52 2 3.35 − approximately 10% absorption at normal incidence, and the
53 ceiling uses frequency dependent boundary conditions. The
4 1.82 7.23
54
8 1.02 4.41 ceiling is modeled as a suspended porous material, with
55
56 12 1.09 3.23 thickness dmat = 0.02 m, and air cavity depth of d0 = 0.2
57 Table 13. Runtimes in hours for the large shoebox-shaped m and having flow resistivity σmat = 50 000 Ns/m4 . The
58 rooms. initial condition is a Gaussian pulse with width sxyz = 0.3
59
m2 , placed at S = (4.0, 5.0, 1.8) m, to mimic the position
60
For the auditorium case, the runtime improves signifi- of a lecturer. Figure 14 presents a series of snapshots of
cantly when increasing from 1 GPU to 8 GPUs. Using 12 the solution in the XZ-plane at y = 5 m, showing the
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17 Figure 15. Geometric model of dome shaped room.
18
19
20
21
Fo
22
23
24
rP
25 Figure 14. Snapshots of the acoustic wave propagating inside

26 the auditorium.
27
28
ee
29 propagation of the acoustic wave within the domain. Note

30 that the diffraction patterns from the benches and the desk
31
rR
are being accurately captured.

32
33
5.6.2 Star-shaped room with a dome As a last test case,
34
35 consider the geometry shown in Figure 15, which consists
ev
36 of three overlapping rectangles and a dome shaped roof

37 in the center. This geometry is meshed with curvilinear
38
iew
elements to accurately capture the curved shape of the

39
40 dome. The geometry is similar to what is used in a
41 work by Bilbao (Bilbao 2013), however, in that work, the
42 geometry is meshed with affine elements and simulated
43
with a second order accurate method. The floor is made
44
45 to be highly absorptive, having a normal incidence surface
46 impedance of Z = ρc = 411.6 Ns/m4 , while other surfaces
47 are perfectly reflecting. Figure 16 shows snapshots of the
48
wave propagation in the room, both in a section view and a
49
50 plan view. Note the acoustic focusing effect due to the dome.
51 The two complex geometry cases presented in this section
52 and the previous one show that the simulator is capable of
53 Figure 16. Snapshots of the acoustic wave propagating inside
using the geometrical flexibility of the DGFEM method to
54 the star-shaped room.
55 simulate realistic large-scale geometries.
56
57 and complex problems, including curvilinear geometries
58 6 Conclusions and frequency dependent boundaries, under short runtimes.
59
A massively parallel nodal DGFEM-based room acoustic This makes the simulator suited for use on problems
60
simulator has been presented and its performance analysed. of industrial size and complexity. The work presented
It has been shown that the simulator can handle large here extends the usability of wave-based room acoustic
Melander et al. 17
1
2 simulations far beyond the restricted use-case of small rooms Filter for Room Acoustics Applications, IEEE/ACM Trans.
3 and frequencies below the Schroeder frequency, which wave- Audio, Speech, Lang. Proc. vol. 23.8: pp. 1368-1380.
4 based methods are historically typically associated with. Lopez JJ, Carnicero D, Ferrando N and Escolano J (2013)
5
6 The heterogeneous multi-device scalability benchmarks Parallelization of the finite-difference time-domain method for
7 show that the simulator can handle problems of nearly room acoustics modelling based on CUDA, Math. Comput.
8 arbitrary size, as long as the number of GPUs is selected Mod. vol. 57.7: pp. 1822-1831.
9
to ensure that the workload on each GPU is sufficient, Ainsworth M (2004) Dispersive and dissipative behaviour of
10
11 to avoid costs, related to communication between CPU high order discontinuous Galerkin finite element methods, J.
12 cores starting to dominate the runtime. While the new Comput. Phys. vol. 198.1: pp. 106-130.
13 model has been designed for massive parallelism and Hu F and Atkins H (2002) Two-dimensional wave analysis of
14
scalability utilizing heterogeneous multi-device CPU-GPU the discontinuous Galerkin method with non-uniform grids
15
16 many-core hardware, there are still opportunities for further and boundary conditions, 8th AIAA/CEAS Aeroacoustics
17 improvements of the performance. It is also shown that Conference and Exhibit.
18 using mesh-adaptive high-order numerical methods is highly Eleuterio FT (1999) Riemann Solvers and Numerical Methods for
19 beneficial for room acoustic simulations, due to the improved
20 Fluid Dynamics: A Practical Introduction, Springer-Verlag, 2.
21 physical description such wave-based room acoustic model edition.
Fo
22 offer. Hornikx M, Krijnen T and van Harten L (2016) openPSTD: The

23
open source pseudospectral time-domain method for acoustic
24
rP
25 Acknowledgements propagation, Comput. Phys. Comm. vol 203: pp. 298-308.

26 Shu C-W (1998) Essentially non-oscillatory and weighted essen-
The DTU Computing Center (DCC) supported the work with access
27 tially non-oscillatory schemes for hyperbolic conservation
28 to state-of-the-art computing resources. The authors wish to thank
ee
laws, Advanced Numerical Approximation of Nonlinear

29 Mr. Hermes Sampedro Llopis and Mr. Nikolas Borrel-Jensen for
30 fruitful discussions regarding this work.
Hyperbolic Equations, Springer: pp 325-432.
31 Shu C-W (2003) High-order Finite Difference and Finite Volume
rR
32 WENO Schemes and Discontinuous Galerkin Methods for

33
References CFD, Int. J. Comp. Fluid Dynamics. vol 17.2: pp. 107-118.
34
35
ev
Sakamoto S (2007) Phase-error analysis of high-order finite

Krokstad A, Strom S and Soersdal S (1968) Calculating the
36 difference time domain scheme and its influence on calculation
acoustical room response by the use of a ray tracing technique,
37 results of impulse response in closed sound field, Acoust.
38 J. Sound Vib. vol. 8.1: pp. 118-125.
iew
39 Sci.Tech. vol 28.5: pp. 295-309.

Allen JM and Berkley DA (1979) Image method for efficiently
40 simulating small-room acoustics, J. Acoust. Soc. Am. vol. 65:
Karakus A, Chalmers N, wirydowicz K and Warburton T (2019) A
41 GPU accelerated discontinuous Galerkin incompressible flow
42 pp. 943-950.
solver, J. Comput. Phys. vol 390: pp. 280-404.
43 Savioja L (2010) Real-time 3D finite-difference time-domain
44 simulation of low-and mid-frequency room acoustics, 13th
wirydowicz K, Chalmers N, Karakus A and Warburton T (2019)
45 Acceleration of tensor-product operations for high-order finite
International Conference on Digital Audio Effects.
46 element methods, Int. J. High Perf. Comput. Appl. vol 33.4: pp.
47 Savioja L, Manocha D and Lin MC (2010) Real-time 3D finite-
725-757.
48 difference time-domain simulation of low-and mid-frequency
49 room acoustics, Proceedings of the International Symposium
Pind F, Engsig-Karup AP, Jeong C-H, Hesthaven JS, Mejling MS
50 and Strømann-Andersen J (2019) Time domain room acoustic
on Room Acoustics.
51 simulations using the spectral element method, J. Acoust. Soc.
52 Takahashi T and Hamada T (2009) GPU-accelerated boundary
Am. vol 145.6: pp. 3299-3310.
53 element method for Helmholtz’ equation in three dimensions,
54 Int. J. Num. Methods Eng. vol. 80.10: pp. 1295-1321.
Wang H, Sihar I, Pagan Munoz R and Hornikx M (2019)
55 Room acoustics modelling in the time-domain with the nodal
56 Morales N, Mehra R and Manocha D (2015) A parallel time-
discontinuous Galerkin method, J. Acoust. Soc. Am. vol 145.4:
57 domain wave simulator based on rectangular decomposition
58 pp. 2650-2663.
for distributed memory architectures, Applied Acoustics. vol.
59 97.10: pp. 104-114.
Saarelma J, Botts J, Hamilton B and Savioja L (2016) Audibility of
60 dispersion error in room acoustic finite-difference time-domain
Spa C, Rey A and Hernandez E (2015) A GPU Implementation of an
simulation as a function of simulation distance, J. Acoust. Soc.
Explicit Compact FDTD Algorithm with a Digital Impedance
1
2 Am. vol 139.4: pp. 1822-1832. Naylor GM (1993) ODEON—Another hybrid room acoustical
3 Geuzaine C and Remacle J-F (2009) Gmsh: a three-dimensional model, Appl. Acoust. vol 38.2-4: pp. 131-143.
4
finite element mesh generator with built-in pre- and post- Kennedy CA and Carpenter MH (2003) Additive Runge-Kutta
5
6 processing facilities, Int. J. Num. Methods Eng. vol 79.11: pp. schemes for convection-diffusion-reaction equations, AAppl.
7 1309-1331. Num. Math. vol 44.1: pp. 139-181.
8 Hesthaven J (1998) From Electrostatics to Almost Optimal Nodal Savioja L and Svensson UP (2015) Overview of geometrical room
9
Sets for Polynomial Interpolation in a Simplex, SIAM J. Num. acoustic modeling techniques, J. Acoust. Soc. Am. vol 138.2:
10
11 Analysis. vol 35.2: pp. 655-676. pp. 708-730.
12 Engsig-Karup AP, Eskilsson C and Bigoni D (2016) A stabilised Gustavsen B and Semlyen A (1999) Rational approximation of
13 nodal spectral element method for fully nonlinear water waves, frequency domain responses by vector fitting, IEEE Trans. Pow.
14
J. Comp. Physics. vol 318: pp. 1-21. Del. vol 14.3: pp. 1052-1061.
15
16 Kirby RM and Karniadakis GE (2003) De-aliasing on non-uniform Gustavsen B (2008) Fast Passivity Enforcement for Pole-Residue
17 grids: algorithms and applications, J. Comput. Physics. vol Models by Perturbation of Residue Matrix Eigenvalues, IEEE
18 191.1: pp. 249-264. Trans. Pow. Del. vol 23.4: pp. 2278-2285.
19
20 Okuzono T, Shimizu N and Sakagami K (2019) Predicting Carpenter MH and Kennedy C (1994) Fourth-order 2N-storage
21 absorption characteristics of single-leaf permeable membrane Runge-Kutta schemes, NASA Report TM 109112. NASA
Fo
22 absorbers using finite element method in a time domain, Appl. Langley Research Center.
23
Acoust. vol 151: pp. 172-182. Wang H, Yang J and Hornikx M (2020) Frequency-dependent
24
rP
25 Biot MA (1956) The theory of propagation of elastic waves in a transmission boundary condition in the acoustic time-domain
26 fluid-saturated porous solid. I. Low frequency range. II. Higher nodal discontinuous Galerkin model, Appl. Acoust. vol. 164:
27 frequency range, J. Acoust. Soc. Am. vol 128: pp. 168-191. pp. 107280.
28
ee
Panneton R (2007) Comments on the limp frame equivalent fluid Dumbser M, Kaser M and Toro EF (2007) An arbitrary high-
29
30 model for porous media, J. Acoust. Soc. Am. vol 122.6: pp. order Discontinuous Galerkin method for elastic waves on
31 217-222. unstructured meshes - V. Local time stepping and p-adaptivity,
rR
32 Geophys. J. Int. vol. 171.2: pp. 695-717.

Brouard B, Lafarge D and Allard JF (1995) Comments on the limp
33
34 frame equivalent fluid model for porous media, J. Sound Vib. Lafarge D, Lemarinier P, Allard JF and Tarnow V (1997)
35
ev
vol 183: pp. 129-142. Dynamic compressibility of air in porous structures at audible
36 de Bree H, Lanoye R, de Cock S and van Heck J (2005) In frequencies, J. Acoust. Soc. Am. vol. 102.4: pp. 1995-2006.
37
situ, broad band method to determine the normal and oblique Jeong C-H and Brunskog J (2013) The equivalent incidence angle
38
iew
39 reflection coefficient of acoustic materials, Proc. Noise. Vib. for porous absorbers backed by a hard surface, J. Acoust. Soc.
40 Conf. Am. vol. 134.6: pp. 4590-4598.
41 Ducourneau J, Planeau V, Chatillon J and Nejade A (2009) Jeong C-H (2012) Absorption and impedance boundary conditions
42
Measurement of sound absorption coefficients of flat surfaces for phased geometrical-acoustics methods, J. Acoust. Soc. Am.
43
44 in a workshop, Appl. Acoust. vol 70: pp. 710-721. vol. 132.4: pp. 2347-2358.
45 Tamura M (1990) Spatial Fourier transform method of measuring Ingard U (1951) On the Reflection of a Spherical Sound Wave from
46 reflection coefficients at oblique incidence. I. Theory and an Infinite Plane, J. Acoust. Soc. Am. vol. 23.3: pp. 329-335.
47
numerical examples, J. Acoust. Soc. Am. vol 88: pp. 2259- Hesthaven JS and Teng CH (2000) Stable Spectral Methods on
48
49 2264. Tetrahedral Elements, SIAM J. Sci. Comput. vol. 21.6: pp.
50 Hesthaven JS and Warburton T (2008) Nodal Discontinuous 2352-1270.
51
Galerkin Methods – Algorithms, Analysis, and Applications, Thomasson S-I (1976) Reflection of waves from a point source by an
52
53 Springer, New York. impedance boundary, J. Acoust. Soc. Am. vol. 59.4: pp. 780-
54 Bliss DB and Burke SE (1982) Experimental investigation of the 785.
55 bulk reaction boundary condition, J. Acoust. Soc. Am. vol Tomiku R, Otsuru T, Okamoto N, Okuzono T and Shibata T (2011)
56
71.3: pp. 546-551. Finite element sound field analysis in a reverberation room
57
58 Grote MJ, Kray M, Nataf F and Assous F (2017) Time-dependent using ensemble averaged surface normal impedance, Proc.
59 wave splitting and source separation, J. Comput. Phys. vol 330: 40th Int. Cong. Expo. Noise Cont. Eng.
60 pp. 981-996. Hargreaves JA, Rendell LR and Lam YW (2019) A framework
for auralization of boundary element method simulations
Melander et al. 19
1
2 including source and receiver directivity, J. Acoust. Soc. Am. bounding surfaces, J. Sound Vib. vol. 309.1: pp. 167-177.
3 vol. 145.4: pp. 2625-2637. Yousefzadeh B and Hodgson M (2012) Energy- and wave-based
4
Harari I and Hughes TJR (1992) A cost comparison of boundary beam-tracing prediction of room-acoustical parameters using
5
6 element and finite element methods for problems of time- different boundary conditions, J. Acoust. Soc. Am. vol. 132.3:
7 harmonic acoustics, Comput. Methods Appl. Mech. Eng. vol. pp. 1450-1461.
8 97.1: pp. 77-102. Wareing A and Hodgson M (2005) Beam-tracing model for
9
Botteldooren D (1995) Finite-difference time-domain simulation of predicting sound fields in rooms with multilayer bounding
10
11 low-frequency room acoustic problems, J. Acoust. Soc. Am. surfaces, J. Acoust. Soc. Am. vol. 118.4: pp. 2321-2331.
12 vol. 98.6: pp. 3302-3308. Gunnarsdóttir K, Jeong C-H and Marbjerg G (2015) Acoustic
13 Craggs A (1995) A finite element method for free vibration of air behavior of porous ceiling absorbers based on local and
14
in ducts and rooms with absorbing walls, J. Sound Vib. vol. extended reaction, J. Acoust. Soc. Am. vol. 137.1: pp. 509-512.
15
16 173.4: pp. 568-576. Jeong C-H, Marbjerg G and Brunskog J (2016) Uncertainty of
17 Bilbao S (2013) Modeling of Complex Geometries and Boundary input data for room acoustic simulations, Proc. Baltic-Nordic
18 Conditions in Finite Difference/Finite Volume Time Domain Acoust. Meet.
19
20 Room Acoustics Simulation, IEEE Trans. Audio, Speech, Lang. Wang H and Hornikx M (2020) Time-domain impedance boundary
21 Proc. vol. 21.7: pp. 1524-1533. condition modeling with the discontinuous Galerkin method for
Fo
22 Bilbao S, Hamilton B, Botts J and Savioja L (2016) Finite Volume room acoustics simulations, J. Acoust. Soc. Am. vol. 147.4: pp.
23
Time Domain Room Acoustics Simulation under General 2534-2546.
24
rP
25 Impedance Boundary Conditions, IEEE Trans. Audio, Speech, Suh JS and Nelson PA (1999) Measurement of transient response
26 Lang. Proc. vol. 24.1: pp. 161-173. of rooms and comparison with geometrical acoustic models, J.
27 Lam YW (2005) Issues for computer modelling of room acoustics Acoust. Soc. Am. vol. 105.4: pp. 2304-2317.
28
ee
in non-concert hall settings, Acoust. Sci. Tech. vol. 26.2: pp. Kirby R (2014) On the modification of Delany and Bazley fomulae,
29
30 145-155. Appl. Acoust. vol. 86: pp. 47-49.
31 Brinkmann F, Aspock L, Ackermann D, Lepa S, Vorlander M Sides DJ and Mulholland KA (1971) The variation of normal layer
rR
32 and Weinzierl S (2019) A round robin on room acoustical impedance with angle of incidence, J. Sound Vib. vol. 14.1: pp.
33
simulation and auralization, J. Acoust. Soc. Am. vol. 145.4: 139-142.
34
35
ev
pp. 2746-2760. Shaw EAG (1953) The acoustic wave guide. II. Some specific
36 Richard A, Fernandez-Grande E, Brunskog J and Jeong C-H (2017) normal acoustic impedance measurements of typical porous
37 Estimation of surface impedance at oblique incidence based on surfaces with respect to normally and obliquely incident waves,
38
iew
39 sparse array processing, J. Acoust. Soc. Am. vol. 141.6: pp. J. Acoust. Soc. Am. vol. 25.2: pp. 231-235.
40 4115-4125. Klein C and Cops A (1980) Angle dependence of the impedance of
41 Miki Y (1990) Acoustical properties of porous materials— a porous layer, Acustica vol. 44.4: pp. 258-264.
42
Modifications of Delany-Bazley models, J. Acoust. Soc. Jap. Jeon C-H (2011) Guideline for Adopting the Local Reaction
43
44 vol. 11.1: pp. 19-24. Assumption for Porous Absorbers in Terms of Random
45 Marbjerg G, Brunskog J, Jeong C-H and Nilsson E (2015) Incidence Absorption Coefficients, Acta Acust. Acust. vol.
46 Development and validation of a combined phased acoustical 97.5: pp. 779-790.
47
radiosity and image source model for predicting sound fields in Aretz M and Vorlander M (2010) Efficient modelling of absorbing
48
49 rooms, J. Acoust. Soc. Am. vol. 138.3: pp. 1457-1468. boundaries in room acoustic FE simulations, Acta Acust.
50 Marbjerg G, Brunskog J and Jeong C-H (2018) The difficulties of Acust. vol. 96.6: pp. 1042-1050.
51 simulating the acoustics of an empty rectangular room with an Vorlander M (2013) Computer simulations in room acoustics:
52
absorbing ceiling, Appl. Acoust. vol. 141: pp. 35-48.
53 Concepts and uncertainties, J. Acoust. Soc. Am. vol. 133.3:
54 Kuttruff H (2009) Room Acoustics, Taylor & Francis, 5th edition, pp. 1203-1213.
55 Ch. 2, 7. Kreiss H-O and Oliger J (1972) Comparison of accurate methods
56
Bradley JS, Reich R and Norcross SG (1999) A just noticeable for the integration of hyperbolic equations, Tellus vol. 24.3: pp.
57
58 difference in C50 for speech, Appl. Acoust. vol. 58.2: pp. 99- 199-215.
59 108. ISO 3382-1 (2009) Acoustics—Measurement of room acoustic
60 Hodgson M and Wareing A (2008) Comparisons of predicted parameters—Part 1: Performance spaces, International Orga-
steady-state levels in rooms with extended- and local-reaction nization for Standardization.
1
2 ISO 3382-3 (2009) Acoustics—Measurement of room acoustic
3 parameters—Part 3: Open plan offices, International Organi-
4 zation for Standardization.
5
6 ISO 354(2003) Acoustics—Acoustics—Measurement of sound
7 absorption in a reverberation room, International Organization
8 for Standardization.
9
Engsig-Karup AP, Glimberg S, Nielsen A and Lindbergr O (2013)
10
11 Fast hydrodynamics on heterogeneous many-core hardware,
12 Designing Scientific Applications on GPUs, CRC Press /
13 Taylor & Francis Group, Boca Raton: pp. 251-294.
14
Ren M and Jacobsen F (1993) A method of measuring the dynamic
15
16 flow resistance and reactance of porous materials, Appl.
17 Acoust. vol. 39.4: pp. 265-276.
18 Taflove A and Hagness SC (2013) Computational Electrodynamics:
19
20 The Finite-Difference Time-Domain Method, Artech House,
21 Inc., Boston, 3rd edition, ch. 9.
Fo
22 Dragna D, Pineau P and Blanc-Benon P (2015) A generalized

23
recursive convolution method for time-domain propagation in
24
rP
25 porous media, J. Acoust. Soc. Am. vol. 138.2: pp. 1030-1042.

26 Wilson DK, Ostashev VE and Collier SL (2004) Time-domain
27 equations for sound propagation in rigid-frame porous media,
28
ee
J. Acoust. Soc. Am. vol. 116.4: pp. 1889-1892.

29
30 Zhao J, Bao M, Wang X, Lee H and Sakamoto S (2018) An
31 equivalent fluid model based finite-difference time-domain
rR
32 algorithm for sound propagation in porous material with rigid

33
frame, J. Acoust. Soc. Am. vol. 143.1: pp. 130-138.
34
35
ev
Zhao J, Chen Z, Bao M, Lee H and Sakamoto S (2019)

36 Two-dimensional finite-difference time-domain analysis of
37 sound propagation in rigid-frame porous material based on
38
iew
39 equivalent fluid model, Appl. Acoust. vol. 146: pp. 204-212.

40 Paris T (1928) On the coefficient of sound-absorption measured by
41 the reverberation method, Philos. Mag. vol. 5: pp. 489-497.
42
Allard JF and Atalla N (2009) Propagation of Sound in Porous
43
44 Media: Modelling sound absorbing materials, Wiley, West
45 Sussex, United Kingdom, 3rd edition, ch 3, 5.
46 Hopkins C (2007) Propagation of Sound in Porous Media: Mod-
47
elling sound absorbing materials, Butterworth-Heinemann,
48
49 Oxford, United Kingdom, 1st edition, ch 4.
50 Merewether DE, Fisher RG and Smith FW (1980) On Implementing
51
a Numeric Huygen’s Source Scheme in a Finite Difference
52
53 Program to Illuminate Scattering Bodies, IEEE Trans. on Nucl.
54 Sci. vol. 27: pp. 1829-1833.
55 Attenborough K (1982) Acoustical characteristics of porous
56
materials, Phys. Reports. vol. 82.3: pp. 179-227.
57
58 Medina DS, St-Cyr A and Warburton T (2014) OCCA:
59 A unified approach to multi-threading languages, arXiv:
60 1403.0968[cs.DC]

Melander Et Al. - 2020 - Massively Parallel Nodal Discontinous Galerkin Finite Element Method Simulator For Room Acoustics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Melander Et Al. - 2020 - Massively Parallel Nodal Discontinous Galerkin Finite Element Method Simulator For Room Acoustics

Uploaded by

Copyright:

Available Formats

International Journal of High Performance Computing Applications

Massively Parallel Nodal Discontinous Galerkin Finite

Journal: International Journal of High Performance Computing Applications

Manuscript Type: Research Paper

Date Submitted by the

Warburton, Timothy; Virginia Polytechnic Institute and State University,

High-performance computing, Multi-device acceleration, Heterogeneous

Keywords: CPU-GPU computing, Room acoustic simulation, Discontinuous Galerkin

We present a massively parallel and scalable nodal Discontinuous

Galerkin Finite Element Method (DGFEM) solver for the time-domain

relevance in practical room acoustics. The implementation is

32 High-performance computing, multi-device acceleration, heterogeneous CPU-GPU computing, room acoustic

governing partial differential equations that describe wave

22 is computed locally for each element, with communication the implementation.

associated high computational cost. Simulating large spaces

32 of millions of degrees of freedom (DoF). Despite progress

Acoustic wave propagation in a lossless medium is governed

(HPC) – in particular, the use of graphics processing units

10 surface ∂Ω. Here Q is the number of real poles, S is the number of

22 should be chosen large enough to satisfy an error threshold

25 By applying the inverse-Fourier transform to (4), we get

world surfaces exhibit frequency dependent absorption. X

= −ρc2 Dx ukh + Dy vhk + Dz whk

39 Here φn (x) is the hierarchical modal basis function and

25 utilizing matrix-based integration via the mass matrices in

29 basis. For curvilinear elements where the Jacobians are non-

32 R−1 q = b (Eleuterio 1999).

work, the upwind flux is chosen. The upwind flux combines

39 Hu and Atkins 2002). To derive the upwinding flux, a negative parts

22 3.3 Meshing and curvilinear geometry

32 accordance with the polynomial basis order used. To also

7 r = V(V r )−1 re (33)

25 denotes an arbitrary CPU core out of the total of M CPUs.

29 introduces an additional communication step, and therefore

compared to only using CPUs. However, using the GPUs

22 All simulations are performed using double-precision

25 This change will result in a roughly 2× speed increase, and

die. This means that the device-host communication and the

1-3 is finished, and the volume kernel is finished.

expected (Hesthaven and Warburton 2008). The results are

29 influence on the convergence rates, indicating that in this test

porous material thickness, d0 [m] is the air cavity depth, σmat

32 parts of the surface admittance, respectively.

affine or curvilinear meshes.

25 against the analytic solution.

4 0.623 0.727 0.851 0.913

32 Table 5. Strong scaling computational efficiency ηS for the

Mesh 4 6 583k 230 405k 552 972k

39 12 GPUs for mesh 1 and 2, the communication time begins

25 increase when comparing N = 4 to N = 6 for both of the computational efficiency is given by

DoF DoF DoF

N = 4 and N = 6 polynomial orders.

22 until some threshold is reached, where the communication

25 problem is selected in such a way that each GPU has a

29 very large problems.

simulation is increased. The valid frequency range of the

39 errors are below a certain threshold. A relevant threshold

errors. The dispersion relation is measured after n = 2 time

25 the simulation duration, where the benefits of using high-

r = afmax is fitted to the measured runtimes, where r is the

very case dependent. It is thus emphasized that this is only

basis function order and the spatial resolution is 6 PPW at

can simulate up to 30 times the Schroeder frequency within