You are on page 1of 9

Journal of Computational Science 16 (2016) 89–97

Contents lists available at ScienceDirect

Journal of Computational Science


journal homepage: www.elsevier.com/locate/jocs

GPU-enabled N-body simulations of the Solar System using a VOVS


Adams integrator
P.W. Sharp a,∗ , W.I. Newman b,c,d,∗∗
a
Department of Mathematics, University of Auckland, Private Bag 92019, Auckland, New Zealand
b
Department of Earth, Planetary, & Space Sciences, University of California, Los Angeles, CA 90095, United States
c
Department of Physics & Astronomy, University of California, Los Angeles, CA 90095, United States
d
Department of Mathematics, University of California, Los Angeles, CA 90095, United States

a r t i c l e i n f o a b s t r a c t

Article history: Collisionless N-body simulations over tens of millions of years are an important tool in understanding
Received 8 December 2015 the early evolution of planetary systems. We first present a CUDA kernel for evaluating the gravitational
Received in revised form 15 March 2016 acceleration of N bodies that is intended primarily for when N is less than several thousand. We then use
Accepted 12 April 2016
the kernel with a variable-order, variable-stepsize Adams method to perform long, collisionless simula-
Available online 16 April 2016
tions of the Solar System near limiting precision. The varying stepsize means no special scheme is required
to integrate close encounters, and the motion of bodies on eccentric orbits or close to the Sun is calculated
Keywords:
accurately. Our method is significantly more accurate than symplectic methods and sufficiently fast.
Collisionless N-body
Self-gravitating disc © 2016 Elsevier B.V. All rights reserved.
Adams method
Variable-stepsize
GPU

1. Introduction which are massive bodies, the cost of evaluating the acceleration of
all N bodies is O(MN) when M  N. The second way is to assume the
N-body models are used extensively to study the dynamics of small body acts upon the massive bodies but not other small bodies,
the Solar System and other planetary systems. The basic N-body and is acted upon by the massive bodies. The cost of evaluating the
model has N bodies represented as point masses and each body acceleration of all N bodies is approximately twice that for the first
interacting with the remaining N − 1 bodies through Newtonian way. The third way and which is the subject of this paper is to have
gravitational forces. The cost of evaluating the acceleration of all N each body interact with all other bodies. This is functionally the
bodies is O(N2 ). same as the basic model described at the start of the introduction
A common extension to this model is to add bodies whose mass but the applications are fundamentally different.
is small compared to that of the bodies already present. The result- The particular model we developed our implementation for is
ing two types of bodies are known as massive bodies and small the Nice model, see for example Gomes et al. [1], Tsiganis et al.
bodies respectively. The number of small bodies is often far greater [2], and Morbidelli et al. [3]. This is the leading model for the early
than the number of massive bodies. The massive bodies are used to evolution of the planets Jupiter, Saturn, Uranus and Neptune. The
represent the planets and the central star in the planetary system. massive bodies are the Sun and the four planets. The small bodies
What the small bodies represents depends in part on the way their are planetesimals which are asteroid-like bodies thought to have
interaction with the other bodies is modelled. At least three ways existed in the early Solar System. All bodies are assumed not to
are used. collide, necessitating the accurate integration of close encounters
One way is to assume a small body acts upon no other body and between bodies.
is acted upon by the massive bodies only. If there are N bodies, M of The Nice model proposes that Jupiter, Saturn, Uranus and Nep-
tune were originally orbiting a lot closer to the Sun than they
currently are and were surrounded by a thin disc of planetesimals
with a total mass of 35 Earth masses. The model then proposes
∗ Corresponding author.
∗∗ Corresponding author at: Department of Earth, Planetary, & Space Sciences,
that over hundreds of millions of years, the gravitational interaction
between the planets and planetesimals caused the planets to move
University of California, Los Angeles, CA 90095, United States.
E-mail addresses: sharp@math.auckland.ac.nz (P.W. Sharp), win@ucla.edu outwards (migrate). As the planets migrated, their orbits became
(W.I. Newman). more stable and the planets settled into their current orbits. The

http://dx.doi.org/10.1016/j.jocs.2016.04.003
1877-7503/© 2016 Elsevier B.V. All rights reserved.
90 P.W. Sharp, W.I. Newman / Journal of Computational Science 16 (2016) 89–97

Nice model can explain several features of the Solar System includ- when N is the order of 1000. Values of this size are representative of
ing the formation of the outer asteroid belt and the orbits of some N used by Levison et al. [4] and in other recent long simulations of
satellites of Uranus and Neptune. planetary systems such as those of Reyes-Ruiz et al. [15]. There are
An N-body model is defined by a set of ordinary differential no hardware or software constraints with our implementation that
equations (ODEs) for the motion of the bodies and a set of initial restrict N to the order of 100. Our kernel can be used for far larger
conditions. When N ≥ 3 as we have here, the ODEs cannot in general N and we have performed short integrations with N = 65, 536. This
be solved analytically and are solved using integration methods for large a value of N could not be used in long simulations because the
initial value ODEs. The resulting simulations for planetary systems computational time would be prohibitive on current computers.
are typically performed over ten million to several hundred million Our second aim is to establish if long simulations can be performed
years, necessitating a large number of integration steps. Since the at limiting precision with a VOVS Adams method in a realistic time.
underlying mathematical problem is an initial value problem, these The two leading VOVS Adams integrators are destep [16] and
steps are performed sequentially. This combined with the high cost diva [17]. Sharp [18] tested these two integrators on four demand-
of evaluating the acceleration of the N bodies meant that until ing N-body problems and both integrators were robust. diva was
recently the simulations could not be performed without reducing more efficient but we have used destep because we wanted to
the computational time by approximating the acceleration. implement the inner loops of the integrator on a GPU and we found
A commonly-used approximation was to ignore the interac- destep easier to understand. All our programming was in C and
tions between planetesimals, see for example Gomes et al. [1]. This CUDA.
reduces the evaluation cost to approximately O(MN) when M  N. We begin in Section 2 with the equations of motion for the N
Levison et al. [4] in their study of the Nice model introduced a bodies, and a summary of destep and our test problems. In Sec-
more realistic approximation. The interactions between two plan- tion 3 we give profiling results for destep run on a CPU. Then
etesimals was included if the distance between the planetesimals in Section 4, we describe our new CUDA C kernel. We also dis-
was small. This distance depends on the mass of the planetesimals cuss using CUDA C kernels for the predictor and corrector loops
and the semi-major axis of their orbit. A representative value for in destep, and describe further optimizations of our implementa-
the simulations of Levison et al. [4] is 0.2 astronomical units (AU). tion. We present a summary of extensive testing of our program in
This is approximately 1/150-th of the semi-major axis of Neptune’s Section 5 and end in Section 6 with a discussion of our results.
present orbit. Our testing was performed on the NeSI HPC facility at the Uni-
The large number of floating point operations required for versity of Auckland. The CPU we used was a single core of an Intel
N-body simulations make them obvious candidates for implemen- E5-2680 processor. The processor has a clock speed of 2.70 GHz and
tation on GPUs. Considerable research has been done on using GPUs 20 MB of cache. The processor has two threads, permitting limited
for galactic N-body simulations, see for example Hamanda et al. [5], multi-threading. We could not use this feature because it is disabled
Portegies Zwart et al. [6], Berzcek et al. [7], Bédorf et al. [8], Spurzem for all E5-2680 processors on the University of Auckland NeSI HPC
et al. [9], Watanabe et al. [10]. system.
Galactic simulations typically require several orders of magni- The GPU we used was a Kepler K20Xm. This has a single G110
tude fewer integration steps than simulations of planetary systems. chip and permits up to 1024 threads per thread block. The total
Hence, N for planetary simulations is small compared with N for amount of L1 cache and shared memory on the chip is 64 KB. This
galactic simulations. Possibly because of this little research has 64 KB is configurable. The default on the HPC at the University of
been done on using GPUs for the simulation of planetary systems Auckland is 48 KB of shared memory and 16 KB of L1 cache.
when the acceleration is evaluated without approximation. Moore Throughout the remainder of the paper, we refer to the version
[11] and Grimm and Stadel [12] implemented modified versions of of our program that runs solely on a core of the E5-2680 processor
the integrator Mercury of Chambers [13] on a GPU. Mercury uses as the CPU version, and the version that uses the GPU as the GPU
a variable-order, variable-step extrapolation method during close version.
encounters, and a second-order mixed-variable symplectic method
at all other times. 2. Background
We adopt a different approach, that of using a variable-order,
variable-stepsize (VOVS) Adams method with the local error toler- Let ri (t), i = 1, . . ., N, denote the position of the ith body at
ance chosen so that integrator is operating near limiting precision. time t in three-dimensional Cartesian coordinates with the origin
The automatic variation in the stepsize means no special scheme at the centre of mass. When all N bodies are modelled as point-
is required to integrate close encounters, and the orbital motion of masses interacting fully through Newtonian gravitational forces,
bodies on eccentric orbits or close to the Sun is calculated accu- their equations of motion can be written as
rately. For the stepsizes typically used with symplectic integrators,
the phase error in the position of the bodies will be considerably 
N
rj − ri
smaller with the Adams method. r̈i = j , i = 1, . . ., N, t ∈ [0, tf ], (1)
rj − ri 32
VOVS Adams integrators have two disadvantages compared to j=1,j =
/ i
low-order symplectic methods. They have considerably more over-
head, some of which is single threaded, and the error in the energy where the overdot operator denotes differentiation with respect
grows with t. The error in the energy for symplectic methods is to t, j is G times the mass of the jth body and G is the gravita-
bounded for an exponentially long time subject to some conditions tional constant. The energy E is conserved by the true solution. We
being satisfied, see for example §X.4 of Hairer et al. [14]. used the relative error in E to measure the accuracy of a numerical
The main motivation for our work was to overcome the difficulty solution.
highlighted by Levison et al. [4] in their study of the Nice model. Eq. (1) combined with initial conditions can be written as the
They stated ‘Unfortunately, it is still computationally too expensive initial value problem
to include the full gravitational interaction of the disc particles in a ẏ = f(t, y(t)), t ∈ [0, tf ], y(0) = y0 . (2)
direct N-body integration’. This led them to use the approximation
described above. The integrator destep solves (2) and is intended for problems
Our work has two aims. One is to present a new CUDA kernel with smooth solutions and an expensive f. The integrator per-
for evaluating the acceleration of N bodies without approximation forms two f evaluations per step and the integration order k varies
P.W. Sharp, W.I. Newman / Journal of Computational Science 16 (2016) 89–97 91

between one and 12. Changes in the stepsize h are restricted to


halving and doubling, and h is not increased until there has been at
least k consecutive steps of constant size. Each integration begins
with an opening phase during which k increases from one and h
from a small value. The increases continue until h and k have reach
suitable values for the problem. Thereafter, changes in h and k are
typically less pronounced except during close encounters between
bodies. On each attempted step destep estimates the local error.
If the norm of the estimate is greater than a user-supplied local
error tolerance, the step is rejected and repeated using a smaller
stepsize.
The overhead of an integration step consists of all the calcu-
lations on the step except those used to evaluate the derivative.
The overhead is dominated by the prediction and correction of
the differences, and requires O(N) operations. There is also some
overhead that is single threaded. The total overhead becomes less
significant as N increases because the cost of evaluating the accel-
eration is O(N2 ). The integrator was originally written in Fortran
77 by Shampine and Gordon [16]. We used the recent C version Fig. 1. The initial position of the Sun, the four planets and the planetesimals for one
realization of the initial conditions.
of Burkardt [19]. This is the same as the Fortran 77 version except
for minor changes. We used a local error tolerance of 10−11 . This
is large compared with the unit round-off in IEEE double precision Table 1
(approximately 2.2 × 10−16 ) but smaller values lead to significant The CPU version of the program. The CPU time used by the derivative routine fcn
and the integrator routines step and de as a percentage of the total CPU time.
round-off error, possibly because destep uses an L2 norm in its
stepsize selection and we have a large number of differential equa- Routine 128 256 512 1024 2048
tions. fcn 89.1 94.3 96.8 98.6 99.1
Our test problems had the Sun, the giant planets Jupiter, Saturn, step 9.6 5.1 3.1 1.3 0.8
Uranus and Neptune, and N − 5 small bodies in a thin disc well out- de 0.9 0.3 0.1 <0.1 <0.1
side Neptune’s orbit. The small bodies had equal mass and a total
mass of 35 Earth masses. The initial position and velocity of the
massive bodies and their masses were those of Levison et al. [4] and 4. Kernels and optimization
supplied by Morbidelli [private communication]. These initial val-
ues are from an extensive computer search by Morbidelli et al. [20] 4.1. Kernel for the acceleration
for initial values that produced the required behaviour for the Nice
model. The initial positions and velocities of the small bodies were Nyland et al. [21] described a kernel for the evaluation of the
generated from the pseudo-random distributions used by Levison acceleration that uses one thread per body. The kernel employs
et al. [4]. These distributions reproduce the essential features of tiling and a one-dimensional grid of one-dimensional thread blocks.
the disc of planetesimals thought to exist in the early Solar System. Nyland et al. [21] tested their N-body code and found the perfor-
These features are that the disc was axially symmetric about Solar mance improved for N ≤ 4096 if more than one thread per body
System axis and symmetrical about the plane of the Solar System, was used. Doing so ensured the latencies on the GPU were cov-
that the planetesimals were orbiting the Sun in orbits of low eccen- ered adequately. The improvement in performance for N > 4096 was
tricity, and that the number of planetesimals per unit square when insignificant because the GPU was fully utilized.
viewed from the north pole of the Solar System decreased as the Nyland et al. [21] did not describe how they modified their
reciprocal of the distance from the Sun. kernel to use multiple threads per body. We tested three imple-
All initial positions and velocities were transformed so the cen- mentations and settled on using two-dimensional thread blocks in
tre of mass was at the origin and stationary. The units of distance place of one-dimensional thread blocks. Table 2 gives pseudo-code
and time were astronomical units and days respectively. Fig. 1 for our kernel.
contains a 3D plot of the initial positions for one realization with The number of tiles is the number of thread blocks. Each thread
N = 1024. We have used a larger scale in the z-direction than in the x- block is blockDim.x threads by blockDim.y threads. The blocks
and y-directions to show the vertical extent of the disc of planetes- are numbered using the index blockIdx.x and the threads within
imals. The small disc near the centre of the plot is the Sun, the four a block using the index threadIdx.x. Both indices start at zero and
smaller discs are the planets, and the dots are the planetesimals. not one.
Line 1 of the kernel sets the index of the body whose acceleration
is being calculated. Line 2 zeros acont which is the contribution to
3. Profiling the acceleration of the (i + 1)-st body calculated by the thread. Line
3 sets the number of tiles. Lines 4–21 is the loop over the tiles. Line
Before writing our CUDA kernels, we profiled the CPU version of 5 sets the thread index and Line 6 saves the positions of a tile in the
our program (the version that runs solely on a core of the E5-2680 shared memory of the thread block. The threads in a block are then
processor). We compiled the program using the compiler options synchronized on Line 7. The inner loop on Lines 8–12 calculates
-p, -g and -Ofast, and analyzed the program’s performance using the contribution to the acceleration from a row of the tile. This
the GNU command gprof. The derivative routine fcn and the two loop is followed by a synchronization, the saving of the acceleration
integrator routines step and de used almost all of the CPU time. just calculated, and a further synchronization. The contributions
The percentage share for N = 128, 256, 512, 1024, 2048 is listed in saved in share memory are then summed if the tile is the last tile
Table 1. We observe that even for the small value of 128 for N, the and threadIdx.y is zero. After the tiles have been processed, the
CPU time required to evaluate the acceleration dominates the total acceleration of the body is set to the accumulated acceleration if
CPU time for a simulation. threadIdx.y is zero.
92 P.W. Sharp, W.I. Newman / Journal of Computational Science 16 (2016) 89–97

Table 2 4.3. Further optimizations


Pseudo-code for our kernel for the acceleration.

1 i:= (blockIdx.x)(blockDim.x) + threadIdx.x On the NeSI HPC facility at the University of Auckland, the
2 acont := 0 default compiler and optimization level for CUDA C is g++ and O3
3 nt : = N/ ((blockDim.x)(blockDim.y))
respectively. We re-did the compilation with the Intel compiler
4 for t : =1 : nt
5 ti:= ((blockDim.y)t + threadIdx.y)blockDim.x + threadIdx.x
ipcp, keeping the optimization level at O3 and found this reduced
6 share[(threadIdx.y)(blockDim.x)+threadIdx.x]:= rti+1 the computational time. A further reduction was had by using the
7 synchronize() optimization level Ofast when compiling the routines in destep
8 for j : =0 : blockDim . x − 1 (this level is not permitted when compiling CUDA routines). We
9 rk+1 := share[(threadIdx.y)(blockDim.x)+j]
experimented with the Portland Group compiler pgcc and found
10 ır : = rk+1 − ri+1
11 acont := acont + j ır/(ır  3 + ) the compiled code was slower than that for ipcp.
12 end We next used pragmas to unroll the inner loop of the deriva-
13 synchronize() tive kernel, and the inner loops of the prediction and correction
14 share[(threadIdx.y)(blockDim.x)+threadIdx.x]:= acont
of the differences in destep. There was an optimal amount of
15 synchronize()
16 if threadIdx.y = 0 &t = nt − 1 then
unrolling. Further unrolling increased the computational time, pos-
17 for j : =1 : blockDim.y sibly because of conflicting requirements on the CPU and GPU
18 acont : = acont + share[j(blockDim.x)+threadIdx.x] registers. The optimal unrolling was respectively 64, 16 and 16 for
19 end the derivative kernel, the prediction of the differences, and their
20 end
correction. The unrolling of the two loops for the differences gave
21 end
22 if threadIdx.y = 0 then smaller reductions in computational time than the unrolling of the
23 ai+1 : = acont inner loop in the derivative kernel. As we illustrate and discuss
24 end below, the reduction in computational time from loop unrolling
decreases as nT increases.
We tried nvcc options for steering the GPU code generation.
We did tests with the option fmad (fused floating-point operations)
The first synchronization ensures that all the required positions
enabled and disabled, and used the option maxregcount to vary the
are saved in shared memory before the acceleration is calculated.
maximum number of registers the GPU functions can use. There
The second synchronization ensures the positions in share memory
was no reduction in computational time. The same occurred when
are not over-written prematurely by the contributions to the accel-
we used the CUDA function cudaFuncSetCacheConfig to direct
eration. The third synchronization ensures these contributions are
the GPU to prefer L1 cache over shared memory.
all saved in shared memory before the acceleration of the body is
calculated.
The divide by zero on Line 11 when i = j is avoided by adding 5. Comparisons
 to ır3 . We used  = 10−60 . This changes the acceleration of the
ith body when using double precision arithmetic if ır is approx- We did extensive testing to establish the optimal number nT of
imately 10−15 or smaller. If ır is this small, the ith and jth bodies threads per body on the GPU, and the speed-up of the GPU ver-
are unrealistically close, the acceleration of the ith body will be very sion relative to a our CPU version. We did this testing using 512
large, and the integration will stop. and 1024 threads per thread block to demonstrate the gains from
having more threads per thread block. We also investigated the
performance of the GPU version on ten integrations of at least 20
4.2. Kernel for the prediction and correction million years each. Since we wanted to know if it was realistic to
perform integrations of 300 million years, we first took a conser-
In our simulations, the stepsize used by destep changes infre- vative approach with these ten integrations and used 512 threads
quently except when bodies are undergoing a close encounter. per thread block. We later repeated three of the 20 million year
Hence, since k is high in our simulations, typically 10–12, the over- simulations with 1024 threads per thread block.
head in destep is dominated by the prediction and correction of We present a summary of the above testing in this section. The
the differences. Let l,j , j = 0, . . ., k, l = 1, . . ., n, denote the jth differ- HPC system was heavily used during the period we performed
ence for the lth equation at the start of a step. The differences are the tests and it took us several months to complete them. We are
predicted and corrected using unaware of any relevant changes in the hardware or software dur-
ing this time. The heavy use caused a 5% or more difference in the
computational time between consecutive runs with the same input.
p p p
l,k = 0, l,j = l,j + l,j+1 , j = k − 1, . . ., 0, l = 1, . . ., n, Since we are interested in simulations with N the order of 1000,
much of our testing was with N = 1024. We have included test
(3)
results for larger N when it clearly demonstrates further aspects
of the performance of the GPU version of our program.
Before we present the summary, we show the accuracy of the
p p p
l,k = fl − l,0 l,j = l,j + l,k , j = k − 1, . . ., 0, l = 1, . . ., n, numerical solution is not changed significantly when the optimiza-
(4) tions described in the previous section are used or when nT – the
number of threads per body on the GPU – is chosen to minimize the
p computational time. If the accuracy was reduced significantly, the
where fl is the lth component of the derivative evaluated using
optimizations and having a choice for nT would be of limited merit.
the predicted solution.
We implemented (3) and (4) as separate CUDA kernel and found
for N the order of 1000 that using the kernel was slower than per- 5.1. Accuracy
forming the calculations on a CPU core. We believe this occurs
because the ratio of arithmetic operations to memory loads is low For each nT and N, we did one set of 100 integrations over
and the overhead of invoking a kernel is significant. 10 years for the following four compilations: the whole program
P.W. Sharp, W.I. Newman / Journal of Computational Science 16 (2016) 89–97 93

compiled with g++ and optimization level O0; the whole program
compiled with g++ and O3; the integrator destep compiled with
ipcp and Ofast and the CUDA kernels compiled with ipcp and
O3; the previous compilation with the unrolling of the inner loops
added. We performed 100 integrations and not one to reduce the
effects of an outlier, and used a short integration interval to ensure
that all integrations for a given N required a very similar number
of f evaluations. We used the relative error in the energy after 10
years as our measure of the error in a single integration, and the
mean of this error over the 100 integrations in a set as the error for
a set.
For a given N, the initial conditions for each set of 100 integra-
tion were the same. The initial conditions for the first integration
in a set were chosen as described in Section 2. We generated the
initial conditions for the remaining 99 integrations in the set by
adding a small pseudo-random perturbation to each component of
the initial positions for the first integration. The perturbation was
selected from the uniform distribution on the interval [−5 ×10−11 ,
5 × 10−11 ].
We used nT = 2i , i = 0, . . ., 6, and performed the integrations for
N = 1024 and 2048, giving 56 sets of 100 integrations in total, 28
for each N. The number of threads per thread block was 512. For
N = 1024, all integrations required 199 f evaluations and the mean Fig. 2. The normalized computational times for the three compilations. N = 1024
(top), N = 2048 (bottom).
of the error across the 28 sets varied six percent from the smallest to
largest mean. For N = 2048, the variation in the number of f evalua-
tions was small: 89.1% of the integrations required 540 evaluations, We re-did the tests using 1024 threads per thread block. For the
10.3% 542 evaluations, and the remaining 0.6% integrations either fastest compilation, the computational time for N = 1024 and 2048
532 or 534 evaluations. The mean of the error across the 28 sets of was minimized when nT was 8 and 16 respectively.
integrations varied 1.3 percent from the smallest to largest mean. To test if the optimal value of nT changed with the interval of
We concluded that changing nT or how the compilation is integration, we performed one integration of one million years for
performed has little effect on the error. We reached the same con- N = 1024 and nT = 2i , i = 2, . . ., 6. There was good agreement with the
clusion when we re-did the tests using 1024 threads per thread results for the integrations of 10 years.
block.
5.3. Speed-up
5.2. Optimal number of threads
We next measured the speed-up of the GPU version of our pro-
We did timing tests for N = 1024 and N = 2048 to establish what gram relative to the CPU version for N = 1024, 2048, 3072, and 4096.
value of nT gave the least computational time. We performed the To ensure a fair comparison, we used the symmetry in the interac-
integrations described above five times and recorded the compu- tion between bodies when implementing the acceleration for the
tational time for each set of 100 integrations. We used the mean of CPU version. To a good approximation, this halved the number of
the five computational times for a given N, nT and compilation as floating point operations required to evaluate the acceleration. We
our measure of the computational time. We omitted the compila- also experimented with the compilation of the CPU version and
tions with optimization level -O0 because these integrations were found that using ipcp with the optimization level -Ofast and a
noticeably slower than those for the other compilations. factor of 8 in the loop unrolling gave the least computational time.
The top half of Fig. 2 gives the graphs of the normalized means The gains from the loop unrolling were small.
of the computational times for the seven values of nT and the three We performed a set of 100 integrations of 1000 years using the
compilations when N = 1024 and there were 512 threads per thread CPU and GPU versions. The initial conditions for the 100 integra-
block. The means were normalized by dividing by the smallest of tions were chosen as described earlier in the section. The mean
the 21 means for N = 1024. We observe from Fig. 2 for N = 1024 that number of f evaluations and the error in the energy for the 100 inte-
the smallest computational time occurs when 32 threads per body grations for the CPU and GPU versions differed but the differences
are used, although using 4 and 16 threads per body leads to a small were small. Fig. 3 gives the graphs of the speedup as a function of
increase only in the computational time. A little puzzlingly, using N for 512 and 1024 threads per thread block.
8 threads per body uses slightly more computational time than 4 We observe from Fig. 3 that for a fixed number of threads per
or 16 threads, showing that the dependence of the performance on thread block, the speedup increases as N increases from 1024 to
nT is complicated for our values of N. 4096. We also observe that the increases decrease monontonically
We also observe from Fig. 2 for N = 1024 that the reduction with N indicating that the GPU makes better use of the threads for
in computational time from using loop unrolling decreases as nT larger N. Of particular interest to us were the speedups for N = 1024.
increases. We expected this because the number of trips around the These are 18.3 and 24.8 for 512 and 1024 threads per thread block
inner loop of our kernel for the derivative is inversely proportional respectively. Hence, for N = 1024 going from 512 to 1024 threads
to nT . Increasing nT reduces the number of trips which reduces the per thread block reduced the computational time by 26%, or using
effectiveness of loop unrolling. the reciprocal, increased the speed of our implementation by 36%.
The bottom half of Fig. 2 gives the graphs of the normalized
means for N = 2048. The graphs have a similar form to those for 5.4. Long integrations
N = 1024 except the smallest computational time occurs at 16 and
not 32 threads per body because increasing N reduces the effect of We attempted a 300 million year integration of the Nice model.
latency on the GPU. The integration was done in segments of 20 million years and the
94 P.W. Sharp, W.I. Newman / Journal of Computational Science 16 (2016) 89–97

The average stepsize was 18.5 days. If we take Jupiter’s period


as the smallest orbital period in the problem, 18.5 days is 1/234-
th of the smallest period. This is an order of magnitude smaller
than typically used for symplectic integrations and shows accurate
integrations require small stepsizes.
Of the steps attempted by destep, 2.3% were repeated because
the error on the first attempt of the step was too large. This small
percentage of repeated steps is strong evidence of the effectiveness
of the scheme used in destep to select the stepsize. The small per-
centage means that if destep was modified so there were very few
rejected steps, a modification we considered, there would be little
gain in efficiency.
The relative error in the energy at 20 million years varied from
5.50 × 10−8 to 6.85 × 10−8 across the ten integrations, and the mean
was 6.19 × 10−8 . A linear least squares fit of the power law ˇt˛ to
the relative error for each of the ten integrations gave ten values of
˛ ranging from 0.75 to 0.94. Despite this near linear growth, the rel-
ative error in the energy is two or three orders of magnitude smaller
than is typical for simulations with low order symplectic methods.
Fig. 3. The speedup of the GPU version relative to the CPU version. For example, the 200 million year simulations of Raymond et al.
[22] had an error better than one part in 103 . For simulations where
the solution remains acceptable, extrapolation of the power laws
described above for our program would give an error better than
position and velocity of all the bodies printed every 10,000 years. one part in 106 .
We took a conservative approach with the efficiency of our pro- We re-did three of the above 20 million year integrations using
gram and used 512 threads per thread block and not the maximum 1024 threads per thread block. The average increase in speed com-
possible of 1024. We stopped the simulation at 32.16 million years pared with the integrations with 512 threads per thread block was
because Saturn had migrated to 4915 AU and the Sun was 1.18 AU 32%, good agreement with the 36% calculated from the speedups
from the origin. Jupiter, Uranus and Neptune had also migrated but of Fig. 3 given the measurement uncertainty in the computational
the distance was small compared to that of Saturn. time.
We were surprised by the large solar displacement and plan-
etary migration because these did not occur in the 300 million
year integration of Levison et al. [4]. To gain insight, we generated 6. Discussion
three more realizations of the initial conditions and performed the
integrations. Each integration was required to go at least 20 mil- We presented a new CUDA C kernel for evaluating the Newto-
lion years and allowed to continue until 300 million years, or the nian acceleration of N bodies without approximation. The kernel
Sun was 1 AU from the origin, or a planet had migrated 1000 AU, is intended mainly for N-body simulations when N is the order
whichever came first. The first simulation stopped at 20 million of 103 but can be employed with far larger N. We used the ker-
years with the Sun 1.21 AU from the origin. The second and third nel with a leading variable-order, variable-stepsize Adams method
simulations stopped at 35.24 and 27.81 million years respectively to perform collisionless N-body simulations of the Solar System
because the Sun was 1 AU from the origin. near limiting precision. We found that despite the high overhead
Shortly after we had completed the above integrations, Reyes- of Adams methods, the speed-up relative to an optimized CPU ver-
Ruiz et al. [15] published the results of their simulations of the Nice sion of the program for N = 1024 was 18 and 25 for 512 and 1024
model described in Levison et al. [4]. Reyes-Ruiz et al. used the threads per thread block respectively.
second-order symplectic integrator Mercury of Chambers [13] and Our program is sufficiently fast that it is feasible to integrate
evaluated the acceleration without approximation. They found the different realizations of the initial conditions. The ability to do so
system was unstable over tens of millions of years. can lead to significantly more insight about the model than a single
The integrator destep is able to integrate unstable solutions of integration, as we demonstrated for the Nice model described in
the above type but the solutions would be of little value because Levison et al. [4].
they are physically unrealistic. Nevertheless, the Nice model can The standard approach when writing a CUDA kernel for evalu-
be used to test our program in a meaningful way if we restrict how ating the acceleration of N bodies is to have one thread per body.
far we integrate. There is no one correct interval of integration. The Because of the values of N we used, we obtained noticeable gains in
results of the four integrations described above suggest an integra- speed when using more than one thread per body. Nyland et al. [21]
tion interval of 20 million years and we used this. We generated found the same. Our gains in speed for a fixed N were greater than
six more realizations of the initial conditions and performed the theirs partly because our GPUs permit up to four times as many
integrations, to give us results for ten integrations of 20 million threads per block. Hence, for a given N, the latency on our GPUs
years. will likely be more significant and more threads per body will be
The average computational time for the ten integrations was required to hide the latency. We found that increasing the number
4.83 days. A linear extrapolation to the 300 million years of Levison of threads per thread block from 512 to 1024 increased the speed
et al. [4] gives approximately two and a half months computational of the GPU version one-third.
time, or three months as a conservative estimate. Three months is Although the Nice model is not the subject of this paper, it is
extremely long compared to the typical computational time when pertinent to make some comments about the model. We found the
solving initial value ODEs but is in line with what is accepted for N- solution became unstable over an interval of approximately 20–40
body simulations of the Solar System. For example, two simulations million years, confirming the recent results of Reyes-Ruiz et al. [15].
in Raymond et al. [22] took 2–3 months, two took 4–5 months and The presence of this instability is not of concern as it is a claimed
one 16 months. output of the model. Without it, the planets could not migrate from
P.W. Sharp, W.I. Newman / Journal of Computational Science 16 (2016) 89–97 95

their initial positions of the Nice model to their current positions. New Zealand’s national facilities are provided by the New Zealand
What is of concern is the time scale of the instability. The Nice eScience Infrastructure (NeSI) and funded jointly by NeSI’s collabo-
model requires a time scale of several hundred millions which is rator institutions and through the Ministry of Business, Innovation
an order of magnitude longer than found by Reyes-Ruiz et al. [15] and Employment’s Infrastructure programme. URL: http://www.
and us. The integrator destep is a robust integrator and we are nesi.org.nz. The second author acknowledges support from the
convinced that the shortened time scale of the instability is due to UCLA Academic Senate for this collaboration.
the sensitivity of the solution to changes in the initial conditions
when the acceleration is not approximated. As Reyes-Ruiz et al. Appendix A.
[15] concluded, further investigation of the Nice model is required.
This appendix contains a listing of our kernel. The kernel has the
Acknowledgements input arguments pos and nB (number of bodies), and the output
argument acc. The arguments pos and acc are arrays of data type
The authors thank the two referees for their suggestions on double4. This is a built-in data type for CUDA C that holds four
improving the suitability of the paper. The authors acknowledge the double precision floating point numbers. The ith element of pos
contribution of the NeSI high-performance computing facilities and contains the x, y, and z components of the position of the (i + 1)-st
the staff at the Centre for eResearch at the University of Auckland. body together with its mass. The ith element of acc contains the x,
y and z components of the acceleration of the (i + 1)-st body.
96 P.W. Sharp, W.I. Newman / Journal of Computational Science 16 (2016) 89–97

References Methods, Tools, and Outcome, Astronomical Society of the Pacific Conference
Series, vol. 453, 2012, p. 223.
[1] R. Gomes, H.F. Levison, K. Tsiganis, A. Morbidelli, Origin of the cataclysmic [10] T. Watanabe, N. Nakasato, GPU accelerated Hybrid Tree Algorithm for
Late Heavy Bombardment period of the terrestrial planets, Nature 435 (2005) Collision-less N-body Simulations, 2014, ArXiv e-printsarXiv:
466–469, http://dx.doi.org/10.1038/nature03676. 1406.6158.
[2] K. Tsiganis, R. Gomes, A. Morbidelli, H.F. Levison, Origin of the orbital [11] A. Moore, A.C. Quillen, QYMSYM: A GPU-accelerated hybrid symplectic
architecture of the giant planets of the Solar System, Nature 435 (2005) integrator that permits close encounters, New Astron. 16 (7) (2011) 445–455,
459–461, http://dx.doi.org/10.1038/nature03539. http://dx.doi.org/10.1016/j.newast.2011.03.009.
[3] A. Morbidelli, H.F. Levison, K. Tsiganis, R. Gomes, Chaotic capture of Jupiter’s [12] S.L. Grimm, J.G. Stadel, The GENGA Code: gravitational encounters in N-body
Trojan asteroids in the early Solar System, Nature 435 (2005) 462–465, http:// simulations with GPU acceleration, Astrophys. J. 796 (2014) 23, http://dx.doi.
dx.doi.org/10.1038/nature03540. org/10.1088/0004-637X/796/1/23, arXiv:1404.2324.
[4] H.F. Levison, A. Morbidelli, K. Tsiganis, D. Nesvorný, R. Gomes, Late orbital [13] J.E. Chambers, A hybrid symplectic integrator that permits close encounters
instabilities in the outer planets induced by interaction with a self-gravitating between massive bodies, MNRAS 304 (1999) 793–799, http://dx.doi.org/10.
planetesimal disk, Astron. J. 142 (5) (2011) 152. 1046/j.1365-8711.1999.02379.x.
[5] T. Hamada, T. Iitaka, The Chamomile Scheme: An Optimized Algorithm for [14] E. Hairer, C. Lubich, G. Wanner, Geometric Numerical Integration:
N-body simulations on Programmable Graphics Processing Units, 2007, ArXiv Structure-Preserving Algorithms for Ordinary Differential Equations, Springer
Astrophysics e-printsarXiv:astro-ph/0703100. Series in Computational Mathematics, Springer, 2006.
[6] S.F. Portegies Zwart, R.G. Belleman, P.M. Geldof, High-performance direct [15] M. Reyes-Ruiz, H. Aceves, C.E. Chavez, Stability of the outer planets in
gravitational N-body simulations on graphics processing units, New Astron. multiresonant configurations with a self-gravitating planetesimal disk,
12 (2007) 641–650, http://dx.doi.org/10.1016/j.newast.2007.05.004, Astrophys. J. 804 (2015) 91, http://dx.doi.org/10.1088/0004-637X/804/2/91,
arXiv:cs/0702135. arXiv:1406.2341.
[7] P. Berczik, K. Nitadori, S. Zhong, R. Spurzem, T. Hamada, X. Wang, I. Berentzen, [16] L.F. Shampine, M.K. Gordon, Computer Solution of Ordinary Differential
A. Veles, W. Ge, High performance massively parallel direct N-body Equations: The Initial Value Problem, W. H. Freeman and Co., San Francisco,
simulations on large GPU clusters, in: International conference on High California, 1975.
Performance Computing, Kyiv, Ukraine, October 8–10, 2011, 2011, [17] F.T. Krogh, 14.1 Variable order Adams method for ordinary differential
pp. 8–18. equations (diva), Technical report, Math à la Carte, Inc., Tujunga, CA,
[8] J. Bédorf, E. Gaburov, S. Portegies Zwart, A sparse octree gravitational N-body 1997.
code that runs entirely on the GPU processor, J. Comput. Phys. 231 (2012) [18] P.W. Sharp, N-body simulations: the performance of some integrators, ACM
2825–2839, http://dx.doi.org/10.1016/j.jcp.2011.12.024, Trans. Math. Softw. 32 (2006) 375–395, http://dx.doi.org/10.1145/1163641.
arXiv:1106.1900. 1163642.
[9] R. Spurzem, P. Berczik, S. Zhong, K. Nitadori, T. Hamada, I. Berentzen, A. Veles, [19] J. Burkardt, ode.c, http://people.sc.fsu.edu/jburkardt/c src/ode/ode.html,
Supermassive black hole binaries in high performance massively parallel downloaded April, 2013 (February 01, 2012).
direct N-body simulations on large GPU clusters, in: R. Capuzzo-Dolcetta, M. [20] A. Morbidelli, K. Tsiganis, A. Crida, H.F. Levison, R. Gomes, Dynamics of the
Limongi, A. Tornambè (Eds.), Advances in Computational Astrophysics: giant planets of the solar system in the gaseous protoplanetary disk and their
P.W. Sharp, W.I. Newman / Journal of Computational Science 16 (2016) 89–97 97

relationship to the current orbital architecture, Astron. J. 134 (2007) William I. Newman (M1978) received the B.Sc. (Hon.)
1790–1798, http://dx.doi.org/10.1086/521705, arXiv:0706.1713. and M.Sc. in Physics from the University of Alberta, in
[21] L. Nyland, M. Harris, J. Prins, Fast n-body simulation with cuda, in: H. Nguyen Edmonton, Canada, in 1971 and 1972, respectively. He
(Ed.), GPU Gems 3, 1st edition, Addison-Wesley Professional, 2007, Chapter received the M.S. and Ph.D. in astronomy and space sci-
31. ence from Cornell University in Ithaca, NY in 1975, and
[22] S.N. Raymond, T. Quinn, J.I. Lunine, High-resolution simulations of the final 1979, respectively. He was a member of the Institute for
assembly of Earth-like planets I. Terrestrial accretion and dynamics, Icarus Advanced Study in Princeton, NJ from 1978 to 1980 and
183 (2006) 265–282, http://dx.doi.org/10.1016/j.icarus.2006.03.011, joined the faculty at the University of California, Los Ange-
arXiv:astro-ph/0510284. les in 1980 where he is Professor of Earth, Planetary, and
Space Sciences, as well as of Physics and Astronomy, and
of Mathematics. He became a Fellow of the John Simon
Guggenheim Foundation in 1987, visiting Cornell Univer-
Dr. Philip Sharp (Mr.) is a senior lecturer in the Department
sity and the U.S.S.R. Academy of Sciences. In 1990–1991,
of Mathematics at the University of Auckland. He held
he held appointment as the Stanislaw Ulam Distinguished Scholar at the Center
previous appointments in the Department of Computer
for Nonlinear Studies at Los Alamos National Laboratory. In 2000–2001, he was
Science at the University of Toronto, and the Depart-
appointed the Belkin Visiting Professor in the Department of Computer Science and
ment of Mathematics and Statistics at Queen’s University,
Applied Mathematics at the Weizmann Institute in Rehovot, Israel. In 2012, he was
Canada. He researches numerical methods for solving
appointed Yuval Ne’eman Distinguished Lecturer in the Department of Geophysics
ordinary and delay differential equations, Volterra inte-
and Planetary Physics at Tel Aviv University in Israel. He has published over 100
gral equations, and for performing N-body simulations of
research papers, edited two books, and published a graduate textbook on contin-
the Solar System on high performance computers.
uum mechanics. His current research interests are primarily in planetary science,
geophysics, astrophysics, numerical analysis, and nonlinear dynamics.

You might also like