You are on page 1of 7


Multi-core Strategies For Particle Methods
John R. Williams, David Holmes and Peter Tilke*
Massachusetts Institute of Technology, *Schlumberger-Doll Research Laboratory

This paper discusses the implementation of particle based numerical methods
on multi-core machines. In contrast to cluster computing, where memory is distributed across machines, multi-core machine can share memory across all
cores. Here general strategies are developed for spatial management of particles
and sub-domains that optimize computation on shared memory machines. In
particular, we extend cell hashing so that cells bundle particles into orthogonal
tasks that can be safely distributed across cores avoiding the use of “memory
locks” while still protecting against race conditions. Adjusting task size provides for optimal load balancing and maximizing cache hits. Additionally, the
way in which tasks are mapped to execution threads has a significant influence
on the memory footprint and it is shown that minimizing memory usage is one
of the most important factors in achieving execution speed and performance on
multi-core. A novel algorithm called H-Dispatch is used to pipeline tasks to
processing cores. The performance is demonstrated in speed-up and efficiency
tests on a smooth particle hydrodynamics (SPH) flow simulator. An efficiency
of over 90% is achieved on a 24-core machine.

Recent trends in computational physics suggest a rapidly growing interest in particle based numerical techniques. While particle based methods allow robust handling of mass advection, this comes at the computational cost of managing the
spatial interactions of particles. For problems involving millions of particles, it is
necessary to develop efficient strategies for parallel implementation. A variety of
authors have reported parallel particle codes on clusters of machines (see for example Nelson et al [1], Morris et al [2], Sbalzarini et al [3], Walther and Sbalzarini [4], Ferrari et al [5]). The key difference between parallel computing on clusters and on multi-core is that memory can be shared across all cores on today’s
multi-core machines. However, there are well known challenges of thread safety
and limited memory bandwidth. A variety of approaches to programming on
multi-core have been proposed to-date. Concurrency tools from traditional cluster

Simulations of Discontinua – Theory and Applications


For such methods. such points represent literal particles (atoms and molecules for MD and discrete grains for DEM). Here. smooth particle hydrodynamics (SPH) [11]. [9] have shown that a programming model developed using such port based techniques provides significant performance advantages over MPI and OpenMP. mesh limitations for problems involving large deformation and complex material interfaces has led to significant developments in meshless and particle based methodologies. In an earlier paper. Examples of Eulerian mesh based methods include finite difference (FD) and the lattice Boltzmann method (LBM). capable of advecting with material in a Lagrangian sense.DISCRETE ELEMENT METHODS computing. while still capitalizing on the speed advantages of shared memory. and the reproducing kernel particle method (RKPM) [13]. PARTICLE BASED NUMERICAL METHODS Mesh based numerical methods have been the cornerstone of computational physics for decades. Chrysanthakopoulos and co-workers [7. These mimic the functionality of a message passing library like MPI. parti- Simulations of Discontinua – Theory and Applications 12 . For such methods. integration points are positioned according to some topological connectivity or mesh to ensure compatibility of the numerical interpolation. Holmes et al. wavelets [12]. For methods like molecular dynamics (MD) and the discrete element method (DEM). While powerful for a wide range of problems. while Lagrangian examples include the finite element method (FEM). the particle analogy is largely figurative. which is relatively slow compared with direct memory sharing. we apply the programming model proposed in [9] to the parallelization of particle based numerical methods on multi-core. rather than exchanging serialized packets over TCP/IP as is the case for MPI. If one runs an MPI process on each core. the operating system guarantees memory isolation across processes. but use shared memory as the medium for data exchange. In this paper. 8] have implemented multi-core concurrency libraries using Port based abstractions. Communication of information across process boundaries requires MPI messages to be sent. integration points are positioned freely in space. Such an approach provides extensibility in program structure. while for methods like dissipative particle dynamics (DPD) [10]. like MPI [6] and OpenMP [7] have been used to achieve fast take-up of the new technology. In response to this.

Spatial hashing techniques have been used widely to achieve algorithmically efficient interaction detection in particle codes (see for example DEM [15. allowing such methods to represent a continuum in a generalized way. In Eulerian mesh based methods. while for Lagrangian mesh based approaches like FEM.. hydrodynamic. We can then perform spatial reasoning on the cells rather than on the particles themselves. 16]).meshfree methods are the natural extension of finite element methods.g. Correspondingly. Where. shared memory enables a near arbitrary selection of spatial divisions by which to distribute a numerical task. chemical. an array of equally sized cells.1. From Li and Liu [14] . Correspondingly. sums to unity. The partition of unity imposed on mesh-free particle methods can be seen to be a generalization of shape functions for arbitrary integration point arrangements. the value of some field function can be determined from its value at all other points via f ( x )   ak  ( x  k ) k 0 .. continuity can be imposed without a defined mesh. a set of nonnegative compactly supported functions.the partition of unity. electrical. magnetic etc. continuity is enforced through the use of element shape functions. it becomes convenient to repurpose a spatial division strategy already in use in one form or other in most particle based numerical codes. Strategies for Programming on Multi-Core By eliminating the need for ghost regions and high volume to surface ratio sub-domains. The advantage of partition of unity methods is that any expression related to a field quantity can be imposed on the continuum. SPATIAL HASHING IN PARTICLE METHODS The central idea behind spatial hashing is to overlay some regular spatial structure over the randomly positioned particles e. for a bounded domain in Eucledean space. they provide a perfect habitat for a more general and more appealing computational paradigm . 2.. such as FD and LBM. mechanical. continuity is inherently provided by the static mesh. By partitioning unity across the particles.DISCRETE ELEMENT METHODS cles provide positions at which to enforce a partition of unity. Such versatility is a key advantage of particle methods. Simulations of Discontinua – Theory and Applications 13 .. n The function f (x) can be related to any physical field expression.

and the second to perform the field variable updates. Traditional software applications for shared memory parallel architectures have utilized locks to avoid thread contention. minimizing the frequency of so called synchronization points has advantages for performance and we utilize this “rolling memory algorithm”. The numerical expense of such an algorithm is O (N) where N is particle number [15. y and z dimensions. We execute a single loop in which global memory for both previous and current field values ( vtn and vt n1 ) is stored for each particle. Managing Thread Safety The cells also provide a means for defining task packages to the cores. and B. j and k are the integer cell coordinates in the x. In this work a dictionary hash table is used where a generic list of particles is stored for each cell and indexed based on an integer unique key to that cell. the first to determine the particle number density of each particle. two operations must be done on particles in each cell per step. ensuring a single synchronization per step. twice per cell per time step.16]. By assigning a number of cells to each core we assure “task orthogonality” in that each core is operating on different set of particles. Using a standard SPH formulation. In parallel. separated by a synchronization. A variety of programmatic implementations of hash cell methods have been used (for example linked list). We note that a core may “read” the memory of particles belonging to surrounding cells but may not update them. key  ( k  ny  j )  nx  i where nx and ny are the total number of cells in the x and y dimensions and i. Recalculation of interacting particles means they need Simulations of Discontinua – Theory and Applications 14 . while updates are written to the current value memory.DISCRETE ELEMENT METHODS Spatial hashing assigns particles to cells or ‘bins’ based on a hash of particle coordinates. Interacting particles are determined as needed in each stage for each cell. This allows the previous and updated terms to be maintained without needing to replace the former with the latter at the end of each step and thus. the performance of two structure variations are compared in this case study: A.e. in the same loop. Counter Intuitive Design In a typical SPH simulator.e. i. i. Gradient terms can then be calculated as functions of values in previous memory. Interacting particles are determined in the first stage for all cells and stored in a global list for use in the second. Interacting particles must be known for each of these two stages.

Number of Cores The MIT SPH simulator (see http://geonumerics. Fig 1. This can be attributed to better cache blocking of the second approach and the significantly smaller amount of data experiencing latency when being loaded from RAM to cache. Number of Particles (b) Speed-up vs. the single search variant (A) solves more quickly than the double (B) due to less computations. suggests that for less than 10 cores.DISCRETE ELEMENT METHODS only be kept in a local thread list that is overwritten with each newly dispatched cell. While differences in execution memory are to be expected of the two code versions. as would be ) has been validated for single and multi-phase flow. however. Figure 2 shows results for two phase RayleighTaylor instability [17].mit. After this point. (a) memory vs. the differences in execution time are more surprising. The fact that such performance gains only manifest when more than 10 cores are used. RAM pipeline bandwidth is sufficient to handle a global interaction list. For low core counts (< 10 cores). Simulations of Discontinua – Theory and Applications 15 . the double search (B) is shown to provide marked improvements in speed over (A) (up to 50%).

J. Nelson. K. Journal of Computational Physics 215 (2006) 566-588. Zhu. M. PPM . clear liquid is water. 4. Walther. object-oriented molecular dynamics program. P. Kotsalis. R. 2. W. H. S. J. I. F. Hieber. T. Fox.DISCRETE ELEMENT METHODS Fig 2. Large-scale parallel discrete element simulations of granular flow. I. Simulations of Discontinua – Theory and Applications 16 . L. Y. D. J. A. Dalke. E. International Journal of High Performance Comp. F. green liquid is oil. Walther. Computers and Geotechnics 25 (1999) 227-246.a highly efficient parallel particle-mesh library for the simulation of continuum systems. Engineering Computations 26 (2009) 688{697. M. Humphrey. Koumoutsakos. P. J. [9]. P. Bergdorf. REFERENCES 1. Parallel simulations of pore-scale flow through porous media. Sbalzarini. Rayleigh-Taylor instability example showing extremely complex fluid phase interface geometries. 3. M. E. Sbalzarini. NAMD: a parallel. Morris. V. Kale. Schulten. A. Applications 10 (1996) 251-268. Gursoy. H. Skeel.

R. 10. A. Cook. M. R. Springer-Verlag. P. 16. Hao. F. P. Fox. Singh. 2005. MIT Press. S. H. M. Skjellum. OOPSLA 2005. London. W. Lusk. International Journal for Numerical Methods in Fluids 21 (1995) 901{931. 17. H. Williams. 11. S. Yuan. A. Gropp. Smoothed particle hydrodynamics.-H. Liu. Engineering Computations 21 (2004) 235-248. An events based algorithm for distributing concurrent tasks on multi-core architectures. 25-26 August 2010 Page 7 5. D.DISCRETE ELEMENT METHODS DEM5-2010. pp. Bae. S. NBS contact detection algorithm for bodies of similar size. Dissipative particle dynamics simulation of pore-scale multiphase fluid flow. Belytschko. 7. Liu. Meshfree Particle Methods. Perkins. Journal of Computational Physics 207 (2005) 610624. Williams. San Diego. Wavelet and multiple scale reproducing kernel methods. 12. International Journal for Numerical Methods in Engineering 43 (1998) 131-149. in: 7th International Conference on Grid and Cooperative Computing GCC2008. W. Annual Review of Astronomy and Astrophysics 30 (1992) 543{574. 6. T. S. K. Chang. International Journal for Numerical Methods 14. Meakin. A. M. Meakin. Computers and Fluids 38 (2009) 1203-1217. Huang. Water Resources Research 43 (2007) W04411. J. Ferrari. Tartakovsky. Nielsen. W. Cambridge. C. J. G. G. K. Berlin Heidelberg. Using MPI: Portable Parallel Programming With the Message-Passing Interface. Chen. Armanini. W. B. X. Chrysanthakopoulos. Munjiza. Monaghan. W. 1999. 13. Holmes. E. Qiu. K. G. Li. E. Toro. Li. Parallel data mining on multicore clusters. T. Computer Physics Communications 181 (2010) 341-354. J. 14. 9. Andrews. 8. Chrysanthakopoulos. R. Simulations of Discontinua – Theory and Applications 17 . Y. 2004. J. A. A smoothed particle hydrodynamics model for miscible flow in three-dimensional fractures and the two-dimensional Rayleigh-Taylor instability. Multi-scale methods. S. in: Proceedings of the Workshop on Synchronization and Concurrency in ObjectOriented Languages. A contact algorithm for partitioning N arbitrary sized objects. F. C. Liu. A new 3D parallel SPH scheme for free surface flows. P. 89-97. An asynchronous messaging library for C#. H. Tilke. Dumbser. Liu. F. K. E. A.