Professional Documents
Culture Documents
A CUDA Implementation of
the Standard Particle Swarm Optimization
Md. Maruf Hussain Hiroshi Hattori Noriyuki Fujimoto
Graduate School of Science Graduate School of Engineering Graduate School of Engineering
Osaka Prefecture University Osaka Prefecture University Osaka Prefecture University
Email:mz301016@edu.osakafu-u.ac.jp Email:swb01128@edu.osakafu-u.ac.jp Email:fujimoto@cs.osakafu-u.ac.jp
Abstract—The social learning process of birds and fishes This paper presents a new parallelized implementation of
inspired the development of the heuristic Particle Swarm Opti- the Standard Particle Swarm Optimization (SPSO), an ex-
mization (PSO) search algorithm. The advancement of Graphics tension of PSO, partially using coalescing memory access.
Processing Units (GPU) and the Compute Unified Device Archi-
tecture (CUDA) platform plays a significant role to reduce the The experimental results show that the proposed parallelized
computational time in search algorithm development. This paper SPSO increases the processing speed compared to previously
presents a good implementation for the Standard Particle Swarm proposed approach on a GPU based on the CUDA architecture.
Optimization (SPSO) on a GPU based on the CUDA architecture, We investigate a large number of iterations to reach a good
which uses coalescing memory access. The algorithm is evaluated optimization solution. We also investigate the impact of fine-
on a suite of well-known benchmark optimization functions. The
experiments are performed on an NVIDIA GeForce GTX 980 grained parallelism in high-dimensionality problems. A large
GPU and a single core of 3.20 GHz Intel Core i5 4570 CPU swarm of particles is analyzed to achieve a desired SPSO
and the test results demonstrate that the GPU algorithm runs solution with improved computational time compared to a
about maximum 46 times faster than the corresponding CPU CPU.
algorithm. Therefore, this proposed algorithm can be used to
The remainder of this paper is organized as follows. In
improve required time to solve optimization problems.
Section II, we sketch briefly the PSO, the SPSO, GPU com-
Index terms– Particle Swarm Optimization (PSO), GPGPU,
puting, CUDA architecture, coalescing memory access and
coalescing memory access, cuRAND, atomic function.
GeForce GTX 980 GPU. After that, Section III summarizes
I. I NTRODUCTION some related works. In Section IV we provide the SPSO and
The Particle Swarm Optimization (PSO) based on a popu- its implementation on a GPU. Subsequently, in Section V,
lation based stochastic optimization technique was first intro- we present and analyze the obtained results and compare it
duced by Eberhart and Kennedy in 1995 [1]. Since then, many to the previous implementation in terms of execution time,
successful applications of PSO have been reported. In many speedup and fitness values. Finally, in Section VI, we give
of those applications, the PSO algorithm has shown several some concluding remarks and point out directions for future
advantages over other swarm intelligence based optimization work.
algorithms due to its robustness, efficiency and simplicity.
Moreover, compared to other stochastic algorithms, it usually II. P RELIMINARIES
requires less computational effort and resources [2].
A. The Particle Swarm Optimization
The PSO algorithm maintains a swarm of particles, where
each of which represents a potential solution. Here a swarm A PSO system simulates the behaviors of a bird flocking.
can be identified as the population and a particle as an In a bird flock, some birds are randomly searching food in
individual. In a PSO system, each particle flows through a an area. If there is only one piece of food in the area all the
multidimensional search space and adjust its position based birds do not know the exact location of the food. However,
on its own experience with neighboring particles. with each iteration, they know how far the food is. For an
On a CPU, this process is implemented based on task individual bird the best effective way to find the food source
scheduling into serial processing, whereas on a GPU, many is to follow the bird which is nearest to the food.
particles can reach to their position simultaneously , which PSO is a similar algorithm, where it learns from the past
improves the PSO efficiency significantly. In recent years, a scenario and used it to solve the optimization problems.
GPU becomes a very popular platform for the realization of Compared to the bird scenario, each single solution in a PSO is
parallel computing, mainly due to changes in architecture and a bird in the search space. We call it a particle. All of particles
development of CUDA and OpenCL languages. Previously are evaluated by their fitness values which are evaluated by
reported works have shown that the PSO implementation the fitness function to be optimized, and have velocities which
on a GPU provides a better performance than CPU-based direct the flying of the particles. The particles fly through the
implementations [3]. problem space by following the current optimum particles.
220
based on CUDA, for example, matrix multiplication, real-time
visual hull computation, image denoising, and so on [10] [11]
[12]. In this paper, we intend to implement SPSO on a GPU
in parallel to accelerate the running speed of it.
221
Mussia et al. [17] proposed two different ways of exploiting Algorithm 2 GPU SPSO algorithm
GPU parallelism. The execution speeds of the two parallel
algorithms are compared, on functions which are typically 1 l e t n = number o f p a r t i c l e s
2 l e t d = number o f d i m e n s i o n s
used as benchmarks for PSO, with a sequential implementation 3 a l l o c a t e memory on a GPU w i t h 1 b l o c k o f 1 t h r e a d .
of SPSO. 4 g e n e r a t e random number s e e d s u s i n g cuRAND &
Li and Zhang [18] proposed a CUDA based Multichannel i n i t i a l i z e v e l o c i t y & p o s i t i o n of each
p a r t i c l e s on a GPU w i t h n b l o c k s o f d
particle swarm algorithm. The optimization experiments re- threads .
sults of 4 benchmark functions like sphere, rastrigin, griewank 5 f o r i = 1 t o i t e r a t i o n s do
and rosenbrock, it showed that the CUDA-based parallel algo- 6 c a l c u l a t e f i t n e s s & P b e s t on a GPU w i t h ( n +
32−1) / 3 2 b l o c k s o f 32 t h r e a d s .
rithm can greatly save computing time and improve computing 7 c a l c u l a t e L b e s t on a GPU w i t h ( n + 32−1) / 3 2
accuracy. Comparison of results on GeForce GTX 480 GPU b l o c k s o f 32 t h r e a d s .
with Intel Core i7 860 also showed, as population gradually 8 u p d a t e v e l o c i t y & p o s i t i o n on a GPU w i t h n
blocks of d t h r e a d s .
increases, speedup also increases. 9 c a l c u l a t e G b e s t by a t o m i c M i n ( ) on a GPU w i t h ( n +
Calazan et al. implementation [19] of a Cooperative Parallel 32−1) / 3 2 b l o c k s o f 32 t h r e a d s .
Particle Swarm Optimization (CPPSO) for high-dimension 10 t r a n s f e r G b e s t t o a CPU
11 f r e e seeds using 1 block of 1 thread kernel
problems on GPUs results showed that the proposed archi- function
tecture is up to 135 times and not less than 20 times faster 12 f r e e memory on a GPU
in terms of optimization time when compared to the direct 13 return the best r e s u l t & corresponding position
software execution of the algorithm.
Zhou and Tan [20] compared with the CPU based sequential
Multiobjective Particle Swarm Optimization (MOPSO) and 1 global void e v a l u a t e p a r t i c l e s ( i n t d , i n t n ,
GPU based parallel MOPSO. Implementation of GPU based f l o a t ∗x , f l o a t ∗ pValue , f l o a t ∗ p B e s t V a l u e ,
f l o a t ∗ pBestPos , i n t ∗ l B e s t I d x )
parallel MOPSO is much more efficient in terms of running 2 {
time, and the speedups range from 3.74 to 7.92 times. 3 i n t i = blockDim . x ∗ b l o c k I d x . x + t h r e a d I d x . x ;
Zhu et al. [21] presented a faster parallel Euclidean Particle 4 i f ( i >= n ) r e t u r n ;
5 p V a l u e [ i ] = f i t n e s s f u n c t i o n ( d , &x [ d ∗ i ] ) ;
Swarm Optimization (pEPSO). Five benchmark functions had 6
been employed to examine the performance of the pEPSO. 7 i f ( pValue [ i ] < pBestValue [ i ] ) {
Experimental results showed that the average processing time 8 pBestValue [ i ] = pValue [ i ] ;
9 pBestPos [ d ∗ i . . d ∗ i + d − 1] = x [ d ∗ i
of calculating fitness had been accelerated to maximum 16.27 . . d ∗ i + d − 1];
times the original algorithm (EPSO). 10 }
Silva and Filho [22] proposed to use multiple sub-swarms. 11 }
Each sub-swarm is executed in a GPU block aiming at maxi-
mizing data alignments and avoiding instructions bifurcations Fig. 3. A kernel function for fitness and Pbest calculation.
and also provided two communication mechanisms and two
topologies in order to allow the sub-swarm to exchange infor-
mation and collaborate by using the GPU global memory. They such as position and velocity on a GPU with n blocks of d
showed speedups up to 100 and 5 times when compared to threads. For this initialization, we used coalescing memory
the serial implementation and PSO start-of-art implementation access. The arrays on VRAM are arranged to realize coalesc-
for CUDA. ing memory access, in particular the array of random number
Bali et al. [23] illustrated a novel parallel approach to seeds.
run SPSO on GPUs and applied to TSP (GPU-PSO-A-TSP). The third kernel generates (n+32-1)/32 blocks of 32 threads
Results showed that running speed of GPU-PSO is four times to compute the fitness function. This kernel performs the
as fast as that of CPU-PSO. reduction process to get the fitness value. When this process is
IV. I MPLEMENTING SPSO U SING CUDA completed, the current Pbest value of each particle is compared
with the previous Pbest value. If the current Pbest value is
In this section we show the main parts of the SPSO-code smaller than the previous Pbest value, then it is updated. When
developed for a GPU. The SPSO algorithm, expressed in a the Pbest is updated, the threads of this kernel also update
CUDA-based pseudocode, is given in Algorithm 2. We used the coordinates of the Pbest value accordingly. The CUDA
seven kernel functions. The first kernel allocates memory on pseudo-code for kernel fitness and Pbest calculator is given in
a GPU with 1 block of 1 thread. In order to generate random Fig. 3.
numbers on a GPU, we used the cuRAND library [24] for
The next kernel computes the local best (Lbest) value by
each thread with an independent seed number. The cuRAND
comparing the previous Pbest of neighboring (left and right)
library focuses on simple and efficient generation of high-
particles. The details of kernel Lbest calculator are shown in
quality pseudorandom and quasirandom numbers.
Fig. 4. Here the neighbors of a particle are particles (tid+1)
The second kernel generates random number seeds using
and (tid-1). During this process, in each terminal of the
cuRAND and initializes the basic information of each particle
array, it will cause illegal access (out of array) if no special
222
1 global void c a l c u l a t e l o c a l B e s t ( i n t n , f l o a t ∗
TABLE I
pBestValue , i n t ∗ l B e s t I d x ) B ENCHMARK T EST F UNCTIONS
2 {
3 i n t t i d = blockDim . x ∗ b l o c k I d x . x + threadIdx Name Equation Bounds
.x; Sphere d (−5.12, 5.12)d
f1 = x2i
4 i f ( t i d >= n ) r e t u r n ; i=0
d−1
5 i n t r i g h t = ( t i d == ( n − 1 ) ) ? 0 : tid Rosenbrock f2 = (100(xi+1 − x2i )2 + (xi − 1)2 ) (−10, 10)d
+1; i=0
6 i n t l e f t = ( t i d == 0 ) ? (n − 1) : tid Rastrigin f3 =
d 2 (−5.12, 5.12)d
i=0 [xi − 10 ∗ cos(2πxi ) + 10]
−1;
7 int lBestCandidate = tid ; Griewank 1
d d (−600, 600)d
f4 = 4000 i=1 x2i − i=1 cos( √
xi
i
)+1
8 i f ( pBestValue [ r i g h t ] < pBestValue [
d
lBestCandidate ]) lBestCandidate = right ; Ackley f5 = −20 exp[− 5 1 1 d
i=1
1
x2i ] − exp[ d i=1 cos(2πxi )] + 20 + e (−32.768, 32.768)d
d
9 i f ( pBestValue [ l e f t ] < pBestValue [
d
lBestCandidate ]) lBestCandidate = left ; De Jong f6 = i=1 | xi |(i+1) (−1, 1)d
223
TABLE II
GPU SPSO AND CPU SPSO ON f1 ( NUMBER OF DIMENSIONS d = 50 )
n
CPU GPU CPU loop GPU loop CPU Gbest GPU Gbest Speedup
Time(s) Time(s) Time(s) Time(s) Value Value
2000 5.7992 0.4638 0.0029 0.0001 8.91E-12 8.65E-12 12.5
TABLE III
GPU SPSO AND CPU SPSO ON f4 ( NUMBER OF DIMENSIONS d = 50 )
n
CPU GPU CPU loop GPU loop CPU Gbest GPU Gbest Speedup
Time(s) Time(s) Time(s) Time(s) Value Value
Fig. 7. Gbest and generation for benchmark test functions 2000 10.2989 0.5097 0.0051 0.0001 0.00E+00 0.00E+00 20.2
224
TABLE V
GPU SPSO AND CPU SPSO ON f1 ( NUMBER OF PARTICLES n = 2000 )
d CPU GPU CPU loop GPU loop CPU Gbest GPU Gbest Speedup
Time(s) Time(s) Time(s) Time(s) Value Value
50 5.7992 0.4638 0.0028 0.0001 8.91E-12 8.65E-12 12.5
TABLE VI
GPU SPSO AND CPU SPSO ON f4 ( NUMBER OF PARTICLES n = 2000 )
Fig. 9. Overlap of computation time
d CPU GPU CPU loop GPU loop CPU Gbest GPU Gbest Speedup
Time(s) Time(s) Time(s) Time(s) Value Value
50 10.2989 0.5097 0.0051 0.0001 0.00E+00 0.00E+00 20.2
VI. C ONCLUSION
This paper has presented an implementation of the SPSO on
the CUDA architecture. The proposed GPU SPSO significantly
reduces execution time compared to previous development.
We have achieved a good fitness value with short execution
time and kernel loop time simultaneously. Moreover, the
Fig. 10. Overlap of loop time implementation has a significant speedup when compared to
the CPU serial implementation. The proposed implementation
is 46 times faster.
In this paper we focused on realizing good implementation
Due to difference between random number sequence on a of the original SPSO. In our future work, we want to investi-
CPU and that on a GPU, the Gbest values may be slightly gate parallel implementation of the extensions [15][19][20] of
different. Moreover, execution time depends on function types SPSO to improve the efficiency of other implementations. We
and how many operators that have been used inside a function. also want to improve the SPSO performance by implementing
We focused on execution time and speedup for fixed number all kernel functions using coalescing memory access which
of iterations. should improve SPSO performance significantly.
The other implementation [14] of SPSO is slower than
our implementation under the same dimension and the same
TABLE VII
population size (see Table VIII). In our implementation the GPU SPSO AND CPU SPSO ON f7 ( NUMBER OF PARTICLES n = 2000 )
best key point is that we try to implement good parallelization.
In the other implementation of SPSO, the random numbers d CPU GPU CPU loop GPU loop CPU Gbest GPU Gbest Speedup
Time(s) Time(s) Time(s) Time(s) Value Value
were generated inside a CPU whereas, in our implementation, 50 23.9281 0.5176 0.0119 0.0001 0.00E+00 0.00E+00 46.2
all the random numbers were generated inside a GPU. The
100 49.3385 1.8221 0.0246 0.0003 0.00E+00 0.00E+00 27.0
implementation of the rand() function on a CPU is very simple
and not time-consuming. However, the transfer of generated 150 72.9061 3.4420 0.0364 0.0004 0.00E+00 0.00E+00 21.1
random numbers from a CPU to a GPU is time-consuming. 200 96.5463 5.0681 0.0482 0.0005 0.00E+00 0.00E+00 19.0
This increases the overall processing time of the other imple-
mentation. In our work, the random numbers were generated TABLE VIII
by cuRAND function on a GPU for SPSO application to A C OMPARISON BETWEEN [14] AND OURS ON f4 ( NUMBER OF
reduce the processing time significantly. Moreover, we used DIMENSIONS d = 50, NUMBER OF ITERATIONS 10000)
225
R EFERENCES
[1] J. Kennedy and R. C. Eberhart: Particle Swarm Optimization, IEEE In-
ternational Conference on Neural Networks, Vol.4, pp.1942–1948 (1995)
[2] A. P. Engelbrecht: Fundamentals of Computational Swarm Intelligence,
Wiley (2005)
[3] J. Koodziejczyk: Survey on Particle Swarm Optimization accelerated on
GPGPU, International Journal of Scientific Engineering and Research,
Vol.5, No.12, pp.2229–5518 (2014)
[4] X. Hu: Particle Swarm Optimization, www.swarmintelligence.org (2004)
[5] D. Bratton and J. Kennedy: Defining a Standard for Particle Swarm
Optimization, IEEE Swarm Intelligence Symposium, pp.120-127 (2007)
[6] K. E. Hoff III, T. Culver, J. Keyser, M. Lin, and D. Manocha: Fast Compu-
tation of Generalized Voronoi Diagrams Using Graphics Hardware, 26th
Annual conference on Computer Graphics and interactive Techniques,
pp.277–286 (1999)
[7] Z.W. Luo, H. Liu, and X. Wu: Artificial Neural Network Computation
on Graphic Process Unit, IEEE International Joint Conference on Neural
Networks, Vol.1, pp 622–626 (2005)
[8] NVIDIA: CUDA Programming Guide 7.5
http://www.nvidia.com/object/cuda develop.html (2015)
[9] A. Munsh: The OpenCL Extension Specification 1.2
http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf (2012)
[10] W. Liu and B. Vinter: An Efficient GPU General Sparse Matrix-Matrix
Multiplication for Irregular Data, 28th IEEE International on Parallel and
Distributed Processing Symposium, pp.370–381(2014)
[11] G. Dafeng and W. Xiaojun: Real-time Visual Hull Computation Based
on GPU, IEEE International Conference on Robotics and Biomimetics
(ROBIO), pp 1792–1797 (2015)
[12] A. P. Yazdanpanah, A. K. Mandava, E. E. Regentova, V. Muthukumar,
and G. Bebis: A CUDA Based Implementation of Locally-and Feature-
Adaptive Diffusion Based Image Denoising Algorithm, 11th Interna-
tional Conference on Information Technology: New Generations (ITNG),
pp.388–393 (2014)
[13] NVIDIA: GeForce GTX 980 for Desktop
http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-980
[14] Y. Zhou and Y. Tan: GPU-based Parallel Particle Swarm Optimization,
11th IEEE Congress on Evolutionary Computation, pp.1493–1500 (2009)
[15] R. M. Calazan, N. Nedjah, and L. D. M. Mourelle: Parallel GPU-
based Implementation of High Dimension Particle Swarm Optimizations,
IEEE Fourth Latin American Symposium on Circuits and Systems
(LASCAS),pp.1–4 (2013)
[16] V. K. Reddy and L. S. S. Reddy: Performance Evaluation of Particle
Swarm Optimization Algorithms on GPU Using CUDA, International
Journal of Computer Science and Information Technologies, Vol.5, No.1,
pp.65–81(2012)
[17] L. Mussia, F. Daoliob, and S. Cagnoni: Evaluation of Parallel Particle
Swarm Optimization Algorithms within the CUDA Architecture, Infor-
mation Sciences on Interpretable Fuzzy Systems, Vol.181, pp.4642–4657
(2011)
[18] W. Li and Z. Zhang: A CUDA-based Multichannel Particle Swarm
Algorithm, International Conference on Control, Automation and Systems
Engineering (CASE), pp.1–4 (2011)
[19] R. M. Calazan, N. Nedjah, and L. D. M. Mourelle: A Cooperative
Parallel Particle Swarm Optimization for High-Dimension Problems on
GPUs, BRICS Congress on Computational Intelligence and 11th Brazilian
Congress on Computational Intelligence, pp.356-361(2013)
[20] Y. Zhou and Y. Tan: GPU-based Parallel Multiobjective Particle
Swarm Optimization, International Journal of Artificial Intelligence, Vol.7,
No.A11 (2011)
[21] H. Zhu, Y. Guo, J. Wu, and J. Gu Paralleling Euclidean Particle
Swarm Optimization in CUDA, 4th International Conference on Intelligent
Networks and Intelligent Systems (ICINIS), pp. 93–96 (2011)
[22] E. H. M. Silva and C. J. A. B. Filho: PSO Efficient Implementation on
GPUs Using Low Latency Memory, IEEE Latin America Transactions,
Vol.13, No.5, pp.1619–1624 (2015)
[23] O. Bali, W. Elloumi, P. Krmer, and A. M. Alimi: GPU Particle
Swarm Optimization Applied to Travelling Salesman Problem, IEEE 9th
International Symposium on Embedded Multicore/Many-core Systems-
on-Chip (MCSoC), pp.112–119 (2015)
[24] NVIDIA: CURAND Library 7.5 http://docs.nvidia.com/cuda/pdf/ CU-
RAND Library.pdf (2015)
226