You are on page 1of 8

18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing

A CUDA Implementation of
the Standard Particle Swarm Optimization
Md. Maruf Hussain Hiroshi Hattori Noriyuki Fujimoto
Graduate School of Science Graduate School of Engineering Graduate School of Engineering
Osaka Prefecture University Osaka Prefecture University Osaka Prefecture University
Email:mz301016@edu.osakafu-u.ac.jp Email:swb01128@edu.osakafu-u.ac.jp Email:fujimoto@cs.osakafu-u.ac.jp

Abstract—The social learning process of birds and fishes This paper presents a new parallelized implementation of
inspired the development of the heuristic Particle Swarm Opti- the Standard Particle Swarm Optimization (SPSO), an ex-
mization (PSO) search algorithm. The advancement of Graphics tension of PSO, partially using coalescing memory access.
Processing Units (GPU) and the Compute Unified Device Archi-
tecture (CUDA) platform plays a significant role to reduce the The experimental results show that the proposed parallelized
computational time in search algorithm development. This paper SPSO increases the processing speed compared to previously
presents a good implementation for the Standard Particle Swarm proposed approach on a GPU based on the CUDA architecture.
Optimization (SPSO) on a GPU based on the CUDA architecture, We investigate a large number of iterations to reach a good
which uses coalescing memory access. The algorithm is evaluated optimization solution. We also investigate the impact of fine-
on a suite of well-known benchmark optimization functions. The
experiments are performed on an NVIDIA GeForce GTX 980 grained parallelism in high-dimensionality problems. A large
GPU and a single core of 3.20 GHz Intel Core i5 4570 CPU swarm of particles is analyzed to achieve a desired SPSO
and the test results demonstrate that the GPU algorithm runs solution with improved computational time compared to a
about maximum 46 times faster than the corresponding CPU CPU.
algorithm. Therefore, this proposed algorithm can be used to
The remainder of this paper is organized as follows. In
improve required time to solve optimization problems.
Section II, we sketch briefly the PSO, the SPSO, GPU com-
Index terms– Particle Swarm Optimization (PSO), GPGPU,
puting, CUDA architecture, coalescing memory access and
coalescing memory access, cuRAND, atomic function.
GeForce GTX 980 GPU. After that, Section III summarizes
I. I NTRODUCTION some related works. In Section IV we provide the SPSO and
The Particle Swarm Optimization (PSO) based on a popu- its implementation on a GPU. Subsequently, in Section V,
lation based stochastic optimization technique was first intro- we present and analyze the obtained results and compare it
duced by Eberhart and Kennedy in 1995 [1]. Since then, many to the previous implementation in terms of execution time,
successful applications of PSO have been reported. In many speedup and fitness values. Finally, in Section VI, we give
of those applications, the PSO algorithm has shown several some concluding remarks and point out directions for future
advantages over other swarm intelligence based optimization work.
algorithms due to its robustness, efficiency and simplicity.
Moreover, compared to other stochastic algorithms, it usually II. P RELIMINARIES
requires less computational effort and resources [2].
A. The Particle Swarm Optimization
The PSO algorithm maintains a swarm of particles, where
each of which represents a potential solution. Here a swarm A PSO system simulates the behaviors of a bird flocking.
can be identified as the population and a particle as an In a bird flock, some birds are randomly searching food in
individual. In a PSO system, each particle flows through a an area. If there is only one piece of food in the area all the
multidimensional search space and adjust its position based birds do not know the exact location of the food. However,
on its own experience with neighboring particles. with each iteration, they know how far the food is. For an
On a CPU, this process is implemented based on task individual bird the best effective way to find the food source
scheduling into serial processing, whereas on a GPU, many is to follow the bird which is nearest to the food.
particles can reach to their position simultaneously , which PSO is a similar algorithm, where it learns from the past
improves the PSO efficiency significantly. In recent years, a scenario and used it to solve the optimization problems.
GPU becomes a very popular platform for the realization of Compared to the bird scenario, each single solution in a PSO is
parallel computing, mainly due to changes in architecture and a bird in the search space. We call it a particle. All of particles
development of CUDA and OpenCL languages. Previously are evaluated by their fitness values which are evaluated by
reported works have shown that the PSO implementation the fitness function to be optimized, and have velocities which
on a GPU provides a better performance than CPU-based direct the flying of the particles. The particles fly through the
implementations [3]. problem space by following the current optimum particles.

2470-881X/16 $31.00 © 2016 IEEE 219


DOI 10.1109/SYNASC.2016.37
Initially, a PSO starts with a group of random particles
(solutions). Each particle then searches for optima by updating
generations. Each particle is updated by following two best
values in every iteration. The first best value is the personal
best which is the best fitness value it has achieved so far and
called as Pbest. The second best value is the value that is
obtained by any particle in the population. This best value is
a global best and called Gbest.
A particle as well as its topological neighbors is considered
Fig. 1. Ring topology
as part of the population. In that case, the best value in the
part is a local best and is called Lbest [4]. In a PSO, the
position and velocity of a particle within the domain of the Using the constant ϕ = 4.1 to ensure convergence, the values
fitness function are computed by two following equation: χ= 0.72984 and c1 = c1 = 2.05 are obtained. But there are
Velocity update (Lbest version): other possible choices for the constriction coefficients.
This constriction factor is applied to the entire velocity
vid (t + 1) = w(vid (t)
update equation:
+c1 ∗ rand1 () ∗ (P bestP osid − xid (t)))
+c2 ∗ rand2 () ∗ (LbestP osid − xid (t)))) vid (t + 1) = χ(vid (t)
+c1 ∗ rand1 () ∗ (P bestP osid − xid (t)))
Position update: +c2 ∗ rand2 () ∗ (LbestP osid − xid (t))))
xid (t + 1) = xid (t) + vid (t + 1) The CPU SPSO algorithm is described in Algorithm 1.
where i is the number of a particle and t is time in generation.
vid (t) is the dth component of the current velocity vector of Algorithm 1 CPU SPSO algorithm
the ith particle and vid (t+1) is the dth component of modified
velocity of the ith particle. P bestP osid is dth component of 1 f o r i = 1 t o n do
2 i n i t i a l i z e the p o s i t i o n & v e l o c i t y of
the personal best position of the ith particle. LbestP osid is p a r t i c l e i randomly
dth component of the local best position of the ith particle. 3 i n i t i a l i z e the Pbest & Lbest of p a r t i c l e i to
rand1 () and rand2 () are random numbers in [0,1]. xid (t) is infinity
4 f o r j = 1 t o i t e r a t i o n s do
the dth component of the position vector of the ith particle and 5 f o r i = 1 t o n do
xid (t + 1) is the dth component of updated position of the ith 6 compute f i t n e s s [ i ]
particle. c1 and c2 are constants and w is a constant called 7 i f f i t n e s s [ i ] < Pbest [ i ] then
8 Pbest [ i ] := f i t n e s s [ i ]
inertia weight. rand1 () and rand2 () are independent each 9 f o r i = 1 t o n do
other. Moreover, rand1 () and rand2 () are independent among 10 l : = 1 +( n+ l −2) % n
particles respectively. Therefore, many independent random 11 r : = 1+ r % n
12 LbestIndex [ i ]:= i
sequences are required. 13 i f Pbest [ r ] < Pbest [ LbestIndex [ i ] ] then
14 LbestIndex [ i ]:= r
B. The Standard Particle Swarm Optimization 15 i f Pbest [ l ] < Pbest [ LbestIndex [ i ] ] then
16 LbestIndex [ i ]:= l
In 2007, by Bratton and Kennedy [5], the Standard Particle 17 f o r i = 1 t o d do
Swarm Optimization (SPSO) is defined which is designed by 18 update v e l o c i t y & p o s i t i o n of p a r t i c l e i
19 Gbest := i n f i n i t y
a straightforward extension of the original algorithm while 20 f o r i = 1 t o n do
taking into account more recent developments that can be 21 i f Pbest [ i ] < Gbest then
expected to improve performance on standard measures. This 22 Gbest := Pbest [ i ]
standard algorithm is intended for use both as a baseline for
performance testing of improvements to the technique, as well
as to represent PSO to the wider optimization community. In the CPU SPSO algorithm, the personal best position
In SPSO, every particle only uses a local best particle for and value are obtained by adjusting the Pbest position and
velocity updating, which is chosen from its left and right value. After that, the local best position and value are updated
neighbors and itself. We call this a ring topology, as shown by comparing the right (Pbest[r]) and left (Pbest[l]) neighbor
in Fig. 1. Similar to the inertia weight w, SPSO introduced a respectively. Finally, it updates the velocity and position of the
parameter χ known as the constriction factor, which is derived particle (shown in algorithm 1).
from the existing constants in the velocity update equation: In a parallel processing system the individual particle can
find their best position simultaneously and increases the SPSO
2 efficiency significantly. During our development of SPSO on
χ= 
|2−ϕ− ϕ2 − 4ϕ | CUDA, fitness function, Pbest, Lbest, Gbest, updated position
and velocity of particles in the swarm can be computed on a
ϕ = c1 + c2 , ϕ > 4 GPU while initialization also can be done on a GPU.

220
based on CUDA, for example, matrix multiplication, real-time
visual hull computation, image denoising, and so on [10] [11]
[12]. In this paper, we intend to implement SPSO on a GPU
in parallel to accelerate the running speed of it.

E. Coalescing memory accesses


GPUs provide high-bandwidth memory. The GPU loads and
stores accessed memory on to the device-global memory. It
is important to follow the right memory access pattern to
get maximum memory bandwidth. The most efficient way to
access memory in a GPU is coalescing global memory access
pattern.
During the coalescing memory access, Video RAM
(VRAM) memory is used. VRAM bandwidth is most effi-
Fig. 2. Memory access on a GPU ciently used when the simultaneous memory is accessed by
threads in a warp (during the execution of a single read or
write instruction) in a single memory segment of 32, 64, or 128
C. GPU Computing
bytes. This is called coalesced memory access (See Fig. 2).
Over the years, GPUs have been evolved into highly paral- On the other hand, VRAM bandwidth is inefficiently used and
lelized, multi-threaded and multi-core processor architecture also time consuming when the simultaneous memory access
due to the insatiable demand for real-time high definition by threads in a warp is non-coalesced. In this case, at least
graphics. Due to the architectural difference, most recent GPUs two number of memory segments are accessed by threads. If
greatly outperform CPUs in certain application where large the threads access the different segments of the VRAM, the
number of data are simultaneously processed to provide a real- bandwidth efficiency is dramatically decreased.
time output. Moreover, GPUs can be integrated into high per-
formance computing with relative ease and lower cost compare F. GeForce GTX 980
to a CPU. Due to those advantages, in recent years GPUs enter
This section illustrates the GeForce GTX 980 GPU. The
into the mainstream application and have successfully been
detailed information about this GPU can be found in [13].
implemented in many applications such as computer vision,
Compared to the older GeForce GTX 780Ti, the 980 offers
voronoi diagram and neural network computation [6], [7].
similar performance levels at reduced average power con-
Over the years, many platforms and programming models
sumption. The GeForce GTX 980 ships with a total of 2048
have been proposed for GPU computing, of which the most
CUDA cores. It features four 64-bit memory controllers (256-
important platforms are CUDA [8] and OpenCL [9]. Both
bit total), tied to each memory controller are 512KB of L2
platforms are based on C, C++ language and share very
cache.
similar platform model, execution model, memory model and
programming model. III. R ELATED W ORKS
D. An Overview of CUDA Architecture Y. Zhou and Y. Tan [14] presented parallel approach to
This section illustrates the CUDA architecture. CUDA run SPSO on a GPU. Some experiments are conducted by
is a parallel computing platform and programming model running SPSO both on a GPU and a CPU, respectively. The
introduced by NVIDIA. CUDA programming model is a running time of the SPSO based on a GPU is greatly shortened
multithreaded programming model which utilizes the multi- compared to that of the SPSO on a CPU. Running speed of
core parallel processing power of a GPU to solve complex GPU-SPSO can be more than 11 times as fast as that of CPU-
computational problems without the need of mapping them SPSO.
into a graphics API by the programmer. In a CUDA program, Calazan et al. [15] proposed GPU based Parallel Dimension
the threads are categorized into two hierarchy structure, a grid Particle Swarm Optimization (PDPSO). For optimization prob-
and thread blocks. A thread block is a set of threads and has lems with low computational complexity i.e. low dimensions,
dimensionality 1, 2, or 3. A grid block is a set of blocks with CPU based PDPSO gives better performance than GPU based
the same size and dimensionality. A kernel function call gener- PDPSO. A GPU provides positive impact on large optimiza-
ates threads as a grid with given dimensionality and size. The tion problems. Fine grained model is used i.e. distribute one
threads in a thread block can share data efficiently via shared dimension to one thread. GPU-PDPSO is 85 times faster than
memory. However, the maximum number of threads per block CPU-PDPSO.
is limited to 1024. So, if more than 1024 threads are required, V. K. Reddy and S. Reddy [16] implemented a parallel
we have to partition threads into several thread blocks with asynchronous version and synchronous version of PSO on a
the same size. Some applications have already been developed GPU and compare the performance in terms of execution time
and speedup with their sequential version on a CPU.

221
Mussia et al. [17] proposed two different ways of exploiting Algorithm 2 GPU SPSO algorithm
GPU parallelism. The execution speeds of the two parallel
algorithms are compared, on functions which are typically 1 l e t n = number o f p a r t i c l e s
2 l e t d = number o f d i m e n s i o n s
used as benchmarks for PSO, with a sequential implementation 3 a l l o c a t e memory on a GPU w i t h 1 b l o c k o f 1 t h r e a d .
of SPSO. 4 g e n e r a t e random number s e e d s u s i n g cuRAND &
Li and Zhang [18] proposed a CUDA based Multichannel i n i t i a l i z e v e l o c i t y & p o s i t i o n of each
p a r t i c l e s on a GPU w i t h n b l o c k s o f d
particle swarm algorithm. The optimization experiments re- threads .
sults of 4 benchmark functions like sphere, rastrigin, griewank 5 f o r i = 1 t o i t e r a t i o n s do
and rosenbrock, it showed that the CUDA-based parallel algo- 6 c a l c u l a t e f i t n e s s & P b e s t on a GPU w i t h ( n +
32−1) / 3 2 b l o c k s o f 32 t h r e a d s .
rithm can greatly save computing time and improve computing 7 c a l c u l a t e L b e s t on a GPU w i t h ( n + 32−1) / 3 2
accuracy. Comparison of results on GeForce GTX 480 GPU b l o c k s o f 32 t h r e a d s .
with Intel Core i7 860 also showed, as population gradually 8 u p d a t e v e l o c i t y & p o s i t i o n on a GPU w i t h n
blocks of d t h r e a d s .
increases, speedup also increases. 9 c a l c u l a t e G b e s t by a t o m i c M i n ( ) on a GPU w i t h ( n +
Calazan et al. implementation [19] of a Cooperative Parallel 32−1) / 3 2 b l o c k s o f 32 t h r e a d s .
Particle Swarm Optimization (CPPSO) for high-dimension 10 t r a n s f e r G b e s t t o a CPU
11 f r e e seeds using 1 block of 1 thread kernel
problems on GPUs results showed that the proposed archi- function
tecture is up to 135 times and not less than 20 times faster 12 f r e e memory on a GPU
in terms of optimization time when compared to the direct 13 return the best r e s u l t & corresponding position
software execution of the algorithm.
Zhou and Tan [20] compared with the CPU based sequential
Multiobjective Particle Swarm Optimization (MOPSO) and 1 global void e v a l u a t e p a r t i c l e s ( i n t d , i n t n ,
GPU based parallel MOPSO. Implementation of GPU based f l o a t ∗x , f l o a t ∗ pValue , f l o a t ∗ p B e s t V a l u e ,
f l o a t ∗ pBestPos , i n t ∗ l B e s t I d x )
parallel MOPSO is much more efficient in terms of running 2 {
time, and the speedups range from 3.74 to 7.92 times. 3 i n t i = blockDim . x ∗ b l o c k I d x . x + t h r e a d I d x . x ;
Zhu et al. [21] presented a faster parallel Euclidean Particle 4 i f ( i >= n ) r e t u r n ;
5 p V a l u e [ i ] = f i t n e s s f u n c t i o n ( d , &x [ d ∗ i ] ) ;
Swarm Optimization (pEPSO). Five benchmark functions had 6
been employed to examine the performance of the pEPSO. 7 i f ( pValue [ i ] < pBestValue [ i ] ) {
Experimental results showed that the average processing time 8 pBestValue [ i ] = pValue [ i ] ;
9 pBestPos [ d ∗ i . . d ∗ i + d − 1] = x [ d ∗ i
of calculating fitness had been accelerated to maximum 16.27 . . d ∗ i + d − 1];
times the original algorithm (EPSO). 10 }
Silva and Filho [22] proposed to use multiple sub-swarms. 11 }
Each sub-swarm is executed in a GPU block aiming at maxi-
mizing data alignments and avoiding instructions bifurcations Fig. 3. A kernel function for fitness and Pbest calculation.
and also provided two communication mechanisms and two
topologies in order to allow the sub-swarm to exchange infor-
mation and collaborate by using the GPU global memory. They such as position and velocity on a GPU with n blocks of d
showed speedups up to 100 and 5 times when compared to threads. For this initialization, we used coalescing memory
the serial implementation and PSO start-of-art implementation access. The arrays on VRAM are arranged to realize coalesc-
for CUDA. ing memory access, in particular the array of random number
Bali et al. [23] illustrated a novel parallel approach to seeds.
run SPSO on GPUs and applied to TSP (GPU-PSO-A-TSP). The third kernel generates (n+32-1)/32 blocks of 32 threads
Results showed that running speed of GPU-PSO is four times to compute the fitness function. This kernel performs the
as fast as that of CPU-PSO. reduction process to get the fitness value. When this process is
IV. I MPLEMENTING SPSO U SING CUDA completed, the current Pbest value of each particle is compared
with the previous Pbest value. If the current Pbest value is
In this section we show the main parts of the SPSO-code smaller than the previous Pbest value, then it is updated. When
developed for a GPU. The SPSO algorithm, expressed in a the Pbest is updated, the threads of this kernel also update
CUDA-based pseudocode, is given in Algorithm 2. We used the coordinates of the Pbest value accordingly. The CUDA
seven kernel functions. The first kernel allocates memory on pseudo-code for kernel fitness and Pbest calculator is given in
a GPU with 1 block of 1 thread. In order to generate random Fig. 3.
numbers on a GPU, we used the cuRAND library [24] for
The next kernel computes the local best (Lbest) value by
each thread with an independent seed number. The cuRAND
comparing the previous Pbest of neighboring (left and right)
library focuses on simple and efficient generation of high-
particles. The details of kernel Lbest calculator are shown in
quality pseudorandom and quasirandom numbers.
Fig. 4. Here the neighbors of a particle are particles (tid+1)
The second kernel generates random number seeds using
and (tid-1). During this process, in each terminal of the
cuRAND and initializes the basic information of each particle
array, it will cause illegal access (out of array) if no special

222
1 global void c a l c u l a t e l o c a l B e s t ( i n t n , f l o a t ∗
TABLE I
pBestValue , i n t ∗ l B e s t I d x ) B ENCHMARK T EST F UNCTIONS
2 {
3 i n t t i d = blockDim . x ∗ b l o c k I d x . x + threadIdx Name Equation Bounds
.x; Sphere d (−5.12, 5.12)d
f1 = x2i
4 i f ( t i d >= n ) r e t u r n ; i=0


d−1
5 i n t r i g h t = ( t i d == ( n − 1 ) ) ? 0 : tid Rosenbrock f2 = (100(xi+1 − x2i )2 + (xi − 1)2 ) (−10, 10)d
+1; i=0

6 i n t l e f t = ( t i d == 0 ) ? (n − 1) : tid Rastrigin f3 =
d 2 (−5.12, 5.12)d
i=0 [xi − 10 ∗ cos(2πxi ) + 10]
−1;
7 int lBestCandidate = tid ; Griewank 1
d d (−600, 600)d
f4 = 4000 i=1 x2i − i=1 cos( √
xi
i
)+1
8 i f ( pBestValue [ r i g h t ] < pBestValue [
  d
lBestCandidate ]) lBestCandidate = right ; Ackley f5 = −20 exp[− 5 1 1 d
i=1
1
x2i ] − exp[ d i=1 cos(2πxi )] + 20 + e (−32.768, 32.768)d
d
9 i f ( pBestValue [ l e f t ] < pBestValue [
d
lBestCandidate ]) lBestCandidate = left ; De Jong f6 = i=1 | xi |(i+1) (−1, 1)d

10 lBestIdx [ tid ] = lBestCandidate ;


d d
11 } Easom f7 = −(−1)d ( i=1 cos2 (xi )) exp[− i=1 (xi − π)2 ] (−2π, 2π)d

Fig. 4. A kernel function for Lbest calculation.


an advanced feature provided by the GPU vendor NVIDIA.
The operation is atomic in that no other thread can access this
1 device i n l i n e f l o a t atomicMin ( f l o a t ∗ addr , address until the operation is complete. The atomic operations
float val ) guarantee the correct calculation result.
2 {
3 unsigned old = atomicMin ( ( unsigned ∗) addr , Among atomic functions, atomicMin function computes the
∗ ( ( u n s i g n e d ∗ ) &v a l ) ) ; minimum of given values. atomicMin(addr, val) reads the 32-
4 r e t u r n ∗ ( ( u n s i g n e d ∗ ) &o l d ) ;
5 } bit or 64-bit word old located at the address addr in global or
shared memory, computes the minimum of old and val, and
stores the result back to memory at the same address. These
Fig. 5. A simple and efficient implementation of atomicMin function for
non-negative float values. three operations are performed in one atomic transaction. The
function returns old. atomicMin functions are provided only
for type int, unsigned, and unsigned long long. However, in
our proposed method, the expected Pbest values are of float
consideration is given.
type. Therefore, we need the atomicMin function for float type.
Therefore, we set exception handling for particles number
Fortunately, we can simply implement it using the atomicMin
“0” and “n-1”. For number “0” the left neighbor was set as
function for unsigned type as shown in Fig. 5.
“n-1” and for number “n-1”, the right neighbor was set as “0”.
This implementation enables us to implement ring topology. Our implementation merely casts float type into unsigned
The local best position of the particle is calculated based on type. Why this simple implementation works correctly can be
the definition of right neighborhood and left neighborhood. explained as follows. In CUDA C, both unsigned type and
The fifth kernel generates n blocks of d threads, which float type are of size 32 bits (shown in Fig. 6). Notice that
compute velocities and positions for the next iteration. In sign bit, exponent, and mantissa in a float value are allocated
this function, we also used coalescing memory access. Our to more significant bit(s) in this order. Hence, for any given
experiment showed that coalesced access is faster than non- float value x and y, the magnitude correlation of x and y is
coalesced access. Therefore in our experiment, coalescing equivalent to that of x and y as unsigned integers if x and y
memory access has an important effect that our implemen- are non-negative integers.
tation is faster than other related work [14]. Most important kernel computes the global best value based
We also used an atomic function [8] which performs a read- on the all particles position in the swarm. During this kernel
modify-write atomic operation on one 32-bit or 64-bit word process, we used atomicMin() function for good fitness values.
residing in global or shared memory. Atomic operations are Finally, our last kernel is used to free the array for seeds. In
our experiments, better performance can be achieved if all the
kernel functions use coalesceing memory access.

V. E XPERIMENTS AND ANALYSIS


In this section, we present our experimental results which
have been obtained using a CPU and a GPU platforms
described below. The results are compared with other previous
implementation [14] in terms of execution time, loop time per
iteration and fitness values. In our experiment, seven classical
benchmark functions (shown in Table I) were used to evaluate
Fig. 6. Difference between integer type and float type. the performance of the implementations.

223
TABLE II
GPU SPSO AND CPU SPSO ON f1 ( NUMBER OF DIMENSIONS d = 50 )

n
CPU GPU CPU loop GPU loop CPU Gbest GPU Gbest Speedup
Time(s) Time(s) Time(s) Time(s) Value Value
2000 5.7992 0.4638 0.0029 0.0001 8.91E-12 8.65E-12 12.5

3000 8.7463 0.9190 0.0043 0.0002 8.41E-12 8.20E-12 9.5

4000 11.5932 1.7760 0.0057 0.0002 7.70E-12 7.65E-12 6.5

5000 14.4893 2.6564 0.0072 0.0003 7.59E-12 7.46E-12 5.4

TABLE III
GPU SPSO AND CPU SPSO ON f4 ( NUMBER OF DIMENSIONS d = 50 )

n
CPU GPU CPU loop GPU loop CPU Gbest GPU Gbest Speedup
Time(s) Time(s) Time(s) Time(s) Value Value
Fig. 7. Gbest and generation for benchmark test functions 2000 10.2989 0.5097 0.0051 0.0001 0.00E+00 0.00E+00 20.2

3000 15.4255 0.9529 0.0077 0.0002 0.00E+00 0.00E+00 16.1

4000 20.5698 1.8107 0.0102 0.0002 0.00E+00 0.00E+00 11.3

5000 25.7197 2.6899 0.0128 0.0003 0.00E+00 0.00E+00 9.5

particles ranges from 2000 to 5000, the number of dimensions


is 50 and the number of iterations is 2000. The GPU SPSO
and the CPU SPSO were run on from f1 to f7 functions for
50 times independently with different seeds.
Analyzing the data of the tables, we can observe that in
TABLE IV, a GPU-SPSO can reach a maximum speedup of
greater than 46 times when the swarm population size is 2000
and dimension size is 50, running on complex function f7 .
For more complex functions speedup may be even greater.
Fig. 8. Gbest and generation for benchmark test functions
However, when optimized by a GPU-SPSO, the time needed
is almost the same among the seven functions under the same
One of the most common measure used by the paral- dimension and population. Therefore, the curves for f1 to f7
lel computing community to compare the test results is by a GPU SPSO in Fig. 9, 10 are overlapped with each other.
speedup. Speedup is defined as the ratio of the execution time In the next test, the swarm population was set at 2000 and
T imeCP U of the sequential implementation to the execution the dimension was changed from 50 to 200. The results shown
time T imeGP U of the parallel implementation: in Table V to VII demonstrate that running PSO on a CPU
to optimize high dimensional problems is slow, but the speed
T imeCP U
Speedup = can be greatly accelerated if we run it on a GPU (see Fig. 9,
T imeGP U 10).
We run the CPU SPSO and the GPU SPSO using the same We also observed that speedup is decreased and execution
configuration of parameters n and d, which are the number of time on a GPU is increased when swarm population size is
particles and the dimensions respectively. more than 2000 and dimension size is more than 50. Speedup
Our tests were conducted using an NVIDIA GeForce GTX is decreased due to the overhead of memory access by too
980 GPU and an Intel(R) Core i5(TM) 4570 @ 3.20GHz many threads and it is common phenomenon observed for
with 8 GB RAM. The operating system was Windows 7 many parallel programs.
Professional SP1. For compilation, we used Microsoft Visual
Studio 2012 Professional Edition and CUDA 7.5 SDK.
In all experiments the number of dimensions and particles TABLE IV
GPU SPSO AND CPU SPSO ON f7 ( NUMBER OF DIMENSIONS d = 50 )
were respectively set from 50 to 200 and from 400 to 20000.
Each experiment was run until the maximum number of n CPU GPU CPU loop GPU loop CPU Gbest GPU Gbest Speedup
iterations has been reached, which was set at 2,000. We carried Time(s) Time(s) Time(s) Time(s) Value Value
2000 23.9281 0.5176 0.0119 0.0001 0.00E+00 0.00E+00 46.2
out another set of simulations to evaluate convergence speed
with respect to the number of iterations. In all cases we got 3000 35.7973 0.9646 0.0178 0.0002 0.00E+00 0.00E+00 37.1
good solutions of all the seven functions (See Fig. 7, 8). 4000 47.8512 1.8228 0.0239 0.0002 0.00E+00 0.00E+00 26.2
The average results using some complex functions are
5000 59.5922 2.6941 0.0297 0.0003 0.00E+00 0.00E+00 22.1
shown in TABLE II through IV. In these cases, the number of

224
TABLE V
GPU SPSO AND CPU SPSO ON f1 ( NUMBER OF PARTICLES n = 2000 )

d CPU GPU CPU loop GPU loop CPU Gbest GPU Gbest Speedup
Time(s) Time(s) Time(s) Time(s) Value Value
50 5.7992 0.4638 0.0028 0.0001 8.91E-12 8.65E-12 12.5

100 11.5499 1.7306 0.0057 0.0002 1.36E-04 1.39E-04 6.6

150 17.6825 3.3030 0.0088 0.0003 6.91E-02 6.52E-02 5.3

200 27.1321 4.8979 0.0135 0.0005 2.30E+00 2.30E+00 5.5

TABLE VI
GPU SPSO AND CPU SPSO ON f4 ( NUMBER OF PARTICLES n = 2000 )
Fig. 9. Overlap of computation time
d CPU GPU CPU loop GPU loop CPU Gbest GPU Gbest Speedup
Time(s) Time(s) Time(s) Time(s) Value Value
50 10.2989 0.5097 0.0051 0.0001 0.00E+00 0.00E+00 20.2

100 20.5673 1.8231 0.0102 0.0002 3.23E-02 3.18E-02 11.2

150 33.8859 3.4463 0.0169 0.0004 1.23E+00 1.23E+00 9.8

200 49.7410 5.0790 0.0248 0.0005 8.97E+00 8.98E+00 9.7

VI. C ONCLUSION
This paper has presented an implementation of the SPSO on
the CUDA architecture. The proposed GPU SPSO significantly
reduces execution time compared to previous development.
We have achieved a good fitness value with short execution
time and kernel loop time simultaneously. Moreover, the
Fig. 10. Overlap of loop time implementation has a significant speedup when compared to
the CPU serial implementation. The proposed implementation
is 46 times faster.
In this paper we focused on realizing good implementation
Due to difference between random number sequence on a of the original SPSO. In our future work, we want to investi-
CPU and that on a GPU, the Gbest values may be slightly gate parallel implementation of the extensions [15][19][20] of
different. Moreover, execution time depends on function types SPSO to improve the efficiency of other implementations. We
and how many operators that have been used inside a function. also want to improve the SPSO performance by implementing
We focused on execution time and speedup for fixed number all kernel functions using coalescing memory access which
of iterations. should improve SPSO performance significantly.
The other implementation [14] of SPSO is slower than
our implementation under the same dimension and the same
TABLE VII
population size (see Table VIII). In our implementation the GPU SPSO AND CPU SPSO ON f7 ( NUMBER OF PARTICLES n = 2000 )
best key point is that we try to implement good parallelization.
In the other implementation of SPSO, the random numbers d CPU GPU CPU loop GPU loop CPU Gbest GPU Gbest Speedup
Time(s) Time(s) Time(s) Time(s) Value Value
were generated inside a CPU whereas, in our implementation, 50 23.9281 0.5176 0.0119 0.0001 0.00E+00 0.00E+00 46.2
all the random numbers were generated inside a GPU. The
100 49.3385 1.8221 0.0246 0.0003 0.00E+00 0.00E+00 27.0
implementation of the rand() function on a CPU is very simple
and not time-consuming. However, the transfer of generated 150 72.9061 3.4420 0.0364 0.0004 0.00E+00 0.00E+00 21.1
random numbers from a CPU to a GPU is time-consuming. 200 96.5463 5.0681 0.0482 0.0005 0.00E+00 0.00E+00 19.0
This increases the overall processing time of the other imple-
mentation. In our work, the random numbers were generated TABLE VIII
by cuRAND function on a GPU for SPSO application to A C OMPARISON BETWEEN [14] AND OURS ON f4 ( NUMBER OF
reduce the processing time significantly. Moreover, we used DIMENSIONS d = 50, NUMBER OF ITERATIONS 10000)

some kernels that are using coalescing memory access which


[14] ours
increase the processing speed. The ”atomicMin” function was
n CPU GPU Speedup CPU GPU Speedup
used to calculate good solutions in SPSO on a GPU. On the Time(s) Time(s) Time(s) Time(s)
other hand, in the previous approach, solutions were calculated 10000 1269.7554 113.1295 11.2 400.7887 13.0416 30.7
by using a complex algorithm. 20000 2537.7515 221.9755 11.4 801.7956 28.4701 28.1

225
R EFERENCES
[1] J. Kennedy and R. C. Eberhart: Particle Swarm Optimization, IEEE In-
ternational Conference on Neural Networks, Vol.4, pp.1942–1948 (1995)
[2] A. P. Engelbrecht: Fundamentals of Computational Swarm Intelligence,
Wiley (2005)
[3] J. Koodziejczyk: Survey on Particle Swarm Optimization accelerated on
GPGPU, International Journal of Scientific Engineering and Research,
Vol.5, No.12, pp.2229–5518 (2014)
[4] X. Hu: Particle Swarm Optimization, www.swarmintelligence.org (2004)
[5] D. Bratton and J. Kennedy: Defining a Standard for Particle Swarm
Optimization, IEEE Swarm Intelligence Symposium, pp.120-127 (2007)
[6] K. E. Hoff III, T. Culver, J. Keyser, M. Lin, and D. Manocha: Fast Compu-
tation of Generalized Voronoi Diagrams Using Graphics Hardware, 26th
Annual conference on Computer Graphics and interactive Techniques,
pp.277–286 (1999)
[7] Z.W. Luo, H. Liu, and X. Wu: Artificial Neural Network Computation
on Graphic Process Unit, IEEE International Joint Conference on Neural
Networks, Vol.1, pp 622–626 (2005)
[8] NVIDIA: CUDA Programming Guide 7.5
http://www.nvidia.com/object/cuda develop.html (2015)
[9] A. Munsh: The OpenCL Extension Specification 1.2
http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf (2012)
[10] W. Liu and B. Vinter: An Efficient GPU General Sparse Matrix-Matrix
Multiplication for Irregular Data, 28th IEEE International on Parallel and
Distributed Processing Symposium, pp.370–381(2014)
[11] G. Dafeng and W. Xiaojun: Real-time Visual Hull Computation Based
on GPU, IEEE International Conference on Robotics and Biomimetics
(ROBIO), pp 1792–1797 (2015)
[12] A. P. Yazdanpanah, A. K. Mandava, E. E. Regentova, V. Muthukumar,
and G. Bebis: A CUDA Based Implementation of Locally-and Feature-
Adaptive Diffusion Based Image Denoising Algorithm, 11th Interna-
tional Conference on Information Technology: New Generations (ITNG),
pp.388–393 (2014)
[13] NVIDIA: GeForce GTX 980 for Desktop
http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-980
[14] Y. Zhou and Y. Tan: GPU-based Parallel Particle Swarm Optimization,
11th IEEE Congress on Evolutionary Computation, pp.1493–1500 (2009)
[15] R. M. Calazan, N. Nedjah, and L. D. M. Mourelle: Parallel GPU-
based Implementation of High Dimension Particle Swarm Optimizations,
IEEE Fourth Latin American Symposium on Circuits and Systems
(LASCAS),pp.1–4 (2013)
[16] V. K. Reddy and L. S. S. Reddy: Performance Evaluation of Particle
Swarm Optimization Algorithms on GPU Using CUDA, International
Journal of Computer Science and Information Technologies, Vol.5, No.1,
pp.65–81(2012)
[17] L. Mussia, F. Daoliob, and S. Cagnoni: Evaluation of Parallel Particle
Swarm Optimization Algorithms within the CUDA Architecture, Infor-
mation Sciences on Interpretable Fuzzy Systems, Vol.181, pp.4642–4657
(2011)
[18] W. Li and Z. Zhang: A CUDA-based Multichannel Particle Swarm
Algorithm, International Conference on Control, Automation and Systems
Engineering (CASE), pp.1–4 (2011)
[19] R. M. Calazan, N. Nedjah, and L. D. M. Mourelle: A Cooperative
Parallel Particle Swarm Optimization for High-Dimension Problems on
GPUs, BRICS Congress on Computational Intelligence and 11th Brazilian
Congress on Computational Intelligence, pp.356-361(2013)
[20] Y. Zhou and Y. Tan: GPU-based Parallel Multiobjective Particle
Swarm Optimization, International Journal of Artificial Intelligence, Vol.7,
No.A11 (2011)
[21] H. Zhu, Y. Guo, J. Wu, and J. Gu Paralleling Euclidean Particle
Swarm Optimization in CUDA, 4th International Conference on Intelligent
Networks and Intelligent Systems (ICINIS), pp. 93–96 (2011)
[22] E. H. M. Silva and C. J. A. B. Filho: PSO Efficient Implementation on
GPUs Using Low Latency Memory, IEEE Latin America Transactions,
Vol.13, No.5, pp.1619–1624 (2015)
[23] O. Bali, W. Elloumi, P. Krmer, and A. M. Alimi: GPU Particle
Swarm Optimization Applied to Travelling Salesman Problem, IEEE 9th
International Symposium on Embedded Multicore/Many-core Systems-
on-Chip (MCSoC), pp.112–119 (2015)
[24] NVIDIA: CURAND Library 7.5 http://docs.nvidia.com/cuda/pdf/ CU-
RAND Library.pdf (2015)

226

You might also like