Li 2020

This article has been accepted for inclusion in a future issue of this journal.
Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON CYBERNETICS 1
Generation-Level Parallelism for Evolutionary

Computation: A Pipeline-Based Parallel
Particle Swarm Optimization
Jian-Yu Li , Student Member, IEEE, Zhi-Hui Zhan , Senior Member, IEEE,
Run-Dong Liu , Student Member, IEEE, Chuan Wang, Sam Kwong , Fellow, IEEE,
and Jun Zhang , Fellow, IEEE
Abstract—Due to the population-based and iterative-based in EC algorithms and may have significant potential applications
characteristics of evolutionary computation (EC) algorithms, par- in time-consumption optimization problems.
allel techniques have been widely used to speed up the EC
algorithms. However, the parallelism usually performs in the pop- Index Terms—Evolutionary computation (EC), parallel, parti-
ulation level where multiple populations (or subpopulations) run cle swarm optimization (PSO), pipeline technique.
in parallel or in the individual level where the individuals are
distributed to multiple resources. That is, different populations
or different individuals can be executed simultaneously to reduce
running time. However, the research into generation-level par-
allelism for EC algorithms has seldom been reported. In this I. I NTRODUCTION
article, we propose a new paradigm of the parallel EC algorithm
LOBAL optimization algorithms play an important
by making the first attempt to parallelize the algorithm in the
generation level. This idea is inspired by the industrial pipeline
technique. Specifically, a kind of EC algorithm called local ver-
G role in the optimization domain, which provide
a great chance of finding global or approximate global
sion particle swarm optimization (PSO) is adopted to implement optimum [1]–[3]. In the global optimization algorithms,
a pipeline-based parallel PSO (PPPSO, i.e., P3 SO). Due to the
generation-level parallelism in P3 SO, when some particles still
there are deterministic algorithms and evolutionary computa-
perform their evolutionary operations in the current generation, tion (EC) algorithms. In recent years, ECs have become more
some other particles can simultaneously go to the next genera- popular in handling global optimization problems [4]–[6].
tion to carry out the new evolutionary operations, or even go to EC algorithms are population-based and iterative-based
further next generation(s). The experimental results show that algorithms, which adopt a population of individuals to
the problem-solving ability of P3 SO is not affected while the
evolutionary speed has been substantially accelerated in a signif-
carry out the biological-evolution-inspired operations,
icant fashion. Therefore, generation-level parallelism is possible such as crossover, mutation, and selection generation by
generation [7]. As a member of EC algorithms, particle
swarm optimization (PSO) was first proposed in 1995 that
Manuscript received March 31, 2020; revised July 14, 2020; accepted
September 17, 2020. This work was supported in part by the National Key simulated the swarm intelligence behaviors of birds [8]–[10].
Research and Development Program of China under Grant 2019YFB2102102; PSO uses a set of particles to search for the global optimum.
in part by the Outstanding Youth Science Foundation under Grant 61822602; Each particle has its velocity and position information.
in part by the National Natural Science Foundations of China under Grant
61772207 and Grant 61873097; in part by the Key-Area Research and During the evolutionary progress, the velocities and positions
Development of Guangdong Province under Grant 2020B010166002; in of particles will be updated, and then particles will fly
part by the Guangdong Natural Science Foundation Research Team under toward the global optimum or approximate global opti-
Grant 2018B030312003; and in part by the Hong Kong GRF-RGC General
Research Fund 9042489 under Grant CityU 11206317. This article was mum. In recent years, many improved PSOs have been
recommended by Associate Editor H. Takagi. (Corresponding authors: developed [11]–[14] and many real-world optimization prob-
Zhi-Hui Zhan; Jun Zhang.) lems have now been addressed efficiently by the enhanced
Jian-Yu Li, Zhi-Hui Zhan, and Run-Dong Liu are with the School of
Computer Science and Engineering, South China University of Technology, EC/PSO algorithms [15]–[18].
Guangzhou 510006, China, and also with the Guangdong Provincial Key However, there are also new issues. Although the computer
Laboratory of Computational Intelligence and Cyberspace Information, has become more powerful than before, many time-consuming
South China University of Technology, Guangzhou 510006, China (e-mail:
zhanapollo@163.com). optimization problems (e.g., the fitness evaluation expensive
Chuan Wang is with the College of Software, Henan Normal University, problems) still need long running time [19], [20]. Moreover,
Xinxiang 453007, China. the population-based and iterative-based characteristics of
Sam Kwong is with the Department of Computer Science, City University
of Hong Kong, Hong Kong. EC/PSO also make the algorithms run slowly. To address
Jun Zhang is with the University of Victoria, Melbourne, VIC 3011, the long-running time problem, parallel and distributed tech-
Australia. niques have been adopted [21]–[23]. The population-based
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org. PSO has natural parallelism. Usually, the parallelization of
Digital Object Identifier 10.1109/TCYB.2020.3028070 PSO can be implemented on the population level and/or
2168-2267
c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on June 05,2021 at 14:55:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2 IEEE TRANSACTIONS ON CYBERNETICS
individual level [24], [25]. In the population-level paral- pipeline idea and to show how the generation parallelism can
lelism, multiple populations are used or the population can be implemented in PSO, like the instruction parallelization
be divided into several subpopulations. All these populations implemented in CPU. Since the Intel i486 CPU, there have
or subpopulations can run parallel in multiple machines or been several different units in a CPU, and one instruction is
machines with multiple cores [26]. In the individual-level par- divided into several stages, with different stages being exe-
allelism, the evolutionary operations or the fitness evaluation cuted under the coordination of the different units. After one
of different individuals can be parallelized and distributed unit completes its operation for one instruction, it can be used
on multiple machines or machines with multiple cores. This to execute the next instruction. When all units work together
can save much time if the calculation of the fitness value in this way, instructions can be efficiently executed in par-
is very expensive [27], [28]. That is, different populations allel. Therefore, the pipeline technique makes CPU achieve
or different individuals can be executed simultaneously to instruction-level parallelization, which will be described in
save time. In fact, the second kind of parallelism can be detail later in Section II-A as the background to illustrate
combined with the first kind of parallelism. For example, the idea.
different populations run parallel in different machines (i.e., Similar to the pipeline technique in CPU, the paralleliza-
the population-level parallelism), and then the individuals tion among generations is implemented for PSO. In this
in each population can be further distributed to different article, local version PSO (LPSO) [8] is adopted to imple-
machines to calculate the fitness value in parallel (i.e., the ment a pipeline-based parallel PSO (PPPSO, i.e., P3 SO). The
individual-level parallelism), like the two-layer Cloudde [29]. P3 SO is based on LPSO that uses a ring topology. That
There is growing literature on parallel PSO algorithms based is, each particle forms the neighborhood by itself and its
on the above-mentioned two parallelism levels, such as left and right particles (the left and right are indicated by
multipopulation-based PSO [26]; GPU-PSO algorithms [22], index). This way, the particle is guided by the neighbor-
[30], [31]; multicore PSO algorithms [32]; and heterogeneous hood’s best information among these three particles (including
platform-based PSO [33]. With the development of the par- itself). In the parallelization, P3 SO can be implemented by
allel technique, parallel PSOs are also adopted in many clusters, multiprocessors, multicores, or multithreads. Without
applications, such as bus voltage optimization [34], trajectory loss of generality, we describe the parallel implementation
optimization of aircraft [35], parameters throughout these par- by multithreads in this article. Particles in each generation
allel PSO algorithms, the parallelism [36], 1-D heat conduction serially execute their evolutionary operations (e.g., updating
equation solving [37], and parameters identification [38]. their velocities and positions, calculating the fitness values,
Throughout these parallel PSO algorithms, the parallelism and updating their historically best positions). When all the
of them has some common features. Regardless of whether neighbors of a particle (including itself) have completed their
the parallelization of PSO is implemented in the population evolutionary operations, this particle can obtain enough guid-
level or individual level, the iterations of these parallel PSO ance information and can move to the next generation in
algorithms are serial (i.e., one by one) in the generation level. advance. A new thread will be used for the new generation
That means only after all particles have completed the cur- to control the evolution of these ready particles in a new
rent generation, then they are allowed to obtain to the next generation, while the nonready particles are still in the old
generation. Therefore, the existing parallel PSOs/ECs do not generation. However, as the number of generations may be
consider the parallelization in the generation level. In this arti- large, some threads responsible for the early generations can
cle, we propose a new parallel PSO and, to the best of our be again used for the late generations when they have fin-
knowledge, this is the first attempt to parallelize the algorithm ished all the particles in the early generations. Herein, we
at the generation level. That is, when some particles have fin- propose a thread manage technique to set a reasonable number
ished their evolutionary operations in the current generation, of threads according to the population size. A series of exper-
they can go to the next generation to carry out the new evolu- imental studies shows that the new P3 SO algorithm is viable
tionary operations, even though the current generation has not and can speed up the evolutionary process at the generation
finished (i.e., ing their evolutionary operations in the current level.
generation). This idea of generation-level parallelism seems to The remainder of this article is organized as follows. The
be challenging but we have implemented it by the inspiration parallel theory adopted in this article and LPSO are intro-
of the industrial pipeline technique. duced in Section II. In Section III, the P3 SO will be described
The pipeline technique is a generic idea for parallel work- in detail. A series of experiments is exhibited in Section IV.
ing to speed up the runtime of multiple tasks. It has been Finally, conclusions and future works are given in Section V.
widely used as the parallel paradigm in the industrial field,
where the production process can be divided into differ-
II. PARALLEL T HEORY AND LPSO
ent sequent stages. For example, automobile manufacturing
adopts the pipeline technique to process the stages in par- A. Pipeline Technique
allel to improve production efficiency [39]. In the computer The pipeline technique is a kind of parallel working mecha-
field, it is well known that the central processing unit (CPU) nism for speeding up the running time of multiple tasks. It can
also adopts the pipeline technique to speed up the instruc- make good use of components to achieve high working effi-
tion execution. Herein, as computer scientists who are familiar ciency. In the industrial field, automobile manufacturing adopts
with CPU, we use the CPU as an example to describe the the pipeline technique to improve production efficiency [39].
LI et al.: GENERATION-LEVEL PARALLELISM FOR EC: P3 SO 3
an instruction’s IF stage is done, its ID stage can be per-

formed in the ID unit of the CPU, and the IF stage of the
next instruction can be performed in the IF unit at the same
time. Therefore, different stages can be performed in one clock
cycle. Starting from the fifth cycle, an instruction will just need
one extra clock cycle to complete the execution. From Fig. 1,
we can see that only nine clock cycles are needed for execut-
ing all five instructions, which is much less than the 25 clock
cycles when executed in sequence. Ideally, assume that an
instruction is divided into K stages and each stage needs one
clock cycle. The clock cycles needed in the pipeline technique
for executing all N instructions are N + (K − 1). Therefore,
Fig. 1. Space-time diagram of pipeline in CPU. compared with CPU that does not adopt the pipeline technique
(whose clock cycles are K ×N), the efficiency has been greatly
improved.
In the computer field, it is well known that CPU has adopted
the pipeline technique for a long time since the i486. With
the help of the pipeline technique, CPUs can execute more B. LPSO
instructions in one clock rate at the same time. Therefore, PSO is a kind of stochastic algorithm, which is a member
CPUs can achieve instruction-level parallelization [40]. This of ECs. The population of PSO is formed of particles. Each
technique will be introduced by the case of the five-stage particle consists of two vectors. One is the particle’s position
pipeline of the CPU. As our proposed method and the pipeline vector as Xi = [xi1 , xi2 , . . . , xiD ], where D is the dimension
of CPU only share similarities in their processes but not of the problem and i means the ith particle. The other one
the hardware implementation, we just introduce the opera- is the particle’s velocity vector as V i = [vi1 , vi2 , . . . , viD ].
tion progress of the pipeline in this article, and the detail of Moreover, each particle will keep its historically best values
hardware will not be discussed. The five stages are: 1) instruc- in vector Pi = [pBesti1 , pBesti2 , . . . , pBestiD ]. Besides, each
tion fetch (IF); 2) instruction decode (ID); 3) execute (EX); particle preserves the best particle’s value among its neigh-
4) memory access (MEM); and 5) write back (WB) [40], [41]. bors in vector Pn = [nBestn1 , nBestn2 , . . . , nBestnD ]. During
An instruction’s execution is completed through the five stages. the evolutionary process, the particles’ velocities and positions
1) IF: In this stage, instruction will be fetched from the will be updated by using
cache. This step is performed by a special component
in CPU in one clock cycle. vij = ω · vij + c1 · r1j · pBestij − xij + c2 · r2j · nBestij − xij
2) ID: In this stage, instruction fetched in the IF stage will (1)
be decoded to obtain the type of instruction and the
xij = xij + vij (2)
address of operands. This stage is also performed in one
clock cycle.
where vij and xij represent the ith particle’s jth variable
3) EX: For this stage, CPU executes the instruction accord-
on velocity and position, respectively; ω represents the
ing to the decoding results in the ID stage. This stage
inertia weight; and c1 and c2 represent the acceleration
is performed in one clock cycle.
factors [9]. r1j and r2j are random values that drop in [0.0,
4) MEM: When the CPU executes the instruction, it may
1.0], which are resampled for each dimension j. Pn is selected
need to access the memory to obtain the operands. In
according to the topology of particles. If Pn is the best parti-
this stage, MEM will be performed for the EX stage.
cle among the population, then nBest can be called as globally
This stage is performed in one clock cycle.
best gBest. If Pn is the best particle among the ith particle’s
5) WB: In this stage, the execution results will be written
neighbors (i.e., Xi−1 , Xi , and Xi+1 ), then nBest can be called
to the register. The results may be used for the next
as locally best lBest. This is a local version of PSO with a ring
instruction. This stage is performed in one clock cycle.
topology called LPSO. In this article, we adopt the pipeline
When this stage is done, the execution of one instruction
technique for the LPSO. It is rather remarkable that parame-
is done. CPU will repeat these steps to execute the next
ter ω can be a positive constant value or a positive linear or
instruction.
a nonlinear value of time [42]. Herein, we adopt (3) that was
As shown in Fig. 1, the horizontal axis represents the clock
used to nonlinearly change parameter ω as
cycle, and the vertical axis represents the order of instructions.
For one instruction, it needs five clock cycles to complete the g
executions of the five stages. Therefore, if the instructions are ω = ωmax − (ωmax − ωmin ) (3)
G
executed in sequence, then five instructions will need 25 clock
cycles for completion. However, under the pipeline technique, where ωmin and ωmax represent the minimum and maximum
these stages of one instruction can be performed by differ- value of ω. g is the number of the current generation, and G
ent units of the CPU (i.e., the units responding to IF, ID, is the maximum generation used in the algorithm. Moreover,
EX, MEM, and WB, respectively). Therefore, ideally, when the parameters c1 and c2 are both set to 2.0.
Algorithm 1: Procedures of P3 SO
1 Begin
2 Set up the number of threads with (4);
3 Initialize the population; gen = 0; firstReady = 1;
4 /* Generate p threads */
5 #pragma omp parallel num_threads (p)
6 While (Not stop)
7 Particles_updating(); /* See Algorithm 2*/
8 #pragma omp atomic
9 gen + +;
10 End of While
11 End
Fig. 2. Execution order in serial LPSO with five particles. Algorithm 2: Particles Updating
1 Begin
2 Try to get the mutex lock;
III. PARALLEL M ETHOD OF P3 SO 3 i = firstReady; /* firstReady is a global variable*/
4 cnt = 1; /* the number of updated particles */
A. Motivations of P3 SO 5 While (cnt < NP)
In serial LPSO with a ring topology, the update of particles 6 if (cnt == 1)
7 Set up current_gen of this thread;
is one by one, as shown in Fig. 2 with five particles, where the 8 Set up parameter ω of LPSO by using (3);
arrows indicate the order of particles’ evolutionary progress. 9 End of If
As the lBest of each particle depends on the pBest values of 10 /*Wait until the particle is ready for its update */
11 While (gen_value of particle i = current_gen)
its neighbors, the serial LPSO will go through all particles 12 Perform dummy operation;
twice- one for updating their pBest values and the other for 13 End of While
lBest. For example, first, particles 1–5 update their pBest one 14 Update velocity, position, and pBest of particle i;
15 if (cnt >= 3)
by one, according to the order shown in Fig. 2. When all 16 Choose appropriate particle to update its lBest;
five particles have finished updating their pBest, the particle 17 Increase its property gen_value by 1;
1 will start to update its lBest according to the newest pBest 18 /*Make sure these steps only be done once in one
generaion*/
of its neighbors, that is, particles 2 and 5. Then, similar to 19 if (cnt == 3)
particle 1, particles 2–5 update their lBest one by one. After 20 Set up firstReady as the index of this particle;
updating the lBest of all particles, the serial LPSO will enter 21 /* Wake up a new thread */
22 Release the mutex lock;
into the next generation and, again, begin to update the pBest 23 End of If
of particles. All these procedures are performed sequentially 24 End of If
in serial LPSO. 25 cnt + +;
26 (i == (NP − 1))?(i = NP) : (i = (i + 1)%NP);
In fact, when particle 3 has finished updating its pBest, this 27 End of While
means particle 4 can start to update its pBest, and particle 28 End
2 can start to update its lBest at the same time and then enter
into the next generation, because all the neighbors of particle
2 (i.e., particles 1, 2, and 3) have updated their pBest. In other Algorithm 3: Procedure of Modified Benchmark Function
words, when particle 4 is updating its pBest at generation g, Input: delay time useconds
particle 2 can update its lBest and enter into generation g + 1. Output: the fitness value
1 Begin
Next, when particle 5 updates its pBest value, particle 3 can 2 Basic benchmark functions’ procedure;
update its lBest value at the same time and go into the next 3 // delayed by useconds microseconds (μs)
generation g + 1. It is noteworthy that the particles handled 4 usleep(useconds);
5 End
at the same time belong to different generations. It is simi-
lar to the execution of instructions in the pipeline-based CPU,
where different instructions’ execution stages can be executed
at the same clock cycle. Herein, the population in one gen- similar to that in CPU. The vertical axis is the order of gen-
eration can be regarded as one instruction, each particle of erations. Populations in different generations can be projected
the population can be regarded as an execution stage of the into different instructions in CPU. Combining Figs. 1 and 3,
instruction, and the populations in different generations can be the mechanism of the P3 SO can be easily understood.
regarded as different instructions. Multithreads can be adopted
to operate particles that belong to different generations at the
same time. Thus, the pipeline technique is implemented in B. Threads Manage Technique
LPSO and results in P3 SO. Besides, the important point is To adopt the pipeline technique, the number of threads
that the parallelization of this P3 SO is implemented on the should be appropriately set. In fact, it is not necessary to use an
generation level. independent thread for each generation. Instead, a few threads
The space-time diagram of P3 SO can be illustrated in Fig. 3. can be cyclically reused for different generations during the
The horizontal axis is the time axis scaled by the clock cycle evolutionary progress.
Fig. 3. Space-time diagram of P3 SO.
TABLE I
T HREADS Q UANTITIES W ITH D IFFERENT P OPULATION S IZE of P3 SO compared with LPSO is
NP × G
Sp = (5)
3 × G + (NP − 3)
which can be transformed as
NP
Sp = (NP−3)
. (6)
By considering the ring topology of LPSO and observing 3+ G
Fig. 3, we can see that after every three particles have finished
Therefore, as NP is always larger than 3, the theoretical
in the first generation, another thread can and should be created
speedup Sp satisfies
to control the evolution of particles of the next generation.

However, if the new generation starts after the first generation NP NP
has finished, then the thread responses for the first generation Sp < ≤ ≤ p. (7)
3 3
can be cyclically adopted for this new generation. As shown in
That is, if the number of available threads is sufficient (i.e.,
Fig. 3, when the third generation starts (i.e., at time cycle 7),
p ≥ NP/3), the maximum theoretical speedup of P3 SO
the first generation has already finished (i.e., at time cycle 5).
is approaching NP/3; otherwise, the maximum theoretical
In this case, no new thread needs to be created because the
speedup of P3 SO is p.
first thread used for the first generation can now be reused
for the third generation. Similarly, the second thread can be
reused for the fourth generation. In fact, as mentioned above, D. Implementation of P3 SO
a new thread is needed after every three particles finish their In this article, we develop two kinds of P3 SO algorithms
operations; therefore, the number of threads corresponds to with the OpenMP [43] library and MPI [44] library, respec-
the population size, as shown in Table I. If the population tively. For the OpenMP version, it is appropriate to work on
size increases, then the number of needed threads should be shared memory systems. As it is elaborated in the threads man-
increased. Besides, the number of threads remains unchanged age technique, when the population size becomes too large,
during the evolutionary progress. Herein, according to Table I, the number of threads may become large too. However, if
we propose an equation to calculate the number of threads as the number of threads is too large, the system cannot run
these threads efficiently. Therefore, the MPI version is also
NP developed. Different from OpenMP, MPI can make the P3 SO
p= (4)
3 algorithm run on different machines. Therefore, if the popula-
tion size becomes too large, the MPI version can be adopted.
where p represents the number of threads, NP means the pop- One sign is that in the OpenMP version, threads are created
ulation size, and the operation x is the ceiling function that in a single machine while in the MPI version, processes are
returns the minimum integer not smaller than x. created in multimachines of a cluster. However, as these two
version P3 SO algorithms have the same core concept, we only
use the OpenMP version as an example to elaborate on the
C. Theoretical Analysis of P3 SO procedures of the P3 SO algorithm.
Now, we consider the speedup of P3 SO. Assume that the As previously mentioned, the number of threads should be
population size is NP and each particle needs one cycle time set first according to (4). Then, particles are initialized in
for updating all its values (e.g., velocity, position, fitness, and the same way as that in serial LPSO. Besides, each particle
pBest), the total computational time needed for running G gen- possesses a property called gen_value to indicate which gen-
erations in serial LPSO is NP×G. However, when the number eration it currently belongs to. This property will be initialized
of available threads is sufficient (i.e., p ≥ NP/3), we can with value 1. It is noteworthy that each thread processes one
obtain from Fig. 3 that the total time needed in the P3 SO is generation. According to the number of threads set before,
only 3 × G + (NP − 3). Therefore, the theoretical speedup Sp a series of threads will be used to process different generations,
and these threads will be cyclically used in order to process thread 1 also finishes updating particle 4 (including its pBest).
all generations. In the beginning, only one thread is work- Therefore, particle 3 can be ready to be processed by thread
ing. The others will be suspended. When a particle is ready 2 in generation 2. After thread 1 completes the evolutions of
for the next generation, its gen_value will be increased by 1. all particles, it will be suspended. Then, while thread 2 is pro-
Then, a suspended thread will be woken up to process evo- cessing particle 4, thread 1 will still be suspended because no
lutions of particles for this generation, and a generation flag particles are ready for generation 3. After thread 2 completes
current_gen is also set to this thread as gen_value. Only the the evolution of particle 4, the gen_value of particle 3 has
particles whose gen_value is the same as current_gen will be been updated as 3 and thread 1 will be consequently woken
processed by this thread. More important, the wake-up oper- up to process evolutions in generation 3, with particle 3 as the
ation will be performed only one time during one generation, firstReady particles. It is easy to know that when applying the
and the index of the first prepared particle on this generation pipeline technique in LPSO, the threads obey a special order
will be recorded as firstReady. For example, in Fig. 3, the to process particles (i.e., 1 → 2 → 3 → 4 → 5 for thread 1
firstReady of the second and third generation is particle 2 and with current_gen = 1, 2 → 3 → 4 → 5 → 1 for thread 2 with
3, respectively. The woken-up thread will start working from current_gen = 2, 3 → 4 → 5 → 1 → 2 for thread 1 with
the particle whose index is the same as firstReady. If a parti- current_gen = 3, . . .). When all generations finish, the entire
cle’s gen_value is equal to the thread’s current_gen, then the algorithm is done and all threads will stop. The pseudocode
thread will begin on this particle to execute its evolutions, of these steps for particle updating is shown in Algorithm 2.
along with the other following particles. Otherwise, the thread
will keep waiting until a particle’s gen_value is equal to cur- IV. E XPERIMENTAL V ERIFICATION
rent_gen. When all particles complete their evolutions of one In this article, we make a set of experiments to evalu-
generation, the corresponding thread will be suspended. As ate the performance of P3 SO, including the speedup, ability
the evolution progresses, threads are woken up and suspended of solving problems, and a comparison on different versions
alternately and working cooperatively until all generations are of P3 SO (i.e., OpenMP version and MPI version). All these
done. The pseudocode of P3 SO is shown in Algorithm 1. experiments are implemented in the Ubuntu operating system.
In order to more clearly describe the procedure of P3 SO, an Besides, the program is written with C++ and the pipeline par-
example is introduced in the following. Assume that the pop- allelization is implemented by using the OpenMP library or
ulation size is 5, and the first three generations’ evolutionary- the MPI library [45]. All experiments are independently run
progress diagram of the algorithm is the same as that in Fig. 3. for 30 times to obtain the average results.
With the help of (4), we can set the number of threads as
2. That is, two threads are used. Conveniently, indexes (i.e., A. Benchmark Functions and Their Modifications
1, 2) are assigned to these two threads. At the beginning, each
particle’s gen_value property is set as 1. Assume that thread Generally, when adopting parallel techniques to an algo-
1 is working first to process generation 1. As shown in Fig. 3, rithm, the running time may be reduced, but the problem-
in generation 1, only thread 1 processes the first three particles solving ability should stay unchanged. Therefore, it is nec-
(i.e., updating velocities, positions, and pBest values) while essary to make a judgment whether the pipeline technique
the gen_value of these particles will remain unchanged. After will influence the performance in solving problems. To this
the process of particle 3 is done, lBest of particle 2 can be aim, a set of ten benchmark functions is selected from [10]
updated. Therefore, the property gen_value of particle 2 will as shown in Table II, where f1 –f4 are unimodal functions and
then be increased by 1 and the value of firstReady will be set f5 –f10 are multimodal functions.
as 2, which is the index of particle 2. This means that parti- Moreover, to conveniently and evidently evaluate the
cle 2 is ready to step into the next generation (i.e., generation speedup brought by the pipeline technique, a delay procedure
2). Therefore, thread 2 is woken up to process generation 2. in Algorithm 3 is used to modify the basic functions to make
As the indication of firstReady is 2, this thread will start them somehow like the fitness evaluation expensive problems.
working from particle 2. Besides, the value of current_gen The duration of the delay time can be adjusted to simulate
of thread 2 will be set as gen_value of particle 2 to note that different kinds of evaluation expensive problems.
thread 2 is processing generation gen_value. This procedure
only performs one time during each generation to ensure that B. Performance of P3 SO in the OpenMP Version
each generation will be processed by one and only one thread. 1) Problem-Solving Ability Measuring: In this experiment,
Therefore, when thread 2 processes particles in generation 2 ten original functions in Table II are used. Besides, the original
(with gen_value = 2), thread 1 keeps processing the remain- LPSO is used to make a comparison with P3 SO. The algorithm
ing particles, such as particles 4 and 5 in generation 1. After configurations are shown in Table III. The computer’s config-
thread 2 completes the evolution of particle 2, it will judge urations are Intel Core i7-7700 CPU 3.60 GHz, 8 GB of the
whether the gen_value of particle 3 is equal to current_gen. memory.
If not, thread 2 will keep waiting. In fact, thread 2 will not The mean and standard deviation of 30 independent runs are
keep waiting for a long time because when thread 2 processes shown in Table IV. Besides, we have conducted Wilcoxon’s
particle 2, thread 1 is processing particle 4 at the same time. rank-sum tests [46] with a significant level α = 0.05 to make
As the time costs needed for updating different particles are the comparison more rigorous. The symbols “+,” “≈,” and
almost the same, when thread 2 finishes updating particle 2, “−” mean that the P3 SO performs significantly better than,
TABLE II
B ENCHMARK F UNCTIONS U SED IN E XPERIMENTS
TABLE III TABLE V

PARAMETER S ETTINGS IN THE E XPERIMENT C ONFIGURATIONS FOR M EASURING P3 SO’ S S PEEDUP
TABLE IV
C OMPARISON ON P ROBLEM -S OLVING A BILITY B ETWEEN P3 SO IN THE 2) Speedup Measuring: In order to obtain the speedup
O PEN MP V ERSION AND O RIGINAL LPSO of P3 SO, a modified f 1 is used. The delay time useconds
in Algorithm 3 is set as 5000 microsecond (μs). We make
a set of 18 cases experiments with different population sizes
to measure the speedup of P3 SO. Due to the change of the
population size, the number of threads used by the P3 SO
is also changed. The detailed configurations of parameters
are shown in Table V. The maximum generation G is 500.
The computer’s configurations are Intel Xeon Phi CPU 7250
3.60 GHz, 110 GB of the memory. This computer is a mul-
ticores machine and has enough processors to process a large
number of threads.
The time consumption and speedup are presented in
Table VI and Fig. 4, respectively. The results show that the
similar to, and significantly worse than the original LPSO, time consumption of P3 SO and the original LPSO has signifi-
respectively. As shown in Table IV, the performance of P3 SO cant differences. In all the 18 cases, P3 SO has a faster running
has no significant differences compared with the original speed. When the population size becomes larger and larger,
LPSO. This means that the pipeline technique has no influ- the speedup of P3 SO will become larger and larger relatively.
ence on the problem-solving ability of LPSO. It just provides More significantly, as the population size becomes larger and
the acceleration for LPSO. Besides, the results also show that larger, the time consumption of LPSO increases dramatically
the parallelization on the generation level is a possible method. and seems to become impossible for practical use. However,
TABLE VI TABLE VII

C OMPARISONS ON T IME C ONSUMPTION B ETWEEN P3 SO IN THE C OMPARISON ON P ROBLEM -S OLVING A BILITY B ETWEEN P3 SO
O PEN MP V ERSION AND O RIGINAL LPSO IN THE MPI V ERSION AND O RIGINAL LPSO
TABLE VIII
C OMPARISONS ON T IME C ONSUMPTION B ETWEEN P3 SO IN THE MPI
V ERSION AND O RIGINAL LPSO
Fig. 4. Speedup of P3 SO in the OpenMP version with different population

sizes.
the time consumption of P3 SO increases slowly even though

the population size becomes much larger.
C. Performance of P3 SO in the MPI Version

P3 SO in the MPI version can use many processors in differ-
ent machines but not necessary in a single machine. Therefore,
a computer cluster is used to carry on the following experi-
ments. The configurations of computers in the cluster are Intel
Core i7-7700 CPU 3.60 GHz, 8 GB of the memory. In addi-
tion, the number of processes is given manually. More detail
about MPI can be found in [44].
Fig. 5. Speedup of P3 SO in MPI version with different population sizes.
1) Problem-Solving Ability Measuring: All the ten original
functions in Table II are also adopted in this experiment. Note
that the algorithm parameter configurations are the same as
that in the OpenMP version as shown in Table III. which are shown in Table V. The results for P3 SO in the MPI
The performance results are shown in Table VII, showing version are shown in Table VIII and the speedup of all the
that P3 SO in the MPI version and original LPSO have no cases is plotted in Fig. 5. As shown in Table VIII and Fig. 5,
significant difference in optimization performance. It is shown the speedup of P3 SO increases when the population size NP
that P3 SO can also be implemented on multimachines. increases. Besides, as the population size is increasing sharply,
2) Speedup Measuring: The function and algorithm con- the time consumption of P3 SO increases smoothly. It is similar
figurations are the same as those in the OpenMP version, to P3 SO in the OpenMP version.
TABLE IX
C ONFIGURATIONS OF D IFFERENT D ELAY T IMES AND PARAMETERS
TABLE X
S PEEDUPS OF T WO V ERSIONS P3 SO W ITH D IFFERENT D ELAY T IMES
Fig. 6. Speedup of P3 SO with different delay times.
D. Comparisons Between OpenMP Version and MPI Version

Both P3 SO in the OpenMP version and MPI version
have distinguished running efficiency. In this part, we make
a comparison between them to figure out their differences.
1) Performances and Speedup With Different Delay Time:
In this experiment, the function f 1 is selected and modified
with different delay times to simulate different time-
consuming situations. That is, the delay time useconds in Fig. 7. Speedup of P3 SO with different population sizes.
Algorithm 3 is set with different microsecond (μs), as the
12 cases shown in Table IX. Moreover, the population size
is fixed as 20, and therefore, the number of needed threads In the larger population size, the MPI version has better effi-
p is 7. For each run, the maximum generation G is set to ciency. This may be due to that processes are more efficient
500. In addition, the computer configurations are Intel Core than threads. Therefore, when the number of processes and
i7-7700 CPU 3.60 GHz, 8 GB of the memory. All algorithms threads grows up, the difference of efficiency between multi-
are run 30 times to obtain the average results for experimental processes (used by MPI) and multithreads (used by OpenMP)
studies. is more obvious. Herein, in applications, we just need to
The results are shown in Table X and Fig. 6, which show choose an appropriate version according to different platforms.
that both these two versions P3 SO with seven threads have The OpenMP version can be used when the computer has
the speedup of over 6.5. As the theoretical speedup is 7 (i.e., a large number of processors while the MPI version can be
20/3) according to the analyses in Section III-C for these adopted if there are enough computer clusters. Besides, the
cases, the speedup efficiency η of both the P3 SO with OpenMP P3 SO algorithm may be suitable to be implemented in some
and MPI is over 93%. Besides, with the delay time increas- artificial chip such as a field-programmable gate array.
ing, the speedups of P3 SO remain unchanged. This means that
the delay time has no significant influence on the algorithm
speedup. E. Performance of P3 SO on Large-Scale Problems
2) Speedup Comparisons on Different Population Sizes: To evaluate the effectiveness of P3 SO on high-dimensional
Moreover, the above results on the running time of P3 SO and large-scale optimization problems, the ten problems pro-
variants with OpenMP and MPI in Sections IV-B and IV-C, vided in Table II are extended to 1000-D problems for the
respectively, are compared in this part. Figs. 4 and 5 are com- experiments. The delay time is set as 100 ms (i.e., 1 × 105 μs)
bined as Fig. 7. As shown in Fig. 7, when the population size for all the problems. In the experiments, the population size
increases, the speedups of these two versions P3 SO increase NP of P3 SO and original LPSO is 20 and the number of
relatively. In the small population size, there is a little differ- threads p for P3 SO is 7. The maximum generation G is set
ence in speedups between these two versions P3 SO. However, to 1000 and all algorithms run 30 times to obtain the average
as the population size becomes larger, the difference between results. In addition, the computer configurations are Intel Core
these two versions P3 SO becomes more and more obvious. i7-7700 CPU 3.60 GHz, and with 8 GB of the memory.
TABLE XI
C OMPARISON OF O PTIMIZATION R ESULTS ON TEN L ARGE -S CALE of the next generation does not need the entire information
P ROBLEMS A MONG O RIGINAL LPSO AND P3 SO W ITH in the current generation. For example, in the differential
O PEN MP AND MPI evolution (DE) algorithm with the ring topology [47], when
an individual carries out the evolutionary operators, it only
requires the information of some individuals in its neighbor-
hood, which is suitable to adopt the pipeline technique. More
generally, for generic EC algorithms with multiple popula-
tions, where the evolution of each population does not need
the information from all populations, the pipeline technique
can also be adopted.
In addition, the proposed generation-level parallelism
method is also worthy researched to combine with other level
parallelisms. For example, when multiple populations run in
TABLE XII
S PEEDUPS OF T WO V ERSIONS P3 SO ON L ARGE -S CALE P ROBLEMS parallel in different computational resources (population-level
parallelism), each population can be further paralleled by the
pipeline technique to execute different generations in paral-
lel to reduce the running time (generation-level parallelism).
Therefore, the proposed pipeline technique has great potential
to be further studied to obtain more computationally efficient
EC algorithms.
R EFERENCES
[1] G. Venter, “Review of optimization techniques,” in Encyclopedia of
Aerospace Engineering. Chichester, U.K.: Wiley, 2010, pp. 1–12.
Tables XI and XII provide the comparisons between P3 SO [2] Z. Z. Liu and Y. Wang, “Handling constrained multiobjective
and original LPSO on the optimization results and speedups, optimization problems with constraints in both the decision and objec-
tive spaces,” IEEE Trans. Evol. Comput., vol. 23, no. 5, pp. 870–884,
respectively. The results in Table XI show that P3 SO and Jan. 2019.
original LPSO have no significant differences in terms of [3] T. Hendtlass, “The EZopt optimisation framework,” in Proc. IEEE
the optimization results on large-scale optimization problems. Congr. Evol. Comput., 2019, pp. 3110–3117.
[4] Z. Z. Liu, Y. Wang, and B. Wang, “Indicator-based con-
However, as shown in Table XII, both the two versions of strained multiobjective evolutionary algorithms,” IEEE Trans.
P3 SO are much faster than the original LPSO and obtain the Syst., Man, Cybern., Syst., early access, Dec. 5, 2019,
speedups of around 6.5, which are similar to the situations on doi: 10.1109/TSMC.2019.2954491.
30-D problems (i.e., Table X). These verify the efficiency and [5] Z. Z. Liu, Y. Wang, S. Yang, and K. Tang, “An adaptive framework to
tune the coordinate systems in nature-inspired optimization algorithms,”
extendability of P3 SO on large-scale optimization problems. IEEE Trans. Cybern., vol. 49, no. 4, pp. 1403–1416, Mar. 2019.
[6] S. Chen, A. Bolufé-Röhler, J. Montgomery, and T. Hendtlass, “An
analysis on the effect of selection on exploration in particle swarm
V. C ONCLUSION optimization and differential evolution,” in Proc. IEEE Congr. Evol.
Comput., 2019, pp. 3037–3044.
Many parallel versions of PSO were proposed in the [7] J. Zhang et al., “Evolutionary computation meets machine learning:
literature because of its natural parallelism. However, the par- A survey,” IEEE Comput. Intell. Mag., vol. 6, no. 4, pp. 68–75,
allelism of PSO usually performs at the population level or Nov. 2011.
[8] R. C. Eberhart and J. Kennedy, “A new optimizer using particle
individual level. Inspired by the pipeline technique, we make swarm theory,” in Proc. 6th Int. Symp. Micromach. Human Sci., 1995,
the first attempt to introduce the pipeline idea into PSO in pp. 39–43.
this article to propose the P3 SO algorithm. Parallelism on the [9] Z. H. Zhan, J. Zhang, Y. Li, and H. S.-H. Chung, “Adaptive parti-
cle swarm optimization,” IEEE Trans. Syst., Man, Cybern. B. Cybern.,
generation level is implemented in the proposed P3 SO algo- vol. 39, no. 6, pp. 1362–1381, Apr. 2009.
rithm. Two different versions of P3 SO (i.e., OpenMP version [10] Z. H. Zhan, J. Zhang, Y. Li, and Y. H. Shi, “Orthogonal learning par-
and MPI version) have been developed and their efficiency ticle swarm optimization,” IEEE Trans. Evol. Comput., vol. 15, no. 6,
pp. 832–847, Sep. 2011.
has been evaluated by a set of experiments. As P3 SO is not [11] X. Xia et al., “An expanded particle swarm optimization based on
affected by the problem-solving ability and is substantially multi-exemplar and forgetting ability,” Inf. Sci., vol. 508, pp. 105–120,
accelerated on the evolutionary speed, it may have significant Jan. 2020.
[12] G. Xu et al., “Particle swarm optimization based on dimensional learning
potential applications in time-consumption optimization prob- strategy,” Swarm Evol. Comput., vol. 45, pp. 33–51, Mar. 2019.
lems. Therefore, a promising future work is to try to apply the [13] X. Xia et al., “Triple archives particle swarm optimization,” IEEE Trans.
generation-level parallelism P3 SO algorithm to solve complex Cybern., early access, Oct. 11, 2019, doi: 10.1109/TCYB.2019.2943928.
problems in real-world applications. [14] X. F. Liu, Z. H. Zhan, Y. Gao, J. Zhang, S. Kwong, and J. Zhang,
“Coevolutionary particle swarm optimization with bottleneck objective
Moreover, although, in this article, we just use the ring learning strategy for many-objective optimization,” IEEE Trans. Evol.
topological LPSO as an example to implement the generation- Comput., vol. 23, no. 4, pp. 587–602, Oct. 2019.
level parallel EC algorithm. Such a pipeline technique can be [15] Z. J. Wang et al., “Dynamic group learning distributed particle
swarm optimization for large-scale optimization and its application
also tried to apply to other EC algorithms in the future work. in cloud workflow scheduling,” IEEE Trans. Cybern., vol. 50, no. 6,
In fact, it can be adopted in EC algorithms if the evolution pp. 2715–2729, Jun. 2020.
[16] Y. Lin, Y. Jiang, Y. J. Gong, Z. H. Zhan, and J. Zhang, “A discrete [38] Z. H. Liu, X. H. Li, L. H. Wu, S. W. Zhou, and K. Liu, “GPU-accelerated
multiobjective particle swarm optimizer for automated assembly of par- parallel coevolutionary algorithm for parameters identification and tem-
allel cognitive diagnosis tests,” IEEE Trans. Cybern., vol. 49, no. 7, perature monitoring in permanent magnet synchronous machines,” IEEE
pp. 2792–2805, Jul. 2019. Trans. Ind. Informat., vol. 11, no. 5, pp. 1220–1230, Apr. 2015.
[17] X. Zeng, W. Wang, C. Chen, and G. G. Yen, “A consensus community- [39] S. Bhaskaran, “Simulation analysis of a manufacturing supply chain,”
based particle swarm optimization for dynamic community detection,” Decis. Sci., vol. 29, no. 3, pp. 633–657, 1998.
IEEE Trans. Cybern., vol. 50, no. 6, pp. 2502–2513, Jun. 2020. [40] S. Novack and A. Nicolau, “A hierarchical approach to instruction-level
[18] X. Zhang, K. J. Du, Z. H. Zhan, S. Kwong, T. Gu, and J. Zhang, parallelization,” Int. J. Parallel Program., vol. 23, no. 1, pp. 35–62,
“Cooperative coevolutionary bare-bones particle swarm optimization 1995.
with function independent decomposition for large-scale supply chain [41] J. Crawford, “The execution pipeline of the Intel i486 CPU,” in
network design with uncertainties,” IEEE Trans. Cybern., vol. 50, no. 10, Proc. 35th IEEE Comput. Soc. Int. Conf. Intell. Leverage, 1990,
pp. 4454–4468, Oct. 2020, doi: 10.1109/TCYB.2019.2937565. pp. 254–258.
[19] J.-Y. Li, Z.-H. Zhan, C. Wang, H. Jin, and J. Zhang, “Boosting [42] Y. Shi and R. Eberhart, “A modified particle swarm optimizer,” in Proc.
data-driven evolutionary algorithm with localized data generation,” IEEE World Congr. Comput. Intell., 1998, pp. 69–73.
IEEE Trans. Evol. Comput., vol. 24. no. 5, pp. 923–937, Oct. 2020, [43] (Mar. 2020). OpenMP Homepage. Accessed: Mar. 2, 2020. [Online].
doi: 10.1109/TEVC.2020.2979740. Available: https://www.openmp.org/
[20] J. Y. Li, Z. H. Zhan, H. Wang, and J. Zhang, “Data-driven [44] (Mar. 2020). MPICH Homepage. [Online]. Available: https://www.
evolutionary algorithm with perturbation-based ensemble sur- mpich.org/
rogates,” IEEE Trans. Cybern., early access, Aug. 10, 2020, [45] L. Shi, Z. H. Zhan, Z. J. Wang, J. Zhang, and J. Zhang, “Experimental
doi: 10.1109/TCYB.2020.3008280. study of distributed differential evolution based on different plat-
[21] Z. J. Wang et al., “Adaptive granularity learning distributed particle forms,” in Proc. Int. Conf. Bio Inspired Comput. Theor. Appl., 2017,
swarm optimization for large-scale optimization,” IEEE Trans. Cybern., pp. 476–486.
early access, doi: 10.1109/TCYB.2020.2977959. [46] W. Haynes, “Wilcoxon rank sum test,” in Encyclopedia of Systems
[22] Y. Guo, J. Y. Li, and Z. H. Zhan, “Efficient hyperparameter optimization Biology. New York, NY, USA: Springer, 2013, pp. 2354–2355.
for convolution neural networks in deep learning: A distributed par- [47] M. G. H. Omran, A. P. Engelbrecht, and A. Salman, “Using the
ticle swarm optimization approach,” Cybern. Syst., to be published, ring neighborhood topology with self-adaptive differential evolution,”
doi: 10.1080/01969722.2020.1827797. in Proc. Int. Conf. Nat. Comput., 2006, pp. 976–979.
[23] Z. H. Zhan, Z. J. Wang, H. Jin, and J. Zhang, “Adaptive distributed
differential evolution,” IEEE Trans. Cybern., early access, Oct. 21, 2019, Jian-Yu Li (Student Member, IEEE) received the
doi: 10.1109/TCYB.2019.2944873. B.S. degree in computer science and technology
[24] M. P. Wachowiak, M. C. Timson, and D. J. DuVal, “Adaptive parti- from the South China University of Technology,
cle swarm optimization with heterogeneous multicore parallelism and Guangzhou, China, in 2018, where he is currently
GPU acceleration,” IEEE Trans. Parallel Distrib. Syst., vol. 28, no. 10, pursuing the Ph.D. degree in computer science and
pp. 2784–2793, Mar. 2017. technology with the School of Computer Science
[25] R. B. Schnabel, “A view of the limitations, opportunities, and challenges and Engineering.
in parallel nonlinear optimization,” Parallel Comput., vol. 21, no. 6, His research interests mainly include com-
pp. 875–905, 1995. putational intelligence, data-driven optimization,
[26] G. W. Zhang et al., “Parallel particle swarm optimization using message machine learning including deep learning and their
passing interface,” in Proc. 18th Asia–Pac. Symp. Intell. Evol. Syst., applications in real-world problems, and in environ-
2014, pp. 55–64. ments of distributed computing and big data.
[27] X. F. Liu, Z. H. Zhan, J. H. Lin, and J. Zhang, “Parallel differential Mr. Li has been invited as a Reviewer of the IEEE T RANSACTIONS ON
evolution based on distributed cloud computing resources for power E VOLUTIONARY C OMPUTATION and Neurocomputing.
electronic circuit optimization,” in Proc. ACM Genet. Evol. Comput.
Conf., 2016, pp. 117–118. Zhi-Hui Zhan (Senior Member, IEEE) received the
[28] N. Ma, X. F. Liu, Z. H. Zhan, J. H. Zhong, and J. Zhang, “Load balance bachelor’s and Ph.D. degrees in computer science
aware distributed differential evolution for computationally expensive from the Sun Yat-sen University, Guangzhou, China,
optimization problems,” in Proc. Genet. Evol. Comput. Conf., 2017, in 2007 and 2013, respectively.
pp. 209–210. He is currently the Changjiang Scholar Young
[29] Z. H. Zhan et al., “Cloudde: A heterogeneous differential evolution algo- Professor with the School of Computer Science and
rithm and its distributed cloud version,” IEEE Trans. Parallel Distrib. Engineering, South China University of Technology,
Syst., vol. 28, no. 3, pp. 704–716, Aug. 2017. Guangzhou. His current research interests include
[30] S. H. S. Ziyabari and A. Shahbahrami, “High performance implementa- evolutionary computation algorithms, swarm intelli-
tion of APSO algorithm using GPU platform,” in Proc. Int. Symp. Artif. gence algorithms and their applications in real-world
Intell. Signal Process., 2015, pp. 196–200. problems, and in environments of cloud computing
[31] S. Jam, A. Shahbahrami, and S. S. Ziyabari, “Parallel implementation and big data.
of particle swarm optimization variants using graphics processing unit Dr. Zhan’s doctoral dissertation was awarded the IEEE Computational
platform,” Int. J. Eng., vol. 30, no. 1, pp. 48–56, 2017. Intelligence Society Outstanding Ph.D. Dissertation and the China Computer
[32] N. Nedjah, R. de Moraes Calazan, and L. de Macedo Mourelle, “A fine- Federation Outstanding Ph.D. Dissertation. He was a recipient of the
grained parallel particle swarm optimization on many-core and multi- Outstanding Youth Science Foundation from National Natural Science
core architectures,” in Proc. Int. Conf. Parallel Comput. Technol., 2017, Foundations of China in 2018, and the Wu Wen-Jun Artificial Intelligence
pp. 215–224. Excellent Youth from the Chinese Association for Artificial Intelligence in
[33] J. Li, W. Wang, and X. Hu, “Parallel particle swarm optimization algo- 2017. He is listed as one of the Most Cited Chinese Researchers in Computer
rithm based on CUDA in the AWS cloud,” in Proc. 9th Int. Conf. Front. Science. He is currently an Associate Editor of the IEEE T RANSACTIONS
Comput. Sci. Technol., 2015, pp. 8–12. ON E VOLUTIONARY C OMPUTATION , Neurocomputing, and the International
[34] O. Ivanov, B. C. Neagu, and M. Gavrilas, “A parallel PSO approach Journal of Swarm Intelligence Research.
for optimal capacitor placement in electricity distribution networks,” in
Proc. Int. Conf. Mod. Power Syst., 2017, pp. 1–5.
Run-Dong Liu (Student Member, IEEE) received
[35] Q. Wu, F. Xiong, F. Wang, and Y. Xiong, “Parallel particle swarm
the B.S. degree in electronic and information engi-
optimization on a graphics processing unit with application to trajectory
neering from Xinjiang University, Ürümqi, China, in
optimization,” Eng. Optim., vol. 48, no. 10, pp. 1679–1692, 2016.
2016, and the master’s degree in computer science
[36] T. O. Ting, J. Ma, K. S. Kim, and K. Huang, “Multicores and
from the South China University of Technology,
GPU utilization in parallel swarm algorithm for parameter estimation
Guangzhou, China, in 2020.
of photovoltaic cell model,” Appl. Soft Comput., vol. 40, pp. 58–63,
His research interests are artificial intelligence,
Mar. 2016.
evolutionary computation algorithms, swarm intelli-
[37] A. Ouyang, Z. Tang, X. Zhou, Y. Xu, G. Pan, and K. Li, “Parallel hybrid
gence, and their applications in real-world problems.
PSO with CUDA for lD heat conduction equation,” Comput. Fluids,
vol. 110, pp. 198–210, Mar. 2015.
Chuan Wang received the bachelor’s degree in com- Jun Zhang (Fellow, IEEE) received the Ph.D.
puter science and the master’s degree in education degree in electrical engineering from the City
from Henan Normal University, Xinxiang, China, in University of Hong Kong, Hong Kong, in 2002.
1999 and 2009, respectively. He is currently a Visiting Scholar with Victoria
He is currently an Associate Professor with the University, Melbourne, VIC, Australia. His cur-
College of Software, Henan Normal University. His rent research interests include computational intel-
current research includes computational intelligence ligence, cloud computing, high performance com-
and its applications on intelligent information pro- puting, operations research, and power electronic
cessing and big data. circuits.
Dr. Zhang was a recipient of the Changjiang
Chair Professor from the Ministry of Education,
China, in 2013, the China National Funds for Distinguished Young Scientists
From the National Natural Science Foundation of China in 2011, and
Sam Kwong (Fellow, IEEE) received the B.S. the First-Grade Award in Natural Science Research From the Ministry
degree in electrical engineering from the State of Education, China, in 2009. He is currently an Associate Editor of
University of New York at Buffalo, Buffalo, NY, the IEEE T RANSACTIONS ON C YBERNETICS, the IEEE T RANSACTIONS
USA, in 1983, the M.S. degree in electrical engi- ON E VOLUTIONARY C OMPUTATION , and the IEEE T RANSACTIONS ON
neering from the University of Waterloo, Waterloo, I NDUSTRIAL E LECTRONICS.
ON, Canada, in 1985, and the Ph.D. degree from the
University of Hagen, University of Hagen, Germany,
in 1996.
From 1985 to 1987, he was a Diagnostic Engineer
with Control Data Canada, where he designed the
diagnostic software to detect the manufactured faults
of the VLSI chips in the cyber 430 machine. He later joined the Bell Northern
Research, Ottawa, ON, Canada, as a Member of Scientific Staff. In 1990, he
joined the Department of Electronics Engineering, City University of Hong
Kong, Hong Kong, as a Lecturer. He is currently a Chair Professor with the
Department of Computer Science. His research interests include pattern recog-
nition, evolutionary computations, and video analytics.
Prof. Kwong was elevated to an IEEE Fellow for his contributions to
optimization techniques for cybernetics and video coding in 2014. He is the
Vice President of IEEE Systems, Man, and Cybernetics Magazine. He was also
appointed as an IEEE Distinguished Lecturer of the IEEE SMC Society in
March 2017. He is currently an Associate Editor of the IEEE T RANSACTIONS
ON E VOLUTIONARY C OMPUTATION .

Li 2020

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Li 2020

Uploaded by

Copyright:

Available Formats

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON CYBERNETICS 1

Generation-Level Parallelism for Evolutionary

2 IEEE TRANSACTIONS ON CYBERNETICS

LI et al.: GENERATION-LEVEL PARALLELISM FOR EC: P3 SO 3

an instruction’s IF stage is done, its ID stage can be per-

4 IEEE TRANSACTIONS ON CYBERNETICS

LI et al.: GENERATION-LEVEL PARALLELISM FOR EC: P3 SO 5

Fig. 3. Space-time diagram of P3 SO.

6 IEEE TRANSACTIONS ON CYBERNETICS

LI et al.: GENERATION-LEVEL PARALLELISM FOR EC: P3 SO 7

TABLE III TABLE V

8 IEEE TRANSACTIONS ON CYBERNETICS

TABLE VI TABLE VII

Fig. 4. Speedup of P3 SO in the OpenMP version with different population

the time consumption of P3 SO increases slowly even though

C. Performance of P3 SO in the MPI Version

LI et al.: GENERATION-LEVEL PARALLELISM FOR EC: P3 SO 9

Fig. 6. Speedup of P3 SO with different delay times.

D. Comparisons Between OpenMP Version and MPI Version

10 IEEE TRANSACTIONS ON CYBERNETICS

LI et al.: GENERATION-LEVEL PARALLELISM FOR EC: P3 SO 11

12 IEEE TRANSACTIONS ON CYBERNETICS

You might also like