You are on page 1of 5

2010 Sixth International Conference on Natural Computation (ICNC 2010)

An Improved Projection Pursuit Clustering Model and


its Application Based on Quantum-behaved PSO
Qun Zhang 1,Xiujuan Lei1 *,Xu Huang1 Aidong Zhang2
College of Computer Science Department of Computer Science and Engineering
Shaanxi Normal University State University of New York at Buffalo
Xi’an, Shaanxi Province, China, 710062 New York,USA,NY 14260-2000

Abstract—Extracting the information with biological significance adopted to improve the project indicating function. This
from gene expression data is an important research direction. algorithm makes use of the global search ability of QPSO to
Clustering algorithms in this area have been increasingly widely search the best project direction and based on it the data is
applied. According to the characteristic of gene expression data, projected from high to low dimension and is clustered then.
the improved projection pursuit cluster model was introduced in Compared with the methods in the literatures and from our
this area and Quantum-behaved Particle Swarm
Optimization(QPSO) was put forward to find the optimal
simulation results, the QPSO improved project indicating
projection direction. The simulation results showed that the clustering algorithm is feasible. It not only can solve the
improved strategy was feasible and effective. This method is not general data clustering problem, but also outperform the
only a new way for the massive high-dimensional data clustering, traditional methods in gene expression spectral data.
but also provides a new approach for the cluster analysis of gene
expression data. II. BASIC THEORIES

Keywords—QPSO; projection pursuit; gene expression data; A. Standard Particle Swarm Optimization Algorithm
clustering Particle Swarm Optimization (PSO)[5] is proposed by
I. INTRODUCTION Kennedy and Eberhar in 1995. The idea is: PSO initializes a
group of random particles and find the best solution by
Biology and life science, as well as information science, is iteration. In every generation, the particles renew themselves
the fastest developing field. The integration of biology and by detecting pbest value and gbest value. The renew
information science, as the inter discipline of these two formula are as follows:
subjects, has played an important role in genome research.
Another new technology as a result of these two subjects, the
gene chip, has already been a powerful means of extracting and
(
vidl +1 = wvidl + c1rand () pbest idl − xidl ) (1)
exploring biological molecule information. It will play an + c2 rand () ( gbest idl − xidl ).
important role in the coming genome research[1].The great
l +1 l l +1
amount and complicated data produced by DNA microarray xid = xid + vid . (2)
laboratorics bring in severe challenges, which needs accurate
and reliable tools to analyze them. Clustering is one of the most Here, vid is the dth dimension of the velocity of
widely used technologies in genome research; it aims at particle i , xid is the dth dimension of the position of
grouping the genes that shows similar expression patterns into
the same set. It is proved by many researches that clustering is particle i . c1 , c 2 are constants, called study factors,
effective to detect the co-regulated genes[2]. However, due to adjusting the maximum step length to the global best and
the specialty of extremely high dimensionality and micro array individual best particles. rand () is a random number between
samples during the exchange from gene table to data, the 0 and 1 and w is the inertial weight.
dimension disaster is unavoidable.
B. QPSO Algorithm
In the late 60s and early 70s of 20th century, there’s a PP
(projection pursuit) model, which is a multiple data processing QPSO[6] algorithm is a new PSO algorithm proposed at the
method to find the best projection to describe a data structure. angle of quantum mechanics. It has fewer parameters, simple
The numerical optimization is used to project high dimension and more effective than PSO algorithm. The procedure is
data to a lower dimension [3,4].The PP model can be used to basically the same as PSO, only different from the renew
solve the dimension disaster problem in gene data clustering. formula as follows:
But the value of density window will directly affect the
p = a * pbest (i ) + (1 − a ) * gbest ,
clustering results. This value is set by experience or trials M
which lacks theoretical support and has heavy computing load.
This article modified the basic PP model, using the distance
mbest = (1 M ) * ∑ pbest(i ) ,
i =1
(3)
between standard deviation and project value to build the b = 1 − (G T ) * 0.5 ,
project indication function which has only one weight factor.
The quantum-behaved particle swarm optimization (QPSO) is xid = pid ± b ∗ mbestid − xid ∗ ln(1 u ) .

978-1-4244-5961-2/10/$26.00 ©2010 IEEE 2581


In (3), a, u are two random numbers between 0 and 1, mbest The standard deviation and spectral density is used to build
is the center of pbest and b is a contracting-expanding indication function in the basic PP model. Spreading
coefficient, linearly decrease in the process of convergence of characters of the project value is: the spectral project point is
QPSO. G is the current evolutionary generation and T is as dense as possible, best to condense to a few points. These
the maximum evolutionary generation. The positions of the points should be as far as possible from each other. The
particles are renewed in the formula, without velocity. method is effective to small scale data sets, but not to those
with high similarity, which will take a long time and get bad
In the basic PSO, every particle searches from the current results. The proposed method in this article aims at large scale
position to the position decided by individual best position and and highly similar data. In order to solve different types of
global best position. This progress is a mutual study system of data, we combine the improved method and basic method,
self study and social teaching, it unavoidably searches new which solve the problem of basic method and at the same time
knowledge in the neighborhood of the direction decided by keep its original advantages. The using range is enlarged and
pbest and gbest , which limits the search space. This prevents the computing efficiency and accurate efficiency are
the particles to search the place far away from the population, theoretically increased.
where it might have better fitness than the current best A Improved PP clustering model
solution. QPSO is a complicated non-linear study system;
every particle is distributed to any position at a certain rate, Step 1: Normalize the sample evaluation indication.
which might be better than the current global position gbest . Suppose the sample of every
indication value
mbest ensures the population to evolve stably. In this pattern,
the convergence of particles is related to the whole population,
{ }
is x (i, j ) i = 1,2, , n; j = 1,2, , p , x(i, j ) is the jth indication

which ensures the convergent and global situation without value of the ith sample, n is sample number, p is indication
increasing the time and special complexity. As a result, QPSO number. The normalization is as follows:
is better than PSO in searching ability.
If the objective is the larger the better:
C. Clustering Criteria
x∗ (i, j ) − xmin ( j )
If the sample set is X = {X i , i = 1,2, , n} , X i is a d x(i, j ) = . (6)
xmax ( j ) − xmin ( j )
dimension vector, clustering is to find a partition
C = {C1, C2 , , Cm } , which meets: If the objective is the smaller the better:
m xmax ( j ) − x∗ (i, j )
X = ∪ Ci , x(i, j ) = . (7)
i =1 xmax ( j ) − xmin ( j )
Ci ≠ φ (i = 1,2, , m ) , (4)
x max ( j ) , x min ( j ) are the maximum and minimum value of
Ci ∩ C j = φ (i, j = 1,2, , m; i ≠ j) .
the jth indication value. x(i, j ) is the normalization sequence
And it makes the clustering criterion function (5) reach the of indication value.
minimum value.
Step 2: build the project indication function Q (a ) .
m
Jc = ∑ ∑ d (X , Z
k =1 X ii ∈C k
i k ). (5) PP method
{x(i, j ) j = 1,2,
is to convert
}
p dimension data
into one dimension value z (i ) at the
,p
project direction of a = {a (1), a (2 ), a (3), , a( p )} :
Z k is the center of the k class, d ( X i , Z k ) is the distance
th

from the sample to class center. The clustering criterion p


function J c is the summation of distances from all the z (i ) = ∑ a( j )∗x(i, j ), i = 1,2, ,n (8)
samples to their centers. Here d ( X i , Z k ) is the Euclidean j =1

distance, which is d ( X i , Z k ) = X i − Z k . Using K-means to cluster according to {z (i ) i = 1,2, , n} ,


III. IMPROVMENT OF THE PP CLUSTERING MODEL the spreading characters of z (i ) is required as: the spectral
project points is as dense as possible and then it is best to
The PP clustering method uses project value to cluster. It condense to a few points. These points should be as far as
actually is a dimension reduction processing technology, possible from each other. So the project indication function is:
which turns multiple dimension problems into one dimension
problems. The main idea is: projecting high dimension data to Q(a ) = S z ∗ Dz , (9)
lower subspace by a certain kind of combination. Project
indicating function is used to evaluate the project results and n
find a best project direction, which makes the value maximum
or minimum. The method then analyzes the data structure and
Sz = ∑ (z (i ) − E (z ))
i =1
2
(n − 1) , (10)

clusters with the project value. The building and optimizing of


the indicating function is the key to the success of PP method.

2582
n n with restrict condition:
Dz = ∑∑ (R − r (i, j )) ∗ u(R − r (i, j )) . (11) p
i =1 j =1
s.t.∑a ( j) = 1
2
(14)
S z is the standard deviation of z (i ) , D z is the summation j =1

of distance between every two z (i ) . α is a weight This is a non-linear optimization use {a ( j ) j = 1,2, , p} as
coefficient, according the experiments, α has a balance
its optimizing variable, which is not easy to deal using the
effect, keeping D z from being too large to fail the standard traditional optimizations. Based on QPSO algorithm, the real
deviation and E (z ) is an average of sequence {z (i ) i = 1,2, , n} , number code mode is used to solve the optimization problem
r (i, j ) = z (i ) − z ( j ) . of high dimension, that is, to get the best direction when the
object function reaches the extreme value by optimizing the
Formula (9) is the basic PP model, simulation experiments project direction.
show that this model is effective to those data which are small Step 4: clustering.
scale, characteristics distributed, but not to those highly
similar and large scale. A new indication function is proposed. Put the best direction a in Step 3 in Formula (8) and get
It requires the project value z (i ) has spreading characteristics project values of every sample as {z (i ) i = 1,2, , n} . Compared
as: the nodes points are as far as possible from each other and with any two values, the closer the more they are the more
as dense as possible inside. So the new project indication likely to classify them into one class. K-means clustering
function is: method is used to cluster the project values. If the values are
of one class, the original data samples are of one class.
Q(a ) = 1 S z + α * Dz ,
n i −1 B QPSO based PP clustering algorithm
(12)
Dz = ∑∑
i =1 j =1
r (i, j ) . This article uses real number code mode where a code
stands for a set of project directions. Positions of every
In Formula (12), indication function Q (a ) is composed of particle are composed of p dimension project direction. The
two parts, the standard deviation and the summation distances structures of particle codes are x1 x 2 x p .
of the project value. These two parts are antithetic. The
I) Fitness f (x ) is decided by project object function:
standard deviation of project value S z stands for the whole
degree of distribution, the larger the standard deviation, the f (x ) = Q ( x) . (15)
more distributive it is. The reciprocal of standard deviation is
on the opposite, it requires the project nodes to be dense. The II) Restrict condition is defined as:
summation distances of project value on the other hand p
requires the project nodes to be distributive. This method
mainly resolves the large scale, highly similar data and
xi = xi ∑x
i =1
i
2
. (16)
requires the projected data to be some kind of distributive,
which can avoid the problem of too dense to classify. Here, x is the searched project direction. Some illegal
In order to get the best result, we cooperate both of the particles not matching Formula (14) might be generated in the
indication functions according to different data characteristics. process; they are modified with Formula (16).
In the optimizing process, basic indication function and the III) The flow of QPSO based PP clustering algorithm is
new one are both used. It is the spreading characteristics of the outlined below:
data sets that decide which one to use. If the data are dense,
we use the improved method and then the basic, otherwise the (a) Normalize the data by Formula (6) and (7).
reverse order. This combination makes good use of both the (b) Randomly generate a population according to Formula
functions and ensures the effects of the algorithm. (14) and calculate the fitness of every particle according to
Step 3: optimize the project indication function. Formula (15).
If the sample set of every indication value is set, Q (a ) (c) Compare the fitness of every particle with the best
will only change according to the project direction a . position pbest id . If there is another position which is better,
Different project direction indicates different structure replace pbest id .
characteristics. The best project direction is the direction that
can show as many as structure characteristics of the high (d) Compare the fitness of every particle with the best
dimensional data. We find the best direction through position, gbestid , the whole group has ever been. If it’s better,
optimizing the project indication function, that is: replace gbestid .
Maximize the object function: (e) Renew the position of particles according to formula
(3), generate a new population.
Max : Q(a ) = 1 S z + α * Dz (13)
(f) Modify the illegal particle according to formula (16).

2583
(g) When reaching the end condition (a good enough improves the accuracy. The most apparent improvement is to
place or maximum iterative number), then stop, otherwise go accelerate the runtime. (2) For the same clustering model,
back to (b). QPSO is the best. This proves that QPSO has better global
searching ability and cost the same runtime as PSO. GA model
(h) Calculate the project value according to Formula (8), is worse than these two and the time consume is two times
using the searched best project direction. larger. This result shows that the algorithm merged with QPSO
(i) Use K-means clustering method to cluster the and improved PP clustering model is the best in both result
projected lower dimension data, the projected value. and runtime.
In the process of K-means clustering, there might be empty We show the following four images using QPSO based
classes. If there is, select a sample from any of the non-empty improved PP algorithm:
class and put it into this empty class. This sample is the farthest
one from its own set. Repeat this action till there’s no empty
class.
IV. EXPERIMENTS AND RESULTS ANALYSIS
Compared with the basic PP clustering model, We
combine the improved model with GA, PSO and QPSO[7,8,9]
and do the simulation in MATLAB 7.0. Test data come
from Iris, Wine, Cancer and Iyer data sets. Sample numbers
are 150, 178, 683 and 484 respectively. The experiment
environment is Intel core2 CPU, 1.86GHz, 1G memory,
MATLAB 7.0. The relevant parameters are initialized as
follows: Algorithm run time C = 30, maximum iterative
number T = 100, population size N = 20, inertial weight w
= 0.8, study factors c1 = c2 = 2, prco = 0.8 and pmut = 0.08.
Figure 1. Iris clustering effect
The comparing results of this three algorithms and models
are as follows:
TABLEⅠ. BASIC MODEL TEST RESULT

Data Alogrithm Max Min Mean Success rate Time(s)


GA 144 135 143.70 89.23% 55.19
Iris PSO 144 136 143.50 90.50% 27.80
QPSO 144 144 144.00 96.00% 29.27
GA 143 110 119.67 67.23% 86.14
Wine PSO 145 111 121.41 68.21% 41.92
QPSO 144 120 132.54 74.46% 41.17
GA 651 578 605.81 88.70% 6550.3
Cancer PSO 655 582 611.35 89.51% 3619.3
QPSO 653 611 625.94 91.65% 3334.1
GA 473 356 408.81 84.46% 2035.5
Iyer PSO 429 394 415.92 85.93% 1024.1
QPSO 432 392 419.88 86.75% 1141.3 Figure 2. Wine clustering effect

TABLE Ⅱ. IMPROVED MODEL TEST RESULT

Data Alogrithm Max Min Mean Success rate Time(s)


GA 144 144 144.00 96.00% 25.41
Iris PSO 145 144 144.21 96.14% 13.12
QPSO 145 144 144.52 96.35% 13.28
GA 140 118 122.13 68.61% 44.88
Wine PSO 146 123 138.15 77.61% 26.51
QPSO 146 140 143.38 80.55% 25.56
GA 665 657 660.43 96.70% 1701.6
Cancer PSO 663 659 661.57 96.86% 866.61
QPSO 663 663 663.00 97.07% 853.51
GA 463 360 410.11 84.73% 827.42
Iyer PSO 462 390 418.53 86.47% 435.88
QPSO 469 392 429.70 88.78% 442.04
Figure 3. Cancer clustering effect
The statistical result of Tables 1 and 2 shows: (1) For the
same optimizations, improved PP clustering model algorithm

2584
[6] SUN Jun, FENG Bin, XU Wenbo. A global search
strategy of quantum-behaved particle swarm
optimization[C].IEEE Conference on Cybernetics and
Intelligent Systems. Piscataway, NJ: IEEE Press,
2004:111-116.
[7] Vander Merwe D W , Engelbrecht A P. Data Clustering
using Particle Swarm Optimization [J]. Proceedings of the
IEEE International Joint Conference, 2003, 215-220.
[8] Mengshu Wu, Xizhi Wu. “Projection Pursuit Clustering
Method Based on Genetic Algorithm” [J]. Statistics &
Information Forum, 2008,23(3):19-22.
[9] Guangzhou Chan, Jiaquan Wang, Huaming Jie.
“Application of Particle Swarm Optimization for Solving
Figure 4. Iyer clustering effect Optimization Problem of Projection Pursuit Modeling” J].
Computer Simulation, 2008,25(8).
Figures 1-4 are composed of the projected value of the
relevant data. The abscissa indicates the sample sequence; the This work was supported by the Fundamental Research
ordinate indicates the projected values of every sample. It can Funds for the Central Universities, Shaanxi Normal
be seen from the images clearly that every sample is step up University(GK200902016) and Shaanxi Provincial Natural
clustered with several classes according to different projected Science and Technology Program: The Research of New
values. The results are close to the real classes, only few fail to Clustering Algorithms of Protein-Protein Interaction Network
project near the centers. The results are exactly the same as Based on Particle Swarm Optimization.
showed in statistical data table.
Qun Zhang: She was born in June 1986. Her main
The results proved that the proposed algorithm is research fields involve intelligent optimization and pattern
effective to large scale cancer cells and gene expression recognition. She is a graduate student in Shaanxi Normal
spectral data. Moreover it can obtain a better result in the University.
premise of reducing calculation amount and time, which Xiujuan Lei: She was born in May 1975. She is an
provided a new method for clustering of the gene associate professor of Shaanxi Normal University. She has
expression data. been published about 30 pieces of papers. Her main research
V. CONCLUSION fields involve intelligent optimization, bioinformatics, etc.
Different ways of clustering algorithms are requested by Xu Huang: He was born in July 1985. He is a graduate
the biological data, including the ability to deal with high student in Shaanxi Normal University. His main research
dimensional data, different types of data, find all kinds of data fields involve intelligent optimization and clustering.
shapes, etc. Accordingly, the improved PP clustering model is Aidong Zhang: She is a professor of State University of
proposed and showed the effectiveness and feasibility on New York at Buffalo. She has been published more than 200
different data types. In the coming future, we will study more papers and 3 books. Her main research fields involve
appropriate clustering algorithms according to the data type,
bioinformatics, data mining, database systems, etc.
clustering objects and the specific application needs.
REFERENCES
[1] Xiao Sun, Ye Wang, Nongyue He. “Application of
Bioinformatics to Gene Chip”. ACTA BIOPHYSICA
SINICA, 2001, 17 (1) :27-34
[2] Fugang Wang, Xiannong Chan. “Clustering in DNA chip
data analysis”. FOREIGN MEDICAL SCIENCES
(BIOMEDICAL ENGINEERING FASCICLE), 2004 ,27
(2) :98-101
[3] Friedman J H, Tukey J W. A projection pursuit algorithm
for exploratory data analysis. IEEE Transaction on
Computer ,Part C , 1974 ,23 (9) :881 - 890.
[4] Qiang Fu, Xiaoyong Zhao. “Theory and Application of
Project Pursuit Model” [M].Beijing: Science Press, 2006.
[5] Kennedy J, Eberhart RC. Particle swarm optimization. In:
Proc. of IEEE Int’l Conf. on Neural Networks, IV.
Piscataway, NJ: IEEE Press, 1995. 1942-1948.

2585

You might also like