You are on page 1of 5

DEPSO MODEL FOR EFFICIENT CLUSTERING USING DRIFTING CONCEPTS.

Dr. J.S.Kanchana1, Dr.N.Rajathi2


1
Department of Information Technology, K.L.N. College of Engineering, Pottapalayam, Tamil Nadu, India
2
Department of Information Technology, Kumaraguru College of Technology Coimbatore, Tamil Nadu, India

Abstract

Data mining is the procedure of mining automated, prospective tools typical of


knowledge from data. The information decision support systems. Data mining
or knowledge extracted so can be used tools can answer business questions that
for Market Analysis, Customer traditionally were too time consuming to
Retention. As a data mining function, resolve. The techniques can be
cluster analysis serves as a tool to gain implemented rapidly on existing software
insight into the distribution of data to and hardware platforms to enhance the
observe characteristics of each cluster. value of existing information-resources,
Clustering is the process of making a and can be integrated with new products
group of abstract objects into classes of and systems as they are brought on-line.
similar objects. An Existing system uses The objective of clustering is to
iterative optimization algorithm for grouping similar data items into clusters so
clustering the data objects with drifting that items in the same cluster have high
concepts using some cluster validity similarity but are dissimilar with item in
function to evaluate the effectiveness of other clusters. The process of clustering
the clustering model while each new the entire dataset without performing
input data subset is flowing. The drifting concepts are in considered. Such
Proposed system uses Differential process is not only decreasing the quality
Evolutionary Particle Swarm of the cluster but also disregards the
Optimization (DEPSO) model for expectations of user that they need only
effectively clustering the several real the recent clustering result. The volume of
data sets with drifting concepts. data stream is huge so storing and taking
the entire data stream is expensive.
Keywords: Cluster analysis, Differential Therefore, well defined technique is used
evolutionary, Particle Swarm Optimization, for effectively and efficiently clustering
Drifting Concepts
dataset with drifting concepts. In this paper
we use differential evolutionary particle
1 INTRODUCTION swarm optimization algorithm for
clustering the dataset with drifting
Today’s data mining, it is helpful for concepts.
the extraction of hidden predictive
information from large databases, it is an
powerful new technology with great 2. RELATED WORK:
potential to help companies focus on the
most important in their data warehouses. Liang Bai [3] in “An optimization
Data mining tools predict future trends and model for clustering categorical data
behaviors, allowing business to make streams with drifting concepts” had
proactive. knowledge based decisions. The observed that there is always a lack of a
2
cluster validity function and optimization without having to look at every point that
strategy to find out clusters and catch the has been clustered so far.
evolution trend of cluster structures on a
categorical data stream. Therefore, this Fuyuan Cao in “A new initialization
paper presents an optimization model for method for categorical data clustering” has
clustering categorical data streams. An revealed that a novel initialization method
iterative optimization algorithm is for categorical data is proposed, in which
proposed to solve an optimal solution of the distance between objects and the
the objective function with some density of the object is considered. We also
constraints. apply the proposed initialization method to
Hung-Leng Chen in “Catching the trend: k-modes algorithm and fuzzy k-modes
A framework for clustering concept- algorithm.
drifting categorical data” has revealed that
sampling has been recognized as an
important technique to improve the
efficiency of clustering. Mechanism 3. PROPOSED WORK:
named Maximal Resemblance Data
Labeling (abbreviated as MARDL) is 3.1. BLOCK DIAGRAM
proposed to allocate each unlabeled data
point into the corresponding appropriate The block diagram of the proposed
cluster based on the novel categorical system is presented in figure 1.
clustering representative, namely, N-Node
set Importance
Representative(abbreviated as NNIR),
which represents clusters by the
importance of the combinations of
attribute values.

Fuyuan Caoin in “A framework for


clustering categorical time-evolving data”
had observed that, the problem of
clustering categorical time-evolving data
remains as a challenging issue. In this
paper, we propose a generalized clustering
framework for categorical time-evolving
data. The time-complexity analysis
indicates that these proposed algorithms
are effective for large datasets.

Daniel Barbara in “COOLCAT: An


entropy-based algorithm for categorical
“had investigated COOLCAT is well
equipped to deal with clustering of data
streams (continuously arriving streams of
data point) since it is an incremental Fig 1. Working of the Proposed System
algorithm capable of clustering new points
7
3.2. CLUSTERING FRAMEWORK in the construction of the new clustering
model. In the case, we need to re-cluster
In order to cluster categorical data the new input data subset. we consider the
streams and detect the drifting concepts, following two factors to find out the
we apply a clustering framework based on drifting concepts such as distribution
the sliding window technique. The sliding variation and certainty variation.
window technique conveniently eliminates The distribution variation may reduce
the outdated records and only saves the or enhance the uncertainty of the
clustering models, which is utilized in clustering model for cluster
several previous works on clustering time- representatives. If the uncertainty is
evolving data. Therefore, based on the reduced, it is thought that the concepts do
technique, we can cluster the latest data not drift, although the distribution
objects in the current window and catch variation is large.
the evolution trend of cluster structures on
the data stream. The different partition
results maybe lead to the different 3.5. CLUSTERING ACCURACY
detection results of drifting concepts. EVALUATION
Therefore, we need the optimization model
to find out the optimal partition result for To evaluate the performance of
the new input data subset. clustering algorithms in the experiment,
we consider the three validity measures
3.3. DEPSO ALGORITHM accuracy, precision and recall and the
formulas are as follows
DE-PSO starts like the usual DE 1) accuracy(AC)
algorithm up to the point where the trial
vector is generated. If the trial vector
2) precision (PE)
satisfies the conditions, then it is included
in the population otherwise the algorithm
enters the PSO phase and generates a new
3) recall (RE)
candidate solution. The method is repeated
iteratively till the optimum value is RE=
reached. The inclusion of PSO phase
creates a perturbation in the population,
which in turn helps in maintaining 4. EXPERIMENTAL SETUP
diversity of the population and producing a
good optimal solution. Here we can use KDD-CUP’99 data
stream for our experiment. Table 1 shows
3.4. THE DRIFTING CONCEPT the computational times against the
DETECTION numbers of clusters. According to the
After clustering new input data table, the DEPSO algorithm requires more
subsets, we need to analyze the change computational times than iterative
situation between the new and last algorithms.
clustering models, in order to determine
whether the drifting concepts occur. While
the concepts have drifted, the last
clustering model is not used to participate
4
effectively clustering with drifting
No. of Iterative DEPSO concepts was presented. Finally, the
Iteration Algorithm Algorithm performance of the proposed algorithm is
tested and the experimental results have
5 169.12 172.7 shown that the proposed algorithm is
effective in clustering the data streams and
10 256.88 262.34
the detection results based on the proposed
15 371.69 385.52 method are reliable. This work has a very
vast scope in future and it can be
20 458.69 467.21 implemented on other new heuristic
algorithms in future. It can be updated in
Table 1. Computational Times (Seconds) near future as and when requirement for
of Clustering Algorithms for Different the same arises, as it is very flexible in
Numbers of Clusters terms of efficiency.

The figure 2 shows the computational


time of iterative algorithm and DEPSO REFERENCES
algorithm for different number of clusters.
The DEPSO algorithm shows the better [1] Liang Bai, Xueqi Cheng, Member,
computational time when compared with IEEE, Jiye Liang, and Huawei Shen., “An
iterative algorithm. Optimization Model for Clustering
Categorical Data Streams with Drifting
Concepts”, IEEE Transactions on
Knowledge and Data Engineering, Vol.
28 , Issue. 11, Nov. 2016.
[2] L. Bai, J. Liang, C. Dang, and F.
Cao, “The impact of cluster
representatives on the convergence of the
K-modes type clustering,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 35, no. 6,
pp. 1509– 1522, Jun. 2013.
[3] L. Bai, J. Liang, and C. Dang, “An
initialization method to simultaneously
find initial cluster centers and the number
of clusters for clustering categorical data,”
Knowl.-Based Syst., vol. 24, no. 6, pp.
Fig.2. Computational Times (Seconds) of 785–795, 2011.
Clustering Algorithms for Different [4] S. Ho and H. Wechsler, “A
Numbers of Clusters martingale framework for detecting
changes in data streams by testing
exchangeability,” IEEE Trans. Pattern
5. CONCLUSION Anal. Mach. Intell., vol. 32, no. 20, pp.
2113–2127, Dec 2010.
In this paper, Differential evolutionary [5] F. Cao, J. Liang, and L. Bai, “A
particle swarm optimization algorithm for framework for clustering categorical time-
7
evolving data,” IEEE Trans. Fuzzy Syst.,
vol. 18, no. 5, pp. 872–885, Oct. 2010.
[6] K. Chen and L. Liu, “HE-tree: A
framework for detecting changes in
clustering structure for categorical data
streams,” Int. J. Very Large Data Bases,
vol. 18, no. 5, pp. 1241–1260, 2009.
[7] H. Chen, M. Chen, and S. Lin,
“Catching the trend: A framework for
clustering concept-drifting categorical
data,” IEEE Trans. Knowl. Data Eng., vol.
21, no. 5, pp. 652–665, May 2009.

You might also like