Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
A Robust -knowledge guided fusion of clustering Ensembles

A Robust -knowledge guided fusion of clustering Ensembles

Ratings: (0)|Views: 11 |Likes:
Published by ijcsis
Discovering interesting, implicit knowledge and general relationships in geographic information databases is very important to understand and to use the spatial data. Spatial Clustering has been recognized as a primary data mining method for knowledge discovery in spatial databases. In this paper, we have analyzed that by using a guided approach in combining the outputs of the various clusterers, we can reduce the intensive computations and also will result in robust clusters .We have discussed our proposed layered cluster merging technique for spatial datasets and used it in our three-phase clustering combination technique in this paper. At the first level, m heterogeneous ensembles are run against the same spatial data set to generate B1…Bm results. The major challenge in fusion of ensembles is the generation of voting matrix or proximity matrix which is in the order of n2, where n is the number of data points. This is very expensive both in time and space factors, with respect to spatial datasets. Instead, in our method, we compute a symmetric clusterer compatibility matrix of order (m x m) , where m is the number of clusterers and m
Discovering interesting, implicit knowledge and general relationships in geographic information databases is very important to understand and to use the spatial data. Spatial Clustering has been recognized as a primary data mining method for knowledge discovery in spatial databases. In this paper, we have analyzed that by using a guided approach in combining the outputs of the various clusterers, we can reduce the intensive computations and also will result in robust clusters .We have discussed our proposed layered cluster merging technique for spatial datasets and used it in our three-phase clustering combination technique in this paper. At the first level, m heterogeneous ensembles are run against the same spatial data set to generate B1…Bm results. The major challenge in fusion of ensembles is the generation of voting matrix or proximity matrix which is in the order of n2, where n is the number of data points. This is very expensive both in time and space factors, with respect to spatial datasets. Instead, in our method, we compute a symmetric clusterer compatibility matrix of order (m x m) , where m is the number of clusterers and m

More info:

Published by: ijcsis on Aug 13, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

10/25/2012

pdf

text

original

 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 4, July 2010
A Robust -knowledge guided fusion of clusteringEnsembles
Anandhi R J
Research Scholar, Dept of CSE,Dr MGR University,Chennai, Indiarjanandhi@hotmail.com
Dr Natarajan Subramaniyan
Professor, Dept of ISEPES Institute of TechnologyBangalore, Indiasnatarajan44@gmail.com
 Abstract
— Discovering interesting, implicit knowledge andgeneral relationships in geographic information databases is veryimportant to understand and to use the spatial data. SpatialClustering has been recognized as a primary data mining methodfor knowledge discovery in spatial databases. In this paper, wehave analyzed that by using a guided approach in combining theoutputs of the various clusterers, we can reduce the intensivecomputations and also will result in robust clusters .We havediscussed our proposed layered cluster merging technique forspatial datasets and used it in our three-phase clusteringcombination technique in this paper. At the first level, mheterogeneous ensembles are run against the same spatial dataset to generate
 B
B
1…
 B
 m
B
 
results. The major challenge in fusion of ensembles is the generation of voting matrix or proximitymatrix which is in the order of n
2
, where n is the number of datapoints. This is very expensive both in time and space factors, withrespect to spatial datasets. Instead, in our method, we compute asymmetric clusterer compatibility matrix of order (m x m) ,where m is the number of clusterers and m <<n, using thecumulative similarity between the clusters of the clusterers. Thismatrix is used for identifying which two clusterers, if consideredfor fusion initially, will provide more information gain. As wetravel down the layered merge, for every layer, we calculate afactor called Degree of Agreement (DOA), based on the agreedclusterers. Using the updated DOA at every layer, the movementof unresolved, unsettled data elements will be handled at muchreduced the computational cost. Added to this advantage, wehave pruned the datasets after every (m-1)/2 layers, using thegained knowledge in previous layer. This helps in fasterconvergence compared to the existing cluster aggregationtechniques. The correctness and efficiency of the proposed clusterensemble algorithm is demonstrated on real world datasetsavailable in UCI data repository.
 Keywords- Clustering ensembles, Spatial Data mining, Degree of Agreement, Cluster Compatibility matrix.
I.
 
I
 NTRODUCTION
With a variety of applications, large amounts of spatial andrelated non-spatial data are collected and stored in GeographicInformation Databases. Spatial Data Mining[1], (i.e.,discovering interesting, implicit knowledge and generalrelationships in large spatial databases) is an important task for the understanding the usage of these spatial data. With therapid growth in size and number of available databases incommercial, industrial, administrative and other applications,it is necessary and interesting to examine how to extractknowledge automatically from huge amount of data. Verylarge data sets present a challenge for both humans andmachine learning algorithms. Machine learning algorithms can be inundated by the flood of data, and become very slow inknowledge extraction. More over, along with the large amountof data available, there is also a compelling need for producingresults
accurately
and
 fast 
.Efficiency and scalability are, indeed, the key issues whendesigning data mining systems for very large data sets.Through the extraction of knowledge in databases, largedatabases will serve as a rich, reliable source for knowledgegeneration and verification, the discovered knowledge can beapplied to information management, query processing,decision-making, process control and many other applications.Therefore, data mining has been considered as one of the mostimportant topics in databases by many database researchers.Spatial data describes information related to the space occupied by objects. It consists of 2D or 3D points, polygons etc. or  points in some d-dimensional feature space. It can be either discrete or continuous. Discrete spatial data might be a single point in multi-dimensional space while continuous spatial dataspans a region of space. This data might consist of medicalimages or map regions and it can be managed through spatialdatabases [8].Clustering [17] is to group analogous elements in a data set inaccordance with its similarity such that elements in each cluster are similar, while elements from different clusters aredissimilar. It doesn’t require the class label information aboutthe data set because it is inherently a data-driven approach. So,the most interesting and well developed method of manipulating and cleaning spatial data in order to prepare it for spatial data mining analysis is by clustering that has beenrecognized as a primary data mining method for knowledgediscovery in spatial database [4-7].Clustering fusion is the integration of results from variousclustering algorithms using a consensus function to yield stableresults. Clustering fusion approaches are receiving increasingattention for their capability of improving clustering performance. At present, the usual operational mechanism for 
284http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 4, July 2010
clustering fusion is the combining of clusterer outputs. One toolfor such combining or consolidation of results from a portfolioof individual clustering results is a cluster ensemble [13]. Itwas shown to be useful in a variety of contexts such as“Quality and Robustness” [3], “Knowledge Reuse” [13,14],and “Distributed Computing” [9].The rest of the paper is organized as follows. The related work is in section 2. The proposed knowledge guided fusionensemble technique is in section 3. In section 4, we presentexperimental test platform and results with discussion. Finally,we conclude with a summary and our planned future work inthis area of research.II.
 
RELATED WORK 
 
 A.
 
 Litereature on Clustering Algorithms
Many clustering algorithms have been developed and they can be roughly classified into hierarchical approaches and non-hierarchical approaches. Non-hierarchical approaches can also be divided into four categories; partitioning methods, density- based methods, grid-based methods, and model-based methods.Hierarchical algorithms can be further divided to agglomerativeand divisive algorithms, corresponding to bottom-up and top-down strategies, to build a hierarchical clustering tree.Spatial data mining or knowledge discovery in spatialdatabases refers to the extraction, from spatial databases, of implicit knowledge, spatial relations, or other patterns that arenot explicitly stored [8, 10]. The large size and highdimensionality of spatial data make the complex patterns thatlurk in the data hard to find. It is expected that the comingyears will witness very large number of objects that arelocation enabled to varying degrees. Spatial clustering [8] has been used as an important process in the areas such asgeographic analysis, exploring data from sensor networks,traffic control, and environmental studies. Spatial dataclustering has been identified as an important technique for many applications and several techniques have been proposedover the past decade based on density-based strategies,random walks, grid based strategies, and brute forceexhaustive searching methods[5]. This paper deals with fusionof spatial cluster ensembles using a guided approach to reducethe space complexity of such fusion algorithms.Spatial data is about instances located in a physical space.Spatial clustering aims to group similar objects into the samegroup considering spatial attributes of the object. The existingspatial clustering algorithms in literature focus exclusivelyeither on the spatial distances or minimizing the distance of object attributes pairs. i.e., the locations are considered asanother attribute or the non-spatial attribute distances areignored. Much activity in spatial clustering focuses onclustering objects based on the location nearness to each other [5]. Finding clusters in spatial data is an active research area,and the current non-spatial clustering algorithms are applied tospatial domain, with recent application and results reported onthe effectiveness and scalability of algorithms [8, 16].Partitioning algorithms are best suited to such problems whereminimization of a distance function is required and a commonmeasure used in such algorithms is the Euclidian distance.Recently new set of spatial clustering algorithms has been proposed, which represents faster method to find clusters withoverlapping densities. DBSCAN, GDBDCAN and DBRS aredensity-based spatial clustering algorithms, but they each perform best only on particular types of datasets [17].However, these algorithms also ignore the non-spatial attribute participation and require user defined parameters. For large-scale spatial databases, the current density based cluster algorithms can be found to be expensive as they require largevolume of memory support due to its operations over theentire database. Another disadvantage is the input parametersrequired by these algorithms are based on experimentalevaluations. There is a large interest in addressing theautomation of the general purpose clustering approach withoutuser intervention. However, it is difficult to expect accurateresults from the results of these algorithms as each one has itsown shortfalls.
 B.
 
 Litereature on Clustering Ensembles
Clustering ensemble is the method to combine several runs of different clustering algorithms to get an optimal partition of theoriginal dataset. Given dataset X = {x
1
x
2
,.. ,x
n
}, a cluster ensemble is a set of clustering solutions, represented as P =P
1
,P
2
,..Pr,where r is the ensemble size, i.e. the number of clusterings in the ensemble. Clustering-Ensemble Approachfirst gets the result of M clusterers, then sets up a commonunderstanding function to fuse each vector and get the labeledvector in the end. The goal of cluster ensemble is to combinethe clustering results of multiple clustering algorithms to obtain better quality and robust clustering results. Even though manyclustering algorithms have been developed, not much work isdone in cluster ensemble in data mining and machine learningcommunity.Strethl and Ghosh [13,14], proposed a hypergraph-partitionedapproach to combine different clustering results by treatingeach cluster in an individual clustering algorithm as a hyper edge. All the three proposed algorithms approach the problem by first transforming the set of clusterings into a hypergraphrepresentation. Cluster-based Similarity PartitioningAlgorithm (CSPA) uses relationship between objects in thesame cluster for establishing a measure of pair wise similarity.In Hyper Graph Partitioning Algorithm (HGPA) the maximummutual information objective is approximated with aconstrained minimum cut objective. In their Meta-CLusteringAlgorithm (MCLA), the objective of integration is viewed as acluster correspondence problem.Kai Kang, Hua-Xiang Zhang, Ying Fan [6] formulated the process of cooperation between component clusterers, and proposed a novel cluster ensemble learning technique based ondynamic cooperating (DCEA). The approach is mainlyconcerned how the component clusterers fully cooperate in the process of training component clusterers.Fred and Jain [2] used co-association matrix to form the final partition. They applied a hierarchical (single-link) clustering tothe co-association matrix. Zeng, Tang, Garcia-Frias andGao[18], proposed an adaptive meta-clustering approach for 
285http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 4, July 2010
combining different clustering results by using a distancematrix.
C.
 
Fusion Framework 
We begin our discussion of the guided ensembles fusionframework by presenting our notation. Let us consider a set of n data objects, D = { v
1
. . . v
n
}. A clustering C of dataset D,is a partition of D into k disjoint sets C1 … Ck. In the sequelwe consider m clusterings; we write B
i
to denote the i
th
 clustering, and k 
i
for the number of clusters of B
i
. In theclustering fusion problem the task is to find a clustering thatmaximizes the related items with a number of already-existingclusterings [4].
 D.
 
 Definations
 
Fusion Joint set, FJij:Fusion Joint set, FJij refers to set of matching pairs of ithclusterer’s jth cluster. For instance, FJ12 refers to probablefusion spot for first clusterer’s clusters with secondclusterer’s cluster. It will be used for deciding where thefusion is most likely to yield optimal preciseness of clusters.
 
Clusterer Compatibility matrix: CCM (m X m)Clusterer Compatibility matrix is a m X m symmetricmatrix where m is the total number of clusterers, consideredfor fusion.
 
CCM[i][j]
Integer value d representing the maximum informationgained through the summation of intersection elementscardinality of the matching pairs of clusterer found inFusion Joint Set, FJ[i][j].
 
 Degree Of Agreement Factor: (DOA)
Degree of agreement factor is the ratio of the index of themerging level to the total number of clusterers. And alsothis DoA value will be cumulative till it reaches thethreshold level DoATh, an user assigned value indicatingthe majority required for decision making. Under normalscenario, DoATh will be set as 50% of the number of clusterers.
 
 Degree of Shadow factor : (DOS)
Degree of shadow factor is the maximized value of theintersection of the two minimum bounding circles of k clusters with ith cluster from a different clustering.III.
 
KNOWLEDGE GUIDED ENSEMBLE FUSION
 In this section we discuss our proposed layered cluster ensemble fusion guided by the gained knowledge during themerging process. The first phase of the algorithm is the preparation of B heterogeneous ensembles. This is done byexecuting different clustering algorithms against the samespatial data set to generate partitioning results. For our experimental purpose, we have also generated homogenousensembles by partitioning the spatial data horizontally/vertically into n subgroups and used it as the input to our ensemble algorithm. Either way individual partitions in eachensemble are sequentially generated.
 A.
 
Selection of clusterings for prime fusion
D Any layered approach will have a drawback of beingdependent on which clusterer is considered for initial fusion.This sensitiveness is a major bottleneck in deciding theaccuracy of the outputs. But, in our approach, we compute a mx m symmetric clusterer compatibility matrix, whereCCM[i][j] indicates the summary of information gain when i
th
clusterer and j
th
clusterer are merged. This way we have usedheuristics to direct the fusion in the right direction.
 B.
 
 Resolution for Label Correspondence Problem
The other issues in fusion of cluster ensembles are labelcorrespondence problem and the merging technique used for fusion. At the second phase, we address the labelcorrespondence problem. These clustering results arecombined in layered pairs, called fusion joints set, FJ
mk 
. Thecriteria of merging can be any one of the Fusion JointIdentification Techniques i.e., overshadowing or usage of highest cardinality in intersection set along with usage of add-on knowledge gathered from such association.First approach uses the degree of shadow that one cluster hason other. This is computed using the smallest circle or minimum covering circle approach, which is a mathematical problem of computing the smallest circle that contains all of agiven set of points in the Euclidean plane. Each cluster of theclusterer in two layers first compute the minimum boundingcircle and the diameter of such circle, using which the degreeof Shadow (DOS) is computed. The aim is to find the clustersin different layers whose shadow overlap is maximized andthen assign it to the matching pair set. This method finds themost appropriate clusters belonging to a two clustererings for forthcoming fusion phase.Second approach uses the usage of heuristic greedy approachin computing mutual information theory to decide on thedegree of compatibility. Mutual information is used when weneed to decide, which amongst candidate variables are closestto a particular variable. Higher the mutual information, morethe two variables are 'closer'. It is the amount of information'contained' in Y about X.Let X and Y be the random variables described by the cluster labeling
λ 
(a)
and
λ 
(b)
, with k 
(a)
and k 
(b)
groups respectively.Let I(X; Y) denote the mutual information between X and Y,and H(X) denote the entropy of X,
i.e,
a measure of theuncertainty associated with a X. The chain rule for entropystates thatH(X1:n)= H(X1)+H(X2|X1)+...+H(Xn|X1:n
1) (1)When X1:n are independent ,identically distributed (i.i.d.),then H(X1:n) = nH(X1). From Eqn 1, we haveH(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y )
286http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->