Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
Constraint-free Optimal Meta Similarity Clusters Using Dynamic Minimum Spanning Tree

Constraint-free Optimal Meta Similarity Clusters Using Dynamic Minimum Spanning Tree

Ratings: (0)|Views: 25 |Likes:
Published by ijcsis
Clustering is a process of discovering groups of objects such that the objects of the same group are similar, and objects belonging to different groups are dissimilar. A number of clustering algorithms exist that can solve the problem of clustering, but most of them are very sensitive to their input parameters. Therefore it is very important to evaluate the result of them. The minimum spanning tree clustering algorithm is capable of detecting clusters with irregular boundaries. In this paper we propose a constraint- free minimum spanning tree based clustering algorithm. The algorithm constructs hierarchy from top to bottom. At each hierarchical level, it optimizes the number of cluster, from which the proper hierarchical structure of underlying dataset can be found. The algorithm uses a new cluster validation criterion based on the geometric property of data partition of the data set in order to find the proper number of clusters at each level. The algorithm works in two phases. The first phase of the algorithm create clusters with guaranteed intra-cluster similarity, where as the second phase of the algorithm create dendrogram using the clusters as objects with guaranteed inter-cluster similarity. The first phase of the algorithm uses divisive approach, where as the second phase uses agglomerative approach. In this paper we used both the approaches in the algorithm to find Optimal Meta similarity clusters.
Clustering is a process of discovering groups of objects such that the objects of the same group are similar, and objects belonging to different groups are dissimilar. A number of clustering algorithms exist that can solve the problem of clustering, but most of them are very sensitive to their input parameters. Therefore it is very important to evaluate the result of them. The minimum spanning tree clustering algorithm is capable of detecting clusters with irregular boundaries. In this paper we propose a constraint- free minimum spanning tree based clustering algorithm. The algorithm constructs hierarchy from top to bottom. At each hierarchical level, it optimizes the number of cluster, from which the proper hierarchical structure of underlying dataset can be found. The algorithm uses a new cluster validation criterion based on the geometric property of data partition of the data set in order to find the proper number of clusters at each level. The algorithm works in two phases. The first phase of the algorithm create clusters with guaranteed intra-cluster similarity, where as the second phase of the algorithm create dendrogram using the clusters as objects with guaranteed inter-cluster similarity. The first phase of the algorithm uses divisive approach, where as the second phase uses agglomerative approach. In this paper we used both the approaches in the algorithm to find Optimal Meta similarity clusters.

More info:

Published by: ijcsis on Aug 12, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

10/25/2012

pdf

text

original

 
Constraint-free Optimal Meta SimilarityClusters Using Dynamic MinimumSpanning Tree
S. John Peter S.P. Victor
Assistant Professor Associate ProfessorDepartment of Computer Science and Department of Computer Science andResearch Center Research Center
St. Xavier’s College, Palayamkottai
 
St. Xavier’s College, Palayamkottai
 Tamil Nadu, India. Tamil Nadu, India. jaypeeyes@rediffmail.com victorsp@rediffmail.com 
 ABSTRACT 
 — 
 
Clustering is a process of discoveringgroups of objects such that the objects of the samegroup are similar, and objects belonging to differentgroups are dissimilar. A number of clusteringalgorithms exist that can solve the problem of clustering, but most of them are very sensitive totheir input parameters. Therefore it is veryimportant to evaluate the result of them. Theminimum spanning tree clustering algorithm iscapable of detecting clusters with irregularboundaries. In this paper we propose a constraint-free minimum spanning tree based clusteringalgorithm. The algorithm constructs hierarchy fromtop to bottom. At each hierarchical level, itoptimizes the number of cluster, from which theproper hierarchical structure of underlying datasetcan be found. The algorithm uses a new clustervalidation criterion based on the geometric propertyof data partition of the data set in order to find theproper number of clusters at each level. Thealgorithm works in two phases. The first phase of the algorithm create clusters with guaranteed intra-cluster similarity, where as the second phase of thealgorithm create dendrogram using the clusters asobjects with guaranteed inter-cluster similarity. Thefirst phase of the algorithm uses divisive approach,where as the second phase uses agglomerativeapproach. In this paper we used both theapproaches in the algorithm to find Optimal Metasimilarity clusters.
 Keywords: Euclidean minimum spanning tree,Subtree, Clustering, Eccentricity, Center, Hierarchical  clustering, Dendrogram, Cluster validity, ClusterSeparation
I.
 
INTRODUCTIONThe problem of determining the correct number of clusters in a data set is perhaps the most difficult
and ambiguous part of cluster analysis. The ―true‖number of clusters depends on the ―level‖ on is
viewing the data. Another problem is due to the
methods that may yield the ―correct‖ number of clusters for a ―bad‖ classification [
10].Furthermore, it has been emphasized thatmechanical methods for determining the optimalnumber of clusters should not ignore that the factthat the overall clustering process has anunsupervised nature and its fundamental objectiveis to uncover the unknown structure of a data set,not to impose one. For these reasons, one shouldbe well aware about the explicit and implicitassumptions underlying the actual clusteringprocedure before the number of clusters can bereliably estimated or, otherwise the initialobjective of the process may be lost. As a solutionfor this, Hardy [10] recommends that thedetermination of optimal number of clustersshould be made by using several differentclustering methods that together produce moreinformation about the data. By forcing a structureto a data set, the important and surprising factsabout the data will likely remain uncovered.In some applications the number of clusters is nota problem, because it is predetermined by thecontext [11]. Then the goal is to obtain amechanical partition for a particular data using afixed number of clusters. Such a process is notintended for inspecting new and unexpected facts
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 4, July 2010126http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
arising from the data. Hence, splitting up a
homogeneous data set in a ―fair‖
way is muchmore straightforward problem when compared tothe analysis of hidden structures fromheterogeneous data set. The clustering algorithms[15, 21] partitioning the data set in to
clusterswithout knowing the homogeneity of groups.Hence the principal goal of these clusteringproblems is not to uncover novel or interestingfacts about data.Numerical methods can usually provide onlyguidance about the true number of clusters and thefinal decision is often an ad hoc decision that isbased on prior assumptions and domainknowledge. Therefore, the choice between thedifferent numbers of clusters is often made bycomparing several alternatives, and the finaldecision is a subjective problem that can besolved in practice only by humans. Nevertheless,a number of methods for objective assessment of cluster validity have been developed andproposed. Because the recognition of clusterstructures is difficult especially in high-dimensional spaces, various visualizationtechnique can also be of valuable help to thecluster analyst.Given a connected, undirected graph
G = ( V, E )
,where
is the set of nodes,
 E 
is the set of edgesbetween pairs of nodes, and a weight
w (u , v)
 specifying weight of the edge
(u, v)
for each edge
(u, v)
. A spanning tree is an acyclic subgraphof a graph
G
, which contains all vertices from
G
.The Minimum Spanning Tree (
MST
) of aweighted graph is minimum weight spanning treeof that graph. Several well established
MST
algorithms exist to solve minimum spanning treeproblem [24, 19, 20]. The cost of constructing aminimum spanning tree is
O (m log n)
, where
m
isthe number of edges in the graph and
n
is thenumber of vertices. More efficient algorithm forconstructing
MST
s have also been extensivelyresearched [18, 5, 13]. These algorithms promiseclose to linear time complexity under differentassumptions. A Euclidean minimum spanning tree(
EMST
) is a spanning tree of a set of 
n
points in ametric space (
E
n
), where the length of an edge isthe Euclidean distance between a pair of points inthe point set.The hierarchical clustering approaches are relatedto graph theoretic clustering. Clusteringalgorithms using minimal spanning tree takes theadvantage of 
MST
. The
MST
ignores manypossible connections between the data patterns, sothe cost of clustering can be decreased. The
MST
 based clustering algorithm is known to be capableof detecting clusters with various shapes and size[34]. Unlike traditional clustering algorithms, the
MST
clustering algorithm does not assume aspherical shapes structure of the underlying data.The
EMST
clustering algorithm [23,34] uses theEuclidean minimum spanning tree of a graph toproduce the structure of point clusters in the
n
-dimensional Euclidean space. Clusters aredetected to achieve some measures of optimality,such as minimum intra-cluster distance ormaximum inter-cluster distance [2]. The
EMST
 algorithm has been widely used in practice.Clustering by minimal spanning tree can beviewed as a hierarchical clustering algorithmwhich follows a divisive approach. Using thismethod firstly
MST
is constructed for a giveninput. There are different methods to producegroup of clusters. If the number of clusters
isgiven in advance, the simplest way to obtain
 clusters is to sort the edges of minimum spanningtree in descending order of their weights andremove edges with first
-1 heaviest weights [2,33].All existing clustering Algorithm require anumber of parameters as their inputs and theseparameters can significantly affect the clusterquality. Our algorithm does not require apredefined cluster number. In this paper we wantto avoid experimental methods and advocate theidea of need-specific as opposed to care-specificbecause users always know the needs of theirapplications. We believe it is a good idea to allowusers to define their desired similarity within acluster and allow them to have some flexibility toadjust the similarity if the adjustment is needed.Our Algorithm produces clusters of 
n
-dimensionalpoints with a naturally approximate intra-clusterdistance.Geometric notion of centrality are closely linkedto facility location problem. The distance matrix
  D
 
can computed rather efficiently using Dijkstra’s
algorithm with time complexity
O (| V|
2
ln | V |)
 [29].The
eccentricity
of a vertex
 x
in
G
and radius
ρ (
G), respectively are defined as
e(x) = max d(x , y) and 
 ρ
(G) = min e(x) yV x
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 4, July 2010127http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
The
center 
of 
G
is the set
C (G) = {x
V | e(x) = ρ (G)}
 
(G)
is the center to the ―
emergency facilitylocation problem
‖ which is always contain
single block of 
G
. The length of the longestpath in the graph is called
diameter 
of thegraph
G
. we can define diameter D (G) as
 D (G) = max e(x) x
The
diameter 
set of 
G
is
 Dia (G) = {xV | e(x) = D (G)}
An important objective of hierarchical clusteranalysis is to provide picture of data that caneasily be interpreted. A picture of a hierarchicalclustering is much easier for a human being tocomprehend than is a list of abstract symbols. A
dendrogram
is a special type of tree structure thatprovides a convenient way to representhierarchical clustering. A dendrogram consists of layers of nodes, each representing a cluster.Hierarchical clustering is a sequence of partitionsin which each partition is nested into the next insequence. An Agglomerative algorithm forhierarchical clustering starts with disjointclustering, which places each of the
n
objects inan individual cluster [1]. The hierarchicalclustering algorithm being employed dictates howthe proximity matrix or proximity graph should beinterpreted to merge two or more of these trivialclusters, thus nesting the trivial clusters intosecond partition. The process is repeated to form asequence of nested clustering in which the numberof clusters decreases as a sequence progress untilsingle cluster containing all
n
objects, called the
conjoint 
 
clustering
, remains[1].Nearly all hierarchical clustering techniques thatinclude the tree structure have two short comings:(1) they do not properly represent hierarchicalrelationship and (2) once the data are assignedimproperly to a given cluster it cannot laterreevaluate and placed in another cluster.In this paper, we propose a new clusteringalgorithm: the Dynamically Growing MinimumSpanning Tree (
DGMST
), which can overcomethese shortcomings. The algorithm optimizes thenumber of clusters at each hierarchical level withthe cluster validation criteria during the minimumspanning tree construction process. Then thehierarchy constructed by the algorithm canproperly represent the hierarchical structure of theunderlying dataset, which improves the accuracyof the final clustering result.Our
DGMST
clustering algorithm addresses theissues of undesired clustering structure andunnecessary large number of clusters. Ouralgorithm does not require a predefined clusternumber. The algorithm constructs an
EMST
of apoint set and removes the inconsistent edges thatsatisfy the inconsistence measure. The process isrepeated to create a hierarchy of clusters untiloptimal numbers of clusters (regions) areobtained. Hence the title! In section 2 we reviewsome of the existing works on cluster validity andgraph based clustering algorithms. In Section 3we propose
DGMST
algorithm which producesoptimal number of clusters with dendrogram forcluster of clusters. Hence we named this newcluster as
Optimal Meta similarity clusters.
Finally in conclusion we summarize the strengthof our methods and possible improvements.II.
 
RELATED WORK.Determining the true number of clusters, alsoknown as the cluster validation problem, is afundamental problem in cluster analysis. Manyapproaches to this problem have been proposed[25, 32, 10]. Two kinds of indexes have been usedto validate the clustering [6, 7]: one based onrelative criteria and other based on external andinternal criteria. The first approach is to choosethe best result from set of clustering resultaccording to a prespecified criterion. Although thecomputational cost of the approach is light,human intervention is required to find the bestnumber of clusters. The
DGMST
algorithm triesto find the proper number of clustersautomatically which makes the first approachunsuitable for clustering validation in the
DGMST
algorithm
.
 The second approach is based on statistical testsand involves computations of both inter-clusterand intra-cluster quality to determine the properbest number of clusters. The evaluation of thecriteria can be completed automatically. Howeverthe computational cost of this type of clustervalidation is very high. The second type of thiskind of approach is also not suitable for
DGMST
algorithm
 
when it is used to cluster a largedataset. A successful and practical clustervalidation criteria used in the
DGMST
algorithm
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 4, July 2010128http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->