|Views: 875|Likes: 1

Published by ijcsis

Clustering is a process of grouping similar objects together and placing the object in a cluster which is most similar to it. In this paper we provide a new measure for calculation of similarity between 2 clusters for categorical attributes and the approach used is agglomerative hierarchical clustering.

Clustering is a process of grouping similar objects together and placing the object in a cluster which is most similar to it. In this paper we provide a new measure for calculation of similarity between 2 clusters for categorical attributes and the approach used is agglomerative hierarchical clustering.

See More

See less

A New Approach for Clustering CategoricalAttributes

Parul Agarwa1

l

, M. Afshar Alam

2

Department Of Computer Science, JamiaHamdard(Hamdard University)Jamia Hamdard(Hamdard University)New Delhi =110062 ,India

parul.pragna4@gmail.com ,

aalam@jamiahamdard.ac.in

Ranjit Biswas

3

Manav Rachna International UniversityManav Rachna International UniversityGreen Fields ColonyFaridabad, Haryana 121001

ranjitbiswas@yahoo.com

Abstract

— Clustering is a process of grouping similar objectstogether and placing the object in a cluster which is most similarto it.In this paper we provide a new measure for calculation of similarity between 2 clusters for categorical attributes and theapproach used is agglomerative hierarchical clustering .

Keywords- Agglomerative hierarchical clustering, Categorical Attributes,Number of Matches.

I.

I

NTRODUCTION

Data Mining is a process of extracting usefulinformation.Clustering is the problem being solved in datamining.Clustering discovers interesting patterns in theunderlying data. It groups similar objects together in acluster(or clusters) and dissimilar objects in other cluster(orclusters).This grouping is based on the approach used for thealgorithm and the similarity measure which identifies thesimilarity between an object and a cluster.The approach isbased upon the clustering method chosen for clustering.Theclustering methods are broadly divided into hierarchical andpartitional.hierarchical clustering performs partitioningsequentially. It works on bottom –up and top-down.The bottomup approach known as agglomerative starts with each object ina separate cluster and continues combining 2 objects based onthe similarity measure until they are combined in one bigcluster which consists of all objects. .Wheras the top-downapproach also known as divisive treats all objects in one bigcluster and the large cluster is divided into small clusters untileach cluster consists of just a single object. The generalapproach of hierarchical clustering is in using an appropriatemetric which measures distance between 2 tuples and alinkage criteria which specifies the dissimilarity of sets as afunction of the pairwise distances of observations in the setsThe linkage criteria could be of 3 types [28]single linkage,average linkage and complete linkage.In single linkage(also known as nearest neighbour), thedistance between 2 clusters is computed as:D(Ci,Cj)= min {D(a,b) : where a

Ci, b

Cj.Thus distance between clusters is defined as the distancebetween the closest pair of objects, where only one objectfrom each cluster is considered.i.e. the distance between two clusters is given by the valueof the shortest link between the clusters. In average Linkagemethod (or farthest neighbour), Distance between Clustersdefined as the distance between the most distant pair of objects,one from each cluster is considered.In the complete linkage method, D(Ci,Cj) is computed asD(Ci,Cj) = Max { d(a,b) : a

Ci,b

Cj.}the distance between two clusters is given by the value of the longest link between the clusters.Whereas,in average linkageD(Ci,Cj) = { d(a,b) / (l1 * l2): a

Ci,b

Cj. And l1 is thecardinality of cluster Ci,and l2 is cardinality of Cluster Cj.And d(a,b) is the distance defined.}The partitional clustering on the other hand breaks the datainto disjoint clusters. In Section II we shall discuss the relatedwork. In Section III, we shall talk about our algorithm followedby section IV containing the experimental results followed bySection V which contains the conclusion and Section VI willdiscuss the future work.II.

R

ELATED

W

ORK

The hierarchical clustering forms its basis with olderalgorithms Lance-Williams formula(based on the Williamsdissimilarity update formula which calculates dissimilaritiesbetween a cluster formed and the existing points, which arebased on the dissimilarities found prior to the new cluster),conceptual clustering,SLINK[1], COBWEB[2] as well asnewer algorithms like CURE[3] and CHAMELEON[4]. TheSLINK algorithm performs single-link (nearest-neighbour)clustering on arbitrary dissimilarity coefficients andconstructs a representation of the dendrogram which can be

(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 1, January 201139http://sites.google.com/site/ijcsis/ISSN 1947-5500

converted into a tree representation. COBWEB constructs adendogram representation known as a classification tree thatcharacterizes each cluster with a probabilistic distribution.CURE(Clustering using Representatives) an algorithm thathandles large databases and employs a combination of random sampling and partitioning. A random sample isdrawn from the data set and then partitioned and eachpartition is partially clustered. The partial clusters are thenclustered in a second pass to yield the desired clustersCURE has the advantage of effectively handling outliers.CHAMELEON combines graph partitioning and dynamicmodeling into agglomerative hierarchical clustering and canperform clustering on all types of data. Theinterconnectivity between two clusters should be high ascompared to intra connectivity between objects within agiven cluster..Whereas, in the partitioning method, a partitioningalgorithm arranges all the objects into various groups orpartitions,, where the total number of partitions(k) is less thanthe number of objects(n).i.e. a database of n objects can bearranged into k partitions ,where k < n. Each of the partitionthus obtained by applying some similarity function is a cluster.The partitioning methods are subdivided as probabilisticclustering[5] (EM ,AUTOCLASS), algorithms that use the k-medoids method (like PAM[6], CLARA[6],CLARANS[7]),and k-means methods (differ on parameters like initialization,optimization and extensions).EM (expectation – maximizationalgorithm) calculates the maximum likelihood estimate byusing the marginal likelihood of the observed data for a givenstatistical model which depends on unobserved latent data ormissing values .But this algorithm depends on the order of input. AUTOCLASS algorithm works for both continuousand categorical data. AUTOCLASS, is a powerfulunsupervised Bayesian classification system which mainly hasapplication in biological sciences and is able to handle themissing values. PAM (partitioning around medoids) builds k representative objects, called medoids randomly from givendataset consisting of n objects . A medoid is an object of agiven cluster such that its average dissimilarity to all theobjects in the cluster is the least. Then each object in the datasetis assigned to the nearest medoid. The purpose of the algorithmis to minimize the objective function which is the sum of thedissimilarities of all the objects to their nearest medoid.CLARA (Clustering Large Applications) deals with largedata sets.it combines sampling and PAM algorithm to togenerate an optimal set of medoids for the sample. It also triesto find k representative objects that are centrally located in thecluster.It considers data subsets of fixed size, so that theoverall computation time and storage requirements becomelinear in the total number of objects. CLARANS (ClusteringLarge Applications based on RANdomized Search) views theprocess of finding k medoids as searching in a graph [12].CLARANS performs serial randomized search instead of exhaustively searching the data.It identifies spatial structurespresent in the data.Partitioning algorithms are also density based i.e. try todiscover dense connected components of data, which areflexible in terms of their shape. Several algorithms likeDBSCAN[8], OPTICS have been proposed.. The DBSCAN(Density-Based Spatial Clustering of Applications withNoise)algorithm identifies clusters on the basis of the densityof the points.Regions with a high density of points depict the existenceof clusters whereas regions with a low density of pointsindicate clusters of noise or outliers. Its main features includeabitlity to handle large datasets with noise,identifying clusterswith different sizes and shapes.OPTICS (Ordering Points ToIdentify the Clustering Structure) though similar to DBSCANin being density based and working over spatial data but differsby considering the problem posed by DBSCAN problem of detecting meaningful clusters in data of varying density.Another category is grid based methods like BANG[9]in addition to evolutionary methods such as SimulatedAnnealing(a probabilistic method of calculating the globalmininmum over a cost function having many localminimas),Genetic Algorithms[10].Several scalabitlityalgorithms e.g. BIRCH[11],DIGNET[12] have been suggestedin the recent past to address the issues associated with largedatabases . BIRCH (Balanced Iterative Reducing andClustering using Hierarchies) is an incremental andagglomerative hierarchical clustering algorithm for databaseswhich are large enough to not fit the main memory. Thisalgorithm performs only single scan of the database andeffectively deals with data containing noise.Another category of algorithms deals with highdimensional data and works on Subspace Clustering,ProjectionTechniques,Co-Clustering Techniques. Subspace clusteringfinds clusters in various subspaces within a dataset. Highdimensional data may consist of thousands of dimensions andthus may pose difficulty in their enumeration due to theirmultiple values that they may be take and visualization owingto the fact that many of the dimensions may often beirrelevant.. The problem with subspace clustering is ,that withd dimensions there exist 2

d

subspaces.Projected clustering[14]assigns each point to a unique cluster, but clusters may exist indifferent subspaces. Co-Clustering or Bi-Clustering[15] issimulataneous clustering of rows and columns of a matrix i.e.of tuples and attributes.The techniques of grouping the objects are differentfor numerical and categorical data owing to their separatenature. The real world databases contain both numerical andcategorical data.Thus, we need separate similarity measures forboth types. The numerical data is generally grouped on thebasis of the inherent geometric properties like distances(mostcommon being Euclidean, Manhattan etc) between them.Whereas for categorical data the attribute values that they takeis small in number and secondly, it is difficult to measure theirsimilarity on the basis of the distance as we can for realnumbers. There exist two approaches for handling mixed typeof attributes. Firstly, group all the same type of variables in aparticular cluster and perform separate dissimilarity computingmethod for each variable type cluster. Second approach is togroup all the variables of different types into a single clusterusing dissimilarity matrix and make a set of common scalevariables. Then using the dissimilarity formula for such cases,we perform the clustering.

(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 1, January 201140http://sites.google.com/site/ijcsis/ISSN 1947-5500

There exist several clustering algorithms fornumerical datasets .The most common being K-means,BIRCH,CURE,CHAMELEON.The k-means algorithmtakes as input the number of clusters desired.Then from thegiven database ,it randomly selects k tuples as centres and thenassigning the objects in the database to belong to these clusterson the basis of the distance.It then recomputes the k centres andcontinues the process till the centres don’t move.K-means wasfurther proposed as fuzzy k-means and also for categoricalattributes.The original work has been explored by severalauthors for extension and several algorithms for the same havebeen proposed in the recent past. In[16], Ralambondrainyproposed an extended version of the k-means algorithm whichconverts categorical attributes into binary ones. In this paperthe author represents every attributes in the form of binaryvalues which results in increased time and space incase thenumber of categorical attributes is largeA few algorithms have been proposed in the lastfew years which cluster categorical data. A few of them listedin[17-19]. Recently work has been done to define a gooddistance (dissimilarity) measure between categorical dataobjects[20-22,25].For mixed data types a few algorithms[23-24] have been written. In [22]the author presents k-modesalgorithm , an extension of the K-means algorithm in whichthe number of mismatches between categorical attributes isconsidered as the measure for performing clustering. In k-prototypes , the distance measure for numerical data isweighted sum of Euclidean distances and for categorical data,a measure has been proposed in the paper.K- Representative isa frequency based algorithm which considers the frequency of attribute in that cluster and dividing it by the length of theclusterIII.

.

The Proposed Algorithm(namely MHC)(Matchesbased Hierarchical-Clustering)where M stands for the numberof matches.This algorithm works for categorical datasets and constructsclusters hierarchically.Consider a database D If D is the database with domainD1, …….Dm defined by attributes A1,…….,Am, then eachtuple X in D is represented asX = ( x

1

,x

2

,….,xm)

ε

(D1

×

D2

×

…….

×

Dm). (1)Let, there be n objects in the database, thenD = ( X

1

,X

2

,……..,X

n

). Where object Xi is represented asXi = (x

i1

,x

i2

,…………,x

im

) (2)Where m is the total number of attributes. Then, we definesimilarity between any two clusters asSim(C

i

,C

j

)=matches(C

i

,C

j

)/(n* (l

i

*l

j

)) (3)Where C

i

,C

j

denote the clusters for which similarity is beingcalculated .matches(C

i

,C

j

): denote the number of matches between 2tuples over corresponding attributes.n: total number of attributes in databasel

i

:length of the C

i

clusterl

j

:length of the

C

j

cluster

A. The Algorithm:

Input: Number of Clusters (k),Data to be Clustered(D)O/p: k number of clusters created.Step1. Begin with n Clusters, each having just one tuple.Step 2 . Repeat step 3 for n-k times.Step 3. Find the most similar cluster C

i

and C

j

using thesimilarity measure Sim(C

i

,C

j

) by “(3)” and merge them into asingle cluster.B. Implementation Details:1.This algorithm has been implemented in Matlab[26] andthe main advantage is that we do not have to reconstruct thesimilarity matrix once this task is done.2.It is simple to implement.3. Given n tuples construct n*n similarity matrix with all i=jvalue initially set to 8000(a special vaule).and the rest with avalue 04. During 1

st

iteration,calculate the similarity of eachcluster with every other cluster.for all i,j s.t.i

≠

j .Compute thesimilarity between 2 tuples(clusters) of database byidentifying the number of matches over attributes and thenusing equation (3) to calculate the value for this step andaccordingly update the matrix.5.Since only the upper triangular matrix will beused,identify the highest value from matrix and merge thecorresponding i and j .the changes in the matrix include :a)set (j,j)=-9000 to identify that this cluster has beenmerged with some other cluster.b)set (i,j) = 8000 which denotes that for correspondingrow i,all j’s with 8000 as value have been merged with i.c)During next iteration ,do not consider the similaritybetween those clusters which have been merged.for exampleif database D contains 4 (n) tuples with 5 (m) attributes ,and1,2 have been merged then following similarities have to becalculated.sim(1,3)=sim(1,3)+sim(2,3) where

li

=2,

l

j

=1sim(3,4) where

li

=1,

l

j

=1sim(1,4)=sim(1,4)=sim(2,4) where

li

=2,

l

j

=1IV.

E

XPERIMENTAL

R

ESULTS

We have implemented this algorithm with small sizesynthetic database and the results have been good.But as thesize increases ,the algorithm has the drawback of producingmixed clusters.Thus, we consider a real life dataset which issmall in size for experiments.Real life dataset:

(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 1, January 201141http://sites.google.com/site/ijcsis/ISSN 1947-5500