There exist several clustering algorithms fornumerical datasets .The most common being K-means,BIRCH,CURE,CHAMELEON.The k-means algorithmtakes as input the number of clusters desired.Then from thegiven database ,it randomly selects k tuples as centres and thenassigning the objects in the database to belong to these clusterson the basis of the distance.It then recomputes the k centres andcontinues the process till the centres don’t move.K-means wasfurther proposed as fuzzy k-means and also for categoricalattributes.The original work has been explored by severalauthors for extension and several algorithms for the same havebeen proposed in the recent past. In[16], Ralambondrainyproposed an extended version of the k-means algorithm whichconverts categorical attributes into binary ones. In this paperthe author represents every attributes in the form of binaryvalues which results in increased time and space incase thenumber of categorical attributes is largeA few algorithms have been proposed in the lastfew years which cluster categorical data. A few of them listedin[17-19]. Recently work has been done to define a gooddistance (dissimilarity) measure between categorical dataobjects[20-22,25].For mixed data types a few algorithms[23-24] have been written. In [22]the author presents k-modesalgorithm , an extension of the K-means algorithm in whichthe number of mismatches between categorical attributes isconsidered as the measure for performing clustering. In k-prototypes , the distance measure for numerical data isweighted sum of Euclidean distances and for categorical data,a measure has been proposed in the paper.K- Representative isa frequency based algorithm which considers the frequency of attribute in that cluster and dividing it by the length of theclusterIII.
.
The Proposed Algorithm(namely MHC)(Matchesbased Hierarchical-Clustering)where M stands for the numberof matches.This algorithm works for categorical datasets and constructsclusters hierarchically.Consider a database D If D is the database with domainD1, …….Dm defined by attributes A1,…….,Am, then eachtuple X in D is represented asX = ( x
1
,x
2
,….,xm)
ε
(D1
×
D2
×
…….
×
Dm). (1)Let, there be n objects in the database, thenD = ( X
1
,X
2
,……..,X
n
). Where object Xi is represented asXi = (x
i1
,x
i2
,…………,x
im
) (2)Where m is the total number of attributes. Then, we definesimilarity between any two clusters asSim(C
i
,C
j
)=matches(C
i
,C
j
)/(n* (l
i
*l
j
)) (3)Where C
i
,C
j
denote the clusters for which similarity is beingcalculated .matches(C
i
,C
j
): denote the number of matches between 2tuples over corresponding attributes.n: total number of attributes in databasel
i
:length of the C
i
clusterl
j
:length of the
C
j
cluster
A. The Algorithm:
Input: Number of Clusters (k),Data to be Clustered(D)O/p: k number of clusters created.Step1. Begin with n Clusters, each having just one tuple.Step 2 . Repeat step 3 for n-k times.Step 3. Find the most similar cluster C
i
and C
j
using thesimilarity measure Sim(C
i
,C
j
) by “(3)” and merge them into asingle cluster.B. Implementation Details:1.This algorithm has been implemented in Matlab[26] andthe main advantage is that we do not have to reconstruct thesimilarity matrix once this task is done.2.It is simple to implement.3. Given n tuples construct n*n similarity matrix with all i=jvalue initially set to 8000(a special vaule).and the rest with avalue 04. During 1
st
iteration,calculate the similarity of eachcluster with every other cluster.for all i,j s.t.i
≠
j .Compute thesimilarity between 2 tuples(clusters) of database byidentifying the number of matches over attributes and thenusing equation (3) to calculate the value for this step andaccordingly update the matrix.5.Since only the upper triangular matrix will beused,identify the highest value from matrix and merge thecorresponding i and j .the changes in the matrix include :a)set (j,j)=-9000 to identify that this cluster has beenmerged with some other cluster.b)set (i,j) = 8000 which denotes that for correspondingrow i,all j’s with 8000 as value have been merged with i.c)During next iteration ,do not consider the similaritybetween those clusters which have been merged.for exampleif database D contains 4 (n) tuples with 5 (m) attributes ,and1,2 have been merged then following similarities have to becalculated.sim(1,3)=sim(1,3)+sim(2,3) where
li
=2,
l
j
=1sim(3,4) where
li
=1,
l
j
=1sim(1,4)=sim(1,4)=sim(2,4) where
li
=2,
l
j
=1IV.
E
XPERIMENTAL
R
ESULTS
We have implemented this algorithm with small sizesynthetic database and the results have been good.But as thesize increases ,the algorithm has the drawback of producingmixed clusters.Thus, we consider a real life dataset which issmall in size for experiments.Real life dataset:
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 1, January 201141http://sites.google.com/site/ijcsis/ISSN 1947-5500