This action might not be possible to undo. Are you sure you want to continue?

1, January 2011

A New Approach for Clustering Categorical Attributes

**Parul Agarwa1 l, M. Afshar Alam2
**

Department Of Computer Science, Jamia Hamdard(Hamdard University) Jamia Hamdard(Hamdard University) New Delhi =110062 ,India parul.pragna4@gmail.com, aalam@jamiahamdard.ac.in

Ranjit Biswas3

Manav Rachna International University Manav Rachna International University Green Fields Colony Faridabad, Haryana 121001 ranjitbiswas@yahoo.com

Abstract— Clustering is a process of grouping similar objects together and placing the object in a cluster which is most similar to it.In this paper we provide a new measure for calculation of similarity between 2 clusters for categorical attributes and the approach used is agglomerative hierarchical clustering . Keywords- Agglomerative hierarchical clustering, Categorical Attributes,Number of Matches.

Thus distance between clusters is defined as the distance between the closest pair of objects, where only one object from each cluster is considered. i.e. the distance between two clusters is given by the value of the shortest link between the clusters. In average Linkage method (or farthest neighbour), Distance between Clusters defined as the distance between the most distant pair of objects, one from each cluster is considered. In the complete linkage method, D(Ci,Cj) is computed as D(Ci,Cj) = Max { d(a,b) : a Ci,b Cj.} the distance between two clusters is given by the value of the longest link between the clusters. Whereas,in average linkage D(Ci,Cj) = { d(a,b) / (l1 * l2): a Ci,b Cj. And l1 is the cardinality of cluster Ci,and l2 is cardinality of Cluster Cj. And d(a,b) is the distance defined.} The partitional clustering on the other hand breaks the data into disjoint clusters. In Section II we shall discuss the related work. In Section III, we shall talk about our algorithm followed by section IV containing the experimental results followed by Section V which contains the conclusion and Section VI will discuss the future work. II. RELATED WORK

I.

INTRODUCTION

Data Mining is a process of extracting useful information.Clustering is the problem being solved in data mining.Clustering discovers interesting patterns in the underlying data. It groups similar objects together in a cluster(or clusters) and dissimilar objects in other cluster(or clusters).This grouping is based on the approach used for the algorithm and the similarity measure which identifies the similarity between an object and a cluster.The approach is based upon the clustering method chosen for clustering.The clustering methods are broadly divided into hierarchical and partitional.hierarchical clustering performs partitioning sequentially. It works on bottom –up and top-down.The bottom up approach known as agglomerative starts with each object in a separate cluster and continues combining 2 objects based on the similarity measure until they are combined in one big cluster which consists of all objects. .Wheras the top-down approach also known as divisive treats all objects in one big cluster and the large cluster is divided into small clusters until each cluster consists of just a single object. The general approach of hierarchical clustering is in using an appropriate metric which measures distance between 2 tuples and a linkage criteria which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets The linkage criteria could be of 3 types [28]single linkage ,average linkage and complete linkage. In single linkage(also known as nearest neighbour), the distance between 2 clusters is computed as: D(Ci,Cj)= min {D(a,b) : where a Ci, b Cj.

The hierarchical clustering forms its basis with older algorithms Lance-Williams formula(based on the Williams dissimilarity update formula which calculates dissimilarities between a cluster formed and the existing points, which are based on the dissimilarities found prior to the new cluster), conceptual clustering,SLINK[1], COBWEB[2] as well as newer algorithms like CURE[3] and CHAMELEON[4]. The SLINK algorithm performs single-link (nearest-neighbour) clustering on arbitrary dissimilarity coefficients and constructs a representation of the dendrogram which can be

39

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011

converted into a tree representation. COBWEB constructs a dendogram representation known as a classification tree that characterizes each cluster with a probabilistic distribution. CURE(Clustering using Representatives) an algorithm that handles large databases and employs a combination of random sampling and partitioning. A random sample is drawn from the data set and then partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters CURE has the advantage of effectively handling outliers. CHAMELEON combines graph partitioning and dynamic modeling into agglomerative hierarchical clustering and can perform clustering on all types of data. The interconnectivity between two clusters should be high as compared to intra connectivity between objects within a given cluster.. Whereas, in the partitioning method, a partitioning algorithm arranges all the objects into various groups or partitions,, where the total number of partitions(k) is less than the number of objects(n).i.e. a database of n objects can be arranged into k partitions ,where k < n. Each of the partition thus obtained by applying some similarity function is a cluster. The partitioning methods are subdivided as probabilistic clustering[5] (EM ,AUTOCLASS), algorithms that use the kmedoids method (like PAM[6], CLARA[6],CLARANS[7]), and k-means methods (differ on parameters like initialization, optimization and extensions).EM (expectation – maximization algorithm) calculates the maximum likelihood estimate by using the marginal likelihood of the observed data for a given statistical model which depends on unobserved latent data or missing values .But this algorithm depends on the order of input. AUTOCLASS algorithm works for both continuous and categorical data. AUTOCLASS, is a powerful unsupervised Bayesian classification system which mainly has application in biological sciences and is able to handle the missing values. PAM (partitioning around medoids) builds k representative objects, called medoids randomly from given dataset consisting of n objects . A medoid is an object of a given cluster such that its average dissimilarity to all the objects in the cluster is the least. Then each object in the dataset is assigned to the nearest medoid. The purpose of the algorithm is to minimize the objective function which is the sum of the dissimilarities of all the objects to their nearest medoid. CLARA (Clustering Large Applications) deals with large data sets.it combines sampling and PAM algorithm to to generate an optimal set of medoids for the sample. It also tries to find k representative objects that are centrally located in the cluster.It considers data subsets of fixed size, so that the overall computation time and storage requirements become linear in the total number of objects. CLARANS (Clustering Large Applications based on RANdomized Search) views the process of finding k medoids as searching in a graph [12]. CLARANS performs serial randomized search instead of exhaustively searching the data.It identifies spatial structures present in the data. Partitioning algorithms are also density based i.e. try to discover dense connected components of data, which are flexible in terms of their shape. Several algorithms like DBSCAN[8], OPTICS have been proposed.. The DBSCAN

(Density-Based Spatial Clustering of Applications with Noise)algorithm identifies clusters on the basis of the density of the points. Regions with a high density of points depict the existence of clusters whereas regions with a low density of points indicate clusters of noise or outliers. Its main features include abitlity to handle large datasets with noise,identifying clusters with different sizes and shapes.OPTICS (Ordering Points To Identify the Clustering Structure) though similar to DBSCAN in being density based and working over spatial data but differs by considering the problem posed by DBSCAN problem of detecting meaningful clusters in data of varying density. Another category is grid based methods like BANG[9] in addition to evolutionary methods such as Simulated Annealing(a probabilistic method of calculating the global mininmum over a cost function having many local minimas),Genetic Algorithms[10].Several scalabitlity algorithms e.g. BIRCH[11],DIGNET[12] have been suggested in the recent past to address the issues associated with large databases . BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is an incremental and agglomerative hierarchical clustering algorithm for databases which are large enough to not fit the main memory. This algorithm performs only single scan of the database and effectively deals with data containing noise. Another category of algorithms deals with high dimensional data and works on Subspace Clustering,Projection Techniques,Co-Clustering Techniques. Subspace clustering finds clusters in various subspaces within a dataset. High dimensional data may consist of thousands of dimensions and thus may pose difficulty in their enumeration due to their multiple values that they may be take and visualization owing to the fact that many of the dimensions may often be irrelevant.. The problem with subspace clustering is ,that with d dimensions there exist 2d subspaces.Projected clustering[14] assigns each point to a unique cluster, but clusters may exist in different subspaces. Co-Clustering or Bi-Clustering[15] is simulataneous clustering of rows and columns of a matrix i.e. of tuples and attributes. The techniques of grouping the objects are different for numerical and categorical data owing to their separate nature. The real world databases contain both numerical and categorical data.Thus, we need separate similarity measures for both types. The numerical data is generally grouped on the basis of the inherent geometric properties like distances(most common being Euclidean, Manhattan etc) between them. Whereas for categorical data the attribute values that they take is small in number and secondly, it is difficult to measure their similarity on the basis of the distance as we can for real numbers. There exist two approaches for handling mixed type of attributes. Firstly, group all the same type of variables in a particular cluster and perform separate dissimilarity computing method for each variable type cluster. Second approach is to group all the variables of different types into a single cluster using dissimilarity matrix and make a set of common scale variables. Then using the dissimilarity formula for such cases, we perform the clustering.

40

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011

There exist several clustering algorithms for numerical datasets .The most common being Kmeans,BIRCH,CURE,CHAMELEON.The k-means algorithm takes as input the number of clusters desired.Then from the given database ,it randomly selects k tuples as centres and then assigning the objects in the database to belong to these clusters on the basis of the distance.It then recomputes the k centres and continues the process till the centres don’t move.K-means was further proposed as fuzzy k-means and also for categorical attributes.The original work has been explored by several authors for extension and several algorithms for the same have been proposed in the recent past. In[16], Ralambondrainy proposed an extended version of the k-means algorithm which converts categorical attributes into binary ones. In this paper the author represents every attributes in the form of binary values which results in increased time and space incase the number of categorical attributes is large A few algorithms have been proposed in the last few years which cluster categorical data. A few of them listed in[17-19]. Recently work has been done to define a good distance (dissimilarity) measure between categorical data objects[20-22,25].For mixed data types a few algorithms[2324] have been written. In [22]the author presents k-modes algorithm , an extension of the K-means algorithm in which the number of mismatches between categorical attributes is considered as the measure for performing clustering. In kprototypes , the distance measure for numerical data is weighted sum of Euclidean distances and for categorical data ,a measure has been proposed in the paper.K- Representative is a frequency based algorithm which considers the frequency of attribute in that cluster and dividing it by the length of the cluster III. . The Proposed Algorithm(namely MHC)(Matches based Hierarchical-Clustering)where M stands for the number of matches. This algorithm works for categorical datasets and constructs clusters hierarchically. Consider a database D If D is the database with domain D1, …….Dm defined by attributes A1,…….,Am, then each tuple X in D is represented as X = ( x1,x2,….,xm) ε (D1 × D2 ×…….×Dm). (1) Let, there be n objects in the database, then D = ( X1 ,X2,……..,Xn). Where object Xi is represented as Xi = (x i1 ,xi2,…………,xim) (2) Where m is the total number of attributes. Then, we define similarity between any two clusters as Sim(Ci,Cj)=matches(Ci,Cj)/(n* (li *lj)) (3) Where Ci,Cj denote the clusters for which similarity is being calculated . matches(Ci,Cj ): denote the number of matches between 2 tuples over corresponding attributes. n: total number of attributes in database

**li :length of the Ci cluster lj :length of the Cj cluster
**

A. The Algorithm:

Input: Number of Clusters (k),Data to be Clustered(D) O/p: k number of clusters created. Step1. Begin with n Clusters, each having just one tuple. Step 2 . Repeat step 3 for n-k times. Step 3. Find the most similar cluster Ci and Cj using the similarity measure Sim(Ci,Cj) by “(3)” and merge them into a single cluster. B. Implementation Details: 1.This algorithm has been implemented in Matlab[26] and the main advantage is that we do not have to reconstruct the similarity matrix once this task is done. 2.It is simple to implement. 3. Given n tuples construct n*n similarity matrix with all i=j value initially set to 8000(a special vaule).and the rest with a value 0 4. During 1st iteration,calculate the similarity of each cluster with every other cluster.for all i,j s.t.i≠j .Compute the similarity between 2 tuples(clusters) of database by identifying the number of matches over attributes and then using equation (3) to calculate the value for this step and accordingly update the matrix. 5.Since only the upper triangular matrix will be used,identify the highest value from matrix and merge the corresponding i and j .the changes in the matrix include : a)set (j,j)=-9000 to identify that this cluster has been merged with some other cluster. b)set (i,j) = 8000 which denotes that for corresponding row i,all j’s with 8000 as value have been merged with i. c)During next iteration ,do not consider the similarity between those clusters which have been merged.for example if database D contains 4 (n) tuples with 5 (m) attributes ,and 1,2 have been merged then following similarities have to be calculated. sim(1,3)=sim(1,3)+sim(2,3) where li =2,l j=1 sim(3,4) where li =1,l j=1 sim(1,4)=sim(1,4)=sim(2,4) where li =2,l j=1

IV.

EXPERIMENTAL RESULTS

We have implemented this algorithm with small size synthetic database and the results have been good.But as the size increases ,the algorithm has the drawback of producing mixed clusters.Thus, we consider a real life dataset which is small in size for experiments. Real life dataset:

41

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011

This dataset has been taken from UCI machine learning repository[27]The dataset is the soyabean small dataset. A small subset of the original soybean (large)database.The soyabean large has 307 instances and 35 attributes alongwith some missing values.The data has been classified into 19 classes.On the other hand,the soyabean small dataset with no missing values consisting of 47 tuples with 35 attributes . The dataset has been classified into 4 classes.Both the datasets are being used for soyabean disease diagnosis.A few of the attributes are germination(in %),area damaged, plant growth(norm,abnorm),leaves(norm,abnorm),etc Table 1 c3

Classes Expected No. Of Clusters Resultant No. of Clusters

Table 2 (MHC)

C1 c1 c2 10 0 0 0

C2 0 10 0 0

C3 0 0 10 0

C4 0 0 0 17

P 1 1 1 1

R 1 1 1 1

F 1 1 1 1

c4

1

10

10

Table 3

C1 C2 C3

ROCK

C4 P R F

2

10

10

3

10

10

c1

7

0

0

8

0.47

0.70

0.56

4

17

17

c2

1

7

0

0

0.87

0.70

0.78

c3

1

3

4

0

0.50

0.40

0.44

A.

Validation Methods: 1.Precision(P): Precision in simplest terms can be formulated as number of objects identified correctly which belong to the class divided by the number of objects identified in that class. 2.Recall (R): Recall can be formulated as the number of objects correctly identified in that class divided by the total number of objects this class correctly has. 3. F measure (say denoted by F):it is the harmonic mean of precision and recall. i.e. F-Measure= (2*P*R)/(P+R). (4)

c4

1

0

6

9

0.56

0.52

0.55

Table 4

C1 C2 C3

k-Modes

C4 P R F

c1

2

0

0

7

0.22

0.20

0.21

c2

0

8

1

0

0.89

0.80

0.84

The following Tables contain the values of the three validation measures discussed abov for the algorithms ROCK,K-modes with our algorithm. We assign the four classed obtained in results as c1,c2,c3,c4 and the actual classes as C1,C2,C3,C4.

c3

6

0

7

1

0.50

0.70

0.58

c4

2

2

2

9

0.60

0.52

0.56

Using Table 4, we shall show how to calculate Precision,Recall and F-Measure for a particular class say c1 for k –modes.

42

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

In class c1, there are 2 tuples that actually belong to c1 and 7 tuples that belong to class c4. So, Precision(P) = 2/(2+7)=0.22 Also, there should be total of 10 tuples that should belong to this class against 2 which have been obtained by k modes algorithm So, Recall(R) = 2/10 = 0.20 F-Measure calculated using eqn (4) for class c1 = (2*0.22*0.20)/ (0.22+0.20) = 0.21 Thus experimental results clearly indicate that M-Cluster has generated accurate achieving 100 % accuracy in contrast to other algorithms.

V. CONCLUSION This algorithm produces good results for small databases. The advantages are that it is extremely simple to implement, memory requirement is low and accuracy rate is high as compared to other algorithms. VI FUTURE WORK We would like to analyse the results for databases as well. large

REFERENCES

[1] SLINK: An optimally efficient algorithm for the single link cluster method. The Computer Journal, 16(1):30–34. [2] Fisher, Douglas H. (1987). "Knowledge acquisition via incremental conceptual clustering". Machine Learning 2: 139–172 [3] Sudipto Guha,Rajeev rastogi,Kyuseok Shim:”CURE”: A Clustering Algorithm for large Databases. Proc. of the ACM_sigmod Int’l Conf. . [4] Clustering Algorithm using dynamic Modelling.IEEE Computer ,/*Vol. 32.No. 8,68-75,1999. [5] T. Mitchell, Machine Learning, McGraw Hill, 1997. [6] Kaufman, L. and Rousseeuw, P. (1990). Finding Groups in Data—An Introduction to Cluster Analysis. Wiley Series in Probability and Mathematical Statistics. NewYork: JohnWiley & Sons, Inc. [7] Ng, R. and Han, J. (1994). Efficient and effective clustering methods for spatial data mining.In Proceedings of the 20th international conference on very large data bases, Santiago,Chile, pages 144–155. Los Altos, CA: Morgan Kaufmann. [8] Ester, M., Kriegel, H., Sander, J., and Xu, X. (1996). Adensity-based algorithm for discovering clusters in large spatial databases with noise. In Second international conference on knowledge discovery and data mining, pages 226–231. Portland, OR: AAAI Press. [9] Schikuta, E. and Erhart, M. (1997). The BANG-clustering system: Gridbased data analysis.In Liu, X., Cohen, P., and Berthold, M., editors, Lecture Notes in Computer Science, volume 1280, pages 513–524. Berlin, Heidelberg: Springer-Verlag. [10] [Goldberg, D. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning.Reading, MA: Addison-Wesley. [11] ]T.Zhang,R.Ramakishnan,M.livny:” BIRCH:An Efficient data Clustering method for very large databases.Proc. of the ACM_sigmod International Conference on Management of data,1996,pp.103-114

[12] R. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In VLDB-94, 1994. [13] Kriegel, Hans-Peter; Kröger, Peer; Renz, Matthias; Wurst, Sebastian (2005), "A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data", Proceedings of the Fifth IEEE International Conference on Data Mining (Washington, DC: IEEE Computer Society): 205–25 [14] [Aggarwal, Charu C.; Wolf, Joel L.; Yu, Philip S.; Procopiuc, Cecilia; Park, Jong Soo (1999), "Fast algorithms for projected clustering", ACM SIGMOD Record (New York, NY: ACM) 28 (2): 61–72, [15] S. C. Madeira and A. L. Oliveira, “Biclustering algorithms for biological data analysis: A survey,” IEEE/ACM Trans. Computat. Biol. Bioinformatics, vol. 1, no. 1, pp. 24–45, Jan. 2004. [16] H.Ralambondrainy,”Aconceptual versionof the kmeans al gorithm” [17] .4rnRecogn. Lett., Vol. 15, No. 11, pp. 1147-1157, 1995. [18] Sudipto Guha,Rajeev rastogi,Kyuseok Shim:”ROCK”:A robust clustering algorithm for categorical attributes”.In Proc. 1999 International. Conference of . data Engineering,pp.512521,Sydney,Australia,Mar.1999. [19] [Yi Zhang,Ada Wai-chee Fu,Chun Hing Cai,peng-Ann Heng:”Clustering Categorical Data.In Proc. 2000 IEEE Int. Conf. Data Engineering,San Diego,USA,March 2000. [20] David Gibson,Jon Klieberg, Prabhakar Raghavan,”Clustering Categorical Data :An Approach based on Dynamic Syatems.Proc. 1998. Int. Conf. On Very Large Databases,pp. 311-323,New York,August 1998. [21] U.R. Palmer,C.Faloutsos.Electricity based External Similarity of Categorical Attributes.In Proc. Of the 7th Pacific Asia Conference on Advances in Knowledge Discovery and Data Mining PAKDD’03,pp. 486-500,2003. [22] ] Z. Huang, ”Extensions to the k-means algorithm for clustering large datasets with categorical values”, Data Mining Knowl. Discov., Vol. 2, No. 2 [23] C.Li,G. Biswas. Unsupervised learning with mixed Numeric and nominal data.IEEE Transactions on Knowledge and Data Engineering,2002,14(4):673-690. [24] S-G. Lee,D-K. Yun.Clustering categorical and Numerical data: A New procedure using Multi Dimensional scaling.International Journal of Information Technology and Decision Making, 2003,2(1) :135-160. [25] Ohn Mar San, Van-Nam Huynh, Yoshiteru Nakamori, ”An Alternative Extension of The K-Means algorithm For Clustering Categorical Data”, J. Appl. Math. Comput. Sci, Vol. 14, No. 2, 2004, 241-247. [26] MATLAB. User's Guide. The Math- Works, Inc., Natick, MA 01760, 1994-2001. http://www.mathworks.com/access/helpdesk/help/techdoc/matlab.shtml [27] .P. M. Murphy and D. W. Aha. UCI repository of machine learning databases, 1992. www.ics.uci.edu/_mlearn/MLRepository.html [28] www.resample.com/xlminer/help/HClst/HClst_intro.htm

43

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

Clustering is a process of grouping similar objects together and placing the object in a cluster which is most similar to it. In this paper we provide a new measure for calculation of similarity be...

Clustering is a process of grouping similar objects together and placing the object in a cluster which is most similar to it. In this paper we provide a new measure for calculation of similarity between 2 clusters for categorical attributes and the approach used is agglomerative hierarchical clustering.

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd