- Image Processing Lab Report
- JiangX.D.-PR-07-2P
- a Network Intrusion Detection Model Based on K-means Algorithm and Information Entropy
- Bio Medical Data Analysis Using Novel Clustering Techniques
- Ppt on Color Image Segmentation
- Clustering
- Multivariate Analysis
- A Hierarchical Clustering Algorithm Based on K
- Dimensional Reduction Of Hyperspectral Image Data Using Band Clustering And Selection Through K-Means Based On Statistical Characteristics Of Band Images
- IJCSE10-02-02-19
- 09_30_Alok_Sairam.pdf
- 8. IJCSEITR-A Study of Fuzzy Based Approach for Securing Information in PPDM
- An Efficient HIM Technique for Tumour Detection from MRI Images
- Realtime Classification for Encrypted Traffic
- Re-enactment of Newspaper Articles
- Shah et al
- 01401005
- Igs Journal Vol25 Issue092 Pg315-346
- Chap10.Part1
- flann_manual-1.8.4
- Usage and Research Challenges in the Area of Frequent Pattern in Data Mining
- babidi_qcav03
- 5
- assignment number 1 2015 final v1
- Sustainable Natural Resources Management
- Mining Airline Data for CRM Strategies.pdf
- 2 17 Spatial Modelling Metropolization Portugal
- Paper 38-Hybrid Approach for Detection of Hard Exudates
- Fcm the Fuzzy c Means Clustering Algorithm PDF
- WiOpt09_Speech3
- Sapiens: A Brief History of Humankind
- The Unwinding: An Inner History of the New America
- Yes Please
- The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
- Dispatches from Pluto: Lost and Found in the Mississippi Delta
- Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
- John Adams
- Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
- The Prize: The Epic Quest for Oil, Money & Power
- A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
- This Changes Everything: Capitalism vs. The Climate
- Grand Pursuit: The Story of Economic Genius
- The Emperor of All Maladies: A Biography of Cancer
- Team of Rivals: The Political Genius of Abraham Lincoln
- The New Confessions of an Economic Hit Man
- Rise of ISIS: A Threat We Can't Ignore
- Smart People Should Build Things: How to Restore Our Culture of Achievement, Build a Path for Entrepreneurs, and Create New Jobs in America
- The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
- The World Is Flat 3.0: A Brief History of the Twenty-first Century
- Bad Feminist: Essays
- How To Win Friends and Influence People
- Steve Jobs
- Angela's Ashes: A Memoir
- Leaving Berlin: A Novel
- The Silver Linings Playbook: A Novel
- The Sympathizer: A Novel (Pulitzer Prize for Fiction)
- Extremely Loud and Incredibly Close: A Novel
- The Light Between Oceans: A Novel
- The Incarnations: A Novel
- You Too Can Have a Body Like Mine: A Novel
- Life of Pi
- The Love Affairs of Nathaniel P.: A Novel
- The First Bad Man: A Novel
- We Are Not Ourselves: A Novel
- The Blazing World: A Novel
- The Rosie Project: A Novel
- The Flamethrowers: A Novel
- Brooklyn: A Novel
- A Man Called Ove: A Novel
- Bel Canto
- The Master
- Interpreter of Maladies
- Beautiful Ruins: A Novel
- The Kitchen House: A Novel
- Wolf Hall: A Novel
- The Art of Racing in the Rain: A Novel
- The Wallcreeper
- A Prayer for Owen Meany: A Novel
- The Cider House Rules
- The Perks of Being a Wallflower
- Lovers at the Chameleon Club, Paris 1932: A Novel
- The Bonfire of the Vanities: A Novel
- Little Bee: A Novel

IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 7, NO. 4, AUGUST 1999

**A Fuzzy k-Modes Algorithm for Clustering Categorical Data
**

Zhexue Huang and Michael K. Ng

Abstract— This correspondence describes extensions to the fuzzy k -means algorithm for clustering categorical data. By using a simple matching dissimilarity measure for categorical objects and modes instead of means for clusters, a new approach is developed, which allows the use of the k -means paradigm to efﬁciently cluster large categorical data sets. A fuzzy k -modes algorithm is presented and the effectiveness of the algorithm is demonstrated with experimental results. Index Terms—Categorical data, clustering, data mining, fuzzy partitioning, k -means algorithm.

I. INTRODUCTION HE -means algorithm [1], [2], [8], [11] is well known for its efﬁciency in clustering large data sets. Fuzzy versions of the -means algorithm have been reported in Ruspini [15] and Bezdek [3], where each pattern is allowed to have membership functions to all clusters rather than having a distinct membership to exactly one cluster. However, working only on numeric data limits the use of these -means-type algorithms in such areas as data mining where large categorical data sets are frequently encountered. Ralambondrainy [13] presented an approach to using the -means algorithm to cluster categorical data. His approach converts multiple categorical attributes into binary attributes, each using one for presence of a category and zero for absence of it, and then treats these binary attributes as numeric ones in the -means algorithm. This approach needs to handle a large number of binary attributes when data sets have attributes with many categories. This will inevitably increase both computational cost and memory storage of the -means algorithm. The other drawback is that the cluster means given by real values between zero and one do not indicate the characteristics of the clusters. Other algorithms for clustering categorical data include hierarchical clustering methods using Gower’s similarity coefﬁcient [6] or other dissimilarity measures [5], the PAM algorithm [9], the fuzzy-statistical algorithms [18], and the conceptual clustering methods [12]. All these methods suffer from a common efﬁciency problem when applied to massive categorical-only data sets. For instance, the computational complexity of most hierarchical clustering methods is [1] and the PAM algorithm has the complexity of per iteration [14], where is the size of data set and is the number of clusters.

Manuscript received September 2, 1997; revised October 28, 1998. This work was supported in part by HKU CRCG under Grants 10 201 824 and 10 201 939. Z. Huang is with Management Information Principles, Ltd., Melbourne, Australia. M. K. Ng is with the Department of Mathematics, The University of Hong Kong, Hong Kong. Publisher Item Identiﬁer S 1063-6706(99)06705-3.

T

To tackle the problem of clustering large categorical data sets in data mining, the -modes algorithm has recently been proposed in [7]. The -modes algorithm extends the -means algorithm by using a simple matching dissimilarity measure for categorical objects, modes instead of means for clusters, and a frequency-based method to update modes in the clustering process to minimize the clustering cost function. These extensions have removed the numeric-only limitation of the -means algorithm and enable it to be used to efﬁciently cluster large categorical data sets from real-world databases. In this paper, we introduce a fuzzy -modes algorithm which generalizes our previous work in [7]. This is achieved by the development of a new procedure to generate the fuzzy partition matrix from categorical data within the framework of the fuzzy -means algorithm [3]. The main result of this paper is to provide a method to ﬁnd the fuzzy cluster modes when the simple matching dissimilarity measure is used for categorical objects. The fuzzy version has improved the -modes algorithm by assigning conﬁdence to objects in different clusters. These conﬁdence values can be used to decide the core and boundary objects of clusters, thereby providing more useful information for dealing with boundary objects. II. NOTATION We assume the set of objects to be clustered is stored deﬁned by a set of attributes in a database table . Each attribute describes a domain of and associated with a deﬁned values denoted by semantic and a data type. In this letter, we only consider two general data types, numeric and categorical and assume other types used in database systems can be mapped to one of these two types. The domains of attributes associated with these two types are called numeric and categorical, respectively. A numeric domain consists of real numbers. A is deﬁned as categorical if it is ﬁnite and domain , either or unordered, e.g., for any , see for instance [5]. in can be logically represented as a conAn object junction of attribute-value pairs , where for . Without as a vector . ambiguity, we represent is called a categorical object if it has only categorical values. attribute values. If We consider every object has exactly is missing, then we denote the the value of an attribute by . attribute value of be a set of objects. Object Let is represented as . We write if for . The relation does not and are the same object in the real-world mean that

1063–6706/99$10.00 © 1999 IEEE

otherwise go to step 3).real matrix. Therefore. the minimizer if otherwise. the following result holds [3]. Then we ﬁx and minimize conditions on with respect to .HUANG AND NG: FUZZY k-MODES ALGORITHM FOR CLUSTERING CATEGORICAL DATA 447 database. dissimilarity measure between in (1) with the constraints in (2)–(4) Minimization of forms a class of constrained nonlinear optimization problems whose solutions are unknown. such that 3) Determine is minimized. which attempt to minimize the cost function (1) subject to (2) For . If —then stop. [17]. However. IV. Then the minimizer of Problem (P2) is given by Most -means-type algorithms have been proved convergent and often terminate at a local minimum (see for instance [3]. [17]). Theorem 1: Let be ﬁxed and consider Problem (P1) subject to For . The computational complexity of the operations. [11]. HARD AND FUZZY -MEANS ALGORITHMS Let be a set of objects described by numeric attributes. is the number of clusters. In the literature the Euclidean norm is often used in the -means algorithm. attributes III. and is the number of objects. and is some and . 2) replacing the means of clusters with the modes. for The proof of Theorem 1 can be found in [3]. [4]. such that is min2) Determine —then imized. working only on numeric values limits its use in applications such as data mining in which categorical values are frequently encountered. Set . the means algorithm is best suited for dealing with large data sets. In this method. We the minimum solution is remark that for the case of may arbitrarily be assigned to the ﬁrst not unique. for a large . . and are calculated according to the The matrices following two theorems. If stop. has made the following modiﬁcations to the -means algorithm: 1) using a simple matching dissimilarity measure for categorical objects. and 3) using a frequency-based method to ﬁnd the modes to solve Problem (P2). The usual method toward in (1) is to use partial optimization for optimization of and [3]. the minimizer of Problem (P1) is given by if if if and (5) and . Algorithm 1—The -Means Algorithm: . where is the number of algorithm is is the number of iterations. and the partition matrix . [16]. be ﬁxed and consider Problem (P2) Theorem 2: Let (3) and (4) is a known number of clusters. and the remaining entries of this column are put to zero. is where is a -by. As for the computational complexity is generally space to hold the set storage. a weighting exponent. This limitation is removed in the hard and fuzzy -modes algorithms to be discussed in the next section. is much less than that required by the hierarchical clustering algorithms. and where is the Euclidean norm. which. attributes. The hard and fuzzy -means clustering algorithms to cluster into clusters can be stated as the algorithms [3]. ﬁrst introduced in [7]. . Determine 1) Choose an initial point such that is minimized. otherwise set and go to step 2). the cluster centers . so minimizing index . HARD AND FUZZY -MODES ALGORITHMS of Problem (P1) is given by The hard -modes algorithm. When it is faster than the hierarchical clustering algorithms whose [1]. These modiﬁcations have removed the . we need of objects. This process is formalized in the -means algorithm as follows. we ﬁrst ﬁx and ﬁnd necessary to minimize . [4]. but rather that the two objects have equal values in . In this case.

Let the cluster centers be represented by for . as an object Z that minimizes n dc (Xi . Minimizing the quantity is equivalent to minimizing each inner ) as sum. For ). The objective of clustering a set of categorical objects into clusters is to ﬁnd and that minimize (7) with other conditions same as in (1). Then the quantity is minimized iff where . respectively. Traditionally. Huang [7] has presented the hard -partition (i. 28–29]. The main problem addressed in the present paper is to ) when the dissimilarity ﬁnd the fuzzy cluster modes ( measure deﬁned in (6) is used. Minimizing the quantity is equivalent to minimizing each inner ) as sum. X2 . 7. However. Proof: For a given It is easy to verify that the function deﬁnes a metric space on the set of categorical objects. Here. Z ) [7]. a frequency-based method to update . Let the cluster centers be represented by for . all the inner sums of the quantity are nonnegative and independent. 4. Here. According to (8). NO. the quantity and is ﬁxed and . According to Theorem 4. It follows that is maximal. for in the set . Combining Theorems 1 and 4 with Algorithm 1 forms the fuzzy -modes algorithm in which the modes of clusters in each iteration are updated according . Then the quantity is minimized iff where (9) for . all the inner sums of the quantity are nonnegative and independent. is the number of categories of attribute for where . Xn g is deﬁned is minimized nonnegative. We note is also a kind of generalized Hamming distance [10]. and be two categorical objects represented by Let and . the category of attribute of is given by the category that achieves the cluster mode to cluster over all the maximum of the summation of categories. the way to update at each minimize iteration is different from the method given in Theorem 2. The simand is ple matching dissimilarity measure between deﬁned as follows: (6) where The inner sum is minimized iff every term is minimal for . Theorem 4—The Fuzzy -Modes Update Method: Let be a set of categorical objects described by categorical attributes and . If the minimum is not unique. i=1 X 1 1 1 . The result follows. then the attribute of the cluster mode may arbitrarily assigned to the ﬁrst minimizing index in (9). the category of attribute of the cluster is determined by the mode of categories of attribute mode in the set of objects belonging to cluster . VOL. This method can be described as follows. be Theorem 3—The Hard -Modes Update Method: Let a set of categorical objects described by categorical attributes and .e. that The -modes algorithm uses the -means paradigm to cluster categorical data. Hence. pp. the simple matching approach is often used in binary variables which are converted from categorical variables [9. We write the th inner sum ( (8) . represents a set of modes for clusters. is the number of categories of attribute for where .1 We can still use Algorithm 1 to . Proof: For a given denotes the number of elements Since is ﬁxed and nonnegative for . the result iff each term follows. We write the th inner sum ( 1 The mode for a set of categorical objects f 1 . Thus the term must be maximal. AUGUST 1999 numeric-only limitation of the -means algorithm but maintain its efﬁciency in clustering large categorical data sets [7].448 IEEE TRANSACTIONS ON FUZZY SYSTEMS..

We obtained the cluster as follows. the cluster centers and the partition the set of objects . Moreover. that Unlike the other two algorithms the fuzzy -modes algorithm produced a fuzzy partition matrix . Charcoal Rot. Hence the result follows. we have However. Proof: We ﬁrst note that there are only a ﬁnite number ) of possible cluster centers (modes). respectively. The average CPU time used by the -modes-type algorithms was much smaller than that by the conceptual -means algorithm. 2. The fuzzy -modes algorithm Theorem 5: Let converges in a ﬁnite number of iterations. The record was assigned memberships from .. Except for Phytophthora Rot which has 17 records. We now consider the cost of the fuzzy -modes algorithm. The soybean data set has 47 records. a special case where . Each record is labeled as one of the four diseases: Diaporthe Stem Canker. A clustering result was measured by the clustering accuracy deﬁned as (a) (b) to Theorem 4 and the fuzzy partition matrix is computed according to Theorem 1. The initial means and modes were randomly selected distinct records from the data set. respectively. If the maximum to the th cluster if was assigned to the cluster of ﬁrst was not unique. the average percentages of the correctly classiﬁed records over 100 runs) of clustering by each algorithm and the average central processing unit (CPU) time used. The computational cost in each step of the fuzzy -modes algorithm and the conceptual version of the -means algorithm [13] are given in Table I according to Algorithm 1 and Theorems 1. The fuzzy -modes algorithm slightly outperformed the hard -modes algorithm in the overall performance. Of the 35 attributes. ( ) is the total number of categories of all attributes. all other diseases have ten records each. We chose this data set to test these algorithms because all attributes of the data can be treated as categorical. 1 shows the distributions of the number of runs with respect to the number of records correctly classiﬁed by each algorithm. using zero for absence of a category and one for presence of it. We used the three clustering algorithms to cluster this data set into four clusters. is large. The test results are discussed below. then achieving the maximum. and 4. matrix V. In our numerical tests is equal to four. HERE j =1 j k M= n out several tests of these algorithms on both real and artiﬁcial data. For the fuzzy -modes algorithm (we tried several values of and found we speciﬁed provides the least value of the cost function ). Table II gives the average accuracy (i. storage space to hold our method requires . the number of runs ) with correct classiﬁcations of more than 40 records ( was much larger from both hard and fuzzy -modes algorithms than that from the conceptual -means algorithm. To investigate the differences between the hard and fuzzy -modes algorithms. is the number of attributes. the cost of the fuzzy -modes algorithm is when signiﬁcantly less than that of the conceptual version of the means algorithm. we only selected 21 because the other 14 have only one category. Rhizoctonia Root Rot. EXPERIMENTAL RESULTS To evaluate the performance and efﬁciency of the fuzzy modes algorithm and compare it with the conceptual -means algorithm [13] and the hard -modes algorithm. A. We remark that the similar proof concerning the convergence in a ﬁnite number of iterations can be found in [16]. Here is the number of clusters. The binary values of the attributes were then treated as numeric values in the -means algorithm. We then ( show that each possible center appears at most once by the where fuzzy -modes algorithm. The computational complexities of steps 2 and 3 of the fuzzy -modes algorithm and the conceptual and version of the -means algorithm are . we ﬁrst converted multiple categorical attributes into binary attributes. Similar to the fuzzy -means-type algorithm. We remark that we need to transform multiple categorical attributes into binary attributes as numeric values in the conceptual version of the -means algorithm. each being described by 35 attributes. The hard -mode algorithm [7] is . Each algorithm was run 100 times. Clustering Performance The ﬁrst data set used was the soybean disease data set [12].e. Therefore. we compared two clustering results pro- .HUANG AND NG: FUZZY k-MODES ALGORITHM FOR CLUSTERING CATEGORICAL DATA 449 k TABLE I (a) NUMBER OF OPERATIONS REQUIRED IN THE FUZZY -MODES ALGORITHM AND (b) THE CONCEPTUAL VERSION m OF THE -MEANS ALGORITHM. Assume that . For the conceptual -means algorithm. we carried where was the number of instances occurring in both cluster and its corresponding class and was the number of instances in the data set. and is the number of objects. Thus. the sequence generated by the hard and fuzzy -modes algorithm is strictly decreasing. According to the fuzzy -modes algorithm we can and of Problem (P1) for compute the minimizers and . and Phytophthora Rot. Fig. The overall clustering performance of both hard and fuzzy -modes algorithms was better than that of the conceptual -means algorithm.

450 IEEE TRANSACTIONS ON FUZZY SYSTEMS. (b) The hard k-modes algorithm. 1. . The modes obtained with the two algorithms are not identical. This indicates that the hard and fuzzy -modes algorithms indeed produce different clusters. Table III gives the modes of four clusters produced by the two algorithms. the clustering results produced by the hard modes algorithm were better than those by the fuzzy -modes algorithm (see Fig. Five records are listed in Table IV together with their dissimilarity values to their corresponding modes. (a) The conceptual version of the k-means algorithm. and were misclassiﬁed. in a few cases. . VOL. Distributions of the number of runs with respect to the number of the correctly classiﬁed records in each run. AUGUST 1999 TABLE III THE MODES OF FOUR CLUSTERS PRODUCED BY (a) HARD k -MODES AND (b) FUZZY k -MODES ALGORITHMS (a) (a) (b) (b) (c) Fig. In Table IV.2% increase of accuracy by the fuzzy -modes algorithm. However. 7. The misclassiﬁcation of records and was due to the same dissimilarities to the . In the clustering result of the hard -modes algorithm [Table IV(a)]. (c) The fuzzy k-modes algorithm. there was 4. In this case. their part of partition matrices. denotes the misclassiﬁed records. four records . This can be shown by the following example. 4. we found that the number of records correctly classiﬁed by the hard -modes algorithm was 43 while the number of records correctly classiﬁed by the fuzzy -modes algorithm was 45. We found such an increase occurred in most cases. NO. TABLE II THE AVERAGE CLUSTERING ACCURACY AND AVERAGE CPU TIME IN SECONDS BY DIFFERENT CLUSTERING METHODS duced by them from the same initial modes. By looking at the accuracies of the two clustering results. their cluster memberships assigned and their true classes. 1). The partition matrix produced by the fuzzy -modes algorithm provides useful information for identiﬁcation of the boundary objects which scatter in the cluster boundaries.

we randomly generated two categorical objects with as the inherent modes of two clusters. For the hard -modes algorithm Huang [7] has reported some preliminary results in clustering a large real data set consisting of 500 000 records. In many the conﬁdence value . The hard -modes algorithm provides no information for identifying these boundary objects. the CPU time used by the fuzzy -modes algorithm was ﬁve times less than that used by the conceptual version of the -means algorithm. which often cause problems in classiﬁcation. We purposely divided . The former objects will have less conﬁdence and can also be considered as boundary objects. which were misclassiﬁed by the two objects hard -modes algorithm. For instance. which was correctly classiﬁed by the hard -modes algorithm was misclassiﬁed by the fuzzy one. In many applications. HERE THE MISCLASSIFIED OBJECTS ARE DENOTED BY * cases. we can see that the clustering accuracy of the fuzzy modes algorithm was better than that of the conceptual version of the -means algorithm. Another advantage of the fuzzy -modes algorithm is that it not only partitions objects into clusters but also shows how conﬁdent an object is assigned to a cluster. we randomly selected 1000 objects from this large data set and tested this subset with a hierarchical clustering algorithm. In such a situation the algorithm arbitrarily assigned them to the ﬁrst cluster. other However. For and were assigned instance. object and . The conﬁdence is determined by the dissimilarity measures of an object to all cluster modes. The objects were described by ﬁve categorical attributes and each attribute had ﬁve categories. In this test. in Table IV(b). From Table V. the algorithm arbitrarily clustered them into cluster 1. In this experiment we used an artiﬁcial data set to test the efﬁciency of the fuzzy -modes algorithm. We found that the clustering accuracies (a) (b) modes of clusters and . and 3) . different dissimilarities to the modes of clusters still has a problem. was corrected because it has the classiﬁcation of object and . 2) 1500. two. These results have shown a good scalability of the -modes algorithm against the number of clusters for a given number of records and against the number of records for a given number of clusters. where was the dissimilarity 2) measure between the modes of the clusters and objects. and three and added this object to the data set. the dissimilarities of objects to the mode of the assigned cluster may be same but the conﬁdence values of objects assigned to that cluster can be quite different because some objects may also be closer to other cluster modes but other objects are only closer to one of them. From this example we can see that the objects misclassiﬁed by the fuzzy -modes algorithm were boundary objects. although both objects ’s assignment to cluster 2. The data set had two clusters with 5000 objects each. we are more conﬁdent for is greater than because the conﬁdence value for cluster 2. Moreover. it is reasonable to consider cluster boundaries as zonal areas. B. But it was not often the case for the hard -modes algorithm. less Then we randomly generated an object with than or equal to one. However. Nine thousand objects had dissimilarity measure at most two to the mode of the cluster. each being described by 34 categorical attributes. Table V gives the average CPU time used by the fuzzy modes algorithm and the conceptual version of the -means algorithm on a POWER2 RISC processor of IBM SP2. Each attribute value was generated by rounding toward the nearest integer of a uniform distribution between one and six. and 3) 500. The generated data set had two inherent clusters. The CPU time required for clustering increased linearly as both the number of clusters and the number of records increased. Because the dissimilarities of the objects and to the centers of clusters 1 and 2 are equal. were correctly classiﬁed by the fuzzy -modes algorithm.HUANG AND NG: FUZZY k-MODES ALGORITHM FOR CLUSTERING CATEGORICAL DATA 451 TABLE IV (a) THE DISSIMILARITY MEASURE BETWEEN MISCLASSIFIED RECORDS AND THE CLUSTER CENTERS AND THE CORRESPONDING PARTITION MATRICES PRODUCED BY THE HARD k -MODES AND (b) FUZZY k -MODES ALGORITHMS. as for the comparison. Furthermore. Some of these misclassiﬁcations can be corrected by the fuzzy -modes algorithm. Although we used integers to represent the categories of categorical attributes. Efﬁciency The purpose of the second experiment was to test the efﬁciency of the fuzzy -modes algorithm when clustering large categorical data sets. respectively. Then we speciﬁed the distribution of objects in each group as: 1) 3000. This means the maximum dissimilarity between any two objects was ﬁve. Since the dissimilarity between the two clusters was ﬁve. objects in each inherent cluster into three groups by: 1) . object . In creating this data and set. the integers had no order. the maximum dissimilarity between each object and the mode was at most three. Such records are called boundary objects.

pp. [8] A. the hierarchical clustering algorithm will suffer from both storage and efﬁciency problem. “A convergence theorem for the fuzzy ISODATA clustering algorithms. Ruspini. Pattern Anal. Diday. 3. 4. 1967. Bezdek. J. 4-3. Ball and D. [11] J. vol. Selim. Z. Gowda and E. 481–485. CONCLUSIONS Categorical data are ubiquitous in real-world databases. 5th Symp. Gower. Anderberg. 12. [5] K. vol. “Local convergence of the fuzzy cmeans algorithms. Such information is extremely useful in applications such as data mining in which the uncertain k k K c . C. Stepp. Cluster Analysis for Applications.. 2. Ralambondrainy. A. 144–155. The experimental results have shown that the -modes-type algorithms are effective in recovering the inherent clustering structures from categorical data if such structures exist. “A general coefﬁcient of similarity and some of its properties.” Pattern Recognition. R.” Pattern Recognition. vol. Hathaway and J. few efﬁcient algorithms are available for clustering massive categorical data. 16. “Automated construction of classiﬁcations: Conceptual clustering versus numerical taxonomy. “ -means-type algorithms: A generalized convergence theorem and characterization of local optimality. 1995. [10] T. Mathematical Statistics and Probability.” IEEE Trans. 1–8.” Behavioral Sci. Berkeley. Ismail.. 857–874. pp.” Data Mining Knowledge Discovery.. 1990. the fuzzy partition matrix provides more information to help the user to determine the ﬁnal clustering and to identify the boundary objects. Machine Intell. of the hierarchical algorithm was almost the same as that of the fuzzy -modes algorithm. July 1983. [4] R. Kaufman and P. [13] H. This demonstrates the advantages of the -modes-type algorithms in clustering large categorical data sets. “A clustering technique for summarizing multivariate data. AUGUST 1999 TABLE V AVERAGE CLUSTERING ACCURACY AND AVERAGE CPU TIME REQUIRED IN SECONDS FOR DIFFERENT CLUSTERING METHODS ON 10 000 OBJECTS boundary objects are sometimes more interesting than objects which can be clustered with certainty.” Pattern Recognition Lett. 20th Very Large Databases Conf.” BioMetrics. S. E.” in Proc. 19. Finding Groups in Data—An Introduction to Cluster Analysis. Contr. Woodbury and J. no.” Pattern Recognition. We have introduced the fuzzy -modes algorithm for clustering categorical objects based on extensions to the fuzzy -means algorithm. Content-Addressable Memories. 27. 477–480.” in Proc. “Some methods for classiﬁcation and analysis of multivariate observations. 22–32. 1967. J. NJ: Prentice-Hall. Berlin. 6. [12] R. vol. 1969. 7. H. C. Ng and J. when the number of objects is large. “Efﬁcient and effective clustering methods for spatial data mining. pp. 281–297. vol. 111–121.452 IEEE TRANSACTIONS ON FUZZY SYSTEMS. “A new approach to clustering. “A conceptual version of the -means algorithm. pp. 153–155. “Symbolic clustering using a new dissimilarity measure. “Clinical pure types as a fuzzy partition. no. 1973. REFERENCES [1] M. Kohonen. 1986. no. A.. Huang.” Inform. 81–87. [2] G. but the time used by the hierarchical clustering algorithm (4. PAMI-5. B. VI. [6] J. no. The other important result is the proof of convergence that demonstrates a nice property of the fuzzy -modes algorithm. C. vol. Algorithms for Clustering Data. Santiago. New York: Wiley. 1998. pp. Sept. Selim and M. 567–578. The development of the -modestype algorithm was motivated to solve this problem. 1147–1157.. pp. 1984. [16] S. Sept. Machine Intell. pp. MacQueen. no. pp. Han.” IEEE Trans. 6. vol.” J. 1974. [14] R. Cybern. C.. R. Jan. 6. “Extensions to the -means algorithm for clustering large data sets with categorical values. pp. pp. 6. vol. pp. CA. vol.384 s). Dubes. Clive. [7] Z. Michalski and R. pp. J.” IEEE Trans. 1971. vol. [9] L. [18] M. vol.33 s) was signiﬁcantly larger than that used by the fuzzy -modes algorithm (0. pp. 1980. AD 669871. A. Moreover. Englewood Cliffs. K.. A. Jain and R. [17] M. Chile. Jan. 1986. vol. Pattern Anal. Hall. 1. New York: Academic. 19. Bezdek. 1988. The most important result of this work is the consequence of Theorem 4 that allows the -means paradigm to be used in generating the fuzzy partition matrix from categorical data. Machine Intell. Ismail and S. VOL. “Fuzzy -means: Optimality of solutions and effective termination of the problem. 24. vol. [3] J. Rousseeuw. [15] E.. 1980. 19. NO. Thus. This procedure removes the numeric-only limitation of the fuzzy -means algorithm. 1994. PAMI-2. However. 1991. Germany: Springer-Verlag. Pattern Anal. Z. 396–410. C.

- Image Processing Lab ReportUploaded byAitzaz Hussain
- JiangX.D.-PR-07-2PUploaded byLuthfiMahendra
- a Network Intrusion Detection Model Based on K-means Algorithm and Information EntropyUploaded byAndi Muhammad Nurhidayat
- Bio Medical Data Analysis Using Novel Clustering TechniquesUploaded byHarikrishnan Shunmugam
- Ppt on Color Image SegmentationUploaded byrajeshbabu_nn641
- ClusteringUploaded byJeff Pratt
- Multivariate AnalysisUploaded bysuduku007
- A Hierarchical Clustering Algorithm Based on KUploaded byjohncena90
- Dimensional Reduction Of Hyperspectral Image Data Using Band Clustering And Selection Through K-Means Based On Statistical Characteristics Of Band ImagesUploaded byIJEC_Editor
- IJCSE10-02-02-19Uploaded bykalyankrishna2011
- 09_30_Alok_Sairam.pdfUploaded bySukma Nata Permana
- 8. IJCSEITR-A Study of Fuzzy Based Approach for Securing Information in PPDMUploaded byTJPRC Publications
- An Efficient HIM Technique for Tumour Detection from MRI ImagesUploaded byEditor IJTSRD
- Realtime Classification for Encrypted TrafficUploaded byTere Rere
- Re-enactment of Newspaper ArticlesUploaded byATS
- Shah et alUploaded byJeeta Sarkar
- 01401005Uploaded byrvicentclases
- Igs Journal Vol25 Issue092 Pg315-346Uploaded byvahombe
- Chap10.Part1Uploaded byJay Ranjit
- flann_manual-1.8.4Uploaded bycirimus
- Usage and Research Challenges in the Area of Frequent Pattern in Data MiningUploaded byInternational Organization of Scientific Research (IOSR)
- babidi_qcav03Uploaded byVisuVision India
- 5Uploaded byGaurav SK
- assignment number 1 2015 final v1Uploaded byapi-286850711
- Sustainable Natural Resources ManagementUploaded byAlexandra Scarlet
- Mining Airline Data for CRM Strategies.pdfUploaded byvladislav
- 2 17 Spatial Modelling Metropolization PortugalUploaded byheraldoferreira
- Paper 38-Hybrid Approach for Detection of Hard ExudatesUploaded byEditor IJACSA
- Fcm the Fuzzy c Means Clustering Algorithm PDFUploaded byHeather
- WiOpt09_Speech3Uploaded byjohnsonsem