You are on page 1of 5

INFORMATION PAPER International Journal of Recent Trends in Engineering, Vol. 1, No.

2, May 2009

A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining


Research Scholar, Mahatma Gandhi University,Kerala, India. binumarian@rediffmail.com 2 SCMS School of Technology & Management, Cochin, Kerala, India. kurupgraju@rediffmail.com
Abstract In data mining, the conventional clustering algorithms have difficulties in handling the challenges posed by the collection of natural data which is often vague and uncertain. Fuzzy clustering methods have the potential to manage such situations efficiently. This paper introduces the limitations of conventional clustering methods through k-means and fuzzy c-means clustering and demonstrates the drawbacks of the algorithms in handling outlier points. In this paper, we propose a new fuzzy clustering method which is more efficient in handling outlier points than conventional fuzzy c-means algorithm. The new method excludes outlier points by giving them extremely small membership values in existing clusters while fuzzy c-means algorithm tends give them outsized membership values. The new algorithm also incorporates the positive aspects of k-means algorithm in calculating the new cluster centers in a more efficient approach than the c-means method. Index Termsfuzzy clustering, outlier points, knowledge discovery, c-means algorithm
1

Binu Thomas1 and Raju G2,

clustering and also introduces k-means clustering algorithm. In Section 3 we explain the concept of vagueness and uncertainty in natural data. Section 4 introduces the fuzzy clustering method and describes how it can handle vagueness and uncertainty with the concept of overlapping clusters with partial membership functions. The same section also introduces the most common fuzzy clustering algorithm namely c-means algorithm and it ends with the limitations of it. Section 5 proposes the new fuzzy clustering method. Section 6 demonstrates the concepts presented in the paper. Finally, section 7 concludes the paper. II. CRISP CLUSTERING TECHNIQUES Traditional clustering techniques attempt to segment data by grouping related attributes in uniquely defined clusters. Each data point in the sample space is assigned to only one cluster. K-means algorithm and its different variations are the most well-known and commonly used partitioning methods. The value k stands for the number of cluster seeds initially provided for the algorithm. This algorithm takes the input parameter k and partitions a set of m objects into k clusters [7]. The technique work by computing the distance between a data point and the cluster center to add an item into one of the clusters so that intra-cluster similarity is high but inter-cluster similarity is low. A common method to find the distance is to calculate to sum of the squared difference as follows and it is known as the Euclidian distance [10](exp.1).

I. INTRODUCTION The process of finding useful patterns and information from raw data is often known as Knowledge discovery in databases or KDD. Data mining is a particular step in this process involving the application of specific algorithms for extracting patterns (models) from data [5]. Cluster analysis is a technique for breaking data down into related components in such a way that patterns and order becomes visible. It aims at sifting through large volumes of data in order to reveal useful information in the form of new relationships, patterns, or clusters, for decisionmaking by a user. Clusters are natural groupings of data items based on similarity metrics or probability density models. Clustering algorithms maps a new data item into one of several known clusters. In fact cluster analysis has the virtue of strengthening the exposure of patterns and behavior as more and more data becomes available [7]. A cluster has a center of gravity which is basically the weighted average of the cluster. Membership of a data item in a cluster can be determined by measuring the distance from each cluster center to the data point [6]. The data item is added to a cluster for which this distance is a minimum. This paper provides an overview of the crisp clustering technique, advantages and limitations of fuzzy c-means clustering and a new fuzzy clustering method which is simple and superior to c-means clustering in handling outlier points. Section 2 describes the basic notions of 161
2009 ACADEMY PUBLISHER

dk =

X
n

k j

Ci j

(1)

where, : is the distance of the kth data point from Cj dk With the definition of the distance of a data point from the cluster centers, the k-means the algorithm is fairly simple. The cluster centers are randomly initialized and we assign a data point xi into a cluster to which it has minimum distance. When all the data points have been assigned to clusters, new cluster centers are calculated by finding the weighted average of all data points in a cluster. The cluster center calculation causes the previous centroid location to move towards the center of the cluster set. This is continued until there is no change in cluster centers.

INFORMATION PAPER International Journal of Recent Trends in Engineering, Vol. 1, No. 2, May 2009 A. Limitations of k-means algorithm The main limitation of the algorithm comes from its crisp nature in assigning cluster membership to data points. Depending on the minimum distance, a data point always becomes a member of one of the clusters. This works well with highly structured data. The real world data is almost never arranged in clear cut groups. Instead, clusters have ill defined boundaries that smear into the data space often overlapping the perimeters of surrounding clusters [4]. In most of the cases the real world data have apparent extraneous data points clearly not belonging to any of the clusters and they are called outlier points. The k-means algorithm is not capable of dealing with overlapping clusters and outlier points since it has to include a data point into one of the existing clusters. Because of this even extreme outlier points will be included in to a cluster based on the minimum distance. III. FUZZY LOGIC The modeling of imprecise and qualitative knowledge, as well as handling of uncertainty at various stages is possible through the use of fuzzy sets. Fuzzy logic is capable of supporting, to a reasonable extent, human type reasoning in natural form by allowing partial membership for data items in fuzzy subsets [2]. Integration of fuzzy logic with data mining techniques has become one of the key constituents of soft computing in handling the challenges posed by the massive collection of natural data [1].Fuzzy logic is logic of fuzzy sets. A Fuzzy set has, potentially, an infinite range of truth values between one and zero[3]. Propositions in fuzzy logic have a degree of truth, and membership in fuzzy sets can be fully inclusive, fully exclusive, or some degree in between[13].The fuzzy set is distinct from a crisp set is that it allows the elements to have a degree of membership. The core of a fuzzy set is its membership function: a function which defines the relationship between a value in the sets domain and its degree of membership in the fuzzy set(exp 2). The relationship is functional because it returns a single degree of membership for any value in the domain[11]. =f(s,x) (2) Here, : is he fuzzy membership value for the element s : is the fuzzy set x : is the value from the underlying domain. Fuzzy sets provide a means of defining a overlapping concepts for a model variable represent degrees of membership. The values complete universe of discourse for a variable memberships in more than one fuzzy set. IV. FUZZY CLUSTERING METHODS The central idea in fuzzy clustering is the non-unique partitioning of the data in a collection of clusters. The 162
2009 ACADEMY PUBLISHER

data points are assigned membership values for each of the clusters. The fuzzy clustering algorithms allow the clusters to grow into their natural shapes [15]. In some cases the membership value may be zero indicating that the data point is not a member of the cluster under consideration. Many crisp clustering techniques have difficulties in handling extreme outliers but fuzzy clustering algorithms tend to give them very small membership degree in surrounding clusters [14]. The non-zero membership values, with a maximum of one, show the degree to which the data point represents a cluster. Thus fuzzy clustering provides a flexible and robust method for handling natural data with vagueness and uncertainty. In fuzzy clustering, each data point will have an associated degree of membership for each cluster. The membership value is in the range zero to one and indicates the strength of its association in that cluster. A. C-means fuzzy clustering algorithm[10] Fuzzy c-means clustering involves two processes: the calculation of cluster centers and the assignment of points to these centers using a form of Euclidian distance. This process is repeated until the cluster centers stabilize. The algorithm is similar to k-means clustering in many ways but it assigns a membership value to the data items for the clusters within a range of 0 to 1. So it incorporates fuzzy sets concepts of partial membership and forms overlapping clusters to support it. The algorithm needs a fuzzification parameter m in the range [1,n] which determines the degree of fuzziness in the clusters. When m reaches the value of 1 the algorithm works like a crisp partitioning algorithm and for larger values of m the overlapping of clusters is tend to be more. The algorithm calculates the membership value with the formula,

j ( xi ) =

1 d ji

m 1
1 m 1

(3)

d
where j(xi) dji m p dki : : : : :

1 k =1 ki
p

is the membership of xi in the jth cluster is the distance of xi in cluster cj is the fuzzification parameter is the number of specified clusters is the distance of xi in cluster Ck

series of since it from the can have

The new cluster centers are calculated with these membership values using the exp. 4.

cj

[ ( x )] x = [ ( x )]
m i j i m i j i

(4)

where Cj xi

: is the center of the jth cluster : is the ith data point

INFORMATION PAPER International Journal of Recent Trends in Engineering, Vol. 1, No. 2, May 2009 j m : the function which returns the membership : is the fuzzification parameter V. THE NEW FUZZY CLUSTERING METHOD The new fuzzy clustering method we propose, removes the restriction imposed by exp (4). Due to this constrain the c-means algorithm tends to give more membership values for the outlier points. In c-means algorithm, the membership of a point a cluster is calculated based on its membership in other clusters. Many limitations of the algorithm arise due to this and in the new method the membership of a point in a cluster center depends only on its distance in that cluster. For calculating the membership values, we use a new simple expression as given in exp (6).

This is a special form of weighted average. We modify the degree of fuzziness in xis current membership and multiply this by xi. The product obtained is divided by the sum of the fuzzified membership. The first loop of the algorithm calculates membership values for the data points in clusters and the second loop recalculates the cluster centers using these membership values. When the cluster center stabilizes (when there is no change) the algorithm ends. B. Limitations of the algorithm The fuzzy c-means approach to clustering suffers from several constrains that affect the performance [10]. The main drawback is from the restriction that the sum of membership values of a data point xi in all the clusters must be one as in Expression (5), and this tends to give high membership values for the outlier points. So the algorithm has difficulty in handling outlier points. Secondly the membership of a data point in a cluster depends directly on the membership values of other cluster centers and this sometimes happens to produce undesirable results.

j ( xi ) =
Where

Max ( d j ) d Max ( d j )

ji

(6)

j(xi) : is the membership of xi in the jth cluster dji : is the distance of xi in cluster cj Max(dj) : is the maximum distance in the cluster cj Since Max ( d ) = 1 , the above membership function
j j

Max

(d

p j =1

j ( xi ) = 1

(5)

In fuzzy c-means method a point will have partial membership in all the clusters. The exp.(4) for calculating the new cluster centers finds a special form of weighted average of all the data points. The third limitation of the algorithm is that due to the influence (partial membership) of all the data members, the cluster centers tend to move towards the center of all the data points [10].The fourth constrain of the algorithm is its inability to calculate the membership value if the distance of a data point is zero(exp.3) TABLE 1. FUZZY C-MEANS ALGORITHM initialize p=number of clusters initialize m=fuzzification parameter initialize Cj (cluster centers) Repeat For i=1 to n :Update j(xi) applying (3) current j(xi) For j=1 to p :Update Ci with(4)with Until Cj estimate stabilize

(exp.6) will generate values closer to one(1) for smaller distances (dji) and a membership value of zero for the maximum distance. If the distance of a data point is zero then the function returns a membership value of one and thus it overcomes the fourth constrain of c-means algorithm. The membership values are calculated only based on the distance of a data member in the cluster and due to this the method does not suffer from the first and second constrains of c-means algorithm. To overcome the third limitation of c-means algorithm in calculating new cluster centers, the new method inherits the features of kmeans algorithm. For this purpose, A data point is completely assigned to a cluster where it has got maximum membership and the point is used only for the calculation of new cluster center in that cluster. This way the influence of a data point in the calculation of all the cluster centers can be avoided. The point is used of the calculations only if its membership value falls above a threshold value . This way we can ensure that outlier points are not considered for new cluster center calculations. Unlike the crisp clustering algorithms like k-means, most of the fuzzy clustering algorithms are sensitive to the selection of initial centroids. The effectiveness of the algorithms depends on the initial cluster seeds [17]. For better results we suggest to use k-means algorithm to get the initial cluster seeds. This way we can ensure that the algorithm converges with best possible results. In the new method, since the membership values vary strictly from 0 to 1, we find that initializing the threshold value of to 0.5 produces better results in cluster center calculation.

163
2009 ACADEMY PUBLISHER

INFORMATION PAPER International Journal of Recent Trends in Engineering, Vol. 1, No. 2, May 2009 TABLE 2. THE NEW FUZZY CLUSTERING ALGORITHM initialize p=number of clusters initialize Cj (cluster centers) initialize (threshold value) Repeat For i=1 to n :Update j(xi) applying (6) For k=1 to p : Sum=0 Count=0 For i=1 to n : If (xi) is maximum in Ck then If (xi)>= Sum=Sum+xi count=count + 1 Ck=Sum/count Until Cj estimate stabilize The fuzzy membership values found with the expression 6. can be used in the new fuzzy clustering algorithm given in table2. The algorithm stops when the cluster centers stabilize. The algorithm will be more efficient in handling data with outlier points and in overcoming other constrains imposed by c-means algorithm. The new method we propose is also far superior in the calculation of new cluster centers, which we demonstrate in the next session with two numerical examples. VI. ILLUSTRATIONS In order to find the effectiveness of the new algorithm, we applied it with a small synthetic data set to demonstrate the functioning of the algorithm in detail. The algorithm is also tested with real data collected for Bhutans Gross National Happiness (GNH) program. A. With synthetic data Consider the sales of an item from different shops at different periods of the year, from figure 1. There is an outlier point at (12,400). If we start data exploration with two random cluster centers at C1(.5,150) and C2(8,150), the algorithm ends with the cluster centers at C1(1.52,166.20) and C2(5.49,172.11). We also applied cmeans algorithm with the same data set and we found that the algorithm converges to cluster centers C1(2.79,47.65) and C2(3.29,204.4). TABLE 3. C-MEANS AND NEW METHOD Methods C-means New Method Final Cluster Centers 2.79 , 47.65 3.29 , 204.4 1.52,166.20 5.49,172.11 Outlier membership Point - 12,400 .37 .63 0 0

Figure 1. The data points and final overlapping clusters

We also found that the algorithm is far superior to cmeans in handling outlier points. (See table 3). The new algorithm is capable of giving very low membership values to the outlier points. If we start with C1 at (.5,150), the algorithm takes four steps to converge the first cluster center to(1.52,166.23). Similarly the second cluster center C2 converges to (5.49,172.11) from (8,150) in four steps. The algorithm finally ends with the final cluster centers at C1(1.5,166.2) and C2 (5.4,172.11) and with two overlapping clusters as shown in figure 1. The outlier point is given zero membership in both the clusters. From Table 3, it is clear that the new fuzzy clustering method we propose is better than the conventional cmeans algorithm in handling outlier points and in the calculation of new cluster centers. Due the constrain given in expression 5, the c-means algorithm gives membership values .37 and .63 to the outlier point (12,400) so that the sum of memberships of this point is one. But the new algorithm gives zero membership to this point. 6.2 With Natural Data Happiness and satisfaction are directly related with a communitys ability to meet their basic needs and these are important factors in safeguarding their physical health. The unique concept of Bhutans Gross National Happiness (GNH) depends on nine factors like health, ecosystem, emotional well being etc.[16]. GNH regional chapter at Sherubtse College conducted a survey among 1311 villagers and the responses were converted into numeric values. For the analysis of the new method we took the attributes income and health index as shown in figure 2.

C1 C2 C1 C2

Figure 2. The data points

164
2009 ACADEMY PUBLISHER

INFORMATION PAPER International Journal of Recent Trends in Engineering, Vol. 1, No. 2, May 2009 TABLE 4. THE FINAL CLUSTER CENTERS
Cent ers Final Cluster Centers C-means New Method X Y X Y

outlier points. The efficiency of the algorithm has to be further tested on a comparatively larger data set. REFERENCES
[1] Sankar K. Pal, P. Mitra, Data Mining in Soft Computing Framework: A Survey, IEEE transactions on neural networks, vol. 13, no. 1, January 2002 [2] R. Cruse, C. Borgelt, Fuzzy Data Analysis Challenges and Perspective http://citeseer.ist.psu.edu/ kruse99fuzzy.html [3] G. Raju, Th. Shanta Kumar, Binu Thomas, Integration of Fuzzy Logic in Data Mining: A comparativeCase Study, Proc. of International Conf. on Mathematicsand Computer Science, Loyola College, Chennai, 128-136,2008 [4]Maria Halkidi, Quality assessment and Uncertainty Handling in Data Mining Process, http://citeseer.ist.psu.edu/ halkidi00quality.html [5] W. H. Inmon The data warehouse and data mining, Commn. ACM, vol. 39, pp. 4950, 1996. [6] U. Fayyad and R. Uthurusamy, Data mining and knowledge discovery in databases, Commn. ACM, vol. 39, pp. 2427, 1996. [7] Pavel Berkhin, Survey of Clustering Data Mining Techniques, http://citeseer.ist.psu.edu/berkhin02survey.html [8] Chau, M., Cheng, R., and Kao, B, Uncertain Data Mining: A New Research Direction, www.business.hku.hk /~mchau/papers/UncertainDataMining_WSA.pdf [9] Keith C.C, C. Wai-Ho Au, B. Choi, Mining Fuzzy Rules in A Donor Database for Direct Marketing by A Charitable Organization, Proc of First IEEE International Conference on Cognitive Informatics, pp: 239 - 246, 2002 [10] E. Cox, Fuzzy Modeling And Genetic Algorithms For Data Mining And Exploration, Elsevier, 2005 [11] G. J Klir, T A. Folger, Fuzzy Sets, Uncertainty and Information, Prentice Hall,1988 [12] J Han, M Kamber, Data Mining Concepts and Techniques, Elsevier, 2003 [13] J. C. Bezdek, Fuzzy Mathematics in Pattern Classification, Ph.D. thesis, Center for Applied Mathematics, Cornell University, Ithica, N.Y., 1973. [14] Carl G. Looney, A Fuzzy Clustering and Fuzzy Merging Algorithm,http://citeseer.ist.psu.edu/399498.html [15] Frank Klawonn, Anette Keller, Fuzzy Clustering Based on Modified Distance Measures, http://citeseer.ist.psu.edu/klawonn99fuzzy.html [16] Sullen Donnelly, How Bhutan Can Develop and Measure GNH, www.bhutanstudies.org.bt/seminar/ 0402gnh/GNH-papers-1st_18-20.pdf [17] Lei Jiang and Wenhui Yang, A Modified Fuzzy C-Means Algorithm for Segmentation of Magnetic Resonance Images Proc. VIIth Digital Image Computing: Techniques and Applications, pp. 225-231, 10-12 Dec. 2003, Sydney.

C1 C2 C3

24243.11 69749.6 115979.3

6.53 5.08 2.83

24376.32 68907 113264.6

8.135 5.62 1.815

As we can see from the data set, in Bhutan the low income group maintains better health than high income group since they are self sufficient in many ways. Like any other natural data this data set also contains many outlier points which do not belong to any of the groups. If we apply c-means algorithm these points tend to get more membership values due to exp.4. To start the data analysis, first we applied k-means algorithm to find the initial three cluster centers. The algorithm ended with three cluster centers at C1(24243,6.7),C2(69794,5.1) and C3(11979.29,2.72). We applied these initial values in both c-means algorithm and the new method to analyze the data and the algorithms ended with centroids as given in Table 4. From figure 2, and table 4, It can be seen that the final centroids of c-means method does not represent the actual centers of the clusters and this is due to the influence of outlier points. The memberships of all the points are also considered for the calculation of cluster centers. So the cluster centers tend to move towards the centre of all the points. But the new method identifies the cluster centers in a better way. The efficiency of the new algorithm lies in its ability to provide very low membership values to outlier points and also to consider a point only for the calculation of one cluster centre. VII. CONCLUSION A good clustering algorithm produces high quality clusters to yield low inter cluster similarity and high intra cluster similarity. Many conventional clustering algorithms like k-means and fuzzy c-means algorithm achieve this on crisp and highly structured data. But they have difficulties in handling unstructured natural data which often contain outlier data points. The proposed new fuzzy clustering algorithm combines the positive aspects of both crisp and fuzzy clustering algorithms. It is more efficient in handling the natural data with outlier points than both k-means and fuzzy c-means algorithm. It achieves this by assigning very low membership values to the outlier points. But The Algorithm has limitations in exploring highly structured crisp data which is free from

165
2009 ACADEMY PUBLISHER

You might also like