You are on page 1of 5

Reaction Paper on BFR Clustering Algorithm

Gautam Kumar Computer Science Department Stony Brook University, New York
This document presents a reaction paper on a clustering algorithm proposed by Bradley, Fayyad and Reina (BFR) in 1998.

Custering is one of the important process by which data set can be classied into groups. There are two category of clustering algorithm.[2] a) Hierarchical clustering b) Point assignment clustering. The proposed BFR algorithm is a point assignment clustering algorithm, where authors have proposed variation of K-mean algorithm for large databases. Authors have assumed that the dataset follows gaussian distribution and clusters present in the data set are aligned with the axis in space. After having these assumptions authors have also claimed that their algorithm is valid for large database as well distributed databases and outperformed the naive K-mean algorithm. The idea behind algorithm is to keep the summary of clusters in the main memory and keep the points on the disk which falls under clusters. Authors have described three sets, Discarded Set (DS) those points which falls into certain cluster are stored in the disk while their summary will be kept in main memory

Compressed Set (CS) These are the minicluster(points close to each other but not close to any cluster) which doesnt t into big cluster they are also kept in disk with their summary present in the main memory Retained Set (RS) those points which neither fall in big cluster nor in the minicluster are kept in the main memory[1]. This algorithm also keeps sum of all points(N), sum of the component of all the points in each dimension(SUM) and sum of the squares of the component of all the points in each dimension(SUMSQ)[2] in main memory. BFR clustering algorithm is discussed in the class. I have referred the paper written by authors of BFR algorithm, in which algorithm has described.

Prior to BFR, clustering in large database was really difcult, but BFR provides method to scale the clustering algorithm to large databases. It also avoids the multiple scan of the database[1]. Infact, proposed algorithm does only one scan of the database. This algorithm can be used even when the dataset resides in the disk. This algorithm represents the cluster in optimized and compressed way by just storing three values N, SUM and SUMSQ in memory and rest of the candidates of the clusters are stored back in the disk. This algorithm also perform very well in the high dimensional data set where all points seems to be eqi-distant and each pair of vectors are orthogonal to each other. For the feasibility of this algorithm authors made very strong assumption about the data set and shape of the cluster. Authors assume that the data sets are taken from gaussian distribution and normally distributed across centroid. The shape of cluster are axis aligned with the axes of the space[2]. These assumptions restricts the data sets into certain range, most of the real world data sets dont obey the gaussian distribution. Moreover the shape of the cluster may or may 2

not be aligned to the axis of space, there are many datasets in the real world where clusters are spiral or interleaving to each other for eg: geographical or astronomical datasets. The proposed method is sound for the restricted data sets which obey the two assumptions made by authors. Their data sets are purposefully chosen in such a way that those data set becomes best case scenario for their algorithm[1]. They have taken the data set which follows the gaussian distribution, infact authors have also mentioned that the data sets which they are referring are not the best case scenario for the real world and there is a high probability that most of the real world data sets will be the worse case input for the algorithm[1]. Authors have conducted the experiments based on the above assumption data sets. Authors have shown that the algorithm outperformes the naive K-mean algorithm in high dimension data set which obey the assumptions. Authors havent taken the noise and outliers into their consideration, outliers and noise can easily deform the shape of the clusters. This paper also assume while doing computation for DS, CS, RS that the mean will not move outside of the computed interval, which may not be true in extreme case when the data point will arrive in monotonically increasing sequence there is a high probability that the mean will not remain into the same interval and the DS, CS and RS may not be valid anymore. In the purposed algorithm authors are keeping the summary of DS into main memory, consider the exterme case where number of clusters are very large in that case it is really difcult to keep the summary of DSs into main memory. This apply to CS, and RS as well if there are many miniclusters formed between the point then it is difcult to keep them in main memory. The RS play a vital role to hog the maximum amount of memory. if there are many points which neither belongs to big cluster nor into minicluster then they have to be retained in the main memory. So a major portion of memory are occupied by summary of DS and CS as well as RS. This shows that all memory are not available to process the incoming data[2]. There is very clear observation that there is no any comment about retrieval of data 3

points which are part of cluster. Since for DS, we only keep track the summary(N, SUM, SUMSQ) of it, but if we ever wanted to retrieve the points which belongs to a particular cluster, there is no intuitive way by which we can perform this task and more over there is no any method mentioned by authors in the paper to achieve this task.

The rst assumption mentioned by author that the data set should follow the gaussian distribution, this assumption is reasonable since authors keeping only summary of the DS by using only N, SUM and SUMSQ. However the second assumption The shape of cluster should be axis aligned to the axes of space however this assumption can be avoided by transforming the dataset points to the axis aligned. This can be done by using one pass of dataset, so the total cost to transform the data set into axis align is linear. So any given data set can be transformed to make the compatibile with this algorithm. The space problem due to holding the summary of DS and CS as well keeping the point of RS can be reduce if we apply the divide and conqeur technique across the mutliple node of the system and nally combing them together. Another method could be to use parallelism to make the computation fast and avoid the memory issue raised by CS, DS and RS. The good summary of DS can be obtained using representative method[3] as well. The further research questions are how we can relax the assumptions mentioned in this paper and at the same time the time complexity and space complexity should not be affect much, i.e. the trade off should be fair. The axis aligned assumption can be relaxed by using the point transformation. The space complexity can be further reduce by using the divide and conquer approach. A parallel computing approach can be used to further reduce the time complexity. The application which has data set which follow the gaussian distribution will get benitted from this algorithm. However authors have explicitly created the synthetatic data set for their 4

experiment but we can always nd the data set in chemical industry, biomedical industry and eloctronic signals(which smooths across the gaussian distribution) where data set follow the gaussian distribution and these dataset will get benited by this algorithm. So the main question to ponder is to avoid the constraint and make the algorithm for general dataset.

References and Notes

1. P.S. Bradley, U.M. Fayyad, and C.Reina, Scaling clustering algorithms to large databases, Proc. Knowledge Discovery and Data Mining, pp.9-15,1988 2. A. Rajaraman, J. Leskovec, and J. D. Ullman. Chapter 7. Mining of Massive Datasets. Clustering 3. S. Guha, R. Rastogi, and K. Shim, CURE: An efcient clustering agorithm for large databases, Proc. ACM SIGMOD Intl. Conf. on Mamagement of Data, pp. 73-84, 1998.