This is the reaction paper on BFR clustering algorithm.

Attribution Non-Commercial (BY-NC)

748 views

This is the reaction paper on BFR clustering algorithm.

Attribution Non-Commercial (BY-NC)

- Ncct - Ieee Software Abstract Collection 3
- Data Mining Lab Manual
- Clustering
- Fuzzy Genetic Data Mining Using k-Means Clustering
- Generating Customer Profiles for Retail Stores Using Clustering Techniques
- Review Journal-cluster Analisis
- P10-3012
- 469
- A Novel Approach for Clustering Categorical Time Series Using Dissimilarity Based Measure
- Leadership Styles and Behaviour
- Intro to Data Mining Webinar
- Research paper
- How to Create a Cluster View for Complex Customizing Table Maintenance
- 06508256.pdf
- Clustering User Preferences Using W-Kmeans
- Harmony Search Optimization in K-Means Clustering
- Comparison of Different Clustering Algorithms using WEKA Tool
- MC0717 Lab Manual
- Makho Ngazimbi Project
- K Means Spectral

You are on page 1of 5

Gautam Kumar gakumar@cs.stonybrook.edu Computer Science Department Stony Brook University, New York

This document presents a reaction paper on a clustering algorithm proposed by Bradley, Fayyad and Reina (BFR) in 1998.

Introduction:

Custering is one of the important process by which data set can be classied into groups. There are two category of clustering algorithm.[2] a) Hierarchical clustering b) Point assignment clustering. The proposed BFR algorithm is a point assignment clustering algorithm, where authors have proposed variation of K-mean algorithm for large databases. Authors have assumed that the dataset follows gaussian distribution and clusters present in the data set are aligned with the axis in space. After having these assumptions authors have also claimed that their algorithm is valid for large database as well distributed databases and outperformed the naive K-mean algorithm. The idea behind algorithm is to keep the summary of clusters in the main memory and keep the points on the disk which falls under clusters. Authors have described three sets, Discarded Set (DS) those points which falls into certain cluster are stored in the disk while their summary will be kept in main memory

Compressed Set (CS) These are the minicluster(points close to each other but not close to any cluster) which doesnt t into big cluster they are also kept in disk with their summary present in the main memory Retained Set (RS) those points which neither fall in big cluster nor in the minicluster are kept in the main memory[1]. This algorithm also keeps sum of all points(N), sum of the component of all the points in each dimension(SUM) and sum of the squares of the component of all the points in each dimension(SUMSQ)[2] in main memory. BFR clustering algorithm is discussed in the class. I have referred the paper written by authors of BFR algorithm, in which algorithm has described.

Critique

Prior to BFR, clustering in large database was really difcult, but BFR provides method to scale the clustering algorithm to large databases. It also avoids the multiple scan of the database[1]. Infact, proposed algorithm does only one scan of the database. This algorithm can be used even when the dataset resides in the disk. This algorithm represents the cluster in optimized and compressed way by just storing three values N, SUM and SUMSQ in memory and rest of the candidates of the clusters are stored back in the disk. This algorithm also perform very well in the high dimensional data set where all points seems to be eqi-distant and each pair of vectors are orthogonal to each other. For the feasibility of this algorithm authors made very strong assumption about the data set and shape of the cluster. Authors assume that the data sets are taken from gaussian distribution and normally distributed across centroid. The shape of cluster are axis aligned with the axes of the space[2]. These assumptions restricts the data sets into certain range, most of the real world data sets dont obey the gaussian distribution. Moreover the shape of the cluster may or may 2

not be aligned to the axis of space, there are many datasets in the real world where clusters are spiral or interleaving to each other for eg: geographical or astronomical datasets. The proposed method is sound for the restricted data sets which obey the two assumptions made by authors. Their data sets are purposefully chosen in such a way that those data set becomes best case scenario for their algorithm[1]. They have taken the data set which follows the gaussian distribution, infact authors have also mentioned that the data sets which they are referring are not the best case scenario for the real world and there is a high probability that most of the real world data sets will be the worse case input for the algorithm[1]. Authors have conducted the experiments based on the above assumption data sets. Authors have shown that the algorithm outperformes the naive K-mean algorithm in high dimension data set which obey the assumptions. Authors havent taken the noise and outliers into their consideration, outliers and noise can easily deform the shape of the clusters. This paper also assume while doing computation for DS, CS, RS that the mean will not move outside of the computed interval, which may not be true in extreme case when the data point will arrive in monotonically increasing sequence there is a high probability that the mean will not remain into the same interval and the DS, CS and RS may not be valid anymore. In the purposed algorithm authors are keeping the summary of DS into main memory, consider the exterme case where number of clusters are very large in that case it is really difcult to keep the summary of DSs into main memory. This apply to CS, and RS as well if there are many miniclusters formed between the point then it is difcult to keep them in main memory. The RS play a vital role to hog the maximum amount of memory. if there are many points which neither belongs to big cluster nor into minicluster then they have to be retained in the main memory. So a major portion of memory are occupied by summary of DS and CS as well as RS. This shows that all memory are not available to process the incoming data[2]. There is very clear observation that there is no any comment about retrieval of data 3

points which are part of cluster. Since for DS, we only keep track the summary(N, SUM, SUMSQ) of it, but if we ever wanted to retrieve the points which belongs to a particular cluster, there is no intuitive way by which we can perform this task and more over there is no any method mentioned by authors in the paper to achieve this task.

Brainstorming

The rst assumption mentioned by author that the data set should follow the gaussian distribution, this assumption is reasonable since authors keeping only summary of the DS by using only N, SUM and SUMSQ. However the second assumption The shape of cluster should be axis aligned to the axes of space however this assumption can be avoided by transforming the dataset points to the axis aligned. This can be done by using one pass of dataset, so the total cost to transform the data set into axis align is linear. So any given data set can be transformed to make the compatibile with this algorithm. The space problem due to holding the summary of DS and CS as well keeping the point of RS can be reduce if we apply the divide and conqeur technique across the mutliple node of the system and nally combing them together. Another method could be to use parallelism to make the computation fast and avoid the memory issue raised by CS, DS and RS. The good summary of DS can be obtained using representative method[3] as well. The further research questions are how we can relax the assumptions mentioned in this paper and at the same time the time complexity and space complexity should not be affect much, i.e. the trade off should be fair. The axis aligned assumption can be relaxed by using the point transformation. The space complexity can be further reduce by using the divide and conquer approach. A parallel computing approach can be used to further reduce the time complexity. The application which has data set which follow the gaussian distribution will get benitted from this algorithm. However authors have explicitly created the synthetatic data set for their 4

experiment but we can always nd the data set in chemical industry, biomedical industry and eloctronic signals(which smooths across the gaussian distribution) where data set follow the gaussian distribution and these dataset will get benited by this algorithm. So the main question to ponder is to avoid the constraint and make the algorithm for general dataset.

1. P.S. Bradley, U.M. Fayyad, and C.Reina, Scaling clustering algorithms to large databases, Proc. Knowledge Discovery and Data Mining, pp.9-15,1988 2. A. Rajaraman, J. Leskovec, and J. D. Ullman. Chapter 7. Mining of Massive Datasets. Clustering 3. S. Guha, R. Rastogi, and K. Shim, CURE: An efcient clustering agorithm for large databases, Proc. ACM SIGMOD Intl. Conf. on Mamagement of Data, pp. 73-84, 1998.

- Ncct - Ieee Software Abstract Collection 3Uploaded byesskayn16936
- Data Mining Lab ManualUploaded byAmanpreet Kaur
- ClusteringUploaded bysunnynnus
- Fuzzy Genetic Data Mining Using k-Means ClusteringUploaded byeditorijsaa
- Generating Customer Profiles for Retail Stores Using Clustering TechniquesUploaded byStephen Enright-Ward
- Review Journal-cluster AnalisisUploaded byMekaTron
- P10-3012Uploaded byNasrull Muhammad
- 469Uploaded byRahul Anand
- Intro to Data Mining WebinarUploaded byKarthikSridhar
- A Novel Approach for Clustering Categorical Time Series Using Dissimilarity Based MeasureUploaded byInternational Journal for Scientific Research and Development - IJSRD
- Research paperUploaded byanilnaik287
- How to Create a Cluster View for Complex Customizing Table MaintenanceUploaded bypsaani
- Leadership Styles and BehaviourUploaded byMohamed
- 06508256.pdfUploaded byhub23
- Clustering User Preferences Using W-KmeansUploaded byHitesh ವಿಟ್ಟಲ್ Shetty
- Harmony Search Optimization in K-Means ClusteringUploaded byseventhsensegroup
- Comparison of Different Clustering Algorithms using WEKA ToolUploaded byIJARTES
- MC0717 Lab ManualUploaded byArun Reddy
- Makho Ngazimbi ProjectUploaded bythundeyy
- K Means SpectralUploaded byEthan Carres Hidalgo
- New Genetic Algorithm to Predict Data In Mobile DatabasesUploaded byATS
- 8. IJCSEITR-A Study of Fuzzy Based Approach for Securing Information in PPDMUploaded byTJPRC Publications
- Noisy Clustering Neural NetworksUploaded bysgjogabonitoa
- kmeansUploaded byapi-340879197
- 1-s2.0-S1569190X16302350-mainUploaded byRajmeet Singh
- Literature Survey: Clustering TechniqueUploaded byATS
- WS00-01-011Uploaded byEduardsh Eduard
- Image Segmentation of Cows using Thresholding and K-Means MethodUploaded byIjaems Journal
- K-Means Clustering in ExcelUploaded byalanpicard2303
- A195-200Ari MuzakirUploaded byHimawan Sandhi

- 135540894-PPT-On-GSM-BASED-E-NOTICE-BOARD.pptxUploaded byGnana Pugazh
- TPD3Uploaded byImran Khan
- Sample STF05501Uploaded byMukuka Kangwa
- Bhiwadi Polymers 3-06-19Uploaded byKaushal Kothari
- Plan de Seguridad y Salud en Proyecto de ConstrucciónUploaded byMichael Barbaran Leal
- ISQTB Question bank..Uploaded byNilesh Pagar
- Short LabUploaded byBatman
- Schneider Electric XCKN2145P20 DatasheetUploaded byBoby Saputra
- exmUploaded byDineshKumar Verma
- Guide for Proctoring Dragnet TestsUploaded bybobby664
- AHRI 430-2009Uploaded byRCYABO
- ts_136211v110200pUploaded byharold_alv
- 1.9-section.pdfUploaded bychinkshady
- calibUploaded byZhanXiang Lee
- The minted packageUploaded byLuis
- Nice3000b user manualUploaded byMD Omar Faruk
- Quad-66CDUploaded byKen Yap
- 2012_10_Agilent1Uploaded byaadsd
- 2.Mechanical SealsUploaded bypsk.pranesh5520
- AD8130Uploaded byflo72af
- API-570 Final Exam QuestionsUploaded byAndiappan Pillai
- Transient ShaperUploaded byChris Worth
- Japan Architect - HousingUploaded byDaarjo
- Prof. Sixtus Kinyua Mwea C.V..pdfUploaded bymachariajkimani
- Electrical Fault CalculationUploaded byAnupam0103
- Operting SystemUploaded bySukanta Kundu
- Revised Project Report of Tata Steel33333333333333Uploaded byAbhishek Mallick
- lec8Uploaded byVu Tuan Hung
- saperr.dcompletre shaper operationsocxUploaded byVijay Kshatriya HR
- Flame Arrester - Wikipedia, The Free EncyclopediaUploaded bybanad