1 views

Uploaded by ravigobi

Iterative Mesh Based Clustering with Threshold Subdivision

- An Introduction to Machine Learning
- Data Mining
- Recommendation System Using Unsupervised Machine Learning Algorithm & Assoc
- 3253 Data Mining Sample Outline
- Unsupervised
- A Review on Clustering Technique
- Network Traffic
- Enriching Gum Disease Prediction using Machine Learning
- Dmba Brochures
- Implementing Cluster based Recommendation algorithm through Split Inversions
- Meta Similarity Noise-Free Clusters Using Dynamic Minimum Spanning Tree With Self-Detection of Best Number of Clusters
- clustream
- MITProfessionalX DS Course Syllabus for Data Science v2
- Week4.Cutting.et.Al.1992.Scatter Gather
- Clustering Techniques and Applications
- orange bio-tutorial.pdf
- Travel Mode Detection Exploiting Cellular Network
- Research Paper
- Automatic Root Cause Analysis for LTE Networks Based on Unsupervised Techniques
- 10.1.1.85.7480.pdf

You are on page 1of 3

SUSANNA STILL 1

Subdivision

Robert Ross Puckett

Department of Information & Computer Sciences

University of Hawaii, Manoa Campus

Honolulu, HI 96822, USA

serious drawbacks to their general utility. For example, the k-

means clustering algorithm usually involves the use of a metric Initially the algorithm was implemented, for the most part,

such as euclidean distance which is scale variant [1]. Clustering as outlined in the Choudhari paper. This provided the ca-

methods based upon similarity matrices, although less susceptible pability of mesh clustering for a given M by N grid size

to this problem, add greater space complexity for the storage value. In the algorithm the data set is first normalized. Next,

of the matrices. As proposed by Choudhari et al. [2] a mesh

each point is assigned to a box number. After all the boxes

based clustering approach without stopping criterion can provide

a clustering solution with O(n) time and space requirements. are assigned points, the boxes are clustered together. The

As mentioned in his paper, such stopping criterion based on algorithm examines the neighbors of each cell to see which

connections or distance may produce suboptimal clustering by neighbor boxes contain points. All neighbor boxes containing

not exhausting the solution space or through scale variance. points are assumed to be inside the same cluster as the box

However, Choudhari notes that the selection of an appropriate

being considered. At this point a graph is displayed with the

grid size is a major limitation of the algorithm. As such, this

paper describes the development of an iterative mesh based current grid and boxes, using colors to represent the different

clustering method. In this method, the grid size is reduced clusters. Next, the M and N values are increased and the

incrementally until a threshold value of the cells belong to the process repeats. Thus the grid is made finer.

expected number of clusters. Thus, the appropriate grid size is After each clustering attempt and graph display, the clusters

self-determined.

are examined against a threshold. If the mean cluster size

Index Termsmesh clustering, grid clustering, partitioning, is within a proportion of the expected cluster size, then the

pattern classification. subdivision process is halted. For the purposes of graphing,

an additional stopping criteria is added to prevent subdivisions

that would make the graph illegible.

I. I NTRODUCTION

III. E XPERIMENTS

LUSTERING is a useful tool for segmentation, pattern

C classification, and data mining. However, traditional ap-

proaches suffer from serious drawbacks to their utility. One

The first experiments were verification tests to ensure that

the clustering algorithm and grapher are operating correctly.

Using artificial data and hand calculations, several values were

of the more popular methods, k-means clustering, depends tested against the results of the algorithm. Different mesh

heavily on the choice of distance metric used. Such an sizes were used to determine useful graduations in size for

algorithm likely is subject to scale variance which can result subdivisions. Some grid sizes result in obvious erroneous

in sub-optimal clustering. Many other methods involve the use graphs which are likely the result of round off error in

of a similarity matrix. However, the creation and maintenance converting between double value and integer pixel positions.

of such a matrix leads to O(n2 ) space complexity [2]. Thus, in This problem is being tracked down. However, certain grid

large high-dimensional data sets, the curse of dimensionality sizes seemed immune to this problem. Thus, stable grid sizes

makes such clustering methods inefficient and ineffective [3]. were used for the further experiments.

Choudhari et al. [2] proposed a mesh based clustering The remaining experiments used the ELENA artificial data-

algorithm lacking a stopping criterion. That is, the space set database [4]. This database includes intersecting Gaussians,

is divided into a mesh and clustering is performed based rings, and additional forms. Although k-means clustering pro-

upon the adjacent cells with data inside them. Furthermore, vides hyper spheres for its clusters, the mesh based clustering

the algorithm does not include a stopping criterion such allows for greater flexibility in cluster shape through non-

as stopping upon a certain degree connected or a distance spherical clusters.

threshold.

However, as admitted in the paper, the algorithms major

problem is finding an appropriate grid size. Thus, the imple- A. ELENA Clouds

mented algorithm described herein incrementally decreases the The ELENA clouds database consists of 5000 two-

grid size until a threshold of cluster membership is reached. dimensional data points divided into three overlapping clusters

ICS 635 - MACHINE LEARNING - DR. SUSANNA STILL 2

Fig. 1. 100 Square Grid Results Fig. 3. 200 Square Grid Results

a cluster. Thus, it was hoped that the algorithm would be able

to identify the ring shaped cluster and the circle shaped cluster

as two separate clusters.

Unfortunately, the mesh clustering algorithm performed

abysmally for this data set. For no grid size were the two

clusters separable. For large grid sizes all data points were

clustered into a single cluster. For grid sizes smaller than the

average distance between two points, the graph is divided into

dozens if not hundreds of meaningless clusters. The problem

is that the algorithm depends on proximity of neighbors to

define membership to clusters. However, if the gap between

the two distributions were larger, then a grid size could be

found that would accurately separate the distributions.

Fig. 2. 150 Square Grid Results

IV. C ONCLUSION

of different variance, mean, and position. Two of the distribu- Mesh based clustering is a useful tool that requires addi-

tions are circular, while the third is oblong. tional research. Since identifying the appropriate grid size is

As shown in Figure 1, the clustering attempt with 100x100 a major limiting factor of the Choudhari algorithm, the im-

cells resulted in one large cluster. The cell size is too large plemented algorithm performs incremental subdivisions until

and enough neighbor cells contain points such that there is a threshold of cluster membership is reached. Unfortunately,

a connection between all of the clusters. Figure 2 shows that there are still problems needing resolution with this algorithm

with 150x150 cells, the upper two clusters are now separated and with mesh clustering in general.

from the lower cluster. The overlapping of the lower cluster Overlapping and sparse distributions create major problems

is not as severe as the upper clusters, thus, this cell size for this form of grid clustering. For overlapping data-sets it is

was able to separate the top clusters and the bottom cluster. desirable to have a small cell size to prevent the clusters being

Finally, with 200x200 cells, Figure 3 shows the three clusters grouped together. For sparse data sets, it is desirable to have a

fully separated. The graphs are filtered to not show clusters of large cell size to capture more of the neighbors and join more

exceedingly small size. points together into a common cluster. Unfortunately, as the

ELENA clouds experiment above shows, it is possible to have

both overlapping and spare distributions occur together.

B. ELENA Concentric Additionally, the lack of a stopping criterion allows for a

The ELENA concentric database consists of 2500 two- great variety of possible cluster shapes and sizes. Unfortu-

dimensional data points divided into two non-overlapping nately, if the cell size is not optimal, then far more cells may be

clusters. One cluster is a ring shape, while the other cluster joined together into a cluster than should be. All it takes is for

is a circular shape embedded within the first clusters ring. one small chain of boxes to connect two clusters and join them,

There is no appreciable gap between the two clusters. The no matter how far they are apart. Furthermore, the practice of

grid algorithm is not limited to clustering circles of ellipses. adding adjacent neighbors to the same cluster is impractical in

ICS 635 - MACHINE LEARNING - DR. SUSANNA STILL 3

exponentially [3].

For simple distributions, identifying the center of cluster

could be as simple as averaging the positions of member boxes

of a cluster. However, for more complex shaped distributions

this process, and the very concept of a center, may not be

useful.

V. F UTURE W ORK

Although it is possible to simply rerun the algorithm with

larger or smaller grid sizes, it seems unnecessary to include all

of the cells in such future operations. Although the algorithm

originally has O(n) complexity, repetitive runs would result

in longer run-time.

As such it may be possible to optimize future iterations

by limiting the number of cells being subdivided. Instead of

dividing all of the cells, which is somewhat equivalent to

increasing the grid size, we can instead divide only the cells

that are important. That is, we will first discard cells with no

data points within. Next, we should not need to subdivide cells

that have a high degree of similar data points compared to free

space.

Thus, with normalized data, we can start the algorithm

with the largest possible grid size, and allow the algorithm to

continue subdividing the grids where subdivision is suspected

to result in improved clustering. This process will likely

result in clusters composed of heterogeneous sized cell pieces.

Cluster centers will likely be larger cells surrounded by smaller

cells that further define the cluster boundaries.

R EFERENCES

[1] R. O. Duda, P. Hart, and D. G. Stork, Pattern Classification, 2nd ed.

Wiley, 2001.

[2] V. N. Choudhari A., Hanmandlu M. and C. R.D, Mesh based clustering

without stopping criterion, in INDICON, 2005 Annual IEEE, 2005.

[3] A. Hinneburg and D. A. Keim, Optimal grid-clustering: Towards

breaking the curse of dimensionality in high-dimensional clustering, in

Proceedings of the 25th VLDB Conference, 1999, pp. 506517. [Online].

Available: http://fusion.cs.uni-magdeburg.de/pubs/optigrid.pdf

[4] Elena database, April 2005. [Online]. Available:

http://www.dice.ucl.ac.be/mlg/DataBases/ELENA/ARTIFICIAL/

- An Introduction to Machine LearningUploaded byerman
- Data MiningUploaded byClaire
- Recommendation System Using Unsupervised Machine Learning Algorithm & AssocUploaded byijerd
- 3253 Data Mining Sample OutlineUploaded bygprasadatvu
- UnsupervisedUploaded byChitradeep Dutta Roy
- A Review on Clustering TechniqueUploaded byEditor IJRITCC
- Network TrafficUploaded byRavi Kiran
- Enriching Gum Disease Prediction using Machine LearningUploaded byIJSTE
- Dmba BrochuresUploaded byjohnnokia
- Meta Similarity Noise-Free Clusters Using Dynamic Minimum Spanning Tree With Self-Detection of Best Number of ClustersUploaded bysaeedullah81
- clustreamUploaded bysandeep83
- MITProfessionalX DS Course Syllabus for Data Science v2Uploaded bydanobikwelu_scribd
- Implementing Cluster based Recommendation algorithm through Split InversionsUploaded byIJSRP ORG
- Week4.Cutting.et.Al.1992.Scatter GatherUploaded bychunkfrog
- Clustering Techniques and ApplicationsUploaded byAnjany Kumar Sekuboyina
- orange bio-tutorial.pdfUploaded byfallenangelm
- Travel Mode Detection Exploiting Cellular NetworkUploaded bydonny_3009
- Research PaperUploaded bybitanmondal12
- Automatic Root Cause Analysis for LTE Networks Based on Unsupervised TechniquesUploaded byMoazzam Tiwana
- 10.1.1.85.7480.pdfUploaded bygeodennys
- dwmUploaded byAnurag
- Analyzing and Learning an Opponent’s Strategies in the RoboCup Small Size League.pdfUploaded byTan Tien Nguyen
- The World in a Nutshell Concise Range QueriesUploaded byVinaya Kumar S
- RBF Classnote Mtech Spring2013Uploaded byNishantKumar
- IRJET-An Efficient Hybrid Comparative Study Based on ACO, PSO, K-Means With K-Medoids for Cluster AnalysisUploaded byIRJET Journal
- Semantically Composing Web Services in a Big Data EnvironmentUploaded byAnonymous vQrJlEN
- MIP AND UNSUPERVISED CLUSTERING FOR THE DETECTION OF BRAIN TUMOUR CELLSUploaded byIJIRAE
- 10.1.1.148.4222Uploaded byashish
- Extraction Method of Glandular Areas in Prostate Biopsy ImageUploaded byoasv926
- a_fuzzy_neural_network_system_based_on_generalized_class_cover.pdfUploaded byAmrYassin

- Health ScopeUploaded byravigobi
- A Developer s Insight Into ARM Cortex M DebuggingUploaded byravigobi
- Which ARM Cortex Core is Right for Your ApplicationUploaded byS_gabriel
- Audio on ARM Cortex-M ProcessorsUploaded byravigobi
- AN00016_AnalyzingHardFaultsOnCortexM.pdfUploaded byfercho573
- Based on the Stm32 GSM Intelligent Home Control System DesignUploaded byravigobi
- Research 1Uploaded byNeil Khosla
- Bibliographic Guide - Vedic TextsUploaded byravigobi
- VedasShare.pdfUploaded bydoors-and-windows
- Distributed Tree KernelsUploaded byravigobi
- Online Ensemble LearningUploaded byravigobi
- Verification as Learning Geometric ConceptsUploaded byravigobi
- Large-Scale Machine Learning with Stochastic Gradient Descent.pdfUploaded byravigobi
- A Comparative Runtime Analysis of Heuristic AlgorithmsUploaded byravigobi
- Machine Learning Algorithms for ClassificationUploaded byravigobi
- Big O NotationUploaded bylawliet
- Factorial Hidden Markov ModelsUploaded byravigobi
- MPI for Big DataUploaded byravigobi
- Text Classification Using Machine Learning TechniquesUploaded byravigobi
- Computational Complexity of the XCS Classifier SystemUploaded byravigobi
- Streaming Linear Regression on Spark MLlib and MOAUploaded byravigobi
- Spark on Hadoop vs MPI OpenMP on BeowulfUploaded byravigobi
- Is Inductive Machine Learning Just Another Wild GooseUploaded byravigobi
- The Worst-case Time Complexity for Generating All Maximal Cliques and Computational ExperimentsUploaded byravigobi
- Towards an Analytic Framework for Analysing the Computation Time of Evolutionary AlgorithmsUploaded byravigobi
- Speeding Up Greedy Forward Selection for Regularized Least-SquaresUploaded byravigobi
- Big O NotationsUploaded byravigobi
- Support Vector Machine Solvers.pdfUploaded byravigobi
- Time and Space Complexity ImprovementsUploaded byravigobi
- Polynomial Time Algorithm for Learning Globally Optimal Dynamic Bayesian NetworkUploaded byravigobi

- Mind MappingUploaded byNurul Hana Omar
- Proceedings of the Fifteenth Meeting of the GMS Subregional Transport ForumUploaded byAsian Development Bank Conferences
- 4-3 Network Layer.pptxUploaded byTutun Juhana
- Watt Watchers Newspaper - Spring 2005Uploaded byOikonomidouFloredia
- Maritime SoftwareUploaded byDhayalan Nethiranantham
- Foreign Language Education of Kazakhstan- Current Trends and Future PerspectivesUploaded byGlobal Research and Development Services
- PhD guide for computer science electronics NITK INDIA BANGALORE research Journal Dr. M.V. Panduranga PhD Computer Science (NITK Surathkal INDIA)Uploaded byDr. Panduranga Rao MV
- Details of Drain (for Approved)Uploaded byHegdeVenugopal
- Network and InternetUploaded byIshmum Rahman
- TQM Q&A GE2022Uploaded bykingsley_psb
- Oracle Dator 12c - Securing Applications With Oracle Platform Security ServicesUploaded byAnonymous rDMzVnB6Q
- 128550_manual_eUploaded bycyberalex
- Aruba Instant 6.5.4.4 Release NotesUploaded byGino Anticona
- MacQueen K. M., McLellan E., Kay K., Milstein B. (1998), Codebook Development for Team-Based Qualitative AnalysisUploaded byLalox_22
- Low Power VLSI Circuit Design with Efficient HDL Coding.docUploaded byNsrc Nano Scientifc
- EA-1-01Uploaded bypelorza
- manual parts drives mercruiser.pdfUploaded byJOSE
- Professional Fractional Rf Radio Frequency Beauty Skin Rejuvenation Machine NewUploaded bymatt hobart
- Tenere ServiceUploaded byjeremy
- 08-08-20 MPS.pdfUploaded bySona Ning
- Launching Retrieving ProceduresUploaded bytonybg
- P44x en T F65 GlobalUploaded byirfu
- iForm XML Post Service for Google DocsUploaded byBill French
- Calculating Concrete Rebars QuantitiesUploaded byJamohl Supremo Alexander
- CCIE RS Written 400 101 Demo VersionUploaded byEdwin Marin
- abstrak_487637_tpjuaUploaded byAggita Cahyani
- Retinex_ret40Uploaded byVinod Kumar Boya
- Fluid MechanicsUploaded byManojkumar Thilagam
- DKTS Questions Internet June 2016Uploaded bykhey
- ibootUploaded byJulian Andres Perez Beltran