209 views

Uploaded by Nikin Kumar Jain

Is a Data Mining Assignment Submitted by me on October 6th

Is a Data Mining Assignment Submitted by me on October 6th

Attribution Non-Commercial (BY-NC)

- Top-10 Data Mining Case Studies
- Data_Ware_House_Assignment
- Data Mining
- r Package
- Predicting Sentiment of Comments to News on Reddit
- k-Anonymity model
- Cluster Analysis
- A survey of the State-of-the-arts on Neutrosophic Sets in Biomedical Diagnoses
- MDMW-Iron09
- IJAIEM-2014-05-26-073
- 4. Ijcseitr - Improved Algorithm for Dynamic Memory Allocation
- Breaking Out of the Black-Box: Research Challenges in Data Mining
- 13ba451 - Seminar on Managing Customers
- An Energy-Based Similarity Measure for Time Series
- Introduction
- A Brief Study of Data Mining Techniques to Study Patterns in Medical Sciences
- KKR_LS
- 12Outlier
- Outliers
- Variance Rover System

You are on page 1of 7

Tech IT

nikinjain_vit@yahoo.com

School of Computing Sciences

Vellore Institute of Technology, Vellore, TamilNadu

Technical Report on Effective and Efficient Clustering Methods for Spatial Data Mining by

Raymond T. Ng and Jiawei Han

Abstract

A spatial database is a database that is optimized to store and query data related to objects in

space, including points, lines and polygons. Spatial Database is the discovery of interesting

relationships and characteristics that may exist implicitly in spatial databases. While typical

databases can understand various numeric and character types of data, additional functionality

needs to be added for databases to process spatial data types. To make up for this, the author has

described a new clustering method which is called CLARANS algorithm. Various comparisons

with other algorithms show that CLARANS has got more efficiency than others.

The given paper aims at describing the usage of CLARANS Algorithm which is based on

clustering method. Because of the huge amounts of data of spatial data that may be obtained

from satellite images, medical equipment, video cameras etc. ., it is costly and often unrealistic

for users to examine spatial data in detail. To this very end, the presented algorithm comes to

rescue which is very well based upon clustering method.

Objectives of any spatial algorithm are as follows:-

1. Extracting interesting spatial patterns and features.

2. Capturing interesting relationship between spatial and non-spatial data.

3. Presenting data regularity and at higher conceptual levels.

4. Helping to reorganize spatial databases, to accommodate data semantics, as well as

achieve higher performance.

Introduction

Data mining in general is the search for hidden patterns that may exist in large databases.

Spatial data mining in particular is the discovery of interesting relationships and characteristics

that may exist implicitly in spatial databases. Both algorithms have got following problems

Firstly, the user or an expert must provide with spatial concept hierarchies, which may not be

available in many applications.

Second, both algorithms conduct their spatial exploration by merging regions at certain level

of hierarchy to a larger region at higher level. Thus quality of the result produced by both

algorithms relies quite crucially on the appropriateness of the hierarchy to the given data. To deal

with these problems, we explore whether cluster analysis techniques are applicable. Cluster

Analysis is a branch of statistics that in the past three decades has been intensely studied and

successfully applied to many applications. The Complaints are that cluster analysis algorithms

are ineffective and inefficient. Indeed, for cluster analysis algorithms to work effectively there

need to be a natural notion of similarities among the “objects” to be clustered.

As for efficiency concern, the author develops his own cluster analysis algorithm, called

CLARANS, which is designed for large data sets.

1. The development of CLARANS, which is based on randomized search and is

partly motivated by two existing algorithms well known in cluster analysis called

PAM and CLARA; and

2. The development of two spatial mining algorithms SD (CLARANS) and NSD

(CLARANS).

Post-Related Work

PAM

Cluster analysis has been widely applied to many areas such as medicine, chemistry, social

studies and so on. Its main goal is to identify structures or clusters present in the data. While,

there is no general definition of a cluster, algorithms have been developed to find several kinds

of shapes like spherical, linear drawings etc.., the author has chose k- medoid algorithm because

it is very robust and clusters formed do not depend on the shape and size of the objects. K-

medoid is invariant with respect to translation and orthogonal points; also they can handle very

large data sets.

Compared to the well - known kmeans algorithm, .PAM has the following features:

1. It operates on the dissimilarity matrix of the given data set or when it is presented with

an n ´ p data matrix, the algorithm first computes a dissimilarity matrix.

2. It is more robust, because it minimizes a sum of dissimilarities instead of a sum of

squared Euclidean distances.

3. It provides a novel graphical display, the silhouette plot, which allows the user to select

the optimal number of clusters.

initial medoids

2. SWAP-step: If the objective function can be reduced by interchanging (swapping) a

selected object with an unselected object, then the swap is carried out. This is continued

till the objective function can no longer be decreased.

PAM (Partitioning around Mediods) was developed by Kauffman and Rousseeaw. To find k-

clusters PAM’s approach is to determine a representative object for each cluster. This

representative object called medoid is meant to be most centrally located object for each cluster.

Once, the medoids have been selected, each non-selected object is grouped with the medoid to

which it is most similar.

Algorithm PAM

1. Select k representative objects arbitrarily.

2. Compute TCih for all pairs of object Oi, Oh where Oi is currently selected and Oh is

not.

3. Select the pair Oi, Oh which corresponds to min Oi,Oh TCih. If the minimum TCih

is negative, replace Oi with Oh and go back to Step (2).

4. Otherwise, for each non-selected object, find the most similar representative object.

Halt.

More precisely, if Oj is a non-selected object and Oi is a medoid, we say that Oj belongs to

the cluster represented by Oi, if d (Oj, Oi) = min Oe d (Oj, Oe) where the notation min Oe

denotes the minimum over all medoids Oe, and the notation d (Oa, Ob) denotes the dissimilarity

or distance between objects Oa and Ob. All the dissimilarity values are given as inputs to PAM.

Finally the quality of a clustering is measured by the average dissimilarity between an object and

the medoid of its cluster.

CLARA

Designed by Kaufmann and Rousseeuw to handle large data sets, CLARA (Clustering Large

Applications) relies on sampling. Instead of finding representative objects for the entire data set,

CLARA draws a sample of the data set, applies PAM on the sample, and finds the medoids of

the sample. The point is that if the sample is drawn in a sufficiently random way, the medoids of

the sample would approximate the medoids of the sample would approximate the medoids of the

entire data set. To come up with better approximations, CLARA draws multiple samples and

gives the best clustering as the output. Here for accuracy, the quality of a clustering is measured

based on the average dissimilarity of all objects in the entire data set, and not only of those

objects in the samples. Experiments reported in indicate that 5 samples of size 40+2k give

satisfactory results.

Algorithm CLARA

1. For i=1 to 5, repeat the following steps:

2. Draw a sample of 40+2k objects randomly from the entire data set, and call

Algorithm PAM to find k medoids of the sample.

3. For each object Oj in the entire data set, determine which of the k medoids is the

most similar to Oj.

4. Calculate the average dissimilarity of the clustering obtained in the previous step.

If this value is less than the current minimum, use this value as the current

minimum, and retain the k medoids found as the best set of medoids obtained so

far.

CLARANS (Clustering large applications based on Randomized Search). Here at first a graph

abstraction of it is given. Later experimental results showing how CLARANS outperforms

CLARA and PAM in terms of both efficiency and effectiveness.

Motivation of CLARANS: a Graph Abstraction

Given n objects, the process described above of finding k medoids can be viewed abstractly

as searching through a certain graph. In this graph denoted by Gn, k, a node is represented by a set

of k objects. Two nodes are neighbors if their sets differ by only one object. More formally two

nodes S1 = {Om1, Omk} and other node S2 are neighbors if and only if the cardinality of

intersection of S1 and S2 is k-1. It is easy to see that each node has k (n-k) neighbors. Since a

node represents a collection of k medoids, each node corresponds to a clustering.

Spatial data mining is the discovery of interesting relationships and characteristics that may

exist implicitly in spatial databases. To this end, this paper has three main contributions. First,

we propose a new clustering method called CLARANS, whose aim is to identify spatial

structures that may be present in the data. Experimental results indicate that, when compared

with existing clustering methods, CLARANS is very efficient and effective. Second, we

investigate how CLARANS can handle not only point’s objects, but also polygon objects

efficiently. One of the methods considered, called the IR-approximation, is very efficient in

clustering convex and no convex polygon objects. Third, building on top of CLARANS, we

develop two spatial data mining algorithms that aim to discover relationships between spatial and

no spatial attributes. Both algorithms can discover knowledge that is difficult to find with

existing spatial data mining algorithms.

CLARANS is also known as Randomized CLARA, Its attributes are:-

1. CLARANS (A Clustering Algorithm based on Randomized Search by Ng and Han in

1994).

3. The Clustering process can be presented as searching a graph where every node is a

potential solution, that is, a set of k medoids.

4. If the local optimum is found, CLARANS start with new randomly selected node in

search for new local optimum.

5. It is more efficient and scalable than both PAM and CLARA.

6. Focusing techniques and spatial access structures may further improve its performance.

Algorithm CLARANS

1. Input parameters numlocal and maxneighbor Initialize i to 1, and mincost to a large number.

2. Set current to an arbitrary node in Gn,k.

3. Set j to 1.

4. Consider a random neighbor S of current, and based on Equation (5), caluculate the cost

differential to the two nodes.

5. If S has a lower cost, set current to S, and go to Step (3).

6. Otherwise, increment j by 1. If j<= maxneighbor, go to Step (4).

7. Otherwise, when j>maxneighbor, compare the cost of current with mincost. If the former is

less than mincost to the cost of current and set best node to current.

8. Increment i by 1. If i > numlocal, output best node and halt. Otherwise go to Step (2).

Steps (3) to (6) above search for nodes with progressively lower costs. But the current node

has already been compared with the maximum number of the neighbors of the node.

Spatial Databases are characterized to manage high number of objects and to fulfill the need

of selective spatial access. Clustering techniques are attractive for KDD, because they can find

hidden structures in data without using any additional domain knowledge.

The above image illustrates that how efficient it could be to use CLARANS algorithm to

PAM algorithm. CLARANS has got better run time efficiency and hence is very useful in spatial

data mining.

In this paper the author has also described about Non-Spatial Dominant Approach NSD

(CLARANS) and later has given an Evaluation of SD (CLARANS) and NSD (CLARANS).

Here an example of Real Estate data set is taken to describe the effectiveness in the working of

algorithm.

Conclusion

We use the clustering algorithm CLARANS for class identification in SDBS. Here the author

has presented two spatial data mining algorithms SD and NSD CLARANS. Experimental results

and analysis indicate that both algorithms are effective and have lead to discoveries that are

difficult to obtain with existing spatial data mining algorithms. Finally experimental results

prove that CLARANS itself is more efficient than existing clustering methods.

References

1. Technical Report by R. Ng and J. Han (1994 Effective and Efficient Clustering Methods for

Spatial Data Mining). http://www.docstoc.com/docs/DownloadDoc.aspx?doc_id=4198672

- Top-10 Data Mining Case StudiesUploaded bynewkid2202
- Data_Ware_House_AssignmentUploaded byNikin Kumar Jain
- Data MiningUploaded byvdsrihari
- r PackageUploaded byPhani Thota
- Predicting Sentiment of Comments to News on RedditUploaded byBruno Jakic
- k-Anonymity modelUploaded byanubhav129
- Cluster AnalysisUploaded byalam1971
- A survey of the State-of-the-arts on Neutrosophic Sets in Biomedical DiagnosesUploaded byMia Amalia
- MDMW-Iron09Uploaded byminingnova
- 4. Ijcseitr - Improved Algorithm for Dynamic Memory AllocationUploaded byTJPRC Publications
- Breaking Out of the Black-Box: Research Challenges in Data MiningUploaded byanupkrishna20
- IntroductionUploaded byAshwni Kumar
- IJAIEM-2014-05-26-073Uploaded byAnonymous vQrJlEN
- 13ba451 - Seminar on Managing CustomersUploaded byJeyasri R Vignesh
- An Energy-Based Similarity Measure for Time SeriesUploaded byIvan Borodavka
- A Brief Study of Data Mining Techniques to Study Patterns in Medical SciencesUploaded byIRJET Journal
- KKR_LSUploaded byengineeringwatch
- 12OutlierUploaded byAnurekha Prasad
- OutliersUploaded byRann Ehok
- Variance Rover SystemUploaded byesatjournals
- K Means HandoutUploaded byEdgar Coló
- Ceng514 Spr2012 Clustering (1)Uploaded byArda Taşcı
- IOSR JournalsUploaded byInternational Organization of Scientific Research (IOSR)
- bahanUploaded byAbinaya Prasaja
- Cluster Cat VarsUploaded byfab101
- Ensemble Clustering for Biological DatasetUploaded byEdian De Los Santos
- 04531898Uploaded byMelokuhle Keep Walking
- o Cluster Algorithm 129983Uploaded byLien Pham
- Nur Syazwani (212360)Uploaded bySyazwani Nizam
- Elevating Forensic Investigation System for File ClusteringUploaded byInternational Journal of Research in Engineering and Technology

- AVL treeUploaded byVishnu Prasad Goranthala
- chap7.pdfUploaded byVinícius Machado
- IndexUploaded bySiligam Pranitha
- Topics of Master ThesisUploaded byRoberto Hernández
- 2 LooksUploaded byTin Bernabe
- 2007 02 01b Janecek PerceptronUploaded byRaden
- Homework 1 2017Uploaded byHemant Rathod
- Ec1304- Control SystemUploaded bys.reegan
- 62426_02Uploaded byLincs lincoln
- Co,Po,Pso Mapping LAB-C316Uploaded byMahi Mahesh
- Tutorial4_376-381Uploaded byEgidius Putrando
- A Novel Approach to Genetic Algorithm Based CryptographyUploaded byWhite Globe Publications (IJORCS)
- K-Nearest Neighbor Query ProcessingUploaded byIJIRST
- 4 a Study of Encryption AlgorithmsUploaded byVivek
- Differentiable Optimization and Equation SolvingUploaded byAnonymous FGY7go
- Rabin AISummitGDC2015 JPSPlusWithGoalBoundingUploaded byMeng Zhang
- NSE BA Sample Paper With SolutionUploaded bySanjay Singh
- ArticleUploaded bymariumamna
- 90892-205704-1-PBUploaded byTayyeb Mahmood
- AVL+B-Tree+TreapUploaded byb12817
- Homework 2Uploaded byRavi Tej
- Lesson 2Uploaded byVaibhav Nandwani
- Flowchart NsgaUploaded bydrajiit
- markovchain.pdfUploaded byDidier De Los Santos Merino
- New Text DocumentUploaded byUsman Younas
- Chaos and Systems TheoryUploaded byBrian Futterman
- T19.pdfUploaded byArun G
- Design of Low-Pass Digital Differentiators Based on B-splinesUploaded byAI Coordinator - CSC Journals
- Placmnt AdviceUploaded bySri Badma
- M5L30Uploaded byabimana