43 views

Uploaded by Gyana Sahoo

White Paper Clustering Approaches and Techniques

save

- 20_ijictv3n10spl.pdf
- 03 Literature Review
- Clustering of Architectural Floor Plans - A Comparison of Shape Representations
- DATA MINING - Simple Guide For Beginners.pdf
- Analysis on Student Admission Enquiry System
- Creating Datamining for Hierarchical Clustering
- Presentation: Parallelisation of Hierarchical Clustering for Metagenomics
- Essentials of Machine Learning Algorithms (With Python and R Codes)
- Comparative Analysis of BIRCH and CURE Hierarchical Clustering Algorithm using WEKA 3.6.9
- Spatial Data Mining
- 05 Clustering
- Az 36311316
- Market Basket Analysis for data mining - msthesis.pdf
- 4196_0_Conceptual Model for Enhanced OLAP and Data Mining
- Eric Chi's CV
- 6234A_09
- HUB4045F_Assignment1_2012
- Preserving Secret Data With Unique Id
- A Person Centered Investigation of Academic
- Image Segmentation Using Two Weighted Variable Fuzzy K Means
- IJAIEM-2014-10-31-82
- V3I2-0204.pdf
- MS_thesis
- Pic Meeting Mom 5
- Kaplan209-236
- Downloads_15122015114324AM.pdf
- Downloads_15122015114324AM
- labjam
- labjam.pdf
- International Treaties, Conventions and Protocols Concerning Cyberspace
- 02 SMMU Andrew
- 03-introduction_to_Oracle.pdf
- PKI Help Guide
- EGram Presentation July 2016 NM 2
- Bbsr Icomc Addendum Final
- License Spiv 16090327 Aus
- Crimes and Torts Committed on a Computer Network and Realting to Electronic Mail

You are on page 1of 22

Data Mining: Clustering “Approaches & Techniques”

based on Real-life Data

24/01/2014

White Paper

BSNL CDR Project

Hrishav Bakul Barua

&

Anupam Roy

Telecom

hrishav.barua@tcs.com,

anupam1.r@tcs.com

2

Confidentiality Statement

Confidentiality and Non-Disclosure Notice

The information contained in this document is confidential and proprietary to TATA

Consultancy Services. This information may not be disclosed, duplicated or used for any

other purposes. The information contained in this document may not be released in

whole or in part outside TCS for any purpose without the express written permission of

TATA Consultancy Services.

Tata Code of Conduct

We, in our dealings, are self-regulated by a Code of Conduct as enshrined in the Tata

Code of Conduct. We request your support in helping us adhere to the Code in letter and

spirit. We request that any violation or potential violation of the Code by any person be

promptly brought to the notice of the Local Ethics Counselor or the Principal Ethics

Counselor or the CEO of TCS. All communication received in this regard will be treated

and kept as confidential.

3

Table of Content

Abstract............................................................................................................................................................................. 4

About the Authors ............................................................................................................................................................ 4

1. Data Mining............................................................................................................................................................... 5

1.1 Cluster Analysis ................................................................................................................................................. 5

1.1.1 What a good clustering technique/algorithm demands?......................................................................... 6

1.1.2 A Categorization of Major Clustering Approaches.................................................................................... 7

1.1.3 Hierarchical Method ................................................................................................................................. 7

1.1.4 Partitioning Method.................................................................................................................................. 8

1.1.5 Density‐Based Method.............................................................................................................................. 9

1.1.6 Grid‐Based Methods ................................................................................................................................. 9

1.1.7 Constraint‐Based Clustering.................................................................................................................... 10

1.1.8 Clustering Over Multi‐Density Data Space.............................................................................................. 11

1.1.9 Clustering Over Variable‐Density Space.................................................................................................. 11

1.1.10 Clustering Higher Dimensional Data ....................................................................................................... 11

1.1.11 Massive Data Clustering Using Distributed and Parallel Approach ........................................................ 12

1.1.12 How Clustering Algorithms are Compared?............................................................................................ 12

1.1.13 Cluster Validation.................................................................................................................................... 12

2. Conclusion............................................................................................................................................................... 19

3. Acknowledgements................................................................................................................................................. 19

4. References .............................................................................................................................................................. 20

4

Abstract

Finding meaningful patterns and useful trends in large datasets has attracted considerable interest recently. One of

the most widely studied problems in this area is the identification and formation of clusters or densely populated

regions in a dataset. Cluster analysis divides data into meaningful or useful groups called clusters. The objective of

this paper is to present a clear analysis and survey of the various existing clustering approaches and techniques and

some of the famous and pioneering algorithms applied under these approaches. Hence, this paper will bring to light

the best of the techniques and will show why they are the best among all the techniques.

In this paper, the technique of data clustering has been examined, which is a particular kind of data mining problem.

The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A

cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the

objects in other clusters [1]. Given a large set of data points that is, data objects; the data space is usually not

uniformly occupied. Data clustering identifies the sparse and the crowded places and hence, discovers the overall

distribution patterns of the data set. Besides, the derived clusters can be visualised more efficiently and effectively

than the original dataset. Mining knowledge from large amounts of spatial data is known as spatial data mining. It

becomes a highly demanding field because huge amounts of spatial data have been collected in various applications

ranging from geo‐spatial data, industrial data to bio‐medical knowledge. The amount of spatial data being collected

is increasing exponentially and has far exceeded human’s ability to analyse them. Recently, clustering has been

recognised as a primary data mining method for knowledge discovery in spatial database. The development of

clustering algorithms has received a lot of attention in the last few years and new clustering algorithms are

proposed. A variety of algorithms have recently emerged that meet the requirements of data mining using cluster

analysis and were successfully applied to real life data mining problems.

About the Authors

Hrishav Bakul Barua has joined TCS on September 10, 2012. A student of Sikkim Manipal University (SMU), he has

published his research works on “Data Mining: Clustering Techniques” in International Journal of Computer

Applications (FCS), New York, USA.

http://www.ijcaonline.org/archives/volume58/number2/9252-3418

Anupam Roy has total four years of project experience in TCS. Currently working in BSNL CDR Project and pursuing

ME in Software Engineering from Jadavpur University, Kolkata. He has worked on ‘Attacks on Distributed

Database’ and ‘Intrusion Detection/Prevention Systems’.

5

1. Data Mining

Data mining refers to extracting or “mining” knowledge from large volume of data. Many other terms carry a

similar or slightly different meaning to data mining, such as knowledge mining from data, knowledge

extraction, data/pattern analysis, data archaeology and data dredging. Many people treat data mining as a

synonym for another popularly used term, Knowledge Discovery from Data or KDD.

1.1 Cluster Analysis

The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A

cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to

the objects in other clusters [1].

Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine

learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of

very large datasets with many attributes of different types. This imposes unique computational requirements on

relevant clustering algorithms. A variety of algorithms have recently emerged that meet these requirements and

were successfully applied to real-life data mining problems. They are subject of the survey. From a machine

learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the

resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in

data mining applications such as scientific data exploration, information retrieval and text mining, spatial database

applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, business management,

archaeology, insurance, libraries and many others. In recentyears, due to the rapid increase of online documents,

text clustering becomes important.

Distance (similarity, or dissimilarity) function for clustering quality

Inter-clusters distance ⇒ maximised

Intra-clusters distance ⇒ minimised

Figure 1: Formation of Clusters

6

1.1.1 What a good clustering technique/algorithm demands?

A good clustering technique/algorithm demands the following:

ν Scalability: Many clustering algorithms work well on small data sets containing fewer than several

hundred data objects; however, a large database may contain millions of objects. Clustering on a sample

of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed.

ν Ability to deal with different types of attributes: Many algorithms are designed to cluster interval-based

(numerical) data. However, applications may require clustering other types of data, such as binary,

categorical (nominal), and ordinal data, or mixture of these data types.

ν Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters based on

Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to find

spherical clusters with similar size and density. However, a cluster could be of any shape. It is important

to develop algorithms that can detect clusters of arbitrary shape.

ν Minimal requirements for domain knowledge to determine input parameters: Many clustering

algorithms require users to input certain parameters in cluster analysis (such as the number of desired

clusters). The clustering results can be quite sensitive to input parameters. Parameters are often difficult

to determine, especially for data sets containing high-dimensional objects. This not only burdens users,

but it also makes the quality of clustering difficult to control.

ν Ability to deal with noisy data: Most real-world databases contain outliers or missing, unknown, or

erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor

quality.

ν Incremental clustering and insensitivity to the order of input records: Some clustering algorithms

cannot incorporate newly inserted data (i.e., database updates) into existing clustering structures and

instead, must determine a new clustering from scratch. Some clustering algorithms are sensitive to the

order of input data. That is, given a set of data objects, such an algorithm may return dramatically different

clustering depending on the order of presentation of the input objects. It is important to develop

incremental clustering algorithms and algorithms that are insensitive to the order of input.

ν High dimensionality: A database or a data warehouse can contain several dimensions or attributes.

Many clustering algorithms are good at handling low-dimensional data, involving only two to three

dimensions. Human eyes are good at judging the quality of clustering for up to three dimensions. Finding

clusters of data objects in high dimensional space is challenging, especially considering that such data

can be sparse and highly skewed.

ν Constraint-based clustering: Real-world applications may need to perform clustering under various

kinds of constraints. Suppose that your job is to choose the locations for a given number of new automatic

banking machines (ATMs) in a city. To decide upon this, you may cluster households while considering

constraints such as the city’s rivers and highway networks, and the type and number of customers per

cluster. A challenging task is to find groups of data with good clustering behavior that satisfy specified

constraints.

ν Interpretability and usability: Users expect clustering results to be interpretable, comprehensible and

usable. That is, clustering may need to be tied to specific semantic interpretations and applications. It is

important to study how an application goal may influence the selection of clustering features and methods.

7

ν Time Complexity: The time required for a particular clustering algorithm to run/execute and produce the

output.

ν Labeling or assignment: Hard or strict (each data object is in one and only one cluster vs. soft or fuzzy

(each data object has a probability of being in each cluster).

1.1.2 A Categorization of Major Clustering Approaches

ν Hierarchical Method

ν Partitioning Method

ν Density-Based Methods

ν Grid-Based Methods

ν Methods Based on Co-Occurrence of Categorical Data

ν Constraint-Based Clustering

ν Clustering Algorithms Used in Machine Learning

ν Scalable Clustering Algorithms

ν Model-based Methods

ν Algorithms For High Dimensional Data

1.1.3 Hierarchical Method

Hierarchical clustering builds a cluster hierarchy or, in other words, a tree of clusters, also known as a dendrogram as

represented in the following figure:

Figure 2: Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e}.

Every cluster node contains child clusters; sibling clusters partition the points covered by their common parent. Such

an approach allows exploring data on different levels of granularity. Hierarchical clustering methods are categorized

into agglomerative (bottom-up) and divisive (top-down). An agglomerative clustering starts with one-point (singleton)

clusters and recursively merges two or more most appropriate clusters. A divisive clustering starts with one cluster of

all data points and recursively splits the most appropriate cluster. The process continues until a stopping criterion

(frequently, the requested number k of clusters) is achieved.

8

Figure 3: Agglomerative and divisive hierarchical clustering on data objects {a, b, c, d, e}.

Advantages of hierarchical clustering include:

ν Embedded flexibility regarding the level of granularity

ν Ease of handling any forms of similarity or distance

ν Consequently, applicability to any attribute types

Disadvantages of hierarchical clustering are related to:

ν Vagueness of termination criteria

ν The fact that most hierarchical algorithms do not revisit once constructed(intermediate) clusters with the

purpose of their improvement

Hierarchical clustering based on linkage metrics results in clusters of proper (convex) shapes. Active contemporary

efforts to build cluster systems that incorporate our intuitive concept of clusters as connected components of arbitrary

shape, including the algorithms CURE and CHAMELEON [13], are surveyed in the sub-section Hierarchical Clusters

of Arbitrary Shapes. Divisive techniques based on binary taxonomies are presented in the sub-section Binary Divisive

Partitioning. The sub-section Other Developments contains information related to incremental learning, model-based

clustering and cluster refinement.

One of the most striking developments in hierarchical clustering is the algorithm BIRCH [8]. Data squashing used by

BIRCH to achieve scalability has independent importance. Hierarchical clustering of large datasets can be very sub-

optimal, even if data fits in memory. Compressing data may improve performance of hierarchical algorithms.

1.1.4 Partitioning Method

In this section we survey data partitioning algorithms, which divide data into several subsets. Because checking all

possible subset systems is computationally infeasible, certain greedy heuristics are used in the form of iterative

optimization. Specifically, this means different relocation schemes that iteratively reassign points between the k

clusters. Unlike traditional hierarchical methods, in which clusters are not revisited after being constructed, relocation

algorithms gradually improve clusters. With appropriate data, this results in high quality clusters. One approach to data

partitioning is to take a conceptual point of view that identifies the cluster with a certain model whose unknown

parameters have to be found. More specifically, probabilistic models assume that the data comes from a mixture of

several populations whose distributions and priors we want to find. Corresponding algorithms are described in the

sub-section Probabilistic Clustering. One clear advantage of probabilistic methods is the interpretability of the

constructed clusters. Having concise cluster representation also allows inexpensive computation of intra-clusters

measures of fit that give rise to a global objective function.

9

Given a database of n objects or data tuples, a partitioning method constructs k partitions of the data, where each

partition represents a cluster and k ≤n. That is, it classifies the data into k groups, which together satisfy the following

requirements: (1) each group must contain at least one object, and (2) each object must belong to exactly one group.

Most applications adopt one of a few popular heuristic methods, such as (1) the k-means algorithm, where each

cluster is represented by the mean value of the objects in the cluster, and(2) the k-medoids algorithm, where each

cluster is represented by one of the objects located near the center of the cluster.

1.1.5 Density-Based Method

Most partitioning methods cluster objects based on the distance between objects. Such methods can find only

spherical-shaped clusters and encounter difficulty at discovering clusters of arbitrary shapes. Other clustering

methods have been developed based on the notion of density. Their general idea is to continue growing the given

cluster as long as the density (number of objects or data points) in the “neighborhood” exceeds some threshold; that

is, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum

number of points. Such a method can be used to filter out noise (outliers) and discover clusters of arbitrary shape. The

density-based approach is famous for its capability of discovering arbitrary shaped clusters of good quality even in

noisy datasets [2].Figure 4: Irregular shapes difficult for k-meansillustrates some cluster shapes that present a

problem for partitioning relocation clustering (e.g., k-means), but are handled properly by density-based algorithms.

hey also have good scalability.

T

Figure 4: Irregular shapes difficult for k-means

int in the attribute space and is

xplained in the sub-section Density Functions. It includes the algorithm DENCLUE.

NCLUE is a method that clusters objects based on the analysis of the value

distribu s of density functions.

.1.6 Grid-Based Methods

nt of the number of data objects, yet

ependent on only the number of cells in each dimension in the quantized space.

There are two major approaches for density-based methods. The first approach pins density to a training data point

and is reviewed in the sub-section Density-Based Connectivity. Representative algorithms include DBSCAN,

GDBSCAN, OPTICS, and DBCLASD. The second approach pins density to a po

e

DBSCAN [2] and its extension, OPTICS, are typical density-based methods that grow clusters according to a density-

based connectivity analysis. DE

tion

1

The grid-based clustering approach uses a multi-resolution grid data structure. It quantizes the object space into a

finite number of cells that form a grid structure on which all of the operations for clustering are performed [3]. The main

advantage of the approach is its fast processing time, which is typically independe

d

10

There is high probability that all data points that fall into the same grid cell belong to the same cluster. Therefore, all

data points belonging to the same cell can be aggregated and treated as one object. It is due to this nature that grid-

based clustering algorithms are computationally efficient which depends on the number of cells in each dimension in

the quantized space. It has many advantages such as the total number of the grid cells is independent of the number

of data points and is insensitive of the order of input data points.

Some of the popular grid-based clustering techniques are STING [4], Wave Cluster [5], CLIQUE [6], pMAFIA [7]and so

on. CLIQUE [6] is a hybrid clustering method that combines the idea of both density-based and grid-based

approaches. pMAFIA [7] is an optimized and improved version of CLIQUE. It uses the concept of adaptive grids for

detecting the clusters. It scales exponentially to the dimension of the cluster of the highest dimension in the data set.

The algorithm STING (STatistical INformation Grid-based method) [4] works with numerical attributes (spatial data)

and is designed to facilitate “region oriented” queries. In doing so, STING constructs data summaries in a way similar

toBIRCH [8]. It, however, assembles statistics in a hierarchical tree of nodes that are grid-cells.Figure 5: Cell

generation and tree construction in STINGpresents the proliferation of cells in 2-dimensional space and the

construction of the corresponding tree. Each cell has four (default) children and stores a point count, and attribute-

dependent measures: mean standard deviation, minimum, maximum, and distribution type. Measures are

accumulated starting from bottom level cells, and further propagate to higher-level cells (e.g., minimum is equal to a

minimum among the children-minimums). Only distribution type presents a problem- X

2

-test is used after bottom cell

distribution types are handpicked. When the cell-tree is constructed (in O(N)time), certain cells are identified and

connected in clusters similar to DBSCAN. If the number of leaves is K, the cluster construction phase depends on K

and not on N. This algorithm has a simple structure suitable for parallelization and allows for multi resolution; though

defining appropriate granularity is not straightforward. STING has been further enhanced to algorithm STING+ [9] that

targets dynamically evolving spatial databases, and uses similar hierarchical cell organization as its predecessor. In

ddition, STING+ enables active data mining.

a

Figure 5: Cell generation and tree construction in STING

lute and relative conditions on regions (a

et of adjacent cells), absolute and relative conditions on certain attributes.

.1.7 Constraint-Based Clustering

of such

onditioned cluster partitions is the subject of active research; for example, we can look into the survey [10].

To do so, it supports user defined trigger conditions (e.g., there is a region where at least10 cellular phones are in use

per square mile with total area of at least 10 square miles, or usage drops by 20% in a described region). The related

measures, sub-triggers, are stored and updated over the hierarchical cell tree. They are suspended until the trigger

fires with user-defined action. Four types of conditions are supported: abso

s

1

In real-world applications customers are rarely interested in unconstrained solutions. Clusters are frequently subjected

to some problem-specific limitations that make them suitable for particular business actions. Building

c

11

The framework for the constrained-based clustering is introduced in [11]. The taxonomy of clustering constraints

includes constraints on individual objects (example, customer who recently purchased) and parameter constraints (like

the number of clusters) that can be addressed through preprocessing or external cluster parameters. The taxonomy

also includes constraints on individual clusters that can be described in terms of bounds on aggregate functions (min,

avg, and so on) over each cluster. Another approach to building balanced clusters is to convert a task into a graph

partitioning problem [12].

Important constraint-based clustering application is to cluster 2D spatial data in the presence of obstacles. Instead of

regular Euclidean distance, a length of the shortest path between two points can be used as an obstacle distance. The

Clustering with Obstructed Distance (COD) algorithm [11] deals with this problem. It is best illustrated by the Figure 6:

Obstacle (river with the bridge) makes a difference, showing the difference in constructing three clusters in absence

of obstacle (left) and in presence of a river with a bridge (right).

Figure 6: Obstacle (river with the bridge) makes a difference

1.1.8 Clustering Over Multi-Density Data Space

One of the main applications of clustering spatial databases is to find clusters of spatial objects which are close to

each other. Most traditional clustering algorithms try to discover clusters of arbitrary densities, shapes and sizes. Very

few clustering algorithms show preferable efficiency when clustering multi-density datasets. This is also because small

clusters with small number of points in a local area are possible to be missed by a global density threshold. TDCT [16]

is a triangle- density clustering technique for large multi-density as well as embedded clusters.

1.1.9 Clustering Over Variable-Density Space

Most of the real life datasets have a skewed distribution and may also contain nested cluster structures the discovery

of which is very difficult. Therefore, we discuss two density based approaches, OPTICS [14] and EnDBSCAN [15],

which attempt to handle the datasets with variable density successfully. OPTICS can identify embedded clusters over

varying density space. However, its execution time performance degrades in case of large datasets with variable

density space and it cannot detect nested cluster structures successfully over massive datasets. In EnDBSCAN [15],

an attempt is made to detect embedded or nested clusters using an integrated approach. Based on our experimental

analysis in light of very large synthetic datasets, it has been observed that EnDBSCAN can detect embedded clusters;

however, with the increase in the volume of data, the performance of it also degrades. EnDBSCAN is highly sensitive

to the parameters MinPts and ε. In addition to the above mentioned parameters, OPTICS requires an additional

parameter that is, ε'

1.1.10 Clustering Higher Dimensional Data

Most of the clustering methods stated in section 1.1 are implemented in 2D spatial datasets. The need for clustering in

3D spatial datasets is highly demanded. In case of space research and Geo-Spatial data or 3D object detection, an

efficient clustering algorithm is required. CLIQUE is a dimension-growth subspace clustering method [12]. Here,

process starts at single dimensional subspace and extends to higher dimensional ones. CLIQUE is a combination of

12

density and grid based clustering method. In this, the data space is portioned into non overlapping rectangular units,

identifying the dense units out of them. 3D-CATD [17] is a clustering technique for massive numeric three-

dimensional (3D) datasets. The clustering algorithm is based on density approach and can detect global as well as

embedded clusters. Experimental results are reported to establish the superiority of the algorithm in light of several

synthetic data sets. We have only considered three-dimensional objects. But, some or more of the real life problems

deals with higher dimensionalities rather than 2D /3D datasets.

1.1.11 Massive Data Clustering Using Distributed and Parallel Approach

Parallel and distributed computing is expected to relieve current clustering methods from the sequential bottleneck,

providing the ability to scale massive datasets and improving the response time. Such algorithms divide the data into

partitions, which are processed in parallel. The results from the partitions are then merged. In [18], a Density Based

Distributed Clustering (DBDC)[21] algorithm was presented where the data are first clustered locally at different sites

independent of each other. The aggregated information about locally created clusters are extracted and transmitted to

a central site. On the central site, a global clustering is performed based on the local representatives and the result is

sent back to the local sites. The local sites update their clustering based on the global model, that is, merge two local

clusters to one or assign local noise to global clusters. For both the local and global clustering, density-based

algorithms are used. This approach is scalable to large datasets and gives clusters of good quality. GDCT [19],[20] is

a distributed algorithm for intrinsic cluster detection over large spatial data.

1.1.12 How Clustering Algorithms are Compared ?

There are many factors on the basis of which clustering algorithms are compared. A few of them are listed as follows:

ν The size of datasets

ν Number of clusters

ν Type of datasets

ν Type of software used for implementation

ν Complexity of time taken for execution

ν Number of user’s parameters

ν Noise handling accuracy

1.1.13 Cluster Validation

A large number of clustering algorithms have been developed to deal with specific applications. Several questions

arise like:

ν Which clustering algorithm is best suitable for the application at hand?

ν How many clusters are there in the studied data?

ν Is there a better cluster scheme?

These questions are related with evaluating the quality of clustering results, that is, cluster validation. Cluster

validation is a procedure of assessing the quality of clustering results and finding a fit cluster strategy for a specific

application. It aims at finding the optimal cluster scheme and interpreting the cluster patterns.

Cluster validation is an indispensable process of cluster analysis, because no clustering algorithm can guarantee the

discovery of genuine clusters from real datasets and that different clustering algorithms often impose different cluster

structures on a data set even if there is no cluster structure present in it. Cluster validation is needed in data mining to

solve the following problems:

13

ν To measure a partition of a real data set generated by a clustering algorithm

ν To identify the genuine clusters from the partition

ν To interpret the clusters

Generally speaking, cluster validation approaches are classified into the following three categories:

ν Internal approaches

ν Relative approaches

ν External approaches

The cluster validation methods are discussed as follows:

1.1.13.1 Internal Approaches

Internal cluster validation is a method of evaluating the quality of clusters when statistics are devised to capture the

quality of the induced clusters using the available data objects only. In other words, internal cluster validation excludes

any information beyond the clustering data, and only focuses on assessing clusters’ quality based on the clustering

data themselves.

The statistical methods of quality assessment are employed in internal criteria, for example, root-mean-square

standard deviation (RMSSTD) is used for compactness of clusters. R-squared (RS) for dissimilarity between clusters;

and S_Dbw for compound evaluation of compactness and dissimilarity [1]. The formulas of RMSSTD, RSand S_Dbw

are shown below.

Formula 1

Where, x

j

is the expected value in the j

th

dimension; n

ij

is the number of elements in the i

th

cluster i

th

dimension; n

j

is

the number of elements in the j

th

dimension in the whole data set; n

c

is the number of clusters.

Formula 2

Where,

14

The formula of S_Dbw is given as:

S_Dbw = Scat(c) + Dens_bw(c)………………………………………………………………………………………..Formula 3

where Scat(c) is the average scattering within c clusters. The Scat(c)is defined as:

Formula 4

The value of Scat(c) is the degree of the data points scattered within clusters. It reflects the compactness of clusters.

The term is the variance of a data set; and the term is the variance of cluster ci.Dens_bw(c) indicates the average

number of points between the c clusters (that is, an indication of inter-cluster density) in relation with density within

clusters. The formula of Dens_bw is given as:

Formula 5

Where uij is the middle point of the distance between the centers of the clusters vi and vj. The density function of a

point is defined as the number of points around a specific point within the given radius.

1.1.13.2 Relative Approaches

Relative assessment compares two structures and measures the irrelative merit. The idea is to run the clustering

algorithm for a possible number of parameters (for example, for each possible number of clusters) and identify the

clustering scheme that best fits the dataset, that is, they assess the clustering results by applying an algorithm with

different parameters on a data set and finding the optimal solution. In practice, relative criteria methods also use

RMSSTD, RSand S_Dbw to find the best cluster scheme in terms of compactness and dissimilarity from all the

clustering results. Relative cluster validity is also called cluster stability, and the recent works on research of relative

cluster validity are presented in.

15

1.1.13.3External Approaches

The results of a clustering algorithm are evaluated based on a pre-specified structure, which reflects the user’s

intuition about the clustering structure of the data set. As a necessary post processing step, external cluster validation

is a procedure of hypothesis test, that is, given a set of class labels produced by a cluster scheme, and compare it

with the clustering results by applying the same cluster scheme to the other partitions of a database, as shown in the

Figure 7

Figure 7: External criteria based validation

External cluster validation is based on the assumption that an understanding of the output of the clustering algorithm

can be achieved by finding a resemblance of the clusters with existing classes. The statistical methods for quality

assessment areemployed in external cluster validation such as Rand statistic, Jaccard Coefficient, Folkes and

Mallows index, Huberts statistic and Normalized statistic, and Monte Carlo method, to measure the similarity between

the priori modeled partitions and clustering results of a dataset.

Based on our selected survey and experimental analysis, it has been observed that:

ν Density based approach is most suitable for quality cluster detection over massive datasets in 2D, 3D or

higher dimensions.

ν Grid based approach is suitable for fast processing of large datasets in 2D, 3D or higher dimensions.

ν Almost all clustering algorithms require input parameters, determinations of which are very difficult, especially

for real world data sets containing high dimensional objects. Moreover, the algorithms are highly sensitive to

those parameters.

ν Distribution of most of the real-life datasets are skewed in nature, so, handling of such datasets for all types

for qualitative cluster detection based on a global input parameter seems to be impractical.

ν Only some of the techniques falling under density/density-grid hybrid approaches (TDCT, GDCT, DGCL etc)

are capable of handling multi-density datasets as well as multiple intrinsic or nested clusters over massive

datasets qualitatively.

ν Only few of the techniques (falling especially under Grid based approach) can handle higher dimensional

datasets.

ν Algorithms under Density based as well as Grid based approaches employ lesser number of user defined

parameters.

ν The Density and Grid based approaches can handle the single-linkage problem well and can detect Multi-

density as well as embedded clusters.

16

A tabular comparison of various pioneering clustering algorithms under various approaches is represented as

follows:

Table 1: Clustering Algorithms

Approac

h

Sl.

N

o.

Algorithms No. of

Parameter

s

Optimize

d for

Structu

re

Multi-

Densi

ty

Clust

er

Embedd

ed

Clusters

Complexity Noise

Handli

ng

1 K-means No. of

Clusters

Separate

d

Clusters

Spheric

al

No No

O(l

t

kN)

No

2

K-medoids No. of

Clusters

Separate

d

Clusters,

Large

valued

objects

Spheric

al

No No

O(k(N-k)

2

)

No

3 K-modes No. of

Clusters

Separate

d

Clusters,

Large

Datasets

Spheric

al

No No

O(l

t

k(N-k)

2

)

No

4 FCM (Fuzzy

C-means

Clustering)

No. of

Clusters

Separate

d

Clusters

Non-

convex

shapes

No No

O(N)

No

5 PAM

(Partition

Around

Medoids)

No. of

Clusters

Separate

d

Clusters,

Large

Datasets

Spheric

al

No No

O(l

t

k(N-k)

2

)

No

6 CLARA

(Clustering

LARge

Applications)

No. of

Clusters

Relatively

Large

Datasets

Spheric

al

No No

O(ksz

2

+k( N-

k))

No

Partitioni

ng

Approac

h

7 CLARANS (A

CLustering

Algorithm

based on

RANdomized

Search)

No. of

Clusters,

Maximum

no. of

neighbors

Better

that PAM

& CLARA

Spheric

al

No No

O(kN

2

)

No

1 BIRCH

(Balanced

Iterative

Reducing &

Clustering

using

hierarchies)

Branching

factor,

Diameter,

Threshold

Large

Data

Spheric

al

No No O(N) Yes

Hierarchi

cal

2 CURE

(Clustering

Using

REpresentati

ves)

No. of

Clusters,

No of

representati

ves

Any

Shaped

Large

data

Arbitrar

y

No No

O(N

2

logN)

Yes

17

Approac

h

Sl.

N

o.

Algorithms No. of

Parameter

s

Optimize

d for

Structu

re

Multi-

Densi

ty

Clust

er

Embedd

ed

Clusters

Complexity Noise

Handli

ng

3 ROCK

(RObust

Clustering

using links)

No. of

Clusters

Small

noisy

data

Arbitrar

y

No No

O(N

2

+Nm

m

m

a

+N

2

l

ogN)

Yes Approac

h

4 CHAMELEO

N

3(k-nearest

neighbors,

MIN-SIZE,

α

c

)

Small

datasets

Arbitrar

y

Yes No

O(N

2

)

Yes

1 DBSCAN

(Density

Based

Spatial

Clustering of

Applications

with Noise)

2(MinPts, ε) Large

datasets

Arbitrar

y

No No O(N log N)

using R*tree

Yes

2 OPTICS(

Ordering

Points To

Identify the

Clustering

Structure)

3(MinPts,

ε,ε')

Large

datasets

Arbitrar

y

Yes Yes O(N log N)

using R

*

tree

Yes

3 DENCLUE

2(MinPts, ε) Large

datasets

Arbitrar

y

No No O(N log N)

using R

*

tree

Yes

4 TDCT

(Triangle-

Density

Clustering

Technique)

2(ε, β)

Large

Spatial

datasets

Arbitrar

y

Yes Yes O(n

c

× 2*m*N) Yes

Density

Based

Approac

h

5 3D-CATD (3-

Dimensional

Clustering

Algorithm

using

Tetrahedron

Density)

2(ε, β)

Large

datasets,

3D

datasets

Arbitrar

y

Yes Yes O(n

c

× m*N) Yes

1 Wave

Cluster

No. of cells

for each

dimension,

No. of

applications

of transform

Any

Shape,

Large

Data

Any Yes No O(N) Yes

Grid-

2 STING No. of cells

in lowest

level, No. of

objects in

cell

Large

spatial

datasets

Vertical

and

horizon

tal

bounda

ry

No No O(N) Yes

18

Approac

h

Sl.

N

o.

Algorithms No. of

Parameter

s

Optimize

d for

Structu

re

Multi-

Densi

ty

Clust

er

Embedd

ed

Clusters

Complexity Noise

Handli

ng

3 CLIQUE Size of the

grid,

minimum

no. of

points in

each grid

cell

High

dimensio

nal, Large

datasets

Arbitrar

y

No No O(N) Yes Based

Approac

h

4 MAFIA Size of the

grid,

minimum

no. of

points in

each grid

cell

High

dimensio

nal, Large

datasets

Arbitrar

y

No No O(c

kl

) Yes

1 GDCT (Grid-

Density

Clustering

Technique)

2 (n, β) Large

datasets,

2D

datasets

Arbitrar

y

Yes Yes O(N/k+t) Yes

2 GDCT Using

Distributed

Computing

2 (n, β) Large

datasets,

2D

datasets

Arbitrar

y

Yes Yes O(N) Yes

Grid-

Density

Hybrid

Approac

h

3 DisClus

(Distributed

Clustering)

2 (n,α) High

resolution

multi-

spectral

Satellite

Datasets

Arbitrar

y

Yes Yes O(N) Yes

Graph

Based

Clusterin

g

1 AUTOCLUST NIL Massive

Data

Arbitrar

y

No No O(NlogN) Yes

19

2. Conclusion

Clustering lies at the heart of data analysis and data mining applications. The ability to discover highly correlated

regions of objects when their number becomes very large is highly desirable, as data sets grow and their properties

and data interrelationships change. Every research paper that presents a new clustering technique shows its

superiority to other techniques and it is hard to judge how well the technique will work. In this paper we described the

process of clustering from the data mining point of view. We gave the properties of a “good” clustering technique and

the methods used to find meaningful partitioning. We have also done a selected survey on various clustering

approaches and the pioneering algorithms of these approaches. From the survey we can conclude that Density Based

and Grid Based clustering approaches can produce optimum solutions in clustering. Density based clustering

techniques can find clusters of any shape and size in large datasets with good noise handling activity and less

parameters. Grid based technique can find clusters in least time complexity as it can perform very fast processing of

datasets. So, the Density-Grid Hybrid clustering approach can be one of the best solutions for any kind of clustering

problems. GDCT, DGCL and DisClus which fall in this category are some of the best algorithms in the arena.

The clusters obtained from the techniques discussed can further be refined by smoothing the cluster boundaries. This

can be performed by employing Membership functions and fuzzy logic on the boundary data points to find the

probability and membership of these points with respect to the clusters and hence predicting the exact cluster where

the point belong to.

3. Acknowledgements

We would sincerely like to thank Mr. Sarbeswar Das, Project Manager BSNL CDR Project for his encouragement to

write this paper and also for sharing his experience and expertise.

Hrishav &Anupam

20

4. References

[1] J. Han and M. Kamber, (2004), Data Mining: Concepts and Techniques. India: Morgan Kaufmann Publishers.

[2] M. Ester, H. P. Kriegel, J. Sander and X. Xu,( 1996), “A Density-Based Algorithm for

Discovering Clusters in Large Spatial Databases withNoise”, in International Conference on

Knowledge Discovery in Databases and Data Mining (KDD-96), Portland, Oregon, pp.226-231.

[3] C. Hsu and M. Chen,(2004) “Subspace Clustering of High Dimensional Spatial Data with

Noises”, PAKDD, pp. 31-40.

[4] W. Wang, J. Yang, and R. R. Muntz,(1997) “STING: A Statistical Information Grid Approach to

Spatial data Mining”, in Proc. 23

rd

International Conference on Very Large Databases, (VLDB),

Athens, Greece, Morgan Kaufmann Publishers, pp. 186 - 195.

[5] G. Sheikholeslami, S. Chatterjee and A. Zhang,(1998) “Wavecluster: A Multiresolution

Clustering approach for very large spatialdatabase”, in SIGMOD'98, Seattle.

[6] R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan,(1998) “Automatic subspace clustering

of high dimensional data for data miningapplications”, in SIGMOD Record ACM Special Interest

Group on Management of Data, pp. 94–105.

[7] H. S. Nagesh, S. Goil and A. N. Choudhary,(2000) “A scalable parallel subspace clustering

algorithm for massive data sets”, in Proc.International Conference on Parallel Processing, pp. 477.

[8]Tian Zhang, Raghu Ramakrishnan ,MironLivny, (1996), “BIRCH: An Efficient Data Clustering

Method for Very Large Databases”, Proceeding SIGMOD '96 Proceedings of the 1996 ACM

SIGMOD international conference on Management of data Pages 103-114 ,ACM New York, NY, USA

[9] WANG, W., YANG, J., and MUNTZ, R.R. (1999). STING+: An approach to active spatialdata

mining. In Proceedings 15th ICDE, 116-125, Sydney, Australia.

[10] HAN, J., KAMBER, M., and TUNG, A. K. H. (2001), Spatial clustering methods in data mining: A

survey. In Miller, H. and Han, J. (Eds.) Geographic Data Mining andKnowledge Discovery, Taylor

and Francis.

[11] TUNG, A.K.H., NG, R.T., LAKSHMANAN, L.V.S., and HAN, J. (2001), Constraint-Based

Clustering in Large Databases, In Proceedings of the 8th ICDT, London, UK.

[12] STREHL, A. and GHOSH, J. 2000. A scalable approach to balanced, high-dimensional

clustering of market baskets, In Proceedings of 17th International Conference on HighPerformance

Computing, Springer LNCS, 525-536, Bangalore, India.

21

[13]L. Ertoz, M. Steinbach and V. Kumar,(2003) “Finding Clusters of Different Sizes, Shapes, and

Densities in Noisy, High Dimensional Data”,in SIAM International Conference on Data Mining

(SDM '03).

[14]M. Ankerst, M. M. Breuing, H. P. Kriegel and J. Sander,(1999) “OPTICS: Ordering Points To

Identify the Clustering Structure”, inACMSIGMOD, pp. 49-60.

[15] S. Roy and D. K. Bhattacharyya,(2005) “An Approach to Find Embedded Clusters Using

Density Based Techniques”, in Proc. ICDCIT,LNCS 3816, pp. 523-535.

[16]HrishavBakulBarua, Dhiraj Kumar Das and SauravjyotiSarmah,(2012), “A Density Based

Clustering Technique For Large Spatial Data Using Polygon Approach”, TDCT, IOSR Journal of

Computer Engineering (IOSRJCE) ISSN: 2278-0661 Volume 3, Issue 6 (July-Aug. 2012), PP 01-

10.

[17]HrishavBakulBarua and SauravjyotiSarmah.(2012), Article: An Extended Density based

Clustering Algorithm for Large Spatial 3D Data using Polyhedron Approach (3D-CATD).

International Journal of Computer Applications 58(2):4-15, November 2012. Published by

Foundation of Computer Science, New York, USA(ISBN: 973-93-80871-32-3),(ISSN:0975 – 8887)

[18] E. Januzaj, H. P. Kriegel and M. Pfeifle, “Towards Effective andEfficient Distributed

Clustering.Workshop on Clustering Large DataSets”, ICDM'03.Melbourne, Florida, 2003.

[19] S. Sarmah, R. Das and D. K. Bhattacharyya, “Intrinsic Cluster Detection Using Adaptive

Grids”, in Proc. ADCOM'07, Guwahati,2007.

[20] S. Sarmah, R. Das and D.K. Bhattacharyya, “A Distributed Algorithm for Intrinsic Cluster

Detection over Large Spatial Data” Agrid-density based clustering Technique (GDCT), World

Academy of Science, Engineering and Technology 45, pp. 856-866, 2008.

[21]Januzaj, E., et al. (2003): Towards effective and efficient distributed clustering. In:

Proceedingsof the ICDM 2003 .

Thank You

Contact

For more information, contact

hrishav.barua@tcs.com,

anupam1.r@tcs.com

About Tata Consultancy Services (TCS)

Tata Consultancy Services is an IT services, consulting and business solutions

organization that delivers real results to global business, ensuring a level of certainty no

other firm can match. TCS offers a consulting‐led, integrated portfolio of IT and IT‐

enabled infrastructure, engineering and assurance services. This is delivered through its

unique Global Network Delivery Model

TM

, recognized as the benchmark of excellence in

software development. A part of the Tata Group, India’s largest industrial conglomerate,

TCS has a global footprint and is listed on the National Stock Exchange and Bombay

Stock Exchange in India.

For more information, visit us at www.tcs.com.

IT Services

Business Solutions

Consulting

All content / information present here is the exclusive property of Tata Consultancy Services Limited (TCS). The content /

information contained here is correct at the time of publishing. No material from here may be copied, modified, reproduced,

republished, uploaded, transmitted, posted or distributed in any form without prior written permission from TCS.

Unauthorized use of the content / information appearing here may violate copyright, trademark and other applicable laws,

and could result in criminal or civil penalties. Copyright © 2011 Tata Consultancy Services Limited

- 20_ijictv3n10spl.pdfUploaded byKhusnulAzima
- 03 Literature ReviewUploaded byVaibhav Hiwase
- Clustering of Architectural Floor Plans - A Comparison of Shape RepresentationsUploaded byanonimo
- DATA MINING - Simple Guide For Beginners.pdfUploaded byJayaprakash Reddy
- Analysis on Student Admission Enquiry SystemUploaded byInternational Journal for Scientific Research and Development - IJSRD
- Creating Datamining for Hierarchical ClusteringUploaded bymiro
- Presentation: Parallelisation of Hierarchical Clustering for MetagenomicsUploaded byMimiTantono
- Essentials of Machine Learning Algorithms (With Python and R Codes)Uploaded byAbhishek Patel
- Comparative Analysis of BIRCH and CURE Hierarchical Clustering Algorithm using WEKA 3.6.9Uploaded bythesij
- Spatial Data MiningUploaded byNicolas Badaro
- 05 ClusteringUploaded byyashwanthr3
- Az 36311316Uploaded byAnonymous 7VPPkWS8O
- Market Basket Analysis for data mining - msthesis.pdfUploaded byRicardo Zonta Santos
- 4196_0_Conceptual Model for Enhanced OLAP and Data MiningUploaded byrakesh1234245
- Eric Chi's CVUploaded byjocelyntchi
- 6234A_09Uploaded byJosé Terán Lavilla
- HUB4045F_Assignment1_2012Uploaded byRobert Arnold
- Preserving Secret Data With Unique IdUploaded byiaetsdiaetsd
- A Person Centered Investigation of AcademicUploaded byKyle Cayaban
- Image Segmentation Using Two Weighted Variable Fuzzy K MeansUploaded byATS
- IJAIEM-2014-10-31-82Uploaded byAnonymous vQrJlEN
- V3I2-0204.pdfUploaded byanshusvyas13
- MS_thesisUploaded bynobeen666

- Pic Meeting Mom 5Uploaded byGyana Sahoo
- Kaplan209-236Uploaded byGyana Sahoo
- Downloads_15122015114324AM.pdfUploaded byGyana Sahoo
- Downloads_15122015114324AMUploaded byGyana Sahoo
- labjamUploaded byGyana Sahoo
- labjam.pdfUploaded byGyana Sahoo
- International Treaties, Conventions and Protocols Concerning CyberspaceUploaded byGyana Sahoo
- 02 SMMU AndrewUploaded byGyana Sahoo
- 03-introduction_to_Oracle.pdfUploaded byGyana Sahoo
- PKI Help GuideUploaded byGyana Sahoo
- EGram Presentation July 2016 NM 2Uploaded byGyana Sahoo
- Bbsr Icomc Addendum FinalUploaded byGyana Sahoo
- License Spiv 16090327 AusUploaded byGyana Sahoo
- Crimes and Torts Committed on a Computer Network and Realting to Electronic MailUploaded byGyana Sahoo