A Survey of Clustering Algorithms

A Survey of Clustering Algorithms
-S. Janani and R. Tamilselvi

Cluster analysis is an important human activity whether in biology deriving plant
and animal taxonomies, in image pattern recognition, in earth observation
database. It help us to organize similar characteristics of a large number of
customers in a group. Clustering partitions large data sets into groups is called as
data segmentation. Clustering can also be used for Outlier Detection, which is
used for the detection of credit card fraud, and monitoring of criminal activities in
electronic commerce etc.,. Some requirements for clustering in data mining are
Scalability, Ability to deal with different types of attributes, Discovery of clusters
with arbitrary shape, Minimal requirements for domain knowledge to determine
input parameters, Able to deal with noise and outliers, Insensitive to order of
input records, High dimensionality.
Clustering methods can be classified into Partitioning method, where each object
belong to exactly one group and given the number of partition it creates an initial
partitioning which is being improved by moving objects from one group to
another. Algorithm based on partition includes K-means clustering, PAM, CLARA,
CLARANS, Clustering with Genetic Algorithm, Clustering with Neural network.
Hierarchical method, It construct an hierarchical relationship among data in order
to cluster. It either starts with a large cluster and splits into small clusters or
merges similar clusters into larger cluster. Algorithm of this kind of clustering
include BIRCH, CURE, ROCK, and Chameleon. Density-based method, It works on
the principle that the data which in the region with high density of the data space
is considered to belong to the same cluster. Algorithm include DBSCAN, OPTICS,
and Mean-shift. Grid- Based method, In this method the objects space is divided
into grid. Algorithms includes are STING, CLIQUE. Model- based method, It find
good approximations of model parameters that best fit the data. This Model
based clustering algorithms can be either partitional or hierarchical, depends on
the structure. At last Constraint-based clustering determines clusters that satisfy
preferences or constraints specified by user.
Some of the most commonly used clustering algorithm are:
K-Means Algorithm, It uses the Centroid of a cluster to represent a cluster.
Algorithm start by choosing the number of cluster and generating random point
as cluster center. Constructing the cluster by assigning each point to the nearest
cluster centers and so recomputed the new cluster centers. By repeating the step
until some convergence criterion is met. PAM on other hand uses Medoids to
represent the cluster which is sequence of objects centrally located in cluster.
The objective is to minimize the average dissimilarity of objects to their closest
selected object by exchanging selected objects with unselected objects. It is
relatively more costly than K-mean but at the same time decrease the sensitivity
toward outliers. CLARA, an extension of PAM to deals with data containing large
number of objects in order to reduce the complexity as it does not find a
representative object for entire data set instead draws a sample of data set and
applies PAM on it. CLARA draws multiple samples and gives the best clustering as
the output. CLARANS, It present a tradeoff between the cost and the effectiveness
of using samples to obtain clustering as unlike CLARA it does not restrict its search
to a particular subgraph. Clustering with Genetic Algorithm, it is a hypothetical
search technique inspired by Darwin’s Theory of Evolution- “Survival of Fittest”
that performs a multi-directional search by maintaining a population of potential
solutions and encourages information formation and exchange between these
directions. Clustering with Neural Network (NN), Self Organizing Map (SOM) is
used which is a grid of neurons which adapt to the topological shape of a dataset
which allow us to visualize large datasets and identify potential cluster. This
network can recognize or characterize inputs it has never encountered before;
this process is known as Generalization capability.
Balanced Iterative Reducing and Clustering Using Hierarchies (BIRCH) is structured
for clustering a large amount of numeric data by combining hierarchical clustering
and other clustering methods. Which helps it overcome difficulties such as
Scalability and the inability to undo what was done in the previous step which was
present in agglomerative clustering methods. It uses the concept of clustering
feature to summarize a cluster and clustering feature tree(CF-tree) to represent a
cluster hierarchy. As using a clustering feature, we can easily derive many useful
statistics of a cluster the cluster’s such as centroid, radius, and diameter. The
algorithm consist of two phases. In first phase, BIRCH initially scans the database
to build the CF tree and in second phase, It applies a clustering algorithm for
clustering the leaf nodes of CF nodes of the CF-tree. Clustering Using
Representatives (CURE) is similar to agglomerative algorithm, begins with every
single data object as a cluster and the object itself is the sole representative of the
corresponding cluster. At any given stage of the algorithm, we have a set of
representative points. The distance between two subclusters is the smallest pair-
wise distance between their representative points. Hierarchical clustering works
by merging the closest pair of subclusters. Once the clusters are merged, a new
set of representative points are computed for the merged cluster. The merging
process continues till the pre-specified number of clusters are obtained.
Chameleon uses a k-nearest neighbor graph approach to construct a sparse
graph, where each vertex of the graph represents a data object, and there exists
an edge between two vertices if one object is among the k-most-similar objects of
the other. The edges are weighted to reflect the similarity between objects.
Chameleon uses a graph partitioning algorithm such that it minimizes the edge
cut. It then uses an agglomerative hierarchical clustering algorithm that
repeatedly merges subclusters based on their similarity which is based on the
Inter Connectivity and Relative closeness. Chameleon has efficiency of producing
random shaped clusters but gives polynomial time complexity of power 2.
DBSCAN: Density-Based Spatial Clustering of Applications with Noise, It is
designed to find non-spherical shaped clusters. It is based on the idea that a
cluster in a data space is a contiguous region of high point density, separated
from each other such clusters by contiguous regions of low point density. It
requires two input parameter Eps and MinPts based on which a core object is
selected which with all density-reachable point combine, called as a cluster where
Outliers contain the points that are not connected to any core point. It gives a
linear time complexity which in worst case will give a polynomial time complexity.
OPTICS (Ordering Points to Identify the Clustering Structure) Algorithm it
produces a novel cluster-ordering of the database points with respect to its
density-based clustering structure containing the information about every
clustering level of the data set up to a generating distance Eps. It adds two more
term to the concept of DBSCAN i.e., Core Distance and Reachability Distance and
so this technique does not explicitly segment the data into clusters. Instead , Iit
produces a visualization of Reachability distances and uses this visualization to
cluster the data. Similar to DBSCAN, OPTICS also processes each point once, and
thereby performs one -neighborhood query during this processing. It gives a
logarithmic time complexity at best. STING: A Statistical Information Grid
Approach, In this the spatial area is divided into rectangular cells. There are
several different levels of such rectangular cells corresponding to different
resolution and these cells form a hierarchical structure. The whole input dataset
serves as the root node in the hierarchy structure and each cell at a high level is
partitioned to form a number of cells of the next lower level where size of the leaf
level cells is dependent on the density of objects. STING goes through the
database once to compute the statistical parameters of the cells and so give us a
linear time complexity for generating the cluster. CLIQUE (Clustering in QUEst), It
is a bottom-up subspace clustering algorithm. We can find valid clusters which are
defined by only a subset of dimension, which take two parameters i.e., density
threshold and number of grids. It partition the data space into non-overlapping
rectangular units called grids and then find out the dense region. Then by using
apriori approach, cluster can be generated from all dense subspaces. Expectation
Maximization Algorithm, is an extension of the K-Means algorithm which find the
maximum likelihood solution. Instead of assigning examples to clusters to
maximize the differences in means, the EM algorithm computes probabilities of
cluster memberships based on one or more probability distributions. The goal is
to maximize the overall probability or likelihood of the data, given the cluster
which make it robust to noisy data.

A Survey of Clustering Algorithms

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Survey of Clustering Algorithms

Uploaded by

Copyright:

Available Formats

A Survey of Clustering Algorithms

-S. Janani and R. Tamilselvi

You might also like