Survey of clustering algorithms

Cluster analysis or clustering is the art of separating the data points into dissimilar group with a group having similar objects (cluster). This is an unsupervised learning technique used in many fields, like machine learning, data mining, bioinformatics, etc. This technique is implemented in a variety of ways few of majorly known techniques are partitioning methods, hierarchical methods, density-based methods, grid-based methods, model-based methods, graph based methods and ensembles of clustering algorithms. This article tries to give a brief description based on the data type and method type under each category, which is not the only comprehensive article describing the all clustering algorithm.
Route map for the paper Partitioning algorithms are described in section 2. Hierarchical algorithms are covered in section 3. The section 4 explains the density-based algorithms. The section 5 describes the grid-based methods. The model-based algorithms are covered in section 6, while, recent advances in clustering techniques, such as ensembles of clustering algorithms, are described in section 7. The final section concludes this work.

2. Partitioning Methods Partitioning algorithms tries to explore the subset in the dataset at a time. The methods in this approach are K-Means, Farthest FirstTraversal k-center(FFT) algorithm, K-Mediods,
CLARA, CLARANS, Fuzzy K-Means, K-Modes, Fuzzy K-Modes, squeezer, K-prototypes, COOLCAT, etc. few methods are explained below.

Numerical data
K-Means algorithm operates on numerical and binary data. The algorithm builds up spherical shaped clusters with means as center. The algorithm’s drawbacks are the user has to enter the number of cluster to be formed and also the stopping criteria. With different order of same data in dataset, algorithm may produce different output clusters. The algorithm cannot handle missing data and outliers (dissimilar data items from reset of the data). The worst case complexity of this algorithm is O(tkN), where t is the number of iterations, k is the number of clusters and n is the number of objects.

The complexity of the algorithm is O(tNK). This effectively handles outliers with a complexity of O(N2). Like K-Means. The algorithm builds up arbitrary shaped clusters with means as center. Discrete K-Modes: this is a method similar to k-means replacing mean value with the mode. The algorithm builds up spherical shaped clusters with means as center. Methodology still remains the same. It has the merits and demerits of K-Means. This works on samples from the dataset and not on entire dataset. The complexity of this method is O(TKN). The algorithm builds up arbitrary shaped clusters with means as center. Fuzzy k-means: is a soft computing algorithm allowing each object to have a membership in more than one cluster using mean as the evaluation criteria. The complexity of this method is O(NK). here s is a sample size. This method suffers from the drawbacks of k-modes. The complexity of the algorithm is O(Ks2bk(N-K )). The complexity of the algorithm is O(tNK). The algorithm builds up spherical shaped clusters with means as center. . The complexity of this method is O(TKN). CLARANS (Clustering Large Applications Based Upon Randomized Search): this algorithm facilitates the dynamic ordering of samples during each iteration. This method is capable of handling outliers but is order sensitivity. hamming distance may also be used with this method. User only requires the input ‘k’ value to be given.Farthest First Traversal k-center (FFT) algorithm this works with variance of the data points. The algorithm builds up arbitrary shaped clusters with means as center. Fuzzy k-modes Fuzzy k-modes is an extension to fuzzy k-means. this method is also unsuitable for noisy datasets and cannot handle outliers effectively. CLARA (Clustering Large Applications): this method capitalizes on scalability of K-Mediods. This method also suffers from drawbacks of K-means. K-Medoids is also known as Partitioning Around Medoids (PAM): first asks the user to enter required number of clusters then finds arbitrary Median (centers) for each cluster.

Average linkage is mean of all pairs of points between the two clusters. Single linkage is minimum distance between two clusters. The edges are links (distances between two clusters) which in this method can be using linkage criterion. Hierarchical methods are popular in bioinformatics since clusters can be navigated at various levels of granularity [15. an N*N matrix is computed containing gene pair similarities. Single linkage or Average linkage and complete linkage method. handles mixed data. Regulatory and network clustering methods often involve finding cliques [120. 87. HIERARCHICAL CLUSTERING Hierarchical clustering algorithms arrange the data into groups in a tree fashion. This method results in chaining problem. Mixed Discrete and Numerical K-Prototypes: The integration of k-Means and k-modes is k-prototypes. we discuss hierarchical clustering of numerical and then discrete data. 119]. This method works by calculating maximum distance between two clusters. This method is computationally expensive than the other two methods.119]. used average linkage for clustering genes by expression pattern similarity. and then scanned to find the most similar genes.Squeezer: this algorithm scans the algorithm exactly once and is an iterative version of k-modes. . Each node in the tree represents a cluster. COOLCAT works with an efficiency of O(N2). Squeezer is efficient with a complexity of O(kN). 121]. not adaptable to decisions of merger and/or split once applied. Hierarchical methods are useful for representing protein sequence family relationships [67. Eisen et al. Hierarchical methods are often slow. COOLCAT: the drawback with k-modes is sensitivity to start clusters (initial cluster mode) this is over come by considering entropy as a measure of dissimilarity. For N genes. The algorithm is a top down approach. The complexity of this approach is O(TKN). 68. Intra cluster relations may be lost due to merging clusters at various levels. Visualization of hierarchical clusters of cliques is called ‘power graphs’ [96]. The algorithm works at minimizing clusters at each stage. as assessed by Euclidean distance. This method comprises to major divisions called bottom-up or agglomerative and top-down or divisive approaches. Next. appropriate for high-dimensional space and inappropriate for noisy datasets. 118.

103]. Its complexity is O(N2). Initially.clustering horizontally and vertically is known as biclustering. Discrete Bottom up methodology (agglomerative algorithm) ROCK ROCK assumes a similarity measure between objects and defines a ‘link’ between two objects whose exceeding a threshold to find appropriate cluster of the data point in hand. 123. Similarity matrix (adjacency matrix) is built and distance is calculated between two matrices. LIMBO improves on the scalability of other hierarchical clustering algorithms. clusters are merged repeatedly according to their closeness: the sum of the number of ‘links’ between all pairs of objects between two clusters. and is unsuitable for large datasets [4. The complexity of the BRICH and CURE remain the same. Chameleon considers the internal interconnectivity and closeness of the objects both between and within two clusters to be merged. Spectral clustering originated in graph partitioning [122. The algorithm scales to O(N). which form the basis of start and later build up the values iteratively. each object is assigned to a separate cluster. GRID-BASED CLUSTERING . if a similarity metric is specified [93]. Its complexity is O(NlogN). Then.Numerical BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies): is an apt method for large databases. This algorithm works on the principles of clustering features tree (CF-Tree). It produces spherical shaped clusters. Chameleon Chameleon applicable for mixed/ all types of data capitalizes on drawbacks of CURE and ROCK. ROCK has cubic complexity in N. This is also a single scan algorithm using a user-specified shrinking factor. This method relies on the input number of cluster to be formed and shrinking factor value. a measure of similarity. 148]. LIMBO builds on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering [94]. Biclustering. 88. This is algorithm handles outliers effectively than BRICH. CURE (Clustering using representatives) is an improvisation over BRICH to project arbitrary shaped clusters.

Some cannot identify clusters of varying densities. this approach facilitates top down traversal. etc. DENSITY-BASED CLUSTERING Density-based approaches use a local density criterion. and then generates the set of two-dimensional cells that might possibly be dense by looking at dense one-dimensional cells. CLIQUE [1] is another grid-based clustering algorithm that starts by finding all the dense areas in the one-dimensional spaces corresponding to each attribute.1) dimensional cells. Some densitybased approaches are also grid-based. and CLustering In QUEst – CLIQUE [1]. Numerical STING is an integration of grid based and hierarchical clustering with O(N) as its complexity to partition a data set into rectangular clusters arranged in hierarchical manner. but not the number of clusters k that changes the clustering. CLIQUE generates the possible set of k-dimensional cells that might possibly be dense by looking at dense (k . standard deviation. empirical evaluation shows that CLIQUE scales linearly with the number of instances. typically involving cliques [19. Advantages of many density-based algorithms include time efficiency and ability to find clusters of arbitrary shapes. Density-based methods are useful in bioinformatics for finding the densest subspaces within networks. In addition. Generally. Unlike other clustering methods. Moreover. Each level has count. Some of the grid-based clustering algorithms are: STatistical INformation Grid-based method – STING [36]. 65]. . It uses a wavelet transformation to transform the original feature space resulting in a transformed space where the natural clusters in the data become distinguishable. WaveCluster [33]. it generates cluster descriptions in the form of DNF expressions [1] for ease of comprehension. WaveCluster [33] does not require users to give the number of clusters applicable to low dimensional space. mean.Grid-based clustering algorithms first quantize the clustering space into a finite number of cells and then perform the required operations on the quantized space. The complexity of this method is O(N). clusters are subspaces in which the objects are dense and are separated by subspaces of low density. The complexity of this method is O(N). Some density-based algorithms take user-specified input parameters. CLIQUE produces identical results irrespective of the order in which the input records are presented. since a histogram is constructed by partitioning the dataset into a number of non-overlapping regions. and has good scalability as the number of attributes is increased.

OPTICS considers that different clusters could require different values. OPTICS Both DBSCAN and OPTICS require parameters to be specified by the user that will affect the result. Its main drawback is the user-specified parameter values. it is only applicable to low-dimensional datasets. and it can detect clusters at varying levels of accuracy. the wavelet transform. otherwise. Its complexity is O(N2). A wavelet transform is useful because it can suppress weaker information. DENCLUE differs from other density-based approaches in that it pins density to a point in the attribute space instead of an object [100]. Its input parameters are the number of grid cells for each dimension. The density function is the sum of the influence functions of all objects.Numerical DBSCAN DBSCAN regards clusters as dense regions of objects in space that are separated by regions of low density. DBSCAN is resistant to noise and provides a means of filtering for noise if desired. WaveCluster WaveCluster uses a wavelet to transform the original data and find dense regions in the transformed space [105]. DBSCAN is not suitable for high-dimensional data. and thus being effective for noise removal. the neighborhood of a given radius has to contain at least a minimum number of objects (MinPts). . Its main advantage is that it can discover clusters of arbitrary shapes. OPTICS was extended into SEQOPTICS to support users choosing parameters [10]. Its complexity is O(N). it is O(N2). where radius and MinPts are input parameters. DENCLUE The main advantage of DENCLUE is ability to find arbitrary shaped clusters. However. this leads to OPTICS. Its complexity is O(NlogN). Clusters are determined by identifying local maxima of the overall density function. and the number of wavelet applications. The influence of each object within its neighborhood is modeled using an influence function. This results’ in two benefits: with less information one can speed up the process. 130]. However. The drawback is that it has a large number of input parameters. For each object of a cluster. The complexity of this method is O(NlogN). For sequence clustering. Every object not contained in any cluster is considered noise [98]. DBSCAN and OPTICS have difficulty identifying clusters within clusters [4. OPTICS finds an ordering of the data that is consistent with DBSCAN [99]. if a spatial index is used.

determined by the subset of attributes most relevant to each cluster. The CLIQUE partitions space into non-overlapping rectangular units and identifies the dense units. instead. not requiring a cut-off to get the clusters as in hierarchical clustering. 130]. and ability to handle outliers [97].CLIQUE CLIQUE improves on the scalability of other methods as its complexity is O(N). CLIQUE is useful for clustering high-dimensional data and is insensitive to object ordering. MULIC: Multiple Layer Incremental Clustering MULIC is a faster simplification of HIERDENC that focuses on the multi-layered structure of special datasets. The values at the relevant attributes are distributed around some specific values in the cluster. . any cluster mergers that may be desirable are done after objects’ clustering has finished. The complexity of this method is O(N). with each cluster corresponding to a collapsed edge. User-specified parameters are the grid size and a global density threshold for clusters. The characteristics of HIERDENC include insensitivity to the order of object input. Projected (subspace) clustering Projected clustering is motivated by high dimensional datasets. Clusters are subspaces of high-dimensional datasets. The drawback is that clustering depends on user parameters for determining the relevant attributes of each cluster. MULIC requires no user-specified parameters. MULIC does not merge clusters during clustering. which has several differences from traditional hierarchical clustering. MULIC results in layered (or nested) clusters of biomolecules. The HIERDENC algorithm for hierarchical density-based clustering of discrete data offers a probabilistic basis for designing faster density-based discrete clustering algorithms. Discrete HIERDENC: Hierarchical Density-based Clustering A challenge involved in applying density-based clustering to discrete datasets is that the ‘cube’ of attribute values has no ordering defined. not losing interesting local cluster structure. MULIC clusters have a clear separation. MULIC produces layered clusters. First. CLIQUE considers only hyper-rectangular clusters and projections parallel to the axes [4. a unit is dense if the fraction of total objects contained in it exceeds the user-specified value [107]. while objects of other clusters are less likely to have such values. The complexity of this method is O(N). where clusters exist only in specific attribute subsets [133]. such parameters are the number of clusters or the average number of dimensions for each cluster.

CLICK may return too many clusters or too many outliers [103]. This is a scalable algorithm to dataset size. STIRR STIRR looks for relationships between all attribute values in a cluster [102]. Then. STIRR is sensitive to object ordering and lacks a definite convergence. In bioinformatics. The notion of weights is non-intuitive and several parameters are user-specified. 135]. one with positive and another with negative weights. CACTUS has difficulty finding clusters within clusters [4. CLOPE CLOPE uses a heuristic of increasing the height to width ratio of the cluster histogram [104]. which is often a statistical distribution.Projected clustering may distinguish the center of a cluster based on higher density or the relevant attributes [107]. CACTUS may return too many clusters [103]. This is also a scalable algorithm. user assumptions may be false and then results will be inaccurate. CLICK CLICK creates a graph representation. 130]. A cluster is a k-partite maximal clique such that most pairs of vertices are connected by an edge. This is also a scalable algorithm. CACTUS CACTUS uses a minimum size for the relevant attribute sets. The model may be user specified as a parameter and the model may change during the process. Another disadvantage of model-based clustering (especially neural networks) is slow processing time on large datasets. the process aims to cluster objects such that they match the distribution. and assumes that a cluster is identified by a unique set of attribute values that seldom occur in other clusters [101]. This assumption may be unnatural for clustering many real world datasets. The final detected clusters are often incomplete [103]. model-based clustering methods integrate background knowledge into gene expression. define two clusters. Two sets of attribute values. MODEL-BASED CLUSTERING Model-based clustering assumes that objects match a model. Building models is an oversimplification. The accuracy of CLOPE’s results may suffer. vertices are discrete values and an edge is a co-occurrence of values in an object. protein structures. The complexity of the algorithm is O(kdN). Numerical Self-Organizing Maps . and sequences [134. CLOPE is fast and scalable to high-dimensional datasets.

Mixed Discrete and Numerical BILCOM (Empirical Bayesian) Model-based methods for gene expression clustering. used mixture densities to estimate amino-acid preferences within known subfamily clusters. A classification tree differs from a decision tree. which apply to the objects classified under that node. and contains the probability of the concept and the probabilities of the attribute-value pairs. but discrete attributes are hard to handle. Modelbased clustering can find arbitrary shaped gene expression clusters [16. A drawback is that it may assume correlated attributes are independent. NNs have a number of drawbacks. A benefit of COBWEB is that it can adjust the number of clusters in a partition. Sibling nodes at a classification tree level form a partition [108]. AutoClass: AutoClass is a clustering algorithm for mixed data types. SOMs assume that the winning units will eventually learn the correct cluster structure. without the user specifying this input parameter. Brown et al. NNs do not present an easily understandable model. COBWEB integrates observations incrementally into an existing classification tree by classifying the observation along a path of best matching nodes. the unit that is closest to the object becomes the winning unit. achieving scalability and accuracy [134]. NNs can handle binary data. such as Bi-level clustering of Mixed Discrete and Numerical Biomedical Data (BILCOM) [109]. 136–139. and can handle dependent attributes. often adopts an empirical Bayesian approach. However. The complexity of the method is O(Nd2). The complexity of the method is O(N2). which uses a Bayesian method for determining the optimal classes based on prior distributions [110]. For protein sequence clustering. 149]. being more of a ‘black box’ that delivers results without explaining how the results were derived. which label branches rather than nodes and use logical rather than probabilistic descriptions. Classifying a result into multiple clusters is done by setting arbitrary value thresholds for discriminating clusters. 47. which resemble processing that occurs in the brain. AutoClass investigates different .Self-Organizing Maps involve neural networks. This is unlike decision trees. NNs and SOM clustering involve several layers of units that pass information between one another in a weighed manner. COBWEB creates a hierarchical clustering in the form of a classification tree. Several units compete for an object. In a COBWEB classification tree each node refers to a concept. 18. Discrete COBWEB COBWEB is a conceptual clustering method. The complexity of the method is O(N2). NNs can model nonlinear relationships between attributes.

MCODE is sensitive to network alterations. SVM Clustering SVMs provide a method for supervised learning. given a prior distribution for each attribute. The complexity of the method is O(N2). Super Paramagnetic Clustering (SPC) SPC is similar to entropy-based COOLCAT. symbolizing prior beliefs of the user. SPC is robust to edge removal. AutoClass can be slow. The complexity of the method is O(kd2Nt). It changes the classifications of objects in clusters and changes the means and variances of the attributes’ distributions in each cluster. These methods are sensitive to user-specified parameter values. The repetition of the above process improves the classification accuracy and limits local minima traps. Then the lowest confidence classifications.numbers of clusters. exceptions are SEQOPTICS and MULIC. every object in the dataset is randomly labelled and a binary SVM classifier trained. and often slow. GRAPH-BASED CLUSTERING Graph-based clustering methods have been applied for complex prediction and to sequence networks. repeatedly have labels switched to the other class label. Molecular Complex Detection (MCODE) MCODE is designed to detect sub networks with many edges [19]. which are not user specified. The complexity of the method is O(N1. presented previously. Restricted Neighborhood Search Clustering (RNSC) . but sensitive to edge addition. when temperature or entropy increases. but few complexes were retrieved. and threshold parameters. the system becomes less stable and the clusters become smaller [70]. those objects with confidence factor values beyond some threshold. until they stabilize. SPC is weak at complex predictions [65]. SVM-based clustering does not use prior knowledge of object classifications [111]. Initially. Drawbacks include that users have to specify the model spaces to be searched in and wrong models may produce wrong results. MCODE complex prediction had high correctness. The output is a mixture of several likely answers. AutoClass finds the most likely classifications of objects in clusters. The complexity of the method is O(Nd3).8). The SVM is re-trained after each re-labeling on the lowest confidence objects. edge removal and addition. SVMs were also adapted for clustering.

It gives many mini-clusters of small sizes [65]. CD-HIT removes redundant sequences. MCL is robust to network alterations. which often improves results on high dimensional datasets. but sensitive to edge removal. considering the number of edges within and between clusters [112]. The BAG algorithm uses graph theoretic properties to guide cluster splitting and reduce errors [142]. improving sequence clustering over Tribe MCL [70]. An overall comparison showed MCL’s superiority for finding complexes [65]. both edge removal and addition. The complexity of the method is O(N2). satisfying the point proportion admissibility requirement [12]. It simulates a flow. . The complexity of the method is O(N3). RNSC is relatively robust to parameter values and edge addition. SPC was also applied to sequences. Markov Clustering (MCL) MCL is similar to projected clustering. finding clusters as high-flow regions separated by no-flow boundaries [113]. Other sequence clustering Tribe MCL clusters sequences into families using BLAST similarity searches [113].RNSC is similar to ROCK and Chameleon.