This action might not be possible to undo. Are you sure you want to continue?

http://en.wikipedia.org/wiki/Spectral_clustering#Spectral_clustering

From Wikipedia, the free encyclopedia

(Redirected from Spectral clustering) Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image analysis, information retrieval, and bioinformatics. Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology and typological analysis.

The result of a cluster analysis shown as the coloring of the squares into three clusters.

1 Types of clustering 2 Distance measure 3 Hierarchical clustering 3.1 Agglomerative hierarchical clustering 3.2 Concept clustering 4 Partitional clustering 4.1 K-means and derivatives 4.1.1 k-means clustering 4.1.2 Fuzzy clustering 4.1.3 QT clustering algorithm 4.2 Locality-sensitive hashing 4.3 Graph-theoretic methods 5 Spectral clustering 6 Applications 6.1 Biology 6.2 Medicine, Psychology and Neuroscience 6.3 Market research 6.4 Educational research 6.4.1 Example of cluster analysis in educational research 6.4.2 Common cluster techniques in educational research 6.4.3 Advantages of cluster analysis 6.4.4 Disadvantages of cluster analysis 6.4.5 Solution to problems of cluster analysis in educational research 6.5 Other applications 7 Evaluation of clustering

1 of 16

7/20/2011 8:22 PM

Wikipedia.1 Internal criterion of quality 7. Subspace clustering methods look for clusters that can only be seen in a particular projection (subspace. 2-norm or infinity-norm distance. co-clustering or biclustering: in these methods not only the objects are clustered but also the features of the objects. the rows and columns are clustered simultaneously. In this approach. the free encyclopedia http://en. but the distance between the point (x = 1. y = 0) is always 1 according to the usual norms. An important step in most clustering is to select a distance measure. Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. y = 0) and the origin (x = 0. For example. These methods thus can ignore irrelevant attributes. or 1 if you take respectively the 1-norm. in a 2-dimensional space.e. The general problem is also known as Correlation clustering while the special case of axis-parallel subspaces is also known as Two-way clustering. but can also be used as divisive algorithms in the hierarchical clustering. This will influence the shape of the clusters.wikipedia. DBSCAN and OPTICS are two typical algorithms of this kind. Many clustering algorithms require the specification of the number of clusters to produce in the input data set. a problem on its own for which a number of techniques have been developed. These algorithms usually are either agglomerative ("bottom-up") or divisive ("top-down"). Partitional algorithms typically determine all clusters at once. A review of cluster analysis in health psychology research found that the most common distance measure in published 2 of 16 7/20/2011 8:22 PM . a cluster is regarded as a region in which the density of data objects exceeds a threshold. prior to execution of the algorithm. They usually do not however work with arbitrary feature combinations as in general subspace methods. the appropriate value must be determined. Barring knowledge of the proper value beforehand. Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. the distance between the point (x = 1.Cluster analysis .3 Relative criterion of quality 8 Algorithms 9 See also 10 References 11 Further reading 12 External links Hierarchical algorithms find successive clusters using previously established clusters. But this special case deserves attention due to its applications in bioinformatics.org/wiki/Spectral_clustering#Spectral_clustering 7. if the data is represented in a data matrix.. as some elements may be close to one another according to one distance and farther away according to another. i.2 External criterion of quality 7. y = 1) and the origin can be 2. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters. which will determine how the similarity of two elements is calculated. Common distance functions: The Euclidean distance (also called distance as the crow flies or 2-norm distance). manifold) of the data.

Agglomerative hierarchical clustering For example. with a smaller number of larger clusters. which is a coarser clustering. in which one starts at the root and recursively splits the clusters.) Main article: Hierarchical clustering Hierarchical clustering creates a hierarchy of clusters which may be represented in a tree structure called a dendrogram.wikipedia. this is not the case. The root of the tree consists of a single cluster containing all observations. The Hamming distance measures the minimum number of substitutions required to change one member into another. cutting after the second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}.Wikipedia. in which one starts at the leaves and successively merges clusters together. The Manhattan distance (aka taxicab norm or 1-norm) The maximum norm (aka infinity norm) The Mahalanobis distance corrects data for different scales and correlations in the variables The angle between two vectors can be used as a distance measure when clustering high dimensional data. the free encyclopedia http://en. Cutting the tree at a given height will give a clustering at a selected precision. and the leaves correspond to individual observations. In the following example. or divisive.. which is a function of the pairwise distances between observations. (A true metric gives symmetric measures of distance. sequence-alignment methods. see Prinzie & Van den Poel (2006)). suppose this data is to be clustered. See Inner product space. Another important distinction is whether the clustering uses symmetric or asymmetric distances. Many of the distance functions listed above have the property that distances are symmetric (the distance from object A to B is the same as the distance from B to A). In other applications (e. 3 of 16 7/20/2011 8:22 PM . Any non-negative-valued function may be used as a measure of similarity between pairs of observations. and the euclidean distance is the distance metric. Algorithms for hierarchical clustering are generally either agglomerative.org/wiki/Spectral_clustering#Spectral_clustering studies in that research area is the Euclidean distance or the squared Euclidean distance.Cluster analysis .g. The choice of which clusters to merge or split is determined by a linkage criterion.

according to the chosen distance. This is a common way to implement this type of clustering. it can easily be adapted to different types of linkage (see below). c}. Suppose we have merged the two closest elements b and c. {e} and {f}. Usually.Cluster analysis .wikipedia. Then. Optionally. {b. one can also construct a distance matrix at this stage. The first step is to determine which elements to merge in a cluster. as clustering progresses. rows and columns are merged as the clusters are merged and the distances updated. we now have the following clusters {a}. and want to merge them further. we need to take the distance between {a} and {b c}. and has the benefit of caching distances between clusters. In our example. To do that. we have six elements {a} {b} {c} {d} {e} and {f}.org/wiki/Spectral_clustering#Spectral_clustering Raw data The hierarchical clustering dendrogram would be as such: Traditional representation This method builds the hierarchy from the individual elements by progressively merging clusters. 4 of 16 7/20/2011 8:22 PM . A simple agglomerative clustering algorithm is described in the single-linkage clustering page. {d}. we want to take the two closest elements.Wikipedia. where the number in the i-th row j-th column is the distance between the i-th and j-th elements. the free encyclopedia http://en.

Cluster analysis . its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster. Then the centroid Z becomes Z = (z1.x3) and .z3).y3). The increase in variance for the cluster being merged (Ward's criterion). Example: The data set has three dimensions and the cluster has two points: X = (x1. where and .wikipedia. Each agglomeration occurs at a greater distance between clusters than the previous agglomeration.Wikipedia. used e. and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion).z2. K-means and derivatives k-means clustering Main article: k-means clustering The k-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest.org/wiki/Spectral_clustering#Spectral_clustering and therefore define the distance between two clusters. Y = (y1. the free encyclopedia http://en. in UPGMA): The sum of all intra-cluster variance.x2.y2. Concept clustering Another variation of the agglomerative clustering approach is conceptual clustering. The algorithm steps are[1]: 5 of 16 7/20/2011 8:22 PM .g. The probability that candidate clusters spawn from the same distribution function (V-linkage). The center is the average of all the points in the cluster — that is. Usually the distance between two clusters one of the following: and is The maximum distance between elements of each cluster (also called complete-linkage clustering): The minimum distance between elements of each cluster (also called single-linkage clustering): The mean distance between elements of each cluster (also called average linkage clustering.

[2] Any point x has a set of coefficients giving the degree of being in the kth cluster wk(x). compute its coefficients of being in the clusters. For each point. rather than belonging completely to just one cluster. k. Repeat until the algorithm has converged (that is. since the resulting clusters depend on the initial random assignments (the k-means++ algorithm addresses this problem by seeking to choose better starting clusters). is related inversely to the distance from x to the cluster centrer as calculated on the previous pass. The algorithm minimizes intra-cluster variance as well. Another disadvantage is the requirement for the concept of a mean to be definable which is not always the case. weighted by their degree of belonging to the cluster: The degree of belonging. may be in the cluster to a lesser degree than points in the center of cluster. and the results depend on the initial choice of weights.Cluster analysis . the coefficients' change between two iterations is no more than . Randomly generate k clusters and determine the cluster centers.org/wiki/Spectral_clustering#Spectral_clustering Choose the number of clusters. It minimizes intra-cluster variance. In the 70's.[citation needed] Fuzzy c-means has been a very important tool for image processing in clustering objects in an image. It generally preferred to fuzzy-c-means.Wikipedia. Thus. The expectation-maximization algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes. where "nearest" is defined with respect to one of the distance measures discussed above. as in fuzzy logic. points on the edge of a cluster. With fuzzy c-means. Assign randomly to each point coefficients for being in the clusters. The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. the free encyclopedia http://en. or directly generate k random points as cluster centers. the given sensitivity threshold) : Compute the centroid for each cluster. For such datasets the k-medoids variants is appropriate. using the formula above. Recompute the new cluster centers. wk(x). An alternative. but has the same problems as k-means. the minimum is a local minimum. Repeat the two previous steps until some convergence criterion is met (usually that the assignment hasn't changed). but does not ensure that the result has a global minimum of variance. The fuzzy c-means algorithm is very similar to the k-means algorithm:[3] Choose a number of clusters. Fuzzy clustering Main article: Fuzzy clustering In fuzzy clustering. An overview and comparison of different fuzzy clustering algorithms is available. using a different criterion for which points are best assigned to which centre is k-medians clustering. Assign each point to the nearest cluster center. using the formula above.wikipedia. each point has a degree of belonging to clusters. Its disadvantage is that it does not yield the same result with each run. the centroid of a cluster is the mean of all points. It also depends on a parameter m that controls how much weight is given to the closest centre. 6 of 16 7/20/2011 8:22 PM .

Cluster analysis . it is possible to get bigger clusters. The feature space can be considered high-dimensional. and remove all points in the cluster from further consideration. The distance between a point and a group of points is computed using complete linkage. 20). and the metric used is the Jaccard distance. The MinHash min-wise independent permutations LSH scheme is then used to put similar items into buckets. commonly used for image segmentation. but does not require specifying the number of clusters a priori. The algorithm is: The user chooses a maximum diameter for clusters.[4] QT clustering algorithm QT (quality threshold) clustering (Heyer. Must clarify what happens if more than 1 cluster has the maximum number of points ? Recurse with the reduced set of points. It requires more computing power than k-means. Spectral clustering techniques make use of the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensions.Wikipedia. given a bipartite graph representing the relation between the objects and attributes. the free encyclopedia http://en. as the maximum distance from the point to any member of the group (see Agglomerative hierarchical clustering.wikipedia. By seeding the hash functions several times (e. Feature space vectors are sets. Build a candidate cluster for each point by iteratively including the point that is closest to the group.org/wiki/Spectral_clustering#Spectral_clustering mathematicians introduced the spatial term into the FCM algorithm to improve the accuracy of clustering under noise. invented for gene clustering. It partitions points into two sets (S1.S2) based on the eigenvector v corresponding to the second-smallest eigenvalue of the Laplacian matrix 7 of 16 7/20/2011 8:22 PM . With just one set of hashing methods. and always returns the same result when run several times. See also: Kernel principal component analysis Given a set of data points A. Save the candidate cluster with the most points as the first true cluster. the similarity matrix may be defined as a matrix S where Sij represents a measure of the similarity between points . which describes various distance metrics between clusters).[5] Graph-theoretic methods Formal concept analysis is a technique for generating clusters(called formal concepts) of objects and attributes. Kruglyak. One such technique is the Normalized Cuts algorithm by Shi-Malik.e. 1999) is an alternative method of partitioning data. Yooseph. until the diameter of the cluster surpasses the threshold. above. This technique was introduced by Rudolf Wille in 1984. Locality-sensitive hashing Locality-sensitive hashing can be used for clustering. Other methods for generating overlapping clusters (a cover rather than a partition) are discussed by Jardine and Sibson (1968) and Cole and Wishart (1970). there are only clusters of very similar elements.g. i.

This is a very important concept in bioinformatics. such as PET scans. data clustering may take different form based on the data dimensionality. Biology In biology clustering has many applications In imaging. the SOCR EM Mixture model segmentation activity and applet (http://wiki. it is also used in plant systematics to generate artificial phylogenies or clusters of organisms (individuals) at the species.wikipedia.stat.ucla. k-means) to cluster points by their respective k components in these eigenvectors. In high-throughput genotyping platforms clustering algorithms are used to automatically assign genotypes. such as by taking the median m of the components in v. A related algorithm is the Meila-Shi algorithm. In the fields of plant and animal ecology. and placing all points whose component in v is greater than m in S1. and then invokes another algorithm (e. region or volume classification using the online SOCR computational libraries.php /SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture) shows how to obtain point. the free encyclopedia http://en. genus or higher level that share a number of attributes In computational biology and bioinformatics: In transcriptomics. a general aspect of genomics. and evolutionary biology in general. for example. and the rest in S2. The algorithm can be used for hierarchical clustering by repeatedly partitioning the subsets in this fashion. cluster analysis can be used to differentiate between different types of tissue and blood in a three dimensional image. Often such groups contain functionally related proteins.edu/socr/index.Cluster analysis . For example. Psychology and Neuroscience In medical imaging. clustering is used to describe and to make spatial and temporal comparisons of communities (assemblages) of organisms in heterogeneous environments. In this application.g. clustering is used to group homologous sequences into gene families. with a dimension for each image that was taken over time. where D is the diagonal matrix Dii = j This partitioning may be done in various ways. actual position does not matter. accurate measurement of the rate a radioactive tracer is delivered to the area of interest. which takes the eigenvectors corresponding to the k largest −1 eigenvalues of the matrix P = SD for some k. High throughput experiments using expressed sequence tags (ESTs) or DNA microarrays can be a powerful tool for genome annotation. or genes that are co-regulated. Sij. This technique allows. 8 of 16 7/20/2011 8:22 PM . but the voxel intensity is considered as a vector.Wikipedia. such as enzymes for a specific pathway. In sequence analysis. In QSAR and molecular modeling studies as also chemoinformatics Medicine. clustering is used to build groups of genes with related expression patterns (also known as coexpressed genes). See evolution by gene duplication.org/wiki/Spectral_clustering#Spectral_clustering L = I − D − 1 / 2SD − 1 / 2 of S.

minority and rurality. By clustering schools. size. Hattie suggested that school types had no significant relation with performance of schools. region. Segmenting the market and determining target markets Product positioning New product development Selecting test markets (see : experimental techniques) Educational research In educational research analysis. data for clustering can be students. Clustering is an important method for understanding and utility of grouping or streaming[6] in educational research. 2707 majority and minority students in New Zealand were classified into different clusters according to school size.wikipedia. the measurement of similarity between individual profiles. student ethnicity.[7] It aims at discovering any meaningful clusters of units based on measures on a set of response variables. the use of that measure to form the groups or clusters.[8] Data exploration is used when there is little information about which schools or students will be grouped together. The clusters in this research were calculated across five dimensions. the free encyclopedia http://en. Hattie used cluster analysis in the project 'School Like Mine' [9] to compare students’ achievement in literacy and numeracy by the type of school they attended. sex or test score. All schools were placed into one of twenty clusters that are used in the asTTle software as a basis of student achievement comparison. This is often an arbitrary decision and the groupings are random. region. A number of formulae are available for measuring this similarity. Among others.[7] cluster confirmation [8] and hypothesis testing. decile. and secondly.[8] Example of cluster analysis in educational research In 2002.org/wiki/Spectral_clustering#Spectral_clustering without a separate sampling of arterial blood. which is the cumulative frequencies of all variables measured. Market researchers use cluster analysis to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers/potential customers. Each individual should be compared with each of the group centroids. Common cluster techniques in educational research All cluster techniques have two basic concerns: firstly. The number of groups needs to be decided. size of civil jurisdiction and socioeconomic status for comparison. Brennan [10] described Iterative Relocation as the most important cluster technique in behavioral and educational research. It has been adopted at Lancaster to create typologies of pupils based on personality and behavioral items to identify types of students [11] and to isolate the skills considered to be important for certain grades of technologist in industry. Wishart[10] has found the error sum of squares. Market research Cluster analysis is widely used in market research when working with multivariate data from surveys and test panels. The results showed that solely using the power of socioeconomic status to describe schools is inadequate. Cluster confirmation is used for confirming the previously reported cluster results. a measure of dissimilarity.Cluster analysis . Other similarity coefficients are available and the one chosen will depend upon the type of data gained. The analysis then proceeds by computing the group profile (or group centroid) of each group.Wikipedia. parents. to be one of the most successful coefficients for 9 of 16 7/20/2011 8:22 PM . Cluster analysis in educational research can be used for data exploration.[8] Hypothesis testing is used for arranging cluster structure. an intrusive technique that is most common today.

schools are automatically assigned into the first twenty-two clusters. However. some researchers suggest a nonhierarchical cluster method which allows for reassignment of units from one cluster.[8] Disadvantages of cluster analysis An object can be assigned in one cluster only. The solution is then said to be stable.Wikipedia. to conduct a k-means analysis.org/wiki/Spectral_clustering#Spectral_clustering continuous data.Cluster analysis . each of which is described by a set of characteristics and attributes.[12] In the ‘Schools Like Mine’ example. by telling them the schools are classified as ineffective. The two most similar groups are then combined to reduce one group. This operates through an iterative partitioning k.[9] 23 clusters of schools with different properties were clearly clustered.[8] Nevertheless. choice of clustering criteria and compositional variables. especially if the objects are classified into many groups. when in fact many are doing well in some unique aspects that are not sufficiently illustrated by the clusters formed. Some schools may have more than one significant property or fall on the edge of two clusters.[13] It can be very sensitive to the choice of initial cluster centres. This is achieved by a fushion process whereby a measure of dissimilarity (error sum of squares) between all pairs of group centroids is calculated again.means algorithm. Cluster Analysis has to be very carefully used in classifying schools into groups because results are heavily influenced by partial sampling. it cannot be assigned to another one. This process can continue until the two level group is reached. This sequence of comparison and relocation continues until all individuals are in the group whose central profile is most similar to their own. Cluster Analysis also suggests how groups of units are determined such that units within groups are similar in some respect and unlike those from other groups. size of civil jurisdiction and social economic status in this example. as well as cluster 10 of 16 7/20/2011 8:22 PM . student ethnicity. students who are educated in them. Advantages of cluster analysis Frisvad of BioCentrum-DTU said that cluster analysis is a good way for quick review of data.[9] Clustering may have detrimental effects to teachers who work in low-decile schools.wikipedia. Data-driven clustering may not represent reality. region. Cluster Analysis provides a simple profile of individuals. it is often requires several analysis before the number of clusters can be determined. the free encyclopedia http://en. if schools want to compare themselves with integrated schools.[9] in order to overcome the unit reassignment issue. It is easy for users to assign or nominate themselves into a cluster they would most like to compare with in a school cluster database[9] because each cluster is clearly named with understandable terms. because once a school is assigned to a cluster.[13] Solution to problems of cluster analysis in educational research Hattie stated although Cluster analysis provides an easy way to make comparison between schools. the number of clusters needs to be specified at the start. no particular variable should be taken as the “short cut” for judging school quality. Recalculate the group centroids and repeat steps 2 and 3 until the solution is stable for the N-1 group level. a new iteration cycle commences. and parents who support them. This limits the exploratory power of cluster analysis.[9] Given a number of analysis units. at which point the analysis is complete. where k denotes the number of clusters. When relocate or alters the composition of groups and recalculate the group centroids are completed.[9] In k-means clustering methods.[9] For example in 'Schools Like Mine'. for example school size. they will have to manually assign themselves into cluster twenty-three. The analysis is then continued by reducing the number of groups by one (N-1).

(eBay doesn't have the concept of a SKU) Recommender systems Recommender systems are designed to recommend new items based on a user's tastes. Crime Analysis Cluster analysis can be used to identify areas where there are greater incidences of particular types of crime.[14] Climatology To find weather regimes or preferred sea level pressure atmospheric patterns. Grouping of Shopping Items Clustering can be used to group all the shopping items available on the web into a set of unique products.org/wiki/Spectral_clustering#Spectral_clustering labeling. etc. They sometimes use clustering algorithms to predict a user's preferences based on the preferences of other users in the user's cluster. IMRT segmentation Clustering can be used to divide a fluence map into distinct regions for conversion into deliverable fields in MLC-based Radiation Therapy.Cluster analysis . Physical Geography The clustering of chemical properties in different sample locations. It is a form of restructuring and hence is a way of directly preventative maintenance.[15] Petroleum Geology Cluster Analysis is used to reconstruct missing bottom hole core data or missing log curves in order to evaluate reservoir properties. Software evolution Clustering is useful in software evolution as it helps to reduce legacy properties in code by reforming functionality that has become dispersed. For example. Another common application is the division of documents. for example. the free encyclopedia http://en. By identifying these distinct areas or "hot spots" where a similar crime has happened over a period of time. all the items on eBay can be grouped into unique products. thereby adversely affecting students. Mathematical chemistry To find structural similarity. clustering may be used to create a more relevant set of search results compared to normal search engines like Google.wikipedia. 11 of 16 7/20/2011 8:22 PM . Data mining Many data mining applications involve partitioning data items into related subsets. Image segmentation Clustering can be used to divide a digital image into distinct regions for border detection or object recognition.[9] Other applications Social network analysis In the study of social networks.Wikipedia. clustering may bring about unnecessary comparisons and inappropriate discriminations among schools. Search result grouping In the process of intelligent grouping of the files and websites. such as World Wide Web pages. clustering may be used to recognize communities within large groups of people. into genres. There are currently a number of web based clustering tools such as Clusty.. Slippy map optimization Flickr's map of photos and other map sites use clustering to reduce the number of markers on a map. 3000 chemical compounds were clustered in the space of 90 topological indices. This makes it both faster and reduces the amount of visual clutter. the marketing applications discussed above represent some examples. Like assigning schools into different bands.

wikipedia.j) between two clusters may be any number of distance measures. Evolutionary algorithms Clustering may be used to identify different niches within the population of an evolutionary algorithm so that reproductive opportunity can be distributed more evenly amongst the evolving species or subspecies. The inter-cluster distance d(i. the intra-cluster distance d'(k) may be measured in a variety ways. such as the maximal distance between any pair of elements in cluster k. the free encyclopedia http://en.org/wiki/Spectral_clustering#Spectral_clustering it is possible to manage law enforcement resources more effectively. Dunn Index The Dunn index aims to identify dense and well-separated clusters. Such a measure can be used to compare how well different data clustering algorithms perform on a set of data.Cluster analysis .[16] The following methods can be used to assess the quality clustering algorithms based on internal criterion: Davies–Bouldin index The Davies-Bouldin index can be calculated by the following formula: where n is the number of clusters. One drawback of using internal criterion in cluster evaluation is that high scores on an internal measure do not necessarily result in effective information retrieval applications. There have been several suggestions for a measure of similarity between two clusterings.cj) is the distance between centroids ci and cj. 12 of 16 7/20/2011 8:22 PM . It is defined as the ratio between the minimal inter-cluster distance to maximal intra-cluster distance. Internal criterion of quality Clustering evaluation methods that adhere to internal criterion assign the best score to the algorithm that produces clusters with high similarity within a cluster and low similarity between clusters. σx is the average distance of all elements in cluster x to centroid cx. algorithms that produce clusters with high Dunn index are more desirable. and d (k) measures the intra-cluster distance of cluster k. and d(ci .j) represents the distance between clusters i and j.Wikipedia. Evaluation of clustering is sometimes referred to as Cluster validation. the Dunn index can be calculated by the following formula [17]: ' where d(i. These measures are usually tied to the type of criterion being considered in assessing the quality of a clustering method. such as the distance between the centroids of the clusters. the clustering algorithm that produces a collection of clusters with the smallest Davies–Bouldin index is considered the best algorithm based on this criteria. Similarly. Since internal criterion seek clusters with high intra-cluster similarity and low inter-cluster similarity. For each cluster partition. Since algorithms that produce clusters with low intra-cluster distances (high intra-cluster similarity) and high inter-cluster distances (low inter-cluster similarity) will have a low Davies-Bouldin index. cx is the centroid of cluster x.

The Jaccard index is defined by the following formula: This is simply the number of unique elements common to both sets divided by the total number of unique elements in both sets.Cluster analysis . In other words. the attributes present may not allow separation of clusters or the classes may contain anomalies. the free encyclopedia http://en. recall has no impact on the F-measure when β increasing β allocates an increasing amount of weight to recall in the final F-measure.wikipedia. The F-measure addresses this concern. These types of evaluation methods measure how close the clustering is to the predetermined benchmark classes. TN is the number of true negatives. and these sets are often created by human (experts). or only on synthetic data sets with a factual ground truth. Such benchmarks consist of a set of pre-classified items. and an index of 0 indicates that the datasets have no common elements. F0 = P. Let precision and recall be defined as follows: where P is the precision rate and R is the recall rate.org/wiki/Spectral_clustering#Spectral_clustering External criterion of quality Clustering evaluation methods that adhere to external criterion compare the results of the clustering algorithm against some external benchmark.Wikipedia. However. it has recently been discussed whether this is adequate for real data. and The Jaccard index is used to quantify the similarity between two datasets. The Jaccard index takes on a value between 0 and 1. This may be an undesirable characteristic for some clustering applications. An index of 1 means that the two dataset are identical. FP is the number of false positives. since classes can contain internal structure. F-measure The F-measure can be used to balance the contribution of false negatives by weighting recall through a parameter . Jaccard index = 0. One can also view the Rand index as a measure of the percentage of correct decisions made by the algorithm. the benchmark sets can be thought of as a gold standard for evaluation.[18] Some of the measures of quality of a cluster algorithm using external criterion include: Rand measure The Rand index computes how similar the clusters (returned by the clustering algorithm) are to the benchmark classifications. We can calculate the F-measure by using the following formula [16]: Notice that when β = 0. 13 of 16 7/20/2011 8:22 PM . and FN is the number of false negatives. Thus. It can be compute using the following formula: where TP is the number of true positives. One issue with the Rand index is that false positives and false negatives are equally weighted.

J. One is Marina Meila's 'Variation of Information' metric. Several different clustering systems based on mutual information have been proposed. another provides hierarchical clustering. For example. 28 (8). which corrects mutual information for agreement solely due to chance between clusterings (Vinh et al.fr/~rnock/Articles/Drafts /tpami06-nn. a wide range of different fit-functions can be optimized. Among the most popular are CLARANS (Ng and Han.[21] Adjusted Mutual Information Artificial neural network (ANN) Cluster-weighted modeling Clustering high-dimensional data Consensus clustering Constrained clustering Curse of dimensionality Data clustering algorithms Data mining Data stream clustering Determining the number of clusters in a data set Dimension reduction Human genetic clustering Latent Class Analysis Mixture modelling Multidimensional scaling Nearest neighbor search Neighbourhood components analysis Paired difference test Principal Component Analysis Silhouette Structured data analysis (statistics) Parallel coordinates 1. University of California Press. but the algorithm may be unnecessarily slow. This approach can also be more expensive. and Nielsen. 1996). [16] In recent years considerable effort has been put into improving algorithm performance (Z. IEEE Trans. Some Methods for classification and Analysis of Multivariate Observations. 1:281-297 2.pdf) ... F.1994). (1967). Using genetic algorithms.univ-ag. 2009).Cluster analysis . the free encyclopedia http://en. Huang. R. a clustering algorithm may perform admirably based on various internal criterion. a faster algorithm that performs slightly poorer on the internal criterion may be more desirable. ^ MacQueen. This valuation method is more direct and requires carefully defining what the "user's needs" are. (2006) "On Weighting Clustering" (http://www1. in particular if large user studies are required to fully understand user preference. 1–13 14 of 16 7/20/2011 8:22 PM .Wikipedia. B. DBSCAN[19] and BIRCH (Zhang et al.wikipedia. on Pattern Analysis and Machine Inteligence. If the user seeks quick clustering response. ^ Nock. Berkeley. 1998).org/wiki/Spectral_clustering#Spectral_clustering Fowlkes-Mallows index confusion matrix A confusion matrix can be used to quickly visualize the results of a classification (or clustering) algorithm.It shows how different a cluster is different from the gold standard cluster.[20] The adjusted-for-chance version for the mutual information is the Adjusted Mutual Information (AMI). Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. including mutual information. Relative criterion of quality Cluster evaluation methods that incorporate relative criterion directly evaluate the clustering algorithm in regard to user needs.

ISBN 978-0-521-86571-5.170. Sci. Farag. IEEE Transactions on Medical Imaging 21 (3): 193–199. Journal of Cybernetics 4: 95–104. Xiaowei Xu (1996). (2010). (2005). Newbury Park (CA): Sage.edu/~kumar/dmbook/ch8. Cluster Analysis. N. ^ Entwistle. Thomas Seidl.org/program /paper. Jordan. W. R. Journal of Data Science. ^ Huth R. In Xiaoli Z. CA. Aldenderfer.1. Arthur Zimek (2010). Tokyo. Niemi C.pdf) . & Brandt.pdf) .edu/wwwcvip/research/publications/Pub_Pdf/2002/3. In Evangelos Simoudis.pdf. Jennifer Dy. (1984). http://www. Jiawei Han. 340 pp. ACM SIGKDD. 3. Harald Stögbauer. & Brennan.Y. "Cluster analysis in higher education research". Sokal: NumericxalTaxonomy. Cluster Analysis for Researchers. ^ Bezdek... Fern. 20. ^ a b c d e f Huberty.org/abs/q-bio/0311039) 21. MultiClust: Discovering.ist. Introduction to Information Retrieval. "On Using Class-Labels in Evaluation of Clusterings" (http://eecs. British Journal of Educational Psychology.umn.. (1971). Cambridge University Press.A.R. Fayyad. 1146. "A Modified Fuzzy C-Means Algorithm for Bias Field Estimation and Segmentation of MRI Data" (http://www. ^ a b c d e f g h i Hattie (2002). C.. ISBN 1-57735-004-9. (2003) ArXiv q-bio/0311039 (http://arxiv. 14. Discr.imm.S. 2008: 105-152 16.R. ^ Frisvad.). Yamany. "Schools Like Mine: Cluster Analysis of New Zealand Schools". Romesburg: Cluster analysis for researchers. ^ a b Bennett. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96).1980) .edu/viewdoc/summary?doi=10. "Determing Structural Similarity of Chemicals Using Graph Theoretic Indices". IEEE. ^ Ahmed. http://citeseerx. Andrzejak. Mathematics Learning Support Chapter 3. CA.Wikipedia. 15 of 16 7/20/2011 8:22 PM . Erich Schubert.1980.pdf 13. Math. Research Intelligence. 226–231.dk/courses/27411/doc/Lection10/Cluster%20analysis. and Peter Grassberger. J.. 17. ^ Martin Ester. Japan. 19. "Well separated clusters and optimal fuzzy partitions".. (1975).dtu. ^ Ines Färber. (2005).C.psu.869 Romesburg.1. Ralph G. M. "Types of successful students".edu/viewdoc/summary?doi=10. Blashfield. San Francisco. 268-276. S.71. 1984 P. Appl. B. "Cluster Analysis" (http://www2. ^ Google News personalization: scalable online collaborative filtering (http://www2007.. the free encyclopedia http://en. N.71.imm. Regal R. J. Mohamed N. Thomas (2002). Aly A. 18.. 437-457). Sameh M. ISBN 0306406713 4. Manning. Clustering by a Genetic Algorithm with Biased Mutation Operator. Ltd. Ian Davidson. et al. C. 19. ^ Basak S. Ann. Co. (1974). http://eecs.oregonstate. Cluster Analysis. "Hierarchical Clustering Based on Mutual Information". "A density-based algorithm for discovering clusters in large spatial databases with noise" (http://citeseerx. N. Peer Kröger. ^ a b c Christopher D.edu/wwwcvip/research/publications/Pub_Pdf/2002/3. (1981). Pattern Recognition with Fuzzy Objective Function Algorithms.cs... 9.cvip. Summarizing. 41. (2007). ^ Dunn.dk/courses/27411/doc/Lection10/Cluster%20analysis. Higher Education: Handbook of Theory and Research (Vol. 20.pdf) . 10. Magnuson V..php?id=570) 6. J. Mohamed.. ^ Kumar Cluster Analysis: Basic Concepts and Algorithms (http://www-users. ISBN 1-4116-0617-5.edu/viewdoc/summary?doi=10. and Using Multiple Clusterings.1.oregonstate. Project asTTle.uofl. Stephan Günnemann. 2010. Technical Report 14. T. http://citeseerx. ^ a b Cornish.1. ^ Auffarth. Belmont. In J.ist. WCCI CEC.wikipedia.pdf. Charles.Cluster analysis .K.1. James C.pdf) .. Usama M. pp. 11.uofl.1. Acad. 1988: 17-44. pp. Prepublication chapter from a book 7.J. 5. ^ Alexander Kraskov. Prabhakar Raghavan & Hinrich Schutze. Jörg Sander. 12.edu/research/multiclust /Evaluation-4. ^ a b Finch. Moriarty.. Sneathand R. 85-100 8.psu. H. C. Freeman. Hans-Peter Kriegel. Emmanuel Müller. 64-70. Great Britain: Springer.org/wiki/Spectral_clustering#Spectral_clustering 3. A Japanese language translation is available from Uchida Rokakuho Publishing Co.ist. Based on BasedonH. H. "Comparison of distance measures in cluster analysis with dichotomous data".1. Cluster analysis in educational research: A non-statistical introduction. July 18–23. 2004. 1973 http://www2.psu...cvip. reprint of 1990 edition published by Krieger Pub. Smart (Ed.R. Lifetime Learning Publications. Nevin.H. 15.dtu. University of Auckland. AAAI Press. E. "Classifications of Atmospheric Circulation Patterns: Recent Advances and Applications".edu /research/multiclust/Evaluation-4. M. Hans-Peter Kriegel.C. 1.

Inc. additional terms may apply..org/wiki/Cluster_analysis#Spectral_clustering" Categories: Data mining | Data analysis | Cluster analysis | Geostatistics | Machine learning | Multivariate statistics | Knowledge discovery in databases This page was last modified on 15 July 2011 at 23:11. a non-profit organization. Wikipedia® is a registered trademark of the Wikimedia Foundation.wikipedia. Retrieved from "http://en.Cluster analysis . 16 of 16 7/20/2011 8:22 PM .wikipedia. See Terms of use for details.com/Plone/framework/Components/learning-methods /copy_of_classification-suite-jw) written in Java. Text is available under the Creative Commons Attribution-ShareAlike License. the free encyclopedia http://en.sri.org/wiki/Spectral_clustering#Spectral_clustering PAL Clustering suite (https://pal.Wikipedia.

- IITK_IMPRINT_Synopsis Brochure 2015.pdf
- NRSC 10 RISAT-1 - eoPortal Directory - Satellite Missions.pdf
- NRSC 9.pdf
- NRSC 8.pdf
- NRSC 7.pdf
- NRSC 6.pdf
- NRSC 4.pdf
- NRSC 3 RISAT1-Handbook.pdf
- NRSC 2 LULC.pdf
- NRSC 1.pdf
- Nrsc Risat-1 Sar Sensor and Imaging Capabilities
- NRSC Attitude Fusing Using Digital Sun Sensor
- NRSC Disaster Management Support (DMS) Programme
- NRSC RISAT-1 Data Availability
- IMSL Fortran Library User Guide 4.pdf
- IMSL Fortran Library User Guide 7.pdf
- IMSL Fortran Library User Guide 6.pdf
- IMSL Fortran Library User Guide 5.pdf
- IMSL Fortran Library User Guide 3.pdf
- IMSL Fortran Library User Guide 3.pdf
- IMSL-Numerical-Libraries-datasheet.pdf
- IMSL Numerical Libraries _ Rogue Wave.pdf
- IMSL JMSL71FC.pdf
- IMSL IDL with the IMSL Math Library and Exponet Gra.pdf
- IMSL Fortran Library User Guide 2.pdf

- IJETTCS-2014-10-21-85
- Cluster Analysis
- Lecture 9 Clustering
- Apache Mahout Clustering Designs - Sample Chapter
- ch7
- Clustering
- Data Clustering..
- IJCTT-V3I1P118
- Cluster analysis using k mean algorithm
- Hausdorff Clustering of Financial Time Series
- Machine Learning 3
- Cluster Analysis
- 4.Cluster Nov 2013
- Cluster Analysis
- Clustering_KUK.pptx
- ch7
- A Comparative Study on Distance Measuring Approaches for Clustering
- Cluster
- Clustering
- k Mean Clustering
- Cluster Analysis of Facts
- Speaker Recognition
- Clustering
- lect4
- Cluster Analysis
- A K-Means Algorithm With a Novel Non-metric Distance
- Bl 24409420
- Report
- Applied Data Analysis (With SPSS)
- Cluster Analysis
- Cluster Analysis

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd