Professional Documents
Culture Documents
multidimensional scaling, factor analysis, principal component analysis, and cluster analysis to name a few. A useful overview is given in [3]. In pattern recognition, data analysis is concerned with predictive modeling: given some training data, we want to predict the behavior of the unseen test data. This task is also referred to as learning. Often, a clear distinction is made between learning problems that are (i) supervised (classification) or (ii) unsupervised (clustering), the first involving only labeled data (training patterns with known category labels) while the latter involving only unlabeled data [4]. Clustering is a more difficult and challenging problem than classification. There is a growing interest in a hybrid setting, called semi- supervised learning [5]; in semisupervised classification, the labels of only a small portion of the training data set are available. The unlabeled data, instead of being discarded, are also used in the learning process. In semi-supervised clustering, instead of specifying the class labels, pair-wise constraints are specified, which is a weaker way of encoding the prior knowledge A pair-wise must-link constraint corresponds to the requirement that two objects should be assigned the same cluster label, whereas the cluster labels of two objects participating in a cannot-link constraint should be different. Constraints can be particularly beneficial in data clustering [6], where precise definitions of underlying clusters are absent. In the search for good models, one would like to include all the available information, no matter whether it is unlabeled data, data with constraints, or labeled data. Figure 1 illustrates this spectrum of different types of learning problems of interest in pattern recognition and machine learning.
machine learning, data mining, pattern recognition, image analysis, information retrieval, and bioinformatics. Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, bryology and typological analysis. Example of clusters:
a. Underlying structure: to gain insight into data, generate hypotheses, detect anomalies, and identify salient features. b. Natural classification: to identify the degree of similarity among forms or organisms (phylogenetic relationship). c. Compression: as a method for organizing the data and summarizing it through cluster prototypes.
clustering and image segmentation. Utilizing the implemented k means method, the image models are grouped into coherent clusters.
Chapter 2 Describes related work of clustering and image segmentation using clustering. Chapter 3 defines defines different clustering algorithms and in chapter 3 k-means clustering algorithm is analyzed. Chapter 5 is about clustering validation, we tried to propose a method for clustering validation. Chapter 5 is about color image, image segmentation and its implementation. In chapter 7 we represented our result.
One of the main differences between retrieval systems that are based on image clustering lies in the way the clustering is performed. The clustering process may be a supervised process using human intervention, or an unsupervised process. Another basic difference lies in the image representation itself, whether it entails global image information or contains local information as well.
based feature sets are extracted from each subsegment using Haar and Daubechies wavelet transforms. Using multi-resolution property of wavelets, the features are extracted at different levels. In comparing images to cluster feature vectors, only coarse scale features are used; whereas in retrieving images, all the features are compared. An image can be assigned to one or more clusters. During the retrieval process a query icon is introduced to the system. The distance between the query icon and the cluster icons is first computed. Clusters for which this distance is less than a threshold are searched to retrieve relevant images. The distance between a database image and an icon is considered as the minimum of the distance between its sub segments and the icon. This clustering approach was demonstrated on a database of air photo images where texture is an important feature. Experiments showed that for a good feature set, retrieval from the clustered database was effective as retrieval from the whole database and was more efficient (required less comparisons between query icon to database images). Carson et al. [19] used a naive Bayes algorithm to learn image categories in a supervised learning scheme. The images in this work are represented by a set of homogeneous regions in feature space of color and texture, based on the Blobworld image repre- sentation (Carson et al. [20]). The suggested framework entails learning blob-rules per category on a labelled training set. Thus, one may argue that there is a shift to a high-level image description (image labelling). Each query image is compared with the extracted category models, and associated with the closest matching category. A probabilistic and continuous framework for supervised image category modeling and matching was recently introduced by Greenspan et al. [12]. A generalized GMMKL framework is described in which each image or image-set (category) is represented as a Gaussian mixture distribution. Images (categories) are compared and matched via a probabilistic measure of similarity between distributions known as the Kullback-Leibler (KL) distance (Cover and Thomas [7]). The KL-distance is applied in an image retrieval task and results indicate better performance when compared to histograms. The main drawback of supervised clustering is that it requires human intervention. In 8
order to extract the cluster representation, the various methods, mentioned above, require a-priori knowledge regarding the database content (e.g. which clusters exist in 9the database or a labelled training set). This approach is therefore not appropriate for large unlabelled databases.
method which dramati- cally reduces both memory and computation required for the agglomerative algorithm. A hierarchical browsing environment called similarity pyramid is constructed based on the results of the agglomerative clustering algorithm. The similarity pyramid groups similar images together while allowing the users to view the database at various levels of resolution. Results for search experiments are presented and the tradeoff between search accuracy and speed is determined. A common characteristic of the histogram representation is the discretization of the feature space. The binning of the space involves loss of information. A binning that is too coarse will not have sufficient discriminative power, while one that is too fine will place similar features in different bins, which will never be matched. A second major characteristic of the global histogram is that it captures only global feature distributions of the images, and lacks information about spatial relationships of the image features. A shift to a more localized representation, which reflects spatial information from the image plane, may be desired. 2.2.2 Unsupervised clustering using region-based image representations The following set of works incorporate region-based approaches for the image representation. Since image regions are the basic building blocks in forming the visual content of an image, they have great potential in representing the image content as well as the category content. Abdel-Mottaleb et al. [1, 16] presented a technique for image retrieval from large databases based on local color features. Images in the database are divided into rectangular regions and represented by a set of normalized histograms corresponding to these rectangular regions. The size of the regions is an important parameter. Regions should be small enough to emphasize the local color and large enough to offer statistically valid histograms. The similarity measure between two images is expressed as the sum of similarities between histograms of the corresponding rectangular regions. The influence of various similarity measures on the proposed framework was tested in [24]. The unsupervised image-clustering scheme introduced in [16], included an agglomerative clustering algorithm based on the image representation and the optimal similarity measure found in [24]. The cluster centroid is calculated as the average of the histograms representing a selected set of images. The agglomerative algorithm is associated with an optimization step in which 10
each image re-evaluates its location and moves to a new cluster if required. Similar images are first grouped into clusters and the cluster centers are calculated. During the retrieval process the query image is initially compared with all the cluster centers. Then a subset of clusters that have the largest similarity to the query image is chosen and all the images in these clusters are compared with the query image. The experimental evaluation shows that using clustering for retrieval offers a high retrieval accuracy with a considerable reduction in the number of similarity comparisons required. In [17] the tree structure was used for the implementation of a browsing environment. A hierarchical model that imposes coarse to fine structure on the image collection is suggested by Bernard et al. [2, 3]. The statistical model presented, integrates semantic information provided by associated text and visual information provided by image features. By integrating the two kinds of information during model construction, the system learns links between the image features and semantics, which can be exploited for, better browsing and search. Each node in the tree has some probability of generating each word, and similarly, each node has some probability of generating an image segment with given features. A node is uniquely specified by cluster and level. The documents belonging to a given cluster are modeled as being generated by the nodes along the path from the leaf corresponding to the cluster up to the root node. Taking all clusters into consideration, a document is modeled by a sum over the clusters, weighted by the probability that the document is in the cluster. The probability for generating a word is simply tabulated, being determined by the appropriate word counts during training. For image segments Gaussian distributions over a number of features (like color texture, shape, position and size) are used in a way similar to the Blobworld representation (Carson et al. [5]). A document is represented as a sequence of words and a sequence of segments. Training the model is done on the entire image collection using the Ex- pectation Maximization algorithm. The clustering process introduced is claimed to be remarkably successful, since the text associated with the images typically emphasizes properties that are very hard to determine with computer vision techniques, but omits the visually obvious, and so the text and the images are complementary.
11
Cluster analysis is the formal study of methods and algorithms for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data
12
3.2 Requirements:
The main requirements that a clustering algorithm should satisfy are:
Scalability; Dealing with different types of attributes; Discovering clusters with arbitrary shape; Minimal requirements for domain knowledge to determine input parameters; Ability to deal with noise and outliers; Insensitivity to order of input records; High dimensionality; Interpretability and usability.
Current clustering techniques do not address all the requirements adequately (and concurrently); Dealing with large number of dimensions and large number of data items can be problematic because of time complexity; The effectiveness of the method depends on the definition of distance (for distance-based clustering); If an obvious distance measure doesnt exist we must define it, which is not always easy, especially in multi-dimensional spaces; The result of the clustering algorithm (that in many cases can be arbitrary itself) can be interpreted in different ways.
To study clustering algorithm, we should first determine what are our requirements, then identify existing problems of current clustering algorithm and finally classify
13
different clustering algorithm. The following figure describes basic steps in clustering.
14
methods not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously. They usually do not however work with arbitrary feature combinations as in general subspace methods. But this special case deserves attention due to its applications in bioinformatics.
15
eigenvectors of the RBF similarity matrix computed from the data in Figure 5(b), they become well separated, making it trivial for K-means to cluster the data [25].
(a)
(b)
Figure 3.4 Importance of a good representation. (a) Two rings dataset where K-means fails to find the two natural clusters; the dashed line shows the linear cluster separation boundary obtained by running K- means with K = 2. (b) A new representation of the data in (a) based on the top 2 eigenvectors of the graph Laplacian of the data, computed using an RBF kernel; K-means now can easily detect the two clusters
16
3.6.1 Common distance functions: The Euclidian Distance: In Cartesian coordinates, if = (1, 2. . . ) and = (1, 2. . . ) are two points in Euclidean n-space then the distance from to q, or from q to p, is given by: , = , = The Manhattan distance: The taxicab distance, d1, between two vectors p,q in an n-dimensional real vector space with fixed Cartesian coordinate system, is the sum of the lengths of the projections of the line segment between the points onto the coordinate axes. More formally,
!
(! ! )! + (! ! )! + (! ! )!
, = =
!!!
|! ! |
17
The maximum norm: In mathematical analysis, the uniform norm assigns to real- or complex valued bounded functions f defined on a set S the nonnegative number
This norm is also called the supremum norm, the Chebyshev norm, or the infinity norm. The Mahalanobis distance corrects data for different scales and correlations in the variable: Formally, the Mahalanobis distance of a multivariate vector ! , ! , ! ! from a group of values with mean ! , ! , ! ! and covariance matrix S is defined as:
The angle between two vectors can be used as a distance measure when clustering high dimensional data. The Hamming distance measures the minimum number of substitutions required to change one member into another.
18
Figure 3.7: Different weights on features result in different partitioning of the data. Sixteen animals are represented based on 13 Boolean features related to appearance and activity. (a) partitioning with large weights assigned to the appearance based features; (b) a partitioning with large weights assigned to the activity features. The figures in (a) and (b) are excerpted from [26]), and are known as heat maps where the colors represent the density of samples at a location; the warmer the color, the larger the density.
19
20
Figure 3.8 Automatic selection of number of clusters, K. (a) Input data generated from a mixture of six Gaussian distributions; (b)-(d) Gaussian mixture model (GMM) fit to the data with 2, 5 and 6 components, respectively; (e) true labels of the data.
3.9 Cluster validity Clustering algorithms tend to find clusters in the data irrespective of whether or not any clusters are present. Figure (a) shows a dataset with no natural clustering; the points here were generated uniformly in a unit square. However, when the K-means algorithm is run on this data with K = 3, three clusters are identified as shown in Figure (b)! Cluster validity refers to formal procedures that evaluate the results of cluster analysis in a quantitative and objective fashion [31]. In fact, even before a clustering algorithm is applied to the data, the user should determine if the data even has a clustering tendency [32].
21
Figure 3.9 Cluster validity. (a) A dataset with no natural clustering; (b) K-means partition with K = 3.
Cluster validity indices can be defined based on three different criteria: internal, relative, and external [29]. Indices based on internal criteria assess the fit between the structure imposed by the clustering algorithm (clustering) and the data using the data alone. Indices based on relative criteria compare multiple structures (generated by different algorithms, for example) and decide which of them is better in some sense. External indices measure the performance by matching cluster structure to the a priori information, namely the true class labels (often referred to as ground truth). Typically, clustering results are evaluated using the external criterion, but if the true labels are available, why even bother with clustering? The notion of cluster stability [33] is appealing as an internal stability measure. Cluster stability is measured as the amount of variation in the clustering solution over different sub-samples drawn from the input data. Different measures of variation can be used to obtain different stability measures. In [26], a supervised classifier is trained on one of the subsamples of the data, by using the cluster labels obtained by clustering the subsample, as the true labels. The performance of this classifier on the testing subset(s) indicates the stability of the clustering algorithm. In model based algorithms (e.g., centroid based representation of clusters in K-means, or Gaussian Mixture Models), the distance between the models found for different subsamples can be used to measure the stability [34]. Shamir and Tishby [22] define stability as the generalization ability of a clustering algorithm (in PAC-Bayesian sense). They argue that since many algorithms can be shown to be asymptotically stable, the rate at which the asymptotic stability is reached with respect to the number of samples is a more useful measure of cluster stability. Cross-validation is a widely used evaluation method in supervised learning. It
22
has been adapted to unsupervised learning by replacing the notation of prediction accuracy with a different validity measure. For example, given the mixture models obtained from the data in one fold, the likelihood of the data in the other folds serves as an indication of the algorithms performance, and can be used to determine the number of clusters K. Among the various types of clustering algorithm, after determining our problem space and requirement, We decided to study the K-means clustering algorithm.
23
4.1 Overview:
4.1.1 What is Kmeans In clustering k-means clustering method tries to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data as well as in the iterative refinement approach employed by both algorithms. Let = ! , = 1,2 be the set of n d-dimensional points to be clustered into a set of K clusters, = ! , = 1,2. . . . K-means algorithm finds a partition such that the squared error between the empirical mean of a cluster and the points in the cluster is minimized. Let ! be the mean of cluster. The squared error between ! and the points in cluster ! is defined as
j(! ) =
!! !!
||! ! ||!
The goal of K-means is to minimize the sum of the squared error over all K clusters,
!
j(c) =
! !! !!
||! ! ||!
Minimizing this objective function is known to be an NP-hard problem (even for K=2) [35]. Thus K-means, which is a greedy algorithm, can only converge to a local minimum, even though recent study has shown with a large probability K-means could converge to the global optimum when clusters are well separated [36]. Kmeans starts with an initial partition with K clusters and assign patterns to clusters so as to reduce the squared error. Since the squared error always decrease with an
24
increase in the number of clusters K (with J(C)=0 when K=n), it can be minimized only for a fixed number of clusters. 4.1.2 Basic Steps The main steps of K-means algorithm are as follows: I. Choose K initial cluster centers ! 1 , ! 2 , ! 3 ! (). II. At the k-th iterative step, distribute the samples {x} among the K clusters using the relation, ! ! < !
For all = 1,2. . . ; ; where ! () denotes the set of samples whose cluster center is ! (). III. Compute the new cluster centers ! ( + 1)), j = 1, 2... K such that the sum of the squared distances from all points in ! () to the new cluster center is minimized. The measure which minimizes this is simply the sample mean of ! (). Therefore, the new cluster centre is given by ! + 1 = !
!
!
!"!! (!)
= 1,2
Where ! is the number of samples in IV. If ! + 1 = ! for j=1,2...K then the algorithm has converged and the procedure is terminated. Otherwise go to Step 2.
25
1) k initial "means" (in this case k=3) are randomly selected from the data set (shown in color).
2) k clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means.
It is clear in this description that the final clustering will depend on the initial cluster centers chosen and on the value of K. This requires some prior knowledge of the number of clusters present in the data; in practical case this is not always possible and feasible. So there must be some process to determine the number of cluster.
C C
0 0 1 2 3 4 5 6
0 0 1 2 3 4 5 6
26
workload with a single feature. Suppose that the feature values range from 1.0 to 5.0 and that there are seven samples at 1.0, one sample at 3.0, and one at 5.0. Figures 2 and 3 show the results of running the clustering algorithm on the data set when trying to find two clusters. Figure 2 shows the two clusters that are found when the initial starting points are 1.7 and 5.1. Eight data points are placed in the first cluster, with a centroid value of 1.25. The second cluster consists of one data point, with a centroid value of 5.0. The error function value is E1=3.5. Figure 3 shows the two clusters that are found when the initial starting points are 1.5 and 3.5. With these starting points, one cluster consists of the seven data points with a centroid value of 1.0 and the other cluster consists of two data points with a centroid value of 4.0. This clustering leads to an error function value or E2=2.0. This simple example illustrates that the selection of starting point values is important when trying to determine the best representation of a workload. In comparing figures 1 and 2 to determine which of the two clusterings is better, a case can be made that a tight cluster is centered on 1.0, with the other two data points being outliers. This is captured in the second (i.e., Figure 3) clustering and also results in a lower E2 value. Thus, lower values of Ek are taken as better clustering representations. A second motivational example is illustrated in Figures 4-9, which show the clusters resulting from clustering on a workload with two features and 11 sample data points. When finding a single cluster, the initial starting point value does not matter because all data points will be assigned to the same cluster. Figure 4 shows the result of clustering when determining one cluster. The error function value is E1=15.2. There are an infinite number of alternatives for initial starting values when trying to find more than one cluster. Figure 5 shows the best (i.e., lowest E2 value) result for this example when determining two clusters. Nine data points are in one cluster and two data points in the other cluster. The resulting
27
Feature 2
Feature 2
C
2 C
0 0 1 2 3 4 5
0 0 1 2 3 4 5
Feature 1
Feature 1
Figure 4.4. Starting Points Example 2: One Cluster Figure 4.5. Starting Points Example 2: Two Clusters
alternatives are evaluated. The results from these two alternatives are shown in figures 6 and 7. In Figure 6, the starting points are: (1.8, 1.0), (1.0, 2.7), and (2.0, 2.7). The figure shows that the algorithm places seven data points in one cluster and two data points in each of the other clusters. The error function value is E3,alt1=4.0. In Figure 7, the starting points are: (1.8, 1.0), (1.4, 2.0), and (1.4, 4.0). The algorithm again places seven data points in one cluster and two data points in each of the other clusters, but the error function value is E3,alt2=1.0, indicating an improved clustering. Figure 8 shows the clustering result when finding four clusters, using (1.8, 1.0), (0.8, 2.0), (2.3, 2.0), and (1.3, 4.0) as the starting values. In this case, the error function value E4=0.5. Figure 9 shows the plot of the error function value Ek as a function of the number of clusters, k. As shown, the two starting point alternatives give different error function values. Intuitively, as more clusters are added, the error function value should decrease. Alternative 1 shows anomalous behavior, where E2 < E3,alt1, indicating that adding an extra cluster actually hurts the error measure. Identifying the knee of the curve for Alternative 1 in Figure 9, one could conclude that two clusters (i.e., Figure 5) best represents this data set. Better, more reasonable, methods for determining the initial starting points should not exhibit such anomalous behavior.
28
5
5
Feature 2
0 0 1 2 3 4 5
0 0 1 2 3 4 5
Feature 1
Feature 1
14
Sum of Distances
to Centroids (E k)
12 10 8 6 4 2
Alternative 1 Alternative 2
Feature 2
a
2
0 0 1 2 3 4 5
0 1 2 3 4
Feature 1
29
30
Fig: Explained Variance. The red circle indicates the elbow. The number of clusters chosen should therefore be 4.
c. Information criterion approach: Another set of methods for determining the number of clusters are information criteria, such as the Akaike information criterion (AIC), Bayesian information criterion (BIC), or the Deviance information criterion (DIC) if it is possible to make a likelihood function for the clustering model. For example: The k-means model is "almost" a Gaussian mixture model and one can construct a likelihood for the Gaussian mixture model and thus also determine information criterion values[43]. d. Information theory approach: Rate distortion theory has been applied to choosing k called the "jump" method, which determines the number of clusters that maximizes efficiency while minimizing error by information theoretic standards. The strategy of the algorithm is to generate a distortion curve for the input data by running a standard clustering algorithm such as k-means for all values of k between 1 and n, and computing the distortion (described below) of the resulting clustering. The distortion curve is then transformed by a negative power chosen based on the dimensionality of the data. Jumps in the resulting values then signify reasonable choices for k, with the largest jump
31
representing the best choice. The distortion of a clustering of some input data is formally defined as follows: Let the data set be modeled as a pdimensional random variable, X, consisting of a mixture distortion of G components with common covariance . If we let ! ! be a set of K cluster centers, with ! the closest center to a given sample of X, then the minimum average distortion per dimension when fitting the K centers to the data is:! = ! min!! !! [ ! ! !! ( ! )] This is also the average Mahalanobis distance per dimension between X and the set of cluster centers C. Because the minimization over all possible sets of cluster centers is prohibitively complex, the distortion is computed in practice by generating a set of cluster centers using a standard clustering algorithm and computing the distortion using the result. The
choice
of
the
transform
power
= ( 2)
is
motivated
by
asymptotic
reasoning
using
results
from
rate
distortion
theory.
Let
the
data
X
have
a
single,
arbitrarily
p-dimensional
Gaussian
distribution,
and
let
fixed
K
=
floor
(p),
for
some
greater
than
zero.
Then
the
distortion
of
a
clustering
of
K
clusters
in
the
limit
as
p
goes
to
infinity
is
It
can
be
seen
that
asymptotically,
the
distortion
of
a
clustering
to
the
power
= ( 2)
is
proportional
to
p,
which
by
definition
is
approximately
the
number
of
clusters
K.
In
other
words,
for
a
single
Gaussian
distribution,
increasing
K
beyond
the
true
number
of
clusters,
which
should
be
one,
causes
a
linear
growth
in
distortion.
This
behavior
is
important
in
the
general
case
of
a
mixture
of
multiple
distribution
components. Let X be a mixture of G p-dimensional Gaussian distributions with common covariance. Then for any fixed K less than G, the distortion of a clustering as p goes to infinity is infinite. Intuitively, this means that a clustering of less than the correct number of clusters is unable to describe asymptotically high-dimensional data, causing the distortion to increase without limit. If, as described above, K is made an increasing function of p, namely, K = floor (p), the same result as above is achieved, with the value of the distortion in the limit as p goes to infinity being equal to
2
2.
. Correspondingly, there is
32
the same proportional relationship between the transformed distortion and the number of clusters, K. Putting the results above together, it can be seen that for sufficiently high values of p, the transformed distortion !
!! !
increasing linearly for K >= G. The jump algorithm for choosing K makes use of these behaviors to identify the most likely value for the true number of clusters. Although the mathematical support for the method is given in terms of asymptotic results, the algorithm has been empirically verified to work well in a variety of data sets with reasonable dimensionality. In addition to the localized jump method described above, there exists a second algorithm for choosing K using the same transformed distortion values known as the broken line method. The broken line method identifies the jump point in the graph of the transformed distortion by doing a simple least squares error line fit of two line segments, which in theory will fall along the x-axis for < , and along the linearly increasing phase of the transformed distortion plot for . The broken line method is more robust than the jump method in that its decision is global rather than local, but it also relies on the assumption of Gaussian mixture components, whereas the jump method is fully non parametric and has been shown to be viable for general mixture distributions. e. Choosing k using Silhouette: The average silhouette of the data is another useful criterion for assessing the natural number of clusters. The silhouette of a datum is a measure of how closely it is matched to data within its cluster and how loosely it is matched to data of the neighboring cluster, i.e. the cluster whose average distance from the datum is lowest [44]. A silhouette close to 1 implies the datum is in an appropriate cluster, whilst a silhouette close to 1 implies the datum is in the wrong cluster. Optimization techniques such as genetic algorithms are useful in determining the number of clusters that gives rise to the largest silhouette [45]
33
These all have a common goal to find the clustering which results in compact clusters which are well separated. Here we tried to deduce a validation method
34
3. For k=2 to k=n a. From running the algorithm for k-1, determine the most populous cluster. b. Within this most populous cluster, find the two data samples that are the farthest away from each other. c. Using these two points, together with the k-2 other data points that formed the clusters that were not most populous, the set of k starting points are determined. This method is reasonable since it seeks to break up the most populous cluster into two smaller clusters. It is hoped that a decreased error value Ek results. The third method for selecting actual sample data points is feature value sums. The algorithm for feature value sums follows: 1. Each data samples feature vector is added together to give a single value for each job. 2. The entire workload is sorted based on these feature vector sums. 3. For each k, the workload is divided into k equal sized regions. The set of k median sample data points, one from each region, is used as the k initial starting points.
|| ! ||! (5.1)
! !! !!
where N is the number of pixels in the image, K is the number of clusters, and ! is the cluster Centre of cluster ! . We obviously want to minimize this measure. We can also
35
measure the inter-cluster distance, or the distance between clusters, which we want to be as big as possible. We calculate this as the distance between clusters Centres, and take the minimum of this value, defined as = min ! !
!
= 1,2, 1, (5.2)
= + 1,
We take only the minimum of this value, as we want the smallest of this distance to be maximized, and the other larger values will automatically be bigger than this value. Since we want both of these measures to help us determine if we have a good clustering, we must combine them in some way. The obvious way is to take the ratio, defined as: = (5.3)
Since we want to minimize the intra-cluster distance and this measure is in the numerator, we consequently want to minimize the validity measure. We also want to maximize the inter- cluster distance measure, and since this is in the denominator, we again want to minimize the validity measure. Therefore, the clustering which gives a minimum value for the validity measure will tell us what the ideal value of K is in the k-means procedure.
36
37
Medical imaging
o o o o o o
Locate tumors and other pathologies Measure tissue volumes Computer-guided surgery Diagnosis Treatment planning Study of anatomical structure
Locate objects in satellite images (roads, forests, etc.) Face recognition Fingerprint recognition Traffic control systems Brake light detection Machine vision Agricultural imaging crop disease detection
Several general-purpose algorithms and techniques have been developed for image segmentation. Since there is no general solution to the image segmentation problem, these techniques often have to be combined with domain knowledge in order to effectively solve an image segmentation problem for a problem domain.
38
(1) RGB (Red, Green and Blue); this color space is used for display. (2) YIQ; for color system of TV signal. (3) XYZ: for C.I.E X-Y-Z color system. (4) ! ! ! : for C.I.E X-Y-z color system. (5) HSI(Hue, Saturation and Intensity): for Human perception. Relationship of the above color systems with RGB is as the following. The transformation matrices are not the standard ones [46] (1) YIQ = 0.299 + 0.587 + 0.114 = 0.5 0.23 0.27 = 0.202 0.5 + 0.298 (2) XYZ = 0.618 + 0.177 + 0.205 = 2.299 + 0.587 = 0.114 = 0.056 + 0.944 (3) ! ! ! ! = ! = ( + + ) ( ) 3
2 (2 ) ! = (4) HSI = = 1 3( ) 2
min , , + +
= + + With these equations other spaces can be converted into RGB color space.
39
40
41
42
scripts, the python and ipython shell (ala MATLAB* or Mathematica), web application servers, and six graphical user interface toolkits. Matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, error charts, scatterplots, etc., with just a few lines of code.
44
The following excel sheet represent the result found after applying k means on the generated data.
45
From the above figure, we can see that for k=16, intra is minimum. So the number of cluster must be 16. We plotted k and variance on a curve and found the following result:
25
20
15
10
5
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Series2
From the figure we can see that variance turns to be constant with the increasing of k. and variance is minimum at k=16, the curve is about straight line and parallel to the x axis. From this we can come to decision that the accurate number of cluster may be 15 or 16 or 17.
46
47
Image comparison:
original image
48
Summary
This paper presents an overview of the K-means clustering algorithm. K-means clustering is a common way to define classes of jobs within a workload. Simple examples show that determining appropriate number of cluster have a significant effect on the results of the algorithm, both in the number of clusters found and their centroids. A case study is described where different methods of choosing initial starting points are evaluated. We implemented the K-means clustering algorithm and used it for image segmentation. For smaller values of sampled pixels in the algorithms give poor results. For larger values of sampled pixels, the segmentation is very rich; many clusters appear in the images at discrete places. This is because Euclidean distance is not a very good metric for segmentation processes. Better result is found using perceptual-based color distance.
49
Reference:
[1] GANTZ, JOHN F. 2008 (March). The diverse and exploding
digital universe. Available online at: http://www.emc.com/collateral/analyst-reports/diverseexploding-digital-universe.pdf
[4] DUDA, R., HART, P., & STORK, D. 2001. Pattern classification.
2 edn. New York: John Wiley & Sons.
[5] CHAPELLE, O., SCHLKOPF, B., & ZIEN, A. (eds). 2006. Semisupervised learning. Cambridge, MA: MIT Press.
[6] LANGE, T., LAW, M. H., JAIN, A. K., & BUHMANN, J. 2005
(June). Learning with constrained and unlabelled data. Pages 730737 of: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1.
[7] 2009 (February). Google Scholar. http://scholar.google.com. [8] Three Dimensional Object Recognition Systems. Elsevier Science Inc., New York, NY. [9] IW
A Y AMA,
categorization: a comparison of category search strategies. Pages 273281 of: Proceedings of the 18th ACM International Conference on Research and Development in Information Retrieval.
[10]
improve information access. Ph.D. thesis, Computer Science Department, Stanford University. [Scholkopf et al., 1998] SCHOLKOPF, BERNHARD, SMOLA, ALEXANDER, &
MULLER, KLAUS-ROBERT. 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 12991319.
50
[11]
information retrieval. IEEE Transactions on Systems, Man and Cybernetics, 28(B), 427 436.
[12]
marketing research. Oxford: Blackwell. Chap. Cluster Analysis in Marketing Research, pages 160189.
[13]
for automated generation of service engagement staffing plans. IBM J. Res. Dev., 51(3), 281 293.
[14]
information retrieval. IEEE Transactions on Systems, Man and Cybernetics, 28(B), 427 436.
[15]
hierarchical image classification scheme. In ACM Conference on multimedia, pages 219228, England, September 1998.
[16] [17]
browsing using hierarchical clustering. In Proceedings of the fourth IEEE Symposium on Computers and Communications, Egypt, July 1999.
[18]
large visual databases using wavelet transform. In Proc. of SPIE conference on visual data exploration and analysis IV, volume 3017, San Jose, California, 1997.
[19]
Region-based image querying. In Proc. of the IEEE Workshop on Content-based Access of Image and Video libraries (CVPR97), pages 4249, 1997.
[20]
Blobworld: Image segmentation using expectation-maximization and its application to image querying. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8):10261038, 2002.
51
[21]
browsing and search of large image databases. IEEE transactions on Image Processing, 9(3):442455, March 2000.
[22]
regions for image indexing. Machine Vision and Applications, 10(2):6673, 1997.
[23]
Image indexing using color correlograms. In Proc. of the IEEE Comp. Vis. And Patt. Rec., pages 762768, 1997.
[24]
Performance evaluation of clustering algorithms for scalable image retrieval. In IEEE Computer Society Workshop on Empirical Evaluation Techniques in Computer Vision, 1998.
[25]
2001. On spectral clustering: Analysis and an algorithm. Pages 849856 of: Advances in Neural Information Processing Systems 14. MIT Press.
[26]
2003. On the evaluation of perceptual similarity measures for music. Pages 712 of: Proceedings of the Sixth International Conference on Digital Audio Effects (DAFx-03). London, UK: Queen Mary University of London.
[27]
competitive clustering algorithm with applications in computer vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21, 450465.
[28]
[29]
HANSEN, MARK H., & YU, BIN. 2001. Model selection and
the principle of minimum description length. Journal of the American Statistical Association, 96(454), 746774.
[30]
Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, B, 411423.
52
[31] [32]
y in multidimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(1), 7381.
[33]
BUHMANN, JOACHIM M. 2004. Stability-based validation of clustering solutions. Neural Computation, 16(6), 12991323.
[34]
VON
VID,
statistical theory of clustering. In: Pascal workshop on statistics and optimization of clustering.
[35]
VINAY, V. 1999. Clustering large graphs via the singular value decomposition. Machine Learning, 56(1-3), 933.
[36]
optimum for k-means. Pages 625632 of: Proceedings of the 23rd International Conference on Machine Learning.
[37]
Analysis of Multivariate Observations". 1. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press. pp. 281 297. MR0214227. Zbl 0214.46201.
[38] [39]
Kanti Mardia et al. (1979). Multivariate Analysis. David J. Ketchen, Jr & Christopher L. Shook (1996).
Academic Press. "The application of cluster analysis in Strategic Management Research: An analysis and critique". Strategic Management Journal 17 (6): 441458. Doi: 10.1002/(SICI) [40] 1097-0266(199606) 17:6<441::AID- SMJ819>3.0.CO;2-G Cyril Goutte, Lars Kai Hansen, Matthew G. Liptrot & Egill Rostrup (2001). "Feature-Space Clustering for fMRI Meta-Analysis". Human Brain Mapping 13 (3): 165183. doi:10.1002/hbm.1031. see especially Figure 14 and appendix
53
[41]
Aid to the Interpretation and Validation of Cluster Analysis". Computational and Applied Mathematics 20: 5365. doi:10.1016/0377-0427(87)90125-7 [42] R. Lleti, M.C. Ortiz, L.A. Sarabia, M.S. Snchez (2004). "Selecting Variables for k-Means Cluster Analysis by Using a Genetic Algorithm that Optimises the Silhouettes". Analytica [43] [44] K.S.Fu Chimica and J.k Acta Mui, A 515: survey on 87100. image doi:10.1016/j.aca.2003.12.020 segmantation, Pattern recognition, Vol.13, pp3-16, 1981 Yu-TCHi, Takeo Kanade, and Toshiyuki Saki Color Information for Region segmentation, Computer Graphics and Processing, Vol.13, pp222-241, 1980. [45] Michael B. Dillencourt and Hannan Samet and Markku Tamminen (1992). "A general approach to connectedcomponent labeling for arbitrary image representations
54