Data Clustering

Chapter 1 Introduction
1.1 High Dimensionality of Data:

Advances in sensing and storage technology and dramatic growth in applications such as Internet search, digital imaging, and video surveillance have created many high-volume, high-dimensional data sets. It is estimated that the digital universe consumed approximately 281 Exabyte in 2007, and it is projected to be 10 times that size by 2011. (One Exabyte is ~1018 bytes or 1,000,000 terabytes) [1]. Most of the data is stored digitally in electronic media, thus providing huge potential for the development of automatic data analysis, classification, and retrieval techniques. In addition to the growth in the amount of data, the variety of available data (text, image, and video) has also increased. Inexpensive digital and video cameras have made available huge archives of images and videos. The prevalence of RFID tags or transponders due to their low cost and small size has resulted in the deployment of millions of sensors that transmit data regularly. Emails, blogs, transaction data, and billions of Web pages create terabytes of new data every day. Many of these data streams are unstructured, adding to the difficulty in analyzing them. The increase in both the volume and the variety of data requires advances in methodology to automatically understand, process, and summarize the data. Data analysis techniques can be broadly classified into two major types [2]: (i) exploratory or descriptive, meaning that the investigator does not have prespecified models or hypotheses but wants to understand the general characteristics or structure of the high- dimensional data, and (ii) confirmatory or inferential, meaning that the investigator wants to confirm the validity of a hypothesis/model or a set of assumptions given the available data. Many statistical techniques have been proposed to analyze the data, such as analysis of variance, linear regression, discriminant analysis, canonical correlation analysis,
multidimensional scaling, factor analysis, principal component analysis, and cluster analysis to name a few. A useful overview is given in [3]. In pattern recognition, data analysis is concerned with predictive modeling: given some training data, we want to predict the behavior of the unseen test data. This task is also referred to as learning. Often, a clear distinction is made between learning problems that are (i) supervised (classification) or (ii) unsupervised (clustering), the first involving only labeled data (training patterns with known category labels) while the latter involving only unlabeled data [4]. Clustering is a more difficult and challenging problem than classification. There is a growing interest in a hybrid setting, called semisupervised learning [5]; in semisupervised classification, the labels of only a small portion of the training data set are available. The unlabeled data, instead of being discarded, are also used in the learning process. In semi-supervised clustering, instead of specifying the class labels, pair-wise constraints are specified, which is a weaker way of encoding the prior knowledge A pair-wise must-link constraint corresponds to the requirement that two objects should be assigned the same cluster label, whereas the cluster labels of two objects participating in a cannot-link constraint should be different. Constraints can be particularly beneficial in data clustering [6], where precise definitions of underlying clusters are absent. In the search for good models, one would like to include all the available information, no matter whether it is unlabeled data, data with constraints, or labeled data. Figure 1 illustrates this spectrum of different types of learning problems of interest in pattern recognition and machine learning.
1.2 What is clustering?

Cluster analysis or clustering is a process of assigning similar data into a same group or subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including
machine learning, data mining, pattern recognition, image analysis, information retrieval, and bioinformatics. Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, bryology and typological analysis. Example of clusters:
Fig: Clustered nodes(Unsupervised)
Fig: clustered Data (Supervised)
Fig 1.1: example of cluster
1.3 Why clustering?

Cluster analysis is prevalent in any discipline that involves analysis of multivariate data. A search via Google Scholar [7] found 1,660 entries with the words data clustering that appeared in 2007 alone. This vast literature speaks to the importance of clustering in data analysis. It is difficult to exhaustively list the numerous scientific fields and applications that have utilized clustering techniques as well as the thousands of published algorithms. Image segmentation, an important problem in computer vision, can be formulated as a clustering problem [8]. Documents can be clustered [9] to generate topical hierarchies for efficient information access [10] or retrieval [11]. Clustering is also used to group customers into different types for efficient marketing [12], to group services delivery engagements for workforce management and planning [13] as well as to study genome data [14] in biology. Data clustering has been used for the following three main purposes.
a. Underlying structure: to gain insight into data, generate hypotheses, detect anomalies, and identify salient features. b. Natural classification: to identify the degree of similarity among forms or organisms (phylogenetic relationship). c. Compression: as a method for organizing the data and summarizing it through cluster prototypes.
1.4 Determining number of cluster

In clustering, determining the appropriate number of cluster is a big problem. Determining the number of clusters in a data set, a quantity often labeled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem. For a certain class of clustering algorithms (in particular k-means, k-medoids and Expectationmaximization algorithm), there is a parameter commonly referred to as k that specifies the number of clusters to detect. Other algorithms such as DBSCAN and OPTICS algorithm do not require the specification of this parameter; hierarchical clustering avoids the problem altogether. The correct choice of k is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. In addition, increasing k without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i.e., when k equals the number of data points, n). Intuitively then, the optimal choice of k will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster. If an appropriate value of k is not apparent from prior knowledge of the properties of the data set, it must be chosen somehow. There are several categories of methods for making this decision. An important application field of clustering is image processing. Clustering is used process image, image segmentation that is useful to determine region, boundary, and contour of an image.
1.5 What is Image segmentation?

In computer vision, segmentation refers to the process of partitioning a digital image into multiple segments (sets of pixels, also known as super pixels). The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain visual characteristics. The result of image segmentation is a set of segments that collectively cover the entire image, or a set of contours extracted from the image (see edge detection). Each of the pixels in a region are similar with respect to some characteristic or computed property, such as color, intensity, or texture. Adjacent regions are significantly different with respect to the same characteristic(s). When applied to a stack of images, typical in Medical imaging, the resulting contours after image segmentation can be used to create 3D reconstructions with the help of interpolation algorithms like Marching cubes. There are many methods to segmentation. The very famous method is clustering.
1.6 The scope of the thesis

In this thesis we focus on different clustering algorithm, determining number of cluster, cluster validation and segment the images into cluster. Understand related field of clustering. Understand existing clustering algorithms and identify the taxonomy. Study the existing methods of Determine number of cluster and deduce some method form clustering validation. Study the field of image segmentation and try to make a relationship among 5
clustering and image segmentation. Utilizing the implemented k means method, the image models are grouped into coherent clusters.
The thesis outline is presented in Figure 1.2.
Understand clustering (ch 3) Kmeans clustering (Ch 4) Determining number of cluster(ch 5)
Color Image representation(ch 6) Image segmentation(ch 6)
Fig 1.2: thesis outline
Chapter 2 Describes related work of clustering and image segmentation using clustering. Chapter 3 defines defines different clustering algorithms and in chapter 3 k-means clustering algorithm is analyzed. Chapter 5 is about clustering validation, we tried to propose a method for clustering validation. Chapter 5 is about color image, image segmentation and its implementation. In chapter 7 we represented our result.
Chapter 2 Related work
One of the main differences between retrieval systems that are based on image clustering lies in the way the clustering is performed. The clustering process may be a supervised process using human intervention, or an unsupervised process. Another basic difference lies in the image representation itself, whether it entails global image information or contains local information as well.
2.1 Supervised clustering

In the following systems, clustering is performed in a supervised manner and the cluster representation is used for the retrieval task. Huang et al. [15] proposed a method for hierarchical classification of images via supervised learning, assuming a training set of images with known class labels is available. Using banded color correlograms for the training images, the feature data is modeled using Singular Value decomposition (SVD) (Golub and Van Loan [16]) to reduce noise and dimensionality. Using the SVD representation, each of the images in the training data is classified using the nearest- neighbor algorithm (Duda et al. [17). Based on the performance of this classification, the set of classes is partitioned into two sub-classes such that the intra-subclass association is maximized while the inter-subclass association is minimized. The subclasses and those training images that were correctly classified with respect to the subclasses are used recursively to obtain a hierarchical classification tree. This tree can be used to categorize new images. The method was tested on some representative classes from the COREL database and the tree built was consistent with the color content of the classes. Sheikholeslami et al. [18] investigated a feature-based approach to image clustering and retrieval. Images in a database are categorized in clusters on the basis of their similarity to a set of iconic images termed cluster icons. The cluster icons are application dependent and are determined by the application expert, therefor a prior knowledge on the database and human intervention are required. The clustering is then performed in an unsupervised manner based on these icons. To cluster database images, all images are first decomposed into subsegments. Four different texture 7
based feature sets are extracted from each subsegment using Haar and Daubechies wavelet transforms. Using multi-resolution property of wavelets, the features are extracted at different levels. In comparing images to cluster feature vectors, only coarse scale features are used; whereas in retrieving images, all the features are compared. An image can be assigned to one or more clusters. During the retrieval process a query icon is introduced to the system. The distance between the query icon and the cluster icons is first computed. Clusters for which this distance is less than a threshold are searched to retrieve relevant images. The distance between a database image and an icon is considered as the minimum of the distance between its sub segments and the icon. This clustering approach was demonstrated on a database of air photo images where texture is an important feature. Experiments showed that for a good feature set, retrieval from the clustered database was effective as retrieval from the whole database and was more efficient (required less comparisons between query icon to database images). Carson et al. [19] used a naive Bayes algorithm to learn image categories in a supervised learning scheme. The images in this work are represented by a set of homogeneous regions in feature space of color and texture, based on the Blobworld image representation (Carson et al. [20]). The suggested framework entails learning blob-rules per category on a labelled training set. Thus, one may argue that there is a shift to a high-level image description (image labelling). Each query image is compared with the extracted category models, and associated with the closest matching category. A probabilistic and continuous framework for supervised image category modeling and matching was recently introduced by Greenspan et al. [12]. A generalized GMMKL framework is described in which each image or image-set (category) is represented as a Gaussian mixture distribution. Images (categories) are compared and matched via a probabilistic measure of similarity between distributions known as the Kullback-Leibler (KL) distance (Cover and Thomas [7]). The KL-distance is applied in an image retrieval task and results indicate better performance when compared to histograms. The main drawback of supervised clustering is that it requires human intervention. In 8
order to extract the cluster representation, the various methods, mentioned above, require a-priori knowledge regarding the database content (e.g. which clusters exist in 9the database or a labelled training set). This approach is therefore not appropriate for large unlabelled databases.
2.2 Unsupervised clustering

A different set of works is based on unsupervised clustering, where the clustering process is fully automated. A distinction is made between works using global image representations and works using local image representations. Local image representations contain additional information related to the image spatial layout. 2.2.1 Unsupervised clustering using global image representations The system presented by Chen et al. [21] is based on global image representation including global color, texture and edge histograms. Histograms are the classical means of representing image content. A histogram is a discrete representation of the continuous feature space. The feature space partition is determined by the features chosen (e.g. the color space representation), by the quantization scheme chosen (such as uniform or vector quantization), as well as by computational and storage considerations. Color histograms advantages and disadvantages are well studied and many variations exist (Pass and Zabih [20], Stricker and Dimai [22], Huang et al. [23]). Chen et al. used the global image representation and the L1 norm as a distance measure between image representations for the retrieval scheme introduced in their work. The work focuses on the use of hierarchical tree-structures to both speed-up search-by-query and organizes databases for effective browsing. Both top-down and bottom-up unsupervised clustering algorithms are presented. While the top-down method is well suited for the fast search problem, the bottom-up method yields better results for the browsing application. The top-down algorithm works by successively splitting nodes of the tree into K children, using the K-means algorithm, working from the root to the leaves of the tree. The bottom-up (agglomerative) clustering algorithm works by first forming a complete matrix of distances between the images and then using this matrix to sequentially group together elements. The work presents a fast-sparse clustering
method which dramati- cally reduces both memory and computation required for the agglomerative algorithm. A hierarchical browsing environment called similarity pyramid is constructed based on the results of the agglomerative clustering algorithm. The similarity pyramid groups similar images together while allowing the users to view the database at various levels of resolution. Results for search experiments are presented and the tradeoff between search accuracy and speed is determined. A common characteristic of the histogram representation is the discretization of the feature space. The binning of the space involves loss of information. A binning that is too coarse will not have sufficient discriminative power, while one that is too fine will place similar features in different bins, which will never be matched. A second major characteristic of the global histogram is that it captures only global feature distributions of the images, and lacks information about spatial relationships of the image features. A shift to a more localized representation, which reflects spatial information from the image plane, may be desired. 2.2.2 Unsupervised clustering using region-based image representations The following set of works incorporate region-based approaches for the image representation. Since image regions are the basic building blocks in forming the visual content of an image, they have great potential in representing the image content as well as the category content. Abdel-Mottaleb et al. [1, 16] presented a technique for image retrieval from large databases based on local color features. Images in the database are divided into rectangular regions and represented by a set of normalized histograms corresponding to these rectangular regions. The size of the regions is an important parameter. Regions should be small enough to emphasize the local color and large enough to offer statistically valid histograms. The similarity measure between two images is expressed as the sum of similarities between histograms of the corresponding rectangular regions. The influence of various similarity measures on the proposed framework was tested in [24]. The unsupervised image-clustering scheme introduced in [16], included an agglomerative clustering algorithm based on the image representation and the optimal similarity measure found in [24]. The cluster centroid is calculated as the average of the histograms representing a selected set of images. The agglomerative algorithm is associated with an optimization step in which 10
each image re-evaluates its location and moves to a new cluster if required. Similar images are first grouped into clusters and the cluster centers are calculated. During the retrieval process the query image is initially compared with all the cluster centers. Then a subset of clusters that have the largest similarity to the query image is chosen and all the images in these clusters are compared with the query image. The experimental evaluation shows that using clustering for retrieval offers a high retrieval accuracy with a considerable reduction in the number of similarity comparisons required. In [17] the tree structure was used for the implementation of a browsing environment. A hierarchical model that imposes coarse to fine structure on the image collection is suggested by Bernard et al. [2, 3]. The statistical model presented, integrates semantic information provided by associated text and visual information provided by image features. By integrating the two kinds of information during model construction, the system learns links between the image features and semantics, which can be exploited for, better browsing and search. Each node in the tree has some probability of generating each word, and similarly, each node has some probability of generating an image segment with given features. A node is uniquely specified by cluster and level. The documents belonging to a given cluster are modeled as being generated by the nodes along the path from the leaf corresponding to the cluster up to the root node. Taking all clusters into consideration, a document is modeled by a sum over the clusters, weighted by the probability that the document is in the cluster. The probability for generating a word is simply tabulated, being determined by the appropriate word counts during training. For image segments Gaussian distributions over a number of features (like color texture, shape, position and size) are used in a way similar to the Blobworld representation (Carson et al. [5]). A document is represented as a sequence of words and a sequence of segments. Training the model is done on the entire image collection using the Ex- pectation Maximization algorithm. The clustering process introduced is claimed to be remarkably successful, since the text associated with the images typically emphasizes properties that are very hard to determine with computer vision techniques, but omits the visually obvious, and so the text and the images are complementary.
11
Chapter 3 Clustering algorithms

3.1 Aim of clustering
The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. There are many methods to cluster the data. The Main problem is to determine which one is the best algorithm for specific case. It can be shown that there is no absolute best criterion, which would be independent of the final aim of the clustering. Consequently, optimality of different clustering algorithm is dependent of problem space. For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding natural clusters and describe their unknown properties (natural data types), in finding useful and suitable groupings (useful data classes) or in finding unusual data objects (outlier detection).
Fig 3.1: Data clustering
Cluster analysis is the formal study of methods and algorithms for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data
12
clustering (unsupervised learning) from classification or discriminant analysis (supervised learning).
3.2 Requirements:
The main requirements that a clustering algorithm should satisfy are:

Scalability; Dealing with different types of attributes; Discovering clusters with arbitrary shape; Minimal requirements for domain knowledge to determine input parameters; Ability to deal with noise and outliers; Insensitivity to order of input records; High dimensionality; Interpretability and usability.
3.3 Existing Problems

There are a number of problems with clustering. Among them:
Current clustering techniques do not address all the requirements adequately (and concurrently); Dealing with large number of dimensions and large number of data items can be problematic because of time complexity; The effectiveness of the method depends on the definition of distance (for distance-based clustering); If an obvious distance measure doesnt exist we must define it, which is not always easy, especially in multi-dimensional spaces; The result of the clustering algorithm (that in many cases can be arbitrary itself) can be interpreted in different ways.
To study clustering algorithm, we should first determine what are our requirements, then identify existing problems of current clustering algorithm and finally classify
13
different clustering algorithm. The following figure describes basic steps in clustering.
3.4 Classification of clustering algorithms:

Clustering algorithms can be classified according to the following characteristics: In some clustering algorithm Clustering is performed according to some predefined numbered of clusters. Barring knowledge of the proper value beforehand, the appropriate value must be determined. The main problem is that some method should be developed to determine the appropriate number of cluster. Example: Kmeans clustering algorithm. In Hierarchical algorithms data set find successive clusters using previously established clusters. These algorithms usually are either agglomerative ("bottom-up") or divisive ("top-down"). Example: hierarchical clustering. Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering. Density-based clustering algorithms are devised to discover arbitraryshaped clusters. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold. Example: DBSCAN, OPTICS. Subspace clustering methods look for clusters that can only be seen in a particular projection (subspace, manifold) of the data. These methods thus can ignore irrelevant attributes. The general problem is also known as Correlation clustering while the special case of axis-parallel subspaces is also known as Two-way clustering, co-clustering or bi-clustering: in these
14
methods not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously. They usually do not however work with arbitrary feature combinations as in general subspace methods. But this special case deserves attention due to its applications in bioinformatics.
The following figure shows taxonomy of clustering algorithm:
Fig 3.3: Taxonomy of clustering algorithm
3.5 Data representation

Data representation is one of the most important factors that influence the performance of the clustering algorithm. If the representation (choice of features) is good, the clusters are likely to be compact and isolated and even a simple clustering algorithm such as K- means will find them. Unfortunately, there is no universally good representation; the choice of representation must be guided by the domain knowledge. Figure 5(a) shows a dataset where K-means fails to partition it into the two natural clusters. A dashed line in Figure 5(a) shows the partition obtained by Kmeans. However, when the same data points in (a) are represented using the top two
15
eigenvectors of the RBF similarity matrix computed from the data in Figure 5(b), they become well separated, making it trivial for K-means to cluster the data [25].
(a)
(b)
Figure 3.4 Importance of a good representation. (a) Two rings dataset where K-means fails to find the two natural clusters; the dashed line shows the linear cluster separation boundary obtained by running K- means with K = 2. (b) A new representation of the data in (a) based on the top 2 eigenvectors of the graph Laplacian of the data, computed using an RBF kernel; K-means now can easily detect the two clusters
3.6 Similarity measurement:

Clustering algorithm identifies and packs similar data in a single cluster. So there is a term similarity measurement with clustering algorithm. Similar can be of different form. If we identify problem space in a 2d or 3d co-ordinate system then distance can be a similarity measurement criterion, which will determine how the similarity of two elements is calculated. This will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. For example, in a 2-dimensional space, the distance between the point ( = 1, = 0) and the origin ( = 0, = 0) is always 1 according to the usual norms, but the distance between the point ( = 1, = 1) and the origin can be 2, 2 1 if you take respectively the 1-norm, 2-norm or infinity-norm distance.
16
3.6.1 Common distance functions: The Euclidian Distance: In Cartesian coordinates, if = (1, 2. . . ) and = (1, 2. . . ) are two points in Euclidean n-space then the distance from to q, or from q to p, is given by: , = , = The Manhattan distance: The taxicab distance, d1, between two vectors p,q in an n-dimensional real vector space with fixed Cartesian coordinate system, is the sum of the lengths of the projections of the line segment between the points onto the coordinate axes. More formally,
!
(! ! )! + (! ! )! + (! ! )!
, = =
!!!
|! ! |
Fig 3.5; Manhattan Distance
17
The maximum norm: In mathematical analysis, the uniform norm assigns to real- or complex valued bounded functions f defined on a set S the nonnegative number
This norm is also called the supremum norm, the Chebyshev norm, or the infinity norm. The Mahalanobis distance corrects data for different scales and correlations in the variable: Formally, the Mahalanobis distance of a multivariate vector ! , ! , ! ! from a group of values with mean ! , ! , ! ! and covariance matrix S is defined as:
The angle between two vectors can be used as a distance measure when clustering high dimensional data. The Hamming distance measures the minimum number of substitutions required to change one member into another.
3.6.2 General equation of distance measure is:
#d & d p (x i,x j )=%x ik x jk p (1 p $ k=1 '

for p=2, euclidian distance. for p=1, ,anhattan distance.
18
3.7 Purpose of grouping

The representation of the data is closely tied with the purpose of grouping. The representation must go hand in hand with the end goal of the user. An example dataset of 16 animals represented using 13 Boolean features was used in [26] to demonstrate how the representation affects the grouping. The animals are represented using 13 Boolean features related to their appearance and activity. When a large weight is placed on the appearance features compared to the activity features, the animals were clustered into mammals vs. birds. On the other hand, a large weight on the activity features clustered the dataset into predators vs. non-predators. Both these partitioning shown in Figure 6 are equally valid, and uncover meaningful structures in the data. It is up to the user to carefully choose his representation to obtain a desired clustering.
Figure 3.7: Different weights on features result in different partitioning of the data. Sixteen animals are represented based on 13 Boolean features related to appearance and activity. (a) partitioning with large weights assigned to the appearance based features; (b) a partitioning with large weights assigned to the activity features. The figures in (a) and (b) are excerpted from [26]), and are known as heat maps where the colors represent the density of samples at a location; the warmer the color, the larger the density.
19
3.8 Number of Clusters

Automatically determining the number of clusters has been one of the most difficult problems in data clustering. Most methods for automatically determining the number of clusters cast it into the problem of model selection. Usually, clustering algorithms are run with different values of K; the best value of K is then chosen based on a predefined criterion. Figueiredo and Jain [27] used the minimum message length (MML) criteria [28] in conjunction with the Gaussian mixture model (GMM) to estimate K. Their approach starts with a large number of clusters, and gradually merges the clusters if this leads to a decrease in the MML criterion. A related approach but using the principle of Minimum Description Length (MDL) was used in [29] for selecting the number of clusters. The other criteria for selecting the number of clusters are the Bayes Information Criterion (BIC) and Akiake Information Criterion (AIC). Gap statistics [30] is another commonly used approach for deciding the number of clusters. The key assumption is that when dividing data into an optimal number of clusters, the resulting partition is most resilient to the random perturbations. The Dirichlet Process (DP) [30] introduces a non-parametric prior for the number of clusters. Probabilistic models to derive a posterior distribution for the number of clusters, from which the most likely number of clusters can be computed, often use it. In spite of these objective criteria, it is not easy to decide which value of K leads to more meaningful clusters. Figure (a) shows a two-dimensional synthetic dataset generated from a mixture of six Gaussian components. The true labels of the points are shown in Figure (e). When a mixture of Gaussians is fit to the data with 2, 5 and 6 components, shown in Figure (b)-(d), respectively, each one of them seems to be a reasonable fit.
20
Figure 3.8 Automatic selection of number of clusters, K. (a) Input data generated from a mixture of six Gaussian distributions; (b)-(d) Gaussian mixture model (GMM) fit to the data with 2, 5 and 6 components, respectively; (e) true labels of the data.
3.9 Cluster validity Clustering algorithms tend to find clusters in the data irrespective of whether or not any clusters are present. Figure (a) shows a dataset with no natural clustering; the points here were generated uniformly in a unit square. However, when the K-means algorithm is run on this data with K = 3, three clusters are identified as shown in Figure (b)! Cluster validity refers to formal procedures that evaluate the results of cluster analysis in a quantitative and objective fashion [31]. In fact, even before a clustering algorithm is applied to the data, the user should determine if the data even has a clustering tendency [32].
21
Figure 3.9 Cluster validity. (a) A dataset with no natural clustering; (b) K-means partition with K = 3.
Cluster validity indices can be defined based on three different criteria: internal, relative, and external [29]. Indices based on internal criteria assess the fit between the structure imposed by the clustering algorithm (clustering) and the data using the data alone. Indices based on relative criteria compare multiple structures (generated by different algorithms, for example) and decide which of them is better in some sense. External indices measure the performance by matching cluster structure to the a priori information, namely the true class labels (often referred to as ground truth). Typically, clustering results are evaluated using the external criterion, but if the true labels are available, why even bother with clustering? The notion of cluster stability [33] is appealing as an internal stability measure. Cluster stability is measured as the amount of variation in the clustering solution over different sub-samples drawn from the input data. Different measures of variation can be used to obtain different stability measures. In [26], a supervised classifier is trained on one of the subsamples of the data, by using the cluster labels obtained by clustering the subsample, as the true labels. The performance of this classifier on the testing subset(s) indicates the stability of the clustering algorithm. In model based algorithms (e.g., centroid based representation of clusters in K-means, or Gaussian Mixture Models), the distance between the models found for different subsamples can be used to measure the stability [34]. Shamir and Tishby [22] define stability as the generalization ability of a clustering algorithm (in PAC-Bayesian sense). They argue that since many algorithms can be shown to be asymptotically stable, the rate at which the asymptotic stability is reached with respect to the number of samples is a more useful measure of cluster stability. Cross-validation is a widely used evaluation method in supervised learning. It
22
has been adapted to unsupervised learning by replacing the notation of prediction accuracy with a different validity measure. For example, given the mixture models obtained from the data in one fold, the likelihood of the data in the other folds serves as an indication of the algorithms performance, and can be used to determine the number of clusters K. Among the various types of clustering algorithm, after determining our problem space and requirement, We decided to study the K-means clustering algorithm.
23
Chapter 4 K-means algorithm:
4.1 Overview:
4.1.1 What is Kmeans In clustering k-means clustering method tries to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data as well as in the iterative refinement approach employed by both algorithms. Let = ! , = 1,2 be the set of n d-dimensional points to be clustered into a set of K clusters, = ! , = 1,2. . . . K-means algorithm finds a partition such that the squared error between the empirical mean of a cluster and the points in the cluster is minimized. Let ! be the mean of cluster. The squared error between ! and the points in cluster ! is defined as
j(! ) =
!! !!
||! ! ||!
The goal of K-means is to minimize the sum of the squared error over all K clusters,
!
j(c) =
! !! !!
||! ! ||!
Minimizing this objective function is known to be an NP-hard problem (even for K=2) [35]. Thus K-means, which is a greedy algorithm, can only converge to a local minimum, even though recent study has shown with a large probability K-means could converge to the global optimum when clusters are well separated [36]. Kmeans starts with an initial partition with K clusters and assign patterns to clusters so as to reduce the squared error. Since the squared error always decrease with an
24
increase in the number of clusters K (with J(C)=0 when K=n), it can be minimized only for a fixed number of clusters. 4.1.2 Basic Steps The main steps of K-means algorithm are as follows: I. Choose K initial cluster centers ! 1 , ! 2 , ! 3 ! (). II. At the k-th iterative step, distribute the samples {x} among the K clusters using the relation, ! ! < !
For all = 1,2. . . ; ; where ! () denotes the set of samples whose cluster center is ! (). III. Compute the new cluster centers ! ( + 1)), j = 1, 2... K such that the sum of the squared distances from all points in ! () to the new cluster center is minimized. The measure which minimizes this is simply the sample mean of ! (). Therefore, the new cluster centre is given by ! + 1 = !
!
!
!"!! (!)
= 1,2
Where ! is the number of samples in IV. If ! + 1 = ! for j=1,2...K then the algorithm has converged and the procedure is terminated. Otherwise go to Step 2.
25
4.1.3 Demonstration of the standard algorithm:
1) k initial "means" (in this case k=3) are randomly selected from the data set (shown in color).
2) k clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means.
3) The centroid of each of the k clusters becomes the new means.
4) Steps 2 and 3 are repeated until convergence has been reached.
Fig: 4.1 Demonstration K-means algorithm
It is clear in this description that the final clustering will depend on the initial cluster centers chosen and on the value of K. This requires some prior knowledge of the number of clusters present in the data; in practical case this is not always possible and feasible. So there must be some process to determine the number of cluster.
4.2 Starting point selection:

In this section, two motivational examples illustrate the fact that choosing different starting point values lead to different clusters with different error values. First, consider a
1
Sample Data Points

C
Sample Data Points

C
Final Cluster Centroids Initial Starting Points
C C
0 0 1 2 3 4 5 6
0 0 1 2 3 4 5 6
Figure 4.2. Starting Points Example 1a
Figure 4.3. Starting Points Example 1b
26
workload with a single feature. Suppose that the feature values range from 1.0 to 5.0 and that there are seven samples at 1.0, one sample at 3.0, and one at 5.0. Figures 2 and 3 show the results of running the clustering algorithm on the data set when trying to find two clusters. Figure 2 shows the two clusters that are found when the initial starting points are 1.7 and 5.1. Eight data points are placed in the first cluster, with a centroid value of 1.25. The second cluster consists of one data point, with a centroid value of 5.0. The error function value is E1=3.5. Figure 3 shows the two clusters that are found when the initial starting points are 1.5 and 3.5. With these starting points, one cluster consists of the seven data points with a centroid value of 1.0 and the other cluster consists of two data points with a centroid value of 4.0. This clustering leads to an error function value or E2=2.0. This simple example illustrates that the selection of starting point values is important when trying to determine the best representation of a workload. In comparing figures 1 and 2 to determine which of the two clusterings is better, a case can be made that a tight cluster is centered on 1.0, with the other two data points being outliers. This is captured in the second (i.e., Figure 3) clustering and also results in a lower E2 value. Thus, lower values of Ek are taken as better clustering representations. A second motivational example is illustrated in Figures 4-9, which show the clusters resulting from clustering on a workload with two features and 11 sample data points. When finding a single cluster, the initial starting point value does not matter because all data points will be assigned to the same cluster. Figure 4 shows the result of clustering when determining one cluster. The error function value is E1=15.2. There are an infinite number of alternatives for initial starting values when trying to find more than one cluster. Figure 5 shows the best (i.e., lowest E2 value) result for this example when determining two clusters. Nine data points are in one cluster and two data points in the other cluster. The resulting
27
Sample Data Points

4 C
Sample Data Points

4 C C

a
Feature 2
Feature 2
C
2 C
0 0 1 2 3 4 5
0 0 1 2 3 4 5
Feature 1
Feature 1

Figure 4.4. Starting Points Example 2: One Cluster Figure 4.5. Starting Points Example 2: Two Clusters
error function value is E2=2.6.
When finding three clusters, two starting point
alternatives are evaluated. The results from these two alternatives are shown in figures 6 and 7. In Figure 6, the starting points are: (1.8, 1.0), (1.0, 2.7), and (2.0, 2.7). The figure shows that the algorithm places seven data points in one cluster and two data points in each of the other clusters. The error function value is E3,alt1=4.0. In Figure 7, the starting points are: (1.8, 1.0), (1.4, 2.0), and (1.4, 4.0). The algorithm again places seven data points in one cluster and two data points in each of the other clusters, but the error function value is E3,alt2=1.0, indicating an improved clustering. Figure 8 shows the clustering result when finding four clusters, using (1.8, 1.0), (0.8, 2.0), (2.3, 2.0), and (1.3, 4.0) as the starting values. In this case, the error function value E4=0.5. Figure 9 shows the plot of the error function value Ek as a function of the number of clusters, k. As shown, the two starting point alternatives give different error function values. Intuitively, as more clusters are added, the error function value should decrease. Alternative 1 shows anomalous behavior, where E2 < E3,alt1, indicating that adding an extra cluster actually hurts the error measure. Identifying the knee of the curve for Alternative 1 in Figure 9, one could conclude that two clusters (i.e., Figure 5) best represents this data set. Better, more reasonable, methods for determining the initial starting points should not exhibit such anomalous behavior.
28
5
5
Sample Data Points

Sample Data Points
4 C C

a Feature 2
Feature 2
0 0 1 2 3 4 5
0 0 1 2 3 4 5
Feature 1
Feature 1
Figure 4.6. Starting Points Example 2: Three Clusters Alt 1
Figure 4.7. Starting Points Example 2: Three Clusters Alt 2

16
Sample Data Points
14
Sum of Distances
C Final Cluster Centroids
Initial Starting Points

3
to Centroids (E k)
12 10 8 6 4 2
Alternative 1 Alternative 2
Feature 2
a
2
0 0 1 2 3 4 5
0 1 2 3 4
Feature 1
Number of Clusters (k)
Figure 4.8. Starting Points Example 2: Four Clusters
Figure 4.9. Starting Points Example 2: Ek Analysis
4.3 Experimental Case Study

For this case study, we generated over 4000 data points which are Gaussian mixture and generated using normal distribution.. Five features were identified as being descriptive of each job (i.e., data point): number of records processed, number of processors used, response time, processor time consumed, and job resource type.
29
Chapter 5 Determining number of cluster
5.1 Necessity of determining Number of K

In K-means clustering there we have to specify the number of clusters to cluster the data set. The correct choice of k is often ambiguous, It is not always desirable that k is known before clustering the dataset. The correct number of K clusters accurately. In addition, increasing k without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster. Intuitively then, the optimal choice of k will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster. If an appropriate value of k is not apparent from prior knowledge of the properties of the data set, it must be chosen somehow.
5.2 Existing Methods

There are several categories of methods for making this decision clustering validity. Here we studied existing method and tried to propose a method for clustering validation. a. Rule of Thumb: One simple rule of thumb sets the number to where nis the number dataset [41] b. Elbow method: You should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data. More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The numbers of clusters are chosen at this point, hence the "elbow criterion". This "elbow" cannot always be unambiguously identified [42].
! !
30
Fig: Explained Variance. The red circle indicates the elbow. The number of clusters chosen should therefore be 4.
c. Information criterion approach: Another set of methods for determining the number of clusters are information criteria, such as the Akaike information criterion (AIC), Bayesian information criterion (BIC), or the Deviance information criterion (DIC) if it is possible to make a likelihood function for the clustering model. For example: The k-means model is "almost" a Gaussian mixture model and one can construct a likelihood for the Gaussian mixture model and thus also determine information criterion values[43]. d. Information theory approach: Rate distortion theory has been applied to choosing k called the "jump" method, which determines the number of clusters that maximizes efficiency while minimizing error by information theoretic standards. The strategy of the algorithm is to generate a distortion curve for the input data by running a standard clustering algorithm such as k-means for all values of k between 1 and n, and computing the distortion (described below) of the resulting clustering. The distortion curve is then transformed by a negative power chosen based on the dimensionality of the data. Jumps in the resulting values then signify reasonable choices for k, with the largest jump
31
representing the best choice. The distortion of a clustering of some input data is formally defined as follows: Let the data set be modeled as a pdimensional random variable, X, consisting of a mixture distortion of G components with common covariance . If we let ! ! be a set of K cluster centers, with ! the closest center to a given sample of X, then the minimum average distortion per dimension when fitting the K centers to the data is:! = ! min!! !! [ ! ! !! ( ! )] This is also the average Mahalanobis distance per dimension between X and the set of cluster centers C. Because the minimization over all possible sets of cluster centers is prohibitively complex, the distortion is computed in practice by generating a set of cluster centers using a standard clustering algorithm and computing the distortion using the result. The choice of the transform power = ( 2) is motivated by asymptotic reasoning using results from rate distortion theory. Let the data X have a single, arbitrarily p-dimensional Gaussian distribution, and let fixed K = floor (p), for some greater than zero. Then the distortion of a clustering of K clusters in the limit as p goes to infinity is It can be seen that asymptotically, the distortion of a clustering to the power = ( 2) is proportional to p, which by definition is approximately the number of clusters K. In other words, for a single Gaussian distribution, increasing K beyond the true number of clusters, which should be one, causes a linear growth in distortion. This behavior is important in the general case of a mixture of multiple distribution components. Let X be a mixture of G p-dimensional Gaussian distributions with common covariance. Then for any fixed K less than G, the distortion of a clustering as p goes to infinity is infinite. Intuitively, this means that a clustering of less than the correct number of clusters is unable to describe asymptotically high-dimensional data, causing the distortion to increase without limit. If, as described above, K is made an increasing function of p, namely, K = floor (p), the same result as above is achieved, with the value of the distortion in the limit as p goes to infinity being equal to
2
2.
. Correspondingly, there is
32
the same proportional relationship between the transformed distortion and the number of clusters, K. Putting the results above together, it can be seen that for sufficiently high values of p, the transformed distortion !
!! !
is approximately zero for K < G, then jumps suddenly and begins
increasing linearly for K >= G. The jump algorithm for choosing K makes use of these behaviors to identify the most likely value for the true number of clusters. Although the mathematical support for the method is given in terms of asymptotic results, the algorithm has been empirically verified to work well in a variety of data sets with reasonable dimensionality. In addition to the localized jump method described above, there exists a second algorithm for choosing K using the same transformed distortion values known as the broken line method. The broken line method identifies the jump point in the graph of the transformed distortion by doing a simple least squares error line fit of two line segments, which in theory will fall along the x-axis for < , and along the linearly increasing phase of the transformed distortion plot for . The broken line method is more robust than the jump method in that its decision is global rather than local, but it also relies on the assumption of Gaussian mixture components, whereas the jump method is fully non parametric and has been shown to be viable for general mixture distributions. e. Choosing k using Silhouette: The average silhouette of the data is another useful criterion for assessing the natural number of clusters. The silhouette of a datum is a measure of how closely it is matched to data within its cluster and how loosely it is matched to data of the neighboring cluster, i.e. the cluster whose average distance from the datum is lowest [44]. A silhouette close to 1 implies the datum is in an appropriate cluster, whilst a silhouette close to 1 implies the datum is in the wrong cluster. Optimization techniques such as genetic algorithms are useful in determining the number of clusters that gives rise to the largest silhouette [45]
33
These all have a common goal to find the clustering which results in compact clusters which are well separated. Here we tried to deduce a validation method
5.3 Implementation technique of existing method:

The random method is a popular way of choosing the initial starting point values. It randomly chooses k of the sample data points and uses these as initial starting values. There are variations in the way that random sample data points are chosen. The algorithm for the random method follows: 1. Determine the maximum number of clusters, n, to be found by the k-means clustering algorithm. A value of n=10 is used here. 2. Choose n random sample data points. 3. For k=1 to k=n, run the k-means clustering algorithm, using k of the n sample data points chosen in step 2 as initial starting values. Each time k is increased; another one of the n random sample starting points is added as an initial starting point. To prevent the anomaly described in Section 3, the set of data points that are used when running the algorithm to find k clusters is a subset of the data points that are used when running the algorithm to find k+1 clusters. For example, if k=2, using the same sample data point that that is used when k=1 (and simply adding another one of the pre-selected random sample data points as the other initial starting value) prevents anomalous behavior. Also, since combinations of the different jobs in the workload lead to a different error values, the k-means clustering algorithm is run multiple times, selecting different random jobs as starting points. The lowest error values are used in the Ek curve. For the results reported below, the clustering algorithm was run 25 times. The second method for selecting actual sample data points as starting points is breakup. The algorithm for breakup follows: 1. Determine the maximum number of clusters, n, to be found by the k-means clustering algorithm. 2. Run the k-means clustering algorithm for k=1. (When finding a single cluster the initial starting point does not matter.)
34
3. For k=2 to k=n a. From running the algorithm for k-1, determine the most populous cluster. b. Within this most populous cluster, find the two data samples that are the farthest away from each other. c. Using these two points, together with the k-2 other data points that formed the clusters that were not most populous, the set of k starting points are determined. This method is reasonable since it seeks to break up the most populous cluster into two smaller clusters. It is hoped that a decreased error value Ek results. The third method for selecting actual sample data points is feature value sums. The algorithm for feature value sums follows: 1. Each data samples feature vector is added together to give a single value for each job. 2. The entire workload is sorted based on these feature vector sums. 3. For each k, the workload is divided into k equal sized regions. The set of k median sample data points, one from each region, is used as the k initial starting points.
5.4 Proposed Method

K means algorithm tries to minimize the distance between the center of cluster and elements corresponds to the cluster. After a iterative process the the data set is clustered. We can therefore use the distances of the points from their cluster Centre to determine whether the clusters are compact. For this purpose, we use the intracluster distance measure, which is simply the distance between a point and its cluster Centre and we take the average of all of these distances, defined as:
1 intra =
!
|| ! ||! (5.1)
! !! !!
where N is the number of pixels in the image, K is the number of clusters, and ! is the cluster Centre of cluster ! . We obviously want to minimize this measure. We can also
35
measure the inter-cluster distance, or the distance between clusters, which we want to be as big as possible. We calculate this as the distance between clusters Centres, and take the minimum of this value, defined as = min ! !
!
= 1,2, 1, (5.2)
= + 1,
We take only the minimum of this value, as we want the smallest of this distance to be maximized, and the other larger values will automatically be bigger than this value. Since we want both of these measures to help us determine if we have a good clustering, we must combine them in some way. The obvious way is to take the ratio, defined as: = (5.3)
Since we want to minimize the intra-cluster distance and this measure is in the numerator, we consequently want to minimize the validity measure. We also want to maximize the inter- cluster distance measure, and since this is in the denominator, we again want to minimize the validity measure. Therefore, the clustering which gives a minimum value for the validity measure will tell us what the ideal value of K is in the k-means procedure.
36
Chapter 6 Image segmentation

6.1 Overview of Image segmentation
Images are considered as one of the most important medium of conveying information. Understanding images and extracting the information from them such that the information can be used for other tasks is an important aspect of Machine learning.[41] An example of the same would be the use of images for navigation of robots. Other applications like extracting malign tissues from body scan etc. form integral part of Medical diagnosis. One of the first steps in direction of understanding images is to segment them and find out different objects in them. To do this, features like the histogram plots and the frequency domain transform can be used. Clustering is a feature-space method of Color image segmentation. This method classifies pixels to different groups in a predefined color space. K-Means clustering is popular for its simplicity for implementation [16], and it is commonly applied for grouping pixels in images. The choices of color space may have significant influences on the result of image segmentation. When image segmentation is viewed as a clustering process, several techniques exists for clustering. Among the clustering techniques the finite mixture distribution, kmeans clustering used in wide variety of important practical situation. We implemented K-means clustering to segment image. In fact, in K-means clustering algorithm for image segmentation, the number of segment to be yielded can be considered as the number of cluster k, in the feature space. Usually k has to be specified in advance. If k is selected correctly good segmentation can be achieved. Otherwise data points cannot be grouped into appropriate clusters and image segmentation can not performed appropriately. We all ready tried to determine the number of cluster in k means clustering. We tried to implement image segmentation using the deduced K-means algorithm.
37
6.2 Application of image segmentation

Some of the practical applications of image segmentation are:
Medical imaging
o o o o o o
Locate tumors and other pathologies Measure tissue volumes Computer-guided surgery Diagnosis Treatment planning Study of anatomical structure
Locate objects in satellite images (roads, forests, etc.) Face recognition Fingerprint recognition Traffic control systems Brake light detection Machine vision Agricultural imaging crop disease detection
Several general-purpose algorithms and techniques have been developed for image segmentation. Since there is no general solution to the image segmentation problem, these techniques often have to be combined with domain knowledge in order to effectively solve an image segmentation problem for a problem domain.
6.3 Image representation

The raw pixel representation of an input image is shifted to a mid-level representation, in which the image is represented as a set of coherent regions in feature space. In this work we focus on the color feature. In particular we model each image as a mixture of Gaussians in the color feature space. It should be noted that the representation model is a general one, and can incorporate any desired feature space (such as texture, shape, etc.) or combination thereof. A color image is represented by 3 components, such as RGB, XYZ, YIQ, HIS and so on[44]. There is no theoretical guide for selecting color space. Here we concentrate on the following color spaces.
38
(1) RGB (Red, Green and Blue); this color space is used for display. (2) YIQ; for color system of TV signal. (3) XYZ: for C.I.E X-Y-Z color system. (4) ! ! ! : for C.I.E X-Y-z color system. (5) HSI(Hue, Saturation and Intensity): for Human perception. Relationship of the above color systems with RGB is as the following. The transformation matrices are not the standard ones [46] (1) YIQ = 0.299 + 0.587 + 0.114 = 0.5 0.23 0.27 = 0.202 0.5 + 0.298 (2) XYZ = 0.618 + 0.177 + 0.205 = 2.299 + 0.587 = 0.114 = 0.056 + 0.944 (3) ! ! ! ! = ! = ( + + ) ( ) 3
2 (2 ) ! = (4) HSI = = 1 3( ) 2
min , , + +
= + + With these equations other spaces can be converted into RGB color space.
39
(a) Original Synthetic Image
(b) Sampled of image (a) in RGB color space
(c) XYZ color space
40
6.4 Modified K-means Clustering Algorithm for image clustering:

We optimized the clustering algorithm, which assumes that the data features form a 3D vector space and tries to find natural clustering in them. Various steps in the algorithm are as follows: 1) Pick K cluster centers, either randomly or based on some heuristic 2) Assign each pixel in the image to the cluster that minimizes the distance between the pixel and the cluster center. 3) Re-compute the cluster centers by averaging all of the pixels in the cluster 4) Repeat steps 2 and 3 until convergence is attained Difference is typically based on pixel color, intensity, texture, and location, or a weighted combination of these factors. Color information contained in pixels. The process used K-means to cluster the objects into clusters (number of colors) and the Euclidean distance metric or perceptual-based color distance measures pixel values distance.
41
Chapter 7 Implementation And Result analysis:

7.1 Implementation Approaches Our aim was to segment color images using K-means clustering so that some information can be extracted from the raw image what is necessary for image processing. To fulfill our goal we first generated a 2d sample problem space. We wrote a python program to generate 2d points, which are generated using normal distribution. Then we implemented Kmeans clustering algorithm to cluster the generated sample data. For K-means we first took k points as initial center. Then updated the center and calculated the average error that is a float number which reflects the average displacement of the points from center. We iterated the process until we get a acceptable error. There is also a stopping condition for, the number of iteration. Our goal was to determine the number of cluster. We tried to implement our proposed method for clustering validation. To implement it we made an iteration process. Initially we assumed the whole data set in a single cluster that is k=1, determined the validation using the equation 5.3. Our aim was to minimize the validity that is to minimize intra cluster distance. We incremented k and again calculated validity. We terminated this process for the stopping condition: validity(k-1) < validity(k) > validity(k+1), And our final goal was to segment image using our proposed method. We recognized image as a collection of pixels, which can be recognize as a collection of 3D points. One major problem was we implemented k-means on 2d points. To implement the same code for image segmentation, we must make some modification on our code. In this case, we used both Euclidian distance and perceptual color distance nad compared their result. We implemented all our work using Python (2.7.2) .
42
7.2 Why Python?

Python is an interpreted, general-purpose high-level programming language whose design philosophy emphasizes code readability. Python aims to combine "remarkable power with very clear syntax",[and its standard library is large and comprehensive. Its use of indentation for block delimiters is unique among popular programming languages. Python supports multiple programming paradigms, primarily but not limited to object-oriented, imperative and, to a lesser extent, functional programming styles. It features a fully dynamic type system and automatic memory management, similar to that of Scheme, Ruby, Perl, and Tcl. Like other dynamic languages, Python is often used as a scripting language, but is also used in a wide range of non-scripting contexts. The reference implementation of Python (CPython) is free and open source software and has a community-based development model, as do all or nearly all of its alternative implementations. The non-profit Python Software Foundation manages CPython. Python interpreters are available for many operating systems, and Python programs can be packaged into stand-alone executable code for many systems using various tools. Python has big library (Scipy, Numpy, Matplot) for scientific works, Large and complex mathematical calculation and Physics like Mat lab. calculation data can plotted using Matplot. After
7.3 What are SciPy, NumPy and Matplot?

SciPy (pronounced "Sigh Pie") is open-source software for mathematics, science, and engineering. It is also the name of a very popular conference on scientific programming with Python. The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization. Together, they run on all popular operating systems, are quick to install, and are free of charge. NumPy and SciPy are easy to use, but powerful enough to be depended upon by some of the world's leading scientists and engineers. matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python 43
scripts, the python and ipython shell (ala MATLAB* or Mathematica), web application servers, and six graphical user interface toolkits. Matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, error charts, scatterplots, etc., with just a few lines of code.
7.4 Experimental Results

We tried to implement our proposed clustering validity method. Then we implemented k means in image segmentation. We plotted the numerical and statistical value on curve and represented necessary images. 7.4.1 Implementation of Kmeans We generated 4000 random data using normal distribution of the form ( ! , ), which are mixture of Gaussian. We plotted the points using Matplot and we found the following figure.
Fig: 7.1: generated problem space with 4000 points
44
The following excel sheet represent the result found after applying k means on the generated data.
In the following figure we plotted k-vs intra points:

50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 k Intra
45
From the above figure, we can see that for k=16, intra is minimum. So the number of cluster must be 16. We plotted k and variance on a curve and found the following result:
25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Series2
From the figure we can see that variance turns to be constant with the increasing of k. and variance is minimum at k=16, the curve is about straight line and parallel to the x axis. From this we can come to decision that the accurate number of cluster may be 15 or 16 or 17.
7.4.2 Image segmentation:

We used the k means method to segment a 200X120 image. The number of cluster is the number of color that is as we used 32 colors, so the number of k is 32.The following image is the original image: We sampled 2000- 10000 pixels to observe the change in the segmentation. For higher samples, execution times also increased. For number of sampled pixels=2000, Execution time: only 5.51s
46
Surface plot with transparency: 3D scatter plots of three distributions:
For number of sampled pixels=5000, Execution time: 14.13 s
3D scatter plots of three distributions:
Surface plot with transparency:
For number of sampled pixels=10000, Execution time: 28.23s
47
Surface plot with transparency: 3D scatter plots of three distributions:
Image comparison:
original image
sampled by 2000 sampled by 5000 sampled by 10000 pixels pixels pixels
Original truecolor image
k-means with 5000 sample points
clustered perceptual-based distance
48
Summary
This paper presents an overview of the K-means clustering algorithm. K-means clustering is a common way to define classes of jobs within a workload. Simple examples show that determining appropriate number of cluster have a significant effect on the results of the algorithm, both in the number of clusters found and their centroids. A case study is described where different methods of choosing initial starting points are evaluated. We implemented the K-means clustering algorithm and used it for image segmentation. For smaller values of sampled pixels in the algorithms give poor results. For larger values of sampled pixels, the segmentation is very rich; many clusters appear in the images at discrete places. This is because Euclidean distance is not a very good metric for segmentation processes. Better result is found using perceptual-based color distance.
49
Reference:
[1] GANTZ, JOHN F. 2008 (March). The diverse and exploding
digital universe. Available online at: http://www.emc.com/collateral/analyst-reports/diverseexploding-digital-universe.pdf
[2] TUKEY, JOHN WILDER. 1977. Exploratory data analysis.

Addison-Wesley.
[3] TABACHNICK, B.G., &FIDELL, L.S.2007.Using multivariate

statistics. 5 edn. Boston: Allyn and Bacon.
[4] DUDA, R., HART, P., & STORK, D. 2001. Pattern classification.
2 edn. New York: John Wiley & Sons.
[5] CHAPELLE, O., SCHLKOPF, B., & ZIEN, A. (eds). 2006. Semisupervised learning. Cambridge, MA: MIT Press.
[6] LANGE, T., LAW, M. H., JAIN, A. K., & BUHMANN, J. 2005
(June). Learning with constrained and unlabelled data. Pages 730737 of: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1.
[7] 2009 (February). Google Scholar. http://scholar.google.com. [8] Three Dimensional Object Recognition Systems. Elsevier Science Inc., New York, NY. [9] IW
A Y AMA,
M., & TOKUNAGA, T. 1995. Cluster- based text
categorization: a comparison of category search strategies. Pages 273281 of: Proceedings of the 18th ACM International Conference on Research and Development in Information Retrieval.
[10]
SAHAMI, MEHRAN. 1998. Using machine learning to
improve information access. Ph.D. thesis, Computer Science Department, Stanford University. [Scholkopf et al., 1998] SCHOLKOPF, BERNHARD, SMOLA, ALEXANDER, &
MULLER, KLAUS-ROBERT. 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 12991319.
50
[11]
BHA TIA, S., & DEOGUN, J. 1998. Conceptual clustering in
information retrieval. IEEE Transactions on Systems, Man and Cybernetics, 28(B), 427 436.
[12]
ARABIE, P., & HUBERT, L. 1994. Advanced methods in
marketing research. Oxford: Blackwell. Chap. Cluster Analysis in Marketing Research, pages 160189.
[13]
HU, J., RAY, B. K., & SINGH, M. 2007. Statistical methods
for automated generation of service engagement staffing plans. IBM J. Res. Dev., 51(3), 281 293.
[14]
BHA TIA, S., & DEOGUN, J. 1998. Conceputal clustering in
information retrieval. IEEE Transactions on Systems, Man and Cybernetics, 28(B), 427 436.
[15]
J. Huang, S. R. Kumar, and R. Zabith. An automatic
hierarchical image classification scheme. In ACM Conference on multimedia, pages 219228, England, September 1998.
[16] [17]
G. Golub and C. Van Loan. Matrix Computations. The
Johns Hopkins Univeristy Press, 1989. S. Krishnamachari and M. Abdel-Mottaleb. Image
browsing using hierarchical clustering. In Proceedings of the fourth IEEE Symposium on Computers and Communications, Egypt, July 1999.
[18]
G. Sheikholeslami and A. Zhang. Approach to clustering
large visual databases using wavelet transform. In Proc. of SPIE conference on visual data exploration and analysis IV, volume 3017, San Jose, California, 1997.
[19]
C. Carson, S. Belongie, H. Greenspan, and J. Malik.
Region-based image querying. In Proc. of the IEEE Workshop on Content-based Access of Image and Video libraries (CVPR97), pages 4249, 1997.
[20]
C. Carson, S. Belongie, H. Greenspan, and J. Malik.
Blobworld: Image segmentation using expectation-maximization and its application to image querying. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8):10261038, 2002.
51
[21]
J. Chen, C.A. Bouman, and J.C. Dalton. Hierarchical
browsing and search of large image databases. IEEE transactions on Image Processing, 9(3):442455, March 2000.
[22]
M. Stricker and A. Dimai. Spectral covariance and fuzzy
regions for image indexing. Machine Vision and Applications, 10(2):6673, 1997.
[23]
J. Huang, S. R. Kumar, M. Mitra, W.-J. Zhu, and R. Zabih.
Image indexing using color correlograms. In Proc. of the IEEE Comp. Vis. And Patt. Rec., pages 762768, 1997.
[24]
M. Abdel-Mottaleb, S. Krishnamachari, and N.Mankovich.
Performance evaluation of clustering algorithms for scalable image retrieval. In IEEE Computer Society Workshop on Empirical Evaluation Techniques in Computer Vision, 1998.
[25]
NG, ANDREW Y., JORDAN, MICHAEL I., & WEISS, YAIR.
2001. On spectral clustering: Analysis and an algorithm. Pages 849856 of: Advances in Neural Information Processing Systems 14. MIT Press.
[26]
PAMPALK, ELIAS, DIXON, SIMON, & WIDMER, GERHARD.
2003. On the evaluation of perceptual similarity measures for music. Pages 712 of: Proceedings of the Sixth International Conference on Digital Audio Effects (DAFx-03). London, UK: Queen Mary University of London.
[27]
FRIGUI, H., & KRISHNAPURAM, R. 1999. A robust
competitive clustering algorithm with applications in computer vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21, 450465.
[28]
WALLACE, C.S., & FREEMAN, P.R. 1987. Estimation and
inference by compact coding (with discussions). JRSSB, 49, 240251.
[29]
HANSEN, MARK H., & YU, BIN. 2001. Model selection and
the principle of minimum description length. Journal of the American Statistical Association, 96(454), 746774.
[30]
TIBSHIRANI, R., WALTHER, G., , & HASTIE, T. 2001.
Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, B, 411423.
52
[31] [32]
JAIN, ANIL K., & DUBES, RICHARD C. 1988. Algorithms
for clustering data. Prentice Hall. SMITH,STEPHENP.,&JAIN,ANILK.1984.Testingforuniformit
y in multidimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(1), 7381.
[33]
LANGE, TILMAN, ROTH, VOLKER, BRAUN, MIKIO L., &
BUHMANN, JOACHIM M. 2004. Stability-based validation of clustering solutions. Neural Computation, 16(6), 12991323.
[34]
VON
LUXBURG, U., & DA
VID,
BEN S. 2005. Towards a
statistical theory of clustering. In: Pascal workshop on statistics and optimization of clustering.
[35]
DRINEAS, P., FRIEZE, A., KANNAN, R., VEMPALA, S., &
VINAY, V. 1999. Clustering large graphs via the singular value decomposition. Machine Learning, 56(1-3), 933.
[36]
MEILA, MARINA. 2006. The uniqueness of a good
optimum for k-means. Pages 625632 of: Proceedings of the 23rd International Conference on Machine Learning.
[37]
MacQueen, J. B. (1967). "Some Methods for classification and
Analysis of Multivariate Observations". 1. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press. pp. 281 297. MR0214227. Zbl 0214.46201.
[38] [39]
Kanti Mardia et al. (1979). Multivariate Analysis. David J. Ketchen, Jr & Christopher L. Shook (1996).
Academic Press. "The application of cluster analysis in Strategic Management Research: An analysis and critique". Strategic Management Journal 17 (6): 441458. Doi: 10.1002/(SICI) [40] 1097-0266(199606) 17:6<441::AID- SMJ819>3.0.CO;2-G Cyril Goutte, Lars Kai Hansen, Matthew G. Liptrot & Egill Rostrup (2001). "Feature-Space Clustering for fMRI Meta-Analysis". Human Brain Mapping 13 (3): 165183. doi:10.1002/hbm.1031. see especially Figure 14 and appendix
53
[41]
Peter J. Rousseuw (1987). "Silhouettes: a Graphical
Aid to the Interpretation and Validation of Cluster Analysis". Computational and Applied Mathematics 20: 5365. doi:10.1016/0377-0427(87)90125-7 [42] R. Lleti, M.C. Ortiz, L.A. Sarabia, M.S. Snchez (2004). "Selecting Variables for k-Means Cluster Analysis by Using a Genetic Algorithm that Optimises the Silhouettes". Analytica [43] [44] K.S.Fu Chimica and J.k Acta Mui, A 515: survey on 87100. image doi:10.1016/j.aca.2003.12.020 segmantation, Pattern recognition, Vol.13, pp3-16, 1981 Yu-TCHi, Takeo Kanade, and Toshiyuki Saki Color Information for Region segmentation, Computer Graphics and Processing, Vol.13, pp222-241, 1980. [45] Michael B. Dillencourt and Hannan Samet and Markku Tamminen (1992). "A general approach to connectedcomponent labeling for arbitrary image representations
54

Data Clustering

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Clustering

Uploaded by

Copyright:

Available Formats

Chapter 1 Introduction

1.1 High Dimensionality of Data:

1.2 What is clustering?

Fig: Clustered nodes(Unsupervised)

Fig: clustered Data (Supervised)

Fig 1.1: example of cluster

1.3 Why clustering?

1.4 Determining number of cluster

1.5 What is Image segmentation?

1.6 The scope of the thesis

The thesis outline is presented in Figure 1.2.

Understand clustering (ch 3) Kmeans clustering (Ch 4) Determining number of cluster(ch 5)

Color Image representation(ch 6) Image segmentation(ch 6)

Fig 1.2: thesis outline

Chapter 2 Related work

2.1 Supervised clustering

2.2 Unsupervised clustering

Chapter 3 Clustering algorithms

Fig 3.1: Data clustering

clustering (unsupervised learning) from classification or discriminant analysis (supervised learning).

3.3 Existing Problems

3.4 Classification of clustering algorithms:

The following figure shows taxonomy of clustering algorithm:

Fig 3.3: Taxonomy of clustering algorithm

3.5 Data representation

3.6 Similarity measurement:

Fig 3.5; Manhattan Distance

3.6.2 General equation of distance measure is:

#d & d p (x i,x j )=%x ik x jk p (1 p $ k=1 '

3.7 Purpose of grouping

3.8 Number of Clusters

Chapter 4 K-means algorithm:

4.1.3 Demonstration of the standard algorithm:

3) The centroid of each of the k clusters becomes the new means.

4) Steps 2 and 3 are repeated until convergence has been reached.

Fig: 4.1 Demonstration K-means algorithm

4.2 Starting point selection:

Sample Data Points

Sample Data Points

Final Cluster Centroids Initial Starting Points

Final Cluster Centroids Initial Starting Points

Figure 4.2. Starting Points Example 1a

Figure 4.3. Starting Points Example 1b

Sample Data Points

Sample Data Points

Final Cluster Centroids Initial Starting Points

Final Cluster Centroids Initial Starting Points

error function value is E2=2.6.

When finding three clusters, two starting point

Sample Data Points

Final Cluster Centroids Initial Starting Points

Final Cluster Centroids Initial Starting Points

Figure 4.6. Starting Points Example 2: Three Clusters Alt 1

Figure 4.7. Starting Points Example 2: Three Clusters Alt 2

Sample Data Points

C Final Cluster Centroids

Initial Starting Points

Number of Clusters (k)

Figure 4.8. Starting Points Example 2: Four Clusters

Figure 4.9. Starting Points Example 2: Ek Analysis

4.3 Experimental Case Study

Chapter 5 Determining number of cluster

5.1 Necessity of determining Number of K