# 1 Cluster Analysis Cluster Analysis deals with procedures for classifying the objects on the basis of their observational

vectors into homogeneous groups, referred as clusters. Procedures for formation of clusters were basically developed in Taxonomy for classification of Operating Taxonomic Units (OTU) of insects. A separate branch of Numerical Taxonomy was also developed for this purpose, due to the efforts of Sneath and Sokal (1973). In cluster analysis, the objects are classified on the basis of “similarities” between them. The similarity is measured in the form of inter-object distances. There are several types of measures for computing the inter-object distances. These measures satisfy the properties of the Distance function.. Distance Function: If there are two objects (P, Q) with their observations X and Y, then d (P, Q) is a distance function if it has the following properties: I. Symmetry- d (P, Q) = d (Q, P) II. Non-Negativity- d(P, Q) ≥ 0 III. Definiteness- d(P, Q) = 0 iff P = Q IV. Triangle Inequality- d(P, Q) ≤ d(P, R) + d(R, Q) Some of the well-known Distance Measures are: 1. Euclidean Distance: ; Where, (Xi, Xj) are observation vectors of (I,j)-th object 2. Minkowski Metric: When m = 2, the Minkowski’s distance becomes the Euclidean Distance. In general, varying m changes the weight given to larger and smaller distances. 3. Karl Pearson’s Distance: 4. Mahalanobis Standardized Distance: The Mahalanobis Standardized Distance D2 is the most commonly preferred distance due to its efficiency in dealing with observation vectors that consists of variables with different range of magnitudes. Cluster Analysis procedures are applied to the two data situations of the observations recorded corresponding to the “N” objects: 1. Single Sample Situation: In this situation, a single observation is recorded corresponding to the objects, therefore there are “N” observations (vectors) to be classified into g (< N) distinct clusters. 2. Multi-Sample Situation: Instead of a single observation, a random sample of Nj observations are recorded corresponding to each of the jth object (j = 1, …, N). Thus there are N samples of observation (vector), i.e., a multi-sample data situation. This situation is the most common in experimental data, where Nj represents the number of replicated observations recorded on the object. Clustering procedures are basically developed for Single Sample situations. However, these can be extended to the Multi-Sample situations by imposing certain assumptions. Approaches to Clustering: There are five approaches in which the procedures of clustering can be categorized. These are: I. Optimization Methods (Ordination Procedures) II. Clumping Techniques

In hierarchical methods. The classification process continues till all the objects are clustered. 2) of C1 and a new j-th object. the distance matrix is revised in terms of distance of the new cluster with other objects. The methods under the hierarchical approach are: 1. with objects. Dendrogram is a tree-like diagram. which is of order (N-1). the procedures developed under the hierarchical approach are commonly preferred due to the element of least subjectivity in the formation of clusters. known as the Dendrogram. i.e.. General Algorithm (Procedure): Step-1: Suppose there are n-objects to be clustered on the basis of the observation vectors recorded corresponding to each object. 2). The distance matrix at every step provided “links” of the objects with the objects that are already clustered. a new object is added if it satisfies a certain criterion. 2) with C1. the objects to be grouped are classified on the basis of relative distances between them. To this. i. The Dendrogram helps in identifying the clusters. i. The outcome of the classification is a tree-like diagram. The procedure involved is: identify the clusters whose inter-cluster distances are less than a certain threshold distance. . the inter-object distances are computed and these form the initial distance matrix D0. There are two possibilities for this. algorithm. Hierarchical Methods: These methods are further classified intoAgglomerative and Divisive Methods. otherwise it is classified into another independent cluster. Xn.2 III. Ward’s Minimum Variance Method. the distance of the cluster those are already formed with the other (remaining) objects. however. From these observations. Average Linkage Method 3. or alternatively.d0. which is the initial cluster C1. which describes the links of the objects that are clustered. say (1. D (1). these methods differ in computation of the distance matrix at every step. the search for the closest pair is continued from D (1). so that the new cluster C 2 may be independent and consists of two objects.e. Density Search Techniques and IV Hierarchical Methods Among these. Step-3: The distance matrix D (0) is then revised to D (1). the closest pair may be the distance of a new pair of (i. so that the new Cluster C2 contains 3 objects.e. The objects that are relatively closest form the initial cluster. j) objects. At every stage of classification. The procedure (algorithm) for these methods is same. Complete Linkage Method (Farthest Neighbor Method) and 4. Single Linkage Method (Nearest Neighbor Method) 2. contains distances of the remaining (n – 2) j-th objects (j # 1. …. Step-2: The relatively closest pair of objects in terms of the distance is then identified from this matrix. on the basis of the inter-object and inter-cluster distances that are computed during every step of classification. Suppose these n-observations are: X1... Step-4: Again. Thus the classification is a stepwise procedure. (1. The inter-object Distance Matrix is the base of such classifications.the closest pair may be the distance of C 1 with a new j-th object.

The diagram is developed by taking the objects in the order of screening. such that the inter-cluster distances of all the clusters are ≥ d0. Step-6: A tree diagram. d2j) Complete Max (d1j. Soil Characteristics and Socio-Economic Characters  Identifying Rainfall Patterns in a District / Region Illustration: Rainfall Patterns in Anantapur District Random Vector Data District Method Number of Monthly Rainfall of South-West Monsoon (including October) 5 Variables-June to October Monthly Rainfall data of 50 Years. Average Linkage and Complete Linkage methods are invariant to transformations ∗ Once an object is allocated to a cluster. N) is computed as follows: Single Linkage Min (d1j. Cropping Patterns. it cannot be re-allocated to other clusters ∗ All the hierarchical methods satisfy the property of chaining. …. All the above clustering methods differ only with regard to the computation of the inter-cluster distances with the remaining objects during every step of the merger. on XAxis and the corresponding distances on the Y-Axis. 2). known as the “Dendrogram” is then plotted out of the distances computed during the each step of the search for fusion.2: 3. If C1 is the initial cluster with objects (1..1955 to 2004 Anantapur Average Linkage Method 7 . selecting string like clusters ∗ Complete Linkage is not suitable when there are ties in the distances ∗ The number of clusters “g” to be formed is a crucial task! The rule of thumb suggested for this purpose is: Applications:  Genetic diversity among the Genotypes (Tocher’s Method)  Regional Classifications on the basis of Rainfall. the merger during each step is based on the criterion of relatively minimum variance of the distances among the objects. d2j) Linkage In Ward’s Minimum Variance Method.i.e. d2j) Linkage Average Average (d1j. Properties: ∗ Single Linkage.3 Step-5: The procedure of identifying the closest pair and revising the distance matrix is repeated till all the objects are classified into a single cluster. Step-7: Cluster formation can be viewed from this diagram by defining a threshold distance “d0”. then the distance of C1 with the remaining objects “j” (j # 1. The choice of d0 is arbitrary and depends on how close the clusters are to be formed.

39 189.00 197.00 39.83 64.00 48.00 19.00 265.33 173.91 133.00 112.64 57.00 19.00 49.38 57.90 65.00 24.92 49.00 25.67 63.40 29.00 158.00 234.00 25.00 280.00 162.75 39.00 51.00 76.67 59. of Septem Cluster June July August October Years ber 1 2 3 4 5 6 7 Mean SD CV 36 6 4 1976 1988 1989 1991 50 47.72 58.23 95.00 203.00 28.00 80.00 13.92 205.42 51.00 131.50 121.25 50.00 63.00 133.43 74.09 .28 55.00 71.33 35.32 68.28 47.4 Clusters Rainfall Patterns of Anantapur 1955-2004 (Clustering Method: Average Linkage) No.95 69.

Abnormal. Relatively Most Frequent. Abnormal.Probability 0.5 Characteristics of Rainfall Patterns: Pattern-1: Pattern-2 Pattern-3: Pattern-4 Pattern-5: Normal: Abnormal. Low Rainfall in . Abnormal. September and October Exceptionally High Rainfall in August (234 mm) and September (265 mm) Exceptionally High Rainfall in July (280 mm) Exceptionally High Rainfall in June (131 mm) and Very July to September months Pattern-6: Abnormal: Pattern-7: Abnormal.72 Exceptionally High Rainfall in October (205 mm) Low Rainfall in July (35 mm) and September (49 mm) Low Rainfall in July.