You are on page 1of 5

1

Cluster Analysis
Cluster analysis, also called segmentation analysis or taxonomy analysis, seeks to identify homogeneous
subgroups of cases in a population. That is, cluster analysis seeks to identify a set of groups which both
minimize within-group variation and maximize between-group variation.
Hierarchical clustering allows users to select a definition of distance, then select a linking method of
forming clusters, then determine how many clusters best suit the data. In k-means clustering the researcher
specifies the number of clusters in advance, then calculates how to assign cases to the K clusters. K-means
clustering is much less computer-intensive and is therefore sometimes preferred when datasets are very
large (ex., > 1,000). Finally, two-step clustering creates pre-clusters, then it clusters the pre-clusters.
A Famous Example of Cluster Analysis
The discovery of white dwarfs and red giants is a famous example of cluster analysis. Stars were plotted
by astronomers Hertzsprung and Russell according to the two features, log luminosity and log
temperature. Three clusters emerged, white dwarfs, red giants and the main sequence in between them.
Cluster Temperature Luminosity
White Dwarfs medium low
Main Sequence wide range low when T is low, high when T is high
Red Giants medium low high

Key Concepts and Terms
Cluster formation is the selection of the procedure for determining how clusters are created, and how the
calculations are done. In agglomerative hierarchical clustering every case is initially considered a cluster,
then the two cases with the lowest distance (or highest similarity) are combined into a cluster. The case
with the lowest distance to either of the first two is considered next. If that third case is closer to a fourth
case than it is to either of the first two, the third and fourth cases become the second two-case cluster; if
not, the third case is added to the first cluster. The process is repeated, adding cases to existing clusters,
creating new clusters, or combining clusters to get to the desired final number of clusters. (There is also
divisive clustering, which works in the opposite direction, starting with all cases in one large cluster.
Hierarchical cluster analysis, can use either agglomerative or divisive clustering strategies.)
Distance. The first step in cluster analysis is establishment of the similarity or distance matrix. This matrix
is a table in which both the rows and columns are the units of analysis and the cell entries are a measure of
similarity or distance for any pair of cases.
2
Euclidean distance is the most common distance measure. A given pair of cases is plotted on two variables,
which form the x and y axes. The Euclidean distance is the square root of the sum of the square of the x
difference plus the square of the y distance. Sometimes the square of Euclidean distance is used instead.
When two or more variables are used to define distance, the one with the larger magnitude will dominate,
so to avoid this it is common to first standardize all variables.
Similarity. Distance measures how far apart two observations are. Cases which are alike share a low
distance. Similarity measures how alike two cases are. Cases which are alike share a high similarity.
Cluster Method
There are different methods to compute the distance between clusters.
Average linkage is the mean distance between all possible inter- or intra-cluster pairs. The average distance
between all pairs in the resulting cluster is made to be as small as possibile. This method is therefore
appropriate when the research purpose is homogeneity within clusters.
Ward's method calculates the sum of squared Euclidean distances from each case in a cluster to the mean of
all variables. The cluster to be merged is the one which will increase the sum the least. This is an ANOVA-
type approach and preferred by some researchers for this reason.
Centroid method. The cluster to be merged is the one with the smallest sum of Euclidean distances between
cluster means for all variables.
K Means algorithm. The algorithm uses the Euclidean distance and requires the user to enter the required
number of clusters.
Agglomeration Schedule
SAS displays this in the Cluster History table. In this table, the rows are stages of clustering, numbered
from 1 to (n - 1). The (n - 1)
th
stage includes all the cases in one cluster. The algorithm uses the Euclidean
distance to combine clusters. The 0
th
row (not shown) has all the observations as one-point clusters. At
Stage 1, the two clusters with the least distance is combined into one cluster. The process continues until
all the cases are collected into one cluster. The metric Norm RMS Dist increases as the number of
clusters reduce. A jump in this metric determines the number of clusters. If there is a significant junp from
Stage I to Stage (i+1) then we stop the clustering process after completing Stage i and identify the clusters.
Dendrogram
(Cluster Analysis Tree Charts) show the relative size of the average cluster distances at which clusters were
combined. The bigger the distance, the more clustering involved combining unlike entities, which may be
undesirable. Clusters with low distance/high similarity are close together. Cases showing low distance are
close, with a line linking them a short distance from the left of the dendrogram, indicating that they are
agglomerated into a cluster at a low distance coefficient, indicating alikeness. When, on the other hand, the
linking line is to the right of the dendrogram the linkage occurs a high distance coefficient, indicating the
cases/clusters were agglomerated even though much less alike.
Cluster centers are the average value on all clustering variables of each cluster's members.
Profiling the Cluster
We profile each cluster by interpreting the centroid (the mean of the cluster). This data is available when
SAS is executed with the K-Means method. We interpret each cluster by the high/low loading on each
variable.
We may also profile the cluster by the variables that were not used in the clustering process either by the
means of such variables or by running Discriminant analysis on these variables.
Validating the Clusters
Because of the non-statistical aspects of cluster analysis, we need to conduct a validation exercise to ensure
that the result is generalizable to future observations. The following are some ways to do this:
1. Cluster using different distance measures
3
2. Cluster using different methods
3. Split the data into two halves, run the program on each half, profile each cluster and compare the
profiles.

Example
Example

Cluster Analysis
Average Linkage Method
SAS Instructions
1. Analyze Multivariate Cluster Analysis
2. Data: Drop variables into Analyze Variable bucket
3. Cluster Method: Average Linkage
4. Plots: Tree Design
5. Results: Display Output
6. Run

Eigenvalues of the Covariance Matrix
Eigenvalue Difference Proportion Cumulative
1 4.02480073 1.14483955 0.5829 0.5829
2 2.87996118 0.4171 1.0000

Root-Mean-Square Total-Sample Standard Deviation 1.858058

Root-Mean-Square Distance Between Observations 3.716117

Cluster History
NCL Clusters Joined FREQ Norm RMS Dist Tie
6 OB5 OB6 2 0.3806
5 OB2 OB3 2 0.5382 T
4 CL5 OB4 3 0.6592
3 CL6 OB7 3 0.712
2 CL4 CL3 6 0.9702
1 OB1 CL2 7 1.3045


4
Average Linkage Method
SAS Instructions
1. Analyze Multivariate Cluster Analysis
2. Data: Drop variables into Analyze Variable bucket
3. Cluster Method: Wards Min Var. Method
4. Plots: Tree Design
5. Results: Display Output
6. Run

Eigenvalues of the Covariance Matrix
Eigenvalue Difference Proportion Cumulative
1 4.02480073 1.14483955 0.5829 0.5829
2 2.87996118 0.4171 1.0000

Root-Mean-Square Total-Sample Standard Deviation 1.858058

Root-Mean-Square Distance Between Observations 3.716117

Cluster History
NCL Clusters Joined FREQ SPRSQ RSQ Tie
6 OB5 OB6 2 0.0241 .976
5 OB2 OB3 2 0.0483 .928 T
4 CL5 OB4 3 0.0805 .847
3 CL6 OB7 3 0.1046 .743
2 OB1 CL4 4 0.3420 .401 T
1 CL2 CL3 7 0.4006 .000


5
K-Means
SAS Instructions
1. Analyze Multivariate Cluster Analysis
2. Data: Drop variables into Analyze Variable bucket
3. Cluster Method: K-Means; Max # of clusters: 3 (Why?);
4. Run

Initial Seeds
Cluster Store Loyalty Brand Loyalty
1 3.000000000 2.000000000
2 7.000000000 7.000000000
3 2.000000000 7.000000000

Minimum Distance Between Initial Seeds = 5

Iteration History
Iteration Criterion
Relative Change in Cluster Seeds
1 2 3

Criterion Based on Final Seeds = 0.8729

Cluster Summary
Cluster Frequency
RMS Std
Deviation
Max. Distance from
Seed to Obs.
Radius
Exceeded
Nearest
Cluster
Distance Between
Cluster Centroids

Statistics for Variables
Variable Total STD Within STD R-Square RSQ/(1-RSQ)

Pseudo F Statistic = 5.77
Approximate Expected Over-All R-Squared = .
Cubic Clustering Criterion = .

Cluster Means
Cluster Store Loyalty Brand Loyalty
1 3.000000000 2.000000000
2 6.333333333 5.666666667
3 3.333333333 6.333333333

Cluster Standard Deviations
Cluster Store Loyalty Brand Loyalty

Output Data
Resp Store Brand Cluster Distance
A 3 2 1 0
B 4 5 3 1.49071198
C 4 7 3 0.94280904
D 2 7 3 1.49071198
E 6 6 2 0.47140452
F 7 7 2 1.49071198
G 6 4 2 1.69967317

You might also like