You are on page 1of 39

Cluster Analysis

Agglomerative Hierarchical
Clustering
CDS504 Module 10
Copyright @ Dr. Aihua Yan
Friday, November 11, 2022
Examples of Clustering
Applications

2
Clustering Applications - Marketing (1)
o Marketing- market segmentation: customers are segmented based on
transaction history, demographic data, behavioral data, psychological
data, and a marketing strategy is tailored for each segment.

Taking Shape Peak Performers

Losing Weight
Health Requirements

Market Segments f
Making Friends or Fitness Centers Sports Focus

3
Clustering Applications - Finance (2)
o Finance- balanced portfolios given data on a variety of investment opportunities
(e.g., stocks), one may find clusters based on financial performance variables such as
return (daily, weekly, or monthly), volatility, beta, and other characteristics, such as
industry and market capitalization. Selecting securities from different clusters can
help create a balanced portfolio.

4
Clustering applications (3)
o Spatial Data Analysis
– Detect spatial clusters and explain
them in spatial data mining. e.g.
sales distribution by zip code.
o Image Processing
– Disease diagnose in healthcare
o Web analysis
– Document classification
– Cluster web-log data to discover
groups of similar access patterns.
5
Introduction:
What is Cluster Analysis?

6
What is Cluster analysis?
Purpose: Identify groups of individuals or objects that are similar to each
other but different from individuals/objects in other groups.

o Cluster Analysis (CA):


– A convenient method for identifying homogenous
groups of objects.
– One of the most fundamental marketing activities
in market segmentation (a.k.a. customer segmentation).

o Cluster: a group of objects (or cases, observations)


which share many characteristics, but are very
dissimilar to objects not belonging to that cluster.

7
Objectives in Cluster Analysis
Object
Within-cluster Centroid

Variation=Minimum

Between-cluster
Variation=Maximum

8
Within-groups vs. Between-groups
pWithin-groups property: Each group is homogenous with respect to
certain characteristics, i.e. observations in each group are similar to each other
p Between-groups property: Each group should be different from other
groups with respect to the same characteristics, i.e. observations of one group
should be different from the observations of other groups.

9
How many clusters?
Types of Clustering
o A clustering is a set of clusters
o Important distinction between hierarchical and partitional sets of clusters
– Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree
• Algorithm: e.g., agglomerative hierarchical clustering

– Partitional Clustering (e.g., K-Means)


• A division of data objects into non-overlapping subsets (clusters)
• Algorithm: e.g., K-means
Hierarchical Clustering

p1
p3 p4
p2
p1 p2 p3 p4

Traditional Hierarchical Clustering Traditional Dendrogram


Partitional Clustering

Original Points A Partitional Clustering


Agglomerative Hierarchical Clustering
Hierarchical Clustering
o Hierarchical clustering proceeds successively by either merging smaller clusters
into larger ones (agglomerative), or by splitting larger clusters (Divisive). The
result of the algorithm is a tree of clusters, called dendrogram.

15
Tie- not a critical
issue in clustering
Stop rule: We are seeking for a solution where
an additional combination of clusters or objects
would occur at a greatly increased distance
(最大信息丢失).
1.414
2.236

2 3.162
Steps of Hierarchical Clustering

Step 3: Similarity
Step 1: Objectives Step 2: Choice of Measures
of cluster analysis Variables (Distance
Measures)

Step 4: Decide on Step 5: Decide on Step 6: Validate


the clustering the number of and interpret the
algorithm clusters clustering solution

19
Observations Measures

20
Step 1: Objective of CA
o Taxonomy description
– Taxonomy: an empirically based classification of objects.
– Although cluster analysis is viewed principally as an exploratory technique, cluster analysis can be
used for confirmatory purposes. In such cases, a proposed typology (theoretically based
classification) can be compared to that derived from the cluster analysis.

o Data Simplification
– Instead of viewing all of the observations as unique, they can be viewed as members of clusters
and profiled by their general characteristics.
Step 2: Choice of Variables
o The selection of clustering variables should base on an explicit theory, past research, or supposition.

o The researcher must also realize that the importance of including only those variables that:
– characterize the objects being clustered, and
– relate specifically to the objectives of the cluster analysis. See the next slide for the variables for need-based market
sementation.

o Practical considerations:
– Cluster analysis can be affected dramatically by the inclusion of only one or two inappropriate or undifferentiated
variables.
– The analyst is always encouraged to examine the results and to eliminate the variables that are not distinctive (i.e.
that do not differ significantly) across the derived clusters.

o Warning:
– Avoid including variables “just because you have”
– Results are dramatically affected by inclusion of even one or two inappropriate or undifferentiated variables
Clustering
variables
Step 3: Choice of similarity measure:
Distance Measures
o Distance (or dissimilarity) Measures
– Euclidean Distance Measuring distance between
– Minkowski Metric two observations/objects
– Euclidean Distance for Standardized Data
Distance Measures (2)
Minkowski metric between cases i and j:
Euclidean Distance=
(0.04 + 0.09 + 0.25) = 0.616

𝑥"# = measurement of ith case on kth variable


s = 2 : Euclidean Distance What is the City-block distance between
Employee 1 and Employee 2?
s = 1 : City-block Distance
(Manhattan Distance)
p = number of variables

26
Standardization of variables
o Note: Euclidean distance depends on the scale of the variables! Variables with large
values will contribute more to the distance measure than variables with small values.
o Standardization of variables is commonly preferred to avoid problems due to
different scales.
o Clustering variables should be standardized whenever possible.
– Most commonly done using Z-scores

𝑋−𝜇
𝑍=
𝜎

o Standardization by observations: if groups are to be formed based on respondents’


response styles, then within-case or row-centering standardization can be considered.

27
Measuring distance
Step 4: deciding on clustering algorithm between two clusters
o Centroid method: the distance between the two
cluster centroids. The centroid of a merged cluster is
a weighted combination of the centroids of the two
individual clusters, where the weights are
proportional to the sizes of the clusters.

o MIN or Single link: the distance between the pair of


observations that are closet.

o MAX or Complete linkage: the distance between the


pair of observations that are farthest.

o Average linkage (average distance): the average


distance of all possible distances between records in
one cluster and records in the other cluster.

28
Ward’s Method
o Ward’s Method consider the “loss of
information” that occurs when observations
are clustered together.
o Ward’s method would choose the
configuration that results in the smallest
incremental loss of information.
o Ward’s method will produce clusters of
similar shape and size.

o To measure loss of information, Ward’s


method employs a measure “error sum of
squares” (ESS) that measures the difference
between individual observations and a group
mean.

29
Step 5: Choose the number of clusters
Stopping rules of hierarchical clustering

o Unfortunately, hierarchical methods provide


only very limited guidance for making this
decision.

o Distance-based rules:
– Rule 1: one potential stopping rule is Elbow’s
rule.
– Rule 2: another alternative rule is looking at
dendrogram.
– Distance-based rules do not work very well in all
cases. It is often difficult to identify where the Based on the above figure, what is your
break actually occurs. final number of clusters ?

30
Elbow Rule
o One should choose a number of clusters so that adding another
cluster doesn't give much better modelling of the data.

o More precisely, if one plots the percentage of variance explained


by the clusters against the number of clusters, the first clusters
will add much information (explain a lot of variance), but at
some point the marginal gain will drop, giving an angle in the
graph.

o The number of clusters is chosen at this point, hence the "elbow


criterion". This "elbow" cannot always be unambiguously
identified.

31
Step 6: Interpretation of the clusters
o The interpretation stage involves examining each cluster in terms of the
cluster variate to name or assign a label accurately describing the nature
of the clusters.
o When starting the interpretation process, one measure frequently used is
the cluster’s centroid.
o The profiling and interpretation of the clusters, however, achieve more
than just description and are essential elements in selecting between
cluster solutions when the stopping rules indicate more than one
appropriate cluster solution.
Profiling
variables
Example: Income, Age, Education
Hierarchical Clustering: Problems and Limitations
o Once a decision is made to combine two clusters, it cannot be undone

o No global objective function is directly minimized

o Different schemes have problems with one or more of the following:


– Sensitivity to noise
– Difficulty handling clusters of different sizes and non-globular shapes
– Breaking large clusters
Additional Takeaways (1)
o Cluster analysis (CA) is descriptive,
atheoretical, and noninferential. Many scholars
contend that it is only an exploratory technique.
– CA doesn’t have statistical basis upon which to draw
inferences from a sample to a population.
• For instance, if you have customers in Hong Kong and
Shenzhen, the clustering results for Hong Kong may be
different from the those for Shenzhen. You can’t just
apply the CA results of HK directly for Shenzhen
marketing.
– If possible, CA should be applied from a confirmatory
mode, using it to identify groups that already have an
established conceptual foundation for their existence.
Additional Takeaways (2)
o Cluster analysis will always create clusters, regardless of the
actual existence of any structure in the data.
– Only those clusters having strong conceptual foundation and
validation are potentially meaningful and relevant.

o The cluster solution is not generalizable because it is totally


dependent upon the variables used as the basis for the
similarity measure.
– Researchers/analysts need to ensure that all the measures used
in the CA have strong conceptual support.

o Cluster analysis usually is not a final destination for your data


analysis.
– The results of CA may be used as inputs for other analytical
methods.
Key Takeaways for Week 9
Cluster analysis is an exploratory technique.

Start a cluster analysis with clear objective(s).

Cluster analysis is usually not a final destination of your data analysis.

The determination of cluster number is hard. It needs both mathematical and practical
considerations.

K-means analysis can help refine hierarchical cluster analysis results.


References
o Hair, J. F., Black, W. C., Babin, B. J., Anderson,
R. E., & Tatham, R. L. (2006). Multivariate
data analysis (Vol. 6).
o Sarstedt, M., & Mooi, E. (2019). Cluster
analysis. In A concise guide to market
research (pp. 301-354). Springer, Berlin,
Heidelberg.
o Tan, P. N., Steinbach, M., & Kumar, V.
(2016). Introduction to data mining. Pearson
Education India.

You might also like