You are on page 1of 45

Cluster Analysis

1
Cluster analysis

• It is a class of techniques used to classify


cases into groups that are
• relatively homogeneous within themselves and
• heterogeneous between each other
• Homogeneity (similarity) and heterogeneity
(dissimilarity) are measured on the basis of a
defined set of variables
• These groups are called clusters

2
Market segmentation
• Cluster analysis is especially useful for market
segmentation
• Segmenting a market means dividing its potential
consumers into separate sub-sets where
• Consumers in the same group are similar with respect to
a given set of characteristics
• Consumers belonging to different groups are dissimilar
with respect to the same set of characteristics
• This allows one to calibrate the marketing mix
differently according to the target consumer group

3
Other uses of cluster analysis
• Product characteristics and the identification of new
product opportunities.
• Clustering of similar brands or products according to their
characteristics allow one to identify competitors, potential
market opportunities and available niches
• Data reduction
• Factor analysis and principal component analysis allow to reduce
the number of variables.
• Cluster analysis allows to reduce the number of observations, by
grouping them into homogeneous clusters.
• Maps profiling simultaneously consumers and products,
market opportunities and preferences as in preference or
perceptual mappings

4
Steps to conduct a cluster analysis

• Select a distance measure


• Select a clustering algorithm
• Define the distance between two clusters
• Determine the number of clusters
• Validate the analysis

5
Distance measures for individual
observations
• To measure similarity between two observations a distance
measure is needed
• With a single variable, similarity is straightforward
• Example: income – two individuals are similar if their income level
is similar and the level of dissimilarity increases as the income gap
increases
• Multiple variables require an aggregate distance measure
• Many characteristics (e.g. income, age, consumption habits, family
composition, owning a car, education level, job…), it becomes more
difficult to define similarity with a single value
• The most known measure of distance is the Euclidean
distance, which is the concept we use in everyday life for
spatial coordinates.
6
Examples of distances
n A
 x  xkj 
2
Dij  Euclidean distance
ki
k 1 B
n A
Dij   xki  xkj City-block (Manhattan) distance
k 1 B
Dij distance between cases i and j
xkj value of variable xk for case j
Problems
• Different measures = different weights
• Correlation between variables (double counting)
Solution: Standardization, rescaling, principal
component analysis
7
Clustering procedures

• Hierarchical procedures
• Agglomerative (start from n clusters to get to 1
cluster)
• Divisive (start from 1 cluster to get to n
clusters)
• Non hierarchical procedures
• K-means clustering

8
Hierarchical clustering
• Agglomerative:
• Each of the n observations constitutes a separate cluster
• The two clusters that are more similar according to same distance rule are
aggregated, so that in step 1 there are n-1 clusters
• In the second step another cluster is formed (n-2 clusters), by nesting the
two clusters that are more similar, and so on
• There is a merging in each step until all observations end up in a single
cluster in the final step.
• Divisive
• All observations are initially assumed to belong to a single cluster
• The most dissimilar observation is extracted to form a separate cluster
• In step 1 there will be 2 clusters, in the second step three clusters and so
on, until the final step will produce as many clusters as the number of
observations.
• The number of clusters determines the stopping rule for the
algorithms

9
Non-hierarchical clustering
• These algorithms do not follow a hierarchy and produce a
single partition
• Knowledge of the number of clusters (c) is required
• In the first step, initial cluster centres (the seeds) are
determined for each of the c clusters, either by the
researcher or by the software (usually the first c
observation or observations are chosen randomly)
• Each iteration allocates observations to each of the c
clusters, based on their distance from the cluster centres
• Cluster centres are computed again and observations may
be reallocated to the nearest cluster in the next iteration
• When no observations can be reallocated or a stopping rule
is met, the process stops

10
Distance between clusters
• Algorithms vary according to the way the
distance between two clusters is
defined.
• The most common algorithm for
hierarchical methods include
• single linkage method
• complete linkage method
• average linkage method
• Ward algorithm (see slide 14)
• centroid method (see slide 15)

11
Linkage methods
• Single linkage method (nearest neighbour):
distance between two clusters is the minimum
distance among all possible distances between
observations belonging to the two clusters.
• Complete linkage method (furthest neighbour):
nests two cluster using as a basis the maximum
distance between observations belonging to
separate clusters.
• Average linkage method: the distance between two
clusters is the average of all distances between
observations in the two clusters

12
Ward algorithm
1. The sum of squared distances is computed within
each of the cluster, considering all distances
between observation within the same cluster
2. The algorithm proceeds by choosing the
aggregation between two clusters which
generates the smallest increase in the total sum
of squared distances.
• It is a computationally intensive method, because
at each step all the sum of squared distances
need to be computed, together with all potential
increases in the total sum of squared distances for
each possible aggregation of clusters.
13
Centroid method
• The distance between two clusters is the distance
between the two centroids,
• Centroids are the cluster averages for each of the
variables
• each cluster is defined by a single set of coordinates, the
averages of the coordinates of all individual observations
belonging to that cluster
• Difference between the centroid and the average
linkage method
• Centroid: computes the average of the co-ordinates of
the observations belonging to an individual cluster
• Average linkage: computes the average of the distances
between two separate clusters.

14
Non-hierarchical clustering:
K-means method
1. The number k of clusters is fixed
2. An initial set of k “seeds” (aggregation centres) is
provided
• First k elements
• Other seeds (randomly selected or explicitly defined)
3. Given a certain fixed threshold, all units are assigned
to the nearest cluster seed
4. New seeds are computed
5. Go back to step 3 until no reclassification is necessary
Units can be reassigned in successive steps (optimising
partioning)

15
Non-hierarchical threshold methods

• Sequential threshold methods


• a prior threshold is fixed and units within that distance
are allocated to the first seed
• a second seed is selected and the remaining units are
allocated, etc.
• Parallel threshold methods
• more than one seed are considered simultaneously
• When reallocation is possible after each stage, the
methods are termed optimizing procedures.

16
Hierarchical vs. non-hierarchical
methods
Hierarchical Methods Non-hierarchical methods
 No decision about the number of  Faster, more reliable, works with
clusters large data sets
 Problems when data contain a high  Need to specify the number of
level of error clusters
 Can be very slow, preferable with  Need to set the initial seeds
small data-sets  Only cluster distances to seeds need
 Initial decisions are more influential to be computed in each iteration
(one-step only)
 At each step they require
computation of the full proximity
matrix

17
The number of clusters c
• Two alternatives
• Determined by the analysis
• Fixed by the researchers
• In segmentation studies, the c represents the number of
potential separate segments.
• Preferable approach: “let the data speak”
• Hierarchical approach and optimal partition identified through
statistical tests
• However, the detection of the optimal number of clusters is subject
to a high degree of uncertainty
• If the research objectives allow a choice rather than
estimating the number of clusters, non-hierarchical
methods are the way to go.

18
Example: fixed number of clusters
• A retailer wants to identify several shopping
profiles in order to activate new and targeted
retail outlets
• The budget only allows him to open three types of
outlets
• A partition into three clusters follows naturally,
although it is not necessarily the optimal one.
• Fixed number of clusters and (k-means) non
hierarchical approach

19
Example: c determined from the data
• Clustering of shopping profiles is expected to detect a new
market niche.
• For market segmentation purposes, it is less advisable to
constrain the analysis to a fixed number of clusters
• A hierarchical procedure allows to explore all potentially valid numbers of
clusters
• For each of them there are some statistical diagnostics to pinpoint the best
partition.
• What is needed is a stopping rule for the hierarchical algorithm, which
determines the number of clusters at which the algorithm should stop.
• Statistical tests are not always univocal, leaving some room
to the researcher’s experience and arbitrariness
• Statistical rigidities should be balanced with the knowledge
gained from and interpretability of the final classification.

20
Determining the optimal number of
cluster from hierarchical methods
• Graphical
• dendrogram
• scree diagram
• Statistical
• ANOVA
• Discriminant Analysis

21
And the merging
distance is
Dendrogram
relatively small This dotted line represents the
Rescaled Distance distance
Cluster between clusters
Combine

C A S E 0 5 10 15 20 25
Label Num +---------+---------+---------+---------+---------+

231  Case 231 and case 275 are merged


275  
These 145  
are the 181  
individual 333   
cases 117  
336   
337  
209  
431  
178 
As the algorithm proceeds, the
merging distances become larger

22
Scree diagram

12
Merging
distance on When one moves from
10
the y-axis 7 to 6 clusters, the
8 merging distance
Distance

increases noticeably
6

0
11 10 9 8 7 6 5 4 3 2 1
Number of clusters

23
Statistical tests

• The rationale is that in optimal partition,


variability within clusters should be as small as
possible, while variability between clusters should
be maximized
• This principle is similar to the ANOVA-F test

24
Suggested approach:
2-steps procedures
1. First perform a hierarchical method to define
the number of clusters
2. Then use the k-means procedure to actually
form the clusters
The reallocation problem
• Rigidity of hierarchical methods: once a unit is classified into a
cluster, it cannot be moved to other clusters in subsequent steps
• The k-means method allows a reclassification of all units in each
iteration.
• If some uncertainty about the number of clusters remains after
running the hierarchical method, one may also run several k-
means clustering procedures and apply the previously discussed
statistical tests to choose the best partition.

25
The SPSS two-step procedure
• The observations are preliminarily aggregated into clusters
using an hybrid hierarchical procedure named cluster
feature tree.
• This first step produces a number of pre-clusters, which is
higher than the final number of clusters, but much smaller
than the number of observations.
• In the second step, a hierarchical method is used to classify
the pre-clusters, obtaining the final classification.
• During this second clustering step, it is possible to
determine the number of clusters.
The user can either fix the number of clusters or let the
algorithm search for the best one according to information
criteria which are also based on goodness-of-fit measures.

26
Evaluation and validation
• goodness-of-fit of a cluster analysis
• ratio between the sum of squared errors and the total sum of
squared errors (similar to R2)
• root mean standard deviation within clusters.
• Validation: if the identified cluster structure
(number of clusters and cluster characteristics) is
real, it should not be c
• Validation approaches
• use of different samples to check whether the final output is
similar
• Split the sample into two groups when no other samples are
available
• Check for the impact of initial seeds / order of cases (hierarchical
approach) on the final partition
• Check for the impact of the selected clustering method

27
Cluster analysis in SPSS

Three types of cluster


analysis are available in
SPSS

28
Hierarchical cluster analysis

Variables selected
for the analysis

Create a new variable


with cluster membership
for each case

Clustering method
Statistics required Graphs (dendrogram) and options
in the analysis Advice: no plots

29
Statistics

The agglomeration
schedule is a table
which shows the
steps of the clustering
procedure, indicating
which cases (clusters)
are merged and the
merging distance
Shows the cluster
membership of
individual cases only
The proximity matrix for a sub-set of
contains all distances solutions
between cases (it may
be huge)

30
Plots

Shows the
clustering process,
indicating which
cases are
aggregated and the
merging distance The icicle plot (which can
With many cases, be restricted to cover a
the dendrogram is small range of clusters),
hardly readable shows at what stage
cases are clustered. The
plot is cumbersome and
slows down the analysis
(advice: no icicle)

31
Method
Choose a
hierarchical
algorithm

Choose the type of data


(interval, counts binary) and
the appropriate measure

Specify whether the variables (values)


should be standardized before analysis.
Z-scores return variables with zero mean
and unity variance. Other standardizations
are possible. Distance measures can also
be transformed
32
Cluster memberships
If the number of clusters has been decided (or at least a
range of solutions), it is possible to save the cluster
membership for each case into new variables

33
The example:
Last 10 stages agglomeration schedule
of the process
(10
  to 1 clusters)
  Cluster Combined    
Number of
Stage clusters Cluster 1 Cluster 2 Distance Diff. Dist
490 10 8 12 544.4  
491 9 8 11 559.3 14.9 As the
algorithms
492 8 3 7 575.0 15.7
proceeds
493 7 3 366 591.6 16.6 towards the
494 6 3 6 610.6 19.0 end, the
495 5 3 37 636.6 26.0 distance
496 4 13 23 663.7 27.1 increases
497 3 3 13 700.8 37.1
498 2 1 8 754.1 53.3
499 1 1 3 864.2 110.2

34
Scree diagram
Scree diagram

840
The scree diagram (not provided by
SPSS but created from the
790
agglomeration schedule) shows a
larger distance increase when the
Distance

740
cluster number goes below 4
690

640
Elbow?

590
7 6 5 4 3 2 1
Number of clusters

35
Non-hierarchical solution
with 4 clusters
Ward Method
1 2 3 4 Total
Case Number N% 26.6% 20.2% 23.8% 29.4% 100.0%
Household size Mean 1.4 3.2 1.9 3.1 2.4
Gross current income of Mean 238.0 1158.9 333.8 680.3 576.9
household
Age of Household Mean 72 44 40 48 52
Reference
EFS: Total Person
Food & Mean 28.8 64.4 29.2 60.6 45.4
non-alcoholic beverage
EFS: Total Clothing and Mean 8.8 64.3 9.2 19.0 23.1
Footwear
EFS: Total Housing, Mean 25.1 77.7 33.5 39.1 41.8
Water, Electricity
EFS: Total Transport Mean 17.7 147.8 24.6 57.1 57.2
costs Total Recreation
EFS: Mean 29.6 146.2 39.4 63.0 65.3

36
K-means solution (4 clusters)

Variables

Number of clusters (fixed)

Ask for one (classify only) or more


iterations before stopping the
algorithm

It is possible to read a file with


initial seeds or write final seeds on
a file

37
K-means options
Creates a new
variable with
cluster
membership
for each case

Improve the More options


algorithm by including an
allowing for ANOVA table
more iterations with statistics
and running
means (seeds
are recomputed
at each stage)

38
Results from k-means
(initial seeds chosen by SPSS)
Final Cluster Centers
Number of Cases in each Cluster
Cluster
1 2 3 4 Cluster 1 292.000
Household size 2.0 2.0 2.8 3.2 2 1.000
Gross current income of 3 155.000
264.5 241.1 791.2 1698.1
household
4 52.000
Age of Household
56 75 46 45 Valid 500.000
Reference Person
EFS: Total Food & Missing .000
37.3 22.2 54.1 66.2
non-alcoholic beverage
EFS: Total Clothing and
14.0 28.0 31.7 48.4
Footwear The k-means algorithm is
EFS: Total Housing,
Water, Electricity
34.7 100.3 47.3 64.5 sensible to outliers and SPSS
EFS: Total Transport
costs
28.4 10.4 78.3 156.8 chose an improbable amount for
EFS: Total Recreation 39.6 3013.1 74.4 125.9 recreation expenditure as an
initial seed for cluster 2 (probably
an outlier due to misrecording or
an exceptional expenditure)

39
Results from k-means:initial seeds from
hierarchical clustering
Cluster Number of Case
1 2 3 4 Total
Case Number N% 32.6% 10.2% 33.6% 23.6% 100.0%
Household size Mean 1.7 3.1 2.5 2.9 2.4
Gross current income of Mean 163.5 1707.3 431.8 865.9 576.9
household
Age of Household Mean 60 45 50 46 52
Reference
EFS: Total Person
Food & Mean 31.3 65.5 45.1 56.8 45.4
non-alcoholic beverage
EFS: Total Clothing and Mean 12.3 48.4 19.1 32.7 23.1
Footwear
EFS: Total Housing, Mean 29.8 65.3 41.9 48.1 41.8
Water, Electricity
EFS: Total Transport Mean 24.6 156.8 37.4 87.5 57.2
costs
EFS: Total Recreation Mean 30.3 126.8 67.9 83.4 65.3

The first cluster is now larger, but it still represents older and poorer households. The
other clusters are not very different from the ones obtained with the Ward algorithm,
indicating a certain robustness of the results.

40
2-step clustering
it is possible to make
a distinction
between categorical
and continuous
variables

This is the
information
criterion to
The search for choose the
the optimal optimal partition
number of
clusters may be One may also
constrained asks for plots and
descriptive stats

41
Options

It is advisable It is possible to
to control for choose which
outliers (OLs) variable should
because the be standardized
analysis is prior to run the
usually analysis
sensitive to
OLs
More advanced
options are
available for a
better control on
the procedure

42
Output
• Results are not satisfactory
• With no prior decision on the number of clusters, two
clusters are found, one with a single observations and the
other with the remaining 499 observations.
• Allowing for outlier treatment does not improve results
• Setting the number of clusters to four produces these
results Cluster Distribution It seems that the two-step clustering is
% of biased towards finding a macro-
N Combined % of Total cluster.
Cluster 1 2 .4% .4%
2 5 1.0% 1.0% This might be due to the fact that the
3 490 98.2% 98.2%
number of observations is relatively
4 2 .4% .4%
Combined 499 100.0% 100.0%
small, but the combination of the
Total 499 100.0% Ward algorithm with the k-means
algorithm is more effective

43
SAS cluster analysis

• Compared to SPSS, SAS provides more


diagnostics and the option of non-parametric
clustering through three SAS/STAT
procedures
• the procedure CLUSTER and VARCLUS (for
hierarchical and the k-th neighbour methods)
• the procedure FASTCLUS (for non-hierarchical
methods)
• and the procedure MODECLUS (for non-
parametric methods)
44
Discussion
• It might seem that cluster analysis is too sensitive to the
researcher’s choice.s
• This is partly due to the relatively small data-set and
possibly to correlation between variables
• However, all outputs point out to a segment with older and
poorer household and another with younger and larger
households, with high expenditures.
• By intensifying the search and adjusting some of the
properties, cluster analysis does help identifying
homogeneous groups.
• “Moral”: cluster analysis needs to be adequately validated
and it may be risky to run a single cluster analysis and take
the results as truly informative, especially in presence of
outliers.

45

You might also like