You are on page 1of 24

Welcome to Powerpoint slides

for
Chapter 6

Cluster Analysis
for
Market Segmentation
Slide 1

1. A cluster, by definition, is a group of similar objects

2. There could be clusters of people, brands or other


objects

3. If clusters are formed of customers similar to one


another, then cluster analysis can help marketers
identify segments (clusters)

4. If clusters of brands are formed, this can be used to


gain insights into brands that are perceived as similar to
each other on a set of attributes

5. This chapter explains the use of cluster analysis for


customer segmentation

6. Cluster analysis is best performed when the variables


are interval or ratio-scaled
Slide 2

1. There are two major classes of cluster analysis


techniques- hierarchical and non-hierarchical

2. In hierarchical clustering, some measure of


distance is used to identify distances between all pairs
of objects to be clustered. One of the popular distance
measures used is Euclidean Distance. Another is the
Squared Euclidean Distance

3. We begin with all objects in separate clusters. Say,


we have ten objects in separate clusters. Two closest
objects are joined to form a cluster. The remaining 8
objects would remain separate. This is stage 1 of
hierarchical clustering.
Slide 2 contd...

4. In stage 2, again the two closest objects form another


cluster. Now, we have two clusters, and 6 unclustered
objects. This means a total of eight clusters, two with
two objects each, and six with one object each.

5. This process continues, until points join existing


clusters (because they are closest to an existing cluster),
and clusters join other clusters, based on the shortest
distance criterion

6. In this way, a range of possible solutions is formed,


from a 10-cluster solution in the beginning, to a single
cluster solution at the end.

7. We have to decide how many clusters the data seems


to have, depending on either the agglomeration
schedule, or the dendrogram to help make the
decision. Both of these are computer outputs that
describe in numbers or visually, the sequence of cluster
formation. This decision is somewhat subjective, but
there are some guidelines one can follow, as illustrated
in the worked example
Slide 3

1. In non-hierarchical clustering methods (also


known as k-means clustering methods), we need to
specify the number of clusters we want the objects to
be clustered into.

2. This can be done if we have a hypothesis that the


objects will group into a certain number of clusters.
Alternatively, we can first do a hierarchical clustering
on the data, find the approximate number of clusters,
and then perform a k-means clustering

3. In our illustration, we have used both hierarchical


and non-hierarchical methods in combination with one
another

4. Let us move on to our worked example


Slide 4
Worked Out Example

Problem: A major Indian FMCG company wants to map


the profile of its target market in terms of lifestyle,
attitudes and perceptions. The company's managers
prepare, with the help of their marketing research team, a
set of 15 statements, which they feel measure many of the
variables of interest. These 15 statements are given below.
The respondent had to agree or disagree (1 = Strongly
Agree, 2 = Agree, 3 = Neither Agree nor Disagree, 4 =
Disagree, 5 = Strongly Disagree) with each statement.

1. I prefer to use e-mail rather than write a letter.


2. I feel that quality products are always priced high.
3. I think twice before I buy anything.
4. Television is a major source of entertainment.
5. A car is a necessity rather than a luxury.
6. I prefer fast food and ready to use products.
7. People are more health conscious today.
8. Entry of foreign companies has increased the efficiency
of Indian companies.
9. Women are active participants in purchase decisions.
10. I believe politicians can play a positive role.
11. I enjoy watching movies.
12. If I get a chance, I would like to settle abroad.
13. I always buy branded products.
14. I frequently go out on weekends.
15. I prefer to pay by credit card rather than in cash.
Slide 5
For the purpose of this illustration, we will assume
that 20 respondents answered the questionnaire above
(In a real life situation, the sample size would be
higher). The input data matrix of 20 respondents x 15
variables is shown in fig 1.

Fig. 1
var000 var0000 var0000 var0000 var0000 var0000 var0000 var00008
01 2 3 4 5 6 7
1. 1.00 3.00 5.00 4.00 3.00 5.00 3.00 2.00
2. 2.00 3.00 2.00 3.00 4.00 4.00 3.00 2.00
3. 3.00 2.00 3.00 4.00 3.00 5.00 3.00 3.00
4. 3.00 2.00 4.00 2.00 2.00 4.00 3.00 4.00
5. 2.00 2.00 4.00 2.00 2.00 5.00 3.00 3.00
6. 2.00 4.00 3.00 3.00 5.00 4.00 4.00 2.00
7. 1.00 1.00 2.00 4.00 4.00 1.00 2.00 4.00
8. 4.00 5.00 1.00 4.00 5.00 4.00 5.00 1.00
9. 2.00 1.00 5.00 3.00 4.00 4.00 2.00 1.00
10. 5.00 2.00 4.00 3.00 2.00 5.00 1.00 5.00
11. 4.00 3.00 3.00 2.00 1.00 2.00 1.00 5.00
12. 3.00 4.00 4.00 4.00 3.00 2.00 5.00 1.00
13. 4.00 3.00 2.00 2.00 3.00 3.00 4.00 2.00
14. 1.00 2.00 2.00 4.00 2.00 5.00 1.00 3.00
15. 2.00 3.00 4.00 1.00 5.00 4.00 2.00 4.00
16. 3.00 2.00 1.00 3.00 4.00 3.00 2.00 3.00
17. 5.00 1.00 1.00 5.00 1.00 2.00 4.00 2.00
18. 3.00 5.00 5.00 3.00 5.00 5.00 5.00 1.00
19. 3.00 2.00 4.00 2.00 4.00 4.00 1.00 4.00
20. 1.00 3.00 3.00 2.00 2.00 5.00 2.00 5.00
Slide 5 contd...

Fig 1 contd...
var000 var0001 var00011 var00012 var00013 var00014 var00015
09 0
1. 3.00 2.00 4.00 1.00 1.00 1.00 5.00
2. 2.00 2.00 4.00 2.00 2.00 2.00 4.00
3. 4.00 2.00 4.00 3.00 4.00 4.00 3.00
4. 5.00 4.00 5.00 4.00 5.00 5.00 5.00
5. 4.00 4.00 5.00 5.00 3.00 3.00 4.00
6. 3.00 4.00 5.00 4.00 3.00 3.00 3.00
7. 2.00 5.00 4.00 3.00 3.00 3.00 1.00
8. 1.00 5.00 3.00 3.00 5.00 5.00 2.00
9. 2.00 1.00 2.00 2.00 4.00 4.00 3.00
10. 3.00 2.00 5.00 1.00 2.00 2.00 1.00
11. 2.00 2.00 4.00 5.00 1.00 1.00 2.00
12. 5.00 3.00 2.00 4.00 4.00 4.00 3.00
13. 2.00 3.00 4.00 3.00 5.00 5.00 4.00
14. 5.00 4.00 3.00 2.00 2.00 2.00 5.00
15. 4.00 5.00 2.00 1.00 1.00 1.00 4.00
16. 2.00 5.00 1.00 2.00 5.00 5.00 3.00
17. 2.00 4.00 4.00 3.00 3.00 3.00 2.00
18. 2.00 3.00 4.00 4.00 2.00 2.00 1.00
19. 1.00 3.00 4.00 5.00 3.00 3.00 2.00
20. 1.00 3.00 4.00 4.00 3.00 3.00 3.00
Slide 6

The computer output is obtained by first doing a


hierarchical cluster analysis to find the number of
clusters that exist in the data. These outputs are in
figs 2 to 4 (Agglomeration schedule, vertical Icicle
Plot and Dendrogram using Average Linkage,
respectively).

The second stage is a K-means (quick cluster)


output with a pre-determined number of clusters to
be specified. In this case, the output is for 4
clusters. We will look at both stage 1 and stage 2
outputs to understand the interpretation of both
stages.
Slide 7

Fig. 2 : Agglomeration Schedule


Clusters Stage Cluster 1st Appears Next
Combined

Sta Clus Clus Coefficient Clust Cluste Stage


ge ter1 ter2 er1 r2
1 4 5 14.000000 0 0 5
2 19 20 16.000000 0 0 7
3 6 18 17.000000 0 0 9
4 1 2 17.000000 0 0 8
5 3 4 20.000000 0 1 11
6 13 16 25.000000 0 0 13
7 11 19 28.000000 0 2 11
8 1 14 28.500000 0 0 10
9 6 8 32.500000 0 0 12
10 1 15 34.666668 0 0 14
11 3 11 36.444443 0 7 15
12 6 12 36.666668 0 0 19
13 7 13 39.500000 0 6 17
14 1 9 41.000000 10 0 16
15 3 10 41.666668 11 0 16
16 1 3 46.342857 14 15 18
17 7 17 47.000000 13 0 18
18 1 7 51.791668 16 17 19
19 1 6 58.156250 18 12 0
Slide 8

1. A look at fig 2, the agglomeration schedule,


can help us to identify large differences in the
coefficient (4th column). The agglomeration
schedule from top to bottom (stage 1 to 19)
indicates the sequence in which cases get
combined with others (or one cluster combines
with another), until all 20 cases are combined
together in one cluster at the last stage (stage
19).

2. Therefore, stage 19 represents a 1 cluster


solution, stage 18 represents a 2 cluster solution,
stage 17 represents a 3 cluster solution, and so
on, going up from the last row to the first row.
We have to identify how many clusters are in
the data. We use the difference between rows in
a measure called coefficient (also known as
fusion coefficient) in column 4 to identify the
number of clusters in the data.
Slide 8 Contd….
3. We will look at this figure from the last row upwards,
because we would like to have lowest possible number of
clusters, for reasons of economy and ease of interpretation.
We see that there is a difference of (58.15 – 51.79) in the
coefficients between the 1 cluster solution (stage 19) and the 2
cluster solution (stage 18). This is a difference of 6.36. The
next difference is of (51.79 – 47.00) which is equal to 4.79
(between stage 18, the 2 cluster solution and stage 17, the 3
cluster solution). The next one after that is (47-46.34), only
0.66, between stage 17 and stage 16. After this, there is again
a large difference between the 4 cluster and 5 cluster
solutions, of (46.34 – 41.660) or 4.68. Thereafter, the
differences are smaller between subsequent rows of
coefficients.

4. A large difference in the coefficient values between any


two rows indicates a solution pertaining to the number of
clusters which the lower row represents. Ignoring the first
difference of 6.36 which would indicate only 1 cluster in the
data, we look at the next largest differences. 4.79 is the
difference between row 2 from the bottom and row 3 from the
bottom, indicating a 2 cluster solution. But almost the same is
the difference between stage 16 and 15, indicating a 4 cluster
solution. At this point, it is the judgement of the researcher,
which should decide whether to go for a 2 cluster or a 4
cluster solution. Just for illustration, we will choose the 4
cluster solution.
Slide 9

Now, in stage 2, a k-means clustering is run with 4


cluster solution requested (as identified from the
hierarchical clustering done above). In the given
problem, figs 5, 6, 7 and 8 indicate the outputs of K-
means clustering for a 4 cluster solution. These
outputs give us the initial cluster centres, the case
listing of cluster membership (i.e., which case
belongs to which of the clusters), the final cluster
centres (the solution) and an ANOVA table.

Fig. 7 : Final Cluster Centers

Cluster VAR00001 VAR00002 VAR00003 VAR00004


1 2.2000 2.2000 3.8000 3.2000
2 3.5000 3.6667 2.6667 3.5000
3 1.7500 2.0000 2.2500 3.0000
4 3.0000 2.4000 3.6000 2.2000

Cluster VAR00005 VAR00006 VAR00007 VAR00008


1 3.2000 4.4000 2.8000 2.4000
2 3.6667 3.3333 4.5000 1.5000
3 3.7500 3.2500 1.7500 3.5000
4 2.2000 4.2000 1.6000 4.4000
Slide 9 Contd….

Fig 7 contd...

Cluster VAR00009 VAR00010 VAR00011 VAR00012


1 3.2000 2.2000 3.8000 2.4000
2 2.5000 3.6667 3.6667 3.5000
3 3.2500 4.7500 2.5000 2.0000
4 2.2000 2.8000 4.4000 4.0000

Cluster VAR00013 VAR00014 VAR00015


1 2.4000 3.2000 4.0000
2 4.1667 3.6667 2.5000
3 1.2500 2.7500 3.2500
4 3.0000 2.4000 2.4000
Slide 10

1. The final cluster centres (above) describe the mean value


of each variable for each of the 4 clusters. For example,
cluster 1 is described by the mean values of variable 1 = 2.2,
variable 2 = 2.2, variable 3 = 3.8, variable 4 = 3.2 and so on.
Similarly, cluster 3 is described by variable 1 = 1.75,
variable 2 = 2.0, variable 3 = 2.25, and variable 4 = 3.0, and
so on.

2. We now go back to the original variables (in this case the


15 statements in our questinnaire), and interpret the clusters
in terms of the 15 variables. For example, cluster 3 consists
of people who are on the e-mail rather than writing
conventional letters (variable 1 value = 1.75 which is
equivalent to “agree” on the scale of 1 to 5). Similarly, they
are also people who tend to think twice before buying
anything (variable 3 value 2.25) in other words, careful
spenders. They also agree (variable 2 value = 2.00) that
quality products are always priced high – that is, they have a
positive correlation in their minds about a product’s quality
and price.

3. On these same variables, cluster 2 shows people who


prefer conventional mail to e-mail (variable 1 value = 3.5 or
close to “disagree”), people who do not necessarily associate
high price with good quality (variable 2 value = 3.67), and
tend to be neutral about care in spending (variable 3 value =
2.67). In this way, when we compare final cluster centre
values on each of the 15 variables, for 1 cluster at a time, a
complete picture of the clusters emerges.
Slide 11

In this case, we will briefly describe each of the 4 clusters


as follows:

Cluster 1

E-mail users, feel quality comes at a price, not careful


spenders, do not like television much, do not think a car is
a necessity, do not like fast food and ready to use products,
are not sure whether people are more health-conscious
today, think foreign companies have increased somewhat
the efficiency of Indian companies, disagree that women
are active purchasing decision makers, feel that politicians
can play an active role, do not enjoy watching movies,
might consider settling abroad, tend to buy branded
products, do not go out much on weekends and like to pay
cash, rather than charging to their credit cards (if they have
one).

It is thus a cluster exhibiting many traditional values,


except that they have adapted to email use. They are also
beginning to loosen their purse strings, and are probably in
transition in some other factors like acceptance of women
as decision makers and the advent of credit cards.
Slide 11 contd...

Cluster 2

Regular mail writers, bargain hunters or aggressive buyers,


not too particular about thinking before spending, not great
valuers of TV, believe the car is a luxury not too fond of fast
food and convenience products, do not think people are very
health conscious, feel foreign companies have done us good,
think women are active purchasing decision makers, do not
believe in politicians, do not like movies, do not want to
settle abroad, do not stress on branded products, do not go
out on weekends, but do prefer credit cards for payments.

It is a group which likes to use credit, spends more freely,


believes in woman power, believe in economics rather than
politics, and feel quality products can be cheap. Also, they
seem to have a patriotic streak, as they do not want to settle
abroad.
Slide 12
Cluster 3

E-mail users, quality measured by price, think twice before


buying, indifferent to TV, car is a luxury to them, not too
fond of fast food, agree that people are health conscious, do
not think foreign companies have made us efficient, do not
believe in woman power, detest politicians, enjoy watching
movies, willing to settle abroad, always buy branded
products, go out on weekends, slightly prefer cash to credit
cards.

This group is not a free spending one, but health conscious,


more patriarchical, more brand loyal to branded products,
but outgoing compared to other groups, even willing to go
abroad to settle.
Slide 12 contd...

Cluster 4

Not too particular about e-mail, measure quality by


price, free spending, enjoy watching TV, think a car is
necessary, not fond of fast food, think people are health
conscious, do not think foreign companies have made
us efficient, believe in woman power, somewhat
positive about politicians, not movie watchers, do not
want to settle abroad, indifferent to branding,
moderately outgoing and moderately in favour of credit
cards rather than cash.

This group is optimistic, free spending and a good


target for TV advertising, particularly consumer
durables and entertainment. But they are not
necessarily influenced by brands. They may want value
for money, but if they see value, they may spend a lot.

In summary, the cluster analysis of this sample of


respondents tells us a lot about the possible segments
which exist in the target population.
Slide 13

ANOVA:
Fig. 8 : Analysis of Variance
Variable Cluster MS DF Error MS DF F Prob
VAR00001 3.0500 3 1.315 16.0 2.3183 .114
VAR00002 3.0722 3 1.083 16.0 2.8359 .071
VAR00003 2.5722 3 1.630 16.0 1.5778 .234
VAR00004 1.6333 3 .943 16.0 1.7307 .201
VAR00005 2.5056 3 1.605 16.0 1.5609 .238
VAR00006 1.7056 3 1.505 16.0 1.1331 .365
VAR00007 9.6500 3 .390 16.0 24.7040 .000
VAR00008 8.5500 3 .681 16.0 12.5505 .000
VAR00009 1.3000 3 1.865 16.0 .6968 .567
VAR00010 5.5.56 3 .730 16.0 7.5397 .002
VAR00011 2.7389 3 1.020 16.0 2.6830 .082
VAR00012 4.0833 3 1.293 16.0 3.1562 .054
VAR00013 7.2556 3 .799 16.0 9.0813 .001
VAR00014 1.6222 3 1.880 16.0 .8628 .480
VAR00015 2.8500 3 1.465 16.0 1.9446 .163
Slide 13 contd...

The ANOVA table (fig. 8) tells us which of the 15


variables is significantly different across the 4
clusters. The last column indicates that variables 2, 7,
8, 10, 11, 12, 13 are significant at the 0.10 level
(equivalent to 90% confidence level) as they have
prob. Values less than 0.10. The other variables are
not statistically significant, as they all have prob.
Values greater then 0.10. But there is divided opinion
about the utility of statistical testing for cluster
analysis. Most established writers seen to feel that
these tests (ANOVA or other tests) are not valid.
Therefore, it is left to the researcher’s judgement
whether he would like to use these in determining
which variables are significant. If the tests were used,
then the interpretation of clusters and differences
across clusters should be only on the basis of those
variables which are (statistically) significantly
different across clusters at 0.10 or 0.05 or some other
level.
Slide 14
Additional Comments on Cluster Analysis

Objects

We have looked at an example of classifying people,


with interval-scaled data. It is possible to classify
objects such as brands, products, cities, etc. with cluster
analysis. For example, which brands are clustered
together in terms of consumer perceptions for a
positioning exercise, or which cities are clustered
together in terms of income, education and age profile
of its residents.

Number of Clusters

One of the main decisions of a researcher is to decide


how many clusters are present in the data. In certain
cases, if for example we have a prior hypothesis about
how many clusters ought to be present, this decision
may already be made. But otherwise, it tends to be a
subjective decision. One of the criteria that can be used
in addition to ones we have described in the chapter is
that every cluster must have a reasonable or minimum
number of objects. Which means, if a cluster comes out
with only one or two objects in it, look for another
solution.
It may be useful to experiment with two or three
possible solutions before deciding on the number of
clusters.
Slide 15

Variables

Once the reader is aware of the basics of cluster


analysis, he can begin to use it creatively. For example,
a cluster analysis can be done on some of the measured
variables, and then other variables can be checked to
see if they also exhibit differences across clusters. In
the worked out example discussed earlier, only
Psychographics or behavioural variables were used to
get the 4 clusters. We could then see if they belonged
to different places, had different education levels, or
whether one gender figured predominantly in any one
of the clusters.

Scale

Cluster analysis is ideally suited to interval scaled


variables, because Euclidean distance is a commonly
used distance measure used in the clustering process.
But nominal and ordinal level data can be used after
standardisation if appropriate. This may also
necessitate the use of other measures of distance, more
appropriate with the scales of variables being used. But
this should be done with care. In general, it is a good
idea to standardise the variables before clustering, if
the units of measurement are radically different.
Slide 15 Contd...

Statistical Tests

As mentioned briefly earlier, some statistical tests


for cluster analysis are available. But their validity
being questionable, caution is recommended in
using either ANOVA or any other tests.

A general caution about cluster analysis itself is


that it tends to produce different results with
different methods and some methods are quite
vulnerable to errors in data. So, the stability of the
clusters can be checked through splitting the
sample and repeating the cluster analysis.

You might also like