You are on page 1of 7

Cluster Analysis for Market Segmentation

Application Areas
1. A cluster, by definition, is a group of similar objects.
2. There could be clusters of people, brands or other objects.
3. If clusters are formed of customers similar to one another, then cluster analysis can help marketers
identify segments (clusters).
4. If clusters of brands are formed, this can be used to gain insights into brands that are perceived as similar
to each other on a set of attributes.
5. Cluster analysis is best performed when the variables are interval or ratio-scaled.

Worked Out Example

Problem:
A major Indian FMCG company wants to map the profile of its target market in terms of lifestyle,
attitudes and perceptions. The company's managers prepare, with the help of their marketing research
team, a set of 15 statements, which they feel measure many of the variables of interest. These 15
statements are given below. The respondent had to agree or disagree (1 = Strongly Agree, 2 = Agree, 3
= Neither Agree nor Disagree, 4 = Disagree, 5 = Strongly Disagree) with each statement.

1. I prefer to use e-mail rather than write a letter.

2. I feel that quality products are always priced high.

3. I think twice before I buy anything.

4. Television is a major source of entertainment.

5. A car is a necessity rather than a luxury.

6. I prefer fast food and ready to use products.

7. People are more health conscious today.

8. Entry of foreign companies has increased the efficiency of Indian companies.

9. Women are active participants in purchase decisions.

10. I believe politicians can play a positive role.

11. 11. I enjoy watching movies.

12. 12. If I get a chance, I would like to settle abroad.

13. I always buy branded products.

14. I frequently go out on weekends.

15. I prefer to pay by credit card rather than in cash.

For the purpose of this illustration, we will assume that 20 respondents answered the questionnaire above (In
a real life situation, the sample size would be higher). The input data matrix of 20 respondents x 15 variables
is shown in fig 1.

Page 1 of 7
Page 2 of 7
Fig. 1

Responden var1 var2 var3 var4 var5 var6 var7 var8 var9 var1 var1 var1 var1 var1 var1
1.t 1.00 3.00 5.00 4.00 3.00 5.00 3.00 2.00 3.00 0
2.00 1
4.00 2
1.00 3
1.00 4
1.00 5
5.00
2. 2.00 3.00 2.00 3.00 4.00 4.00 3.00 2.00 2.00 2.00 4.00 2.00 2.00 2.00 4.00
3. 3.00 2.00 3.00 4.00 3.00 5.00 3.00 3.00 4.00 2.00 4.00 3.00 4.00 4.00 3.00
4. 3.00 2.00 4.00 2.00 2.00 4.00 3.00 4.00 5.00 4.00 5.00 4.00 5.00 5.00 5.00
5. 2.00 2.00 4.00 2.00 2.00 5.00 3.00 3.00 4.00 4.00 5.00 5.00 3.00 3.00 4.00
6. 2.00 4.00 3.00 3.00 5.00 4.00 4.00 2.00 3.00 4.00 5.00 4.00 3.00 3.00 3.00
7. 1.00 1.00 2.00 4.00 4.00 1.00 2.00 4.00 2.00 5.00 4.00 3.00 3.00 3.00 1.00
8. 4.00 5.00 1.00 4.00 5.00 4.00 5.00 1.00 1.00 5.00 3.00 3.00 5.00 5.00 2.00
9. 2.00 1.00 5.00 3.00 4.00 4.00 2.00 1.00 2.00 1.00 2.00 2.00 4.00 4.00 3.00
10. 5.00 2.00 4.00 3.00 2.00 5.00 1.00 5.00 3.00 2.00 5.00 1.00 2.00 2.00 1.00
11. 4.00 3.00 3.00 2.00 1.00 2.00 1.00 5.00 2.00 2.00 4.00 5.00 1.00 1.00 2.00
12. 3.00 4.00 4.00 4.00 3.00 2.00 5.00 1.00 5.00 3.00 2.00 4.00 4.00 4.00 3.00
13. 4.00 3.00 2.00 2.00 3.00 3.00 4.00 2.00 2.00 3.00 4.00 3.00 5.00 5.00 4.00
14. 1.00 2.00 2.00 4.00 2.00 5.00 1.00 3.00 5.00 4.00 3.00 2.00 2.00 2.00 5.00
15. 2.00 3.00 4.00 1.00 5.00 4.00 2.00 4.00 4.00 5.00 2.00 1.00 1.00 1.00 4.00
16. 3.00 2.00 1.00 3.00 4.00 3.00 2.00 3.00 2.00 5.00 1.00 2.00 5.00 5.00 3.00
17. 5.00 1.00 1.00 5.00 1.00 2.00 4.00 2.00 2.00 4.00 4.00 3.00 3.00 3.00 2.00
18. 3.00 5.00 5.00 3.00 5.00 5.00 5.00 1.00 2.00 3.00 4.00 4.00 2.00 2.00 1.00
19. 3.00 2.00 4.00 2.00 4.00 4.00 1.00 4.00 1.00 3.00 4.00 5.00 3.00 3.00 2.00
20. 1.00 3.00 3.00 2.00 2.00 5.00 2.00 5.00 1.00 3.00 4.00 4.00 3.00 3.00 3.00

Page 3 of 7
The computer output is obtained by first doing a hierarchical cluster analysis to find the number of clusters
that exist in the data. These outputs are Agglomeration schedule, vertical Icicle Plot and Dendrogram using
Average Linkage, respectively.

The second stage is a K-means (quick cluster) output with a pre-determined number of clusters to be
specified. In this case, the output is for 4 clusters. We will look at both stage 1 and stage 2 outputs to
understand the interpretation of both stages.

Fig. 2 : Agglomeration Schedule

Clusters
Stage Cluster 1st Appears Next
Combined
Stage Cluster1 Cluster2 Coefficient Cluster1 Cluster2 Stage
1 4 5 14.000000 0 0 5
2 19 20 16.000000 0 0 7
3 6 18 17.000000 0 0 9
4 1 2 17.000000 0 0 8
5 3 4 20.000000 0 1 11
6 13 16 25.000000 0 0 13
7 11 19 28.000000 0 2 11
8 1 14 28.500000 0 0 10
9 6 8 32.500000 0 0 12
10 1 15 34.666668 0 0 14
11 3 11 36.444443 0 7 15
12 6 12 36.666668 0 0 19
13 7 13 39.500000 0 6 17
14 1 9 41.000000 10 0 16
15 3 10 41.666668 11 0 16
16 1 3 46.342857 14 15 18
17 7 17 47.000000 13 0 18
18 1 7 51.791668 16 17 19
19 1 6 58.156250 18 12 0

1. A look at fig 2, the agglomeration schedule, can help us to identify large differences in the coefficient (4 th
column). The agglomeration schedule from top to bottom (stage 1 to 19) indicates the sequence in which
cases get combined with others (or one cluster combines with another), until all 20 cases are combined
together in one cluster at the last stage (stage 19).

2. Therefore, stage 19 represents a 1 cluster solution, stage 18 represents a 2 cluster solution, stage 17
represents a 3 cluster solution, and so on, going up from the last row to the first row. We have to identify
how many clusters are in the data. We use the difference between rows in a measure called coefficient (also
known as fusion coefficient) in column 4 to identify the number of clusters in the data.

3. We will look at this figure from the last row upwards, because we would like to have lowest possible
number of clusters, for reasons of economy and ease of interpretation. We see that there is a difference of
(58.15 – 51.79) in the coefficients between the 1 cluster solution (stage 19) and the 2 cluster solution (stage
18). This is a difference of 6.36. The next difference is of (51.79 – 47.00) which is equal to 4.79 (between
stage 18, the 2 cluster solution and stage 17, the 3 cluster solution). The next one after that is (47-46.34),
only 0.66, between stage 17 and stage 16. After this, there is again a large difference between the 4 cluster

Page 4 of 7
and 5 cluster solutions, of (46.34 – 41.660) or 4.68. Thereafter, the differences are smaller between
subsequent rows of coefficients.

4. A large difference in the coefficient values between any two rows indicates a solution pertaining to the
number of clusters which the lower row represents. Ignoring the first difference of 6.36 which would
indicate only 1 cluster in the data, we look at the next largest differences. 4.79 is the difference between row
2 from the bottom and row 3 from the bottom, indicating a 2 cluster solution. But almost the same is the
difference between stage 16 and 15, indicating a 4 cluster solution. At this point, it is the judgement of the
researcher, which should decide whether to go for a 2 cluster or a 4 cluster solution. Just for illustration, we
will choose the 4 cluster solution.

Now, in stage 2, a k-means clustering is run with 4 cluster solution requested (as identified from the
hierarchical clustering done above). In the given problem, figs 7 and 8 indicate the outputs of K-means
clustering for a 4 cluster solution. These outputs give us the initial cluster centres, the case listing of cluster
membership (i.e., which case belongs to which of the clusters), the final cluster centres (the solution) and an
ANOVA table.

Fig. 7 : Final Cluster Centers

VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8


1 2.2000 2.2000 3.8000 3.2000 3.2000 4.4000 2.8000 2.4000
2 3.5000 3.6667 2.6667 3.5000 3.6667 3.3333 4.5000 1.5000
3 1.7500 2.0000 2.2500 3.0000 3.7500 3.2500 1.7500 3.5000
4 3.0000 2.4000 3.6000 2.2000 2.2000 4.2000 1.6000 4.4000

VAR1 VAR1 VAR1 VAR1 VAR1 VAR1


Cluster VAR9
0 1 2 3 4 5
1 3.2000 2.2000 3.8000 2.4000 2.4000 3.2000 4.0000
2 2.5000 3.6667 3.6667 3.5000 4.1667 3.6667 2.5000
3 3.2500 4.7500 2.5000 2.0000 1.2500 2.7500 3.2500
4 2.2000 2.8000 4.4000 4.0000 3.0000 2.4000 2.4000

1. The final cluster centres (above) describe the mean value of each variable for each of the 4 clusters. For
example, cluster 1 is described by the mean values of variable 1 = 2.2, variable 2 = 2.2, variable 3 = 3.8,
variable 4 = 3.2 and so on. Similarly, cluster 3 is described by variable 1 = 1.75, variable 2 = 2.0, variable 3
= 2.25, and variable 4 = 3.0, and so on.

2. We now go back to the original variables (in this case the 15 statements in our questinnaire), and interpret
the clusters in terms of the 15 variables. For example, cluster 3 consists of people who are on the e-mail
rather than writing conventional letters (variable 1 value = 1.75 which is equivalent to “agree” on the scale
of 1 to 5). Similarly, they are also people who tend to think twice before buying anything (variable 3 value
2.25) in other words, careful spenders. They also agree (variable 2 value = 2.00) that quality products are
always priced high – that is, they have a positive correlation in their minds about a product’s quality and
price.

3. On these same variables, cluster 2 shows people who prefer conventional mail to e-mail (variable 1 value
= 3.5 or close to “disagree”), people who do not necessarily associate high price with good quality (variable
2 value = 3.67), and tend to be neutral about care in spending (variable 3 value = 2.67). In this way, when

Page 5 of 7
we compare final cluster centre values on each of the 15 variables, for 1 cluster at a time, a complete picture
of the clusters emerges.

In this case, we will briefly describe each of the 4 clusters as follows:

Cluster 1
E-mail users, feel quality comes at a price, not careful spenders, do not like television much, do not think a
car is a necessity, do not like fast food and ready to use products, are not sure whether people are more
health-conscious today, think foreign companies have increased somewhat the efficiency of Indian
companies, disagree that women are active purchasing decision makers, feel that politicians can play an
active role, do not enjoy watching movies, might consider settling abroad, tend to buy branded products, do
not go out much on weekends and like to pay cash, rather than charging to their credit cards (if they have
one).

It is thus a cluster exhibiting many traditional values, except that they have adapted to email use. They are
also beginning to loosen their purse strings, and are probably in transition in some other factors like
acceptance of women as decision makers and the advent of credit cards.

Cluster 2
Regular mail writers, bargain hunters or aggressive buyers, not too particular about thinking before
spending, not great valuers of TV, believe the car is a luxury not too fond of fast food and convenience
products, do not think people are very health conscious, feel foreign companies have done us good, think
women are active purchasing decision makers, do not believe in politicians, do not like movies, do not want
to settle abroad, do not stress on branded products, do not go out on weekends, but do prefer credit cards for
payments.

It is a group which likes to use credit, spends more freely, believes in woman power, believe in economics
rather than politics, and feel quality products can be cheap. Also, they seem to have a patriotic streak, as they
do not want to settle abroad.

Cluster 3
E-mail users, quality measured by price, think twice before buying, indifferent to TV, car is a luxury to
them, not too fond of fast food, agree that people are health conscious, do not think foreign companies have
made us efficient, do not believe in woman power, detest politicians, enjoy watching movies, willing to
settle abroad, always buy branded products, go out on weekends, slightly prefer cash to credit cards.

This group is not a free spending one, but health conscious, more patriarchical, more brand loyal to branded
products, but outgoing compared to other groups, even willing to go abroad to settle.

Cluster 4
Not too particular about e-mail, measure quality by price, free spending, enjoy watching TV, think a car is
necessary, not fond of fast food, think people are health conscious, do not think foreign companies have
made us efficient, believe in woman power, somewhat positive about politicians, not movie watchers, do not
want to settle abroad, indifferent to branding, moderately outgoing and moderately in favour of credit cards
rather than cash.

Page 6 of 7
This group is optimistic, free spending and a good target for TV advertising, particularly consumer durables
and entertainment. But they are not necessarily influenced by brands. They may want value for money, but if
they see value, they may spend a lot.

Page 7 of 7

You might also like