Cluster Analysis

23 Cluster Analysis
O ften the marketer needs to categorize objects into groups (or clusters) so that
the objects in each group are similar, and the objects in each group are sub-
stantially different from the objects in the other groups. Here are some examples:
■ When Procter & Gamble test markets a new cosmetic, it may want to group
U.S. cities into groups that are similar on demographic attributes such as
percentage of Asians, percentage of Blacks, percentage of Hispanics, median
age, unemployment rate, and median income level.
■ An MBA chairperson naturally wants to know the segment of the MBA market
in which her program belongs. Therefore, she might want to cluster MBA
programs based on program size, percentage of international students, GMAT
scores, and post-graduation salaries.
■ A marketing analyst at Coca-Cola wants to segment the soft drink market
based on consumer preferences for price sensitivity, preference of diet versus
regular soda, and preference of Coke versus Pepsi.
■ Microsoft might cluster its corporate customers based on the price a given
customer is willing to pay for a product. For example, there might be a cluster
of construction companies that are willing to pay a lot for Microsoft Project
but not so much for Power Point.
■ Eli Lilly might cluster doctors based on the number of prescriptions for each
Lilly drug they write annually. Then the sales force could be organized around
these clusters of physicians: a GP cluster, a mental health cluster, and so on.
This chapter uses the first and third examples to learn how the Evolutionary
version of the Excel Solver makes it easy to perform a cluster analysis. For example,
in the U.S. city illustration, you can find that every U.S. city is similar to Memphis,
Omaha, Los Angeles, or San Francisco. You can also find, for example, that the cities
in the Memphis cluster are dissimilar to the cities in the other clusters.
Cluster Analysis 379
each cluster. You assign each city to the “nearest” cluster center. Your target cell
is then to minimize the sum of the squared distances from each city to the closest
cluster anchor.
Standardizing the Attributes

In the example, if you cluster using the attribute levels referred to in Figure
23-1, the percentage of Blacks and Hispanics in each city will drive the clusters
because these values are more spread out than the other demographic attributes.
To remedy this problem you can standardize each demographic attribute by sub-
tracting off the attribute’s mean and dividing by the attribute’s standard devia-
tion. For example, the average city has 24.34 percent Blacks with a standard
deviation of 18.11 percent. This implies that after standardizing the percentage
of Blacks, Atlanta has 2.35 standard deviations more Blacks (on a percentage
basis) than a typical city. Working with standardized values for each attribute
ensures that your analysis is unit-free and each attribute has the same effect on
your cluster selection. Of course you may give a larger weight to any attribute.
Choosing Your Clusters

You can use the Solver to identify a given number of clusters. The key in doing so
is to ensure that the cities in each cluster are demographically similar and cities
in different clusters are demographically different. Using few clusters enables the
marketing analyst to reduce the 49 U.S. cities into a few (in your case four) easily
interpreted market segments. To determine the four clusters, as shown in Figure
23-2, begin by computing the mean and standard deviation for Black percentage
in C1:G2.
1. Compute the Black mean percentage in C1 with the formula =AVERAGE(C10:C58).
2. In C2 compute the standard deviation of the Black percentages with the for-
mula =STDEV(C10:C58).
3. Copy these formulas to D1:G2 to compute the mean and standard deviation
for each attribute.
4. In cell I10 (see Figure 23-3) compute the standardized percent-
age of Blacks in Albuquerque (often called a z-score) with the formula
=STANDARDIZE(C10,C$1,C$2) . This formula is equivalent, of course,
to C10-C$1
C$2
. The reader can verify (see Exercise 6) that for each demo-
graphic attribute the z-scores have a mean of 0 and a standard
deviation of 1.
380 Part VI: Market Segmentation
Figure 23-2: Means and standard deviations for U.S. cities
Figure 23-3: Standardized demographic attributes
5. Copy this formula from I10 to N58 to compute z-scores for all cities and
attributes.
How Solver Finds the Optimal Clusters

To determine n clusters (in this case n = 4) you define a changing cell for each cluster
to be a city that “anchors” the cluster. For example, if Memphis is a cluster anchor,
each city in the Memphis cluster should be similar to Memphis demographically,
and all cities not in the Memphis cluster should be different demographically from
Memphis. You can arbitrarily pick four cluster anchors, and for each city in the
data set, you can determine the squared distance (using z-scores) of each city from
each of the four cluster anchors. Then you assign each city the squared distance to
the closest anchor and have your Solver target cell equal the sum of these squared
distances.
To illustrate how this approach can find optimal clusters, suppose you ask a set
of moviegoers who have seen both Fight Club and Sea Biscuit to rate these movies
on a 0–5 scale. The ratings of 40 people for these movies are shown in Figure 23-4
(see file Clustermotivation.xlsx).
Figure 23-4: Movie ratings
Looking at the chart it is clear that the preference of each moviegoer falls into
one of four categories:
■ Group 1: People who dislike Fight Club and Sea Biscuit (lower-left corner)
■ Group 2: People who like both movies (upper-right corner)
■ Group 3: People who like Fight Club and dislike Sea Biscuit (aka people with
no taste in the lower-right corner)
■ Group 4: People who like Sea Biscuit and hate Fight Club (aka smart people
in the upper-left corner)
Suppose you take this data and set up four changing cells, with each changing cell
or anchor allowed to represent the ratings of any person (refer to Figure 23-4). Let each
point’s contribution to the target cell be the squared distance to the closest anchor. Then
choose one anchor from each group to minimize the target cell. This ensures each point
is “close” to an anchor. If, for example, Solver considers two anchors from Group 1, one
from Group 3, and one from Group 4, this cannot be optimal because swapping out one
Group 1 anchor for a Group 2 anchor would lessen the target cell contribution from the
10 Group 2 points, while hardly changing the target cell contribution from the Group 1
points. Therefore you must only have one anchor for each group. You can now implement
this approach for the Cities example.
Setting Up the Solver Model for Cluster Analysis

For the Solver to determine four suitable anchors you must pick a trial set of anchors
and figure out the squared distance of each city from the closest anchor. Then Solver
can pick the set of four anchors that minimizes the sum of the squared distances
of each city from its closest anchor.
To begin, set up a way to “look up” the z-scores for candidate cluster centers:
1. In H5:H8 enter “trial values” for cluster anchors. Each of these values can be
any integer between 1 and 49. For simplicity you can let the four trial anchors
be cities 1–4.
2. After naming A9:N58 as the range lookup in G5, look up the name of the first
cluster anchor with the formula =VLOOKUP(H5,Lookup,2).
3. Copy this formula to G6:G8 to identify the name of each cluster center
candidate.
4. In I5:N8 identify the z-scores for each cluster anchor candidate by copying
from I5 to I5:N8 the formula =VLOOKUP($H5,Lookup,I$3).
Figure 23-5: Look up z-scores for cluster anchors
You can now compute the squared distance from each city to each cluster can-
didate (see Figure 23-6.)
1. To compute the distance from city 1 (Albuquerque) to cluster candidate
anchor 1, enter in O10 the formula =SUMXMY2($I$5:$N$5,$I10:$N10). This
cool Excel function computes the following:
(I5-I10)2+(J5-J10)2+(K5-K10)2+(L5-L10)2+(M5-M10)2+(N5-N10)2
2. To compute the squared distance of Albuquerque from the second cluster

anchor, change each 5 in O10 to a 6. Similarly, in Q10 change each 5 to a 7.
Finally, in R10 we change each 5 to an 8.
3. Copy from O10:R10 to O11:R58 to compute the squared distance of each city
from each cluster anchor.
Figure 23-6: Computing squared distances from cluster anchors
4. In S10:S58 compute the distance from each city to the “closest” cluster anchor
by entering the formula =MIN(O10:R10) in cell S10 and copying it to the cell
range S10:S59.
5. In S8 compute the sum of squared distances of all cities from their cluster
anchor with the formula =SUM(S10:S58).
6. In T10:T58 compute the cluster to which each city is assigned by entering
in T10 the formula =MATCH(S10,O10:R10,0) and copying this formula to
T11:T58. This formula identifies which element in columns O:R gives the
smallest squared distance to the city.
7. Use the Solver window, as shown in Figure 23-7, to find the optimal cluster
anchors for the four clusters.
Figure 23-7: Solver window for cluster anchors
NOTE Cell S8 (sum of squared distances) is minimized in the example. The

cluster anchors (H5:H8) are the changing cells. They must be integers between
1 and 49.
8. Choose the Evolutionary Solver. Select Options from the Solver window, navi-
gate to the Evolutionary tab, and increase the Mutation rate to 0.5. This setting
of the Mutation rate usually improves the performance of the Evolutionary
Solver.
The Evolutionary Solver finds that the cluster anchors are Los Angeles, Omaha,
Memphis, and San Francisco. Figure 23-8 shows the members of each cluster.
Interpretation of Clusters
The z-scores of the anchors represent a typical member of a cluster. Therefore,
examining the z-scores for each anchor enables you to easily interpret your clusters.
Figure 23-8: Assignment of cities to clusters
You can find that the San Francisco cluster consists of rich, older, and highly Asian
cities. The Memphis cluster consists of highly Black cities with high unemployment
rates. The Omaha cluster consists of average income cities with few minorities. The
Los Angeles cluster consists of highly Hispanic cities with high unemployment rates.
From your clustering of U.S. cities a company like Procter & Gamble that often
engages in test marketing of a new product could now predict with confidence
that if a new product were successfully marketed in the San Francisco, Memphis,
Los Angeles, and Omaha areas, the product would succeed in all 49 cities. This

Cluster Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster Analysis

Uploaded by

Copyright:

Available Formats

23 Cluster Analysis

Standardizing the Attributes

Choosing Your Clusters

Figure 23-2: Means and standard deviations for U.S. cities

Figure 23-3: Standardized demographic attributes

How Solver Finds the Optimal Clusters

Figure 23-4: Movie ratings

Setting Up the Solver Model for Cluster Analysis

Figure 23-5: Look up z-scores for cluster anchors

2. To compute the squared distance of Albuquerque from the second cluster

Figure 23-6: Computing squared distances from cluster anchors

Figure 23-7: Solver window for cluster anchors

NOTE Cell S8 (sum of squared distances) is minimized in the example. The

Figure 23-8: Assignment of cities to clusters

You might also like