You are on page 1of 65

Chapter 18 CLUSTER ANALYSIS AND FACTOR ANALYSIS

Subhash Sharma, University of South Carolina Ajith Kumar, Arizona State University

Introduction
Consider the following situations: The marketing manager of a large financial institution is interested in developing portfolios of product offerings that would appeal to various market segments. The manager of a large telecommunication company is interested in developing and managing a portfolio of customers that would generate substantial profits over their lifetime. The store manager is interesting in identifying the optimal mix of products demanded by its customers. In each of the above scenarios, the managers objective is to group stimuli (e.g., product offerings, customers, and mutual funds) into groups such that stimuli in each group are similar, and stimuli in each group are different from stimuli in other groups. Cluster analysis is one of the techniques in the managers toolkit that can be used to achieve this objective. The purpose of this chapter is to present a non-technical discussion of cluster analysis.

What Is Cluster Analysis and Its Objectives?

Suppose a major soft drink company is interested in introducing a new soft drink and has conducted numerous marketing research studies. In one study, consumers were asked to taste the soft drink and respond to a number of questions. In order to use geometry for illustrating the concepts of cluster analysis, assume that only the following two questions were askedthe degree to which consumers liked the taste of the new soft drink on a 0-100 scale, and purchase likelihood using an eleven-point scale. Hypothetical data for eleven respondents are presented in Table 18.1 and Figure 18.1 (data in the last three columns will be discussed later in the chapter). In Figure 18.1, the horizontal axis represents likelihood of purchase and the vertical axis represents taste reaction, and each consumer is represented as a point in this two-dimensional space defined by the two axes. Examination of Table 18.1 and Figure 18.1, reveals that the consumers can be divided into four groups or clusters: (1) high taste reactionlow purchase likelihood; (2) high taste reactionhigh purchase likelihood; (3) low taste reactionlow purchase likelihood; and (4) low taste reactionhigh purchase likelihood. The first three groups are logically consistent. However, the fourth group or cluster appears to be counter-intuitive and could very well comprise of consumers providing logically inconsistent or bad responses. On the other hand, one could argue that responses of this group are valid and logically consistent as they might be making purchases for others. The marketing manager would be very much interested in the second group as they like the taste of the new soft drink and have high purchase likelihood; however, she might want to determine why consumers in the first group are not interested in purchasing the soft drink despite positive taste reaction. [Insert Table- 18.1 Figure- 18.1] It is obvious that in the above example, the objective of cluster analysis is to group consumers into groups such that each group is as homogeneous as possible with respect to

consumers characteristics, which are taste reaction and purchase likelihood.1 To form the four clusters discussed above, we visually examined the two-dimensional space defined by taste reaction and purchase likelihood, and determined which consumers are in close proximity to each other with respect to taste reaction and purchase likelihood. This was most probably done by employing some measure of similarity to assess proximity of consumers. The most likely measure of similarity employed was the straight line distance between pairs of consumers. In real life applications, the number of consumers could be in the thousands and the number of variables or characteristics could be in the hundreds. Visual clustering is not feasible when there are large number of consumers and/or large number of variables. In order to obtain clusters we have to resort to analytical methods. Hierarchical and non-hierarchical are two of the most popular clustering methods. Each one is discussed using hypothetical data.

Hierarchical Clustering
Assume that squared Euclidean distance (which measures the straight line distance between two points) will be used to measure the similarity between any two consumers in a given dimensional space defined by the variables of interest. Squared Euclidean distance (henceforth referred to as squared distance) between any two points in a p-dimensional space is given by:

D = ( xik x jk )
2 ij p k =1

where xik is the coordinate of the ith consumer for the kth variable and xjk is the coordinate of the jth consumer for the kth variable. Using the above formula, the squared distance between C1 and C2 would be

(80 85) 2 + (9 8) 2 = 26. Squared distances between all pairs of consumers are given in Table 18.2. The matrix of squared distances given in Table 18.2 is known as the dissimilarity matrix as small values indicate similarity and large values indicate dissimilarity. Note that in Table 18.2, there appear to be four groups of consumers such that distances among consumers in each group are small compared to distances among consumers across groups. This suggests that there might be four clusters or groups present in the data. One could also argue that there are two clusters as squared distances among consumers in each cluster (cluster 1 consisting of C1, C2, C3, C4, C5, and C6, and cluster 2 consisting of C7, C8, C9, C10, and C11) are small compared to squared distances among consumers across groups. In such cases one has to use his judgment as to whether a twoor a four-cluster solution is meaningful based on such criteria as cluster size (i.e., size of the market segment), cluster interpretability, ability to develop a marketing strategy for each cluster, and potential revenue and profits from each cluster. [Insert Table 18.2] From Table 18.2 it is clear that C7 and C9 are the most similar with a dissimilarity measure of 16. Therefore, it is reasonable to form a cluster consisting of these two consumers. The total number of clusters resulting from the merger of these two consumers is 10 with one cluster consisting of consumers C7 and C9 and the other nine clusters consisting of a single consumer. Table 18.3 gives the ten clusters, the labels assigned to the ten clusters, and membership of each cluster. But, what is the taste reaction and purchase likelihood for the cluster formed by consumers C7 and C9? In other words, how should the distance between two clusters be determined when a cluster consists of more than one consumer? Different procedures or algorithms for computing the distance between two clusters have been suggested. The various 4

hierarchical clustering methods (except Wards method) differ with respect to how these distances are computed. The most popular hierarchical methods are: (1) Centroid; (2) Single linkage or Nearest neighbor; (3) Complete linkage or Farthest neighbor; (4) Average linkage; and (5) Wards. Each of these is discussed below. [Insert Table-18.3]

Centroid Method
In the centroid method, the concept of an average consumer is used to represent the cluster. The characteristics of a cluster are the centroid or average values of the characteristic of the consumers comprising the cluster. The cluster is then assumed to consist of one average consumer whose characteristics are the centroid. For example, the characteristics of the cluster formed by consumers C7 and C9 are assumed to be that of an average consumer whose taste reaction is 10 [i.e., (10+10)/2] and purchase likelihood is 5 [i.e., (7+3)/2)]. Denote this average consumer as CLUS10the resulting total number of clusters (which is 10 clusters).2 Figure 18.2 presents a graphical depiction of the distance between two clusters for the Centroid method. Table 18.3 gives data for the ten clusters or consumers. [Insert Figure-18.2] A new dissimilarity matrix among the ten consumers is created and the two most similar consumers are joined to form a nine cluster solution. We do not report the similarities among the ten consumers, but C2 and C6 are the most similar, with a similarity of 25 and should be merged to form another cluster, which is named CLUS9. Once again a new dissimilarity matrix among nine consumers is computed and the two closest consumers are merged to form the next cluster. This process is continued until only one cluster, consisting of all the observations, is left. As the

name implies, for n consumers, hierarchical clustering proceeds with merging two consumers to get n-1 clusters, then the two most similar consumers or clusters are merged to get n-2 clusters, and so on until only one cluster is left. Table 18.4 gives the clustering history or summary. For example, the cluster formed at Step 7 is obtained by merging clusters CLUS7 (i.e., cluster formed at Step 4) and CLUS5 (i.e., cluster formed at Step 6), and comprises of consumers C1, C2, C3, C4, C5 and C6. [Insert Table-18.4] The clustering history can also be represented by a dendrogram or a tree, which is a graphical depiction of the clustering history. We do not show the dendrogram as for a large number of observations it is quite messy and difficult to interpret. The obvious question ishow many clusters should be formed? A number of different measures have been proposed to determine the optimum number of clusters. These will be discussed after we have presented the remaining hierarchical clustering methods.

Single Linkage or the Nearest Neighbor Method


The first step is the same as the centroid methodC7 and C9 are merged to form the first cluster. In order to determine the similarity or distance between two clusters, the distance between all possible pairs formed by the consumers in the two clusters is obtained. For example, the distance between CLUS10 consisting of consumers C7 and C9 and the cluster consisting of a single consumer C1 will be the minimum of the distance between all possible pairs formed by the consumers in the two clusters (i.e., C1 and C7, and C1 and C9). From Table 18.2, these distances are, respectively, 4904 and 4936. Therefore, the similarity between CL10 and C1 is 4904. Figure 18.2 graphically depicts the minimum distance between two clusters.

Complete Linkage or Farthest Neighbor Method


In the complete linkage the procedure is the same as for the single linkage except, as depicted in Figure 18.2, the similarity between two clusters is given by the maximum distance between all possible pairs formed by the consumers in the two clusters. For example, the distance between CLUS10 and C1 would be given by 4936 and not 4904.

Average Linkage Method


In the average linkage method, as the name implies, the similarity is represented by the average of the minimum and maximum distance. For example, the distance between CLUS10 and C1 is given by 4920, which is the average of 4936 and 4904.

Wards Method
Wards method uses complete enumeration and maximizes within-cluster homogeneity. At each step, all possible cluster solutions of a given size are examined. For a ten-cluster solution there will be a total of 55 ten-cluster solutions.3 The homogeneity or similarity of each cluster is measured by the within-cluster sums of squares, which is also known as the error sums of squares (ESS). ESS measures the extent to which each consumer in a given cluster differs from the centroid of the respective cluster, i.e., it is the squared distance between each consumer in the cluster and the centroid. For example, consider a ten-cluster solution number in which one cluster consists of consumers C7 and C9 and the other nine clusters consist of a single consumer. Using data presented in Table 18.1, the averages (i.e., centroid) for taste reaction and likelihood of purchase, respectively, are 10 and 5. The ESS for consumer C7 is equal to 4 [i.e., (10-10)2 + (7-5)2] and for C9 it is also equal to 4 [i.e., (10-10)2 + (3-5)2]. The ESS for the cluster comprised of consumers C7 and C9 would be 8 (i.e., 4+4). The ESS for clusters with only a single 7

consumer would obviously be equal to zero. The ESS for the entire ten-cluster solution is the sum of the ESS of each cluster. The best solution is that which has the smallest ESS for the entire solution, i.e., the one for which the total squared distances between consumers and their assigned cluster centroids is minimized. Although, we do not give the ESS for all the 55 cluster solutions, the cluster solution comprising of C7 and C9 to form one cluster and the remaining 9 clusters consist of a single consumer is the best cluster solution, and has an ESS of 8. The next step in Wards method is to evaluate all possible nine-cluster solutions and select the one which has the smallest ESS. There will be forty five such solutions. The best cluster solution is where C7 and C9 form one cluster, C2 and C6 form the second cluster and the remaining 7 clusters consist of a single consumer. The procedure is repeated for the remaining steps.

Which Hierarchical Method Is the Best?


As discussed above, hierarchical clustering algorithms (except the Wards method) differ with respect to how the distance between clusters is computed. The performance of each method is dependent on the data and its orientation in a given dimensional space (which we do not know), and data contamination due to the presence of outliers. A number of simulation studies have been conducted to assess the performance of the various methods. Following is a summarization of the findings. Hierarchical clustering methods are sometimes susceptible to chaining effects, which is a phenomenon where one cluster grows when most of the other observations join it in successive steps to form one large cluster. Single linkage is particularly susceptible to chaining, more so than other methods. However, depending upon the existence of natural

clusters in the data, chaining may be desirable. For example, chaining is desirable to identify the two clusters depicted in Panel I of Figure 18.3, and single linkage would be the most appropriate method for identifying such clusters. However, in marketing applications the clusters depicted in Panel I of Figure 18.3 are highly unlikely and therefore chaining is not a desirable property as invariably it results in a single cluster or segment. [Insert Figure 18.3] Panel II, Figure 18.3 shows outliers in the data. Single linkage will tend to merge the two clusters because of the presence of outliers. On the other hand, complete linkage will not. Complete linkage method will identify compact homogeneous clusters. The Wards method identifies clusters of nearly equal shape and size. It is clear from the above summarization that it is difficult to determine the best hierarchical clustering method. It all depends on the orientation of the data in a given dimensional space and the presence of outliers. Therefore, it is suggested that one should use all the methods and compare the resulting solutions for interpretability of the clusters.

How Many Clusters?


Hierarchical clustering methods report the number and composition of n-1 clusters, n-2 cluster, n-3 clusters, and so on. What then is the optimum number of clusters? In general, a good cluster solution is one in which each cluster is very different from other clusters (between-cluster heterogeneity) and consumers in each cluster are as similar as possible (within-cluster homogeneity). Various measures have been proposed to assess the homogeneity and/or heterogeneity of the clustering solution. These measures are provided by most of the clustering

procedures in popular software packages such as SAS and SPSS. We only provide a brief nontechnical discussion of each of the heuristics. For a detailed discussion the interested reader is referred to Sharma (1996) and Aldenderfer and Blashfield (1984).

Root-Mean-Square Standard Deviation


The root-mean-square standard deviation (RMSSTD) measures the homogeneity of the cluster formed at any given step. It essentially measures the compactness or homogeneity of a cluster. Clusters in which consumers are very close to the centroid are compact clusters. The smaller the RMSSTD, the more homogeneous or compact is the cluster formed at a given step. A large value of RMSSTD suggests that the cluster obtained at a given step is not homogeneous, and is probably formed by merging of two very heterogeneous clusters. In Panel I of Figure 18.4, the first cluster would have a low RMSSTD whereas the second cluster would have a high RMSSTD. Notice that the cluster with a low RMSTD is relatively more homogeneous than the cluster with a high RMSTD. In general a cluster solution with a low RMSTD is preferred as it implies that the resulting clusters are homogeneous. [Insert Figure 18.4]

Semi-Partial R-Squared
The semi-partial R-squared (SPR) measures the loss of homogeneity due to merging two clusters to form a new cluster at a given step. If the value is small, then it suggests that the cluster solution obtained at a given step is formed by merging two very homogeneous clusters. On the other hand, large values of SPR suggest that two heterogeneous clusters have been merged to form the new cluster. In Panel II of Figure 18.4, the first set of two clusters, if merged, would have a low SPR as opposed to merging the second set of two clusters. In general, a cluster

10

solution with a low SPR is preferred as high value for SPR implies that two heterogeneous clusters are being merged.

Centroid Distance
The centroid distance (CD) measures the heterogeneity of the clusters merged at any given step to form a new cluster, and is given by the distance of the clusters merged at a given step. If two less heterogeneous clusters are merged the value will be small and if two very heterogeneous clusters are merged the value will be large. In Panel II of Figure 18.4, the first set of two clusters, if merged, would have a smaller CD than the merging of the second set of two clusters. A cluster solution having a low CD is preferred as a high value for CD also implies that two heterogeneous clusters are being merged.

R-Square
R-Square (RS) measures the heterogeneity of the cluster solution formed at a given step. A large value represents that the clusters obtained at a given step are quite different (i.e., heterogeneous) from each other, whereas a small value would signify that the clusters formed at a given step are not very different from each other. In Panel III of Figure 18.4, the first fourcluster solution will have a lower RS than the second four-cluster solution. It is obvious from Panel III of Figure 18.4 that the clusters with a low RS are quite close to each other (i.e., less heterogeneous) compared to clusters with a high RS. Consequently, one would prefer to have a cluster solution with a high RS. An obvious question is: which of the above measures should one use? It is suggested that all the measures should be used as they relate to various properties of the clusters. RMSTD, SPR, and CD represent homogeneity of the cluster solution and RS represents heterogeneity of the

11

cluster solution. The confidence in a given cluster solution is obviously high when all of these measures suggest the same number of clusters. On the other hand, if there is no consensus among the measures regarding the number of clusters then it is prudent to examine all the suggested solutions and determine how many clusters are appropriate using other criteria such as interpretability and usefulness of the cluster solutions. Figure 18.5 gives the above measures for the hierarchical clustering using the centroid method. There is no absolute cut-off values for determining how high is high, or how low is low. Rather, one should look for a big jump or change in the measures across the number of clusters. As is evident from Figure 18.5, for all the measures the big jump or change occurs from a two-cluster to a one-cluster solution suggesting that the two clusters might be the optimum number of clusters. From Table 18.4, one can determine that one cluster consists of consumers C7, C8, C9, C10, and C11 whereas the other cluster consist of consumers C1, C2, C3, C4, C5 and C6. The average taste reaction and purchase likelihood for the first cluster are, respectively, 85 and 5.333 and the average taste reaction and purchase likelihood for the second cluster are, respectively, 17 and 4.400. The average values for the two clusters suggest that the first cluster has a high taste reaction and purchase likelihood of slightly greater than 5, and the second cluster has a low taste reaction and a purchase likelihood of slightly less than 5. That is, the clusters differ mostly with respect to taste reaction and only slightly on purchase likelihood. Is it possible that there are more than 2 clusters? If so, should we examine a three- or a four-cluster solution? In other words, interpretability and usability of the identified clusters is also an important criterion for determining the optimum number of clusters. [Insert Figure 18.5]

12

Note that in Figure 18.1, we had visually identified four clusters. Why did the clustering algorithm not identify the four clusters? The answer, which probably was evident earlier, is the scales used to measure taste reaction and purchase likelihood and the effect it has on computing the distances.

Effect of Measurement Scale on Cluster Solution


Taste reaction is measured on a 0-100 scale and has a larger range than the 1-11 scale used for purchase likelihood. The range of the scale acts as an implicit weight in computing the distances such that the weight is directly proportional to the range of the scale. For example, consider consumers C1 and C10. The squared distance between these two consumers would be
2 D1,10 = (80 20) 2 + (9 2) 2

=3600+49 =3649. It is evident from the above computation that out of a total squared distance of 3649, 3600 (almost 99%) is due to taste reaction? Why should taste reaction have such a large influence in the formation of the squared distance? If there is no reason for this, then we should rescale the data to equate the scales. One way this could be done is to convert the taste reaction scale to an eleven point scale by dividing it by 10 and adding one. Column 4 of Table 18.1 gives the rescaled values for taste reaction. The resulting squared distance would be: D12,10 = (9 3) 2 + (9 2) 2 = 36 + 49 = 85. Notice that now the contribution of purchase likelihood is greater than the contribution of taste reaction.

13

One could also standardize the data to obtain z-scores and use them to compute the distances. Standardization not only equates the scale with respect to range, but also equates the scale with respect to standard deviation (or variance) where the standard deviation (variance) of each characteristic is 1.0. This is done if there is no rationale for why the variation in a given characteristic should affect the formation of the Euclidean distances, and consequently the clustering of the data. However, one might want the variable or characteristic to have more influence in the formation of the clusters as greater variation implies that consumers differ considerably with respect to that characteristic and hence can be segmented based on it. Typically, clustering is done on standardized and non-standardized data. The resulting solutions are compared and, in cases where different solutions are obtained, the one which makes the most sense is used. The last two columns of Table 18.1 give the standardized values for taste reaction and likelihood of purchase. The squared distance between C1 and C10 would be
2 D1,10 = (.719 (.946)) 2 + (1.186 (.844)) 2

= 2.772+4.121 =6.893. Notice that the contribution of purchase likelihood is higher than the contribution of taste reaction. That is, each adjustment of data could result in different squared distances, which in turn could affect the clustering solution. Recall that in Figure 18.1 the horizontal axis represented likelihood of purchase and the vertical axis represented taste reaction. Geometrically, rescaling of variables is tantamount to stretching and compressing the axes. The orientation of the points will change if the axes are differentially stretched or compressed. Taste reaction was rescaled to range between 1-11 by dividing it by 10 and adding 1. This algebraic transformation results in compressing the vertical

14

axis of Figure 18.1 by a factor of 10 resulting in Figure 18.6, which presents a plot of the data for purchase likelihood and rescaled taste reaction. The resulting effect is that the orientation of the consumers in the two-dimensional space has changed and the four clusters are now further apart and more distinct. The important point to keep in mind is that changing the scale of the variables could have an impact on the results. [Insert Figure 18.6] In order to further illustrate the effect of scaling, the centroid method was used to form clusters using rescaled data. Figure 18.7 presents the plot for the measures to determine the number of clusters. There is a big change in the values of all the measures from 4 to 3 clusters suggesting a four-cluster solution. Table 18.5 gives the cluster membership of each cluster and their centroids. The first cluster has high taste reaction and low purchase likelihood; the second cluster has high taste reaction and high purchase likelihood; the third cluster has low taste reaction and low purchase likelihood and the fourth cluster has low taste reaction and high purchase likelihood. Notice that the number of clusters and their interpretation is now similar to that identified by Figure 18.6 and as hypothesized at the beginning of the chapter. [Insert Table 18.5 , Figure 18.7]

Non-Hierarchical Clustering
A second most popular method for forming clusters is non-hierarchical clustering technique. For non-hierarchical clustering, the number of clusters has to be fixed to get a solution for that number of clusters. Figure 18.8 presents one recommended process for conducting nonhierarchical clustering. We will discuss the procedure by applying it to the rescaled data.

15

Step 1: The first step consists of specifying the number of clusters. Assume that, based on managerial discussions, four clusters are desired. Step 2: The procedure starts with first selecting centroids (seeds) for the k clusters. The seeds can be selected randomly, or using a heuristic such that each seed is as far as possible from other seeds, or seeds supplied by the user (e.g., from the 4 cluster hierarchical solution). Using hierarchical solutions as starting points or seeds is a common practice. Suppose the seeds selected for the hypothetical data are C1, C2, C3 and C4 and are referred to as the initial seeds. Table 18.6 gives the initial seeds for rescaled data. The various types of nonhierarchical clustering algorithms differ with respect to how the seeds are selected. Once the initial seeds have been identified, each consumer is then assigned to one of the clusters by computing the distance of each consumer from the centroid of each cluster and assigning it to the cluster to which it is the closest. Sometimes the centroids of clusters are changed dynamically. That is, the cluster centroids are updated as consumers are allocated to that cluster. Table 18.6 also gives the squared distance of each consumer from the centroid and the initial assignment. [Insert Table 18.6 , Figure 18.8] Step 3: After all the consumers have been assigned, the next step is to recompute the centroids. Table 18.6 also gives the centroids of the clusters after the initial assignment. Step 4: Reassigning of consumers to clusters is done by computing the distance of each consumer from the cluster centroids and assigning them to the cluster to which it is the closest. The cluster centroids are recomputed. Table 18.6 also gives the reassignment at Iteration 1 and the cluster seeds after the reassignment.

16

Step 5: The most widely used criterion for determining whether the optimum assignment has been achieved is to specify the acceptable minimum change in centroids. The change in cluster centroids is measured by the distance between the centroids. For example, as noted in Table 18.6, the change in cluster centroid of cluster 1 after Iteration 1 is (4.5 2.25) 2 + (8.333 8.000) 2 = 5.173 . If the change is less than the specified minimum, then it is assumed that convergence has been achieved and the cluster solution is optimum. If not, then another iteration of reassignment is done. A second criterion that is used is to specify the maximum number of iterations. The reassignment continues until there is no change in cluster centroids or when the specified number of reassignments has been reached. As can be seen, there is a change in the cluster centroids between the initial and reassignment and since it is greater than our criterion of a zero change (we are going to assume that this is our criterion), another reassignment (i.e., iteration 2) is made and there is no change in cluster centroids. It is important to note that no change in cluster centroids implies that there is no change in cluster membership.

It should be noted that the resulting four-cluster solution is not consistent with what was expected or with that obtained from the hierarchical method. The problem is the selection of the initial seeds affects the particular solution obtained. Other initial seeds might produce different cluster solutions. The first four consumers were selected for the seeds intentionally to illustrate the effect of the selection of seeds on the clustering solution. A better choice of seeds would be to use some heuristic such that the selected seeds are as far apart as possible. Most nonhierarchical clustering procedures in the statistical packages provide the option of using heuristics to insure that the selected seeds are as far apart as possible. When such an algorithm 17

was used in our example the resulting solution was the same as identified by the hierarchical method.

Which Clustering Method Is the Best?


A question facing users is choice between hierarchical and non-hierarchical clustering methods. It is obvious that in the hierarchical method no a priori knowledge of the number of clusters is needed. For large data sets hierarchical methods require extensive computational resources as large dissimilariy matrices have to be computed and stored. However, this is not a deterrent for its use as one can draw a random sample and subject it to hierarchical clustering. A commonly used procedure is to use hierarchical method in conjunction with non-hierarchical clustering methods. For example, a hierarchical method could be used initially to determine a number of cluster solutions (say from 1 to 5). The number of cluster solutions examined will depend on prior knowledge of the market and consumers, and interpretability of the cluster solutions. Cluster centroids are obtained for each of the cluster solutions. These centroids are then used as initial seeds in the non-hierarchical clustering method to refine each of the solution. The final cluster solution is the one that make the most sense to managers and can be used to develop appropriate marketing strategies.

Reliability and Validity


Reliability and validity of the cluster solution obtained also needs to be examined. Reliability is typically examined using a cross-validation procedure. The data are randomly split into two samples. Cluster analysis is performed on the first sample (referred to as the analysis sample) and the cluster centroids are computed. Consumers in the second sample (referred to as the holdout sample) are assigned to one of the clusters identified in the analysis sample by

18

computing the squared distance of each consumer from the cluster centroid and assigning it to the cluster to which it is the closest. Degree of agreement between the assignments and a separate cluster analysis of the holdout sample is an indicator of reliability. The procedure is repeated by assigning consumers in the analysis sample using cluster analysis conducted on the holdout sample and comparing assignment with the membership identified by the cluster solution of the analysis sample. Validity of the clustering solution pertains to interpretability of the cluster solution and comparing the results with external variables. If the solution is not interpretable or does not make sense to the manager, then the cluster solution is not valid or useful. In addition to obtaining meaningful clusters, they should also relate to external factors or variables. For example, in the case of the soft drink example we could compare the average likelihood of purchase of each cluster or segment with the actual purchase of that cluster or segment. Obviously, average purchase of consumers in high likelihood of purchase cluster should be higher than the cluster which has lower likelihood of purchase. It is obvious that this can be done only after the soft drink as been launched and actual purchase data are available. This type of validity is often referred to as external validity.

Similarity Measures
As indicated earlier, Euclidean distance is the most widely used measure of similarity. The Euclidean distance is a special form of the more general Minkowski metric, which is given by

p n Dij = (| xik x jk |) k =1

1/ n

19

where Dij is the distance between i and j, p is the number of variables and n=1, 2, 3,, . For n=2 the above formula gives the Euclidean distance. For n=1 the resulting distance is the city block distance, which, as the name implies is the distance one would travel in a city to get from one place to another, when one has to follow city blocks to get there. Besides Minkowski distance, other measures of similarity such as correlation coefficients and perceptual distances could also be used. These and additional measures are discussed in detail by Aldenderfer and Blashfield (1984) and Sneath and Sokal (1973). The most commonly used measure of similarity/dissimilarity, however, is the Euclidean distance.

Other Clustering Methods


The clustering methods presented in this chapter assign a given consumer to one and only one cluster. Other clustering algorithms exist in which consumers can belong to more than one cluster. Furthermore, the clustering methods we illustrated require metric data (i.e., ratio- and interval-scaled data). Situations do exist where characteristics or variables could consist of metric and non-metric data (nominal- and ordinal-scaled data). Newer clustering methods have been proposed where the clustering variables can be metric (i.e., taste reaction), ordinal (i.e., income in categories) or categorical variables (i.e., gender). These newer techniques are discussed in Chapter 25 on segmentation.

Some Recommendations for Conducting Cluster Analysis


The recommended steps for conducting cluster analysis obviously cannot be exhaustive as it will vary from application to application. However, the following suggested steps do provide the reader with general guidelines as to what should or should not be done.

20

1.

Identify Clustering Variables. Management must identify the characteristics or variables for forming segments. These are typically obtained based on past experience or market research studies, or through initial qualitative market research techniques such as focus groups interviews and depth interviews. For example, in case of demographic segmentation management might use consumer characteristics such as age, income, education, number of children living at home, and marital status as clustering variables. 2. Data Cleaning. The data must be cleaned for excessive missing values, bad or logically inconsistent responses, and outliers. Exploratory data analysis techniques (see Tukey, 1977) such as stem-and-leaf plot and box-plot can be employed. All of these and other procedures have been implemented in commercially available packages such as SPSS and SAS. 3. Run Hierarchical Cluster Analysis. It is recommended that data should be subjected to hierarchical analysis. This is done to gain insights into the number of potential clusters and initial cluster seeds. One should examine solutions from the various hierarchical methods discussed for interpretability and use the most plausible cluster solutions. Obviously, the confidence in the results is enhanced if all the algorithms result in the same solution. In situations where the number of consumers (i.e., observations) is large, one can draw a random sample. Multiple random samples can be drawn to increase confidence in the resulting cluster solution. 4. Refine the solution. It is strongly recommended that one use a non-hierarchical method for refining the hierarchical solution. The number of clusters and their centroids identified by hierarchical methods can be used as inputs to the non-hierarchical clustering technique.

21

5. Examine the reliability and validity. Before using the resulting segments or cluster solution its reliability and validity must be examined. As mentioned previously reliability pertains to the degree to which similar solutions are obtained when the data is split into samples. Obviously, this can only be done when there is large number of observations. Validity pertains more to whether the resulting segments are meaningful and interpretable, and the extent to which they related to external variables or factors.

Factor Analysis
Consider the following scenario. A marketing researcher is hired by a company that manufactures and markets various brands of toothpaste to conduct a study whose objectives are to determine consumers drivers of brand preferences. In the absence of any prior information the researcher is likely to conduct the study in two phases. In the first phase, the researcher might conduct qualitative interviews with a small sample of consumers drawn from the target population. Through these interviews the researcher determines that the key drivers of brand preferences are perceptions of the extent to which a brand provides various benefits that consumers typically seek when using toothpaste, such as prevention of tooth decay, freshening the breath, keeping the gums healthy, keeping the mouth clean, etc. In the second phase of the study, the researcher might create a questionnaire containing rating scales that capture (numerical) evaluations of the extent to which a brand of toothpaste provides various benefits (attributes) uncovered in the first phase, as well as ratings of brand preferences. This questionnaire is administered to a larger sample of consumers using some method of survey to obtain quantitative ratings. To predict brand preferences, the researcher might then regress a brands preference rating on ratings of that brands attributes to determine the relative importance of each in determining preference. At this point, the researcher is likely to encounter

22

certain problems, especially if the regression contains a large number of attributes as predictors of preference. Some of the regression coefficients may have the wrong sign: for example, the regression results might suggest that the lower the rating on the attribute, helps prevent tooth decay, the higher the preference for that brand. Other regression coefficients, while they have the proper sign, fail to be statistically significant, although the attributes concerned are each individually strongly correlated with brand preference. Both these problems are usually manifestations of multicollinearity, which arises primarily from the predictor (or explanatory) variables in a regression being highly inter-correlated. When these problems occur, the researcher can try to solve them in a couple of ways. One approach would be to regress brand preference on a smaller number of attributes. However, this approach in turn might pose other problems: different subsets of attributes might provide comparable, interpretable results; attributes that are important from the perspective of the clients business might get omitted from the analysis. Another approach might be to reduce the number of predictors in the regression by creating a few indices (i.e., weighted linear combinations of subsets of attributes) and to use these indices as predictors in the regression instead of the much larger number of attributes. Factor analysis can be viewed as a set of statistical techniques for performing this type of data reduction (i.e., as a set of techniques for creating, from a set of highly correlated variables, a smaller number of indices that capture the statistical information contained in the original set of variables). Although problems arising in multiple regression was used to provide a context in which data reduction techniques might find practical application, factor analytic techniques can be used in any context where some data reduction is necessary and the set of variables to be reduced is highly inter-correlated.

23

An alternative, academic (i.e., psychometric) perspective of factor analysis (Bagozzi, 1980), one that is increasingly being adopted by many marketing research practitioners, can also be illustrated in the context of the scenario discussed above. Analysis of the qualitative interviews in the first phase of the study might suggest that consumers evaluate brands on what may be described as perceptual dimensions. For example, one perceptual dimension might be related to dental health/hygiene. The key characteristic of the concept of dental hygiene that is relevant to our discussion is that it encompasses several attributes. Good dental health would imply, among other things, absence or low incidence of cavities, gum disease and plaque. In this perspective, one of the key benefits sought by consumers in using toothpaste might be the more abstract dimension of good dental health; the extent to which a toothpaste provides good dental health would be inferred or assessed indirectly through evaluations of more concrete attributes, such as the extent to which a brand reduces incidence of cavities, gum disease, etc. In factor analytic terminology, dental health would be viewed as a latent construct. Specific evaluations of aspects such as prevention of cavities, plaque, etc., would be described as manifest indicators of dental health. Going beyond the illustrative scenario, many concepts in Marketing (e.g., customer loyalty), Psychology (e.g., attitude towards a brand) and Sociology (e.g., social class) that are used both by academics and marketing research practitioners are defined such that they can be measured properly only by using multiple items as indicators of the corresponding latent constructs. This chapter eschews technical details that can readily be found in standard multivariate texts and focuses instead on various decision issues that confront marketing researchers when they use factor analytic techniques. How should the relationship between a latent construct and its manifest indicators be modeled? Given the availability of various techniques of model

24

parameter estimation, what criteria can one use to choose among them? What is the optimal number of indices (or latent factors) for a given set of variables? What are the problems that may arise in using factor analytic model results? How should the validity of the results (i.e., the extent to which a model fits the data) be assessed? The remainder of the chapter is organized as follows. The discussion of exploratory factor analysis is covered under the following topics: (1) specification of the common factor model in the context of exploratory factor analysis; (2) the statistical model (i.e., the covariance structure model) that emerges from the model that links the latent factors to the observed variables; (3) the issue of identification or indeterminacy of factor model solutions; (4) techniques of estimation and criteria for choosing among them; (5) selecting the appropriate number of common factors; (6) rotation of factor solutions; (7) interpreting and evaluating factor solutions; and (8) estimation of factor scores. An illustrative example is then presented, followed by a brief discussion of the key differences between exploratory factor analysis and principal components analysis. The chapter concludes by noting some recent developments and making some recommendations.

Specification of the Common Factor Model


Let x1, x2, , xp-1, xp, denote the p observed variables. Suppose the researcher hypothesizes that there are q common, unobserved factors underlying these observed variables. Then the model relating the observed variables to the unobserved, common factors consists of the following set of p equations x1 = l11 f1 + l12 f2 + + l1(q-1) fq-1 + l1q fq + u1 x2 = l21 f1 + l22 f2 + + l2(q-1) fq-1 + l2q fq + u2 .

25

xp-1 = l(p-1)1 f1 + l(p-1)2 f2 + + l(p-1)(q-1) fq-1 + l(p-1)q fq + up-1 xp = lp1 f1 + lp2 f2 + + lp(q-1) fq-1 + lpq fq + up where f1, f2, ., fq-1 and fq denote the q common factors; l11, l12, ., lp(q-1) and lpq are coefficients associated with the p regressions of the observed variables on the common factors; u1, u2, .., up-1 and up denote residual components of x1, x2, .., xp-1 and xp, respectively, that are not explained by the q common factors. In the terminology of the factor analysis literature, the regression coefficients (i.e., the l coefficients) are termed factor loadingsfor example, l32, which is the regression coefficient associated with f2 in the regression of x3 on the common factors, is alternatively described as the loading of x3 on factor f2. Similarly, the terms u1, u2, ., up-1 and up, which are usually described as error terms in the multiple regression literature, are described in the factor analytic literature as uniquenesses or unique components of x1, x2, .., xp-1 and xp, respectively, because each u is hypothesized (implicitly) to be unique to a particular x, in contrast to the f-variables which are common to all the xvariables. Note that, in contrast to typical multiple regression models, none of the p equations contains an intercept term. This is because all variablesthe observed x-variables, the unobserved common factors and the unique components- are assumed to be mean-centered (i.e., they all have zero expectations). Strictly speaking, this assumption is not necessary. However, it is a convenient one to make given that the focus of factor analysis is on analyzing/explaining correlations/covariances. How does the data-generating mechanism (i.e., the p equations relating the observed variables to the common factors and unique components) give rise to a statistical model? The specifications of the p equations imply a (testable) statistical hypothesis and a postulate that is untestable in the context of exploratory factor analysis. The hypothesis specifies the number of

26

common factors (q in our example). The postulate specifies that the unique components u1, u2, .., up-1 and up are uncorrelated. In turn, this implies that the correlation between any two observed variables is due solely to the common factors. Because the assumption of uncorrelated uniquenesses is always made in exploratory factor analysis (i.e., it is embedded in all models), it cannot be treated as a testable hypothesis. In addition, just as in regressions of observed variables on observed predictors, it is also assumed that each unique component is uncorrelated with all the predictor variables (i.e., the common factors). When rotating the initial factor solution to facilitate interpretation of the results, one needs to decide whether the factors should be allowed to correlate. The linear relationships specified in the p equations impose, in turn, restrictions on the variances of (and covariances between) the observed variables. Each equation expresses an xvariable as a linear combination of the q common factors and a unique component. Thus the covariances between (and the variances of) the observed variables become functions of three sets of parameters: the factor loadings (i.e., the l-coeffcients), the variances of (and covariances between) the common factors and the variances of the unique terms. In other words, the statistical hypothesis that follows from the factor model is that each element of the (population) covariance matrix (i.e., an array consisting of the variances and covariances of the p observed variables) can be, using covariance algebra, expressed as a function of these parameters. For a given data, the task of statistical estimation becomes one of estimating values for the ls, the variances and covariances of the q common factors and the variances of the p unique components so as to optimize some statistical criterion. An important consequence is that these parameters can be estimated without having to estimate the corresponding latent individual

27

scores (i.e., the unobserved values of f1, f2, , fq, u1, u2, .., up for each respondent in the sample).

Model Identification/Indeterminacy Issues


Before turning to the problem of parameter estimation it is necessary to insure that the statistical model is identified. In general, the requirement of identification can be translated to one of requiring that alternative sets of parameter values should ultimately yield distinct probability distributions. If two different sets of parameter values were to yield the same probability distribution then there is no empirical basis for distinguishing or choosing between them. For the common factor model the requirement for the model to be identified is that two different sets of parameter valuesvalues of factor loadings, factor variances and covariances, and variances of uniquenessesshould not yield the same estimated covariance matrix for the p observed variables. As it stands, this is requirement is violated in two ways. The first type of violation arises because the common factors are unobserved and therefore their scaling, which, in turn, determines their variances, is arbitrary. Suppose the model postulating q common factors for the p observed variables is correct. Then the model remains the same- with respect to generating the same values for the x-variableseven if the scaling of the factors is changed by multiplying each factor fj by a positive constant tj that is not equal to one and, simultaneously, dividing each of the p loadings on that factor (i.e., l1j, l2j, .., l(p-1)j, lpj) by the same constant. For example, the equation x2 = l21 f1 + l22 f2 + + l2(q-1) fq-1 + l2q fq + u2 is not empirically distinguishable from x2 = (l21/t1)(t1 f1) + (l22/t2)(t2 f2) + + (l2(q-1)/tq-1)(tq-1 fq-1) + (l2q/tq)(tq fq) + u2

28

even though corresponding factor loading values in the two equations are different (e.g., l21 and l21/t1 will have different numerical values) and the variances of the corresponding factors will also differ (e.g., the variance of f2 and the variance of t2 f2), because substitution of the appropriate numerical values on the right side of either equation would yield the same value for x2 and, consequently, both equations would yield identical variances for x2 and covariances of x2 with the other p-1 observed variables. (The same argument applies to the other p-1 observed variables.) However, this type of identification problem, which arises because the unobserved factors can be arbitrarily rescaled, can also be solved easily by fixing the variances of the factors to some (positive) numerical values (usually one) instead of allowing them to be estimated. If a factors variance is required to be one then its scaling ceases to be arbitrary and multiplying the factor by a positive constant is no longer permissible because that would simultaneously alter its variance. For factor models with more than one common factor, fixing the factor variances still leaves another type of identification problem which can be understood more easily in geometric terms. For expository ease, lets assume that there are only two uncorrelated common factors f1 and f2 both of whose variances are fixed to one for scaling. Then, for example, the model equation for observed variable x2 becomes x2 = l21 f1 + l22 f2 + u2 For the same variable, consider the alternative model x2 = t21 g1 + t22 g2 + u2 where u2 is the same in both models and g1 = .8f1 + .6f2 g2 = -.6f1 + .8f2

29

t21 = .8l21 + .6l22 t22 = -.6l21 + .8l22 Then t21g1 + t22g2 = (.8l21 + .6l22)( .8f1 + .6f2) + (-.6l21 + .8l22)( -.6f1 + .8f2) = .64 l21 f1 + .48 l21 f2 + .48 l22 f1 + .36 l22 f2 + .36 l21 f1 - .48 l21 f2 - .48 l22 f1 + .64 l22 f2 = l21 f2 + l22 f2 Thus the two models, although they have different common factors and different factor loadings, would generate the same values for x2 and therefore, also the same covariance structure when applied to all p observed variables. Using covariance algebra, one can also show that the variances of the alternative factors g1 and g2 will each be equal to one and further, that g1 and g2 will also be uncorrelated. There are also other pairs of numerical values (i.e., other than .8 and .6) that can generate alternative but statistically equivalent models. In geometric terms, if f1 and f2 are viewed as coordinate axes in a two-dimensional plane, then g1 and g2 represent new coordinate axes for the same plane that are derived as rotations of the f1 and f2 axes. Thus this source of indeterminacy in the common factor model is usually discussed as a rotation problem, and occurs in all common factor models with two or more factors. The identification or indeterminacy problem can be summarized as follows. One source of indeterminacy arises because the scale of each common factor is arbitrary. However, this can be resolved by fixing the values of the factor variances. The second source of indeterminacy arises because any factor solution can be linearly transformed or rotated (i.e., through specific linear combinations of factor loadings and linear combinations of the unobserved factors) to yield another solution that is equivalent to the first in terms of generating the same covariance

30

matrix for the observed variables. The second type of indeterminacy is usually addressed by initially requiring the common factors to be uncorrelated and by imposing further restrictions on the factor loadings. The nature of the restrictions imposed on the factor loadings depend on the specific estimation procedure used to obtain the initial factor solution (e.g., principal factoring versus maximum likelihood) and are unlikely to be meaningful in any substantive context. The initial factor solution can then be rotated in a variety of ways to obtain other solutions that may be easier to interpret and use.

Estimation of the Initial Solution


There are several methods of estimation available in commonly-used software such as SAS and SPSS. Instead of describing the technical details of various estimation procedures, we focus on a key decision issue that the marketing research analyst has to confront. This concerns the choice of input data, which can be either a covariance matrix (of the observed variables) or the correlation matrix. Some methods of estimation (e.g., maximum likelihood or generalized least squares) are scale-free, which means that the solution based on one type of input data can be re-scaled, using the standard deviations of the observed variables, to derive the solution that would be obtained using the other type of data, in much the same way as standardized regression coefficients can be calculated in multiple regression from the unstandardized coefficients and the standard deviations of the dependent and independent variables. With estimation methods that are not scale-free (e.g., principal factoring), the solutions based on the covariance matrix input and the correlation matrix input can differ dramatically. For example, the number of common factors required may vary depending on the type of data input. Further, its not possible to derive one solution from the other by re-scaling using the appropriate standard deviations. In marketing research applications, we think it is desirable to use scale-free estimation methods in most cases

31

for the following reasons. Most of the rating scales used in marketing research studies lack a meaningful scale, which implies that standardized measures (e.g., correlations) can be interpreted more easily than unstandardized ones (e.g., covariances). Furthermore, variables on different scales must sometimes be factor analyzed together. Therefore, the use of standardized scores (and of correlations instead of covariances) is by far the more common practice in market research applications . In many studies, the researcher may have to report or use both standardized and unstandardized results at different stages in the presentation of results to clients. In these situations, use of a scale-free method of estimation becomes imperative to maintain consistency in the presentation of results.

Determining the Number of Common Factors


When estimating factor models the researcher has to determine the appropriate number of common factors to be used. In most software, one can either specify the number of factors directly or use a statistical/heuristic criterion to determine the optimal number of factors. If the researcher chooses to specify the number of factors directly, then the estimation should be repeated varying the number to generate alternative solutions which can then be compared to select the one that best describes the data. When the method of maximum likelihood is used to estimate the factor solution, it generates a likelihood ratio chi-square test statistic than be used to assess the statistical fit of the model. The null hypothesis in this case is that the population covariance matrix is generated by the number of factors in the solution. Large values of the chi-square statistic would lead to rejection of the null hypothesis- typically, this would lead the researcher to consider solutions containing a larger number of factors. The likelihood ratio test statistic is quite sensitive to sample size- in large samples, the hypothesis that a particular model fits the data may be rejected

32

even though the discrepancy between the model and the data may be, in a practical sense, quite small. This problem has led researchers to develop a large number of alternative overall goodness-of-fit indices that are, to varying degrees, independent of sample size. However, most of these indices are not yet provided as part of the diagnostic output in commonly-used software such as SPSS. One heuristic used when correlations are being analyzed is to extract and retain only those factors whose associated eigenvalues exceed one. Eigenvalues are the variances of the components extracted. The original justification (Guttman, 1954) for use of this criterion was that it provided a lower bound for the number of factors present in the data. A more intuitive justification is that for a factor to be considered practically significant it should explain more than the variance of any one of the input variables. Another heuristic approach to selecting the optimal number of factors is to use the scree plot (Cattell, 1966) that is output by most software. The scree plot provides a graphical representation of the eigenvalues associated with the sample correlation matrix, ordered by size. Usually, the plot (i.e., the polygonal line joining points representing consecutive eigenvalues) tends to show steep declines initially and then levels off, meaning that the larger eigenvalues tend to have quite different numerical values while the smaller ones tend to have approximately equal values. The number of factors to be retained is determined (through visual inspection) by the point at which the leveling-off occurs. Simulation studies have shown that the two eigenvalue-based approaches (i.e., the eigenvalue-greater-than-one heuristic and the scree plot) may not always identify the correct number of factors. This would suggest that even when these criteria are used, it may be fruitful to examine other solutions containing different numbers of factors especially because, in some

33

cases, the factor solution obtained using one of these criteria may not be as interpretable as another with more (or fewer) common factors. When correlation data are being analyzed, we have found that the eigenvalue-greaterthan-one statistical criterion provides a useful starting point for generating the initial solution, especially when the number of observed variables is large. In such instances, generating a sequence of factor solutions starting with the one-factor solution may not be the most efficient approach, from the point of view of having to evaluate a large number of alternative solutions. However, one should look at alternative models (i.e., those with slightly fewer or slightly more factors than the number extracted using the criterion) as well.

Rotation of the Initial Factor Solution


Recall that the initial factor solution is obtained by requiring all factor variances to be one, all factors to be orthogonal (i.e., pairwise uncorrelated), and by imposing estimationprocedure-specific constraints on the factor loadings to counter the problem of rotational indeterminacy. In general, the initial factor solution tends to be less interpretable compared to some rotated solutions- often the first factor in the initial factor solution tends to be a general factor (i.e., one on which most or many of the observed variables load highly), while some of the others tend to be contrast factors, on which some variables have positive loadings and others have negative loadings. Note that any factor solution can be rotated by creating new factors that are linear combinations of factors in the solution obtained previously and new factor loadings that are linear combinations of the old loadings. When rotating the initial solution, in which the factors are orthogonal, the coefficients of the linear combination (e.g., .6 and .8 with the appropriate signs in the section on model identification/indeterminacy issues) can be chosen such that the new factors obtained either remain mutually uncorrelated (orthogonal rotation) or

34

become correlated (oblique rotation). With each type of rotation- orthogonal versus obliquemany different solutions are possible, corresponding to the variations that are possible in the values of coefficients in the linear combinations. In addition, factor rotations may also be classified as subjective or analytical. As the name suggests, subjective rotations involve subjective judgments on the part of the researcher and usually require recourse to graphical procedures. Analytical methods, in contrast, require the rotations to satisfy certain criteria. Because analytical rotation methods have largely superceded subjective methods for reasons of efficiency, we restrict our discussion to the former. Among the methods available for orthogonal rotation, the varimax method tries to ensure that only one or a few observed variables have large loadings on any given factor. In contrast, the quartimax method attempts to ensure that each variable has large loadings only on one or a few factors. The equamax method of orthogonal rotation essentially strikes a balance between the objectives of the varimax method and those of the quartimax method. Among the oblique rotation methods, the direct oblimin method is perhaps the most well-known. This method allows the user to control the extent of correlations among the factors through the specification of a numerical value for a parameter usually designated by delta. Positive values of delta increase the correlations among the rotated factors (relative to a zero value for delta, which is usually the default setting), while negative values tend to decrease inter-factor correlations. While it is easy enough to employ each rotation method in turn and compare the alternative solutions, the choice of rotation methods will also be guided to some degree by the researchers objectives and substantive aspects of the study context. If the ultimate purpose of factor analysis is to compute factor scores that are as uncorrelated as possible (e.g., for use as composite predictors in a multiple regression), then an orthogonal rotation method would be

35

preferred. On the other hand, if interpretational clarity of the factor solution is paramount, then an oblique solution is often necessary. Its also reasonable to expect factors to be correlated if the observed variables loading on those factors are intercorrelated.

Interpreting/Evaluating Factor Solutions


Because the common factors are unobserved, their interpretation depends crucially on the pattern of factor loadings and factor inter-correlations in the solution that is being evaluated. Meaning is assigned to a factor through the subset of observed variables that have high loadings on that factor. While it is tempting to prescribe an arbitrary numerical cut-off for determining what loadings are to be considered high, pragmatic considerations dictate the need to be flexible across different research studies. Among both marketing academics and practitioners there appears to be a preference for working with observed measures that are unidimensional or load highly only on one factor (Anderson & Gerbing, 1982). This, presumably, is because factor analysis is often used to create indices that are then used as input variables in structural equation models or path analytic models. In these settings, use of indices that have common subsets of constituent variables can be extremely problematic when interpreting model results. But it is wise to bear in mind that, conceptually, there is no inherent reason why a variable must load on one and only one factor. For example, a rating of the extent to which a product or service provides good value could potentially load on both a price factor and a quality factor, since value can be a function of both quality and price. Some observed variables may not have high loadings on any factor, implying either that these have low intercorrelations with the other variables included in the analysis or that additional factors might be required to account for what these variables have in common with the others. When this occurs, it is advisable to re-estimate the factor model after deleting the

36

variables with low communalities; however, before this is done, one should consider models with additional factors to check for the possibility that the added factors may capture what the variable might have in common with others. The communality of an observed variable is the proportion of its variance that is accounted for by the common factors. Communalities should be neither too high nor too low. At one extreme a communality of zero suggests that either all of that observed variables variance is due to the unique component (i.e., that it has nothing in common with the other observed variables) or that more factors need to extracted. If the communality of a variable is close to zero, it does not necessarily mean that it cannot be useful in other types of statistical analysis (e.g., multiple regression). At the other extreme a communality of one suggests that the variable and the factor are identical.

Factor Scores
In the psychometric perspective on factor analysis, factors represent latent constructs (McDonald, 1985). The constructs are considered latent partly because the scores of individuals (or other units of analysis) on these constructs cannot be directly observed; in addition, any response purporting to measure a construct provides only an imperfect assessment of the unobserved score (imperfect in the sense that it is not perfectly correlated with the unobserved score) for that individual. Nonetheless, from a practical standpoint, researchers often need to estimate scores on a latent construct (i.e., factor scores) to use instead of the set of items that load on that factor. As a starting point for constructing a factor score, one could reasonably consider a simple sum or average of the scores on items loading on that factor. From a statistical perspective, however, this procedure could be refined by using the information contained in the factor solution- the simple average utilizes only the information that the set of items load on a given

37

factor. For example, if the items have vastly different loadings on the factor, it follows that some items are better measures of the underlying factor (i.e., more highly correlated with the factor) than others. In particular, items with relatively high loadings are better indicators of the underlying factor than those with relatively low loadings. In turn, this implies that a weighted combination of the item scores wherein items with relatively high loadings are given higher weights should be a better estimate of the factor score than a simple (i.e., equally weighted) average. In addition, a statistical procedure for estimating the weights should also take into account other available, relevant statistical information (e.g., factor variances and covariances, the covariances among the observed variables). The weights used to combine scores on observed items to form factor scores are usually obtained through some form of least squares regression, which is why they are referred to as factor score regression coefficients (sometimes shortened to factor score coefficients). It is important to bear in mind that the factor scores thus obtained are only estimates of their corresponding unobserved counterparts. The accuracy of these estimates could potentially be gauged from the R-squared associated with the regression used to estimate the weights; unfortunately, this is not provided as part of the output in some of the statistical packages.

An Illustrative Example
We return to the toothpaste-related scenario used at the beginning of this chapter. A data set containing eight ratings of a brand of toothpaste was factor analyzed. The eight ratings required consumers to evaluate the following characteristics of the brand: (1) freshens the breath (breath); (2) makes the mouth feel clean (clean); (3) helps prevent tooth decay (decay); (4) helps prevent cavities (cavities); (5) whitens the teeth (whitens) ; (6) helps prevent gum

38

disease (gums); (7) strengthens tooth enamel (enamel) and (8) prevents plaque formation (plaque). The (scale-free) maximum likelihood estimation procedure was used to estimate the model. First, use of the eigenvalue-greater-than-one criterion resulted in the extraction of two common factors. The scree plot (Figure 18.9) also suggested retention of two factors. The communalities of all eight variables were sufficiently high to warrant retaining all of them in the analysis- their values ranged from .45 to .74. In addition to the two-factor solution, one- and three-factor models were also estimated. Because of the large sample size, the likelihood ratio chi-square test statistic was not used to assess the overall goodness-of-fit between each of the models and the data. [Insert Figure 18.9] Consistent with what is generally observed, the initial solution was uninterpretable, consisting of a general factor and a contrast factor (Table 18.7). Compared to the two-factor solution, the communalities of three variables- breath, clean and whitens- showed substantial declines in the one-factor solution, indicating that a two-factor solution was clearly superior. On the other hand, the communalities remained approximately the same across the two- and threefactor solutions for all variables except cavities. In addition, the third factor in the three-factor solution had no variables with high factor loadings- the loading for cavities was -.31, while all other variables had loadings below .13 (Table 18.8). This means that the third factor, if retained, cannot strictly be interpreted as a common factor (in the sense of being common to two or more observed variables). It would instead be more appropriately interpreted as a factor unique to the cavities variable and even that interpretation would be considered a stretch, given that variables relatively low loading. Therefore the two-factor solution was chosen.

39

[Insert Table-18.7 , 18.8] With respect to orthogonal rotations (i.e., those transformations that keep the factors uncorrelated) of the initial factor solution, the varimax-rotated solution (Table 18.9) provides a clearer separation of the observed variables between the factors than the quartimax-rotated solution (Table 18.10). The first factor in the varimax-rotated solution, on which the variables with relatively large loadings are decay, cavity, gums and enamel can be interpreted as a dental health/hygiene-related factor. The second factor, with large loadings for breath, clean, and whitens , may be described as a hedonic dimension which encompasses positive feelings that are associated with the use of the product. Interestingly, the variable plaque, has moderately high loadings on both factors, suggesting that respondents associate it with both the health and the hedonic dimensions (i.e., plaque is considered both unhealthy and as detracting from ones appearance). However, this interpretation is contingent on the assumption that the two factors should be modeled as orthogonal. If they are instead modeled as correlated, as in the oblimin-rotated solution (Table 18.11), then plaque is more unambiguously associated with the health/hygiene dimension. The correlation between the two factors is about -.62. Note, however, that the three observed variables that define the hedonic dimension (i.e., breath, clean and whitens) have negative loadings on the second factor. For interpretive purposes, both the signs of all the loadings on the second factor and the sign of its correlation with the first factor can be simultaneously reversed. Then the two factors are both scaled the same way, with high scores on observed items implying favorable evaluations of the brand on both (latent) dimensions. From a statistical standpoint (i.e., with respect to accounting for correlations in the input data), the original solution with the negative loadings and correlation is equivalent to the solution with the signs reversed.

40

[Insert Table 18.9 , 18.10 , 18.11] Given that all the two-factor solutions discussed above are equivalent with respect to generating the same estimated covariance matrix which solution should one choose as the most appropriate for the data? Between the two orthogonally-rotated solutions, the criterion of interpretability would favor the varimax-rotated solution, especially given the preference among marketing academics and practitioners for unidimensional items (i.e., observed variables that load only on one factor). However, the choice between the varimax-rotated (orthogonal) solution and the oblimin-rotated (oblique) solution would have to be guided by factors that go beyond the single criterion of interpretability of the results. The key question to be addressed is whether there are any context-specific reasons for the factors to be correlated. In the toothpaste context, there are at least two reasons why the health/hygiene and the hedonic dimensions may be correlated. The correlation may reflect respondent beliefs that (essentially) the same set of toothpaste ingredients drive performance on both dimensions (e.g., an ingredient that prevents plaque may also whiten the teeth). It may also reflect the communication/branding strategy used by the toothpaste manufacturer (viz., one of emphasizing performance on both dimensions in order to appeal to a larger customer base). In this instance, a correlated factor solution is also justified by the relatively high intercorrelation of the two sets of variables.

The Distinction Between the Common Factor Model and Principal Components Analysis
In popular statistical software such as SPSS the technique of principal components analysis is usually presented as one of the methods for estimating factor models. However, it is important to note the conceptual distinctions between factor models and principal components analysis, even though they may frequently provide similarly interpretable results when applied to

41

data sets. For a set of p observed variables, the first principal component is defined to be their weighted linear combination where the weights are chosen such that the sum of squares of the weights is equal to one and the variance of the linear composite formed is the maximum possible. Letting y1 denote the first principal component, y1 = w11 x1 + w12 x2 + + w1(p-1) xp-1 + w1p xp where x1, x2, , xp-1 and xp denote the observed variables, and w11, w12, . w1(p-1), and w1p denote the weights whose squares sum to one. The reader conversant with covariance algebra will readily recognize the need to require squares of weights to sum to one- in the absence of such a constraint, the variance of the linear composite can be made arbitrarily large by choosing large values (positive or negative, as appropriate) for the weights. The second principal component is defined to be the weighted linear combination where the weights are again chosen such that their squares sum to one and the variance of the linear composite formed is the maximum possible subject to the additional requirement that it be uncorrelated with the first principal component. In general, the objective of principal components analysis is to generate a sequence of weighted linear composites of the observed variables such that for each linear composite the squares of the weights sum to one, and each composite in the sequence has the maximum possible variance subject to the requirement that it be uncorrelated with all the other components preceding it in the sequence. In the common factor model the observed variables are dependent variables in regressions where the common factors are the (unobserved) independent variables. Further, a specific number of common factors are hypothesized to account for the covariances among the observed variables. In contrast, the definition of principal components does not imply a testable hypothesis. Unlike the common factor model, there is no rotational indeterminacy associated

42

with the initial solution. However, just as with the common factor model, the initial solution may need to be rotated to facilitate interpretation. Unlike the common factor model which can be estimated using different estimation techniques, the estimation approach for principal components is embedded in their very definition. The weights used to form the first principal component, for example, will always correspond to the elements of the eigenvector associated with the largest eigenvalue of the sample covariance matrix of the observed variables. Unlike some factor solutions (e.g., those estimated by maximum likelihood), principal components estimates are not scale-free. Hence the choice between the covariance matrix and the correlation matrix for use as input data needs to be considered carefully.

Conclusion and Recent Developments


We have provided a non-technical description of exploratory factor analysis and an example illustrating its usage. The essential differences between the common factor model and principal components analysis have been highlighted. Although recent developments in statistics have made possible the simultaneous analysis of several samples (i.e., multi-group analysis), both for exploratory factor analysis and principal components analysis, algorithms for performing these analyses are yet to be made available in software such as SPSS. The technique of confirmatory factor analysis, implemented in software such as LISREL, EQS and AMOS, provides an efficient alternative to the tedium of subjective rotations using graphical methods. In essence, confirmatory factor analysis allows the user to impose a variety of competing factor structures, usually through the specification of zero and non-zero factor loadings. In addition to solving the factor indeterminacy problem, confirmatory factor analysis allows the researcher to develop and test a greater variety of competing models and solutions.

43

We conclude by offering a few recommendations that may be especially relevant in marketing research practice. Given the large sample sizes usually associated with marketing research studies, strict reliance on the chi-square test for determining the correct number of factors is not appropriate. While the eigenvalue-greater-than-one rule and/or the scree plot are useful diagnostics for an initial determination of the number of factors needed, researchers should also examine solutions with more and less factors than specified by these criteria and use the additional criterion of interpretability to make a final selection among these alternative solutions. Given the preference among both marketing academics and practitioners for measures that are reflective only of a single construct/factor, we would recommend use of varimax rotations for orthogonal solutions and oblimin rotations for correlated factor solutions for facilitating interpretation.

44

Table 18.1 Soft Drink Hypothetical Data


Consumer Taste Reaction 80 85 90 90 80 85 10 15 10 20 30 Purchase Likelihood 9 8 9 1 2 3 7 9 3 2 1 Taste Reaction Rescaled 9 9.5 10 10 9 9.5 2 2.5 2 3 4 Taste Reaction Standardized 0.719 0.858 0.996 0.996 0.719 0.858 -1.223 -1.085 -1.223 -0.946 -0.668 Purchase Likelihood Standardized 1.186 0.896 1.186 -1.134 -0.844 -0.554 0.606 1.186 -0.554 -0.844 -1.134

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11

Table 18.2 Dissimilarity Matrix for Raw Data


Consumer C1 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 0 26 100 164 49 61 4904 4225 4936 3649 2564 C2 26 0 26 74 61 25 5626 4901 5650 4261 3074 C3 100 26 0 64 149 61 6404 5625 6436 4949 3664 C4 164 74 64 0 101 29 6436 5689 6404 4901 3600 C5 49 61 149 101 0 26 4925 4274 4901 3600 2501 Consumer C6 C7 61 4904 25 5626 61 6404 29 6436 26 4925 0 5641 5641 0 4936 29 5625 16 4226 125 3029 436

C8 4225 4901 5625 5689 4274 4936 29 0 61 74 289

C9 4936 5650 6436 6404 4901 5625 16 61 0 101 404

C10 3649 4261 4949 4901 3600 4226 125 74 101 0 101

C11 2564 3074 3664 3600 2501 3029 436 289 404 101 0

45

Table 18.3 Ten-Cluster Solution


Cluster Number 1 2 3 4 5 6 7 8 9 10 Cluster Label C1 C2 C3 C4 C5 C6 C8 C10 C11 CLUS10 Cluster Membership C1 C2 C3 C4 C5 C6 C8 C10 C11 C7, C9 Taste Reaction 80 85 90 90 80 85 15 20 30 10 Purchase Likelihood 9 8 9 1 2 3 9 2 1 5

Table 18.4 Clustering Summary for Hierarchical Clustering

Step 1 2 3 4 5 6 7 8 9 10

Number of Clusters 10 9 8 7 6 5 4 3 2 1

Clusters Merged C7 C2 C1 CLUS8 CLUS10 C3 CLUS7 CLUS6 CLUS3 CLUS4 C9 C6 CLUS9 C5 C8 C4 CLUS5 C10 C11 CLUS2

Name of new cluster CLUS10 CLUS9 CLUS8 CLUS7 CLUS6 CLUS5 CLUS4 CLUS3 CLUS2

Size of new cluster 2 2 3 4 3 2 6 4 5 11

Subjects in the new cluster C7, C9 C2, C6 C1, C2, C6 C1, C2, C6, C5 C7, C9, C8 C3, C4 C1, C2, C3, C4, C5, C6 C7, C9, C8, C10 C7, C9, C8, C10, C11 C3, C4, C1, C2, C6, C5, C7, C9, C8, C10, C11

Dissimilarity 16.000 25.000 37.250 32.889 41.000 64.000 56.500 88.222 282.125 4624.871

46

Table 18.5 Four-Cluster Solution

Cluster

1 (C4, C5, C6) 2 (C1, C2, C3) 3 (C7, C8) 4 (C9, C10, C11)

Rescaled taste reaction 9.5000 9.5000 3.0000 2.2500

Likelihood of purchase

2.0000 8.6667 2.0000 8.0000

47

Table 18.6 Initial and Reassignment of Consumers to Clusters Distance from Cluster for Initial Assignment

Consumer

1 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11
0.00 1.250 1.000 65.000 49.000 36.250 53.000 42.250 85.000 85.000 89.000

2
1.250 0.000 1.250 49.250 36.250 25.000 57.250 50.000 81.250 78.250 79.250

3
1.000 1.2500 0.000 64.000 50.000 36.250 68.000 56.250 100.000 98.000 100.000

4
65.000 49.250 64.000 0.000 2.000 4.250 100.000 120.250 68.000 50.000 36.000

Initial Cluster Assignment 1 2 3 4 4 4 1 1 4 4 4

Cluster Reassignment in Iteration 1 3 2 3 4 4 4 1 1 4 4 4

Cluster Centroids and Change in the Centroids Consumer Characteristics Cluster Centroids 1 9.000 9.000 2 9.500 10.000 3 10.000 9.000 4 10.000 1.000

Initial seeds (selected cluster seeds) Initial Assignment

Taste Reaction Purchase Likelihood

Taste Reaction Purchase Likelihood Iteration 1 Taste Reaction Purchase Likelihood Change in cluster centroids for Initial Assignment and Iteration 1 Taste Reaction Iteration 2 Purchase Likelihood Change in cluster centroids for Iterations 1 and 2

4.500 8.333 2.250 8.000 5.173* 2.250 8.000 0.000

9.500 8.000 9.500 8.000 0.000 9.500 8.000 0.000

10.000 9.000 9.500 9.000 0.000 9.500 9.000 0.000

6.250 2.000 6.250 2.000 0.000 6.250 2.000 0.000

*Change in cluster centroid from Initial assignment and Iteration 1 is equal to 5.173 [i.e., (4.5-2.25)2 + (8.333-8.000)2]

48

Table 18.7 Factor Loadings Initial Solution (2-Factor solution)

Factor 1 BREATH CLEAN DECAY CAVITIES WHITENS GUMS ENAMEL PLAQUE .616 .679 .706 .753 .728 .682 .777 .691 2 -.267 -.447 .274 .321 -.459 .272 .309 .073

49

Table 18.8 Factor Loadings Oblimin Rotated Solution (3-Factor solution)

Factor 1 BREATH CLEAN DECAY CAVITIES WHITENS GUMS ENAMEL PLAQUE .129 -.044 .728 .859 -.014 .745 .838 .537 2 .576 .851 .036 .003 .868 -.021 -.011 .215 3 .107 -.060 -.073 -.310 -.021 .121 .092 .129

50

Table 18.9 Factor Loadings Varimax Rotated Solution (2-Factor Solution)

Factor 1 BREATH CLEAN DECAY CAVITIES WHITENS GUMS ENAMEL PLAQUE .297 .229 .716 .782 .258 .696 .792 .574 2 .602 .780 .247 .243 .821 .234 .267 .391

51

Table 18.10 Factor Loadings Quartimax Rotated Solution (2-Factor Solution)

Factor 1 BREATH CLEAN DECAY CAVITIES WHITENS GUMS ENAMEL PLAQUE .447 .429 .756 .818 .468 .733 .835 .658 2 .501 .691 .047 .025 .722 .039 .046 .223

52

Table 18.11 Factor Loadings Oblimin Rotated Solution (2-Factor Solution)

Factor 1 BREATH CLEAN DECAY CAVITIES WHITENS GUMS ENAMEL PLAQUE .113 -.043 .760 .842 -.024 .742 .844 .532 2 -.596 -.839 .005 .039 -.875 .013 .014 -.226

53

Figure 18.1 Plot of Soft Drink Data

100.00

C4

C3

C6

C2

C5
80.00

C1

60.00

Taste Reaction
40.00

C11

C10
20.00

C8

C9

C7

0.00

0.00

2.00

4.00

6.00

8.00

10.00

Likelihood of purchase

54

Figure 18.2 Distance between Two Clusters for Centroid, Single Linkage and Complete Linkage Methods

Distance for complete linkage method Distance for centroid method

Distance for single linkage method

Centroids or average customers

55

Figure 18.3 Hypothetical Configurations Panel I: Chaining

Panel II: Effect of Outliers in Single Linkage Clustering

56

Figure 18.4 Assessing Cluster Solution Panel I: RMSSTD

Low RMSSD

High RMSSD

Panel II: SPR and CD

If merged, these two clusters will have low SPR and CD

If merged, these two clusters will have higher SPR and CD

57

Figure 18.4 (continued) Panel III: RS

This cluster solution will have a lower RS This cluster solution will have a higher RS

58

Figure 18.5 Plot to Assess the Number of Clusters


1.2 1 0.8 0.6 0.4 0.2 0 10 9 8 7 6 5 4 3 2 1 Number of clusters

RSQ

SPR

CD 5000 4000 3000 2000 1000 0 10 9 8 7 6 5 4 3 2 1 Num ber of Clusters

Centroid Distance

RMSSTD 30 25 RMSSTD 20 15 10 5 0 10 9 8 7 6 5 4 3 2 1

Number of Clusters

59

Figure 18.6 Plot of Rescaled Data

C4
10.00

C3 C6 C5 C2 C1

8.00

Rescaled taste reaction

6.00

C11
4.00

C10 C8 C9
2.00

C7

0.00

2.00

4.00

6.00

8.00

10.00

Likelihood of purchase

60

Figure 18.7 Plot to Assess the Cluster Solution for Rescaled Data

4 3.5

Centroid distance
3 2.5 2 1.5

RSQ
1 0.5 0 10 9 8 7 6 5 4 3 2

SPR RMSSTD

Number of Clusters

61

Figure 18.8 Non-Hierarchical Clustering

Step 1

Specify Number of Cluster

Step 2

Select initial centroids of the k clusters and assign consumers to the clusters

Recompute cluster centroids Step 3 Reassign observations to the k clusters by recomputing the distance of each consumer from each cluster. Recompute the cluster centroids.

Step 4

Step 5 No

Compare the change in cluster centroids to criterion? Is it less than the criterion, or is the number of iterations greater than the specified number of iterations?

Yes

Stop

62

Figure 18.9

Scree Plot
5

Eigenvalue

0 1 2 3 4 5 6 7 8

Factor Number

63

ENDNOTES
1

It should be noted that in general the objective of cluster analysis is to group stimuli (which

could be consumers, products, product attributes, etc.) into groups (or clusters, or segments). In order to simplify the discussion we will use the term consumers throughout this chapter.

We have adopted an arbitrary labeling scheme. Clusters consisting of more than one consumer

are labeled CLUSX where X is the number of clusters formed at a given step (i.e., CLUS10) and clusters consisting of a single consumer is named as CY where Y is the consumer number (i.e., C7).

The number of possible solutions is given by n(n-1)/2 where n is the number of clusters or

observations.

REFERENCES

64

Aldenderfer, M. S., & Blashfield, R. K. (1984). Cluster analysis. Thousand Oaks, CA: Sage. Anderson, J. C., & Gerbing, D. W. (1982). Some methods for respecifying measurement models to obtain unidimensional construct measurement. Journal of Marketing Research, 19, 453-60. Bagozzi, R. P. (1980). Causal models in marketing. New York: Wiley. Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1, 245-76. Guttman, L. (1945). Some necessary conditions for common factor analysis. Psychometrika, 19, 149-61. McDonald, R. P. (1985). Factor analysis and related methods. Mahwah, NJ: Erlbaum. Sharma, S. (1996). Applied multivariate techniques. New York: Wiley. Sneath, P., & Sokal, R. (1973). Numerical taxonomy. San Francisco: W. H. Freeman. Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley.

65

You might also like