BRM Multivariate Notes | Cluster Analysis | Multicollinearity

# Multivariate Analysis Business Research Methods

1

Multiple Regression
Q-1 What is Multiple Regression? Ans : Multiple regression is used to account for (predict) the variance in an interval dependent, based on linear combinations of interval, dichotomous, or dummy independent variables. Multiple regression can establish that a set of independent variables explains a proportion of the variance in a dependent variable at a significant level (through a significance test of R2), and can establish the relative predictive importance of the independent variables (by comparing beta weights). Power terms can be added as independent variables to explore curvilinear effects. Cross-product terms can be added as independent variables to explore interaction effects. One can test the significance of difference of two R2's to determine if adding an independent variable to the model helps significantly. Using hierarchical regression, one can see how most variance in the dependent can be explained by one or a set of new independent variables, over and above that explained by an earlier set. Of course, the estimates (b coefficients and constant) can be used to construct a prediction equation and generate predicted scores on a variable for further analysis. The multiple regression equation takes the form y = b1x1 + b2x2 + ... + bnxn + c. The b's are the regression coefficients, representing the amount the dependent variable y changes when the corresponding independent changes 1 unit. The c is the constant, where the regression line intercepts the y axis, representing the amount the dependent y will be when all the independent variables are 0. The standardized version of the b coefficients are the beta weights, and the ratio of the beta coefficients is the ratio of the relative predictive power of the independent variables. Associated with multiple regression is R2, multiple correlation, which is the percent of variance in the dependent variable explained collectively by all of the independent variables. Multiple regression shares all the assumptions of correlation: linearity of relationships, the same level of relationship throughout the range of the independent variable ("homoscedasticity"), interval or near-interval data, absence of outliers, and data whose range is not truncated. In addition, it is important that the model being tested is correctly specified. The exclusion of important causal variables or the inclusion of extraneous variables can change markedly the beta weights and hence the interpretation of the importance of the independent variables.

2

Criteria for "sizable proportion" vary among researchers but the most common criterion is if two or more variables have a variance partition of . check Collinearity diagnostics to get condition indices. Condition indices are used to flag excessive collinearity in the data. Regression. the lower the tolerance. If this is the case. VIF and tolerance are found in the SPSS and SAS output section on collinearity statistics.20 or VIF < 4 suggest no multicollinearity. 4 . If a factor (component) has a high condition index. these variables have high linear dependence and multicollinearity is a problem.R2 for the regression of that independent variable on all the other independents. when VIF is high there is high multicollinearity and instability of the b and beta coefficients. even when the rules of thumb for tolerance > . Types of multicollinearity. The type of multicollinearity matters a great deal. Variance-inflation factor. if tolerance is less than . Note that it is possible for the rule of thumb for condition indices (no index over 30) to indicate multicollinearity.50 or higher on a factor with a high condition index. Therefore. A condition index over 30 suggests serious collinearity problems and an index over 15 indicates possible collinearity problems. one uses tolerance or VIF. To assess multivariate multicollinearity. 90. While simple correlations tell something about multicollinearity. Some types are necessary to the research purpose Tolerance is 1 . There will be as many tolerance coefficients as there are independents. ignoring the dependent. and "condition indices" are the ratio of the largest singular values to each other singular value. a problem with multicollinearity is indicated. which build in the regressing of each independent on all the others. In SPSS or SAS. note that estimates of the importance of other variables in the equation (variables which are not collinear with others) are not affected. select Analyze. Computationally. The higher the intercorrelation of the independents. the more the tolerance will approach zero. Even when multicollinearity is present.The more the multicollinearity.beta coefficients and make assessment of the unique role of each independent difficult or impossible. When tolerance is close to 0 there is high multicollinearity of that variable with other independents and the b and beta coefficients will be unstable. Condition indices and variance proportions.20. with the typical criterion being bivariate correlations > . with the effect that small data changes or arithmetic errors may translate into very large changes or errors in the regression analysis. one looks in the variance proportions column. the more the standard error of the regression coefficients. the preferred method of assessing multicollinearity is to regress each independent on all the other independent variables in the equation. Linear. Tolerance is part of the denominator in the formula for calculating the confidence limits on the b (partial regression) coefficient. As a rule of thumb. Inspection of the correlation matrix reveals only bivariate multicollinearity. VIF VIF is the variance inflation factor.click Statistics. a "singular value" is the square root of an eigenvalue. which is simply the reciprocal of tolerance.

Q-5 What is homoscedasticity ? Homoscedasticity: The researcher should test to assure that the residuals are dispersed randomly throughout the range of the estimated dependent. Thousand Oaks. One method of dealing with hetereoscedasticity is to select the weighted least squares regression option. whereas lack of homoscedasticity will be characterized by a pattern such as a funnel shape. Nonconstant error variance can indicate the need to respecify the model to include omitted independent variables. Menard.uu.edu/garson/pa765/regress. and Paula E. CA: Sage Publications. A homoscedastic model will display a cloud of dots. Larry D. 106. and reciprocal transformations of the dependent may also reduce or eliminate lack of homoscedasticity. CA: Sage Publications. Lack of homoscedasticity may mean (1) there is an interaction effect between a measured independent variable and an unmeasured independent variable not in the model. 5 . No. Leo H. Stephan (1986). Series: Quantitative Applications in the Social Sciences. Suggested Readings and Links: http://www2. log. No. Miles. If not. Scott (1995). David L. moderate violations of homoscedasticity have only minor impact on regression estimates . the variance of residual error should be constant for all values of the independent(s). separate models may be required for the different ranges.pdf Kahane. Also. or (2) that some independent variables are skewed while others are not. Sjoquist.nl/docs/vakken/arm/SPSS/spss4. CA: Sage Publications. indicating greater error as the dependent increases. when the homoscedasticity assumption is violated "conventionally computed confidence intervals and conventional t-tests for OLS estimators can no longer be justified" However. Jeremy and Mark Shevlin (2001). Introductory text built around model-building. Put another way. (2001)..cs. Square root. Series: Quantitative Applications in the Social Sciences. Thousand Oaks. Regression basics. Applied logistic regression analysis. Schroeder. Nonconstant error variance can be observed by requesting a simple residual plot (a plot of residuals on the Y axis against predicted values on the X axis).ncsu. CA: Sage Publications.chass. Applying regression and correlation.htm www. Understanding regression analysis: An introductory guide. Thousand Oaks. This causes cases with smaller residuals to be weighted more in calculating the b coefficients. 57. Thousand Oaks.

Discriminant analysis has two steps: (1) an F test (Wilks' lambda) is used to test if the discriminant model as a whole is significant. 6 . it also assumes proper model specification (inclusion of all important independents and exclusion of extraneous variables). depending on whether the specified grouping variable has two or more categories. is used to classify cases into the values of a categorical dependent. Discriminant. If discriminant function analysis is effective for a set of data. Multiple discriminant analysis (MDA) is an extension of discriminant analysis and a cousin of multiple analysis of variance (MANOVA). Like multiple regression. based on discriminant loadings. To investigate differences between or among groups. and (2) if the F test shows significance. DA also assumes the dependent variable is a true dichotomy since data which are forced into dichotomous coding are truncated.k. To assess the relative importance of the independent variables in classifying the dependent variable. a. There are several purposes for DA and/or MDA: • • • • • • To classify cases into groups using a discriminant prediction equation. the classification table of correct and incorrect estimates will yield a high percentage correct. requiring linear and homoscedastic relationships. To infer the meaning of MDA dimensions which distinguish groups. MDA is used to classify a categorical dependent which has more than two categories. using as predictors a number of interval or dummy independent variables. attenuating correlation.Discriminant Analysis Q-1 What is Disriminant Analysis ? Ans: Discriminant function analysis. discriminant analysis or DA. To determine the most parsimonious way to distinguish among groups. Discriminant function analysis is found in SPSS/SAS under Analyze. Classify.a. usually a dichotomy. To test theory by observing whether cases are classified as predicted. then the individual independent variables are assessed to see which differ significantly in mean by group and these are used to classify the dependent variable. and untruncated interval or near interval data. sharing many of the same assumptions and tests. One gets DA or MDA from this same menu selection. MDA is sometimes also called discriminant factor analysis or canonical discriminant analysis. Discriminant analysis shares all the usual assumptions of correlation.

DA is an earlier alternative to logistic regression. is robust. the x's are discriminating variables. the traditional method. the first will be the largest and most important. where the b's are discriminant coefficients. + bnxn + c. but the b's are discriminant coefficients which maximize the distance between the means of the criterion (dependent) variable. but for higher order DA. there is one discriminant function and one eigenvalue. or have equal within-group variances). A dimension is simply one of the discriminant functions when there are more than one. also called a canonical root. The eigenvalues assess relative importance because they reflect the percents of variance explained in the dependent variable. Note that the foregoing assumes the discriminant function is estimated using ordinary least-squares. reflects the ratio of importance of the dimensions which classify cases of the dependent variable. but there is also a version involving maximum likelihood estimation. linearly related. where g is the number of categories in the grouping variable. If there is more than one discriminant function. handles categorical as well as continuous variables. and has coefficients which many find easier to interpret.the number of discriminating (independent) variables. This is the dependent variable. also called predictors. Each discriminant function is orthogonal to the others. For twogroup DA. There is one discriminant function for 2-group discriminant analysis. such that L = b1x1 + b2x2 + . and c is a constant. Few Definitions and Concepts Discriminating variables: These are the independent variables. the number of functions (each with its own cut-off value) is the lesser of (g . which accounts for 100% of the explained variance.. Logistic regression is preferred when data are not normal in distribution or group sizes are very unequal.. in multiple discriminant analysis. the second next most important in explanatory power. Discriminant function: A discriminant function.1). also called the grouping variable in SPSS. See also the separate topic on multiple discriminant function analysis (MDA) for dependents with more than two categories. is a latent variable which is created as a linear combination of discriminating (independent) variables. The criterion variable. or p. which is now frequently used in place of DA as it usually involves fewer violations of assumptions (independent variables needn't be normally distributed. cumulating 7 . Number of discriminant functions. It is the object of classification efforts. also called the characteristic root of each discriminant function. The eigenvalue. There is one eigenvalue for each discriminant function. and so on. This is analogous to multiple regression.

Thus it is the percent of discriminating power for the model associated with a given discriminant function.0 indicates that all of the variability in the discriminant scores can be accounted for by that dimension.. there is a high correlation between the discriminant functions and the groups. Discriminant.4. The relative percentage of a discriminant function equals a function's eigenvalue divided by the sum of all eigenvalues of all discriminant functions in the model. or if above it is classed as 1. Discriminant). for instance. check "Discriminant scores". the b's are discriminant coefficients. The Z score is the discriminant score for standardized data. + bnxn + c. the canonical correlation is equivalent to the Pearsonian correlation of the discriminant scores with the grouping variable. the cutoff is the weighted mean. discriminant coefficients are the regression-like b coefficients in the discriminant function. is the value resulting from applying a discriminant function formula to the data for a given case. To get discriminant scores in SPSS. Classify. is a measure of the association between the groups formed by the dependent and the given discriminant function. When the canonical correlation is large. An R of 1. R is used to tell how much each function is useful in determining group differences. select Analyze. Relative % is used to tell how many functions are important.. R. the cutoff is the mean of the two centroids (for two-group DA). also called the DA score. Classify. where L is the latent variable formed by the discriminant function. When R is zero. Eigenvalues are part of the default output in SPSS (Analyze. The discriminant function coefficients are partial coefficients. Note that relative % and R* do not have to be correlated. in the form L = b1x1 + b2x2 + . Note that for two-group DA. The standardized discriminant coefficients. One may find that only the first two or so eigenvalues are of importance. The discriminant score." Cutoff: If the discriminant score of the function is less than or equal to the cutoff. reflecting the unique contribution of each variable to the classification of the criterion variable. The canonical correlation. If the groups are unequal. the case is classed as 0. the ratio of the eigenvalues indicates the relative discriminating power of the discriminant functions. When group sizes are equal. the x's are discriminating variables. click the Save button. One can also view the discriminant scores by clicking the Classify button and checking "Casewise results. there is no relation between the groups and the function. much as b coefficients are used in regression in making predictions.to 100% for all functions. That is. Unstandardized discriminant coefficients are used in the formula for making the classifications in DA. If the ratio of two eigenvalues is 1. like beta weights in 8 . That is. then the first discriminant function accounts for 40% more betweengroup variance in the dependent categories than does the second discriiminant function. and c is a constant. The constant plus the sum of products of the unstandardized coefficients with the observations yields the discriminant scores.

It is an F test. Lambda varies from 0 to 1. the more that variable contributes to the discriminant function. the "Wilks' Lambda" table will have a column labeled "Test of Function(s)" and a row labeled "1 through n" (where n is the number of discriminant functions). much as beta weights are used in regression." level for this row is the significance level of the discriminant function as a whole. are used to assess the relative classifying importance of the independent variables. The smaller the lambda for an independent variable. For this purpose. Wilks' lambda also can be used to test which independents contribute significantly to the discriminant function. also termed the standardized canonical discriminant function coefficients. In SPSS. Standardized discriminant coefficients. if there are more than two groups of the dependent.05 means the model differentiates discriminant scores between the groups significantly better than chance (than a model with just the constant). Q-2 What is Wilk’s Lambda? Wilks' lambda is used to test the significance of the discriminant function as a whole. Addition or deletion of variables in the model can change discriminant coefficients markedly. with 0 meaning group means differ (thus the more the variable differentiates the groups). group centroids and factor structure are examined. A significant lambda means one can reject the null hypothesis that the two groups have the same mean discriminant function scores. The F test of Wilks's lambda shows which variables' contributions are significant. Classify. It is obtained in SPSS by asking for Analyze. As with regression. the standardized discriminant coefficients do not tell the researcher between which groups the variable is most or least discriminating. and 1 meaning all group means are the same. Wilks's lambda is part of the default output in SPSS (Analyze. this use of Wilks' lambda is in the "Tests of equality of group means" table in DA output. Note that importance is assessed relative to the model being analyzed. using discriminant scores from DA (which SPSS will label Dis1_1 or similar) as dependent. since these are partial coefficients. Compare Means. only the unique explanation of each independent is being compared. are used to compare the relative importance of the independent variables. this use of Wilks' lambda is in the "Wilks' lambda" table of the output section on "Summary of Canonical Discriminant Functions." p value < . Discriminant). One-Way ANOVA. In SPSS." ANOVA table for discriminant scores is another overall test of the DA model. The "Sig. not considering any shared explanation. Wilks's lambda is sometimes called the U statistic. Also. In SPSS. Q-3 What is Confusion or classification Matrix ? 9 . where a "Sig.regression.

No. the expected percent is 50%. Discriminant analysis. summing for all groups. Note that the hit ratio must be compared not to zero but to the percent that would have been correctly classified by chance alone. all cases will lie on the diagonal.chass. Expected hit ratio. Lachenbruch. NY: Hafner.Ans: The classification table. by multiplying the prior probabilities times the group size. P. The percentage of cases on the diagonal is the percentage of correct classifications. Klecka. Carl J. and dividing the sum by N. CA: Sage Publications. This percentage is called the hit ratio. assignment. is used to assess the performance of DA. (1975). (Wiley Series in Probability and Statistics). 19. Applied discriminant analysis . William R. also called a classification matrix. This is simply a table in which the rows are the observed categories of the dependent and the columns are the predicted categories of the dependents.ncsu. 10 .edu/garson/PA765/discrim2. For two-group discriminant analysis with a 50-50 split in the dependent variable. Thousand Oaks. For unequally split 2-way groups of different sizes. Quantitative Applications in the Social Sciences Series. or prediction matrix or table. When prediction is perfect. NY: Wiley-Interscience. (1980). A. the expected percent is computed in the "Prior Probabilities for Groups" table in SPSS.htm Suggested Readings: Huberty. or a confusion. Discriminant analysis. Adapted from the link: http://faculty. (1994).

Hierarchical clustering allows users to select a definition of distance. That is. and how the calculations are done. starting with all cases in one large cluster. Similarity and Distance 11 . then select a linking method of forming clusters.Cluster Analysis Q-1 What is Cluster Analysis ? Ans: Cluster analysis. or combining clusters to get to the desired final number of clusters. The process is repeated. the third case is added to the first cluster. In agglomerative hierarchical clustering every case is initially considered a cluster. seeks to identify homogeneous subgroups of cases in a population. If that third case is closer to a fourth case than it is to either of the first two. the third and fourth cases become the second two-case cluster. then calculates how to assign cases to the K clusters. then determine how many clusters best suit the data. Hierarchical cluster analysis. In k-means clustering the researcher specifies the number of clusters in advance.000). There is also divisive clustering. > 1. can use either agglomerative or divisive clustering strategies. discussed below. if not. Key Concepts and Terms Cluster formation is the selection of the procedure for determining how clusters are created. SPSS offers three general approaches to cluster analysis. two-step clustering creates pre-clusters. cluster analysis seeks to identify a set of groups which both minimize within-group variation and maximize between-group variation. adding cases to existing clusters. K-means clustering is much less computer-intensive and is therefore sometimes preferred when datasets are very large (ex. which works in the opposite direction. also called segmentation analysis or taxonomy analysis. The case with the lowest distance to either of the first two is considered next. then the two cases with the lowest distance (or highest similarity) are combined into a cluster. then it clusters the pre-clusters. Finally. also perform clustering and are discussed separately.. creating new clusters. such as latent class analysis and Q-mode factor analysis. Other techniques.

In SPSS. the researcher normally selects absolute values. Lambda. squared Euclidean distance. high negative as well as high positive values indicate similarity. Under the Method button in the SPSS Classify dialog. In SPSS. Rogers and Tanimoto. Anderberg's D. Cluster. The Euclidean distance is the square root of the sum of the square of the x difference plus the square of the y distance. block. or customized. Euclidean distance is the most common distance measure. which form the x and y axes. phi 4-point correlation. Hierarchical clustering. the one with the larger magnitude will dominate. Jaccard. chi-square or phi-square. Summary. or Lance and Williams. proximity matrices are selected under Analyze. Similarity. shape. Similarity measures how alike two cases are. There are a variety of different measures of inter-observation distances and inter-cluster distances to use as criteria when merging nearest clusters into broader groups or when considering the relation of a point to a cluster. For binary data. There are three measure pulldown menus. Distance measures how far apart two observations are. (Recall high school geometry: this is the formula for the length of the third side of a right triangle. Since for Pearson correlation. Statistics button. Absolute values. The first step in cluster analysis is establishment of the similarity or distance matrix. SPSS supports these interval distance measures: Euclidean distance. SPSS supports a large number of similarity measures for interval data (Pearson correlation or cosine) and for binary data (Russell and Rao. for interval. the pull-down Method selection determines how cases or clusters are combined at each step. This can be done by checking the absolute value checkbox in the Transform Measures area of the Methods subdialog (invoked by pressing the Methods button) of the main Cluster dialog. binary. or dispersion). Sokal and Sneath 1. A given pair of cases is plotted on two variables. similarity/distance measures are selected in the Measure area of the Method subdialog obtained by pressing the Method button in the Classify dialog. size difference. check proximity matrix. Sokal and Sneath 3. Minkowski. Sokal and Sneath 4.Distance. Sokal and Sneath 2. Ochiai. so to avoid this it is common to first standardize all variables. Dice. Method. for count data. squared Euclidean distance. pattern difference.The proximity matrix table in the output shows the actual distances or similarities computed for any pair of cases. Kulczynski 1. Cases which are alike share a high similarity. and count data respectively. Yule's Y. Different methods will result in different cluster patterns. Hamann. Yule's Q. variance. Sokal and Sneath 5. Kulczynski 2. Cases which are alike share a low distance. SPSS offers these method choices: 12 . This matrix is a table in which both the rows and columns are the units of analysis and the cell entries are a measure of similarity or distance for any pair of cases.) It is common to use the square of Euclidean distance as squaring removes the sign. When two or more variables are used to define distance. Chebychev. it supports Euclidean distance. simple matching.

Correlation of items can be used as a similarity measure. the distance between two clusters is the distance between their two furthest member points. not just the nearest or furthest ones. Clusters are weighted equally regardless of group size when computing centroids of two clusters being combined." Ward's method calculates the sum of squared Euclidean distances from each case in a cluster to the mean of all variables. the distance between two clusters is the distance between their closest neighboring points Furthest neighbor. SPSS does not make this available in the Cluster dialog. but one can click the Save button. UPGMA (unweighted pair-group method using averages). UPGMA is generally preferred over nearest or furthest neighbor methods since it is based on information about all inter-cluster pairs." Average linkage within groups is the mean distance between all possible inter. Centroid method. The cluster to be merged is the one with the smallest sum of Euclidean distances between cluster means for all variables. The average distance between all pairs in the resulting cluster is made to be as small as possibile. A table of means and variances of the clusters with respect to the original variables shows how the clusters differ on the original variables. Median method. This method also uses Euclidean distance as the proximity measure. This is an ANOVA-type approach and preferred by some researchers for this reason. In this single linkage method.Nearest neighbor. Means and variances. By using columns as cases and rows as variables instead. SPSS labels this "within-groups linkage. and clustering will be indeterminate. The distance between two clusters is the average distance between all inter-cluster pairs. the correlation is between cases and these correlations may constitute the cells of the similarity matrix. where 1 indicates a match and 0 indicates no match between any pair of cases. which will 13 . Summary measures assess how the clusters differ from one another. Note that it is usual in binary matching to have several attributes because there is a risk that when the number of attributes is small. Binary matching is another type of similarity measure. The cluster to be merged is the one which will increase the sum the least. In this complete linkage method. and is the default method in SPSS.or intracluster pairs. SPSS labels this "between-groups linkage. This method is therefore appropriate when the research purpose is homogeneity within clusters. There are multiple matched attributes and the similarity score is the number of matches divided by the number of attributes being matched. they may be orthogonal to (uncorrelated) with one another. One transposes the normal data table in which columns are variables and rows are cases.

1). under the Statistics button). at Stage 1. In agglomerative clustering using a distance measure like Euclidean distance. The last/bottom row will show all the cases in separate one-case clusters. where columns are alternative numbers of clusters in the solution (as specified in the "Range of Solution" option in the Cluster membership group in SPSS. one can see which cases are in which groups. Subsequent columns/rows show further clustering steps. Row 1 (vertical icicle plots) or column 1 (horizontal icicle plots) will show all cases in a 14 .1)th stage includes all the cases in one cluster. Then in Analyze. Note. Note that for distance measures. When there are relatively few cases. depending on the number of clusters in the solution. the rows are stages of clustering. Compare Means. with two cases combined into one cluster.. meaning the cases are alike. In this table. This is the (n 1) solution. Agglomeration Schedule. high coefficients mean cases are alike. After the stopping stage is determined in this manner. resulting in a cluster labeled 3. the researcher can see how agglomeration proceeded. stage 1 combines the two cases which have lowest proximity (distance) score. for similarity measures. If there are few cases. low is good. cases 3 and 18 might be combined. Cell entries show the number of the cluster to which the case belongs. resulting in a cluster labeled 2. Linkage tables show the relation of the cases to the clusters. Linkage plots show similar information in graphic form. 2 to 4 clusters) requested by the researcher in the Cluster Membership group of the Statistics button in th Hierarchical Clustering dialog. There are two "Cluster Combined" columns. icicle plots or dendograms provide the same linkage information in an easier format. giving the case or cluster numbers for combination at each stage. that SPSS will not stop on this basis but instead will compute the range of solutions (ex. This shows cases as rows. The next-to-last/bottom column/row will show the (n-2) solution. The (n . though. Later cluster 3 and case 2 might be combined. vertical icicle plots may plotted. Icicle plots are usually horizontal. From this table. showing cases as rows and number of clusters in the solution as columns. with cases as columns. Cluster membership table. Reading from the last column right to left (horizontal icicle plots) or last row bottom to top (vertical icicle plots). where cases are initially numbered 1 to n. the researcher can work backward to determine how many clusters there are and which cases belong to which clusters (but it is easier just to get this information from the cluster membership table). The researcher looks at the "Coefficients" column of the agglomerative schedule and notes when the proximity coefficient jumps up and is not a small increment from the one before (or when the coefficient reaches some theoretically important level). For instance. Agglomeration schedule is a choice under the Statistics button for Hierarchical Cluster in the SPSS Cluster dialog. numbered from 1 to (n . The cluster number goes by the lower of the cases or clusters combined. Means the researcher can use the cluster number as the grouping variable to compare differences of means on any other continuous variable in the dataset.save the cluster number for each case (or numbers if multiple solutions are requested).

What is Hierarchical Cluster Analysis ? Hierarchical clustering is appropriate for smaller samples (typically < 250). The merging of clusters is visualized using a tree format.. 200) to inspect results for different numbers of clusters. . also called tree diagrams. which may be undesirable.9). the rescaling of the X axis still produces a diagram with linkages involving high alikeness to the left and low alikeness to the right. and how many clusters are needed. K-Means Cluster Analysis). indicating alikeness.. also called agglomerative clustering: Small clusters are formed by using a high similarity index cut-off (ex. To accomplish hierarchical clustering. Cluster.K. indicating the cases/clusters were agglomerated even though much less alike. Then this cut-off is relaxed to establish broader and broader clusters in stages until all cases are in a single cluster at some low similarity index cut-off. the more clustering involved combining unlike entities. Forward and backward methods need not generate the same results. the clusters are nested rather than being mutually exclusive. If a similarity measure is used rather than a distance measure. with a line linking them a short distance from the left of the dendogram. Identifying "typical" types may call for few clusters and identifying "exceptional" types may call for many clusters. Forward clustering. Backward clustering. the researcher must specify how similarity or distance is defined... After using hierarchical clustering to determine the desired number of clusters. show the relative size of the proximity coefficients at which cases were combined. select Analyze. with each row representing a case on the Y axis. indicating that they are agglomerated into a cluster at a low distance coefficient. while the X axis is a rescaled version of the proximity coefficients. Cases showing low distance are close. Cases with low distance/high similarity are close together. not vertically.single cluster. The bigger the distance coefficient or the smaller the similarity coefficient. The optimum number of clusters depends on the research purpose. larger clusters created at later stages may contain smaller clusters created at earlier stages of agglomeration. Dendrograms. but without the proximity coefficient information.That is. specifying that number of clusters. click the Plots button. is the same idea. the Quick Cluster procedure: Analyze. Trees are usually depicted horizontally. but starting with a low cut-off and working toward a high cut-off. Classify. 15 . how clusters are aggregated (or divided). In SPSS. but is used only for relatively small samples. on the other hand. In hierarchical clustering. Hierarchical clustering generates all possible clusters of sizes 1. also called divisive clustering. the researcher may wish then to analyze the entire dataset with k-means clustering (aka. the linking line is to the right of the dendogram the linkage occurs a high distance coefficient. in hierarchical clustering. When. as is the usual case. Hierarchical Cluster. check the Dendogram checkbox. This is a visual way of representing information on the agglomeration schedule. One may wish to use the hierarchical cluster procedure on a sample of cases (ex..

In the Hierarchical Cluster dialog. Initial cluster centers are chosen in a first pass of the data. select Proximity Matrix. K. then reassigned to a different cluster as the algorithm unfolds. instead you must re-run K-means clustering.Clustering variables." In SPSS. Method. select Range of Solutions in the Cluster Membership group. SPSS calls hierarchical clustering the "Cluster procedure. specify the number of clusters (typically 3 to 6). The default method is "Iterate and classify. Cluster centers are the average value on all clustering variables of each cluster's members. which are not updated. there is no option for "Range of solutions". asking for a different number of clusters." in spite of its title. or just Classify). a given case may be assigned to a cluster. in order to cluster variables. optionally. SPSS: Analyze. However. The researcher must specify in advance the desired number of clusters. under which cases are immediately classified based on the initial cluster centers. 16 . enter "Number of clusters:". in agglomerative K-means clustering. Large datasets are possible with K-means clustering. When the change drops below a specified cutoff. Cluster centers change at each pass. Unlike hierarchical clustering. The "Final cluster centers" table in SPSS output gives the same thing for the last iteration step. choose Method: Iiterate and classify. SPSS supports a "Classify only" method. Normally in K-means clustering. What is K-means Cluster Analysis ? K-means cluster analysis. K-Means Cluster Analysis. the iterative process stops and cases are assigned to clusters according to which cluster center they are nearest. because K-means clustering does not require prior computation of a proximity matrix of the distance/similarity of every case with every other case. select Cases in the Cluster group click Statistics. K-means cluster analysis uses Euclidean distance. in the Cluster group. then cases are classified based on the updated centers. Agglomerative K-means clustering. select Analyze. Hierarchical Cluster. Cluster. Continue. the researcher may selected Variable rather than the usual Cases. Classify. OK. enter variables in the Variables: area. The "Iteration history" table shows the change in cluster centers when the usual iterative approach is taken. unlike hierarchical clustering. The "Initial cluster centers. The process continues until cluster means do not shift more than a given cut-off value or the iteration limit is reached. then each additional iteration groups observations based on nearest Euclidean distance to the mean of the cluster. the solution is constrained to force a given case to remain in its initial cluster. gives the average value of each variable for each cluster for the k well-spaced cases which SPSS selects for initialization purposes when no initial file is supplied. enter a variable in the "Label cases by:" area. select variables." under which an interative process is used to update cluster centers. However.

Sometimes the researcher wishes to experiment to get different clusters. or even by presenting the data file in different case order. also gives the Euclidean distance between final cluster centers). but rather is a one-pass-through-the-dataset method. Save button: Optionally. in which case it is split using the most-distant pair in the node as seeds. if checked.Iterate button. not the default. Cases start at the root node and are channeled toward nodes and eventually leaf nodes which match it most closely. There are three statistics options: "Initial cluster centers" (gives the initial variable means for each clusters). If there is no adequate match. This is the method used when one or more of the variables are categorical (not interval or dichotomous). it is recommended for very large datasets. To override this default. the case is used to start its own leaf node. as when the "Number of cases in each cluster" table shows highly imbalanced clusters and/or clusters with very few members. Cluster feature tree. iterations terminate if the largest change in any cluster center is less than 2% of the minimum distance between initial centers (or if the maximum number of iterations has been reached). which is after the entire set of cases is classified. Different results may occur by setting different initial cluster centers from file (see above). What is Two-Step Cluster Analysis ? Two-step cluster analysis groups cases into pre-clusters which are treated as single cases. The default maximum number of iterations in SPSS is 10. enter a positive number less than or equal to 1 in the convergence box.. Also. since it is a method requiring neither a proximity table like hierarchical classification nor an iterative process like K-means clustering. The process continues 17 . There is also a "Use running means" checkbox which. The preclustering stage employs a CFtree with nodes leading to leaf nodes. ANOVA table (ANOVA F-tests for each variable. the threshold distance is increased and the tree is rebuilt. Optionally. and "Cluster information for each case" (gives each case's final cluster assignment and the Euclidean distance between the case and the cluster center.. the resulting probabilities are for exploratory purposes only. and/or you may save the Euclidean distance between each case and its cluster center (labeled QCL_2) by checking "Distance from cluster center. For the convergence criterion. you may press the Save button to save the final cluster number of each case as an added column in your dataset (labeled QCL_1)." Options button: Optionally. non-significant variables might be dropped as not contributing to the differentiation of clusters). but as the F tests are only descriptive. If this recursive process grows the CFtree beyond maximum size. Standard hierarchical clustering is then applied to the pre-clusters in the second step. by default. Getting different clusters. allowing new cases to be input. by changing the number of clusters requested. you may press the Options button to select statistics or missing values options. you may press the Iterate button and set the number of iterations and the convergence criterion. It can happen that the CFtree fills up and cannot accept new leaf entries in a node. nonetheless. will cause the clulster centers to be updated after each case is classified.

2004 Finding Groups In Data: An Introduction Leonard Kaufman.uu. Dubes. Number of clusters. for example.cs. To Cluster Analysis. Richard C. Choose Analyze. Click Output and select the statistics wanted (descriptive statistics. with cases categorized under the cluster which is associated with the largest log-likelihood. It is also possible to have this done based on changes in AIC (the Akaike Information Criterion). such as 3-5 clusters. If variables are all continuous. select your categorical and continuous variables. When one or more of the variables are categorical. SPSS. Jain. Euclidean distance is used.until all the data are read. Rousseeuw. The researcher can also ask for a range of solutions. Proximity. or to simply to tell SPSS how many clusters are wanted. By default SPSS determines the number of clusters using the change in BIC (the Schwarz Bayesian Criterion: when BIC change is small.htm www. Two-Step Cluster. if desired. with cases categorized under the cluster which is associated with the smallest Euclidean distance. click Plots and select the plots wanted. maximum levels.chass. Classify. cluster frequencies.nl/docs/vakken/arm/SPSS/spss8. it stops and selects as many clusters as thus far created. and maximum branches per leaf node manually.ncsu. The "Autoclustering statistics" table in SPSS output gives.edu/garson/PA765/cluster. Click the Advanced button in the Options button dialog to set threshold distances.pdf Suggested Readings: Anil K. Peter J. Algorithms for Clustering Data . AIC or BIC). log-likelihood is the distance measure used. Continue Adapted from http://faculty. BIC and BIC change for all solutions. 2005 18 .