BRM Multivariate Notes | Cluster Analysis | Multicollinearity

Multivariate Analysis Business Research Methods


Multiple Regression
Q-1 What is Multiple Regression? Ans : Multiple regression is used to account for (predict) the variance in an interval dependent, based on linear combinations of interval, dichotomous, or dummy independent variables. Multiple regression can establish that a set of independent variables explains a proportion of the variance in a dependent variable at a significant level (through a significance test of R2), and can establish the relative predictive importance of the independent variables (by comparing beta weights). Power terms can be added as independent variables to explore curvilinear effects. Cross-product terms can be added as independent variables to explore interaction effects. One can test the significance of difference of two R2's to determine if adding an independent variable to the model helps significantly. Using hierarchical regression, one can see how most variance in the dependent can be explained by one or a set of new independent variables, over and above that explained by an earlier set. Of course, the estimates (b coefficients and constant) can be used to construct a prediction equation and generate predicted scores on a variable for further analysis. The multiple regression equation takes the form y = b1x1 + b2x2 + ... + bnxn + c. The b's are the regression coefficients, representing the amount the dependent variable y changes when the corresponding independent changes 1 unit. The c is the constant, where the regression line intercepts the y axis, representing the amount the dependent y will be when all the independent variables are 0. The standardized version of the b coefficients are the beta weights, and the ratio of the beta coefficients is the ratio of the relative predictive power of the independent variables. Associated with multiple regression is R2, multiple correlation, which is the percent of variance in the dependent variable explained collectively by all of the independent variables. Multiple regression shares all the assumptions of correlation: linearity of relationships, the same level of relationship throughout the range of the independent variable ("homoscedasticity"), interval or near-interval data, absence of outliers, and data whose range is not truncated. In addition, it is important that the model being tested is correctly specified. The exclusion of important causal variables or the inclusion of extraneous variables can change markedly the beta weights and hence the interpretation of the importance of the independent variables.


the number of independents. adjusted R2 may be noticeably lower. The "residual sum of squares" in SPSS /SAS output is SSE and reflects regression error. Put another way. where Yi is the actual value of Y for the ith case and EstYi is the regression prediction for the ith case. Always use adjusted R2 when comparing models with different numbers of independents.. Thus R-square is 1 minus regression error as a percent of total error and will be 0 when regression error is as large as it would be if you simply guessed the mean for all cases of Y.0. when there are as many independents as cases in the sample." When used for the case of a few independents.1) ). R2's near 1 violate the assumption of no perfect collinearity. Some authors conceive of adjusted R2 as the percent of variance "explained in a replication. R2 reflects the number of errors made when using the regression model to guess the value of the dependent.EstYi)squared). The greater the number of independents.MeanY)squared). where SSE = error sum of squares = SUM((Yi . increases. the number of independents).(SSE/SST)). also called multiple correlation or the coefficient of multiple determination. after subtracting out the contribution of chance. That is. Mathematically. it is possible that R2 will become artificially high simply because some independents' chance variations "explain" small parts of the variance of the dependent. The adjustment to the formula arbitrarily lowers R2 as p. while high R2's increase the standard error of the 3 . the more the researcher is expected to report the adjusted coefficient.k . the regression sum of squares/total sum of squares = R-square.e. R2 = (1 . At the extreme.( (1-R2)(N-1 / N . where n is sample size and k is the number of terms in the model not counting the constant (i. and where SST = total sum of squares = SUM((Yi . Q-4 What is Multicollinearity and How it is measured? Multicollinearity is the intercorrelation of independent variables. R2 and adjusted R2 will be close. R2 will always be 1. When there are a great many independents. R-squared can also be interpreted as the proportionate reduction in error in estimating the dependent when knowing the independents.Q-2 What is R-square ? Ans: R2. where the regression sum of squares = total sum of squares . Adjusted R2 = 1 .residual sum of squares Q-3 What is Adjusted R-square and How it is calculated? Ans: Adjusted R-Square is an adjustment for the fact that when one has a large number of independents. in ratio to the total errors made when using only the dependent's mean as the basis for estimating all cases. is the percent of the variance in the dependent explained uniquely or jointly by the independents.

Criteria for "sizable proportion" vary among researchers but the most common criterion is if two or more variables have a variance partition of . check Collinearity diagnostics to get condition indices. Condition indices are used to flag excessive collinearity in the data. Regression. the lower the tolerance. If this is the case. VIF and tolerance are found in the SPSS and SAS output section on collinearity statistics.20 or VIF < 4 suggest no multicollinearity. 4 . If a factor (component) has a high condition index. these variables have high linear dependence and multicollinearity is a problem.R2 for the regression of that independent variable on all the other independents. when VIF is high there is high multicollinearity and instability of the b and beta coefficients. even when the rules of thumb for tolerance > . Types of multicollinearity. The type of multicollinearity matters a great deal. Variance-inflation factor. if tolerance is less than . Note that it is possible for the rule of thumb for condition indices (no index over 30) to indicate multicollinearity.50 or higher on a factor with a high condition index. Therefore. A condition index over 30 suggests serious collinearity problems and an index over 15 indicates possible collinearity problems. one uses tolerance or VIF. To assess multivariate multicollinearity. 90. While simple correlations tell something about multicollinearity. Some types are necessary to the research purpose Tolerance is 1 . There will be as many tolerance coefficients as there are independents. ignoring the dependent. and "condition indices" are the ratio of the largest singular values to each other singular value. a problem with multicollinearity is indicated. which build in the regressing of each independent on all the others. In SPSS or SAS. note that estimates of the importance of other variables in the equation (variables which are not collinear with others) are not affected. select Analyze. Computationally. The higher the intercorrelation of the independents. the more the tolerance will approach zero. Even when multicollinearity is present.The more the multicollinearity.beta coefficients and make assessment of the unique role of each independent difficult or impossible. When tolerance is close to 0 there is high multicollinearity of that variable with other independents and the b and beta coefficients will be unstable. Condition indices and variance proportions.20. with the typical criterion being bivariate correlations > . with the effect that small data changes or arithmetic errors may translate into very large changes or errors in the regression analysis. one looks in the variance proportions column. the more the standard error of the regression coefficients. the preferred method of assessing multicollinearity is to regress each independent on all the other independent variables in the equation. Linear. Tolerance is part of the denominator in the formula for calculating the confidence limits on the b (partial regression) coefficient. As a rule of thumb. Inspection of the correlation matrix reveals only bivariate multicollinearity. VIF VIF is the variance inflation Statistics. a "singular value" is the square root of an eigenvalue. which is simply the reciprocal of tolerance.

Q-5 What is homoscedasticity ? Homoscedasticity: The researcher should test to assure that the residuals are dispersed randomly throughout the range of the estimated dependent. Thousand Oaks. One method of dealing with hetereoscedasticity is to select the weighted least squares regression option. whereas lack of homoscedasticity will be characterized by a pattern such as a funnel shape. Nonconstant error variance can indicate the need to respecify the model to include omitted independent variables. and Paula E. CA: Sage Publications. A homoscedastic model will display a cloud of dots. Larry D. 106. and reciprocal transformations of the dependent may also reduce or eliminate lack of homoscedasticity. CA: Sage Publications. Lack of homoscedasticity may mean (1) there is an interaction effect between a measured independent variable and an unmeasured independent variable not in the model. 5 . No. Leo H. Stephan (1986). Series: Quantitative Applications in the Social Sciences. Suggested Readings and Links: http://www2. log. No. Miles. If not. Scott (1995). David L. moderate violations of homoscedasticity have only minor impact on regression estimates . the variance of residual error should be constant for all values of the independent(s). separate models may be required for the different ranges.pdf Kahane. Also. or (2) that some independent variables are skewed while others are not. CA: Sage Publications. indicating greater error as the dependent increases. when the homoscedasticity assumption is violated "conventionally computed confidence intervals and conventional t-tests for OLS estimators can no longer be justified" However. Jeremy and Mark Shevlin (2001). Introductory text built around model-building. Put another way. (2001)..cs. Square root. Series: Quantitative Applications in the Social Sciences. Thousand Oaks. Regression basics. Applied logistic regression analysis. Schroeder. Nonconstant error variance can be observed by requesting a simple residual plot (a plot of residuals on the Y axis against predicted values on the X axis).ncsu. CA: Sage Publications.chass. Applying regression and correlation.htm www. Understanding regression analysis: An introductory guide. Thousand Oaks. This causes cases with smaller residuals to be weighted more in calculating the b coefficients. 57. Thousand Oaks.

Discriminant analysis has two steps: (1) an F test (Wilks' lambda) is used to test if the discriminant model as a whole is significant. 6 . it also assumes proper model specification (inclusion of all important independents and exclusion of extraneous variables). depending on whether the specified grouping variable has two or more categories. is used to classify cases into the values of a categorical dependent. Discriminant. If discriminant function analysis is effective for a set of data. Multiple discriminant analysis (MDA) is an extension of discriminant analysis and a cousin of multiple analysis of variance (MANOVA). Like multiple regression. based on discriminant loadings. To investigate differences between or among groups. and (2) if the F test shows significance. DA also assumes the dependent variable is a true dichotomy since data which are forced into dichotomous coding are truncated.k. To assess the relative importance of the independent variables in classifying the dependent variable. a. There are several purposes for DA and/or MDA: • • • • • • To classify cases into groups using a discriminant prediction equation. the classification table of correct and incorrect estimates will yield a high percentage correct. requiring linear and homoscedastic relationships. To infer the meaning of MDA dimensions which distinguish groups. MDA is used to classify a categorical dependent which has more than two categories. using as predictors a number of interval or dummy independent variables. attenuating correlation.Discriminant Analysis Q-1 What is Disriminant Analysis ? Ans: Discriminant function analysis. discriminant analysis or DA. To determine the most parsimonious way to distinguish among groups. Discriminant function analysis is found in SPSS/SAS under Analyze. Classify.a. usually a dichotomy. To test theory by observing whether cases are classified as predicted. then the individual independent variables are assessed to see which differ significantly in mean by group and these are used to classify the dependent variable. and untruncated interval or near interval data. sharing many of the same assumptions and tests. One gets DA or MDA from this same menu selection. MDA is sometimes also called discriminant factor analysis or canonical discriminant analysis. Discriminant analysis shares all the usual assumptions of correlation.

DA is an earlier alternative to logistic regression. is robust. the x's are discriminating variables. the traditional method. the first will be the largest and most important. where the b's are discriminant coefficients. + bnxn + c. but the b's are discriminant coefficients which maximize the distance between the means of the criterion (dependent) variable. but for higher order DA. there is one discriminant function and one eigenvalue. or have equal within-group variances). A dimension is simply one of the discriminant functions when there are more than one. also called a canonical root. The eigenvalues assess relative importance because they reflect the percents of variance explained in the dependent variable. Note that the foregoing assumes the discriminant function is estimated using ordinary least-squares. reflects the ratio of importance of the dimensions which classify cases of the dependent variable. but there is also a version involving maximum likelihood estimation. linearly related. where g is the number of categories in the grouping variable. If there is more than one discriminant function. handles categorical as well as continuous variables. and has coefficients which many find easier to interpret.the number of discriminating (independent) variables. This is the dependent variable. also called predictors. Each discriminant function is orthogonal to the others. For twogroup DA. There is one discriminant function for 2-group discriminant analysis. such that L = b1x1 + b2x2 + . and c is a constant. Few Definitions and Concepts Discriminating variables: These are the independent variables. the number of functions (each with its own cut-off value) is the lesser of (g . which accounts for 100% of the explained variance.. Logistic regression is preferred when data are not normal in distribution or group sizes are very unequal.. in multiple discriminant analysis. the second next most important in explanatory power. Discriminant function: A discriminant function.1). also called the grouping variable in SPSS. See also the separate topic on multiple discriminant function analysis (MDA) for dependents with more than two categories. is a latent variable which is created as a linear combination of discriminating (independent) variables. The criterion variable. or p. which is now frequently used in place of DA as it usually involves fewer violations of assumptions (independent variables needn't be normally distributed. cumulating 7 . Number of discriminant functions. It is the object of classification efforts. also called the characteristic root of each discriminant function. The eigenvalue. There is one eigenvalue for each discriminant function. and so on. This is analogous to multiple regression.

Thus it is the percent of discriminating power for the model associated with a given discriminant function.0 indicates that all of the variability in the discriminant scores can be accounted for by that dimension.. there is a high correlation between the discriminant functions and the groups. Discriminant.4. The relative percentage of a discriminant function equals a function's eigenvalue divided by the sum of all eigenvalues of all discriminant functions in the model. or if above it is classed as 1. Discriminant). for instance. check "Discriminant scores". the b's are discriminant coefficients. The Z score is the discriminant score for standardized data. + bnxn + c. the canonical correlation is equivalent to the Pearsonian correlation of the discriminant scores with the grouping variable. the cutoff is the weighted mean. discriminant coefficients are the regression-like b coefficients in the discriminant function. is the value resulting from applying a discriminant function formula to the data for a given case. To get discriminant scores in SPSS. Classify. is a measure of the association between the groups formed by the dependent and the given discriminant function. When the canonical correlation is large. An R of 1. R is used to tell how much each function is useful in determining group differences. select Analyze. Relative % is used to tell how many functions are important.. R. the cutoff is the mean of the two centroids (for two-group DA). also called the DA score. Classify. where L is the latent variable formed by the discriminant function. When R is zero. Eigenvalues are part of the default output in SPSS (Analyze. The discriminant function coefficients are partial coefficients. Note that relative % and R* do not have to be correlated. in the form L = b1x1 + b2x2 + . Note that for two-group DA. The standardized discriminant coefficients. One may find that only the first two or so eigenvalues are of importance. The discriminant score." Cutoff: If the discriminant score of the function is less than or equal to the cutoff. reflecting the unique contribution of each variable to the classification of the criterion variable. The canonical correlation. If the groups are unequal. the case is classed as 0. the ratio of the eigenvalues indicates the relative discriminating power of the discriminant functions. When group sizes are equal. the x's are discriminating variables. click the Save button. One can also view the discriminant scores by clicking the Classify button and checking "Casewise results. there is no relation between the groups and the function. much as b coefficients are used in regression in making 100% for all functions. That is. Unstandardized discriminant coefficients are used in the formula for making the classifications in DA. If the ratio of two eigenvalues is 1. like beta weights in 8 . That is. then the first discriminant function accounts for 40% more betweengroup variance in the dependent categories than does the second discriiminant function. and c is a constant. The constant plus the sum of products of the unstandardized coefficients with the observations yields the discriminant scores.

It is an F test. Lambda varies from 0 to 1. the more that variable contributes to the discriminant function. the "Wilks' Lambda" table will have a column labeled "Test of Function(s)" and a row labeled "1 through n" (where n is the number of discriminant functions). much as beta weights are used in regression." level for this row is the significance level of the discriminant function as a whole. are used to assess the relative classifying importance of the independent variables. The smaller the lambda for an independent variable. For this purpose. Wilks' lambda also can be used to test which independents contribute significantly to the discriminant function. also termed the standardized canonical discriminant function coefficients. In SPSS. Standardized discriminant coefficients. if there are more than two groups of the dependent.05 means the model differentiates discriminant scores between the groups significantly better than chance (than a model with just the constant). Q-2 What is Wilk’s Lambda? Wilks' lambda is used to test the significance of the discriminant function as a whole. Addition or deletion of variables in the model can change discriminant coefficients markedly. with 0 meaning group means differ (thus the more the variable differentiates the groups). group centroids and factor structure are examined. A significant lambda means one can reject the null hypothesis that the two groups have the same mean discriminant function scores. The F test of Wilks's lambda shows which variables' contributions are significant. Classify. It is obtained in SPSS by asking for Analyze. As with regression. the standardized discriminant coefficients do not tell the researcher between which groups the variable is most or least discriminating. and 1 meaning all group means are the same. Wilks's lambda is part of the default output in SPSS (Analyze. this use of Wilks' lambda is in the "Tests of equality of group means" table in DA output. Note that importance is assessed relative to the model being analyzed. using discriminant scores from DA (which SPSS will label Dis1_1 or similar) as dependent. since these are partial coefficients. Compare Means. only the unique explanation of each independent is being compared. are used to compare the relative importance of the independent variables. this use of Wilks' lambda is in the "Wilks' lambda" table of the output section on "Summary of Canonical Discriminant Functions." p value < . Discriminant). One-Way ANOVA. In SPSS." ANOVA table for discriminant scores is another overall test of the DA model. The "Sig. not considering any shared explanation. Wilks's lambda is sometimes called the U statistic. Also. In SPSS. Q-3 What is Confusion or classification Matrix ? 9 . where a "Sig.regression.

No. the expected percent is 50%. Discriminant analysis. summing for all groups. Note that the hit ratio must be compared not to zero but to the percent that would have been correctly classified by chance alone. all cases will lie on the diagonal.chass. Expected hit ratio. Lachenbruch. NY: Hafner.Ans: The classification table. by multiplying the prior probabilities times the group size. P. The percentage of cases on the diagonal is the percentage of correct classifications. Klecka. Carl J. and dividing the sum by N. CA: Sage Publications. This percentage is called the hit ratio. assignment. is used to assess the performance of DA. (1975). (Wiley Series in Probability and Statistics). 19. Applied discriminant analysis . William R. also called a classification matrix. This is simply a table in which the rows are the observed categories of the dependent and the columns are the predicted categories of the dependents.ncsu. 10 .edu/garson/PA765/discrim2. For two-group discriminant analysis with a 50-50 split in the dependent variable. Thousand Oaks. For unequally split 2-way groups of different sizes. Quantitative Applications in the Social Sciences Series. or prediction matrix or table. When prediction is perfect. NY: Wiley-Interscience. (1980). A. the expected percent is computed in the "Prior Probabilities for Groups" table in SPSS.htm Suggested Readings: Huberty. or a confusion. Discriminant analysis. Adapted from the link: http://faculty. (1994).

Hierarchical clustering allows users to select a definition of distance. That is. and how the calculations are done. starting with all cases in one large cluster. Similarity and Distance 11 . then select a linking method of forming clusters.Cluster Analysis Q-1 What is Cluster Analysis ? Ans: Cluster analysis. or combining clusters to get to the desired final number of clusters. The process is repeated. the third case is added to the first cluster. In agglomerative hierarchical clustering every case is initially considered a cluster. seeks to identify homogeneous subgroups of cases in a population. If that third case is closer to a fourth case than it is to either of the first two. the third and fourth cases become the second two-case cluster. then calculates how to assign cases to the K clusters. then determine how many clusters best suit the data. Hierarchical cluster analysis. In k-means clustering the researcher specifies the number of clusters in advance.000). There is also divisive clustering. > 1. can use either agglomerative or divisive clustering strategies. discussed below. if not. Key Concepts and Terms Cluster formation is the selection of the procedure for determining how clusters are created. SPSS offers three general approaches to cluster analysis. two-step clustering creates pre-clusters. cluster analysis seeks to identify a set of groups which both minimize within-group variation and maximize between-group variation. adding cases to existing clusters. K-means clustering is much less computer-intensive and is therefore sometimes preferred when datasets are very large (ex. which works in the opposite direction. also called segmentation analysis or taxonomy analysis. The case with the lowest distance to either of the first two is considered next. then the two cases with the lowest distance (or highest similarity) are combined into a cluster. then it clusters the pre-clusters. Finally. also perform clustering and are discussed separately.. creating new clusters. such as latent class analysis and Q-mode factor analysis. Other techniques.

In SPSS. the researcher normally selects absolute values. Lambda. squared Euclidean distance. high negative as well as high positive values indicate similarity. Under the Method button in the SPSS Classify dialog. In SPSS. Rogers and Tanimoto. Anderberg's D. Cluster. The Euclidean distance is the square root of the sum of the square of the x difference plus the square of the y distance. block. or customized. Euclidean distance is the most common distance measure. which form the x and y axes. phi 4-point correlation. Hierarchical clustering. the one with the larger magnitude will dominate. Jaccard. chi-square or phi-square. Summary. or Lance and Williams. proximity matrices are selected under Analyze. Similarity. shape. Similarity measures how alike two cases are. There are a variety of different measures of inter-observation distances and inter-cluster distances to use as criteria when merging nearest clusters into broader groups or when considering the relation of a point to a cluster. For binary data. There are three measure pulldown menus. Distance measures how far apart two observations are. (Recall high school geometry: this is the formula for the length of the third side of a right triangle. Since for Pearson correlation. Statistics button. Absolute values. The first step in cluster analysis is establishment of the similarity or distance matrix. SPSS supports these interval distance measures: Euclidean distance. SPSS supports a large number of similarity measures for interval data (Pearson correlation or cosine) and for binary data (Russell and Rao. for interval. the pull-down Method selection determines how cases or clusters are combined at each step. This can be done by checking the absolute value checkbox in the Transform Measures area of the Methods subdialog (invoked by pressing the Methods button) of the main Cluster dialog. binary. or dispersion). Sokal and Sneath 1. A given pair of cases is plotted on two variables. similarity/distance measures are selected in the Measure area of the Method subdialog obtained by pressing the Method button in the Classify dialog. size difference. check proximity matrix. Sokal and Sneath 3. Minkowski. Sokal and Sneath 4.Distance. Sokal and Sneath 2. Ochiai. so to avoid this it is common to first standardize all variables. Dice. Method. for count data. squared Euclidean distance. pattern difference.The proximity matrix table in the output shows the actual distances or similarities computed for any pair of cases. Kulczynski 1. Cases which are alike share a high similarity. and count data respectively. Yule's Y. Different methods will result in different cluster patterns. Hamann. Yule's Q. variance. Sokal and Sneath 5. Kulczynski 2. Cases which are alike share a low distance. SPSS offers these method choices: 12 . This matrix is a table in which both the rows and columns are the units of analysis and the cell entries are a measure of similarity or distance for any pair of cases.) It is common to use the square of Euclidean distance as squaring removes the sign. When two or more variables are used to define distance. Chebychev. it supports Euclidean distance. simple matching.

Correlation of items can be used as a similarity measure. the distance between two clusters is the distance between their two furthest member points. not just the nearest or furthest ones. Clusters are weighted equally regardless of group size when computing centroids of two clusters being combined." Ward's method calculates the sum of squared Euclidean distances from each case in a cluster to the mean of all variables. the distance between two clusters is the distance between their closest neighboring points Furthest neighbor. SPSS does not make this available in the Cluster dialog. but one can click the Save button. UPGMA (unweighted pair-group method using averages). UPGMA is generally preferred over nearest or furthest neighbor methods since it is based on information about all inter-cluster pairs." Average linkage within groups is the mean distance between all possible inter. Centroid method. The cluster to be merged is the one with the smallest sum of Euclidean distances between cluster means for all variables. The average distance between all pairs in the resulting cluster is made to be as small as possibile. A table of means and variances of the clusters with respect to the original variables shows how the clusters differ on the original variables. Median method. This method also uses Euclidean distance as the proximity measure. This is an ANOVA-type approach and preferred by some researchers for this reason. In this single linkage method.Nearest neighbor. Means and variances. By using columns as cases and rows as variables instead. SPSS labels this "within-groups linkage. and clustering will be indeterminate. The distance between two clusters is the average distance between all inter-cluster pairs. the correlation is between cases and these correlations may constitute the cells of the similarity matrix. where 1 indicates a match and 0 indicates no match between any pair of cases. which will 13 . Summary measures assess how the clusters differ from one another. Note that it is usual in binary matching to have several attributes because there is a risk that when the number of attributes is small. Binary matching is another type of similarity measure. The cluster to be merged is the one which will increase the sum the least. In this complete linkage method. and is the default method in SPSS.or intracluster pairs. SPSS labels this "between-groups linkage. This method is therefore appropriate when the research purpose is homogeneity within clusters. There are multiple matched attributes and the similarity score is the number of matches divided by the number of attributes being matched. they may be orthogonal to (uncorrelated) with one another. One transposes the normal data table in which columns are variables and rows are cases.

1). under the Statistics button). at Stage 1. In agglomerative clustering using a distance measure like Euclidean distance. The last/bottom row will show all the cases in separate one-case clusters. where columns are alternative numbers of clusters in the solution (as specified in the "Range of Solution" option in the Cluster membership group in SPSS. one can see which cases are in which groups. Subsequent columns/rows show further clustering steps. Row 1 (vertical icicle plots) or column 1 (horizontal icicle plots) will show all cases in a 14 .1)th stage includes all the cases in one cluster. Then in Analyze. Note. Note that for distance measures. When there are relatively few cases. depending on the number of clusters in the solution. the rows are stages of clustering. Compare Means. with two cases combined into one cluster.. meaning the cases are alike. In this table. This is the (n 1) solution. Agglomeration Schedule. high coefficients mean cases are alike. After the stopping stage is determined in this manner. resulting in a cluster labeled 3. the researcher can see how agglomeration proceeded. stage 1 combines the two cases which have lowest proximity (distance) score. for similarity measures. If there are few cases. low is good. cases 3 and 18 might be combined. Cell entries show the number of the cluster to which the case belongs. resulting in a cluster labeled 2. Linkage tables show the relation of the cases to the clusters. Linkage plots show similar information in graphic form. 2 to 4 clusters) requested by the researcher in the Cluster Membership group of the Statistics button in th Hierarchical Clustering dialog. There are two "Cluster Combined" columns. icicle plots or dendograms provide the same linkage information in an easier format. giving the case or cluster numbers for combination at each stage. that SPSS will not stop on this basis but instead will compute the range of solutions (ex. This shows cases as rows. The next-to-last/bottom column/row will show the (n-2) solution. The (n . though. Later cluster 3 and case 2 might be combined. vertical icicle plots may plotted. Icicle plots are usually horizontal. From this table. showing cases as rows and number of clusters in the solution as columns. with cases as columns. Cluster membership table. Reading from the last column right to left (horizontal icicle plots) or last row bottom to top (vertical icicle plots). where cases are initially numbered 1 to n. the researcher can work backward to determine how many clusters there are and which cases belong to which clusters (but it is easier just to get this information from the cluster membership table). The researcher looks at the "Coefficients" column of the agglomerative schedule and notes when the proximity coefficient jumps up and is not a small increment from the one before (or when the coefficient reaches some theoretically important level). For instance. Agglomeration schedule is a choice under the Statistics button for Hierarchical Cluster in the SPSS Cluster dialog. numbered from 1 to (n . The cluster number goes by the lower of the cases or clusters combined. Means the researcher can use the cluster number as the grouping variable to compare differences of means on any other continuous variable in the the cluster number for each case (or numbers if multiple solutions are requested).

What is Hierarchical Cluster Analysis ? Hierarchical clustering is appropriate for smaller samples (typically < 250). The merging of clusters is visualized using a tree format.. 200) to inspect results for different numbers of clusters. . also called tree diagrams. which may be undesirable.9). the rescaling of the X axis still produces a diagram with linkages involving high alikeness to the left and low alikeness to the right. and how many clusters are needed. K-Means Cluster Analysis). indicating alikeness.. also called agglomerative clustering: Small clusters are formed by using a high similarity index cut-off (ex. To accomplish hierarchical clustering. Cluster.K. indicating the cases/clusters were agglomerated even though much less alike. Then this cut-off is relaxed to establish broader and broader clusters in stages until all cases are in a single cluster at some low similarity index cut-off. the more clustering involved combining unlike entities. Forward and backward methods need not generate the same results. the clusters are nested rather than being mutually exclusive. If a similarity measure is used rather than a distance measure. with a line linking them a short distance from the left of the dendogram. Identifying "typical" types may call for few clusters and identifying "exceptional" types may call for many clusters. Forward clustering. Backward clustering. the researcher must specify how similarity or distance is defined... After using hierarchical clustering to determine the desired number of clusters. show the relative size of the proximity coefficients at which cases were combined. select Analyze. with each row representing a case on the Y axis. indicating that they are agglomerated into a cluster at a low distance coefficient. while the X axis is a rescaled version of the proximity coefficients. Cases showing low distance are close. Cases with low distance/high similarity are close together. not vertically.single cluster. The bigger the distance coefficient or the smaller the similarity coefficient. The optimum number of clusters depends on the research purpose. larger clusters created at later stages may contain smaller clusters created at earlier stages of agglomeration. Dendrograms. but without the proximity coefficient information.That is. specifying that number of clusters. click the Plots button. is the same idea. the Quick Cluster procedure: Analyze. Trees are usually depicted horizontally. but starting with a low cut-off and working toward a high cut-off. Classify. 15 . how clusters are aggregated (or divided). In SPSS. but is used only for relatively small samples. on the other hand. In hierarchical clustering. Hierarchical clustering generates all possible clusters of sizes 1. also called divisive clustering. the researcher may wish then to analyze the entire dataset with k-means clustering (aka. the linking line is to the right of the dendogram the linkage occurs a high distance coefficient. in hierarchical clustering. When. as is the usual case. Hierarchical Cluster. check the Dendogram checkbox. This is a visual way of representing information on the agglomeration schedule. One may wish to use the hierarchical cluster procedure on a sample of cases (ex..

In the Hierarchical Cluster dialog. Initial cluster centers are chosen in a first pass of the data. select Proximity Matrix. K. then reassigned to a different cluster as the algorithm unfolds. instead you must re-run K-means clustering.Clustering variables." In SPSS. Method. select Range of Solutions in the Cluster Membership group. SPSS calls hierarchical clustering the "Cluster procedure. specify the number of clusters (typically 3 to 6). The default method is "Iterate and classify. Cluster centers are the average value on all clustering variables of each cluster's members. which are not updated. there is no option for "Range of solutions". asking for a different number of clusters." in spite of its title. or just Classify). a given case may be assigned to a cluster. in order to cluster variables. optionally. SPSS: Analyze. However. The researcher must specify in advance the desired number of clusters. under which cases are immediately classified based on the initial cluster centers. 16 . enter "Number of clusters:". in agglomerative K-means clustering. Large datasets are possible with K-means clustering. When the change drops below a specified cutoff. Cluster centers change at each pass. Unlike hierarchical clustering. The "Final cluster centers" table in SPSS output gives the same thing for the last iteration step. choose Method: Iiterate and classify. SPSS supports a "Classify only" method. Normally in K-means clustering. What is K-means Cluster Analysis ? K-means cluster analysis. K-Means Cluster Analysis. the iterative process stops and cases are assigned to clusters according to which cluster center they are nearest. because K-means clustering does not require prior computation of a proximity matrix of the distance/similarity of every case with every other case. select Cases in the Cluster group click Statistics. K-means cluster analysis uses Euclidean distance. in the Cluster group. then cases are classified based on the updated centers. Agglomerative K-means clustering. select Analyze. Hierarchical Cluster. Cluster. Continue. the researcher may selected Variable rather than the usual Cases. Classify. OK. enter variables in the Variables: area. The "Iteration history" table shows the change in cluster centers when the usual iterative approach is taken. unlike hierarchical clustering. The "Initial cluster centers. The process continues until cluster means do not shift more than a given cut-off value or the iteration limit is reached. then each additional iteration groups observations based on nearest Euclidean distance to the mean of the cluster. the solution is constrained to force a given case to remain in its initial cluster. gives the average value of each variable for each cluster for the k well-spaced cases which SPSS selects for initialization purposes when no initial file is supplied. enter a variable in the "Label cases by:" area. select variables." under which an interative process is used to update cluster centers. However.

Sometimes the researcher wishes to experiment to get different clusters. or even by presenting the data file in different case order. also gives the Euclidean distance between final cluster centers). but rather is a one-pass-through-the-dataset method. Save button: Optionally. in which case it is split using the most-distant pair in the node as seeds. if checked.Iterate button. not the default. Cases start at the root node and are channeled toward nodes and eventually leaf nodes which match it most closely. There are three statistics options: "Initial cluster centers" (gives the initial variable means for each clusters). If there is no adequate match. This is the method used when one or more of the variables are categorical (not interval or dichotomous). it is recommended for very large datasets. To override this default. the case is used to start its own leaf node. as when the "Number of cases in each cluster" table shows highly imbalanced clusters and/or clusters with very few members. Cluster feature tree. iterations terminate if the largest change in any cluster center is less than 2% of the minimum distance between initial centers (or if the maximum number of iterations has been reached). which is after the entire set of cases is classified. Different results may occur by setting different initial cluster centers from file (see above). What is Two-Step Cluster Analysis ? Two-step cluster analysis groups cases into pre-clusters which are treated as single cases. The default maximum number of iterations in SPSS is 10. enter a positive number less than or equal to 1 in the convergence box.. Also. since it is a method requiring neither a proximity table like hierarchical classification nor an iterative process like K-means clustering. The process continues 17 . There is also a "Use running means" checkbox which. The preclustering stage employs a CFtree with nodes leading to leaf nodes. ANOVA table (ANOVA F-tests for each variable. the threshold distance is increased and the tree is rebuilt. Optionally. and "Cluster information for each case" (gives each case's final cluster assignment and the Euclidean distance between the case and the cluster center.. the resulting probabilities are for exploratory purposes only. and/or you may save the Euclidean distance between each case and its cluster center (labeled QCL_2) by checking "Distance from cluster center. For the convergence criterion. you may press the Save button to save the final cluster number of each case as an added column in your dataset (labeled QCL_1)." Options button: Optionally. non-significant variables might be dropped as not contributing to the differentiation of clusters). but as the F tests are only descriptive. If this recursive process grows the CFtree beyond maximum size. Standard hierarchical clustering is then applied to the pre-clusters in the second step. by default. Getting different clusters. allowing new cases to be input. by changing the number of clusters requested. you may press the Options button to select statistics or missing values options. you may press the Iterate button and set the number of iterations and the convergence criterion. It can happen that the CFtree fills up and cannot accept new leaf entries in a node. nonetheless. will cause the clulster centers to be updated after each case is classified.

2004 Finding Groups In Data: An Introduction Leonard Kaufman.uu. Dubes. Number of clusters. for example.cs. To Cluster Analysis. Richard C. Choose Analyze. Click Output and select the statistics wanted (descriptive statistics. with cases categorized under the cluster which is associated with the largest log-likelihood. It is also possible to have this done based on changes in AIC (the Akaike Information Criterion). such as 3-5 clusters. If variables are all continuous. select your categorical and continuous variables. When one or more of the variables are categorical. SPSS. Jain. Euclidean distance is used.until all the data are read. Rousseeuw. The researcher can also ask for a range of solutions. Proximity. or to simply to tell SPSS how many clusters are wanted. By default SPSS determines the number of clusters using the change in BIC (the Schwarz Bayesian Criterion: when BIC change is small.htm www. Two-Step Cluster. if desired. with cases categorized under the cluster which is associated with the smallest Euclidean distance. click Plots and select the plots wanted. maximum levels.chass. Classify. cluster it stops and selects as many clusters as thus far created. and maximum branches per leaf node manually.ncsu. The "Autoclustering statistics" table in SPSS output Click the Advanced button in the Options button dialog to set threshold distances.pdf Suggested Readings: Anil K. Peter J. Algorithms for Clustering Data . AIC or BIC). log-likelihood is the distance measure used. Continue Adapted from http://faculty. BIC and BIC change for all solutions. 2005 18 .

are the correlation coefficients between the variables (rows) and factors (columns). 1978b: 55). A minimum requirement of confirmatory factor analysis is that one hypothesize beforehand the number of factors in the model. Analogous to Pearson's r. Factor analysis beings begins with a large number of variables and then tries to reduce the interrelationships amongst the variables to a few number of clusters or factors. To get the percent of variance in all the variables accounted for by each factor. but usually also the researcher will posit expectations about which variables will load on which factors (Kim and Mueller. if measures created to represent a latent variable really belong together. the squared factor loading is the percent of variance in that variable explained by the factor. Factor Analysis should be driven by a researcher who has a deep and genuine interest in relevant theory in order to get optimal value from choosing the right type of factor analysis and interpreting the factor loadings. This is the most common form of factor analysis. Factor analysis finds relationships or natural connections where variables are maximally correlated with one another and minimally correlated with other variables. After this process has been done many times a pattern appears of relationships or factors that capture the essence of all of the data emerges. The researcher seeks to determine. also called component loadings in PCA. There is no prior theory and one uses factor loadings to intuit the factor structure of the data. Summary: Factor analysis refers to a collection of statistical methods for reducing correlational data into a smaller number of dimensions or factors Key Concepts and Terms Exploratory factor analysis (EFA) seeks to uncover the underlying structure of a relatively large set of variables. Confirmatory factor analysis (CFA) seeks to determine if the number of factors and the loadings of measured (indicator) variables on them conform to what is expected on the basis of pre-established theory.Factor Analysis Q-1 What is Factor Analysis? • • • • • • Factor analysis is a correlational technique to determine meaningful clusters of shared variance. Indicator variables are selected on the basis of prior theory and factor analysis is used to see if they load as predicted on the expected number of factors. and then groups the variables accordingly. Factor loadings: The factor loadings. The researcher's à priori assumption is that any indicator may be associated with any factor. add the sum of the squared factor loadings for that factor (column) and divide by 19 . The researcher's à priori assumption is that each factor (the number and labels of which may be specified à priori) is associated with a specified subset of indicator variables. for instance.

what is critical is not the communality coefficient per se. A factor's eigenvalue may be computed as the sum of its squared factor loadings for all the variables.75 seems high but is meaningless unless the factor on which the variable is loaded is interpretable. Q-2 What are the criteria for determining the number of factors. total variance is equal to the number of variables). The communality measures the percent of variance in a given variable explained by all the factors jointly and may be interpreted as the reliability of the indicator. as picking the "elbow" can be subjective because the curve has multiple elbows or is a smooth curve. is the squared multiple correlation for the variable as dependent using the factors as predictors. Thus. h2. (Note the number of variables equals the sum of their variances as the variance of a standardized variable is 1. eigenvalues measure the amount of variation in the total sample accounted for by each factor. SPSS will output a corresponding column titled '% of variance'. A communality of . As one moves to the right. though often this role is greater when communality is high Eigenvalues: Also called characteristic roots. However. but rather the extent to which the item plays a role in the interpretation of the factor. Cattell's scree test says to drop all further components after the one starting the elbow.25 seems low but may be meaningful if the item is contributing to a well-defined factor. 1989: 22-3). communalities must be interpreted in relation to the interpretability of the factors." That is. Communality. A communality of . Scree plot: The Cattell scree test plots the components as the X axis and the corresponding eigenvalues as the Y axis.) This is the same as dividing the factor's eigenvalue by the number of variables. then it is contributing little to the explanation of variances in the variables and may be ignored as redundant with more important factors. The Kaiser rule is to drop all components with eigenvalues under 1. If a factor has a low eigenvalue. This rule is sometimes criticised for being amenable to researchercontrolled "fudging. The eigenvalue for a given factor measures the variance in all the variables which is accounted for by that factor. Kaiser criterion is the default in SPSS and most computer programs. the eigenvalues drop. When the drop ceases and the curve makes an elbow toward less steep decline. Kaiser criterion: A common rule of thumb for dropping the least important factors from the analysis. Note that the eigenvalue is not the percent of variance explained but rather a measure of amount of variance in relation to total variance (since variables are standardized to have means of 0 and variances of 1. The ratio of eigenvalues is the ratio of explanatory importance of the factors with respect to the variables. toward later components.0. That is. the factor model is not working well for that indicator and possibly it should be removed from the model.the number of variables. roughly in the order of frequency of use in social science (see Dunteman. the researcher may be tempted to set the cut-off 20 . When an indicator variable has a low communality. though it usually will be.

but it is a good idea to select a rotation method. the criterion could be as low as 50%. This is the most common rotation option. endorse both STV and the Rule of 200. Q-3 What are the different rotation methods used in factor analysis? Ans: No rotation is the default. Each factor will tend to have either large or small loadings of any particular variable. The subjects-to-variables ratio should be no lower than 5 (Bryant and Yarnold. This type of rotation often generates a general factor on which most variables are loaded to a high or medium degree. 1995) 21 . These are not mutually exclusive: Bryant and Yarnold. usually varimax. include those below. Alternative arbitrary "rules of thumb. A varimax solution yields results which make it as easy as possible to identify each variable with a single factor. for instance. Variance explained criteria: Some researchers simply use the rule of keeping enough factors to account for 90% (sometimes 80%) of the variation. and methodologists differ. However. unrotated principal components solution maximizes the sum of squared factor loadings. unrotated solutions are hard to interpret because variables tend to load on multiple factors. The amount explained is reflected in the sum of the eigenvalues of all factors. which has the effect of differentiating the original variables by extracted factor. Even when "fudging" is not a consideration. efficiently creating a set of factors which explain as much of the variance in the original variables as possible. Varimax rotation is an orthogonal rotation of the factor axes to maximize the variance of the squared loadings of a factor (column) on all the variables (rows) in a factor the number of factors desired by his or her research agenda. Quartimax rotation is an orthogonal alternative which minimizes the number of factors needed to explain each variable. The original." in descending order of popularity. the scree criterion tends to result in more factors than the Kaiser criterion. Q-4 How many cases are required to do factor analysis? There is no scientific answer to this question. Such a factor structure is usually not helpful to the research purpose. There should be at least 10 cases for each item in the instrument being used. Rule of 10. Where the researcher's goal emphasizes parsimony (explaining variance with as few factors as possible). STV ratio.

uk/Users/andyf/factor.check KMO and Bartlett's test of sphericity and also check Anti-image . The diagonal elements on the Anti-image correlation matrix are the KMO individual statistics for each variable.ncsu. Q-5 What is "sampling adequacy" and what is it used for? Measured by the Kaiser-Meyer-Olkin (KMO) statistics. Adapted from: http://faculty. The KMO output is KMO overall. however.pdf www.pdf Suggested Readings Bruce Thompson.Descriptives .Data Reduction . There is a KMO statistic for each individual variable. Exploratory and Confirmatory Factor Analysis: Understanding Concepts and Applications.300 cases.Rule of 100: The number of subjects should be the larger of 5 times the number of variables. KMO is found under Analyze .edu/garson/PA765/factspss. sampling adequacy predicts if data are likely to factor well. If it is not.OK.60 or higher to proceed with factor (input variables) .cs.Correlation Matrix .60. 2004 22 . of course). In the old days of manual factor analysis. more toward the 150 end when there are a few highly correlated variables. to assess which variables to drop from the model because they are too multicollinear.Statistics . Even more subjects are needed when communalities are low and/or few variables load on each factor. based on correlation and partial correlation. this was extremely useful. The concept is that the partial correlations should not be very large if one is to expect distinct factors to emerge from factor analysis. The denominator is this same sum plus the sum of squared partial correlations of each variable i with each variable j. or 100.Continue . 1994) Rule of 150: Hutcheson and Sofroniou (1999) recommends at least 150 .uu. until KMO overall rises above .Factor . and their sum is the KMO overall statistic. controlling for others in the analysis. the numerator is the sum of squared correlations of all variables in the analysis (except the 1. drop the indicator variables with the lowest individual KMO statistic values.chass.htm www.0 self-correlations of variables with themselves.0 and KMO overall should be . To compute KMO overall. (Hatcher. as would be the case when collapsing highly multicollinear variables.sussex. In KMO varies from 0 to 1. KMO can still be used.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.