Multivariate Analysis Business Research Methods

Compiled by Dr. Sunil Bhardwaj
(From various online and published resources)


Multiple Regression
Q-1 What is Multiple Regression? Ans : Multiple regression is used to account for (predict) the variance in an interval dependent, based on linear combinations of interval, dichotomous, or dummy independent variables. Multiple regression can establish that a set of independent variables explains a proportion of the variance in a dependent variable at a significant level (through a significance test of R2), and can establish the relative predictive importance of the independent variables (by comparing beta weights). Power terms can be added as independent variables to explore curvilinear effects. Cross-product terms can be added as independent variables to explore interaction effects. One can test the significance of difference of two R2's to determine if adding an independent variable to the model helps significantly. Using hierarchical regression, one can see how most variance in the dependent can be explained by one or a set of new independent variables, over and above that explained by an earlier set. Of course, the estimates (b coefficients and constant) can be used to construct a prediction equation and generate predicted scores on a variable for further analysis. The multiple regression equation takes the form y = b1x1 + b2x2 + ... + bnxn + c. The b's are the regression coefficients, representing the amount the dependent variable y changes when the corresponding independent changes 1 unit. The c is the constant, where the regression line intercepts the y axis, representing the amount the dependent y will be when all the independent variables are 0. The standardized version of the b coefficients are the beta weights, and the ratio of the beta coefficients is the ratio of the relative predictive power of the independent variables. Associated with multiple regression is R2, multiple correlation, which is the percent of variance in the dependent variable explained collectively by all of the independent variables. Multiple regression shares all the assumptions of correlation: linearity of relationships, the same level of relationship throughout the range of the independent variable ("homoscedasticity"), interval or near-interval data, absence of outliers, and data whose range is not truncated. In addition, it is important that the model being tested is correctly specified. The exclusion of important causal variables or the inclusion of extraneous variables can change markedly the beta weights and hence the interpretation of the importance of the independent variables. 2

and where SST = total sum of squares = SUM((Yi .( (1-R2)(N-1 / N . R-squared can also be interpreted as the proportionate reduction in error in estimating the dependent when knowing the independents.. R2 reflects the number of errors made when using the regression model to guess the value of the dependent.0. when there are as many independents as cases in the sample. Some authors conceive of adjusted R2 as the percent of variance "explained in a replication." When used for the case of a few independents. in ratio to the total errors made when using only the dependent's mean as the basis for estimating all cases. the number of independents. R2 will always be 1. also called multiple correlation or the coefficient of multiple determination. after subtracting out the contribution of chance. where SSE = error sum of squares = SUM((Yi . where Yi is the actual value of Y for the ith case and EstYi is the regression prediction for the ith case. the regression sum of squares/total sum of squares = R-square.residual sum of squares Q-3 What is Adjusted R-square and How it is calculated? Ans: Adjusted R-Square is an adjustment for the fact that when one has a large number of independents. where the regression sum of squares = total sum of squares . the number of independents). Mathematically. Thus R-square is 1 minus regression error as a percent of total error and will be 0 when regression error is as large as it would be if you simply guessed the mean for all cases of Y. Adjusted R2 = 1 .e. When there are a great many independents.EstYi)squared).1) ). Put another way. Always use adjusted R2 when comparing models with different numbers of independents. At the extreme. increases. The greater the number of independents. it is possible that R2 will become artificially high simply because some independents' chance variations "explain" small parts of the variance of the dependent. the more the researcher is expected to report the adjusted coefficient. is the percent of the variance in the dependent explained uniquely or jointly by the independents. That is. The "residual sum of squares" in SPSS /SAS output is SSE and reflects regression error.(SSE/SST)). The adjustment to the formula arbitrarily lowers R2 as p. where n is sample size and k is the number of terms in the model not counting the constant (i. adjusted R2 may be noticeably lower. R2 = (1 .MeanY)squared). Q-4 What is Multicollinearity and How it is measured? 3 .k . R2 and adjusted R2 will be close.Q-2 What is R-square ? Ans: R2.

a "singular value" is the square root of an eigenvalue. Criteria for "sizable proportion" vary among researchers but the most common criterion is if two or more variables have a variance partition of . while high R2's increase the standard error of the beta coefficients and make assessment of the unique role of each independent difficult or impossible. If this is the case. if tolerance is less than . note that estimates of the importance of other variables in the equation (variables which are not collinear with others) are not affected. Some types are necessary to the research purpose Tolerance is 1 . which build in the regressing of each independent on all the others. VIF VIF is the variance inflation factor. Tolerance is part of the denominator in the formula for calculating the confidence limits on the b (partial regression) coefficient. Types of multicollinearity. VIF and tolerance are found in the SPSS and SAS output section on collinearity statistics. Even when multicollinearity is present. with the effect that small data changes or arithmetic errors may translate into very large changes or errors in the regression analysis. when VIF is high there is high multicollinearity and instability of the b and beta coefficients. and "condition indices" are the ratio of the largest singular 4 . Therefore. one looks in the variance proportions column.50 or higher on a factor with a high condition index.R2 for the regression of that independent variable on all the other independents. Condition indices and variance proportions. a problem with multicollinearity is indicated.20 or VIF < 4 suggest no multicollinearity. R2's near 1 violate the assumption of no perfect collinearity. even when the rules of thumb for tolerance > . ignoring the dependent.Multicollinearity is the intercorrelation of independent variables. the more the standard error of the regression coefficients. these variables have high linear dependence and multicollinearity is a problem. Note that it is possible for the rule of thumb for condition indices (no index over 30) to indicate multicollinearity. To assess multivariate multicollinearity. one uses tolerance or VIF. When tolerance is close to 0 there is high multicollinearity of that variable with other independents and the b and beta coefficients will be unstable. Computationally. While simple correlations tell something about multicollinearity.20. the preferred method of assessing multicollinearity is to regress each independent on all the other independent variables in the equation. Inspection of the correlation matrix reveals only bivariate multicollinearity.The more the multicollinearity. Variance-inflation factor. There will be as many tolerance coefficients as there are independents. with the typical criterion being bivariate correlations > . The type of multicollinearity matters a great deal. A condition index over 30 suggests serious collinearity problems and an index over 15 indicates possible collinearity problems. the more the tolerance will approach zero. Condition indices are used to flag excessive collinearity in the data. the lower the tolerance. which is simply the reciprocal of tolerance. The higher the intercorrelation of the independents.90. If a factor (component) has a high condition index. As a rule of thumb.

when the homoscedasticity assumption is violated "conventionally computed confidence intervals and conventional t-tests for OLS estimators can no longer be justified" However. Put another way. and reciprocal transformations of the dependent may also reduce or eliminate lack of homoscedasticity. Regression basics. Jeremy and Mark Shevlin (2001). CA: Sage Publications. Leo H. One method of dealing with hetereoscedasticity is to select the weighted least squares regression Thousand Oaks.values to each other singular value. Nonconstant error variance can be observed by requesting a simple residual plot (a plot of residuals on the Y axis against predicted values on the X axis). the variance of residual error should be constant for all values of the independent(s). (2001). 5 . Also.cs. CA: Sage Publications. Applying regression and correlation. CA: Sage Publications. Square In SPSS or SAS. Miles. 106. Sources & Suggested Readings: http://www2. Lack of homoscedasticity may mean (1) there is an interaction effect between a measured independent variable and an unmeasured independent variable not in the model. whereas lack of homoscedasticity will be characterized by a pattern such as a funnel shape.uu. moderate violations of homoscedasticity have only minor impact on regression estimates . or (2) that some independent variables are skewed while others are not. Thousand Statistics. Q-5 What is homoscedasticity ? Homoscedasticity: The researcher should test to assure that the residuals are dispersed randomly throughout the range of the estimated dependent.ncsu. select Analyze. If not. Thousand Oaks. indicating greater error as the dependent increases. Menard. A homoscedastic model will display a cloud of dots.htm www. Introductory text built around model-building. This causes cases with smaller residuals to be weighted more in calculating the b coefficients. Series: Quantitative Applications in the Social Sciences.chass. Scott (1995). separate models may be required for the different ranges. Nonconstant error variance can indicate the need to respecify the model to include omitted independent variables. Applied logistic regression analysis. No. Regression. Linear. check Collinearity diagnostics to get condition indices. log.pdf Kahane.

depending on whether the specified grouping variable has two or more categories. usually a dichotomy. and untruncated interval or near interval data. MDA is sometimes also called discriminant factor analysis or canonical discriminant analysis. sharing many of the same assumptions and tests. which is now frequently used in place of DA as it usually involves fewer violations of assumptions (independent variables needn't be normally distributed. If discriminant function analysis is effective for a set of data. Discriminant analysis shares all the usual assumptions of correlation. Discriminant analysis has two steps: (1) an F test (Wilks' lambda) is used to test if the discriminant model as a whole is significant.Discriminant Analysis Q-1 What is Disriminant Analysis ? Ans: Discriminant function analysis. There are several purposes for DA and/or MDA: • • • • • • To classify cases into groups using a discriminant prediction equation. is 6 .a. or have equal within-group variances). a. Multiple discriminant analysis (MDA) is an extension of discriminant analysis and a cousin of multiple analysis of variance (MANOVA). and (2) if the F test shows significance. To assess the relative importance of the independent variables in classifying the dependent variable. Discriminant. To test theory by observing whether cases are classified as predicted. To infer the meaning of MDA dimensions which distinguish groups. requiring linear and homoscedastic relationships. discriminant analysis or DA. To determine the most parsimonious way to distinguish among groups. MDA is used to classify a categorical dependent which has more than two categories. it also assumes proper model specification (inclusion of all important independents and exclusion of extraneous variables). To investigate differences between or among groups. DA also assumes the dependent variable is a true dichotomy since data which are forced into dichotomous coding are truncated.k. is used to classify cases into the values of a categorical dependent. then the individual independent variables are assessed to see which differ significantly in mean by group and these are used to classify the dependent variable. One gets DA or MDA from this same menu selection. the classification table of correct and incorrect estimates will yield a high percentage correct. using as predictors a number of interval or dummy independent variables. Classify. Discriminant function analysis is found in SPSS/SAS under Analyze. DA is an earlier alternative to logistic regression. based on discriminant loadings. Like multiple regression. linearly related. attenuating correlation.

That is. where g is the number of categories in the grouping variable.robust. the x's are discriminating variables. Discriminant function: A discriminant function. There is one discriminant function for 2-group discriminant analysis.1). A dimension is simply one of the discriminant functions when there are more than one. then the first discriminant function accounts for 40% more between- 7 . The criterion variable. but for higher order DA. also called the characteristic root of each discriminant function. The eigenvalue. where the b's are discriminant coefficients. Each discriminant function is orthogonal to the others. in multiple discriminant analysis. the traditional method. This is the dependent variable. the number of functions (each with its own cut-off value) is the lesser of (g .. This is analogous to multiple regression.the number of discriminating (independent) variables. but the b's are discriminant coefficients which maximize the distance between the means of the criterion (dependent) variable. but there is also a version involving maximum likelihood estimation. and so on. If the ratio of two eigenvalues is 1. for instance. also called the grouping variable in SPSS. the ratio of the eigenvalues indicates the relative discriminating power of the discriminant functions. also called predictors. reflects the ratio of importance of the dimensions which classify cases of the dependent variable. and has coefficients which many find easier to interpret. The eigenvalues assess relative importance because they reflect the percents of variance explained in the dependent variable. If there is more than one discriminant function. handles categorical as well as continuous variables. there is one discriminant function and one eigenvalue. + bnxn + c. There is one eigenvalue for each discriminant function. Note that the foregoing assumes the discriminant function is estimated using ordinary least-squares. It is the object of classification efforts.. is a latent variable which is created as a linear combination of discriminating (independent) variables. such that L = b1x1 + b2x2 + . the second next most important in explanatory power. Few Definitions and Concepts Discriminating variables: These are the independent variables. also called a canonical root. which accounts for 100% of the explained variance. For twogroup DA. Logistic regression is preferred when data are not normal in distribution or group sizes are very unequal. and c is a constant. See also the separate topic on multiple discriminant function analysis (MDA) for dependents with more than two categories. the first will be the largest and most important.4. cumulating to 100% for all functions. or p. Number of discriminant functions.

The relative percentage of a discriminant function equals a function's eigenvalue divided by the sum of all eigenvalues of all discriminant functions in the model. much as b coefficients are used in regression in making predictions. discriminant coefficients are the regression-like b coefficients in the discriminant function.. The constant plus the sum of products of the unstandardized coefficients with the observations yields the discriminant scores. and c is a constant. Note that for two-group DA. Classify. To get discriminant scores in SPSS. Unstandardized discriminant coefficients are used in the formula for making the classifications in DA. The discriminant score. where L is the latent variable formed by the discriminant function. When group sizes are equal. is a measure of the association between the groups formed by the dependent and the given discriminant function. Eigenvalues are part of the default output in SPSS (Analyze. When R is zero. Thus it is the percent of discriminating power for the model associated with a given discriminant function. Note that relative % and R* do not have to be correlated. One can also view the discriminant scores by clicking the Classify button and checking "Casewise results. reflecting the unique contribution of each variable to the classification of the criterion variable. Classify. When the canonical correlation is large. the case is classed as 0. The canonical correlation." Cutoff: If the discriminant score of the function is less than or equal to the cutoff. is the value resulting from applying a discriminant function formula to the data for a given case. like beta weights in regression. R. The standardized discriminant coefficients. + bnxn + c. are used to assess the relative classifying importance of the independent variables. or if above it is classed as 1.0 indicates that all of the variability in the discriminant scores can be accounted for by that dimension. The discriminant function coefficients are partial coefficients. If the groups are unequal. in the form L = b1x1 + b2x2 + . 8 . select Analyze. click the Save button. the canonical correlation is equivalent to the Pearsonian correlation of the discriminant scores with the grouping variable. An R of variance in the dependent categories than does the second discriiminant function. there is no relation between the groups and the function. also called the DA score. the cutoff is the weighted mean. The Z score is the discriminant score for standardized data. check "Discriminant scores". Relative % is used to tell how many functions are important.. the b's are discriminant coefficients. One may find that only the first two or so eigenvalues are of importance. the x's are discriminating variables. Discriminant. the cutoff is the mean of the two centroids (for two-group DA). That is. R is used to tell how much each function is useful in determining group differences. there is a high correlation between the discriminant functions and the groups. Discriminant).

and 1 meaning all group means are the same. with 0 meaning group means differ (thus the more the variable differentiates the groups). It is obtained in SPSS by asking for Analyze. the "Wilks' Lambda" table will have a column labeled "Test of Function(s)" and a row labeled "1 through n" (where n is the number of discriminant functions)." p value < . Classify. if there are more than two groups of the dependent. Q-3 What is Confusion or classification Matrix ? Ans: 9 . Discriminant). group centroids and factor structure are examined. not considering any shared explanation. the more that variable contributes to the discriminant function. Also. The smaller the lambda for an independent variable." level for this row is the significance level of the discriminant function as a whole. Addition or deletion of variables in the model can change discriminant coefficients markedly." ANOVA table for discriminant scores is another overall test of the DA model.05 means the model differentiates discriminant scores between the groups significantly better than chance (than a model with just the constant). Wilks's lambda is sometimes called the U statistic. this use of Wilks' lambda is in the "Tests of equality of group means" table in DA output. Lambda varies from 0 to 1. For this purpose. Q-2 What is Wilk’s Lambda? Wilks' lambda is used to test the significance of the discriminant function as a whole. where a "Sig. only the unique explanation of each independent is being compared. using discriminant scores from DA (which SPSS will label Dis1_1 or similar) as dependent. are used to compare the relative importance of the independent variables. In SPSS. The "Sig.Standardized discriminant coefficients. this use of Wilks' lambda is in the "Wilks' lambda" table of the output section on "Summary of Canonical Discriminant Functions. In SPSS. Wilks's lambda is part of the default output in SPSS (Analyze. since these are partial coefficients. In SPSS. A significant lambda means one can reject the null hypothesis that the two groups have the same mean discriminant function scores. Compare Means. It is an F test. Wilks' lambda also can be used to test which independents contribute significantly to the discriminant function. much as beta weights are used in regression. One-Way ANOVA. As with regression. The F test of Wilks's lambda shows which variables' contributions are significant. Note that importance is assessed relative to the model being analyzed. the standardized discriminant coefficients do not tell the researcher between which groups the variable is most or least discriminating. also termed the standardized canonical discriminant function coefficients.

chass. is used to assess the performance of DA. Carl J. NY: Wiley-Interscience. P. When prediction is perfect. Klecka. or a confusion.htm Huberty. (Wiley Series in Probability and Statistics). For unequally split 2-way groups of different sizes. The percentage of cases on the diagonal is the percentage of correct classifications. Discriminant analysis. William R. and dividing the sum by N. also called a classification matrix. Lachenbruch. the expected percent is 50%. (1975). the expected percent is computed in the "Prior Probabilities for Groups" table in SPSS.ncsu. A. Thousand Oaks. No. or prediction matrix or table. (1980).The classification table. summing for all groups. CA: Sage Publications. 10 . For two-group discriminant analysis with a 50-50 split in the dependent variable. Discriminant analysis. Note that the hit ratio must be compared not to zero but to the percent that would have been correctly classified by chance alone. Sources & Suggested Readings: http://faculty. Quantitative Applications in the Social Sciences Series. This percentage is called the hit ratio. 19. (1994). Applied discriminant analysis .edu/garson/PA765/discrim2. This is simply a table in which the rows are the observed categories of the dependent and the columns are the predicted categories of the dependents. all cases will lie on the diagonal. by multiplying the prior probabilities times the group size. Expected hit ratio. NY: Hafner. assignment.

Hierarchical clustering allows users to select a definition of distance. also called segmentation analysis or taxonomy analysis. starting with all cases in one large cluster. adding cases to existing clusters. That is. Hierarchical cluster analysis. if not. Euclidean distance is the most common distance measure. SPSS offers three general approaches to cluster analysis. or combining clusters to get to the desired final number of clusters. which form the x and y axes. Similarity and Distance Distance. then determine how many clusters best suit the data. can use either agglomerative or divisive clustering strategies. then it clusters the pre-clusters. Key Concepts and Terms Cluster formation is the selection of the procedure for determining how clusters are created. also perform clustering and are discussed separately. creating new clusters. then select a linking method of forming clusters.000). There is also divisive clustering. A given pair of cases is plotted on two variables. such as latent class analysis and Q-mode factor analysis. In agglomerative hierarchical clustering every case is initially considered a cluster. then calculates how to assign cases to the K clusters. and how the calculations are done. In k-means clustering the researcher specifies the number of clusters in advance. The process is repeated.. the third case is added to the first cluster.Cluster Analysis Q-1 What is Cluster Analysis ? Ans: Cluster analysis. The case with the lowest distance to either of the first two is considered next. If that third case is closer to a fourth case than it is to either of the first two. cluster analysis seeks to identify a set of groups which both minimize within-group variation and maximize between-group variation. Other techniques. which works in the opposite direction. two-step clustering creates pre-clusters. K-means clustering is much less computer-intensive and is therefore sometimes preferred when datasets are very large (ex. The first step in cluster analysis is establishment of the similarity or distance matrix. seeks to identify homogeneous subgroups of cases in a population. then the two cases with the lowest distance (or highest similarity) are combined into a cluster. Finally. the third and fourth cases become the second two-case cluster. discussed below. This matrix is a table in which both the rows and columns are the units of analysis and the cell entries are a measure of similarity or distance for any pair of cases. The Euclidean distance is the square root 11 . > 1.

binary. Kulczynski 1. it supports Euclidean distance. the pull-down Method selection determines how cases or clusters are combined at each step. and count data respectively. Sokal and Sneath 3. Sokal and Sneath 1. In SPSS. Method. Anderberg's D. Summary. the distance between two clusters is the distance between their two furthest member points. pattern difference. shape. proximity matrices are selected under Analyze. check proximity matrix. Similarity. Cluster. Dice. Chebychev. Yule's Q. Sokal and Sneath 5. SPSS offers these method choices: Nearest neighbor.) It is common to use the square of Euclidean distance as squaring removes the sign. There are three measure pulldown menus. Cases which are alike share a low distance. phi 4-point correlation. similarity/distance measures are selected in the Measure area of the Method subdialog obtained by pressing the Method button in the Classify dialog. so to avoid this it is common to first standardize all variables. Kulczynski 2. or Lance and Williams. 12 . variance. Different methods will result in different cluster patterns.The proximity matrix table in the output shows the actual distances or similarities computed for any pair of cases. Since for Pearson correlation. Jaccard. Minkowski. Lambda. Absolute values. Hamann. In SPSS. Rogers and Tanimoto. the distance between two clusters is the distance between their closest neighboring points Furthest neighbor. chi-square or phi-square. block. for count data. Hierarchical clustering. (Recall high school geometry: this is the formula for the length of the third side of a right triangle. squared Euclidean distance. Statistics button. the one with the larger magnitude will dominate. Sokal and Sneath 4. SPSS supports these interval distance measures: Euclidean distance. When two or more variables are used to define distance. squared Euclidean distance. simple matching. for interval. Under the Method button in the SPSS Classify dialog. In this single linkage method. Cases which are alike share a high similarity. the researcher normally selects absolute values. There are a variety of different measures of inter-observation distances and inter-cluster distances to use as criteria when merging nearest clusters into broader groups or when considering the relation of a point to a cluster. high negative as well as high positive values indicate similarity. In this complete linkage method. Distance measures how far apart two observations are. or dispersion). Sokal and Sneath 2. SPSS supports a large number of similarity measures for interval data (Pearson correlation or cosine) and for binary data (Russell and Rao. Similarity measures how alike two cases are. This can be done by checking the absolute value checkbox in the Transform Measures area of the Methods subdialog (invoked by pressing the Methods button) of the main Cluster dialog.of the sum of the square of the x difference plus the square of the y distance. For binary data. or customized. Ochiai. Yule's Y. size difference.

There are multiple matched attributes and the similarity score is the number of matches divided by the number of attributes being matched. 13 . Binary matching is another type of similarity measure. The distance between two clusters is the average distance between all inter-cluster pairs.UPGMA (unweighted pair-group method using averages). which will save the cluster number for each case (or numbers if multiple solutions are requested). The cluster to be merged is the one with the smallest sum of Euclidean distances between cluster means for all variables. and clustering will be indeterminate. This method also uses Euclidean distance as the proximity measure. UPGMA is generally preferred over nearest or furthest neighbor methods since it is based on information about all inter-cluster pairs." Ward's method calculates the sum of squared Euclidean distances from each case in a cluster to the mean of all variables. SPSS does not make this available in the Cluster dialog. SPSS labels this "between-groups linkage.or intracluster pairs. they may be orthogonal to (uncorrelated) with one another. Means and variances. Note that it is usual in binary matching to have several attributes because there is a risk that when the number of attributes is small. Median method. not just the nearest or furthest ones. Correlation of items can be used as a similarity measure. the correlation is between cases and these correlations may constitute the cells of the similarity matrix. SPSS labels this "within-groups linkage. One transposes the normal data table in which columns are variables and rows are cases. Centroid method. Summary measures assess how the clusters differ from one another. A table of means and variances of the clusters with respect to the original variables shows how the clusters differ on the original variables. Linkage tables show the relation of the cases to the clusters. Means the researcher can use the cluster number as the grouping variable to compare differences of means on any other continuous variable in the dataset. This is an ANOVA-type approach and preferred by some researchers for this reason. The average distance between all pairs in the resulting cluster is made to be as small as possibile. Compare Means. where 1 indicates a match and 0 indicates no match between any pair of cases. This method is therefore appropriate when the research purpose is homogeneity within clusters. The cluster to be merged is the one which will increase the sum the least. Clusters are weighted equally regardless of group size when computing centroids of two clusters being combined. Then in Analyze." Average linkage within groups is the mean distance between all possible inter. but one can click the Save button. By using columns as cases and rows as variables instead. and is the default method in SPSS.

the rows are stages of clustering. stage 1 combines the two cases which have lowest proximity (distance) score. Trees are usually depicted horizontally. low is good. where columns are alternative numbers of clusters in the solution (as specified in the "Range of Solution" option in the Cluster membership group in SPSS. resulting in a cluster labeled 2. one can see which cases are in which groups.. Cell entries show the number of the cluster to which the case belongs. When there are relatively few cases. The bigger the distance coefficient or the smaller the similarity coefficient.Cluster membership table. cases 3 and 18 might be combined. Linkage plots show similar information in graphic form. Reading from the last column right to left (horizontal icicle plots) or last row bottom to top (vertical icicle plots). In this table. that SPSS will not stop on this basis but instead will compute the range of solutions (ex. while the X axis is a rescaled version of the 14 . 2 to 4 clusters) requested by the researcher in the Cluster Membership group of the Statistics button in th Hierarchical Clustering dialog. If there are few cases. The last/bottom row will show all the cases in separate one-case clusters. Note that for distance measures.1)th stage includes all the cases in one cluster. high coefficients mean cases are alike. The (n . meaning the cases are alike. After the stopping stage is determined in this manner. This is the (n 1) solution. This is a visual way of representing information on the agglomeration schedule. In agglomerative clustering using a distance measure like Euclidean distance. Dendrograms. There are two "Cluster Combined" columns. The next-to-last/bottom column/row will show the (n-2) solution. which may be undesirable. not vertically. the more clustering involved combining unlike entities. with two cases combined into one cluster. at Stage 1. for similarity measures. From this table. This shows cases as rows. giving the case or cluster numbers for combination at each stage. though. The cluster number goes by the lower of the cases or clusters combined. Row 1 (vertical icicle plots) or column 1 (horizontal icicle plots) will show all cases in a single cluster. The researcher looks at the "Coefficients" column of the agglomerative schedule and notes when the proximity coefficient jumps up and is not a small increment from the one before (or when the coefficient reaches some theoretically important level). the researcher can see how agglomeration proceeded. with each row representing a case on the Y axis. numbered from 1 to (n . depending on the number of clusters in the solution. show the relative size of the proximity coefficients at which cases were combined. For instance. showing cases as rows and number of clusters in the solution as columns. but without the proximity coefficient information. the researcher can work backward to determine how many clusters there are and which cases belong to which clusters (but it is easier just to get this information from the cluster membership table). icicle plots or dendograms provide the same linkage information in an easier format. Subsequent columns/rows show further clustering steps. Agglomeration Schedule. under the Statistics button). Icicle plots are usually horizontal. with cases as columns. Later cluster 3 and case 2 might be combined. resulting in a cluster labeled 3. Note. vertical icicle plots may plotted.1). where cases are initially numbered 1 to n. also called tree diagrams. Agglomeration schedule is a choice under the Statistics button for Hierarchical Cluster in the SPSS Cluster dialog.

the researcher may wish then to analyze the entire dataset with k-means clustering (aka. Cases with low distance/high similarity are close together. with a line linking them a short distance from the left of the dendogram.. as is the usual case. Classify. click the Plots button. indicating the cases/clusters were agglomerated even though much less alike. Backward clustering.. how clusters are aggregated (or divided). but starting with a low cut-off and working toward a high cut-off. 15 .. the Quick Cluster procedure: Analyze.That is. specifying that number of clusters. . After using hierarchical clustering to determine the desired number of clusters. Cluster. Continue. and how many clusters are needed. the clusters are nested rather than being mutually exclusive.9). larger clusters created at later stages may contain smaller clusters created at earlier stages of agglomeration. select variables. In hierarchical clustering. in hierarchical clustering. One may wish to use the hierarchical cluster procedure on a sample of cases (ex. To accomplish hierarchical clustering. also called agglomerative clustering: Small clusters are formed by using a high similarity index cut-off (ex. In the Hierarchical Cluster dialog. also called divisive clustering. The merging of clusters is visualized using a tree format. Hierarchical Cluster. select Analyze. is the same idea. the linking line is to the right of the dendogram the linkage occurs a high distance coefficient. Classify.proximity coefficients. Hierarchical clustering generates all possible clusters of sizes 1. select Proximity Matrix. The optimum number of clusters depends on the research purpose.K. K-Means Cluster Analysis). Clustering variables." In SPSS. check the Dendogram checkbox. select Range of Solutions in the Cluster Membership group. Forward and backward methods need not generate the same results. If a similarity measure is used rather than a distance measure. Cases showing low distance are close. on the other hand. 200) to inspect results for different numbers of clusters. select Cases in the Cluster group click Statistics. in the Cluster group. OK.. but is used only for relatively small samples. Forward clustering. indicating that they are agglomerated into a cluster at a low distance coefficient.. the rescaling of the X axis still produces a diagram with linkages involving high alikeness to the left and low alikeness to the right. select Analyze. indicating alikeness. Then this cut-off is relaxed to establish broader and broader clusters in stages until all cases are in a single cluster at some low similarity index cut-off. In SPSS. Hierarchical Cluster. SPSS calls hierarchical clustering the "Cluster procedure. specify the number of clusters (typically 3 to 6). Identifying "typical" types may call for few clusters and identifying "exceptional" types may call for many clusters. When. in order to cluster variables. the researcher must specify how similarity or distance is defined. the researcher may selected Variable rather than the usual Cases. What is Hierarchical Cluster Analysis ? Hierarchical clustering is appropriate for smaller samples (typically < 250).

choose Method: Iiterate and classify. When the change drops below a specified cutoff. The default method is "Iterate and classify. you may press the Iterate button and set the number of iterations and the convergence criterion. K-Means Cluster Analysis. The default maximum number of iterations in SPSS is 10. unlike hierarchical clustering. under which cases are immediately classified based on the initial cluster centers. Cluster centers change at each pass. Initial cluster centers are chosen in a first pass of the data. enter variables in the Variables: area. K-means cluster analysis uses Euclidean distance. asking for a different number of clusters. The "Iteration history" table shows the change in cluster centers when the usual iterative approach is taken. the solution is constrained to force a given case to remain in its initial cluster. SPSS supports a "Classify only" method. because K-means clustering does not require prior computation of a proximity matrix of the distance/similarity of every case with every other case. There is 16 . the iterative process stops and cases are assigned to clusters according to which cluster center they are nearest. Optionally. then each additional iteration groups observations based on nearest Euclidean distance to the mean of the cluster. in agglomerative K-means clustering. SPSS: Analyze. Iterate button. The process continues until cluster means do not shift more than a given cut-off value or the iteration limit is reached. Agglomerative K-means clustering. To override this default. The researcher must specify in advance the desired number of clusters. However. iterations terminate if the largest change in any cluster center is less than 2% of the minimum distance between initial centers (or if the maximum number of iterations has been reached). a given case may be assigned to a cluster. Large datasets are possible with K-means clustering. The "Initial cluster centers. by default. enter "Number of clusters:". optionally. there is no option for "Range of solutions". Cluster centers are the average value on all clustering variables of each cluster's members. Method. or just Classify). Cluster. For the convergence criterion. then cases are classified based on the updated centers. Normally in K-means clustering. enter a positive number less than or equal to 1 in the convergence box." under which an interative process is used to update cluster centers. instead you must re-run K-means clustering. enter a variable in the "Label cases by:" area.What is K-means Cluster Analysis ? K-means cluster analysis. Unlike hierarchical clustering. which are not updated. The "Final cluster centers" table in SPSS output gives the same thing for the last iteration step. However. K. then reassigned to a different cluster as the algorithm unfolds. gives the average value of each variable for each cluster for the k well-spaced cases which SPSS selects for initialization purposes when no initial file is supplied." in spite of its title.

Standard hierarchical clustering is then applied to the pre-clusters in the second step.. log-likelihood is the distance measure used. Euclidean distance is used.also a "Use running means" checkbox which. When one or more of the variables are categorical. if checked. Click the Advanced button in the Options button dialog to set threshold distances. maximum levels. What is Two-Step Cluster Analysis ? Two-step cluster analysis groups cases into pre-clusters which are treated as single cases. Different results may occur by setting different initial cluster centers from file (see above). It can happen that the CFtree fills up and cannot accept new leaf entries in a node. non-significant variables might be dropped as not contributing to the differentiation of clusters). nonetheless." Options button: Optionally. by changing the number of clusters requested. it is recommended for very large datasets. ANOVA table (ANOVA F-tests for each variable. not the default. will cause the clulster centers to be updated after each case is classified. as when the "Number of cases in each cluster" table shows highly imbalanced clusters and/or clusters with very few members. you may press the Options button to select statistics or missing values options. and maximum branches per leaf node manually. Cases start at the root node and are channeled toward nodes and eventually leaf nodes which match it most closely. the case is used to start its own leaf node. The process continues until all the data are read. which is after the entire set of cases is classified. but rather is a one-pass-through-the-dataset method. allowing new cases to be input. This is the method used when one or more of the variables are categorical (not interval or dichotomous). with 17 . the resulting probabilities are for exploratory purposes only. The preclustering stage employs a CFtree with nodes leading to leaf nodes.. Save button: Optionally. you may press the Save button to save the final cluster number of each case as an added column in your dataset (labeled QCL_1). the threshold distance is increased and the tree is rebuilt. Cluster feature tree. in which case it is split using the most-distant pair in the node as seeds. Getting different clusters. or even by presenting the data file in different case order. and/or you may save the Euclidean distance between each case and its cluster center (labeled QCL_2) by checking "Distance from cluster center. If this recursive process grows the CFtree beyond maximum size. Proximity. Sometimes the researcher wishes to experiment to get different clusters. since it is a method requiring neither a proximity table like hierarchical classification nor an iterative process like K-means clustering. Also. with cases categorized under the cluster which is associated with the largest log-likelihood. If variables are all continuous. There are three statistics options: "Initial cluster centers" (gives the initial variable means for each clusters). and "Cluster information for each case" (gives each case's final cluster assignment and the Euclidean distance between the case and the cluster center. If there is no adequate match. also gives the Euclidean distance between final cluster centers). but as the F tests are only descriptive.

Rousseeuw. Click Output and select the statistics wanted (descriptive statistics. It is also possible to have this done based on changes in AIC (the Akaike Information Criterion). The "Autoclustering statistics" table in SPSS output gives. Two-Step Cluster. it stops and selects as many clusters as thus far created. Dubes. The researcher can also ask for a range of solutions. select your categorical and continuous variables. BIC and BIC change for all solutions. Jain. Continue Sources and Suggested Reading: http://faculty. click Plots and select the plots wanted.htm www. Algorithms for Clustering Data . Finding Groups In Data: An Introduction To Cluster Analysis. or to simply to tell SPSS how many clusters are Suggested Readings: Anil K. Choose Analyze.cases categorized under the cluster which is associated with the smallest Euclidean distance.ncsu. cluster frequencies.uu. SPSS. Richard C.cs. if desired. 2005 18 .chass. Number of clusters. Peter J.2004 Leonard AIC or BIC). Classify. By default SPSS determines the number of clusters using the change in BIC (the Schwarz Bayesian Criterion: when BIC change is small. for example. such as 3-5 clusters.

but usually also the researcher will posit expectations about which variables will load on which factors (Kim and Mueller. and then groups the variables accordingly. Indicator variables are selected on the basis of prior theory and factor analysis is used to see if they load as predicted on the expected number of factors. The researcher's à priori assumption is that any indicator may be associated with any factor. Analogous to 19 . Factor loadings: The factor loadings. After this process has been done many times a pattern appears of relationships or factors that capture the essence of all of the data emerges. Factor analysis beings begins with a large number of variables and then tries to reduce the interrelationships amongst the variables to a few number of clusters or factors. Factor analysis finds relationships or natural connections where variables are maximally correlated with one another and minimally correlated with other variables. There is no prior theory and one uses factor loadings to intuit the factor structure of the data. Confirmatory factor analysis (CFA) seeks to determine if the number of factors and the loadings of measured (indicator) variables on them conform to what is expected on the basis of pre-established theory.Factor Analysis Q-1 What is Factor Analysis? • • • • • • Factor analysis is a correlational technique to determine meaningful clusters of shared variance. 1978b: 55). This is the most common form of factor analysis. Summary: Factor analysis refers to a collection of statistical methods for reducing correlational data into a smaller number of dimensions or factors Key Concepts and Terms Exploratory factor analysis (EFA) seeks to uncover the underlying structure of a relatively large set of variables. are the correlation coefficients between the variables (rows) and factors (columns). if measures created to represent a latent variable really belong together. Factor Analysis should be driven by a researcher who has a deep and genuine interest in relevant theory in order to get optimal value from choosing the right type of factor analysis and interpreting the factor loadings. The researcher's à priori assumption is that each factor (the number and labels of which may be specified à priori) is associated with a specified subset of indicator variables. The researcher seeks to determine. for instance. A minimum requirement of confirmatory factor analysis is that one hypothesize beforehand the number of factors in the model. also called component loadings in PCA.

When an indicator variable has a low communality. When the drop ceases and the curve makes an elbow toward less steep decline. though it usually will be. h2. A communality of . A factor's eigenvalue may be computed as the sum of its squared factor loadings for all the variables. However.Pearson's r.75 seems high but is meaningless unless the factor on which the variable is loaded is interpretable. Thus. Cattell's scree test says to drop all further components after the 20 . If a factor has a low eigenvalue.) This is the same as dividing the factor's eigenvalue by the number of variables. though often this role is greater when communality is high Eigenvalues: Also called characteristic roots. the squared factor loading is the percent of variance in that variable explained by the factor. the factor model is not working well for that indicator and possibly it should be removed from the model.0. As one moves to the right. is the squared multiple correlation for the variable as dependent using the factors as predictors. the eigenvalues drop. roughly in the order of frequency of use in social science (see Dunteman. SPSS will output a corresponding column titled '% of variance'. Kaiser criterion: A common rule of thumb for dropping the least important factors from the analysis. Kaiser criterion is the default in SPSS and most computer programs. 1989: 22-3). then it is contributing little to the explanation of variances in the variables and may be ignored as redundant with more important factors. add the sum of the squared factor loadings for that factor (column) and divide by the number of variables. what is critical is not the communality coefficient per se. A communality of . Q-2 What are the criteria for determining the number of factors. total variance is equal to the number of variables). toward later components.25 seems low but may be meaningful if the item is contributing to a well-defined factor. eigenvalues measure the amount of variation in the total sample accounted for by each factor. The eigenvalue for a given factor measures the variance in all the variables which is accounted for by that factor. That is. (Note the number of variables equals the sum of their variances as the variance of a standardized variable is 1. Note that the eigenvalue is not the percent of variance explained but rather a measure of amount of variance in relation to total variance (since variables are standardized to have means of 0 and variances of 1. Scree plot: The Cattell scree test plots the components as the X axis and the corresponding eigenvalues as the Y axis. The communality measures the percent of variance in a given variable explained by all the factors jointly and may be interpreted as the reliability of the indicator. but rather the extent to which the item plays a role in the interpretation of the factor. communalities must be interpreted in relation to the interpretability of the factors. Communality. The Kaiser rule is to drop all components with eigenvalues under 1. The ratio of eigenvalues is the ratio of explanatory importance of the factors with respect to the variables. To get the percent of variance in all the variables accounted for by each factor.

The amount explained is reflected in the sum of the eigenvalues of all factors. The subjects-to-variables ratio should be no lower than 5 (Bryant and Yarnold. for instance. Alternative arbitrary "rules of thumb. This type of rotation often generates a general factor on which most variables are loaded to a high or medium degree." in descending order of popularity. These are not mutually exclusive: Bryant and Yarnold. the criterion could be as low as 50%. and methodologists differ. Each factor will tend to have either large or small loadings of any particular variable. However. which has the effect of differentiating the original variables by extracted factor. usually varimax. Varimax rotation is an orthogonal rotation of the factor axes to maximize the variance of the squared loadings of a factor (column) on all the variables (rows) in a factor matrix. Such a factor structure is usually not helpful to the research purpose. but it is a good idea to select a rotation method. unrotated solutions are hard to interpret because variables tend to load on multiple factors. The starting the elbow. unrotated principal components solution maximizes the sum of squared factor loadings. Quartimax rotation is an orthogonal alternative which minimizes the number of factors needed to explain each variable. include those below. efficiently creating a set of factors which explain as much of the variance in the original variables as possible. Variance explained criteria: Some researchers simply use the rule of keeping enough factors to account for 90% (sometimes 80%) of the variation. Rule of 10. Where the researcher's goal emphasizes parsimony (explaining variance with as few factors as possible). A varimax solution yields results which make it as easy as possible to identify each variable with a single factor. Even when "fudging" is not a consideration. STV ratio. This rule is sometimes criticised for being amenable to researchercontrolled "fudging. as picking the "elbow" can be subjective because the curve has multiple elbows or is a smooth curve. 1995) 21 . There should be at least 10 cases for each item in the instrument being used. endorse both STV and the Rule of 200. This is the most common rotation option. Q-3 What are the different rotation methods used in factor analysis? Ans: No rotation is the default. Q-4 How many cases are required to do factor analysis? There is no scientific answer to this question. the researcher may be tempted to set the cut-off at the number of factors desired by his or her research agenda. the scree criterion tends to result in more factors than the Kaiser criterion." That is.

0 self-correlations of variables with themselves. controlling for others in the analysis. of course).uk/Users/andyf/factor.Correlation Matrix .Rule of 100: The number of subjects should be the larger of 5 times the number of variables. Exploratory and Confirmatory Factor Analysis: Understanding Concepts and Applications. or 100.check KMO and Bartlett's test of sphericity and also check Anti-image . based on correlation and partial correlation.pdf Bruce Thompson. however.0 and KMO overall should be . The denominator is this same sum plus the sum of squared partial correlations of each variable i with each variable Reduction . The KMO output is KMO overall. There is a KMO statistic for each individual (Hatcher. drop the indicator variables with the lowest individual KMO statistic Q-5 What is "sampling adequacy" and what is it used for? Measured by the Kaiser-Meyer-Olkin (KMO) statistics.Continue .ncsu.300 cases.sussex. this was extremely useful.Descriptives . In SPSS. sampling adequacy predicts if data are likely to factor well. the numerator is the sum of squared correlations of all variables in the analysis (except the 1.60. KMO is found under Analyze . To compute KMO overall.Variables (input variables) . to assess which variables to drop from the model because they are too multicollinear. 2004 22 . and their sum is the KMO overall statistic. KMO varies from 0 to 1. The concept is that the partial correlations should not be very large if one is to expect distinct factors to emerge from factor analysis.60 or higher to proceed with factor analysis. 1994) Rule of 150: Hutcheson and Sofroniou (1999) recommends at least 150 .pdf www. Even more subjects are needed when communalities are low and/or few variables load on each factor.Statistics .OK. until KMO overall rises above . more toward the 150 end when there are a few highly correlated variables. The diagonal elements on the Anti-image correlation matrix are the KMO individual statistics for each variable. as would be the case when collapsing highly multicollinear variables. If it is not. KMO can still be used.htm www.cs. Sources and Suggested Reading: http://faculty. In the old days of manual factor analysis.chass.Factor .

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.