Professional Documents
Culture Documents
Topics
A Guide to Multivariate Techniques Preparation for Statistical Analysis Review: ANOVA Review: ANCOVA MANOVA MANCOVA Repeated Measure Analysis Factor Analysis Discriminant Analysis Cluster Analysis
Guide-1
Correlation: 1 IV 1 DV; relationship Regression: 1+ IV 1 DV; relation/prediction T test: 1 IV (Cat.) 1 DV; group diff. One-way ANOVA: 1 IV (2+ cat.) 1 DV; group diff. One-way ANCOVA: 1 IV (2+ cat.) 1 DV 1+ covariates; group diff. One-way MANOVA: 1 IV (2+ cat.) 2+ DVs; group diff.
Guide-2
One-way MANCOVA: 1 IV (2+cat.) 2+ DVs 1+ covariate; group diff. Factorial MANOVA: 2+ IVs (2+cat.) 2+ DVs; group diff. Factorial MANCOVA: 2+ IVs (2+cat.) 2+ DVs 1+ covariate; group diff. Discriminant Analysis: 2+ IVs 1 DV (cat.); group prediction Factor Analysis: explore the underlying structure
There are outliers in rincome_2, lets change those outliers to the acceptable min or max value
Transform>Recode>Into Different Variable
Put income_2 into Original Variable box, type income_3 as the new name Replace all values <= 3 by 4, all other values remain the same
Review: ANOVA -1
One-way ANOVA test the equality of group means
Assumptions: independent observations; normality; homogeneity of variance
Review: ANCOVA -1
Idea: the difference on a DV often does not just depend on one or two IVs, it may depend on other measurement variables. ANCOVA takes into account of such dependency.
i.e. it removes the effect of one or more covariates
Review: ANCOVA -2
Homogeneity of slopes = homogeneity of regression = there is interaction between IVs and the covariate
If the interaction between covariate and IVs are significant, ANCOVA should not be conducted
Example: determine if hours worked per week (hrs2) is different by gender (sex) and for those satisfy or dissatisfied with their job (satjob2), after adjusted to their income (or equalized to their income)
Review: ANCOVA -3
Analysis>GLM>Univariate
Move hrs2 into DV box; move sex and satjob2 into Fixed Factor box; move rincome_2 into Covariate box Click at Model>Custom
Highlight all variables and move it to the Model box Make sure the Interaction option is selected
Click at Option
Move sex and satjob2 into Display Means box Click Descriptive Stat.; Estimates of effect size; and Homogeneity tests
Review: ANCOVA -4
If there is no interaction found by the previous step, then repeat the previous step except click at Model>Factorial instead of Model>Custom
Review: ANOVA -2
Interaction is significant means the two IVs in combination result in a significant effect on the DV, thus, it does not make sense to interpret the main effects. Assumptions: the same as One-way ANOVA Example: the impact of gender (sex) and age (agecat4) on income (rincome_2)
Explore (omitted) Analysis>GLM>univariate
Click model>click Full factorial>Cont. Click Options>Click Descriptive Stat; Estimates of effect size; Homogeneity test Click Post Hoc>click LSD; Bonferroni; Scheffe; Cont. Click Plots>put one IV into Horizontal and the other into Separate line
MANOVA-1
Characteristics
Similar to ANOVA Multiple DVs The DVs are correlated and linear combination makes sense It tests whether mean differences among k groups on a combination of DVs are likely to have occurred by chance The idea of MANOVA is find a linear combination that separates the groups optimally, and perform ANOVA on the linear combination
MANOVA-2
Advantages
The chance of discovering what actually changed as a result of the the different treatment increases May reveal differences not shown in separate ANOVAs Without inflation of type one error The use of multiple ANOVAs ignores some very important info (the fact that the DVs are correlated)
MANOVA-3
Disadvantages
More complicated ANOVA is often more powerful
Assumptions:
Independent random samples Multivariate normal distribution in each group Homogeneity of covariance matrix Linear relationship among DVs
MANOVA-4
Steps in carry out MANOVA
Check for assumptions If MANOVA is not significant, stop If MANOVA is significant, carry out univariate ANOVA If univariate ANOVA is significant, do Post Hoc
If homoscedasticity, use Wilks Lambda, if not, use Pillais Trace. In general, all 4 statistics should be similar.
MANOVA-5
Example:An experiment looking at the memory effects of different instructions: 3 groups of human subjects learned nonsense syllables as they were presented and were administered two memory tests: recall and recognition. The first group of subjects was instructed to like or dislike the syllables as they were presented (to generate affect). A second group was instructed that they will be tested (induce anxiety?). The 3rd group was told to count the syllable as the were presented (interference). The objective is to access group differences in memory
MANOVA-6
How to do it?
File>Open Data
Open the file As9.por in Instruct>Zhang Multivariate Short Course folder
Analyze>GLM>Multivariate
Move recall and recog into Dependent Variable box; move group into Fixed Factors box Click at Options; move group into Display means box (this will display the marginal means predicted by the model, these means may be different than the observed means if there are covariates or the model is not factorial); Compare main effect box is for testing the every pair of the estimated marginal means for the selected factors. Click at Estimates of effect size and Homogeneity of variance
MANOVA-7
Push buttons:
Plots: create a profile plot for each DV displaying group means Post Hoc: Post Hoc tests for marginal means Save: save predicted values, etc. Contrast: perform planned comparisons Model: specify the model Options:
Display Means for: display the estimated means predicted by the model
Compare main effects: test for significant difference between every pair of estimated marginal means for each of the main effects
MANOVA-8
Observed power: produce a statistical power analysis for your study Parameter estimate: check this when you need a predictive model Spread vs. level plot: visual display of homogeneity of variance
MANOVA-9
Example 2: Check for the impact of job satisfaction (satjob) and gender (sex) on income (rincome_2) and education (educ) (in gssft.sav)
Screen data: transform educ to educ2 to eliminate cases with 6 or less Check for assumptions: explore MANOVA
MANCOVA-1
Objective: Test for mean differences among groups for a linear combination of DVs after adjusted for the covariate. Example: to test if there is differences in productivity (measured by income and hours worked) for individuals in different age groups after adjusted for the education level
MANCOVA-2
Assumptions: similar to ANCOVA SPSS how to:
Analysis>GLM>Multivariate
Move rincome_2 and educ2 to DV box; move sex and satjob into IV box; move age to Covariate box Check for homogeneity of regression
Click at Model>Custom; Highlight all variables and move them to Model box
If the covariate-IVs interaction is not significant, repeat the process but select the Full under model
Click Define: here we link the repeated measure factor level to variable names; define between subject factors and covariates
Move drug1 drug 4 to the Within-Subject box You can move a selected variable by the up and down button
2.
If Sphericity is significant, use the Multivariate results (test on the contrasts). It tests whether all of the contrast variables are zero in the population If Sphericity is not significant, use the Sphericity Assumed result
3. Look at the tests for within subject contrasts: it test the linear trend; the quadratic trend
It may not be make sense in some applications, as in this example (but it makes sense in terms of time and dosage)
Having concluded there are memory differences due to drug condition, , we want to know which condition differ to which others
Suppose we want to test each adjacent pair of means: drug1 vs. drug2; drug2 vs. drug3; drug3 vs. drug 4:
Repeated measure>Define>Contrast>Select Repeated
Factor Analysis-1
The main goal of factor analysis is data reduction. A typical use of factor analysis is in survey research, where a researcher wishes to represent a number of questions with a smaller number of factors Two questions in factor analysis:
How many factors are there and what they represent (interpretation)
Factor Analysis-2
Two types of factor analysis:
Exploratory: introduce here Confirmatory: SPSS AMOS
Theoretical basis:
Correlations among variables are explained by underlying factors An example of mathematical 1 factor model for two variables: V1=L1*F1+E1 V2=L2*F1+E2
Factor Analysis-3
Each variable is compose of a common factor (F1) multiply by a loading coefficient (L1, L2 the lambdas or factor loadings) plus a random component V1 and V2 correlate because the common factor and should relate to the factor loadings, thus, the factor loadings can be estimated by the correlations A set of correlations can derive different factor loadings (i.e. the solutions are not unique) One should pick the simplest solution
Factor Analysis-4
A factor solution needs to be confirm:
By a different factor method By a different sample
More on terminology
Factor loading: interpreted as the Pearson correlation between the variable and the factor Communality: the proportion of variability for a given variable that is explained by the factor Extraction: the process by which the factors are determined from a large set of variables
Factor Analysis-5
Principle component: one of the extraction methods
A principle component is a linear combination of observed variables that is independent (orthogonal) of other components The first component accounts for the largest amount of variance in the input data; the second component accounts for the largest amount or the remaining variance Components are orthogonal means they are uncorrelated
Factor Analysis-6
Possible application of principle components:
E.g. in a survey research, it is common to have many questions to address one issue (e.g. customer service). It is likely that these questions are highly correlated. It is problematic to use these variables in some statistical procedures (e.g. regression). One can use factor scores, computed from factor loadings on each orthogonal component
Factor Analysis-7
Principle component vs. other extract methods:
Principle component focus on accounting for the maximum among of variance (the diagonal of a correlation matrix) Other extract methods (e.g. principle axis factoring) focus more on accounting for the correlations between variables (off diagonal correlations) Principle component can be defined as a unique combination of variables but the other factor methods can not Principle component are use for data reduction but more difficult to interpret
Factor Analysis-8
Number of factors:
Eigenvalues are often used to determine how many factors to take
Take as many factors there are eigenvalues greater than 1
Eigenvalue represents the amount of standardized variance in the variable accounted for by a factor The amount of standardized variance in a variable is 1 The sum of eigenvalues is the percentage of variance accounted for
Factor Analysis-9
Rotation
Objective: to facilitate interpretation Orthogonal rotation: done when data reduction is the objective and factors need to be orthogonal
Varimax: attempts to simplify interpretation by maximize the variances of the variable loadings on each factor Quartimax: simplify solution by finding a rotation that produces high and low loadings across factors for each variable
Oblique rotation: use when there are reason to allow factors to be correlated
Oblimin and Promax (promax runs fast)
Factor Analysis-10
Factor scores: if you are satisfy with a factor solution
You can request that a new set of variables be created that represents the scores of each observation on the factor (difficult of interpret) You can use the lambda coefficient to judge which variables are highly related to the factor; the compute the sum of the mean of this variables for further analysis (easy to interpret)
Factor Analysis-11
Sample size: the sample size should be about 10 to 15 times of the number of variables (as other multivariate procedures) Number of methods: there are 8 factoring methods, including principle component
Principle axis: account for correlations between the variables Unweighted least-squares: minimize the residual between the observed and the reproduced correlation matrix
Factor Analysis-12
Generalize least-squares: similar to Unweighted least-squares but give more weight the the variables with stronger correlation Maximum Likelihood: generate the solution that is the most likely to produce the correlation matrix Alpha Factoring: Consider variables as a sample; not using factor loadings Image factoring: decompose the variables into a common part and a unique part, then work with the common part
Factor Analysis-13
Recommendations:
Principle components and principle axis are the most common used methods When there are multicollinearity, use principle components Rotations are often done. Try to use Varimax
Factor Analysis-14
Example 1: whether a small number of athletic skills account for performance in the ten separate decathlon events
File>Open>Data; select Olymp88.por Looking at correlation:
Analyze>Correlation>Bivariate
Factor Analysis-15
Click Rotation button>click Varimax; Loading plots; click continue Click options button>click sorted by size; click Suppress absolute values box; change .1 to ,3; click continue Click Descriptive>Univariate descriptive; KMO and Bartletts test of sphericity (KMO measures how well the sample data are suited for factor analysis: .9 is great and less than .5 is not acceptable; Bartletts test tests the sphericity of the correlation matrix); click continue Click OK
Factor Analysis-16
Try to validate the first factor solution using a different method
Analyze>Data Reduction>Factor Analysis
Click Extraction>Select Principle axis factoring; click continue Click Rotation>Select Direct Oblimin (leave delta value at 0, most oblique value possible); type 50 in the Max Iteration box; click continue Click Score button>click save as variables (this involve solving system of equation for the factors, regression is one of the methods to solve the equations); click continue Click OK
Factor Analysis-17
Note: the Patten matrix gives the standardized linear weights and the Structure matrix gives the correlation between variable and factors (in principle component analysis, the component matrix gives both factor loadings and the correlations)
Discriminant Analysis-1
Discriminant analysis characterize the relationship between a set of IVs with a categorical DV with relatively few categories
It creates a linear combination of the IVs that best characterizes the differences among the groups Predictive discriminant analysis focus on creating a rule to predict group membership Descriptive DA studies the relationship between the DV and the IVs.
Discriminant Analysis-2
Possible applications:
Whether a bank should offer a loan to a new customer? Which customer is likely to buy? Identify patients who may be at high risk for problems after surgery
Discriminant Analysis-3
How does it work?
Assume the population of interest is composed of distinct populations Assume the IVs follows multivariate normal distribution DS seek a linear combination of the IVs that best separate the populations If we have k groups, we need k-1 discriminate functions A discriminant score is computed for each function This score is used to classify cases into one of the categories
Discriminant Analysis-4
There are three methods to classify group memberships:
Maximum likelihood method: assign case to group k is the probability of membership is greater in group k than any other group Fisher (linear) classification functions: assign a membership to group k if its score on the function for group k is greater than any other function scores Distance function: assign membership to group k if its distance to the centroid of the group is minimum Note: SPSS uses Maximum likelihood method
Discriminant Analysis-5
Basic steps in DA:
Identify the variables Screen data: look for outliers, variables may not be good predictors, etc Run DA Check for the correct prediction rate Check for the importance of individual predictors Validate the model
Discriminant Analysis-6
Assumptions:
IVs are either dichotomous or measurement Normality Homogeneity of variances
Discriminant Analysis-7
Example 1: VCR buyers filled out a survey; we want to determine which set of demographic information and attitude best predict which customer may buy another VCR
File>Open Data>CSM.por Explore the data Analyze>Classify>Discriminant
Move age, complain, educ, fail, pinnovat, preliabl, puse, qual, use, and value into Independent box Move buyyes into Grouping box Click Define Range; type 1 for Min and 2 for Max Click continue
Discriminant Analysis-8
Click Statistics>click Boxs M and Fishers; continue Click Classify button>click Summary table; Separate groups; Continue Click Save button>click on Discriminant Scores; continue Click OK
Discriminant Analysis-9
Since Boxs M test was significant, one can ask SPSS to run DA using separate covariances option (under Classify) and compare the results From the 1st analysis, we see that age was not important, one can redo the analysis without age and compare the results
Discriminant Analysis-10
Validate the model: leave-one-out classification
Repeat the analysis, click on Classify>click leaveone-out classification; Click continue
Cluster Analysis-1
Cluster analysis is an exploratory data analysis technique design to reveal groups How?
By distance: close together observations should be in the same group, and observations in the groups should be far apart
Applications:
Plants and animals into ecological groups Companies for product usage
Cluster Analysis-2
Two types of method
Hierarchical: requires observations to remain together once they have joint in a cluster
Complete linkage Between groups average linkage Wards method
Cluster Analysis-3
Recommendations:
For relative small samples, use hierarchical (less than a few hundred) For large samples, use K-means
Cluster Analysis-4
Analyze>Classify>Hierarchical Cluster
Move cost, calories, sodium, and alcohol into Variable list box Move Beer into label cases by box Click Plots>click Dendrogram; click none in Icicle area; continue Click Method>select Z-score from the standardize drop-down list; Continue Click Save>Click range of solutions; range 2-5 clusters; continue OK
Cluster Analysis-5
Additional analysis
Look at the last 4 column of the data (clu5_1 to clu2_1) they contain memberships for each solution between 5 and 2 clusters Analyze>Descriptive>Frequencies
Move clu2_1 to clu5_1 to Variable box OK
Path Analysis-1
Path analysis is a technique based on regression to establish causal relationship
Start with a diagram with causal flow Direct causal effects model (regression)
The direct causal effect of an IV on a DV is the coefficient (the number of unit change in DV for 1 unit change in X)
Path Analysis-2
An example of a causal model
Structural equation:
Z4=p41Z1+p42Z2+p43Z3+e4
P: path coefficient e: disturbance Z4, endogenous variable Z1: exogenous variable
Path diagram
Indirect effect is the multiplication of the path coefficients
Path Analysis-3
Steps in path analysis:
Create a path diagram Use regression to estimate structural equation coefficients Assess to model:
Compare the observed and reproduced correlations (reproduced correlations will be computed by hand)
Path Analysis-4
Research questions:
Is our model-which describe the causal effects among the variables region of the world, status as a developing nation, number of doctors, and male life expectancy-consistent with our observed correlation among these variables? If our model is consistent, what are the estimated direct, indirect, and total causal effects among the variables?
Path Analysis-5
Legal path:
No path may pass through the same variable more than once No path may go backward on an arrow after going forward on another arrow No path may include more than one double headed curve arrow
Path Analysis-6
Component labels:
D: direct effect (just one straight arrow) I: indirect effect (more than one straight arrows) S: spurious effect (there is a backward arrow) U: effect is uncertain (start with a two arrows curve)
Path Analysis-7
If the model is in question (some of the reproduced correlations differ from the observed correlations by more than .05)
Test all missing paths (running additional regressions and check for significance of the coefficients) Reduce the existing paths if their coefficients are not significant
Motivations-Continue
Fit a S curve to the data
1.0
0.5
0.0 0 5 10
Income
Objectives:
Run a logistic regression Apply a stepwise logistic regression Use ROC (response operating characteristic) curve to access the model
Note on assumptions
No need for normality of errors No need for equal variance
Example
Objective: to predict low birth weight babies Variables:
Low: 1: <=2500 grams, 0: >2500 grams LWT: weight at last menstrual cycle Age Smoke PTL: # of premature deliveries HT: History of Hypertension UI: uterine irritability FTV: # of physician visits during first trimester Race: 1=white, 2=black, 3=other
Example
File > Open > Data > Select SPSS Portable type > select Birthwt (in Regression) Analyze > Regression > Binary Logistic
Move low to the Dependent list box Move age, ftv, ht, ptl, race, smoke, and ui into the Covariate list box
Example (cont.)
Click the Categorical button
Place race in the Categorical Covariates box
Example (cont.)
Click on the Classification plots and Hosmer-Lemeshow goodness of fit checkboxes Click Continue, then OK
Logistic outputs
Initial summary output: info on dependent and categorical variables Block 0: based on the model just include a constant provides baseline info Block 1: Method Enter include the model info
Chi-square tests if all the coeffs are 0 (similar to F in regression)
Making prediction
Suppose a mother;
Age 20 Weigh 130 pounds Smoke No hypertension or premature labor Has uterine irritability White Two visits to her doctor
Checking classification
Need to study the characteristics of mispredicted cases
Transform>Compute> Pred_err=1 if Analyze>Compare Means (LWT vs Pred_err)
The mean LWT for mispredicted is much lower than the correctly predicted
Residual Analysis
Analyze>Regression>Logistic>Click Save >Click Cooks, Leverage, Unstandardized, Logit, and Standardized Examining data
Cooks and Leverage should be small (if a case has no influence on the regression result, the values would be 0) Res_1 is the residual of probability (e.g. 1st case have predicted prob. .29804 and and actual low value is 0, and the res_1=0-.29804=-.29804) Zre_1 is the standardized residual of the probs lre_1 is the residual in terms of logit
Multicollinearity
Use OLS regression to check (?)
Example
Objective: predict risk credit risk (3 categories) base on financial and demographic variables Variables:
Age Income Gender Marital (single, married, divsepwid) Numkids: # of dependent children
Example Cont.
Numcards: #of credit cards Howpaid: how often paid (weekly, monthly) Mortgage: have a mortgage (y, n) Storecar: # of store credit cards Loans: # of other loads Risk: 1=bad risk, 2=bad risk-profit, 3=good risk
g ( j) f ( j) g ( j)
Example Cont.
File > Open > Data > Select Risk > Open Move risk into dependent list box Move marital and mortgage into the Factor(s) list box Move income and numberkids into the Covariate(s) list box Click model button
Click cancel button
Example (Cont.)
Click Statistics button
Check the Classification table check box Click Continue
Click Save
The Multinomial Logistic regression in SPSS version 10 will only save model info in an XML (Extensible Markup Language) format Click cancel
Click OK
Multinomial output
Model Fit and Pseudo R-square, Likelihood ratio test are similar to logistic regression Parameter estimates table is different
There are two sets of parameters
One for the probability ratio of (bad risk-lost)/(good risk) Another set for the prob. Ratio of (bad risk-profit)/(good risk)
Interpretation of coefficients
Income in the bad lost section
It is significant Exp(B)=.962: the expected probability ratio is decreased a little (by a factor of .962) for one unit increase of income
How to predict?
F(1) the chance in bad loss group F(2) the chance in bad profit group F(3) the chance in good risk group F(j)=g(j)/sum(g(i)) g(j)=exp(modelj)
How to predict?
F(1)=.218/(.218+.767+1)=.110 F(2)=.386 F(3)=.504 The individual is classified as good risk