You are on page 1of 124

Multivariate Data Analysis

Using SPSS

John Zhang
ARL, IUP
Topics
A Guide to Multivariate Techniques
Preparation for Statistical Analysis
Review: ANOVA
Review: ANCOVA
MANOVA
MANCOVA
Repeated Measure Analysis
Factor Analysis
Discriminant Analysis
Cluster Analysis
Guide-1
Correlation: 1 IV – 1 DV; relationship
Regression: 1+ IV – 1 DV; relation/prediction
T test: 1 IV (Cat.) – 1 DV; group diff.
One-way ANOVA: 1 IV (2+ cat.) – 1 DV;
group diff.
One-way ANCOVA: 1 IV (2+ cat.) – 1 DV –
1+ covariates; group diff.
One-way MANOVA: 1 IV (2+ cat.) – 2+ DVs;
group diff.
Guide-2
One-way MANCOVA: 1 IV (2+cat.) – 2+ DVs –
1+ covariate; group diff.
Factorial MANOVA: 2+ IVs (2+cat.) – 2+ DVs;
group diff.
Factorial MANCOVA: 2+ IVs (2+cat.) – 2+ DVs –
1+ covariate; group diff.
Discriminant Analysis: 2+ IVs – 1 DV (cat.);
group prediction
Factor Analysis: explore the underlying structure
Preparation for Stat. Analysis-1
Screen data
– SPSS Utility procedures
– Frequency procedure
Missing data analysis (missing data
should be random)
– Check if patterns exist
– Drop data case-wise
– Drop data variable-wise
– Impute missing data
Preparation for Stat. Analysis-2
Outliers (generally, statistical procedures
are sensitive to outliers.
– Univariate case: boxplot
– Multivariate case: Mahalanobis distance (a
chi-square statistics), a point is an outlier
when its p-value is < .001.
– Treatment:
Drop the case
Report two analysis (one with outlier, one without)
Preparation for Stat. Analysis-3
Normality
– Testing univariate normal:
Q-Q plot
Skewness and Kurtosis: they should be 0 when
normal; not normal when p-value < .01 or .001
Komogorov-Smirnov statistic: significant means
not normal.
– Testing multivariate normal:
Scatterplots should be elliptical
Each variable must be normal
Preparation for Stat. Analysis-4
Linearity
– Linear combination of variables make sense
– Two variables (or comb. of variables) are
linear
– Check for linearity
Residual plot in regression
Scatterplots
Preparation for Stat. Analysis-5
Homoscedasticity: the covariance
matrixes are equal across groups
– Box’s M test: test the equality of the
covariance matrixes across groups
Sensitive to normality
– Levene’s test: test equality of variances
across groups.
Not sensitive to normality
Preparation for Stat. Analysis-
Example-1
Steps in preparation for stat. analysis:
– Check for variable codling, recode if necessary
– Examining missing data
– Check for univariate outlier, normality, homogeneity of
variances (Explore)
– Test for homogeneity of variances (ANOVA)
– Check for multivariate outliers (Regression>Save>
Mahalanobis)
– Check for linearity (scatterplots; residual plots in
regression)
Preparation for Stat. Analysis-
Example-2
Use dataset dssft.sav
Objective: we are interested in
investigating group differences (satjob2) in
income (income91), age (age_2) and
education (educ)
Check for coding: need to recode
rincome91 into rincome_2 (22, 98, 99 be
system missing)
– Transform>Recode>Into Different Variable
Preparation for Stat. Analysis-
Example-3
Check for missing value
– Use Frequency for categorical variable
– Use Descriptive Stat. for measurement variable
– For categorical variables:
If missing value is < 5%, use List-wise option
If >=5%, define the missing value as a new category
– For measurement variables:
If missing value is < 5%, use List-wise option
If between 5% and 15%, use Transform>Replace Missing
Value. Replacing less than 15% of data has little effect on
the outcome
If greater than 15%, consider to drop the variable or subject
Preparation for Stat. Analysis-
Example-4
– Check missing value for satjob2
Analysis>Descriptive Statistics>Frequency
– Check for missing value for rincome_2
Analysis>Descriptive Statistics>Descriptive
– Replaying the missing values in rincome_2
Transform>Replacing Missing Value
Preparation for Stat. Analysis-
Example-5
Check for univariate outliers, normality,
Homogeneity of variances
– Analysis>Descriptive Statistics>Explore
Put rincome_2, age_2, and educ into the
Dependent List box; satjob2 into Factor List box
– There are outliers in rincome_2, lets change
those outliers to the acceptable min or max
value
Transform>Recode>Into Different Variable
– Put income_2 into Original Variable box, type income_3
as the new name
– Replace all values <= 3 by 4, all other values remain the
same
Preparation for Stat. Analysis-
Example-6
Explore rincome_3 again: not normal
– Transform rincome_3 into rincome_4 by ln or
sqrt
Explore rincome_4
Check for multivariate outliers
– Analysis>Regression>linear
Put id (dummy variable) into Depend box, put
rincome_4, age_2, and educ into Independent box
Click at Save, then Mahalanobis box
Compare Mahalanobis dist. with chi-sqrt critical
value at p=.001 and df=number of independent
variables
Preparation for Stat. Analysis-
Example-7
Check for multivariate normal:
– Must univariate normal
– Construct a scatterplot matrix, each
scatterplot should be elliptical shape
Check for Homoscedasticity
– Univariate (ANOVA, Levene’s test)
– Multivariate (MANOVA, Box’s M test, use .01
level of significance level)
Review: ANOVA -1
One-way ANOVA test the equality of group
means
– Assumptions: independent observations; normality;
homogeneity of variance
Two-way ANOVA tests three hypotheses
simultaneously:
– Test the interaction of the levels of the two
independent variables
Interaction occurs when the effects of one factor depends on
the different levels of the second factor
– Test the two independent variable separately
Review: ANCOVA -1
Idea: the difference on a DV often does not just
depend on one or two IVs, it may depend on
other measurement variables. ANCOVA takes
into account of such dependency.
– i.e. it removes the effect of one or more covariates
Assumptions: in addition to the regular ANOVA
assumptions, we need:
– Linear relationship between DV and covariates
– The slope for the regression line is the same for each
group
– The covariates are reliable and is measure without
error
Review: ANCOVA -2
– Homogeneity of slopes = homogeneity of
regression = there is interaction between IVs
and the covariate
If the interaction between covariate and IVs are
significant, ANCOVA should not be conducted
Example: determine if hours worked per
week (hrs2) is different by gender (sex)
and for those satisfy or dissatisfied with
their job (satjob2), after adjusted to their
income (or equalized to their income)
Review: ANCOVA -3
– Analysis>GLM>Univariate
Move hrs2 into DV box; move sex and satjob2 into
Fixed Factor box; move rincome_2 into Covariate
box
Click at Model>Custom
– Highlight all variables and move it to the Model box
– Make sure the Interaction option is selected
Click at Option
– Move sex and satjob2 into Display Means box
– Click Descriptive Stat.; Estimates of effect size; and
Homogeneity tests
This tests the homogeneity of regression slopes
Review: ANCOVA -4
– If there is no interaction found by the previous
step, then repeat the previous step except
click at Model>Factorial instead of
Model>Custom
Review: ANOVA -2
– Interaction is significant means the two IVs in
combination result in a significant effect on the DV,
thus, it does not make sense to interpret the main
effects.
– Assumptions: the same as One-way ANOVA
– Example: the impact of gender (sex) and age
(agecat4) on income (rincome_2)
Explore (omitted)
Analysis>GLM>univariate
– Click model>click Full factorial>Cont.
– Click Options>Click Descriptive Stat; Estimates of effect size;
Homogeneity test
– Click Post Hoc>click LSD; Bonferroni; Scheffe; Cont.
– Click Plots>put one IV into Horizontal and the other into
Separate line
MANOVA-1
Characteristics
– Similar to ANOVA
– Multiple DVs
– The DVs are correlated and linear combination makes
sense
– It tests whether mean differences among k groups on
a combination of DVs are likely to have occurred by
chance
– The idea of MANOVA is find a linear combination that
separates the groups ‘optimally’, and perform ANOVA
on the linear combination
MANOVA-2
Advantages
– The chance of discovering what actually
changed as a result of the the different
treatment increases
– May reveal differences not shown in separate
ANOVAs
– Without inflation of type one error
– The use of multiple ANOVAs ignores some
very important info (the fact that the DVs are
correlated)
MANOVA-3
Disadvantages
– More complicated
– ANOVA is often more powerful
Assumptions:
– Independent random samples
– Multivariate normal distribution in each group
– Homogeneity of covariance matrix
– Linear relationship among DVs
MANOVA-4
Steps in carry out MANOVA
– Check for assumptions
– If MANOVA is not significant, stop
– If MANOVA is significant, carry out univariate
ANOVA
– If univariate ANOVA is significant, do Post
Hoc
If homoscedasticity, use Wilks Lambda, if
not, use Pillai’s Trace. In general, all 4
statistics should be similar.
MANOVA-5
Example:An experiment looking at the memory
effects of different instructions: 3 groups of
human subjects learned nonsense syllables as
they were presented and were administered two
memory tests: recall and recognition. The first
group of subjects was instructed to like or dislike
the syllables as they were presented (to
generate affect). A second group was instructed
that they will be tested (induce anxiety?). The 3rd
group was told to count the syllable as the were
presented (interference). The objective is to
access group differences in memory
MANOVA-6
How to do it?
– File>Open Data
Open the file As9.por in Instruct>Zhang Multivariate Short
Course folder
– Analyze>GLM>Multivariate
Move recall and recog into Dependent Variable box; move
group into Fixed Factors box
Click at Options; move group into Display means box (this
will display the marginal means predicted by the model,
these means may be different than the observed means if
there are covariates or the model is not factorial); Compare
main effect box is for testing the every pair of the estimated
marginal means for the selected factors.
Click at Estimates of effect size and Homogeneity of variance
MANOVA-7
Push buttons:
– Plots: create a profile plot for each DV displaying
group means
– Post Hoc: Post Hoc tests for marginal means
– Save: save predicted values, etc.
– Contrast: perform planned comparisons
– Model: specify the model
– Options:
Display Means for: display the estimated means predicted by
the model
– Compare main effects: test for significant difference between
every pair of estimated marginal means for each of the main
effects
MANOVA-8
– Observed power: produce a statistical power
analysis for your study
– Parameter estimate: check this when you
need a predictive model
– Spread vs. level plot: visual display of
homogeneity of variance
MANOVA-9
Example 2: Check for the impact of job
satisfaction (satjob) and gender (sex) on
income (rincome_2) and education (educ)
(in gssft.sav)
– Screen data: transform educ to educ2 to
eliminate cases with ‘6 or less’
– Check for assumptions: explore
– MANOVA
MANCOVA-1
Objective: Test for mean differences
among groups for a linear combination of
DVs after adjusted for the covariate.
Example: to test if there is differences in
productivity (measured by income and
hours worked) for individuals in different
age groups after adjusted for the
education level
MANCOVA-2
Assumptions: similar to ANCOVA
SPSS how to:
– Analysis>GLM>Multivariate
Move rincome_2 and educ2 to DV box; move sex
and satjob into IV box; move age to Covariate box
Check for homogeneity of regression
– Click at Model>Custom; Highlight all variables and move
them to Model box
If the covariate-IVs interaction is not significant,
repeat the process but select the Full under model
Repeated Measure Analysis-1
Objective: test for significant differences in
means when the same observation appears in
multiple levels of a factor
Examples of repeated measure studies:
– Marketing – compare customer’s ratings on 4 different
brands
– Medicine – compare test results before, immediately
after, and six months after a procedure
– Education – compare performance test scores before
and after an intervention program
Repeated Measure Analysis-2
The logic of repeated measure: SPSS
performs repeated measure ANOVA by
computing contrasts (differences) across
the repeated measures factor’s levels for
each subject, then testing if the means of
the contrasts are significantly different
from 0; any between subject tests are
based on the means of the subjects.
Repeated Measure Analysis-3
Assumptions:
– Independent observations
– Normality
– Homogeneity of variances
– Sphericity: if two or more contrasts are to be pooled
(the test of main effect is based on this pooling), then
the contrasts should be equally weighted and
uncorrelated (equal variances and uncorrelated
contrasts); this assumption is equivalent to the
covariance matrix is diagonal and the diagonal
elements are the same)
Repeated Measure Analysis-4
Example 1: A study in which 5 subjects were
tested in each of 4 drug conditions
Open data file:
– File>Open…Data; select Repmeas1.por
SPSS repeated measure procedure:
– Analyze>GLM>Repeated Measure
Within-Subject Factor Name (the name of the repeated
measure factor): a repeated measure factor is expressed as
a set of variables
– Replace factor1 with Drug
Number of levels: the number of repeated measurements
– Type 4
Repeated Measure Analysis-5
– The Measure pushbutton for two functions
For multiple dependent measures (e.g. we
recorded 4 measures of physiological stress under
each of the drug conditions)
To label the factor levels
– Click Measure; type memory in Measure name box; click
add
Click Define: here we link the repeated measure
factor level to variable names; define between
subject factors and covariates
– Move drug1 – drug 4 to the Within-Subject box
You can move a selected variable by the up and
down button
Repeated Measure Analysis-6
Model button: by default a complete model
Contrast button: specify particular contrasts
Plot button: create profile plots that graph factor
level estimated marginal means for up to 3 factors
at a time
Post Hoc: provide Post Hoc tests for between
subject factors
Save button: allow you to save predicted values,
residuals, etc.
Options: similar to MANOVA
– Click Descriptive; click at Transformation Matrix (it
provides the contrasts)
Repeated Measure Analysis-7
Interpret the results
1. Look at the descriptive statistics
2. Look at the test for Sphericity
If Sphericity is significant, use the Multivariate results (test
on the contrasts). It tests whether all of the contrast
variables are zero in the population
If Sphericity is not significant, use the Sphericity Assumed
result
3. Look at the tests for within subject contrasts: it test
the linear trend; the quadratic trend…
– It may not be make sense in some applications, as in this
example (but it makes sense in terms of time and dosage)
Repeated Measure Analysis-8
Transformation matrix provide info on what are
linear contrast, etc.
– The fist table is for the average across the repeated
measure factor (here they are all .5, it means each
variable is weighted equally, normalization requires that
the square of the sums equals to 1)
– The second table defines the corresponding repeated
measure factor
Linear – increase by a constant, etc.
Linear and quadratic is orthogonal, etc.
– Having concluded there are memory
differences due to drug condition, , we want to
know which condition differ to which others
Repeated Measure Analysis-9
Repeat the analysis, except under Option button,
move ‘drug’ into Display Means, click at Compare
Main effects and select Bonferroni adjustment
– Transformation Coefficients (M Matrix): it shows how the
variables are created for comparison. Here, we compare
the drug conditions, so the M matrix is an identity matrix
Suppose we want to test each adjacent pair of
means: drug1 vs. drug2; drug2 vs. drug3; drug3
vs. drug 4:
– Repeated measure>Define>Contrast>Select Repeated
Repeated Measure Analysis-10
Example 2: A marketing experiment was devised
to evaluate whether viewing a commercial
produces improved ratings for a specific brand.
Ratings on 3 brands were obtained from objects
before and after viewing the commercial. Since
the hope was that the commercial would
improve ratings of only one brand (A),
researchers expected a significant brand by pre-
post commercial interaction. There are two
between-subjects factors: sex and brand used
by the subject
Repeated Measure Analysis-11
SPSS how to:
– Analyze>GLM>Repeated Measures
Replace factor1 with prepost in the Within-Subject
Factor box; type 2 in the Number of level box; click
add
Type brand in the Within-Subject Factor box; type
3 in the Number of level box; click add
Click measure; type measure in Measure Name
box; click add
Note: SPSS expects 2 between-subject factors
Repeated Measure Analysis-12
Click Define button; move the appropriate variable
into place; move sex and user into Between-
Subject Factor box
Click Options button; move sex, user, prepost and
brand into the Display means box
Click Homogeneity tests and descriptive boxes
Click Plot; move user into Horizontal Axis box and
brand into Separate Lines box
Click continue; OK
Factor Analysis-1
The main goal of factor analysis is data
reduction. A typical use of factor analysis is in
survey research, where a researcher wishes to
represent a number of questions with a smaller
number of factors
Two questions in factor analysis:
– How many factors are there and what they represent
(interpretation)
Two technical aids:
– Eigenvalues
– Percentage of variance accounted for
Factor Analysis-2
Two types of factor analysis:
– Exploratory: introduce here
– Confirmatory: SPSS AMOS
Theoretical basis:
– Correlations among variables are explained by
underlying factors
– An example of mathematical 1 factor model for two
variables:
V1=L1*F1+E1
V2=L2*F1+E2
Factor Analysis-3
Each variable is compose of a common factor (F1)
multiply by a loading coefficient (L1, L2 – the
lambdas or factor loadings) plus a random
component
V1 and V2 correlate because the common factor
and should relate to the factor loadings, thus, the
factor loadings can be estimated by the
correlations
A set of correlations can derive different factor
loadings (i.e. the solutions are not unique)
One should pick the simplest solution
Factor Analysis-4
A factor solution needs to be confirm:
– By a different factor method
– By a different sample

More on terminology
– Factor loading: interpreted as the Pearson
correlation between the variable and the
factor
– Communality: the proportion of variability for a
given variable that is explained by the factor
– Extraction: the process by which the factors
are determined from a large set of variables
Factor Analysis-5
Principle component: one of the extraction
methods
– A principle component is a linear combination of
observed variables that is independent (orthogonal) of
other components
– The first component accounts for the largest amount
of variance in the input data; the second component
accounts for the largest amount or the remaining
variance…
– Components are orthogonal means they are
uncorrelated
Factor Analysis-6
Possible application of principle
components:
– E.g. in a survey research, it is common to
have many questions to address one issue
(e.g. customer service). It is likely that these
questions are highly correlated. It is
problematic to use these variables in some
statistical procedures (e.g. regression). One
can use factor scores, computed from factor
loadings on each orthogonal component
Factor Analysis-7
Principle component vs. other extract methods:
– Principle component focus on accounting for the
maximum among of variance (the diagonal of a
correlation matrix)
– Other extract methods (e.g. principle axis factoring)
focus more on accounting for the correlations
between variables (off diagonal correlations)
– Principle component can be defined as a unique
combination of variables but the other factor methods
can not
– Principle component are use for data reduction but
more difficult to interpret
Factor Analysis-8
Number of factors:
– Eigenvalues are often used to determine how
many factors to take
Take as many factors there are eigenvalues
greater than 1
– Eigenvalue represents the amount of standardized
variance in the variable accounted for by a factor
– The amount of standardized variance in a variable is 1
– The sum of eigenvalues is the percentage of variance
accounted for
Factor Analysis-9
Rotation
– Objective: to facilitate interpretation
– Orthogonal rotation: done when data reduction is the
objective and factors need to be orthogonal
Varimax: attempts to simplify interpretation by maximize the
variances of the variable loadings on each factor
Quartimax: simplify solution by finding a rotation that
produces high and low loadings across factors for each
variable
– Oblique rotation: use when there are reason to allow
factors to be correlated
Oblimin and Promax (promax runs fast)
Factor Analysis-10
Factor scores: if you are satisfy with a
factor solution
– You can request that a new set of variables
be created that represents the scores of each
observation on the factor (difficult of interpret)
– You can use the lambda coefficient to judge
which variables are highly related to the
factor; the compute the sum of the mean of
this variables for further analysis (easy to
interpret)
Factor Analysis-11
Sample size: the sample size should be about
10 to 15 times of the number of variables (as
other multivariate procedures)
Number of methods: there are 8 factoring
methods, including principle component
– Principle axis: account for correlations between the
variables
– Unweighted least-squares: minimize the residual
between the observed and the reproduced correlation
matrix
Factor Analysis-12
– Generalize least-squares: similar to Unweighted least-
squares but give more weight the the variables with
stronger correlation
– Maximum Likelihood: generate the solution that is the
most likely to produce the correlation matrix
– Alpha Factoring: Consider variables as a sample; not
using factor loadings
– Image factoring: decompose the variables into a
common part and a unique part, then work with the
common part
Factor Analysis-13
Recommendations:
– Principle components and principle axis are
the most common used methods
– When there are multicollinearity, use principle
components
– Rotations are often done. Try to use Varimax
Factor Analysis-14
Example 1: whether a small number of athletic
skills account for performance in the ten
separate decathlon events
– File>Open>Data…; select Olymp88.por
– Looking at correlation:
Analyze>Correlation>Bivariate
– Principle component with orthogonal rotation
Analyze>Data Reduction>Factor
– Select all variables except score
– Click Extract button>click Scree Plot
– Check off Unrotated factor solution
– Click continue
Factor Analysis-15
Click Rotation button>click Varimax; Loading plots;
click continue
Click options button>click sorted by size; click
Suppress absolute values box; change .1 to ,3;
click continue
Click Descriptive>Univariate descriptive; KMO and
Bartlett’s test of sphericity (KMO measures how
well the sample data are suited for factor analysis:
.9 is great and less than .5 is not acceptable;
Bartlett’s test tests the sphericity of the correlation
matrix); click continue
Click OK
Factor Analysis-16
Try to validate the first factor solution
using a different method
– Analyze>Data Reduction>Factor Analysis
Click Extraction>Select Principle axis factoring;
click continue
Click Rotation>Select Direct Oblimin (leave delta
value at 0, most oblique value possible); type 50 in
the Max Iteration box; click continue
Click Score button>click save as variables (this
involve solving system of equation for the factors,
regression is one of the methods to solve the
equations); click continue
Click OK
Factor Analysis-17
Note: the Patten matrix gives the
standardized linear weights and the
Structure matrix gives the correlation
between variable and factors (in principle
component analysis, the component
matrix gives both factor loadings and the
correlations)
Discriminant Analysis-1
Discriminant analysis characterize the
relationship between a set of IVs with a
categorical DV with relatively few
categories
– It creates a linear combination of the IVs that
best characterizes the differences among the
groups
– Predictive discriminant analysis focus on
creating a rule to predict group membership
– Descriptive DA studies the relationship
between the DV and the IVs.
Discriminant Analysis-2
Possible applications:
– Whether a bank should offer a loan to a new
customer?
– Which customer is likely to buy?
– Identify patients who may be at high risk for
problems after surgery
Discriminant Analysis-3
How does it work?
– Assume the population of interest is composed of
distinct populations
– Assume the IVs follows multivariate normal
distribution
– DS seek a linear combination of the IVs that best
separate the populations
– If we have k groups, we need k-1 discriminate
functions
– A discriminant score is computed for each function
– This score is used to classify cases into one of the
categories
Discriminant Analysis-4
– There are three methods to classify group
memberships:
Maximum likelihood method: assign case to group
k is the probability of membership is greater in
group k than any other group
Fisher (linear) classification functions: assign a
membership to group k if its score on the function
for group k is greater than any other function
scores
Distance function: assign membership to group k if
its distance to the centroid of the group is minimum
Note: SPSS uses Maximum likelihood method
Discriminant Analysis-5
Basic steps in DA:
– Identify the variables
– Screen data: look for outliers, variables may
not be good predictors, etc
– Run DA
– Check for the correct prediction rate
– Check for the importance of individual
predictors
– Validate the model
Discriminant Analysis-6
Assumptions:
– IVs are either dichotomous or measurement
– Normality
– Homogeneity of variances
Discriminant Analysis-7
Example 1: VCR buyers filled out a survey; we
want to determine which set of demographic
information and attitude best predict which
customer may buy another VCR
– File>Open Data…>CSM.por
– Explore the data
– Analyze>Classify>Discriminant
Move age, complain, educ, fail, pinnovat, preliabl, puse, qual,
use, and value into Independent box
Move buyyes into Grouping box
Click Define Range; type 1 for Min and 2 for Max
Click continue
Discriminant Analysis-8
Click Statistics>click Box’s M and Fisher’s;
continue
Click Classify button>click Summary table;
Separate groups; Continue
Click Save button>click on Discriminant Scores;
continue
Click OK
– How original variables related to the
discriminant score?
Graphs>Scatter>Click Define
– Move pinnovat into X and dis1_1 into Y; move buyyes
into Set Markers by box
Discriminant Analysis-9
Since Box’s M test was significant, one
can ask SPSS to run DA using ‘separate
covariances’ option (under Classify) and
compare the results
From the 1st analysis, we see that ‘age’
was not important, one can redo the
analysis without ‘age’ and compare the
results
Discriminant Analysis-10
Validate the model: leave-one-out classification
– Repeat the analysis, click on Classify>click leave-one-
out classification; Click continue
Example 2: predict smoking and drinking habits
– Analyze>Classify>Discriminant
Move smkdrnk into Grouping Variable box; move age,
attend, black, class, educ, sex and white into IV list
Click Statistics>Select Fisher’s and Box M; Continue
Click Classify>Summary table, Combine-groups; Territorial
map; Continue
Click OK
Cluster Analysis-1
Cluster analysis is an exploratory data
analysis technique design to reveal groups
How?
– By distance: close together observations
should be in the same group, and
observations in the groups should be far apart
Applications:
– Plants and animals into ecological groups
– Companies for product usage
Cluster Analysis-2
Two types of method
– Hierarchical: requires observations to remain
together once they have joint in a cluster
Complete linkage
Between groups average linkage
Ward’s method
– Nonhierarchical: no such requirement
Research must pick a number of clusters to run (K-
means algorithm)
Cluster Analysis-3
Recommendations:
– For relative small samples, use hierarchical
(less than a few hundred)
– For large samples, use K-means
Example 1: evaluating 20 types of beer
– File>Open>Data; select beer.por
– Analyze>Descriptive Stat>Descriptive
Move cost, calories, sodium, and alcohol into
variable list
Click at Save standardized values; OK
Cluster Analysis-4
Analyze>Classify>Hierarchical Cluster
– Move cost, calories, sodium, and alcohol into Variable
list box
– Move Beer into label cases by box
– Click Plots>click Dendrogram; click none in Icicle
area; continue
– Click Method>select Z-score from the standardize
drop-down list; Continue
– Click Save>Click range of solutions; range 2-5
clusters; continue
– OK
Cluster Analysis-5
Additional analysis
– Look at the last 4 column of the data (clu5_1 to
clu2_1) they contain memberships for each solution
between 5 and 2 clusters
– Analyze>Descriptive>Frequencies
Move clu2_1 to clu5_1 to Variable box
OK
– Obtain mean profile for clusters
Graph>Line>summary of separate variables
– Click Define>move zcost, zcalorie, zsodium, and zalcohol to
Lines Rep. Box
– Click clu4_1 and move it to Category box
Path Analysis-1
Path analysis is a technique based on
regression to establish causal relationship
– Start with a diagram with causal flow
– Direct causal effects model (regression)
The direct causal effect of an IV on a DV is the coefficient
(the number of unit change in DV for 1 unit change in X)
– Building on the DCEM
Two forms of causal model:
– Diagram
– Equation (structure equation)
Path Analysis-2
An example of a causal model
– Structural equation:
Z4=p41Z1+p42Z2+p43Z3+e4
– P: path coefficient
– e: disturbance
– Z4, endogenous variable
– Z1: exogenous variable

– Path diagram
Indirect effect is the multiplication of the path
coefficients
Path Analysis-3
Steps in path analysis:
– Create a path diagram
– Use regression to estimate structural equation
coefficients
– Assess to model:
Compare the observed and reproduced
correlations (reproduced correlations will be
computed by hand)
Path Analysis-4
Research questions:
– Is our model-which describe the causal
effects among the variables ‘region of the
world’, ‘status as a developing nation’,
‘number of doctors’, and ‘male life
expectancy’-consistent with our observed
correlation among these variables?
– If our model is consistent, what are the
estimated direct, indirect, and total causal
effects among the variables?
Path Analysis-5
Legal path:
– No path may pass through the same variable
more than once
– No path may go backward on an arrow after
going forward on another arrow
– No path may include more than one double
headed curve arrow
Path Analysis-6
Component labels:
– D: direct effect (just one straight arrow)
– I: indirect effect (more than one straight
arrows)
– S: spurious effect (there is a backward arrow)
– U: effect is uncertain (start with a two arrows
curve)
Path Analysis-7
If the model is in question (some of the
reproduced correlations differ from the
observed correlations by more than .05)
– Test all missing paths (running additional
regressions and check for significance of the
coefficients)
– Reduce the existing paths if their coefficients
are not significant
Logistic regression - Motivations
When the dependent variable is
dichotomous, regular regression is not
appropriate
– We want to predict probability
– OLS regression predictions could be any
numbers, not just numbers between 0 and 1
– When dealing with proportions, variance is
depended on mean, equal variance
assumption in OLS is violated
Motivations-Continue
Fit a S curve to the data

1.0
Prob of Ownning Home

0.5

0.0

0 5 10
Income
What is Logistic Regression?
Regressions of the form
ln(Odds)=B0+B1X1+…+BkXk
ln(Odds) is called a logic
Odds=Porb/(1-Prob)
e B0 + B1 X 1 +...+ Bk X k
Pr ob =
1 + e B0 + B1 X 1 +...+ Bk X k
Application of Logistic
Regression
When to use it?
– When the dependent valuable is
dichotomous
Objectives:
– Run a logistic regression
– Apply a stepwise logistic regression
– Use ROC (response operating
characteristic) curve to access the model
Assumptions of logistic
regression
The indep. variables be interval or
dichotomous
All relevant predictors be included, no
irrelevant predictors be included and the
form of the relationship is linear
The expected value of the error term is
zero
There is no autocorrelation
Assumptions of logistic
regression – Cont.
There is no correlation between the error
and the independent variables
There is an absence of perfect
multicollinearity between the independent
variables
Need to have a large sample (rule of
thumb: n should be > 30 times of the
number of parameters)
Note on assumptions
No need for normality of errors
No need for equal variance
Example
Objective: to predict low birth weight babies
Variables:
– Low: 1: <=2500 grams, 0: >2500 grams
– LWT: weight at last menstrual cycle
– Age
– Smoke
– PTL: # of premature deliveries
– HT: History of Hypertension
– UI: uterine irritability
– FTV: # of physician visits during first trimester
– Race: 1=white, 2=black, 3=other
Example
File > Open > Data > Select SPSS
Portable type > select Birthwt (in
Regression)
Analyze > Regression > Binary Logistic
– Move ‘low’ to the Dependent list box
– Move ‘age’, ‘ftv’, ‘ht’, ‘ptl’, ‘race’, ‘smoke’, and
‘ui’ into the Covariate list box
Example (cont.)
Click the Categorical button
– Place ‘race’ in the Categorical Covariates box
Click Continue, click Save
– Click the Probability and Group Membership
check boxes
Click Continue and then the Option button
Example (cont.)
Click on the Classification plots and
Hosmer-Lemeshow goodness of fit
checkboxes
Click Continue, then OK
Logistic outputs
Initial summary output: info on dependent
and categorical variables
Block 0: based on the model just include a
constant – provides baseline info
Block 1: Method Enter – include the model
info
– Chi-square tests if all the coeffs are 0 (similar
to ‘F’ in regression)
Logistic outputs (cont.)
The Modle chi-square value is the
difference of the initial and final –2LL
(small value of -2LL indicates a good fit,
-2LL=0 indicates a perfect fit)
The Step and Block display the the result
of last Step and Block (they are the same
here because we are not using stepwise
regression)
Logistic outputs (cont.)
The goodness of fit statistics –2LL is
203.554
Cox & Snell R square – similar to R-
square in OLS
Nagelkerke R squre (prefered b/c it can
be 1)
Hosmer and Lemeshow test: test “there
is no difference between expected and
observe counts”. I.e. we prefer a non-
significant result
Logistic outputs (cont.)
Classification table: can our model to
predict accurately?
– Overall accuracy is 73%
– We do much better on higher birth weight
– Does a poor job on lower birth weight
– A significant model doesn’t mean having high
predictability
Interpretation of the coefficients
E.g. HT (hypertension)
– B=1.736 – hypertension in the mother
increase the log odds by 1.736
– Exp(B)=5.831 - hypertension in the mother
increase the odds of having a low birth baby
by a factor of 5.831
– What is the prob. change?
If the original odds is 1:100 (p=.0099), it changes
to 5.831:100 (p=.0551); if the original odds is 1:1
(p=.5), it changes to 5:1 (p=.83)
Interpretation of the coefficients
(cont.)
Categorical variable Race:
– First an overall effect
– Race(1) – white: the effect of being white is
significant, acting to decrease the odds ratio
compared to those of ‘other’ by a factor of .4
– The effect of being black is not significant
compared with ‘other’
Making prediction
Suppose a mother;
– Age 20
– Weigh 130 pounds
– Smoke
– No hypertension or premature labor
– Has uterine irritability
– White
– Two visits to her doctor
Making prediction (cont.)
P(event) = 1/(1+exp(-(a+b1X1+…+bkXk)
P=.397
Predicted to be not have low birth rate
because the prob. is less that .5
Checking classification
Need to study the characteristics of
mispredicted cases
– Transform>Compute> Pred_err=1 if…
– Analyze>Compare Means (LWT vs Pred_err)
The mean LWT for mispredicted is much lower
than the correctly predicted
Residual Analysis
Analyze>Regression>Logistic>Click Save >Click
Cook’s, Leverage, Unstandardized, Logit, and
Standardized
Examining data
– Cook’s and Leverage should be small (if a case has
no influence on the regression result, the values
would be 0)
– Res_1 is the residual of probability (e.g. 1st case have
predicted prob. .29804 and and actual ‘low’ value is 0,
and the res_1=0-.29804=-.29804)
– Zre_1 is the standardized residual of the probs
– lre_1 is the residual in terms of logit
ROC curve (Receiver Operating
Characteristic)
Sensitivity: true positive
Specificity: true negative
Changing cut off points (.5) changes both the
sensitivity and specificity
ROC can help us to select an ‘optimal’ cut off
point
Graph>ROC Curve>move pre_1 to ‘Test
Variable’, low to ‘State Variable’, type ‘1’ in the
‘Value of State Variable’, click ‘with diagonal
reference line’ and ‘Coordinate points of the
ROC Curve’
ROC curve interpretation
Vertical axis: sensitivity (true positive rate)
Horizontal axis: false negative rate
Diagonal: reference
Give the trade off between sensitivity and
false negative rates
Pay attention to the area where the curve
rise rapidly
The 1st column of ‘coordinate of the curve’
gives the cut off prob.
Residual Analysis – Cont.
Examine the distribution of zre_1
– Graph>Interactive>Histogram>drag zre_1 to
X axis, click Histogram, click Normal Curve
Note: this plot need not to should normality
Finding influential cases
– Graph>Scatterplot>Define>Move id to X axis,
coo_1 to Y axis
Multicollinearity
– Use OLS regression to check (?)
Multinomial Logistic Regression
The dependent variable is categorical with
two or more categories
It is an extension of the logistic regression
The assumptions are the assumptions for
logistic regression plus ‘the dependent
variable has multinomial distribution
Example
Objective: predict risk credit risk (3
categories) base on financial and
demographic variables
Variables:
– Age
– Income
– Gender
– Marital (single, married, divsepwid)
– Numkids: # of dependent children
Example Cont.
– Numcards: #of credit cards
– Howpaid: how often paid (weekly, monthly)
– Mortgage: have a mortgage (y, n)
– Storecar: # of store credit cards
– Loans: # of other loads
– Risk: 1=bad risk, 2=bad risk-profit, 3=good
risk
How does it work?
Let f(j) be the probability of being in
outcome category j
– f(1)=P(bad risk-lost)
– f(2)=P(bad risk-profit)
– f(3)=P(good risk)
– g(1)=f(1)/f(3)
– g(2)=f(2)/f(3)
– g(3)=f(3)/f(3)=1
How does it work? – Cont.
Fit the modele:
– ln(g(1))= A1+B11X1+…+B1kXk
– ln(g(2))= A2+B21X1+…+B2kXk
– ln(g(3))= ln(1)=0=A3+B31X1+…+B3kXk
g ( j)
f ( j) =
∑ g ( j)
How does it work? – Cont.
g (1) e A1 + B11 X 1 +...+ B1k X k
f (1) = = A1 + B11 X 1 +...+ B1k X k
g (1) + g (2) + 1 e + e A2 + B21 X 1 +...+ B2 k X k + 1

g (2) e A2 + B21 X 1 +...+ B2 k X k


f ( 2) = = A1 + B11 X 1 +...+ B1k X k
g (1) + g (2) + 1 e + e A2 + B21 X 1 +...+ B2 k X k + 1

g (3) 1
f (3) = = A1 + B11 X 1 +...+ B1k X k
g (1) + g (2) + 1 e + e A2 + B21 X 1 +...+ B2 k X k + 1
Example – Cont.
File > Open > Data > Select Risk > Open
Move risk into dependent list box
Move marital and mortgage into the
Factor(s) list box
Move income and numberkids into the
Covariate(s) list box
Click model button
– Click cancel button
Example (Cont.)
Click Statistics button
– Check the Classification table check box
– Click Continue
Click Save
– The Multinomial Logistic regression in SPSS
version 10 will only save model info in an XML
(Extensible Markup Language) format
– Click cancel
Click OK
Multinomial output
Model Fit and Pseudo R-square,
Likelihood ratio test are similar to logistic
regression
Parameter estimates table is different
– There are two sets of parameters
One for the probability ratio of
(bad risk-lost)/(good risk)
Another set for the prob. Ratio of
(bad risk-profit)/(good risk)
Interpretation of coefficients
Income in the ‘bad lost’ section
– It is significant
– Exp(B)=.962: the expected probability ratio is
decreased a little (by a factor of .962) for one
unit increase of income
How to predict?
F(1) – the chance in ‘bad loss’ group
F(2) – the chance in ‘bad profit’ group
F(3) – the chance in ‘good risk’ group
F(j)=g(j)/sum(g(i))
g(j)=exp(modelj)
How to predict? (cont.)
Suppose an individual
– Single, has a mortgage
– No children
– Income of 35,000 pounds
g(1)=.218
g(2)=.767
g(3)=1
How to predict?
F(1)=.218/(.218+.767+1)=.110
F(2)=.386
F(3)=.504
The individual is classified as good risk
Multinomial Logistic Reg. With
Interaction
Analyze>Regression>Multinomial
Logistic>Click at Model, select
custom>specify your model (all main
effects and the interaction between Marital
and Mortgage)
Interpret the results as usual
Interaction effects in logistic
Regression
It is similar to OLS regression:
– Add interaction terms to the model as
crossproducts
– In SPSS, highlight two variables (holding
down the ctrl key) and move them into the
variable box will create the interaction term

You might also like