Professional Documents
Culture Documents
Data Screening
Data Screening
3
Handling Missing Data
• Missing more than 10% from a variable or respondent
is typically not problematic (unless you lose specific
items, or one end of the tail)
• Method for handling missing data:
– >10% - Just don't use that variable/respondent unless you
go below acceptable n
– <10% - Impute if not categorical
– Warning: If you remove too many respondents, you will
introduce response bias
• If the DV is missing, then there is little you can do with
that record
• One alternative is to impute and run models with and
without missing data to see how sensitive the result is
Imputation Methods (Hair, table 2-2)
• Use only valid data
– No imputation, just use valid cases or variables
– In SPSS: Exclude Pairwise (variable), Listwise (case)
• Use known replacement values
– Match missing value with similar case’s value
• Use calculated replacement values
– Use variable mean, median, or mode
– Regression based on known relationships
• Model based methods
– Iterative two step estimation of value and descriptives
to find most appropriate replacement value
Mean Imputation in SPSS
2. Include each variable
that has values that need
imputing 2
50%
Mean should fall
within the
box 99%
should fall
within this
range
Outliers!
Handling Univariate Outliers
• Univariate outliers should be examined on a case by
case basis.
• If the outlier is truly abnormal, and not representative
of your population, then it is okay to remove. But this
requires careful examination of the data points
– e.g., you are studying dogs, but somehow a cat got ahold
of your survey
– e.g., someone answered “1” for all 75 questions on the
survey
• However, just because a datapoint doesn’t fit
comfortably with the distributions does not nominate
that datapoint for removal
Detecting Multivariate Outliers
• Multivariate outliers refer to sets of data points
(tuples) that do not fit the standard sets of correlations
exhibited by the other data points in the dataset with
regards to your causal model.
• For example, if for all but one person in the dataset
reports that diet has a positive effect on weight loss,
but this one guy reports that he gains weight when he
diets, then his record would be considered an outlier.
• To detect these influential multivariate outliers, you
need to calculate the Mahalanobis d-squared. (Easy in
AMOS)
These are Anything less than .05 in the p1
row numbers column is abnormal, and is
from SPSS candidate for inspection
Handling Multivariate Outliers
• Create a new variable in SPSS called “Outlier”
– Code 0 for Mahalanobis > .05
– Code 1 for Mahalanobis < .05
• I have a tool for this if you want…
• Then in AMOS, when selecting data files, use
“Outlier” as a grouping variable, with the
grouping value set to 0
– This then runs your model with only non-outliers
Before and after removing outliers
N=340 N=295
BEFORE AFTER
Even after you remove outliers, the Mahalanobis will come up with a whole new set of outliers, so these
should be checked on a case by case basis, using the Mahalanobis as a guide for inspection.
“Best Practice” for outliers
• In general, it is a bad idea to remove outliers,
unless they are truly “abnormal” and do not
represent accurate observations from the
population. The logic of removal needs to be
based on semantics of the data
• Removing outliers (especially en mass as
demonstrated with the mahalanobis values) is
risky because it decreases your ability to
generalize as you do not know the cause of this
type of variance, it may be more than just noise.
Statistical Assumptions
Part of data screening is ensuring you meet the four main
statistical assumptions for multivariate data analysis:
1. Normality
2. Homoscedasticity
3. Linearity
4. Multicollinearity
Shape
Skewness
Kurtosis
Tests for Skewness and Kurtosis
1
• Relaxed rule:
– Skewness > 1 = positive (right) skewed
– Skewness < -1 = negative (left) skewed
– Skewness between -1 and 1 is fine
• Strict rule: 3
– Abs(Skewness) > 3*Std. error = Skewed
– Same for Kurtosis
Tests for Normality
SPSS
1. Analyze
2. Explore
3. Plots
4. Normality
*Neither of these variables would
be considered normally
distributed according to the KS or
SW measures, but a visual
inspection shows that role
conflict (left) is roughly normal
and participation (right) is
positive skewed.
So, ALWAYS conduct visual
inspections!
Fixing Normality Issues
• Fix flat distribution with:
– Inverse: 1/X
• Fix negative skewed distribution with:
– Squared: X*X
– Cubed: X*X*X
• Fix positive skewed distribution with:
– Square root: SQRT(X)
– Logarithm: LG10(X)
Before and After Transformation
Negative Skewed Cubed
Homoscedasticity
• Homoscedasticity is a nasty word that helps impress
your listeners!
• If a variable has this property it means that the DV
exhibits consistent variance across different levels of
the IV.
• A simple way to determine if a relationship is
homoscedastic, is to do a scatter plot with the IV on
the x-axis and the DV on the y-axis.
• If the plot comes up with a linear pattern, and has a
substantial R-square we have homoscedasticity!
• If there is not a linear pattern, and the R-square is low,
then the relationship is heteroscedastic.
Scatterplot approach
Linearity
• Linearity refers to the consistent slope of
change that represents the relationship
between an IV and a DV.
• If the relationship between the IV and the DV
is radically inconsistent, then it will throw off
your SEM analyses as your data is not linear
• Sometime you achieve this with
transformations (log linear).
Good
Bad
Multicollinearity
• Multicollinearity is not desirable in regressions
(but desirable in factor analysis!).
• It means that independent variables are too
highly correlated with each other and share
too much variance
• Influences the accuracy of estimates for DV
and inflates error terms for DV (Hair).
• How much unique variance does the black
circle actually account for?
Detecting Multicollinearity
• An easy way to check this is to calculate a Variable
Inflation Factor (VIF) for each independent variable
after running a multivariate regression using one of
the IVs as the dependent variable, and then regressing
it on all the remaining IVs. Then swap out the IVs one
at a time.
• The rules of thumb for the VIF are as follows:
– VIF < 3; no problem
– VIF > 3; potential problem
– VIF > 5; very likely problem
– VIF > 10; definitely problem
Handling Multicollinearity
Loyalty 2 and
loyalty 3 seem to
be too similar in
both of these test
Dropping Loyalty
2 fixed the
problem