Datascreening
LESSON: Data Screening
VIDEO TUTORIAL: Data Screening
conduct further statistical analyses. Data must be screened in order to ensure the data is useable, reliable, and valid for testing causal

Contents
1MissingData
2Outliers
2.1Univariate
2.2Multivariate
3Normality
4Linearity
5Homoscedasticity
6Multicollinearity

MissingData
If you are missing much of your data, this can cause several problems. The most apparent problem is that there simply won't be
enough data points to run your analyses. The EFA, CFA, and path models require a certain number of data points in order to compute
estimates. This number increases with the complexity of your model. If you are missing several values in your data, the analysis just
won't run.
then you will have male biased data. Perhaps only 50% of the females reported their gender, but 95% of the males reported gender. If
you use gender in your causal models, then you will be heavily biased toward males, because you will not end up using the
unreported responses.
To find out how many missing values each variable has, in SPSS go to Analyze, then Descriptive Statistics, then Frequencies. Enter
the variables in the variables list. Then click OK. The table in the output will show the number of missing values for each variable.
The threshold for missing data is flexible, but generally, if you are missing more than 10% of the responses on a particular variable,
or from a particular respondent, that variable or respondent may be problematic. There are several ways to deal with problematic
variables.
Just don't use that variable.
If it makes sense, impute the missing values. This should only be done for continuous or interval data (like age or Likert scale
responses), not for categorical data (like gender).
however, if the number of missing responses is greater than 10%.
To impute values in SPSS, go to Transform, Replace Missing Values then select the variables that need imputing, and hit OK. See
the screenshots below. In this screenshot, I use the Mean replacement method. But there are other options, including Median
replacement. Typically with Likert type data, you want to use median replacement, because means are less meaningful in these
scenarios. For more information on when to use which type of imputation, refer to: Lynch (2003)

simply do not have the data for that person. My recommendation is to first determine which variables will actually be used in your
model (often we collect data on more variables than we actually end up using in our model), then determine if the respondent is
problematic. If so, then remove that respondent from the analysis.

Outliers
Outliers can influence your results, pulling the mean away from the median. Two types of outliers exist: outliers for individual
variables, and outliers for the model.

Univariate
To detect outliers on each variable, just produce a boxplot in SPSS (as demonstrated in the video). Outliers will appear at the
extremes, and will be labeled, as in the figure below. If you have a really high sample size, then you may want to remove the outliers.
extreme (1 or 5) is not really representative outlier behavior.

Another type of outlier is an unengaged respondent. Sometimes respondents will enter '3,3,3,3,...' for every single survey item. This
participant was clearly not engaged, and their responses will throw off your results. Other patterns indicative of unengaged
respondents are '1,2,3,4,5,1,2,...' or '1,1,1,1,5,5,5,5,1,1,...'. There are multiple ways to identify and eliminate these
unengaged respondents:
third and two thirds of the way through my surveys. I am always astounded at how many I catch this way...
responded strongly agree to both of these items, then they were not paying attention: "I am very hungry", "I don't have much
appetite right now".

Multivariate
Multivariate outliers refer to records that do not fit the standard sets of correlations exhibited by the other records in the dataset, with
regard to your causal model. So, if all but one person in the dataset reports that diet has a positive effect on weight loss, but this one
guy reports that he gains weight when he diets, then his record would be considered a multivariate outlier. To detect these influential
multivariate outliers, you need to calculate the Mahalanobis d squared. This is a simple matter in AMOS. See the video tutorial for
much will show up. It is a slippery slope.
A more conservative approach that I would recommend is to examine the influential cases indicated by the Cook's distance. Here is a
video explaining what this is and how to do it. This video also discusses multicollinearity.
VIDEO TUTORIAL: Multivariate Assumptions

Normality
Normality refers to the distribution of the data for a particular variable. We usually assume that the data is normally distributed, even
though it usually is not! Normality is assessed in many different ways: shape, skewness, and kurtosis (flat/peaked).
Shape: To discover the shape of the distribution in SPSS, build a histogram (as shown in the video tutorial) and plot the normal
curve. If the histogram does not match the normal curve, then you likely have normality issues. You can also look at the
boxplot to determine normality.
Skewness: Skewness means that the responses did not fall into a normal distribution, but were heavily weighted toward one
end of the scale. Income is an example of a commonly right skewed variable most people make between 20 and 70 thousand
dollars in the USA, but there is smaller group that makes between 70 and 100, and an even smaller group that makes between
100 and 150, and a much smaller group that makes between 150 and 250, etc. all the way up to Bill Gates and Mark
There are two rules on Skewness:
(1) If your skewness value is greater than 1 then you are positive (right) skewed, if it is less than -1 you are negative (left)
skewed, if it is in between, then you are fine. Some published thresholds are a bit more liberal and allow for up to +/- 2.2,
(2) If the absolute value of the skewness is less than three times the standard error, then you are fine otherwise you are skewed.
Using these rules, we can see from the table below, that all three variables are fine using the first rule, but using the second rule, they
are all negative (left) skewed.

Skewnesslookslikethis:

Kurtosis:
Kurtosis refers to the outliers of the distribution of data. Data that have outliers have large kurtosis. Data without outliers have low
kurtosis. The kurtosis (excess kurtosis) of the normal distribution is 0. The rule for evaluating whether or not your kurtosis is
problematic is the same as rule two above:
If the absolute value of the kurtosis is less than three times the standard error, then the kurtosis is not significantly different
from that of the normal distribution otherwise you have kurtosis issues. Although a looser rule is an overall kurtosis score of
2.200 or less (rather than 1.00) (Sposito et al., 1983).
Kurtosislookslikethis:
Bimodal:
One other issue you may run into with the distribution of your data is a bimodal distribution. This means that the data has multiple
(two) peaks, rather than peaking at the mean. This may indicate there are moderating variables effecting this data. A bimodal
distribution looks like this:

Transformations:
VIDEO TUTORIAL: Transformations
When you have extremely non-normal data, it will influence your regressions in SPSS and AMOS. In such cases, if you have non-
Likert scale variables (so, variables like age, income, revenue, etc.), you can transform them prior to including them in your model.
He also references his article in the video.

Linearity
Perhaps the most elegant (easy and clear cut, yet rigorous), is the deviation from linearity test available in the ANOVA test in SPSS.
In SPSS go to Analyze, Compare Means, Means. Put the composite IVs and DVs into the lists, then click on options, and select "Test
for Linearity". Then in the ANOVA table in the output window, if the Sig value for Deviation from Linearity is less than 0.05, the
relationship between IV and DV is not linear, and thus is problematic (see the screenshots below). Issues of linearity can sometimes
be fixed by removing outliers (if the significance is borderline), or through transforming the data. In the screenshot below, we can see
that the first relationship is linear (Sig=.268), but the second relationship is nonlinear (Sig=.003).
If this test turns up odd results, then simply perform an OLS linear regression between each IV>DV pair. If the sig value is
less than 0.05, then the relationship can be considered "sufficiently" linear. While this approach is somewhat less rigorous, it
has the benefit of working every time! You can also do a curvelinear regression ("curve estimation") to see if the relationship
is more linear than nonlinear.

Homoscedasticity
VIDEO TUTORIAL: Plotting Homoscedasticity
Homoscedasticity reference
4HICQ&ved=0CEIQ6AEwAg#v=onepage&q=homoscedasticity%20residual%20scatterplots&f=false)
Homoscedasticity is a nasty word that means that the variable's residual (error) exhibits consistent variance across different levels of
the variable. There are good reasons for desiring this. For more information, see Hair et al. 2010 chapter 2. :) A simple way to
determine if a relationship is homoscedastic is to do a simple scatterplot with the variable on the y axis and the variable's residual on
the x axis. To see a step by step guide on how to do this, watch the video tutorial. If the plot comes up with a consistent pattern as in
the figure below, then we are good we have homoscedasticity! If there is not a consistent pattern, then the relationship is considered
heteroskedastic. This can be fixed by transforming the data or by splitting the data by subgroups (such as two groups for gender). You

Schools of thought on homoscedasticity are still out. Some suggest that evidence of heteroskedasticity is not a problem (and is
this test unless specifically requested to by a reviewer.

Multicollinearity
Multicollinearity is not desirable. It means that the variance our independent variables explain in our dependent variable are
overlapping with each other and thus not each explaining unique variance in the dependent variable. The way to check this is to
calculate a Variable Inflation Factor (VIF) for each independent variable after running a multivariate regression. The rules of thumb
for the VIF are as follows:
VIF < 3: not a problem
VIF > 3 potential problem
VIF > 5 very likely problem
VIF > 10 definitely problem
The tolerance value in SPSS is directly related to the VIF, and values less than 0.10 are strong indications of multicollinearity issues.
For particulars on how to calculate the VIF in SPSS, watch the step by step video tutorial. The easiest method for fixing
much unique explanation of variance anyway.
Obrien, R.M. 2007. A Caution Regarding Rules of Thumb for Variance Inflation Factors. Quality & Quantity, 41, 673
Retrieved from "http://statwiki.kolobkreations.com/index.php?title=Data_

