You are on page 1of 13

D/RS 1013

Data Screening/Cleaning/
Preparation for Analyses

Data entered in computer

assuming reasonable care was taken


scanner probably most "error free"
checking physical forms against file
verifying any recoding or score
calculations
"list cases"(mac) or "case summaries
(windows)

Data screening

descriptives: look for out of range


values
check values against original forms
correct data in file

Missing data

respondents will not answer all


questions on a survey
what to do about items where data is
missing?
several options to consider/ways to
address

Missing data (cont.)

single variable - is systematic bias


present in the kinds of people who fail
to answer an item?
if the amount of missing data is small
don't really need to worry
use pairwise deletion
pairwise can cause problems

Missing data (cont.)

drop subject's data completely


if missing data on unimportant
variable don't analyze
if a reasonable guess can be made
based on other available variables, do
it
numerical variable - use average

Missing data (cont.)

correlation between answered and


unanswered questions
regression equation to predict values on
one variable based on others for which
we have data

new variable that flags whether they


answered question or not
analyze for possible differences on some
other variable.

Outliers

exert influence on the mean


inflate variance of the sample
identify - look at a graph or run explore
requesting outliers
rule out some kind of data problem
can dump and not use
compromise is to move outlier
residual analysis and detecting multivariate
outliers when we move on to multiple
regression (e.g. Mahalanobis Dist.)

Normality

assessing univariate normality


look at graph
skew and kurtosis values
can test significance
divide by standard error
result is a z score

Normality (cont.)

tells us whether skew/kurtosis is


significantly different than "0
does not necessarily mean it is a
problem
Kline's (1998) recommendations
skewness values > 3 and kurtosis > 10
If seriously violated transforming is an
option

Linearity of relationship

relationship between variables


reasonably summarized by straight
line
check scatterplot
may be curvilinear

Homoscedasticity

assumption that variation in one


variable is constant across range of
another variable
check scatterplot

Homoscedasticity

You might also like