You are on page 1of 14

Screening Data

Yue Jiao

Screening data
Deal with
(1) Accuracy
(2) Missing data
(3) Fit between data set and the
assumptions
(4) Transformations of variables
(5) Outliers
(6) perfect or near- perfect correlations

Accuracy
Proofreading: For large number of data,
screening for accuracy involves
examination of descriptive statistics and
graphic representations of the variables
Honest correlation: It is important that
the correlations, whether between two
continuous variables or between a
dichotomous and continuous variable, be
as accurate as possible

Accuracy
Inflated Correlation: When composite
variables are constructed from several
individual items by pooling responses to
individual items, correlations are inflated
if some items are reused
Deflated Correlation: A falsely small
correlation between two continuous
variables is obtained if the range of
responses to one or both of the variables
is restricted in the sample

Missing Data
Missing data are characterized as:
MCAR (missing completely at random),
MAR (missing at random, called ignorable nonresponse),
MNAR (missing not at random or nonignorable).

The distribution of missing data is unpredictable


in MCAR.
The pattern of missing data is predictable from
other variables in the data set when data are
MAR.
In NMAR, the missingness is related to the
variable itself and, therefore, cannot be ignored.

Missing Data:
(1) Delete missing data, if only a few cases have
missing data and they seem to be a random
subsample of the whole sample.
(2) Estimate missing data, using prior knowledge;
inserting mean values; using regression; expectationmaximization; and multiple imputation.
(3) Another option with randomly missing data
involves analysis of a missing data correlation matrix.
(4) Treat missing data as data.
(5) Repeating Analyses With and Without Missing Data

Outlier
An outlier is a case with such an extreme value on one variable (a
univariate outlier) or such a strange combination of scores on two
or more variables (multivariate outlier) that it distorts statistics.
Reason:
(1) incorrect data entry.
(2) failure to specify missing-value codes in computer syntax so
that missing-value indicators are read as real data.
(3) the outlier is not a member of the population from which you
intended to sample.
(4) intended population but the distribution for the variable in the
population has more extreme values than a normal distribution

Eliminate of the variable, which is


responsible for most of the outliers
Variable transformation - undertaken to
change the shape of the distribution to
more nearly normal
Score alteration- change the score on the
variable for the outlying case so that they
are deviant, but not as deviant as they
were

Normality
Two components of normality:
(1) Skewness has to do with the symmetry
of the distribution; a skewed variable is a
variable whose mean is not in the center
of the distribution.
(2) Kurtosis has to do with the peakedness
of a distribution; a distribution is either too
peaked (with short, thick tails) or too flat
(with long, thin tails).
When a distribution is normal, the values
of skewness and kurtosis are zero

Linearity
The assumption of linearity is that there
is a straight-line relationship between
two variables (where one or both of the
variables can be combinations of several
variables).
Linearity is important in a practical sense
because Pearsons r only captures the
linear relationships among variables; if
there are substantial nonlinear
relationships among variables, they are
ignored.

Homoscedasticity
Assumption of homoscedasticity is that
the variability in scores for one
continuous variable is roughly the same
at all values of another continuous
variable.
Homoscedasticity is related to the
assumption of normality because when
the assumption of multivariate normality
is met, the relationships between
variables are homoscedastic.

Common Data
Transformations
Data transformations are recommended
as a remedy for outliers and for failures
of normality, linearity, and
homoscedasticity.
Transformed variables are sometimes
harder to interpret.

Multicollinearity and
Singularity
Multicollinearity and singularity are
problems with a correlation matrix that
occur when variables are too highly
correlated.
Multicollinearity, the variables are very
highly correlated (say, .90 and above);
Singularity, the variables are redundant;
one of the variables is a combination of
two or more of the other variables.

Checklist for Screening Data


1. Inspect univariate descriptive statistics for accuracy of input
a. Out-of-range values
b. Plausible means and standard deviations
c. Univariate outliers

2. Evaluate amount and distribution of missing data; deal with


problem
3. Check pairwise plots for nonlinearity and heteroscedasticity
4. Identify and deal with nonnormal variables and univariate
outliers
a. Check skewness and kurtosis, probability plots
b. Transform variables (if desirable)
c. Check results of transformation

5. Identify and deal with multivariate outliers


a. Variables causing multivariate outliers
b. Description of multivariate outliers

6. Evaluate variables for multicollinearity and singularity

You might also like