Professional Documents
Culture Documents
Yue Jiao
Screening data
Deal with
(1) Accuracy
(2) Missing data
(3) Fit between data set and the
assumptions
(4) Transformations of variables
(5) Outliers
(6) perfect or near- perfect correlations
Accuracy
Proofreading: For large number of data,
screening for accuracy involves
examination of descriptive statistics and
graphic representations of the variables
Honest correlation: It is important that
the correlations, whether between two
continuous variables or between a
dichotomous and continuous variable, be
as accurate as possible
Accuracy
Inflated Correlation: When composite
variables are constructed from several
individual items by pooling responses to
individual items, correlations are inflated
if some items are reused
Deflated Correlation: A falsely small
correlation between two continuous
variables is obtained if the range of
responses to one or both of the variables
is restricted in the sample
Missing Data
Missing data are characterized as:
MCAR (missing completely at random),
MAR (missing at random, called ignorable nonresponse),
MNAR (missing not at random or nonignorable).
Missing Data:
(1) Delete missing data, if only a few cases have
missing data and they seem to be a random
subsample of the whole sample.
(2) Estimate missing data, using prior knowledge;
inserting mean values; using regression; expectationmaximization; and multiple imputation.
(3) Another option with randomly missing data
involves analysis of a missing data correlation matrix.
(4) Treat missing data as data.
(5) Repeating Analyses With and Without Missing Data
Outlier
An outlier is a case with such an extreme value on one variable (a
univariate outlier) or such a strange combination of scores on two
or more variables (multivariate outlier) that it distorts statistics.
Reason:
(1) incorrect data entry.
(2) failure to specify missing-value codes in computer syntax so
that missing-value indicators are read as real data.
(3) the outlier is not a member of the population from which you
intended to sample.
(4) intended population but the distribution for the variable in the
population has more extreme values than a normal distribution
Normality
Two components of normality:
(1) Skewness has to do with the symmetry
of the distribution; a skewed variable is a
variable whose mean is not in the center
of the distribution.
(2) Kurtosis has to do with the peakedness
of a distribution; a distribution is either too
peaked (with short, thick tails) or too flat
(with long, thin tails).
When a distribution is normal, the values
of skewness and kurtosis are zero
Linearity
The assumption of linearity is that there
is a straight-line relationship between
two variables (where one or both of the
variables can be combinations of several
variables).
Linearity is important in a practical sense
because Pearsons r only captures the
linear relationships among variables; if
there are substantial nonlinear
relationships among variables, they are
ignored.
Homoscedasticity
Assumption of homoscedasticity is that
the variability in scores for one
continuous variable is roughly the same
at all values of another continuous
variable.
Homoscedasticity is related to the
assumption of normality because when
the assumption of multivariate normality
is met, the relationships between
variables are homoscedastic.
Common Data
Transformations
Data transformations are recommended
as a remedy for outliers and for failures
of normality, linearity, and
homoscedasticity.
Transformed variables are sometimes
harder to interpret.
Multicollinearity and
Singularity
Multicollinearity and singularity are
problems with a correlation matrix that
occur when variables are too highly
correlated.
Multicollinearity, the variables are very
highly correlated (say, .90 and above);
Singularity, the variables are redundant;
one of the variables is a combination of
two or more of the other variables.