You are on page 1of 32

Missing data & how to

handle it
Arooj Arshad
PhD Scholar

Goals
Discuss ways to evaluate and
understand missing data
Discuss common missing data
methods
Know the advantages and
disadvantages of common methods
Treatment of the missing data

Missing data can occur for many


reasons:
participants can fail to respond to
questions (legitimately or illegitimately
more on that later),
equipment and data collecting or recording
mechanisms can malfunction,
subjects can withdraw from studies before
they are completed,
data entry errors can occur.

Difference between missing and


legitimate missing data

Methods for analyzing missing data require


assumptions about the nature of the data and about
the reasons for the missing observations that are
often not acknowledged.
When researchers use missing data methods without
carefully considering the assumptions required of that
method, they run the risk of obtaining biased and
misleading results. Reviewing the stages of data
collection, data preparation, data analysis, and
interpretation of results will highlight the issues that
researchers must consider in making a decision about
how to handle missing data in their work.

Point to be remember.
All researchers should examine their
data for missingness, and
researchers wanting the best (i.e.,
the most Replicable and
Generalizable) results from their
research need to be prepared to deal
with missing data in the most
appropriate and desirable way
possible.

Missing Data Mechanisms


Missing Completely at Random (MCAR)
Probability of the missing data on Y is unrelated to Y
and X

Example: the reporting of income by the respondents.


Checked with the help of Littles MCAR test.

Missing at Random (MAR)


Probability of missing data on y is relayed to X.
Example: for really sick patients, clinicians may not draw
blood for routine labs.

Not Missing at Random


Probability of missing data on Y is dependent on value
of Y
Example: Respondents with high income less likely to report
income

Missing Data Consequences


Bias
Estimate
systematically
deviates from the
quantity of
interest.
No bias is the data
is MCAR, but bias
can occur with not
MCAR.

Variance
Missing data can
sometimes leas to
wrong standard
errors.
Wrong study
conclusions about
relationship of
variables to
outcomes.

Commonly-Used Missing Data


Handling Methods

Commonly-Used Missing Data


Methods
Deletion Methods
Listwise/complete case deletion,
pairwise deletion

Single Imputation Methods


Mean/mode substitution, dummy
variable method, single regression

Model-Based Methods
Maximum Likelihood, Multiple
imputation

Deletion Method

Listwise Deletion (Complete Case


Analysis)
Only analyze cases
with complete data
dropping the missing
variables.
When a researcher is
estimating a model,
such as a linear
regression, most
statistical packages
use listwise deletion
by default.

Listwise Deletion (Complete Case


Analysis)
Advantages
Ease of implementation.
Comparability across analyses

Disadvantage
Reduces statistical power (because lowers n a researcher
cannot anticipate if an adequate amount of data remain for
the analysis).
Doesnt use all information
Estimates may be biased if data isnt MCAR (complete case
analysis assumes that the observed complete cases are a
random sample of the originally targeted sample, or in
Rubin's (1976) terminology, that the missing data are MCAR)

Pairwise deletion (Available Case


Analysis)
Analysis with all cases in which
the variables of interest are
present.
Advantage:
Keeps as many cases as
possible for each analysis.
Uses all information
possible with each analysis.
Disadvantage:
Cant compare analyses
because sample different
each time.

Single Imputation Methods

Single Imputation Methods


Mean/Mode substitution
Dummy variable control
Conditional mean substitution

Mean/Mode Substitution
Replace missing value with sample mean
or mode
Run analyses as if complete cases analysis
Advantages
Can use complete case analysis methods

Disadvantages
Reduces variability
Weakens covariance and correlation estimates
in the data (because It ignores relationship
between variables)

Dummy Variable Adjustment


Create an indicator for missing value (1=value is
missing for observation; 0=value is observed for
observation)
Impute missing values to a constant (such as the
mean)
Advantage
Uses all available information about missing
observation

Disadvantage
Results in biased estimates
Not theoretically driven

Regression Imputation
Replaces missing values with
predicted score from a regression
equation.
Advantage:
Uses information from observed
data
Disadvantages:
Overestimates model fit and
correlation estimates
Weakens variance

Model Based Methods

Model Based Methods


Maximum Likelihood Using EM
algorithm
Multiple imputation
These methods share two assumptions:
that the joint distribution of the data is
multivariate normal, and that the
missing data mechanism is ignorable.

Identifies the set of parameter values that produces the


highest log-likelihood.
ML estimate: value that is most likely to have resulted in
the observed data
Conceptually, process the same with or without missing
data
Advantages:
Uses full information (both complete cases and
incomplete cases) to calculate log likelihood
Unbiased parameter estimates with MCAR/MAR data
Disadvantages
SEs biased downwardcan be adjusted by using observed
information matrix

we can base estimation on the


likelihood of the observed data.

Multiple Imputation
1. Impute: Data is filled in with imputed values using
specified regression model
This step is repeated mtimes, resulting in a separate dataset
each time.
2. Analyze: Analyses performed within each dataset
3. Pool: Results pooled into one estimate
Advantages:
Variability more accurate with multiple imputations for each
missing value
Considers variability due to sampling AND variability due to
imputation
Disadvantages:
Cumbersome coding
Room for error when specifying models

References
Allison, Paul D. 2001. Missing Data.Sage University
Papers Series on Quantitative Applications in the
Social Sciences.Thousand Oaks: Sage.
Enders, Craig. 2010. Applied Missing Data Analysis.
Guilford Press: New York.
Little, Roderick J., Donald Rubin. 2002. Statistical
Analysis with Missing Data. John Wiley & Sons, Inc:
Hoboken.
Schafer, Joseph L., John W. Graham. 2002. Missing
Data: Our View of the State of the Art.
Psychological Methods.

You might also like