Missing Data & How To Handle It

Missing data & how to
handle it
Arooj Arshad
PhD Scholar
Goals
Discuss ways to evaluate and
understand missing data
Discuss common missing data
methods
Know the advantages and
disadvantages of common methods
Treatment of the missing data
Missing data can occur for many

reasons:
participants can fail to respond to
questions (legitimately or illegitimately
more on that later),
equipment and data collecting or recording
mechanisms can malfunction,
subjects can withdraw from studies before
they are completed,
data entry errors can occur.
Difference between missing and

legitimate missing data
Methods for analyzing missing data require

assumptions about the nature of the data and about
the reasons for the missing observations that are
often not acknowledged.
When researchers use missing data methods without
carefully considering the assumptions required of that
method, they run the risk of obtaining biased and
misleading results. Reviewing the stages of data
collection, data preparation, data analysis, and
interpretation of results will highlight the issues that
researchers must consider in making a decision about
how to handle missing data in their work.
Point to be remember.
All researchers should examine their
data for missingness, and
researchers wanting the best (i.e.,
the most Replicable and
Generalizable) results from their
research need to be prepared to deal
with missing data in the most
appropriate and desirable way
possible.
Missing Data Mechanisms

Missing Completely at Random (MCAR)
Probability of the missing data on Y is unrelated to Y
and X
Example: the reporting of income by the respondents.

Checked with the help of Littles MCAR test.
Missing at Random (MAR)

Probability of missing data on y is relayed to X.
Example: for really sick patients, clinicians may not draw
blood for routine labs.
Not Missing at Random

Probability of missing data on Y is dependent on value
of Y
Example: Respondents with high income less likely to report
income
Missing Data Consequences

Bias
Estimate
systematically
deviates from the
quantity of
interest.
No bias is the data
is MCAR, but bias
can occur with not
MCAR.
Variance
Missing data can
sometimes leas to
wrong standard
errors.
Wrong study
conclusions about
relationship of
variables to
outcomes.
Commonly-Used Missing Data

Handling Methods
Commonly-Used Missing Data

Methods
Deletion Methods
Listwise/complete case deletion,
pairwise deletion
Single Imputation Methods

Mean/mode substitution, dummy
variable method, single regression
Model-Based Methods
Maximum Likelihood, Multiple
imputation
Deletion Method
Listwise Deletion (Complete Case

Analysis)
Only analyze cases
with complete data
dropping the missing
variables.
When a researcher is
estimating a model,
such as a linear
regression, most
statistical packages
use listwise deletion
by default.
Listwise Deletion (Complete Case

Analysis)
Advantages
Ease of implementation.
Comparability across analyses
Disadvantage
Reduces statistical power (because lowers n a researcher
cannot anticipate if an adequate amount of data remain for
the analysis).
Doesnt use all information
Estimates may be biased if data isnt MCAR (complete case
analysis assumes that the observed complete cases are a
random sample of the originally targeted sample, or in
Rubin's (1976) terminology, that the missing data are MCAR)
Pairwise deletion (Available Case

Analysis)
Analysis with all cases in which
the variables of interest are
present.
Advantage:
Keeps as many cases as
possible for each analysis.
Uses all information
possible with each analysis.
Disadvantage:
Cant compare analyses
because sample different
each time.

Mean/Mode substitution
Dummy variable control
Conditional mean substitution
Mean/Mode Substitution
Replace missing value with sample mean
or mode
Run analyses as if complete cases analysis
Advantages
Can use complete case analysis methods
Disadvantages
Reduces variability
Weakens covariance and correlation estimates
in the data (because It ignores relationship
between variables)
Dummy Variable Adjustment

Create an indicator for missing value (1=value is
missing for observation; 0=value is observed for
observation)
Impute missing values to a constant (such as the
mean)
Advantage
Uses all available information about missing
observation
Disadvantage
Results in biased estimates
Not theoretically driven
Regression Imputation
Replaces missing values with
predicted score from a regression
equation.
Advantage:
Uses information from observed
data
Disadvantages:
Overestimates model fit and
correlation estimates
Weakens variance
Model Based Methods
Model Based Methods

Maximum Likelihood Using EM
algorithm
Multiple imputation
These methods share two assumptions:
that the joint distribution of the data is
multivariate normal, and that the
missing data mechanism is ignorable.
Identifies the set of parameter values that produces the

highest log-likelihood.
ML estimate: value that is most likely to have resulted in
the observed data
Conceptually, process the same with or without missing
data
Advantages:
Uses full information (both complete cases and
incomplete cases) to calculate log likelihood
Unbiased parameter estimates with MCAR/MAR data
Disadvantages
SEs biased downwardcan be adjusted by using observed
information matrix
we can base estimation on the

likelihood of the observed data.
Multiple Imputation
1. Impute: Data is filled in with imputed values using
specified regression model
This step is repeated mtimes, resulting in a separate dataset
each time.
2. Analyze: Analyses performed within each dataset
3. Pool: Results pooled into one estimate
Advantages:
Variability more accurate with multiple imputations for each
missing value
Considers variability due to sampling AND variability due to
imputation
Disadvantages:
Cumbersome coding
Room for error when specifying models
References
Allison, Paul D. 2001. Missing Data.Sage University
Papers Series on Quantitative Applications in the
Social Sciences.Thousand Oaks: Sage.
Enders, Craig. 2010. Applied Missing Data Analysis.
Guilford Press: New York.
Little, Roderick J., Donald Rubin. 2002. Statistical
Analysis with Missing Data. John Wiley & Sons, Inc:
Hoboken.
Schafer, Joseph L., John W. Graham. 2002. Missing
Data: Our View of the State of the Art.
Psychological Methods.

Missing Data & How To Handle It

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Missing Data & How To Handle It

Uploaded by

Copyright:

Available Formats

Missing data & how to

Missing data can occur for many

Difference between missing and

Methods for analyzing missing data require

Missing Data Mechanisms

Example: the reporting of income by the respondents.

Missing at Random (MAR)

Not Missing at Random

Missing Data Consequences

Commonly-Used Missing Data

Commonly-Used Missing Data

Single Imputation Methods

Listwise Deletion (Complete Case

Listwise Deletion (Complete Case

Pairwise deletion (Available Case

Single Imputation Methods

Single Imputation Methods

Dummy Variable Adjustment

Model Based Methods

Model Based Methods

Identifies the set of parameter values that produces the

we can base estimation on the

You might also like