You are on page 1of 46

Analysis of Complex Sample Survey Data:

SURVMETH 614

Lecture Notes: Module 14

Missing Data in Complex Sample Surveys:


Multiple Imputation Analysis

Instructor: Brady T. West


email: bwest@umich.edu

July 28, 2021


Patterns of Missing Data
• General Pattern
variables
cases

• Special Patterns
monotone univariate file matching

2
“Matrix Sampling”

3
Missing-data mechanisms
• Pattern: Which values are missing?
• Mechanism: Why? Reasons related to the
study variables?
Y = data matrix, if no data were missing
M = missing-data indicator matrix
(i,j) th element indicates whether (i,j) th element of Y
is missing (1) or observed (0)
– Pattern concerns distribution of M
– Mechanism concerns distribution of M given Y

4
More on mechanisms
• Data are:
– missing completely at random (MCAR) if
missingness independent of Y:
p(M | Y) = p(M) for all Y
– missing at random (MAR) if missingness only
depends on observed components Yobs of Y:
p(M | Y) = p(M | Yobs) for all Y
– not missing at random (NMAR) if missingness
depends on missing (as well as perhaps on
observed) components of Y
5
Alternatives for Handling Missing
Data in Surveys
1. Conduct complete case analysis, ignoring cases with missing data (the
list-wise deletion default in SAS);
2. Employ weighting of complete cases to compensate for missing data;
3. Analyze the incomplete data using the EM algorithm (Little and Rubin,
2002, Allison, 2001);
4. Employ full information maximum likelihood methods to analyze the
data;
5. Perform single imputation of missing values using deterministic or
stochastic approaches such as mean/mode imputation; regression
imputation, predictive mean matching, nearest neighbor method or the hot
deck (see Kalton and Kasprzyk 1986 and Little and Rubin 2002 for a
comprehensive review);
6. Develop multiple imputations of the item missing data and employ MI
estimation and inference in analysis.
6
General Strategies
Complete cases

???
??
?? ?
??? ? ???

Imputation Complete-Case Analyze


Analysis Incomplete
w1
Complete cases w2 Complete cases Complete cases
w3
437 ??? ???
63 ??
?? Discard
22 1 ?? ? ?? ?
741 7 234 ??? ? ??? ??? ? ???

Imputations Weights e.g. maximum likelihood


7
Conclusions
• Some methods apply to particular patterns,
others to any pattern
• Properties of methods vary depending on
mechanism
• All methods have limitations -- better to
avoid missing values, or try to minimize the
problem
• We now examine the methods in detail!
8
Complete-Case Analysis

Complete cases

???
??
?? ? Discard
??? ? ???

• Default Analysis in Statistical Packages


(“Listwise Deletion”)
• Simple but limited
9
Unweighted CC Analysis
• Easy (but missing values must be flagged!)
• Does not invent data
• Simple and may be good enough with small
amounts of missing data
– but defining “small” is problematic; depends on
• fraction of incomplete cases
• recorded information in these cases
• parameter being estimated

10
Limitations of CC Analysis
• Loss of information in incomplete cases has
two aspects:
– Increased variance of estimates
– Bias when complete cases differ systematically
from incomplete cases
• restriction to complete cases involves a major
assumption -- that the complete cases are
representative of all the cases (MCAR)
• this assumption is often questionable!

11
Weighted CC Analysis
w1
w2 Complete cases
w3
???
??
?? ? Discard
??? ? ???

weights
• Weight respondents differentially to reduce
nonresponse bias
• Common for unit nonresponse in surveys, but
problematic for item nonresponse (why?)

12
Weighting Methods
• Weighting is a relatively simple device for
reducing bias from complete-case analysis
• Same weight for all variables -- simple, but better
methods tune adjustment according to outcome
• No built in control of variance
– ad-hoc trimming is common in surveys
• Less useful when:
– Covariate information is extensive
– Pattern of missing-data is non-monotone

13
Features of Imputation
Complete cases

437
63
22 1
741 7 234

Imputations

Good Bad
Rectangular File Naïve methods can be bad
Retains observed data Invents data –
Handles missing data once Understates uncertainty
Exploits incomplete cases
14
Imputing Means
Unconditional Conditional on observed variables

Y1 Y2 Y1 Y2
1
. E (Y2 )  y 2 y i 2  E ( yi 2 | yi 1 )   201   211 yi 1
m
m+1 y2 y m1,2
. y2 y m2 , 2
n
y2 y m3,2

Missing data is replaced by Means conditional on observed data.


overall mean of observed
values for Y2. 15
Properties of Mean Imputation
• Marginal Distributions, Associations are
distorted
• Standard errors of estimates from filled-in
data are too small, since
– Standard deviations are underestimated
– “Sample size” is overstated
• Conditional better than unconditional mean,
which can be worse than complete cases

16
Imputing Draws
• Imputations can be random draws
from a predictive distribution for
the missing values

Y1 Y2 mean

y i 2  E ( yi 2 | yi 1 )  ri
ri ~ N ( 0, s221 ), s221  resid variance, or
y m1, 2
y m2 , 2 ri  residual from random complete case
y m3,2
17
Properties of Imputed Draws
• Adds noise, less efficient than imputing means, but:
– No (or reduced) bias for estimating distributions
• Conditional draws better than unconditional:
– Improved efficiency
– Preserves associations with conditioned variables
• Standard errors from filled-in data are improved,
but still wrong:
– Standard deviation is ok
– “Sample size” overstated; multiple imputation fixes this

18
Creating the predictive distribution
All imputation methods assume a model for
the predictive distribution of the missing
values
– Explicit: predictive distribution based on a
formal statistical model (e.g. multivariate
normal); assumptions are explicit
– Implicit: focus is on an algorithm, but the
algorithm implies an underlying model;
assumptions are implicit
19
Implicit modeling procedures
• Hot deck imputation
– classify respondents, nonrespondents into
adjustment cells with similar observed values
– impute values from random respondent in same
cell
– implicit model: regression of missing variables
on variables forming cells, including all
interactions

20
Hot deck imputation example

21
Current Population Survey Hot Deck
• Missing (Y): Earnings Variables
• Observed (X):
– Age, Race, Sex, Family Relationship, Children, Marital
Status, Occupation, Schooling, Full/Part time, Type of
Residence, Income Recipiency Pattern
• Flexible matching:
– Joint Classification by X yields giant matrix. If
a match is not found, table is coarsened or
collapsed in stages until a match is found
22
CPS Hot Deck (continued)
Good Features Bad Features
• Imputes real values
• Does not exploit previous
• multivariate: associations earnings models
preserved
• Conditions on X’s • Includes high order
interactions at expense of
• Assessments suggest
main effects of omitted X’s
method works quite well
• Imputation uncertainty not
included in standard errors
For comparison of CPS Hot Deck with stochastic
regression imputation, see David et al. (1986)
23
Summary
• Imputations should:
– Condition on observed variables
– be multivariate to preserve associations
between missing variables
– generally be draws rather than means

• Key problem: single imputations do not account


for imputation uncertainty in standard errors.

24
Assumptions Made by Simple Methods

CCA: Complete Case Analysis;


ACA: Available Case Analysis;
LOCF: Last Observation Carried Forward
25
Accounting for Imputation
Uncertainty
• Imputation “makes up” the missing data
– treats imputed values as the truth
• For statistical inference (standard errors, p-values,
confidence intervals), need methods that account for
imputation error
– Bootstrap imputations (Rubin and Schenker 1986)
– Multiple imputation (Rubin 1987)
– Fractional imputation (Kim and Fuller 2004)

26
Multiple Imputation (MI)

27
Advantages of the MI Approach
• MI is model-based, transparent
• MI is stochastic, draws of model parameters
• MI is multivariate, preserves distributions
• MI employs repetitions to estimate uncertainty
• MI is robust against minor departures from
assumptions
• MI is usable for analysts who can access current
software systems

28
Relationship of the Imputation and Analysis Models

Imputation Model
Y  (Y1 , Y2 , Y3 )
f (Y |  )

Analysis Model
Y *  (Y1 , Y2 )
f (Y1 |  *, Y2 )

29
Recommendations for Selecting an
Imputation Model
1) Include all key analysis variables: (dependent: Y1
and independent: Y2 )

2) Include other variables that are correlated or


associated with the analysis variables: (Y3 )

3) Include variables that predict item missing data on


the analytic variables: (Z)
30
Incorporating Complex Sample Design
Features in The Imputation Model
• See Berglund, P. and Heeringa, S., Multiple Imputation of
Missing Data Using SAS®. (2014) Cary, NC: SAS Institute,
Inc.
• Don Rubin (1996) offered the following guidance on MI for
complex samples: “Minimally, major clustering and
stratification indicators and sample design weights (or
estimated propensity scores of being in the sample) should be
included in the imputation models. The possible lost precision
when including unimportant predictors is usually a small price
to pay for the general validity of the resultant multiply
imputed data base.”
• See also Reiter et al. (2006) and Hanzhi Zhou’s 2016 articles
on “uncomplexing” the design for MI (Canvas)
31
Multiple Imputation Inference
• M completed data sets (e.g. M = 5)
• Analyze each completed data set
• Combine results in easy way to produce
multiple imputation inference
• Particularly useful for public use datasets
– data provider creates imputes for multiple
users, who can analyze data with complete-data
methods
32
MI Inference for a Scalar Estimand
 = estimand
 l = estimate from lth completed dataset (l = 1,…,M)
Ul = estimate of variance of  l from l th analysis
Then the MI estimate of  is:
1 M 
   l
M l 1
The MI estimate of variance is V = U + MM1 B
1 M
U   U l = Within - Imputation Variance
M l 1
1 M 
B 
M  1 l 1
( l   ) 2
= Between - Imputation Variance
33
Example of Multiple Imputation
• First imputed dataset
Estimate (se 2 )
Dataset (l) 1  531234

Y1 Y2 Y3 Y4 Y5 1 12.6 (3.6 2 ) 4.32 (1.95 2 )

2.1
4.5
24 1

34
• Second imputed dataset
Estimate (se 2 )
Dataset (l) 1  531234

Y1 Y2 Y3 Y4 Y5 1 12.6 (3.6 2 ) 4.32 (1.95 2 )


2 12.6 (3.6 2 ) 4.15 ( 2.64 2 )
2.7
5.1
31 1

35
• Third imputed dataset
Estimate (se 2 )
Dataset (l) 1  531234

Y1 Y2 Y3 Y4 Y5 1 12.6 (3.6 2 ) 4.32 (1.95 2 )


2 12.6 (3.6 2 ) 4.15 ( 2.64 2 )
1.9 3 12.6 (3.6 2 ) 4.86 ( 2.09 2 )
5.8
32 2

36
• Fourth imputed dataset
Estimate (se 2 )
Dataset (l) 1  531234

Y1 Y2 Y3 Y4 Y5 1 12.6 (3.6 2 ) 4.32 (1.95 2 )


2 12.6 (3.6 2 ) 4.15 ( 2.64 2 )
2.5 3 12.6 (3.6 2 ) 4.86 ( 2.09 2 )
3.9 4 12.6 (3.6 2 ) 3.98 ( 2.14 2 )
18 1

37
• Fifth imputed dataset
Estimate (se 2 )
Dataset (l) 1  531234

Y1 Y2 Y3 Y4 Y5 1 12.6 (3.6 2 ) 4.32 (1.95 2 )


2 12.6 (3.6 2 ) 4.15 ( 2.64 2 )
2.3 3 12.6 (3.6 2 ) 4.86 ( 2.09 2 )
4.2 4 12.6 (3.6 2 ) 3.98 ( 2.14 2 )
25 2 5 12.6 (3.6 2 ) 4.50 ( 2.47 2 )
Mean 12.6 (3.6 2 ) 4.36 ( 2.27 2 )
Var 0 0.339

38
Summary of MI Inferences
 U B V  U  65 B  MI  1.2 B (1.2 B U )
1 12.6 3.6 2 0 3.6 0
 531234
 4.36 2.272 0.339 2.36 0.073

(1  1/ M ) B
 MI   estimated fraction of missing information
(1  1/ M ) B  U

39
Creating Multiple Imputations
• Multiple Imputations created within a single model take
into account within-model uncertainty
• Multiple Imputations can also be created under
alternative models, to account for imputation model
uncertainty
• Imputations can be based on implicit or explicit models,
as for single imputation
• Joint modeling (one multivariate distribution) and fully
conditional specification (series of univariate
distributions for variable by variable) can be used as the
imputation engine
40
Summary of Multiple Imputation
• Retains advantages of single imputation
– Consistent analyses
– Data collectors knowledge
– Rectangular data sets
• Corrects disadvantages of single imputation
– Reflects uncertainty in imputed values
– Corrects inefficiency from imputing draws
• estimates have high efficiency for modest M, e.g. 5

41
Multiple Imputation Software Options
• Stata
– Stata includes several easy-to-use commands from
the -mi- suite of commands (e.g., mi impute chained)
• R: The -mi- package
• IVEware - %IMPUTE
• SAS PROC MI [see ASDA Chapter 12 Analysis
Examples Replication on the ASDA website]
• Others packages (SOLAS 3.0, M-Plus, …)

42
MI Estimation and Inference
• Stata – mi estimate: operator with svy
capabilities
• IVEWare - %DESCRIBE, %REGRESS
• SAS – PROC MIANALYZE (see ASDA
Chapter 12 analysis examples replication on
ASDA website)

43
Fractional Hot Deck Imputation
for Missing Survey Data
• Kim and Fuller (2004)
• Method (implemented in PROC SURVEYIMPUTE):
1. Split only those cases with item missing data into M > 1 replicate
cases (M = # of imputations)
2. Impute the missing data using the joint (multivariate) hot deck
method for each replicate case (other methods possible)
3. Assign fractional weights to the replicate cases based on the
relative weights of the donor cases (which sum to the original
weight for the case), and combine the cases to form a data file with
n* > n records
4. Estimate variances through replicate weights that independently
repeat the above process (Steps 1 through 3) for replicates
(jackknife, half-sample, etc.) of the full complex sample

44
Selected References
• Berglund, P. and Heeringa, S., Multiple Imputation of Missing Data
Using SAS®. (2014) Cary, NC: SAS Institute, Inc.
• Fay, R.E., Alternative Paradigms for the Analysis of Imputed Survey
Data, Journal of the American Statistical Association, Vol. 91, No.
434. (1996), pp. 490-498
• Kalton, G., and Kaspyzyk, D. (1986), "The Treatment of Missing
Survey Data," Survey Methodology, 12, 1–16.
• Kim, J.K., Parametric fractional imputation for missing data analysis,
Biometrika, 98 (2011), pp. 119–132.
• Kim, J. K., & Fuller, W. (2004). Fractional hot deck
imputation. Biometrika, 91(3), 559-578.
• Little, R.J.A. and Rubin, D.B., Statistical Analysis with Missing Data
(2nd ed.), John Wiley & Sons, New York, 2002.
• Rao, J.N.K. and Shao, J., Jackknife variance estimation with survey
data under hot deck imputation, Biometrika, 79, 811-822, 1992.
45
Selected References
• Raghunathan, T.E., Lepkowski, J.M., Van Hoewyk, J., and Solenberger,
P., A multivariate technique for multiply imputing missing values using a
sequence of regression models, Survey Methodology, 27(1), 85-95, 2001.
• Reiter, Jerome P., Trivellore E. Raghunathan, and Satkartar K. Kinney.
"The importance of modeling the sampling design in multiple imputation
for missing data." Survey Methodology 32(2), 2006: 143.
• Rubin, D.B., Multiple Imputation for Nonresponse in Surveys, John Wiley
& Sons, New York, 1987.
• Schafer, J.L., Analysis of Incomplete Multivariate Data, Chapman & Hall,
London, 1997.
• van Buuren S (2012). Flexible Imputation of Missing Data. Chapman &
Hall/CRC Press, Boca Raton, FL.

46

You might also like