Missing Values Imputation

Missing Values Imputation
A New Approach to Missing Values Processing with Bayesian Networks
Stefan Conrady, stefan.conrady@conradyscience.com

Dr. Lionel Jouffe, jouffe@bayesia.com
December 19, 2011
Conrady Applied Science, LLC - Bayesias North American Partner for Sales and Consulting
Missing Values Imputation with Bayesian Networks
Table of Contents
Introduction
Motivation & Objective
Overview
Notation
Theory of Missing Values Processing

Types of Missingness
Methods for the Analysis of Incomplete Data
Ad Hoc Methods
Listwise Deletion
Pairwise Deletion
Imputation
Single Mean Imputation
Conditional Mean Imputation
Multiple Imputation (MI)
Ad Hoc Methods Summary
Maximum Likelihood Estimation (ML)

Bayesian Networks
10
10
Missing Values Processing in Practice

Linear Example
12
Data Generating Process (DGP)
12
Missing Data Mechanism (MDM)
13
Standard Methods
14
Listwise/Casewise Deletion
14
Pairwise Deletion
14
15
15
Missing Values Processing with Bayesian Networks
17
Static Completion
18
Dynamic Completion
19
www.conradyscience.com | www.bayesia.com
ii
Nonlinear Example
23
23
Bayesian Network Representation
24
Missing Data Mechanism
26
Bayesian Network Learning & Missing Values Imputation
28
Unsupervised Learning
28
Imputation
30
Summary Nonlinear Example

Summary & Outlook
31
31
Appendix
About the Authors
32
Stefan Conrady
32
Lionel Jouffe
32
References
33
Contact Information
35
Conrady Applied Science, LLC
35
Bayesia S.A.S.
35
Copyright
35
iii
Introduction
Motivation & Objective
With the abundance of big data in the eld of analytics, and all the challenges todays immense data volume is causing, it may not be particularly fashionable or pressing to discuss missing values. After all, who
cares about missing data points when there are petabytes of more observations out there?
As the objective of any data gathering process is to gain knowledge about a domain, missing values are obviously undesirable. A missing datum does without a doubt reduce our knowledge about any individual observation, but implications for our understanding of the whole domain may not be so obvious, especially
when there seems to be an endless supply of data.
Missing values are encountered in virtually all real-world data collection processes. Missing values could be
the result of non-responses in surveys, poor record-keeping, server outages, attrition in longitudinal surveys
or the faulty sensors of a measuring device, etc. Whats often overlooked is that not properly handling missing observations can lead to misleading interpretations or create a false sense of condence in ones ndings,
regardless of how many more complete observations might be available.
Despite the intuitive nature of this problem, and the fact that almost all quantitative studies are affected by
it, applied researchers have given it remarkably little attention in practice. Burton and Altman (2004) state
this predicament very forcefully in the context of cancer research: We are concerned that very few authors
have considered the impact of missing covariate data; it seems that missing data is generally either not recognized as an issue or considered a nuisance that it is best hidden.
As missing values processing (beyond the nave ad-hoc approaches) can be a demanding task, both methodologically and computationally, the principal objective of this paper is to propose a new and hopefully easier approach by employing Bayesian networks. It is not our intention to open the proverbial new can of
worms, and thus distract researchers from their principal study focus, but rather we want to demonstrate
that Bayesian networks can reliably, efciently and intuitively integrate missing values processing into the
main research task.
Overview
1. We will rst provide a brief introduction to missing values and highlight a selection of methods that
have been traditionally used to deal with this problem.
2. We will then use a linear and a nonlinear example to illustrate the different statistical methods and introduce missing values imputation with Bayesian networks.
Notation
To clearly distinguish between natural language, software-specic functions and example-specic variable
names, the following notation is used:
Bayesian network and BayesiaLab-specic functions, keywords, commands, etc., are capitalized and
shown in bold type.
Names of attributes, variables, nodes and are italicized.
Theory of Missing Values Processing

Missing values are deceiving, as they all appear the same, namely as null values in ones dataset. Metaphorically speaking, they are like a colorless and odorless gas, invisible to the naked eye and undetectable by olfaction, and thus seemingly immaterial. Yet we know from physics and chemistry that dangerous substances
can be cloaked in an ethereal appearance.
While we want to focus on this paper how to practically deal with missing values, using a minimum of
mathematical notation and statistical jargon, we need to introduce a few basic concepts that are essential for
understanding the problem of missing values. We will keep this theory to a minimum and rather refer the
interested reader to specialist literature for further study.
Types of Missingness
There are several types of missing data that are typically encountered in studies:
First, data can be Missing Completely at Random (MCAR), which requires that the missingness1 is
completely unrelated to the data. More formally, this can be stated as:
p(RYobs ,Ymis , X, ) = p(R )

where R is the missing data indicator and is a parameter that characterizes the relationship between R
and the data. This implies that R does not depend on Yobs and Ymis, i.e. the observed and the missing parts
of the data, or any covariates X.
When data are MCAR, missing cases are no different than non-missing cases, in terms of the analysis being performed. Thus, these cases can be thought of as randomly missing from the data and the only real
penalty in failing to account for missing data is loss of power in any parameter estimation. Unfortunately,
MCAR is a very strong assumption and generally not testable. As a result, analysts can rarely rely on this
fairly benign condition.
Secondly, data can be Missing at Random (MAR), which is a weaker assumption. For MAR data, the
missingness depends on known values and is thus described fully by variables observed in the data set.
This also means that R, which denotes the missingness of Y, is unrelated to the values of Ymis:
p(RYobs ,Ymis , X, ) = p(RYobs , X, )

For instance, if MAR holds, one can regress Yobs on X for the respondents and then use this relationship
to obtain imputed values for Ymis for the non-respondents (Shafer and Olson, 1998)
For illustration we use an example from missingdata.org.uk: suppose we seek to collect data on income and property tax band. Typically, those with higher incomes may be less willing to reveal them.
Thus, a simple average of incomes from respondents will be downwardly biased. However, now suppose
we have everyones property tax band, and given property tax band non-response to the income question
It is important to note the difference between missingness of a variable and a missing value. Missingness refers
to the probability of not being observed, whereas missing refers to the state of not being observed.
is random. Then, the income data is missing at random; the reason, or mechanism, for it being missing
depends on property band. Given property band, missingness does not depend on income itself.
Data can be missing in an unmeasured fashion, termed nonignorable, also called Missing Not at Random (MNAR) and Not Missing at Random (NMAR). This is the most general case, which includes all
possible associations between missingness and data.
p(RYobs ,Ymis , X, )
Since the missing data depends on events or items which the researcher has not measured, such as the
missing values Ymis themselves, this creates a challenge.
A hypothetical example for this situation would a behavioral survey. Individuals who are engaging in
dangerous behaviors may be less likely to admit the high-risk nature of their conduct and thus decline
to respond. In comparison, low-risk respondents may quite willingly make statements with regard to
their safe behavior.
Filtered or Censored Values
At this point we should briey mention a fourth type of missingness, which is less often mentioned in the
literature. We refer to it as ltered values. These are values that cannot exist, but often appear in datasets as no different than other missing values that do exist, but have not been observed.
For instance, in an auto buyer survey, a question about rear-seat legroom cannot be answered by the
owner of a two-seat roadster. Conceptually, these ltered values are similar to MAR values, they depend
on other variables in the dataset, e.g. given that vehicle type=roadster, rear seat legroom=FV. As opposed
to MAR values, and as this example suggests, ltered values are determined by logic.
It is particularly important that these missing variables are not imputed with any estimated values, as a
bias is highly likely. Rather, a new state, e.g. a code like FV, has to be added as an additional state to
the affected variables.2 The ltered value declaration can typically be done in the context of the ETL3
process, prior to the start of any modeling work. Upon denition of these ltered values, the normal
missing values processing can begin.
BayesiaLab provides a framework for dealing with such ltered values, so these ltered values are specically not
taken into account while machine-learning a model of the data generating process. Furthermore, BayesiaLab excludes
ltered values when computing correlations, etc.
3
ETL stands for extract, transform and load, referring to the data preprocessing prior to starting a statistical analysis.
Methods for the Analysis of Incomplete Data

We now want to provide a very brief and admittedly incomplete (no pun intended) overview of some of the
common methods for dealing with incomplete data, although some of them must be used with great caution
or may not even be recommended at all.
Ad Hoc Methods
Listwise Deletion
Given that many statistical estimations require a complete set of observations, any records with missing values are often excluded by default from such computations. The rst consequence is an immediate loss of
samples, which may have been very costly to acquire. This (often rather involuntarily applied) method is
known more formally as listwise deletion or casewise deletion. In the best case, an estimation on a reduced set of observations may only increase the standard error, which is the case with MCAR data, but under different circumstances listwise deletion could bias the parameter estimates, which can be the case with
MAR data. In situations when no case is completely observed, i.e. there is at least one missing value per set
of observation records, listwise deletion is obviously not applicable at all.
Pairwise Deletion
To address the loss of data that is inherent in listwise deletion, pairwise deletion uses different data sets for
the estimation of each parameter. For instance, pairwise deletion is used when a different subsets of complete data are used to compute each cell in a correlation matrix. One concern is the fact the separate computation of variances on subsets can lead to correlation values exceeding plus or minus 1. Another issue is the
absence of a consistent sample base, which leads to problems in calculating standard errors of estimated
parameters.
Imputation
As opposed to deletion-type methods, we now want to consider the opposite approach, i.e. lling in the
blanks with imputed values. Here, imputing means replacing the non-observed values with reasonable
estimates, in order to facilitate the analysis of the whole domain. Upon imputation, complete case statistical
methods can be used to estimate parameters, etc. Although it may sound like statistical alchemy to create
values for non-existing datums, it can be appropriate to do so.
Lets start with what not to do in terms of imputation: The oldest of these techniques is
presumably mean imputation (also know as unconditional mean imputation), which substitutes the missing values with the mean of the marginal distribution of the observed values. It
is well-known that this approach can cause biased estimates and understated standard errors, but it remains a commonly used tool and can be found as an option in virtually all statistical software packages (see screenshot from STATISTICA). Although there are examples where mean
imputation will do no harm, the risk of bias should deter practitioners from using it as regular method.
Conditional Mean Imputation

Conditional mean imputation (sometimes referred to as regression imputation) is a somewhat more sophisticated approach. If the MAR conditions are met, conditional mean imputation consists of modeling the
missing values as a function of the observed values, such as with a regression. In effect, this method predicts
the missing values from the observed values. However, in the case of a dataset with many variables that
contain missing values, the specication of such functions can be a challenging endeavor in its own right.
Shaffer (1997) states, When the pattern of missingness is complex, devising an ad hoc imputation scheme
that preserves important aspects of the joint distribution can be a daunting task.
Assuming that this can be done successfully, one key issue remains for imputation methods that generate
single values. Whenever a missing data point is imputed with a xed value, it removes the uncertainty about
the unknown variable, which inevitably results in underestimated errors. Even if the relationships can be
adequately modeled, Shafer warns, it may be a serious mistake to treat the imputed data as if they were
real. Standard errors, p-values and other measures of uncertainty calculated by standard complete-data
methods could be misleading because they fail to reect any uncertainty due to missing data. Intuitively,
missing values should always increase standard errors due to the additional noise introduced by the missingness and not reduce them.
While at the population level this lack of uncertainty may generate underestimated errors, for the individual
records it does provide an optimal prediction of the missing values. However, as most studies are studying
population-level effects, the predicted state of individual records may generally be of less relevance.
To address the problem of determinism in conditional mean imputation, multiple imputation reintroduces the uncertainty, as the name implies, by repeating the imputation multiple times. Multiple imputation draws from a plausible distribution and thus generates the desired variance in the imputed values and
in the subsequently computed parameter estimates. Plausible distribution also implies that the chosen
imputation model is compatible with the functional form of the overall model of interest. For instance, if the
overall relationship of interest is to be modeled with a linear regression, it would be suitable to also use a
linear regression for the imputation model. Shafer and Olson (1998) provide a comprehensive overview of
multiple imputation from a data analysts perspective.
The principles of MI were rst proposed by Rubin in the 1970s, and even today MI remains a state-of-theart method for missing values processing. Despite the undisputed benets of this technique, it is not as
widely used as one might expect (van Buuren, 2007).
Ad Hoc Methods Summary

All of the above imputation methods have in common that they create a complete dataset (or multiple complete sets, in the case of multiple imputation), so that the analyst can proceed with traditional estimation
methods. Despite the proven, practical performance of these methods, there are perhaps two conceptual
challenges that always add a degree of complexity to the research workow:
1. The missing values imputation model is distinct from the statistical model that is meant to describe the
domains overall characteristics, which should normally be at the core of the analysts efforts. This rewww.conradyscience.com | www.bayesia.com
quires, beyond the main model, a second set of assumptions regarding the functional form and the distributions.
2. For any imputation, special steps must be taken to reestablish the very uncertainty that is inherent in the
absence of values.
Maximum Likelihood Estimation (ML)

The previous methods are all removing unknown values, either by means of deletion or imputation, prior to
estimating the unknown parameters of the main model. This means that we treat unknown values as a separate (and often less important) problem versus the unknown parameters of a model. Schafer (1998) phrased
the challenge as follows: If we knew the missing values, then estimating the model parameters would be
straightforward. Similarly, if we knew the parameters of the data model, then it would be possible to obtain
unbiased predictions for the missing values. The idea is of Maximum Likelihood (ML) estimation is to
merge both unknowns into a single process and facilitate the estimation of the entire parametric model.
We will not go into further details of the implementation of this approach, but we will later pick up the
general idea of jointly estimating unknown values and unknown parameters with Dynamic Completion
(Expectation-Maximization). A very comprehensive explanation of maximum likelihood estimation is provided in Enders (2010).
Bayesian Networks
It goes beyond the scope of this paper to formally introduce Bayesian networks. Even a very supercial introduction would amount to a substantial portion of the paper, perhaps distracting from the focus on the
missing values methods. For a very short and general overview we suggest our white paper, Introduction to
Bayesian Networks (Conrady and Jouffe, 2011), for a much more comprehensive introduction, Bayesian
Reasoning and Machine Learning (Barber, 2011) is highly recommended.
For dealing with incomplete datasets, Bayesian networks provide advantages that specically relate to the
two points stated in the summary of the ad hoc methods:
1. Bayesian networks offer a unied framework for representing the joint distribution of the overall domain
and simultaneously encoding the dependencies with the missing values (Heckerman, 2008). This implicitly addresses the requirement that Shafer and Olson stipulate for MI, namely any association that may
prove important in subsequent analysis should be present in the imputation model. A rich imputation
model that preserves a large number of associations is desirable because it may be used for a variety of
post-imputation analyses. Also, by using a Bayesian network, the functional form for missing values
imputation and for representing the overall model are automatically identical and thus compatible.4
2. The inherently probabilistic nature of Bayesian networks allows to deal with missing values and their
imputation non-deterministically. That means that the (needed) variance in the imputed data does not
need to be generated articially, but is inherently available.
In our case, the presented Bayesian network approach is nonparametric, so the term functional form is used loosely.
10
Using the terminology from the ad hoc methods, one could say that Bayesian networks can perform a kind
of stochastic conditional imputation. Like all other previously mentioned imputation methods, the imputation approach with Bayesian networks also makes the MAR assumption.
An important caveat regarding the interpretation of Bayesian networks must be stated up front: In this paper, Bayesian networks are employed entirely non-parametrically. While this will provide many advantages
that we will see in the subsequent examples, formal statistical properties cannot be established. Rather, the
imputation performance of Bayesian networks and the BayesiaLab software can only be established through
empirical tests and simulation.
11
Missing Values Processing in Practice

We will now illustrate several traditional methods for missing values processing as well as the relatively new
approach of using Bayesian networks for that purpose. To facilitate a comparison of methods, we start with
a linear example and then later explore a nonlinear case, which provides fewer options for the researcher.
Linear Example
In order to establish a benchmark for the performance of different methods for missing values processing,
we will rst synthetically generate a complete reference dataset. This dataset consists of 10,000 observations
of a row vector, A=[X,Z,U], which follows a multivariate normal distribution:
A N 3 ( , )
0
= 0
0
1 0.5 0.5
= 0.5 1 0.5
0.5 0.5 1
In words, X, Z and U are drawn from a multivariate normal distribution with a correlation of 0.5 between
each of the variables.
We subsequently generate a variable Y, which is dened as:
yi = xi + zi + ui , i [1...n], n = 10,000
Our complete set of observations is thus contained in this matrix:
x + z + u
1
1
i
x + z + u
n n n
x1
z1
xn
zn
u1

un
With our knowledge of the underlying data-generating mechanism, we can describe the functional form as:
y = x x + z z + uu ,
12
with the known parameters
x = z = u = 1
In our subsequent analysis we will treat these parameters as unknown and compute estimates for them from
the manipulated dataset, i.e. once it was subjected to the Missing Data Mechanism.
Missing Data Mechanism (MDM)

In order to simulate a real-world situation, we now cause portions of X, Z and U from the original reference dataset to be missing. More specically, we will apply the following arbitrary rules:
1
,
1+ e yi
1
P(zi = missing) = 1
,
1+ e yi
0 if xi = missing and yi = missing
P(ui = missing) =
else 1
P(xi = missing) = 1
In words, we apply a logistic function each for X and Z to generate the probability of missingness as a function of the values of Y. This also means that the missingness of X (or Z) does not depend on values of X (or
Z). This is a key condition for making the MAR assumption.
The choice of the logistic function (see graph) is arbitrary, but one could think of a lifestyle survey with
Body Mass Index (BMI) as a target variable (Y). It is plausible that those with a good BMI are more
likely to report details of their healthy lifestyle than those with a poor BMI and a presumably unhealthy
lifestyle, hence the increasing probability of missing data points one the poor side of the scale.
PHmissingL
1.0
0.8
0.6
0.4
0.2
-10
-5
10
More specically, this means that with increasing values of Y (the independent variable), the probability of
X (a dependent variable) missing is increasing (red curve), while the opposite is true for Z (blue curve).
13
Furthermore, the (also arbitrary) rule for U implies that U is always missing when both X and Z are observed. For instance, the variable U could represent a follow-up question in a survey that is only being asked
when X and/or Z are item non-responses. Whatever the reason, with this rule characterizing the missing
data mechanism we will never have a full set of observations for Y, X, Z and U.
The following table shows the rst 20 observations once the MDM is applied.
Y
0.503
)0.988
0.783
1.058
0.646
2.279
0.380
)0.647
3.297
1.871
0.749
2.266
2.762
1.804
0.110
)0.273
1.526
)1.873
)1.932
2.206
Z
)0.319
0.333
0.923
0.642
0.485
2.176
0.413
)0.490
0.681
0.603
0.264
)0.058
0.696
0.320
0.749
0.851
0.271
)1.342
)0.563
)0.163
U
0.991
)2.136
)0.461
0.178
0.134
0.228
)0.400
1.317
0.662
0.701
0.854
1.003
0.130
)0.718
0.351
0.024
)1.209
0.506
In total, this amounts to 50% of the X and Z data points missing and approximately 13% of U.
Standard Methods
First we will approach this dataset with traditional missing value processing methods. Estimating the regression parameters and comparing them to the true known parameters will allow us to assess the performance
of these methods.
Listwise/Casewise Deletion
With no complete set of observations available, listwise deletion is obviously not an option here. After deleting all cases with missing values, not a single case would be left in our dataset. In the case of a large number
of variables in a study, this situation is not at all unrealistic.
Pairwise Deletion
With listwise deletion not being an option, pairwise deletion will be examined next. For the Y-X pair, we
have 5,047 observations, for Y-Z and Y-U, we have 5,010 and 8,630 valid observations pairs respectively.
We estimate the parameters with OLS and obtain the following coefcients:
x = 1.529
z = 1.585
u = 1.419
Constant = - 0.020
14
Clearly, the parameter estimates are quite different from the known values. More interestingly, and as suggested in the introduction, the computed value of R2=1.29 is not meaningful. As such, we do not gain much
insight from this approach.
As an alternative to deletion methods, most statistics programs do have mean imputation available as the
next option, so we will try this approach here. As a result, each (observed) variables mean values are imputed to replace the missing values. Given that the MDM is not symmetric for X and Z, the mean of these
observed variables is no longer zero:
x = 0.524
z = 0.519
u = 0.001
When performing the regression on the basis of the mean-imputed values, we obtain the following parameters (standard errors are reported in parentheses):
x = 1.187 (0.018)
z = 1.201 (0.019)
u = 1.784 (0.012)
Constant = 0.001 (0.017)
Knowing the true parameters, i.e. x = z = u = 1 , we can conclude that these parameter estimates from
the mean imputation are biased. So, the mean imputation does not successfully address the problem of missing values here, as we had already suggested to in the introduction.
Today, multiple imputation, is becoming widely available in statistical software packages, even though references to this method still remain relatively rare in applied research.
In any case, multiple imputation has very appealing theoretical and practical properties. Without spelling
out the details of the computation, we will simply show the results from the MI function implemented in
SPSS 20.
15
The pooled parameter estimates turn out to be very close to the true parameter values, far better than what
was estimated with the pairwise deletion and the means imputation:
x = 0.958 (0.007)
z = 0.957 (0.013)
u = 1.015 (0.009)
Constant = -0.028 (0.013)
SPSS automatically computed a small value for the intercept, which was not included in our DGP function
specication, but the overall performance is still very good and far ahead of the mean imputation.
16
Missing Values Processing with Bayesian Networks

In comparison to the very brief overview of some of the standard methods in the previous chapter, we will
now provide a greater amount of detail as we discuss missing values processing with Bayesian networks
and, more specically, BayesiaLab. As readers will presumably be less familiar with this approach, we will
present the workow tutorial-style and illustrate it with screenshots when appropriate.
The starting point for this exercise is importing the incomplete dataset into BayesiaLab.
In the following dialog box we can see the large number of missing data points highlighted in the information panel (28.28% of the dataset is missing).
The next step is of critical importance. We need to select how to deal with the missing values and BayesiaLab gives us a number of options.
17
The rst option, Filter, allows us to perform listwise/casewise deletion, which, as before, is not feasible as no
data points would remain available for analysis. The second option, Replace By, refers to mean imputation5, which, with all its drawbacks, can be applied here within the Bayesian network framework. We mention means imputation for the sake of completeness, rather than as a recommendation. Also, Filter and/or
Replace By can be selectively applied on a variable-by-variable basis. Once one of these static missing value
processing options is applied to the selected variables, BayesiaLab considers these variables as completely
observed. In other words, this approach would purely be a pre-treatment generating xed values.
For our example we shall instead choose from among the BayesiaLab-specic options, i.e. Static Completion, Dynamic Completion and Structural EM, which will subsequently explain:
Static Completion
Static Completion resembles mean imputation, but differs in one important aspect: while mean imputation
is deterministic, Static Completion performs random draws from the marginal distributions of the observed
data points and saves these randomly drawn values as placeholder values, at the completion of the import
process. Our articially-created missing values are thus being lled in instantly with estimated values, which
would then permit a parameter estimation as in the case of a complete data set. BayesiaLab highlights the
variables that contain missing values with a small question mark icon.
Mean imputation applies to continuous and discrete numerical variables. For categorical variables the modal values
are imputed.
18
The tables below compares the original data with missing points (left) with the values imputed by the Static
Completion (right).
Y
0.503
)0.988
0.783
1.058
0.646
2.279
0.380
)0.647
3.297
1.871
0.749
2.266
2.762
1.804
0.110
)0.273
1.526
)1.873
)1.932
2.206
Z
)0.319
0.333
0.923
0.642
0.485
2.176
0.413
)0.490
0.681
0.603
0.264
)0.058
0.696
0.320
0.749
0.851
0.271
)1.342
)0.563
)0.163
U
0.991
)2.136
)0.461
0.178
0.134
0.228
)0.400
1.317
0.662
0.701
0.854
1.003
0.130
)0.718
0.351
0.024
)1.209
0.506
Y
0.503
-0.988
0.783
1.058
0.646
2.279
0.380
-0.647
3.297
1.871
0.749
2.266
2.762
1.804
0.110
-0.273
1.526
-1.873
-1.932
2.206
X
1.967
0.131
0.333
0.238
-2.035
-0.693
-0.920
-0.490
0.739
0.606
1.068
-0.161
0.329
-1.880
-2.826
-0.407
0.853
-1.342
-0.563
1.863
Z
-0.319
-0.162
1.133
0.344
1.195
2.176
0.640
1.589
0.681
0.603
-1.021
-0.058
0.696
0.969
0.749
0.485
0.271
-0.488
1.383
-0.163
U
0.991
-2.136
-0.461
0.178
-0.761
0.134
0.228
-0.400
1.317
1.377
0.588
0.854
1.003
1.221
0.130
-0.718
0.351
0.024
-1.209
-0.862
When Static Completion is selected, one can also subsequently perform imputation on demand by using
the Learning>Probabilities Learning function. BayesiaLab then uses the current Bayesian network to infer,
for each missing value, the posterior probability distribution of the corresponding variable given the observed values. Once the imputation completed, one can relearn a network with the new dataset and once
again perform the imputation by using this most recent network. This process can be repeated until no
structural change is observed in the learned Bayesian network.
Dynamic Completion
The workow described above for Static Completion is fully automated with Dynamic Completion (and
Structural Expectation-Maximization) and automatically applied after each modication of the network
during learning, i.e. after every single arc addition, suppression and inversion.
In practice, this works as follows:
1. BayesiaLab starts with the fully unconnected network and uses the available data to estimate the marginal probability distributions of all the variables and then lls its conditional probability tables.
2. BayesiaLab uses the newly learned network and parameters to impute the missing values by drawing
from the posterior probability distributions of the variables, given the values of the observed (notmissing or already imputed) variables. This is the Expectation step, i.e. a static imputation step.
19
3. BayesiaLab uses this new dataset, which no longer contains missing values, to learn the structure and
estimate the corresponding parameters. This is the Maximization step.
4. The process alternates the Expectation and Maximization steps until convergence is achieved.
In this process the Bayesian network will grow from an initially unconnected network and evolve in its
structure until its nal state is reached.
Depending on the number of variables, the chosen learning algorithm and the network complexity6 , hundreds or thousands of iterations may be calculated. The nal imputed dataset then remains available for
subsequent analysis or export.
In the context of this iterative learning process, and beyond merely estimating the missing values, we have
also established an interpretable structure of the underlying domain in the form of a Bayesian network.
From a qualitative perspective, this network does indeed reect all the relationships that we originally created with the DGP formula and the covariance structure form the original sampling process.
Having this fully estimated network available, we can also retrieve the quantitative knowledge contained
therein and obtain the parameters of the structure. We are using parameters here loosely (hence the quotation marks), as the Bayesian network structures in BayesiaLab are always entirely nonparametric.
To obtain these quantitative characteristics from the network, we can immediately perform the Target
Mean Analysis (Direct Effects), which produces a plot of the interactions of U, Z and X with Y.7 Even
though there is no assumption of linearity in BayesiaLab, the shape of the curve conrms than the learned
network indeed approximates the linear function from the DGP formula.
In BayesiaLab, the network complexity can be managed with the Structural Coefcient.
Our white paper, Direct Effects and Causal Inference, describes details of the Direct Effects computation.
20
Computing the Direct Effect on Target provides us with the partial derivative of each of the above curves
y y y
, , at the mean value of each variable:
x z u
Direct Effects on Target Y
Standardized
Node
Direct,Effect
Direct,Effect
U
Z
X
0.4283
0.3733
0.3514
1.0311
1.0238
0.9733
The derivates represent the slopes of each curve (at the mean of each of the independent variables), and now
assuming linearity, we can directly interpret the Direct Effects as the parameter estimates of our DGP
function:8
x = 1.031
z = 1.024
u = 0.973
These values are indeed very close to the true parameters. Even though we are giving these results a linear
interpretation, it is important to point out that a functional form was neither assumed in the missing data
estimation process, nor in the Bayesian network structural and parametric learning.
We can do this here as we have no intercept in the DGP function.
21
Alternatively, we can also save the estimated missing values for subsequent external analysis with classical
methods (Data>Imputation).
This will prompt a choice of imputation method, which is very important.
We can maintain the uncertainty of the missing data by selecting the option, Choose the Values According to the Law. This means that each to-be-saved value will be drawn from the conditional probability
distribution of each variable. This option is clearly not the optimal choice at the individual level (one has to
choose the Maximum Probability for that purpose), but it allows retaining the variance, i.e. the uncertainty
of the original data.
With this fully imputed dataset, and assuming linearity for the DGP, we can once again use traditional
methods, e.g. OLS, to compute the parameter estimates. In this case, we would obtain the following:
x = 0.954 (0.013)
z = 0.898 (0.013)
u = 1.163 (0.012)
Constant = 0.026
Given that no functional form was utilized in the missing values imputation process, and given the variables
were discretized, the results are extremely useful.
22
Nonlinear Example
As the previous (linear) example turned out to be manageable with a number of different approaches, we
will now look at a more challenging situation. It is challenging because we introduce nonlinear relationships
between variables.

As before, the synthetic dataset consists of 10,000 observations of a row vector, A=[X,Z,U], which follows
a multivariate normal distribution:
A N 3 ( , )
0
= 0
0
1 0.5 0.5
= 0.5 1 0.5
0.5 0.5 1
So, we use the same DGP as for the linear example, but now create a new function for Y that combines exponential, quadratic and linear terms:
yi = e xi + zi2 + ui , i [1...n], n = 10,000

Our complete set of observations is thus contained in this matrix:
e x1 + z 2 + u
1
1
x
e n+ z 2 + u
n
n
x1
z1
xn
zn
u1

un
With our knowledge of the underlying DGP, we can describe the functional form as:
y = x e x + z z 2 + u u , with the known parameters x = z = u = 1 .

In our subsequent analysis we will treat these parameters as unknown and compute estimates for them from
the manipulated dataset, i.e. once it is subjected to the Missing Data Mechanism.
The following plot illustrates the shape and the distribution of the individual terms, i.e. Y as a function of
X, Z and U.
23
Bayesian Network Representation

As shown in the linear case, Bayesian networks can provide an intuitive representation of a domain with
missing values and quasi automatically provide missing value estimates. While this is certainly convenient, it
is not necessarily a unique selling point for Bayesian networks. The situation is quite different when moving to nonlinear domains. A number of statistical software packages do allow performing advanced methods, such as multiple imputation, even in nonlinear domains. However, the process of specifying an imputation model in the nonlinear case is a scientic modeling effort on its own and few non-statisticians dare venture into this specialized eld (van Buuren, 2007).
We believe that the integral missing values processing in BayesiaLab provides a critical advantage to the
applied researcher, who might otherwise not even attempt to formally deal with missing values in nonlinear
models.
However, before we move on to missing values processing, we will rst establish that the original variables
nonlinear dynamics can indeed be captured in a Bayesian network learned with BayesiaLab. While BayesiaLab cannot discover the functional form of the DGP formula per se, it can learn the relationships on a nonparametric basis.
We will omit the data import steps, as they are analogous to the process shown before, and rather fastforward to the presentation of the nal network:
24
We can see that the overall network structure reects the relationships that were stipulated by both the covariance matrix and the DGP formula. More specically, we can check the Pearsons correlation coefcients
of the linear relationships between X, Z and U, by displaying the values on the relevant arcs (Analysis>Graphic>Pearsons Correlation).
25
The computed values are indeed close to the 0.5 stated in the covariance matrix.9 It is important to point
out here again that BayesiaLab does not compute the correlation on the assumption of a linear function, as
BayesiaLab is entirely nonparametric.
We now turn to the nonlinear relationships specied in the DGP formula. To see how the principal dynamics with Y are reected in the Bayesian network, we illustrate the dependencies in the following plot of the
Target Mean Analysis by Direct Effect:
We can visually identify the exponential, quadratic and linear shape of the curves and thus judge that the
complete dataset is adequately represented.
Missing Data Mechanism

Now that we have a framework for the complete dataset (with all its mutual dependencies) in place, we can
start deleting observations. The MDM is very similar to the one applied earlier, with the only change being
in the parameters of the logistic function, which now generate a softer application of the value deletion
(see graph).
Given that the standard deviation was set to 1, the covariance equals the correlation coefcient.
26
P(xi = missing) = 1
P(zi = missing) = 1
1
1+ e
1
1+ e
yi
3
yi
3
0 if x i =missing and y i =missing

P(ui = missing) =
else 1
PHmissingL
0.8
0.6
0.4
0.2
-10
-5
10
Upon application of the MDM, we once again have the issue that not a single observation is complete.
With that, we have a threefold challenge:
1. Data missing at random
2. No complete set of observations
3. Unknown dependencies between all variables, both for DGP and MDM.
To further illustrate this challenge, we plot the remaining data points post-MDM. Given that we know the
original DGP, we can just about make out the functional form, although the exponential and the linear term
seem to blur together.
27
Another way to represent the MDM is by plotting Y vs. X (red, complete) and Y vs. X (black, missing),
shown below:
From a traditional statistical perspective, it would indeed present a formidable challenge to identify the
functional form, its parameters and then impute the missing values.
Bayesian Network Learning & Missing Values Imputation

Within the Bayesian network framework this is much simpler, as all three problems can be dealt within a
single process, i.e. Unsupervised Learning with Dynamic Completion of missing values.
Unsupervised Learning
As described in the linear case, BayesiaLab iteratively alternates learning and imputation until convergence
is achieved. The graphs below represent a selection of steps from the learning process.
28
As there is no functional form recorded in BayesiaLab, we need to look at a plot of the probabilistic dependencies to see whether we have successfully recovered the underlying dynamics. More specically, we
need to use Target Mean Analysis by Direct Effects to decompose the overall function (now represented by
the Bayesian network) into the individual components that reect the DGP terms.
A rst visual inspection suggests that BayesiaLab has indeed captured the salient points of the DGP (keeping
in mind that the variables have been discretized). As BayesiaLab is entirely nonparametric, we cannot retrieve parameter estimates, rather we can utilize the learned functions as they are encoded with the structure
of the Bayesian network and the associated conditional probability tables.
However, computing the Direct Effect on Target provides us with the partial derivative of each of the above
Direct Effect Function curves
y y y
, , at the estimated mean value of each variable.
x z u
29
This facilitates a comparison with the true partial derivatives of the known DGP function at the mean values,
X = Z = U = 0
Variable
Partial*Derivative*
of*DGP
Direct*Effect*
Computed*with*
Missing*Data
X
Z
U
1
0
1
1.150
0.296
0.902
Given the large amount of missing data, these Direct Effect values turn out to be useful approximations.
Now that we have the complete DGP function represented in a Bayesian network, we can proceed with our
analysis within the BayesiaLab framework. A wide range of BayesiaLabs analysis and optimization functions are described in our other white papers.
Imputation
For the sake of completeness, we will once again make the complete case dataset available for external
analysis with standard statistical tools. As shown before in the linear case, we can export the imputed values
by selecting Data>Imputation.
This will once again prompt a choice of imputation methods.
We can maintain the uncertainty of the missing data by selecting the option, Choose the Values According to the Law. This means that each to-be-saved value will be drawn from the posterior probability distribution of each variable, which retains the variance, i.e. the uncertainty, of the original data.
30
The complete dataset is now saved as a CSV le and can be accessed by any statistical program for further
analysis. Assuming that we know the functional form of the DGP (or are able to guess it from the plot of
the Target Mean Analysis seen earlier), we can then use the lled-in dataset to estimate the parameters. For
instance, we can use OLS to compute the parameter estimates for the functional form of the DGP:
y = x e x + z z 2 + uu
In this case, we would obtain the following coefcients, which are once again very close to the true values of
the parameters:
x = 1.025 (0.013)
z = 0.998 (0.013)
u = 1.153 (0.020)
Summary Nonlinear Example

Despite the large number of missing values, BayesiaLabs learning and imputation algorithms allowed us to
discover the DGP and simultaneously estimate the missing values. As a result, we now have a fully integrated process that does no longer require a separate treatment of the DGP and the MDM thus facilitating
an accelerated modeling workow for researchers and analysts.
Summary & Outlook

The proposed integrated approach to missing values processing and imputation is certainly a practical benet to the applied researcher, especially given Burton and Altmans serious concerns stated in the introduction. While missing data has been the proverbial elephant in the room, there is now even less of an excuse
for not squarely addressing the problem.
Beyond merely addressing a aw in data, our new approach may actually lend itself to broader applications.
With this tool in hand, we may see an opportunity to rephrase other kinds of problems and treat them as
missing values problems. One could think of state space models, multiple sensor fusion or mixed-frequency
models in this context. However, these extensions go beyond the scope of this paper will need to be covered
in future work.
31
Appendix
About the Authors
Stefan Conrady
Stefan Conrady is the cofounder and managing partner of Conrady Applied Science, LLC, a privately held
consulting rm specializing in knowledge discovery and probabilistic reasoning with Bayesian networks. In
2010, Conrady Applied Science was appointed the authorized sales and consulting partner of Bayesia S.A.S.
for North America.
Stefan Conrady studied Electrical Engineering in Germany and has extensive management experience in the
elds of product planning, marketing and analytics, working at Daimler and BMW Group in Europe, North
America and Asia. Prior to establishing his own rm, he was heading the Analytics & Forecasting group at
Nissan North America.
Lionel Jouffe
Dr. Lionel Jouffe is cofounder and CEO of France-based Bayesia S.A.S. Lionel Jouffe holds a Ph.D. in Computer Science and has been working in the eld of Articial Intelligence since the early 1990s. He and his
team have been developing BayesiaLab since 1999 and it has emerged as the leading software package for
knowledge discovery, data mining and knowledge modeling using Bayesian networks. BayesiaLab enjoys
broad acceptance in academic communities as well as in business and industry. The relevance of Bayesian
networks, especially in the context of consumer research, is highlighted by Bayesias strategic partnership
with Procter & Gamble, who has deployed BayesiaLab globally since 2007.
32
References
Allison, P.D. Multiple imputation for missing data: A cautionary tale. Sociological methods and Research
28, no. 3 (2000): 301309.
Allison, Paul D. Missing Data. 1st ed. Sage Publications, Inc, 2001.
Barber, David. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2011.
BBN_Introduction_V13.pdf, n.d.
Burton, A, and D G Altman. Missing covariate data within cancer prognostic studies: a review of current
reporting and proposed guidelines. British Journal of Cancer 91 (June 8, 2004): 4-8.
Van Buuren, S. Multiple imputation of discrete and continuous data by fully conditional specication.
Statistical Methods in Medical Research 16, no. 3 (2007): 219.
Conrady, Stefan, and Lionel Jouffe. Introduction to Bayesian Networks - Practical and Technical Perspectives. Conrady Applied Science, LLC, February 15, 2011.
http://www.conradyscience.com/index.php/introduction-to-bayesian-networks.
Copas, John B., and Guobing Lu. Missing at random, likelihood ignorability and model completeness.
The Annals of Statistics 32 (April 2004): 754-765.
Data Imputation for Missing Values: Statnotes, from North Carolina State University, Public Administration Program, n.d. http://faculty.chass.ncsu.edu/garson/PA765/missing.htm.
Dempster, A. P., N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM
Algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39, no. 1 (January 1,
1977): 1-38.
Enders, Craig K. Applied Missing Data Analysis. 1st ed. The Guilford Press, 2010.
Heckerman, D. A tutorial on learning with Bayesian networks. Innovations in Bayesian Networks (2008):
3382.
Heitjan, Daniel F., and Srabashi Basu. Distinguishing Missing at Random and Missing Completely at
Random. The American Statistician 50 (August 1996): 207.
Historical Stock Data, n.d. http://pages.swcp.com/stocks/.
Horton, N.J., and S.R. Lipsitz. Multiple Imputation in Practice. The American Statistician 55, no. 3
(2001): 244254.
Howell, David C. Treatment of Missing Data, n.d.
http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html.
Lin, J.H., and P.J. Haug. Exploiting missing clinical data in Bayesian network modeling for predicting
medical problems. Journal of biomedical informatics 41, no. 1 (2008): 114.
Little, Roderick J. A., and Donald B. Rubin. Statistical Analysis with Missing Data, Second Edition. 2nd ed.
Wiley-Interscience, 2002.
Missing completely at random - Wikipedia, the free encyclopedia, n.d.
http://en.wikipedia.org/wiki/Missing_completely_at_random.
de Morais, S.R., and A. Aussem. Exploiting Data Missingness in Bayesian Network Modeling. In Advances in Intelligent Data Analysis VIII: 8th International Symposium on Intelligent Data Analysis, IDA
2009, Lyon, France, August 31-September 2, 2009, Proceedings, 5772:35, 2009.
NCES Statistical Standards, Appendix B, Evaluating the Impact of Imputations for Item Non-Response,
n.d. http://nces.ed.gov/statprog/2002/appendixb3.asp.
33
Riphahn, Regina T., and Oliver Sering. Item non-response on income and wealth questions. Empirical
Economics 30 (September 2005): 521-538.
Rubin, Donald B. Inference and missing data. Biometrika 63, no. 3 (December 1, 1976): 581 -592.
. Multiple Imputation for Nonresponse in Surveys. Wiley-Interscience, 2004.
Schafer, J.L. Analysis of Incomplete Multivariate Data. 1st ed. Chapman and Hall/CRC, 1997.
Schafer, J.L., and M.K. Olsen. Multiple imputation for multivariate missing-data problems: A data analysts perspective. Multivariate Behavioral Research 33, no. 4 (1998): 545571.
SOLAS Missing Data Analysis | Multiple Imputation | Missing Data Analysis Software @ Solas for Missing
Data Analysis, n.d. http://www.solasmissingdata.com/.
SPSS Missing Values 17.0 Users Guide. SPSS Inc., n.d.
Wayman, J.C. Multiple imputation for missing data: What is it and how can I use it. In Annual Meeting
of the American Educational Research Association, 116, 2003.
www.missingdata.org.uk, n.d. http://missingdata.lshtm.ac.uk/.
34
Contact Information
Conrady Applied Science, LLC
312 Hamlets End Way
Franklin, TN 37067
USA
+1 888-386-8383
info@conradyscience.com
www.conradyscience.com
Bayesia S.A.S.
6, rue Lonard de Vinci
BP 119
53001 Laval Cedex
France
+33(0)2 43 49 75 69
info@bayesia.com
www.bayesia.com
Copyright
2011 Conrady Applied Science, LLC and Bayesia S.A.S. All rights reserved.
Any redistribution or reproduction of part or all of the contents in any form is prohibited other than the
following:
You may print or download this document for your personal and noncommercial use only.
You may copy the content to individual third parties for their personal use, but only if you acknowledge
Conrady Applied Science, LLC and Bayesia S.A.S. as the source of the material.
You may not, except with our express written permission, distribute or commercially exploit the content.
Nor may you transmit it or store it in any other website or other form of electronic retrieval system.
35

Missing Values Imputation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Missing Values Imputation

Uploaded by

Copyright:

Available Formats

Missing Values Imputation

A New Approach to Missing Values Processing with Bayesian Networks

Stefan Conrady, stefan.conrady@conradyscience.com

Missing Values Imputation with Bayesian Networks

Theory of Missing Values Processing

Methods for the Analysis of Incomplete Data

Single Mean Imputation

Conditional Mean Imputation

Multiple Imputation (MI)

Ad Hoc Methods Summary

Maximum Likelihood Estimation (ML)

Missing Values Processing in Practice

Data Generating Process (DGP)

Missing Data Mechanism (MDM)

Single Mean Imputation

Multiple Imputation (MI)

Missing Values Processing with Bayesian Networks

Missing Values Imputation with Bayesian Networks

Data Generating Process (DGP)

Bayesian Network Representation

Missing Data Mechanism

Bayesian Network Learning & Missing Values Imputation

Summary Nonlinear Example

Conrady Applied Science, LLC

Missing Values Imputation with Bayesian Networks

Missing Values Imputation with Bayesian Networks

Missing Values Imputation with Bayesian Networks

Theory of Missing Values Processing

p(RYobs ,Ymis , X, ) = p(R )

p(RYobs ,Ymis , X, ) = p(RYobs , X, )

Missing Values Imputation with Bayesian Networks

Missing Values Imputation with Bayesian Networks

Methods for the Analysis of Incomplete Data

Missing Values Imputation with Bayesian Networks

Conditional Mean Imputation

Ad Hoc Methods Summary

Missing Values Imputation with Bayesian Networks

Maximum Likelihood Estimation (ML)

Missing Values Imputation with Bayesian Networks

Missing Values Imputation with Bayesian Networks

Missing Values Processing in Practice

Missing Values Imputation with Bayesian Networks

with the known parameters

Missing Data Mechanism (MDM)

Missing Values Imputation with Bayesian Networks

Missing Values Imputation with Bayesian Networks

Missing Values Imputation with Bayesian Networks

Missing Values Imputation with Bayesian Networks

Missing Values Processing with Bayesian Networks

Missing Values Imputation with Bayesian Networks

Missing Values Imputation with Bayesian Networks

Missing Values Imputation with Bayesian Networks

Missing Values Imputation with Bayesian Networks

We can do this here as we have no intercept in the DGP function.

Missing Values Imputation with Bayesian Networks

This will prompt a choice of imputation method, which is very important.

Missing Values Imputation with Bayesian Networks

Data Generating Process (DGP)

yi = e xi + zi2 + ui , i [1...n], n = 10,000

y = x e x + z z 2 + u u , with the known parameters x = z = u = 1 .

Missing Values Imputation with Bayesian Networks

Bayesian Network Representation

Missing Values Imputation with Bayesian Networks

Missing Values Imputation with Bayesian Networks