Professional Documents
Culture Documents
Conrady Applied Science, LLC - Bayesias North American Partner for Sales and Consulting
Table of Contents
Introduction
Motivation & Objective
Overview
Notation
Ad Hoc Methods
Listwise Deletion
Pairwise Deletion
Imputation
10
10
12
12
13
Standard Methods
14
Listwise/Casewise Deletion
14
Pairwise Deletion
14
15
15
17
Static Completion
18
Dynamic Completion
19
www.conradyscience.com | www.bayesia.com
ii
Nonlinear Example
23
23
24
26
28
Unsupervised Learning
28
Imputation
30
31
31
Appendix
About the Authors
32
Stefan Conrady
32
Lionel Jouffe
32
References
33
Contact Information
35
35
Bayesia S.A.S.
35
Copyright
www.conradyscience.com | www.bayesia.com
35
iii
Introduction
Motivation & Objective
With the abundance of big data in the eld of analytics, and all the challenges todays immense data volume is causing, it may not be particularly fashionable or pressing to discuss missing values. After all, who
cares about missing data points when there are petabytes of more observations out there?
As the objective of any data gathering process is to gain knowledge about a domain, missing values are obviously undesirable. A missing datum does without a doubt reduce our knowledge about any individual observation, but implications for our understanding of the whole domain may not be so obvious, especially
when there seems to be an endless supply of data.
Missing values are encountered in virtually all real-world data collection processes. Missing values could be
the result of non-responses in surveys, poor record-keeping, server outages, attrition in longitudinal surveys
or the faulty sensors of a measuring device, etc. Whats often overlooked is that not properly handling missing observations can lead to misleading interpretations or create a false sense of condence in ones ndings,
regardless of how many more complete observations might be available.
Despite the intuitive nature of this problem, and the fact that almost all quantitative studies are affected by
it, applied researchers have given it remarkably little attention in practice. Burton and Altman (2004) state
this predicament very forcefully in the context of cancer research: We are concerned that very few authors
have considered the impact of missing covariate data; it seems that missing data is generally either not recognized as an issue or considered a nuisance that it is best hidden.
As missing values processing (beyond the nave ad-hoc approaches) can be a demanding task, both methodologically and computationally, the principal objective of this paper is to propose a new and hopefully easier approach by employing Bayesian networks. It is not our intention to open the proverbial new can of
worms, and thus distract researchers from their principal study focus, but rather we want to demonstrate
that Bayesian networks can reliably, efciently and intuitively integrate missing values processing into the
main research task.
www.conradyscience.com | www.bayesia.com
Overview
1. We will rst provide a brief introduction to missing values and highlight a selection of methods that
have been traditionally used to deal with this problem.
2. We will then use a linear and a nonlinear example to illustrate the different statistical methods and introduce missing values imputation with Bayesian networks.
Notation
To clearly distinguish between natural language, software-specic functions and example-specic variable
names, the following notation is used:
Bayesian network and BayesiaLab-specic functions, keywords, commands, etc., are capitalized and
shown in bold type.
Names of attributes, variables, nodes and are italicized.
www.conradyscience.com | www.bayesia.com
Types of Missingness
There are several types of missing data that are typically encountered in studies:
First, data can be Missing Completely at Random (MCAR), which requires that the missingness1 is
completely unrelated to the data. More formally, this can be stated as:
It is important to note the difference between missingness of a variable and a missing value. Missingness refers
to the probability of not being observed, whereas missing refers to the state of not being observed.
www.conradyscience.com | www.bayesia.com
is random. Then, the income data is missing at random; the reason, or mechanism, for it being missing
depends on property band. Given property band, missingness does not depend on income itself.
Data can be missing in an unmeasured fashion, termed nonignorable, also called Missing Not at Random (MNAR) and Not Missing at Random (NMAR). This is the most general case, which includes all
possible associations between missingness and data.
p(RYobs ,Ymis , X, )
Since the missing data depends on events or items which the researcher has not measured, such as the
missing values Ymis themselves, this creates a challenge.
A hypothetical example for this situation would a behavioral survey. Individuals who are engaging in
dangerous behaviors may be less likely to admit the high-risk nature of their conduct and thus decline
to respond. In comparison, low-risk respondents may quite willingly make statements with regard to
their safe behavior.
Filtered or Censored Values
At this point we should briey mention a fourth type of missingness, which is less often mentioned in the
literature. We refer to it as ltered values. These are values that cannot exist, but often appear in datasets as no different than other missing values that do exist, but have not been observed.
For instance, in an auto buyer survey, a question about rear-seat legroom cannot be answered by the
owner of a two-seat roadster. Conceptually, these ltered values are similar to MAR values, they depend
on other variables in the dataset, e.g. given that vehicle type=roadster, rear seat legroom=FV. As opposed
to MAR values, and as this example suggests, ltered values are determined by logic.
It is particularly important that these missing variables are not imputed with any estimated values, as a
bias is highly likely. Rather, a new state, e.g. a code like FV, has to be added as an additional state to
the affected variables.2 The ltered value declaration can typically be done in the context of the ETL3
process, prior to the start of any modeling work. Upon denition of these ltered values, the normal
missing values processing can begin.
BayesiaLab provides a framework for dealing with such ltered values, so these ltered values are specically not
taken into account while machine-learning a model of the data generating process. Furthermore, BayesiaLab excludes
ltered values when computing correlations, etc.
3
ETL stands for extract, transform and load, referring to the data preprocessing prior to starting a statistical analysis.
www.conradyscience.com | www.bayesia.com
Ad Hoc Methods
Listwise Deletion
Given that many statistical estimations require a complete set of observations, any records with missing values are often excluded by default from such computations. The rst consequence is an immediate loss of
samples, which may have been very costly to acquire. This (often rather involuntarily applied) method is
known more formally as listwise deletion or casewise deletion. In the best case, an estimation on a reduced set of observations may only increase the standard error, which is the case with MCAR data, but under different circumstances listwise deletion could bias the parameter estimates, which can be the case with
MAR data. In situations when no case is completely observed, i.e. there is at least one missing value per set
of observation records, listwise deletion is obviously not applicable at all.
Pairwise Deletion
To address the loss of data that is inherent in listwise deletion, pairwise deletion uses different data sets for
the estimation of each parameter. For instance, pairwise deletion is used when a different subsets of complete data are used to compute each cell in a correlation matrix. One concern is the fact the separate computation of variances on subsets can lead to correlation values exceeding plus or minus 1. Another issue is the
absence of a consistent sample base, which leads to problems in calculating standard errors of estimated
parameters.
Imputation
As opposed to deletion-type methods, we now want to consider the opposite approach, i.e. lling in the
blanks with imputed values. Here, imputing means replacing the non-observed values with reasonable
estimates, in order to facilitate the analysis of the whole domain. Upon imputation, complete case statistical
methods can be used to estimate parameters, etc. Although it may sound like statistical alchemy to create
values for non-existing datums, it can be appropriate to do so.
Single Mean Imputation
Lets start with what not to do in terms of imputation: The oldest of these techniques is
presumably mean imputation (also know as unconditional mean imputation), which substitutes the missing values with the mean of the marginal distribution of the observed values. It
is well-known that this approach can cause biased estimates and understated standard errors, but it remains a commonly used tool and can be found as an option in virtually all statistical software packages (see screenshot from STATISTICA). Although there are examples where mean
imputation will do no harm, the risk of bias should deter practitioners from using it as regular method.
www.conradyscience.com | www.bayesia.com
quires, beyond the main model, a second set of assumptions regarding the functional form and the distributions.
2. For any imputation, special steps must be taken to reestablish the very uncertainty that is inherent in the
absence of values.
Bayesian Networks
It goes beyond the scope of this paper to formally introduce Bayesian networks. Even a very supercial introduction would amount to a substantial portion of the paper, perhaps distracting from the focus on the
missing values methods. For a very short and general overview we suggest our white paper, Introduction to
Bayesian Networks (Conrady and Jouffe, 2011), for a much more comprehensive introduction, Bayesian
Reasoning and Machine Learning (Barber, 2011) is highly recommended.
For dealing with incomplete datasets, Bayesian networks provide advantages that specically relate to the
two points stated in the summary of the ad hoc methods:
1. Bayesian networks offer a unied framework for representing the joint distribution of the overall domain
and simultaneously encoding the dependencies with the missing values (Heckerman, 2008). This implicitly addresses the requirement that Shafer and Olson stipulate for MI, namely any association that may
prove important in subsequent analysis should be present in the imputation model. A rich imputation
model that preserves a large number of associations is desirable because it may be used for a variety of
post-imputation analyses. Also, by using a Bayesian network, the functional form for missing values
imputation and for representing the overall model are automatically identical and thus compatible.4
2. The inherently probabilistic nature of Bayesian networks allows to deal with missing values and their
imputation non-deterministically. That means that the (needed) variance in the imputed data does not
need to be generated articially, but is inherently available.
In our case, the presented Bayesian network approach is nonparametric, so the term functional form is used loosely.
www.conradyscience.com | www.bayesia.com
10
Using the terminology from the ad hoc methods, one could say that Bayesian networks can perform a kind
of stochastic conditional imputation. Like all other previously mentioned imputation methods, the imputation approach with Bayesian networks also makes the MAR assumption.
An important caveat regarding the interpretation of Bayesian networks must be stated up front: In this paper, Bayesian networks are employed entirely non-parametrically. While this will provide many advantages
that we will see in the subsequent examples, formal statistical properties cannot be established. Rather, the
imputation performance of Bayesian networks and the BayesiaLab software can only be established through
empirical tests and simulation.
www.conradyscience.com | www.bayesia.com
11
Linear Example
Data Generating Process (DGP)
In order to establish a benchmark for the performance of different methods for missing values processing,
we will rst synthetically generate a complete reference dataset. This dataset consists of 10,000 observations
of a row vector, A=[X,Z,U], which follows a multivariate normal distribution:
A N 3 ( , )
0
= 0
0
1 0.5 0.5
= 0.5 1 0.5
0.5 0.5 1
In words, X, Z and U are drawn from a multivariate normal distribution with a correlation of 0.5 between
each of the variables.
We subsequently generate a variable Y, which is dened as:
yi = xi + zi + ui , i [1...n], n = 10,000
Our complete set of observations is thus contained in this matrix:
x + z + u
1
1
i
x + z + u
n n n
x1
z1
xn
zn
u1
un
With our knowledge of the underlying data-generating mechanism, we can describe the functional form as:
y = x x + z z + uu ,
www.conradyscience.com | www.bayesia.com
12
x = z = u = 1
In our subsequent analysis we will treat these parameters as unknown and compute estimates for them from
the manipulated dataset, i.e. once it was subjected to the Missing Data Mechanism.
1
,
1+ e yi
1
P(zi = missing) = 1
,
1+ e yi
0 if xi = missing and yi = missing
P(ui = missing) =
else 1
P(xi = missing) = 1
In words, we apply a logistic function each for X and Z to generate the probability of missingness as a function of the values of Y. This also means that the missingness of X (or Z) does not depend on values of X (or
Z). This is a key condition for making the MAR assumption.
The choice of the logistic function (see graph) is arbitrary, but one could think of a lifestyle survey with
Body Mass Index (BMI) as a target variable (Y). It is plausible that those with a good BMI are more
likely to report details of their healthy lifestyle than those with a poor BMI and a presumably unhealthy
lifestyle, hence the increasing probability of missing data points one the poor side of the scale.
PHmissingL
1.0
0.8
0.6
0.4
0.2
-10
-5
10
More specically, this means that with increasing values of Y (the independent variable), the probability of
X (a dependent variable) missing is increasing (red curve), while the opposite is true for Z (blue curve).
www.conradyscience.com | www.bayesia.com
13
Furthermore, the (also arbitrary) rule for U implies that U is always missing when both X and Z are observed. For instance, the variable U could represent a follow-up question in a survey that is only being asked
when X and/or Z are item non-responses. Whatever the reason, with this rule characterizing the missing
data mechanism we will never have a full set of observations for Y, X, Z and U.
The following table shows the rst 20 observations once the MDM is applied.
Y
0.503
)0.988
0.783
1.058
0.646
2.279
0.380
)0.647
3.297
1.871
0.749
2.266
2.762
1.804
0.110
)0.273
1.526
)1.873
)1.932
2.206
Z
)0.319
0.333
0.923
0.642
0.485
2.176
0.413
)0.490
0.681
0.603
0.264
)0.058
0.696
0.320
0.749
0.851
0.271
)1.342
)0.563
)0.163
U
0.991
)2.136
)0.461
0.178
0.134
0.228
)0.400
1.317
0.662
0.701
0.854
1.003
0.130
)0.718
0.351
0.024
)1.209
0.506
In total, this amounts to 50% of the X and Z data points missing and approximately 13% of U.
Standard Methods
First we will approach this dataset with traditional missing value processing methods. Estimating the regression parameters and comparing them to the true known parameters will allow us to assess the performance
of these methods.
Listwise/Casewise Deletion
With no complete set of observations available, listwise deletion is obviously not an option here. After deleting all cases with missing values, not a single case would be left in our dataset. In the case of a large number
of variables in a study, this situation is not at all unrealistic.
Pairwise Deletion
With listwise deletion not being an option, pairwise deletion will be examined next. For the Y-X pair, we
have 5,047 observations, for Y-Z and Y-U, we have 5,010 and 8,630 valid observations pairs respectively.
We estimate the parameters with OLS and obtain the following coefcients:
x = 1.529
z = 1.585
u = 1.419
Constant = - 0.020
www.conradyscience.com | www.bayesia.com
14
Clearly, the parameter estimates are quite different from the known values. More interestingly, and as suggested in the introduction, the computed value of R2=1.29 is not meaningful. As such, we do not gain much
insight from this approach.
Single Mean Imputation
As an alternative to deletion methods, most statistics programs do have mean imputation available as the
next option, so we will try this approach here. As a result, each (observed) variables mean values are imputed to replace the missing values. Given that the MDM is not symmetric for X and Z, the mean of these
observed variables is no longer zero:
x = 0.524
z = 0.519
u = 0.001
When performing the regression on the basis of the mean-imputed values, we obtain the following parameters (standard errors are reported in parentheses):
x = 1.187 (0.018)
z = 1.201 (0.019)
u = 1.784 (0.012)
Constant = 0.001 (0.017)
Knowing the true parameters, i.e. x = z = u = 1 , we can conclude that these parameter estimates from
the mean imputation are biased. So, the mean imputation does not successfully address the problem of missing values here, as we had already suggested to in the introduction.
Multiple Imputation (MI)
Today, multiple imputation, is becoming widely available in statistical software packages, even though references to this method still remain relatively rare in applied research.
In any case, multiple imputation has very appealing theoretical and practical properties. Without spelling
out the details of the computation, we will simply show the results from the MI function implemented in
SPSS 20.
www.conradyscience.com | www.bayesia.com
15
The pooled parameter estimates turn out to be very close to the true parameter values, far better than what
was estimated with the pairwise deletion and the means imputation:
x = 0.958 (0.007)
z = 0.957 (0.013)
u = 1.015 (0.009)
Constant = -0.028 (0.013)
SPSS automatically computed a small value for the intercept, which was not included in our DGP function
specication, but the overall performance is still very good and far ahead of the mean imputation.
www.conradyscience.com | www.bayesia.com
16
In the following dialog box we can see the large number of missing data points highlighted in the information panel (28.28% of the dataset is missing).
The next step is of critical importance. We need to select how to deal with the missing values and BayesiaLab gives us a number of options.
www.conradyscience.com | www.bayesia.com
17
The rst option, Filter, allows us to perform listwise/casewise deletion, which, as before, is not feasible as no
data points would remain available for analysis. The second option, Replace By, refers to mean imputation5, which, with all its drawbacks, can be applied here within the Bayesian network framework. We mention means imputation for the sake of completeness, rather than as a recommendation. Also, Filter and/or
Replace By can be selectively applied on a variable-by-variable basis. Once one of these static missing value
processing options is applied to the selected variables, BayesiaLab considers these variables as completely
observed. In other words, this approach would purely be a pre-treatment generating xed values.
For our example we shall instead choose from among the BayesiaLab-specic options, i.e. Static Completion, Dynamic Completion and Structural EM, which will subsequently explain:
Static Completion
Static Completion resembles mean imputation, but differs in one important aspect: while mean imputation
is deterministic, Static Completion performs random draws from the marginal distributions of the observed
data points and saves these randomly drawn values as placeholder values, at the completion of the import
process. Our articially-created missing values are thus being lled in instantly with estimated values, which
would then permit a parameter estimation as in the case of a complete data set. BayesiaLab highlights the
variables that contain missing values with a small question mark icon.
Mean imputation applies to continuous and discrete numerical variables. For categorical variables the modal values
are imputed.
www.conradyscience.com | www.bayesia.com
18
The tables below compares the original data with missing points (left) with the values imputed by the Static
Completion (right).
Y
0.503
)0.988
0.783
1.058
0.646
2.279
0.380
)0.647
3.297
1.871
0.749
2.266
2.762
1.804
0.110
)0.273
1.526
)1.873
)1.932
2.206
Z
)0.319
0.333
0.923
0.642
0.485
2.176
0.413
)0.490
0.681
0.603
0.264
)0.058
0.696
0.320
0.749
0.851
0.271
)1.342
)0.563
)0.163
U
0.991
)2.136
)0.461
0.178
0.134
0.228
)0.400
1.317
0.662
0.701
0.854
1.003
0.130
)0.718
0.351
0.024
)1.209
0.506
Y
0.503
-0.988
0.783
1.058
0.646
2.279
0.380
-0.647
3.297
1.871
0.749
2.266
2.762
1.804
0.110
-0.273
1.526
-1.873
-1.932
2.206
X
1.967
0.131
0.333
0.238
-2.035
-0.693
-0.920
-0.490
0.739
0.606
1.068
-0.161
0.329
-1.880
-2.826
-0.407
0.853
-1.342
-0.563
1.863
Z
-0.319
-0.162
1.133
0.344
1.195
2.176
0.640
1.589
0.681
0.603
-1.021
-0.058
0.696
0.969
0.749
0.485
0.271
-0.488
1.383
-0.163
U
0.991
-2.136
-0.461
0.178
-0.761
0.134
0.228
-0.400
1.317
1.377
0.588
0.854
1.003
1.221
0.130
-0.718
0.351
0.024
-1.209
-0.862
When Static Completion is selected, one can also subsequently perform imputation on demand by using
the Learning>Probabilities Learning function. BayesiaLab then uses the current Bayesian network to infer,
for each missing value, the posterior probability distribution of the corresponding variable given the observed values. Once the imputation completed, one can relearn a network with the new dataset and once
again perform the imputation by using this most recent network. This process can be repeated until no
structural change is observed in the learned Bayesian network.
Dynamic Completion
The workow described above for Static Completion is fully automated with Dynamic Completion (and
Structural Expectation-Maximization) and automatically applied after each modication of the network
during learning, i.e. after every single arc addition, suppression and inversion.
In practice, this works as follows:
1. BayesiaLab starts with the fully unconnected network and uses the available data to estimate the marginal probability distributions of all the variables and then lls its conditional probability tables.
2. BayesiaLab uses the newly learned network and parameters to impute the missing values by drawing
from the posterior probability distributions of the variables, given the values of the observed (notmissing or already imputed) variables. This is the Expectation step, i.e. a static imputation step.
www.conradyscience.com | www.bayesia.com
19
3. BayesiaLab uses this new dataset, which no longer contains missing values, to learn the structure and
estimate the corresponding parameters. This is the Maximization step.
4. The process alternates the Expectation and Maximization steps until convergence is achieved.
In this process the Bayesian network will grow from an initially unconnected network and evolve in its
structure until its nal state is reached.
Depending on the number of variables, the chosen learning algorithm and the network complexity6 , hundreds or thousands of iterations may be calculated. The nal imputed dataset then remains available for
subsequent analysis or export.
In the context of this iterative learning process, and beyond merely estimating the missing values, we have
also established an interpretable structure of the underlying domain in the form of a Bayesian network.
From a qualitative perspective, this network does indeed reect all the relationships that we originally created with the DGP formula and the covariance structure form the original sampling process.
Having this fully estimated network available, we can also retrieve the quantitative knowledge contained
therein and obtain the parameters of the structure. We are using parameters here loosely (hence the quotation marks), as the Bayesian network structures in BayesiaLab are always entirely nonparametric.
To obtain these quantitative characteristics from the network, we can immediately perform the Target
Mean Analysis (Direct Effects), which produces a plot of the interactions of U, Z and X with Y.7 Even
though there is no assumption of linearity in BayesiaLab, the shape of the curve conrms than the learned
network indeed approximates the linear function from the DGP formula.
In BayesiaLab, the network complexity can be managed with the Structural Coefcient.
Our white paper, Direct Effects and Causal Inference, describes details of the Direct Effects computation.
www.conradyscience.com | www.bayesia.com
20
Computing the Direct Effect on Target provides us with the partial derivative of each of the above curves
y y y
, , at the mean value of each variable:
x z u
Direct Effects on Target Y
Standardized
Node
Direct,Effect
Direct,Effect
U
Z
X
0.4283
0.3733
0.3514
1.0311
1.0238
0.9733
The derivates represent the slopes of each curve (at the mean of each of the independent variables), and now
assuming linearity, we can directly interpret the Direct Effects as the parameter estimates of our DGP
function:8
x = 1.031
z = 1.024
u = 0.973
These values are indeed very close to the true parameters. Even though we are giving these results a linear
interpretation, it is important to point out that a functional form was neither assumed in the missing data
estimation process, nor in the Bayesian network structural and parametric learning.
www.conradyscience.com | www.bayesia.com
21
Alternatively, we can also save the estimated missing values for subsequent external analysis with classical
methods (Data>Imputation).
We can maintain the uncertainty of the missing data by selecting the option, Choose the Values According to the Law. This means that each to-be-saved value will be drawn from the conditional probability
distribution of each variable. This option is clearly not the optimal choice at the individual level (one has to
choose the Maximum Probability for that purpose), but it allows retaining the variance, i.e. the uncertainty
of the original data.
With this fully imputed dataset, and assuming linearity for the DGP, we can once again use traditional
methods, e.g. OLS, to compute the parameter estimates. In this case, we would obtain the following:
x = 0.954 (0.013)
z = 0.898 (0.013)
u = 1.163 (0.012)
Constant = 0.026
Given that no functional form was utilized in the missing values imputation process, and given the variables
were discretized, the results are extremely useful.
www.conradyscience.com | www.bayesia.com
22
Nonlinear Example
As the previous (linear) example turned out to be manageable with a number of different approaches, we
will now look at a more challenging situation. It is challenging because we introduce nonlinear relationships
between variables.
A N 3 ( , )
0
= 0
0
1 0.5 0.5
= 0.5 1 0.5
0.5 0.5 1
So, we use the same DGP as for the linear example, but now create a new function for Y that combines exponential, quadratic and linear terms:
e x1 + z 2 + u
1
1
x
e n+ z 2 + u
n
n
x1
z1
xn
zn
u1
un
With our knowledge of the underlying DGP, we can describe the functional form as:
www.conradyscience.com | www.bayesia.com
23
www.conradyscience.com | www.bayesia.com
24
We can see that the overall network structure reects the relationships that were stipulated by both the covariance matrix and the DGP formula. More specically, we can check the Pearsons correlation coefcients
of the linear relationships between X, Z and U, by displaying the values on the relevant arcs (Analysis>Graphic>Pearsons Correlation).
www.conradyscience.com | www.bayesia.com
25
The computed values are indeed close to the 0.5 stated in the covariance matrix.9 It is important to point
out here again that BayesiaLab does not compute the correlation on the assumption of a linear function, as
BayesiaLab is entirely nonparametric.
We now turn to the nonlinear relationships specied in the DGP formula. To see how the principal dynamics with Y are reected in the Bayesian network, we illustrate the dependencies in the following plot of the
Target Mean Analysis by Direct Effect:
We can visually identify the exponential, quadratic and linear shape of the curves and thus judge that the
complete dataset is adequately represented.
Given that the standard deviation was set to 1, the covariance equals the correlation coefcient.
www.conradyscience.com | www.bayesia.com
26
P(xi = missing) = 1
P(zi = missing) = 1
1
1+ e
1
1+ e
yi
3
yi
3
PHmissingL
0.8
0.6
0.4
0.2
-10
-5
10
Upon application of the MDM, we once again have the issue that not a single observation is complete.
With that, we have a threefold challenge:
1. Data missing at random
2. No complete set of observations
3. Unknown dependencies between all variables, both for DGP and MDM.
To further illustrate this challenge, we plot the remaining data points post-MDM. Given that we know the
original DGP, we can just about make out the functional form, although the exponential and the linear term
seem to blur together.
www.conradyscience.com | www.bayesia.com
27
Another way to represent the MDM is by plotting Y vs. X (red, complete) and Y vs. X (black, missing),
shown below:
From a traditional statistical perspective, it would indeed present a formidable challenge to identify the
functional form, its parameters and then impute the missing values.
www.conradyscience.com | www.bayesia.com
28
As there is no functional form recorded in BayesiaLab, we need to look at a plot of the probabilistic dependencies to see whether we have successfully recovered the underlying dynamics. More specically, we
need to use Target Mean Analysis by Direct Effects to decompose the overall function (now represented by
the Bayesian network) into the individual components that reect the DGP terms.
A rst visual inspection suggests that BayesiaLab has indeed captured the salient points of the DGP (keeping
in mind that the variables have been discretized). As BayesiaLab is entirely nonparametric, we cannot retrieve parameter estimates, rather we can utilize the learned functions as they are encoded with the structure
of the Bayesian network and the associated conditional probability tables.
However, computing the Direct Effect on Target provides us with the partial derivative of each of the above
Direct Effect Function curves
y y y
, , at the estimated mean value of each variable.
x z u
www.conradyscience.com | www.bayesia.com
29
This facilitates a comparison with the true partial derivatives of the known DGP function at the mean values,
X = Z = U = 0
Variable
Partial*Derivative*
of*DGP
Direct*Effect*
Computed*with*
Missing*Data
X
Z
U
1
0
1
1.150
0.296
0.902
Given the large amount of missing data, these Direct Effect values turn out to be useful approximations.
Now that we have the complete DGP function represented in a Bayesian network, we can proceed with our
analysis within the BayesiaLab framework. A wide range of BayesiaLabs analysis and optimization functions are described in our other white papers.
Imputation
For the sake of completeness, we will once again make the complete case dataset available for external
analysis with standard statistical tools. As shown before in the linear case, we can export the imputed values
by selecting Data>Imputation.
We can maintain the uncertainty of the missing data by selecting the option, Choose the Values According to the Law. This means that each to-be-saved value will be drawn from the posterior probability distribution of each variable, which retains the variance, i.e. the uncertainty, of the original data.
www.conradyscience.com | www.bayesia.com
30
The complete dataset is now saved as a CSV le and can be accessed by any statistical program for further
analysis. Assuming that we know the functional form of the DGP (or are able to guess it from the plot of
the Target Mean Analysis seen earlier), we can then use the lled-in dataset to estimate the parameters. For
instance, we can use OLS to compute the parameter estimates for the functional form of the DGP:
y = x e x + z z 2 + uu
In this case, we would obtain the following coefcients, which are once again very close to the true values of
the parameters:
x = 1.025 (0.013)
z = 0.998 (0.013)
u = 1.153 (0.020)
www.conradyscience.com | www.bayesia.com
31
Appendix
About the Authors
Stefan Conrady
Stefan Conrady is the cofounder and managing partner of Conrady Applied Science, LLC, a privately held
consulting rm specializing in knowledge discovery and probabilistic reasoning with Bayesian networks. In
2010, Conrady Applied Science was appointed the authorized sales and consulting partner of Bayesia S.A.S.
for North America.
Stefan Conrady studied Electrical Engineering in Germany and has extensive management experience in the
elds of product planning, marketing and analytics, working at Daimler and BMW Group in Europe, North
America and Asia. Prior to establishing his own rm, he was heading the Analytics & Forecasting group at
Nissan North America.
Lionel Jouffe
Dr. Lionel Jouffe is cofounder and CEO of France-based Bayesia S.A.S. Lionel Jouffe holds a Ph.D. in Computer Science and has been working in the eld of Articial Intelligence since the early 1990s. He and his
team have been developing BayesiaLab since 1999 and it has emerged as the leading software package for
knowledge discovery, data mining and knowledge modeling using Bayesian networks. BayesiaLab enjoys
broad acceptance in academic communities as well as in business and industry. The relevance of Bayesian
networks, especially in the context of consumer research, is highlighted by Bayesias strategic partnership
with Procter & Gamble, who has deployed BayesiaLab globally since 2007.
www.conradyscience.com | www.bayesia.com
32
References
Allison, P.D. Multiple imputation for missing data: A cautionary tale. Sociological methods and Research
28, no. 3 (2000): 301309.
Allison, Paul D. Missing Data. 1st ed. Sage Publications, Inc, 2001.
Barber, David. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2011.
BBN_Introduction_V13.pdf, n.d.
Burton, A, and D G Altman. Missing covariate data within cancer prognostic studies: a review of current
reporting and proposed guidelines. British Journal of Cancer 91 (June 8, 2004): 4-8.
Van Buuren, S. Multiple imputation of discrete and continuous data by fully conditional specication.
Statistical Methods in Medical Research 16, no. 3 (2007): 219.
Conrady, Stefan, and Lionel Jouffe. Introduction to Bayesian Networks - Practical and Technical Perspectives. Conrady Applied Science, LLC, February 15, 2011.
http://www.conradyscience.com/index.php/introduction-to-bayesian-networks.
Copas, John B., and Guobing Lu. Missing at random, likelihood ignorability and model completeness.
The Annals of Statistics 32 (April 2004): 754-765.
Data Imputation for Missing Values: Statnotes, from North Carolina State University, Public Administration Program, n.d. http://faculty.chass.ncsu.edu/garson/PA765/missing.htm.
Dempster, A. P., N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM
Algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39, no. 1 (January 1,
1977): 1-38.
Enders, Craig K. Applied Missing Data Analysis. 1st ed. The Guilford Press, 2010.
Heckerman, D. A tutorial on learning with Bayesian networks. Innovations in Bayesian Networks (2008):
3382.
Heitjan, Daniel F., and Srabashi Basu. Distinguishing Missing at Random and Missing Completely at
Random. The American Statistician 50 (August 1996): 207.
Historical Stock Data, n.d. http://pages.swcp.com/stocks/.
Horton, N.J., and S.R. Lipsitz. Multiple Imputation in Practice. The American Statistician 55, no. 3
(2001): 244254.
Howell, David C. Treatment of Missing Data, n.d.
http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html.
Lin, J.H., and P.J. Haug. Exploiting missing clinical data in Bayesian network modeling for predicting
medical problems. Journal of biomedical informatics 41, no. 1 (2008): 114.
Little, Roderick J. A., and Donald B. Rubin. Statistical Analysis with Missing Data, Second Edition. 2nd ed.
Wiley-Interscience, 2002.
Missing completely at random - Wikipedia, the free encyclopedia, n.d.
http://en.wikipedia.org/wiki/Missing_completely_at_random.
de Morais, S.R., and A. Aussem. Exploiting Data Missingness in Bayesian Network Modeling. In Advances in Intelligent Data Analysis VIII: 8th International Symposium on Intelligent Data Analysis, IDA
2009, Lyon, France, August 31-September 2, 2009, Proceedings, 5772:35, 2009.
NCES Statistical Standards, Appendix B, Evaluating the Impact of Imputations for Item Non-Response,
n.d. http://nces.ed.gov/statprog/2002/appendixb3.asp.
www.conradyscience.com | www.bayesia.com
33
Riphahn, Regina T., and Oliver Sering. Item non-response on income and wealth questions. Empirical
Economics 30 (September 2005): 521-538.
Rubin, Donald B. Inference and missing data. Biometrika 63, no. 3 (December 1, 1976): 581 -592.
. Multiple Imputation for Nonresponse in Surveys. Wiley-Interscience, 2004.
Schafer, J.L. Analysis of Incomplete Multivariate Data. 1st ed. Chapman and Hall/CRC, 1997.
Schafer, J.L., and M.K. Olsen. Multiple imputation for multivariate missing-data problems: A data analysts perspective. Multivariate Behavioral Research 33, no. 4 (1998): 545571.
SOLAS Missing Data Analysis | Multiple Imputation | Missing Data Analysis Software @ Solas for Missing
Data Analysis, n.d. http://www.solasmissingdata.com/.
SPSS Missing Values 17.0 Users Guide. SPSS Inc., n.d.
Wayman, J.C. Multiple imputation for missing data: What is it and how can I use it. In Annual Meeting
of the American Educational Research Association, 116, 2003.
www.missingdata.org.uk, n.d. http://missingdata.lshtm.ac.uk/.
www.conradyscience.com | www.bayesia.com
34
Contact Information
Conrady Applied Science, LLC
312 Hamlets End Way
Franklin, TN 37067
USA
+1 888-386-8383
info@conradyscience.com
www.conradyscience.com
Bayesia S.A.S.
6, rue Lonard de Vinci
BP 119
53001 Laval Cedex
France
+33(0)2 43 49 75 69
info@bayesia.com
www.bayesia.com
Copyright
2011 Conrady Applied Science, LLC and Bayesia S.A.S. All rights reserved.
Any redistribution or reproduction of part or all of the contents in any form is prohibited other than the
following:
You may print or download this document for your personal and noncommercial use only.
You may copy the content to individual third parties for their personal use, but only if you acknowledge
Conrady Applied Science, LLC and Bayesia S.A.S. as the source of the material.
You may not, except with our express written permission, distribute or commercially exploit the content.
Nor may you transmit it or store it in any other website or other form of electronic retrieval system.
www.conradyscience.com | www.bayesia.com
35