Professional Documents
Culture Documents
master thesis
STATISTICAL SCIENCE
FOR THE LIFE AND BEHAVIOURAL SCIENCES
Abstract
Food-borne disease outbreaks constitute a large, ongoing public health burden worldwide (Hald
et al., 2016). Early identification of contaminated food products plays an important role in
reducing health burdens of food-borne disease outbreaks (Jacobs et al., 2017). Case-control
studies together with logistic regression analysis are primarily used in food-borne outbreak in-
vestigations. However, the current methodology is associated with problems including response
misclassification, missing values and ignoring small sample bias.
Jacobs et al. (2017) developed a formal Bayesian variable selection method which deals
with the problems of missing covariates and misclassified response. The re-analysis of Dutch
Salmonella Thompson 2012 outbreak data (Friesema et al., 2014) has illustrated that this
Bayesian approach allows a relatively easy implementation of these concepts and performs better
than the standard logistic regression analysis in the identification of responsible food products.
The complete Bayesian variable selection model is composed of three different parts, namely,
misclassification correction, missing value imputation and Bayesian variable selection. In this
thesis, we are interested in how these different parts affect the performance of Bayesian variable
selection models in scenarios with (i) the same response misclassification rate and missingness
rate in an assumed responsible food product covariate as in the original food-borne disease out-
break dataset, (ii) different response misclassification rates, (iii) different missingness rates in an
assumed responsible food product and (iv) the combination of different response misclassifica-
tion rates and missingness rates. We answer this research question by designing and executing
a simulation study. Our results indicate that for the four different versions of Bayesian variable
selection models studied in this thesis, the increase in the response misclassification rate or the
missingness rate in the assumed responsible food product covariate or the increase in both results
in a decrease in model performance. Bayesian variable selection, misclassification correction and
missing value imputation all contribute positively to the model performance. Although miss-
ing value imputation is most computationally expensive, it contributes the most to the model
performance among these three components.
Contents
1 Introduction 2
1.1 Case-control studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Case definition and control selection . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Data and data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Methods for analysing data from case-control studies . . . . . . . . . . . . . . . . 3
1.2.1 Standard methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Lasso logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Bayesian variable selection method . . . . . . . . . . . . . . . . . . . . . . 4
1.3 The goal of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Methodology 6
2.1 Classical logistic regression without misclassification . . . . . . . . . . . . . . . . 6
2.1.1 Estimation of conditional odds ratio . . . . . . . . . . . . . . . . . . . . . 6
2.2 Logistic regression with nondifferentially misclassified responses . . . . . . . . . . 7
2.3 Variable selection methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Classical variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2.1 Variable selection techniques . . . . . . . . . . . . . . . . . . . . 8
2.3.2.2 Combination of univariable analysis and stepwise approaches . . 9
2.3.2.3 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.3 Bayesian variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Missing covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 Missing data mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Methods for missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2.1 Complete-case analysis . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2.2 Ad-hoc imputation . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2.3 Sequential full Bayesian approach . . . . . . . . . . . . . . . . . 12
2.5 Complete Bayesian variable selection model . . . . . . . . . . . . . . . . . . . . . 13
2.5.1 Logistic regression with nondifferentially misclassified responses . . . . . . 14
2.5.2 Bayesian variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1
2.5.3 Missing imputation with variable selection . . . . . . . . . . . . . . . . . . 14
4 Simulation study 22
4.1 Motivation for designing a simulation study based on real data . . . . . . . . . . 22
4.2 Simulating new datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.1 General description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Misclassification scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Missingness scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Scenarios with increased misclassification and missingness rates . . . . . . . . . . 25
4.6 Model fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6.1 Prior specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6.2 Initial values of parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6.3 Burn-in iterations and the posterior sample size . . . . . . . . . . . . . . . 26
4.6.4 Model fitting steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6.5 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2
D R-code for the complete Bayesian variable selection model 55
3
Chapter 1
Introduction
Food-borne disease outbreaks are defined by the Centers for Disease Control and Prevention
(CDC) as an incident in which two or more persons experience a similar illness due to ingestion
of the same food (Centers for Disease Control and Prevention, 2015). The global burden of
food-borne disease outbreaks is estimated to be considerable by the World Health Organization
(WHO) (Hald et al., 2016). For example, in 2014, 5521 food-borne disease outbreaks, resulting
in 45665 human cases, 6438 hospitalizations and 27 deaths, were reported in the European Union
(EU) (European Food Safety Authority (EFSA) and European Centre for Disease Prevention and
Control (ECDC), 2015). Early identification of contaminated food products plays an important
role in reducing health burdens of food-borne disease outbreaks (Jacobs et al., 2017).
4
cases in many aspects, such as age and gender, and must be unaffected individuals at the same
risk of developing the food borne disease (Lewallen and Courtright, 1998).
1.1.3 Limitations
Despite the widespread use of case-control methods in food-borne outbreak investigations, there
are several limitations. First, case-control studies are subject to information bias (Dwyer et al.,
1994). One example of information bias is the recall bias, which is caused by systematic differ-
ences in the accuracy or completeness of historical self-reported information from respondents
(Last, 2000). True exposures are likely to be underreported in controls and overreported in cases.
This exaggerates the magnitude of the difference between cases and controls in reported rates of
exposure to suspected risk factors and thus leads to an inflation of the odds ratio (Raphael, 1987).
Recall bias also depends on the types of food products. Compared with special food products,
common food products might be more likely to be underreported. Second, misclassification of
disease outcomes in case-control studies results in misclassification bias. For instance, asymp-
tomatic cases may be selected as controls, i.e. a false negative. Likewise, false positives may
occur among individuals who show symptoms consistent with the case definition, while in fact
those symptoms result from a different etiology (Dwyer et al., 1994). Misclassification can bias
estimates and may lead investigators to make incorrect conclusions about an exposure (Gilbert
et al., 2016). Finally, item non-response may arise in the data collection for several reasons. For
example, subjects may fail to remember their dietary intake in a given time period, or forget to
fill in an answer due to carelessness.
5
1.2 Methods for analysing data from case-control studies
1.2.1 Standard methods
Classical logistic regression is typically applied in the statistical analysis of the questionnaires to
estimate exposure effects while controlling for confounders. The selection of relevant variables is
a critical component of a case-control outbreak investigation as a large number of food products
that people may have consumed are usually investigated. Because the number of food products
being investigated is usually larger than the number of observations, researchers commonly use a
combination of univariable analysis and stepwise, forward or backward variable selection (Hosmer
et al., 2013). Small sample bias exists in the classical variable selection procedures.
6
1.3 The goal of this thesis
The complete Bayesian variable selection model is composed of three different parts, namely,
misclassification correction, missing value imputation with variable selection and Bayesian vari-
able selection. Different versions of Bayesian models are generated by different combinations of
these parts.
We are interested in investigating how these different parts affect the performance of Bayesian
variable selection models in scenarios with (i) the same missingness rate (i.e. the percentage
of missing values) in an assumed responsible food product covariate and the same response
misclassification rate as in the original dataset, (ii) different response misclassification rates, (iii)
different missingness rates in an assumed responsible food product and (iv) the combination of
different response misclassification rates and missingness rates. This main research question can
be divided into two smaller research sub-questions:
(i) What are the effects of missingness in the assumed responsible food product covariate and
response misclassification on the performance of Bayesian variable selection models? That is,
how well does a Bayesian variable selection model identify the responsible food product in the
above scenarios?
(ii) Which parts of the Bayesian variable selection model contribute to the performance in
each of the above scenarios?
In this thesis, we will answer these research questions by designing and executing a simulation
study. The simulated data will be based on a real dataset. More specifically, the simulated
datasets will constitute shuffled version of a real dataset. The simulation study will consist of
four scenarios: (i) original response misclassification rate and missingness rate in the assumed
responsible food product covariate, (ii) increased response misclassification rate, (iii) increased
missingness rate in the assumed responsible food product, (iv) increased response misclassification
rate and missingness rate in the assumed responsible food product covariate. In each of the
scenarios, the data will be analysed using standard logistic regression (used as a baseline) and
different versions of the Bayesian variable selection model. By comparing the performance of the
different models in each scenario and the performance of one specific model in all scenarios, we
can gain better insight into the mechanisms of the Bayesian variable selection method.
This thesis is organized in the following way. In Chapter 2, we present the methods used in
the analysis of the case-control data including the standard logistic regression and the Bayesian
variable selection method developed by Jacobs et al. (2017). In Chapter 3, we present two
datasets from real food-borne disease outbreaks and show the results of the descriptive statistics
and standard logistic regression. In Chapter 4, we present the simulation study design. In
Chapter 5, we present the results of the simulation study. Chapter 6 involves the discussion and
conclusions.
7
Chapter 2
Methodology
logit(µi ) = β0 + xi β, (2.1)
where xi ∈ R1×p denotes the ith row of the design matrix X ∈ Rn×p , β ∈ Rp×1 denotes a
p-dimensional column vector listing unknown regression coefficients.
then eβ̂z is an estimate of the conditional odds ratio ψcond (z). The conditional odds ratio ψcond (z)
is the odds ratio between Y and Z when the values of X1 , . . . , Xp are held fixed.
8
2.2 Logistic regression with nondifferentially misclassified
responses
Response misclassification occurs in a case-control study when the observed disease outcomes
cannot truly reflect the true disease status. Let ỹi designate the true disease status which is
subject to misclassification, with ỹi = 1 if the disease is present, and ỹi = 0 otherwise. Let
yi denote the observed disease outcome for person i = 1, 2, ..., n. We assume nondifferential
misclassification (i.e. the misclassification of disease status does not depend on the covariates):
This assumption implies that the critical diagnostic properties known as sensitivity (Se) and
specificity (Sp) do not vary according to exposure status (Thomas et al., 1993; Tang et al., 2015).
Thus we define
For an individual with covariate information xi , let the probability of having a true case be
πi = P (ỹi = 1|xi ).
By the law of total probability, a possibly misclassified, positive diagnostic test occurs with
probability
µi ≡ P (yi = 1|xi ) = P (yi = 1|ỹi = 1, xi )P (ỹi = 1|xi ) + P (yi = 1|ỹi = 0, xi )P (ỹi = 0|xi )
= P (yi = 1|ỹi = 1)P (ỹi = 1|xi ) + P (yi = 1|ỹi = 0)P (ỹi = 0|xi )
= P (yi = 1|ỹi = 1)P (ỹi = 1|xi ) + [1 − P (yi = 0|ỹi = 0)][1 − P (ỹi = 1|xi )]
= (Se)πi + (1 − Sp)(1 − πi ) (2.6)
We assume yi |xi ∼ Bernoulli(µi ), with probability of success derived from Equation 2.6. In
the logistic regression case, logit(πi ) = β0 + xi β, where xi ∈ R1×p denotes the ith row of the
design matrix X ∈ Rn×p , β ∈ Rp×1 denotes a p-dimensional column vector listing unknown
regression coefficients.
The likelihood function is
n
Y
L(β, Se, Sp|Y ) = f (yi |β, Se, Sp)
i=1
Yn
= [πi Se + (1 − πi )(1 − Sp)]yi [πi (1 − Se) + (1 − πi )Sp]1−yi (2.7)
i=1
9
2.3 Variable selection methods
In many situations, researchers are interested in variable selection. Researchers may be interested
in gaining a good understanding of the real relationship between the response and the explanatory
variables, so it is important to select only those explanatory variables that best explain the
response. Or researchers may be interested in finding the best prediction model, i.e. selecting
those variables that give the best prediction. In the context of food-borne disease outbreak
investigations, investigators are particularly interested in variable selection with the purpose of
finding the one food product (i.e. variable) that best distinguishes between cases and controls.
In this section, we will discuss variable selection methods in the context of generalized linear
regression.
Here a(·), b(·) and c(·) are known functions which vary according to the distribution.
Specifically, when Y are binary and the distribution function is Bernoulli distribution, GLMs
with a canonical logit link
µ
g(µ) = log . (2.10)
1−µ
are logistic regression models.
Variable selection in a generalized linear regression context can be seen as an exercise deciding
which of regression coefficients are equal to zero.
1. Stepwise approach. There are three main approaches: backward elimination, forward
selection and stepwise selection. Backward elimination starts with all candidate predictors
in the model and looks for predictors that are not statistically significant, i.e. whose deletion
10
does not significantly reduce the fit of the model based on a model comparison criterion.
Then the predictor that is least significant is removed and thus a new “simplified” model is
obtained. The above procedure can be repeated until all predictors in the model are found
significant. The forward selection method reverses the backward elimination method. It
starts with no variables in the model and sequentially adds the most significant predictors.
Stepwise selection is a combination of backward elimination and forward selection. At each
stage, a predictor may be added or deleted.
2. Best subsets approach. The best subsets approach searches all possible models with a
specific set of predictors and identifies the best-fitting models based on the values of the
quantitative criterion. If there are p predictors, the number of subsets is 2p .
The criteria used for model comparison include p-values, Akaike’s information criterion (AIC)
and Bayesian information criterion (BIC).
The above variable selection techniques are implicitly designed for situations where the num-
ber of the observations is larger than the number of candidate predictors. In the food-borne
disease outbreak investigation, the number of candidate predictors is usually larger than the
number of observations. In this case, the following two techniques are frequently used.
A pre-selection is performed before including all the candidate predictors in the full model. One
way to do a pre-selection is to use univariable models, a model with just one predictor is tested at
a time. Then only those predictors which meet a preset criterion for significance are selected. This
criterion is often more relaxed than the conventional criterion for significance (for instance, p-
value < 0.20, instead of the usual p-value < 0.05), since the pre-selection aims to identify potential
predictors rather than to test a hypothesis. After the pre-selection procedure, a further variable
selection is performed by using backward, forward or stepwise selection procedures (Hosmer et al.,
2013; Harrell, 2015).
2.3.2.3 Lasso
The technique, least absolute shrinkage and selection operator (lasso) was first formulated by
Tibshirani (1996) for estimation in linear models. Based on this lasso method which adds a
penalty term to the residual sum of squares, Park and Hastie (2007) proposed a modified criterion
with regularization for estimating the coefficients β in GLMs:
where λ > 0 is the regularization parameter. The lasso method penalizes the coefficients of
candidate predictors, shrinking some of the unimportant coefficients to zero and thus achieves
variable selection.
11
Glmnet (Friedman et al., 2010) fits the lasso logistic regression by solving the following
problem:
" n #
1X
(β0 +xi β)
min − yi · (β0 + xi β) − log 1 + e + λkβk1 . (2.12)
(β0 ,β)∈Rp+1 n i=1
with
Z
p(y|m) = p(y|θm , m)p(θm |m)d(θm ), (2.14)
where θm denotes the km -dimensional vector of parameters for model m. For example, for linear
regression θm is composed of σ 2 , β0 and dm regression coefficients βm .
Several BVS methods have been developed in the last 25 years. The BVS method that we
apply in this research is Stochastic Search Variable Selection (SSVS) proposed by George and
McCulloch (1993).
Let a latent variable γj denotes the inclusion indicator for βj , with γj = 1 denoting that the j th
covariate is included in the regression model, and γj = 0 otherwise for covariate j = 1, 2, ..., p. ωj
denotes the inclusion probability of the j th covariates. The regression coefficients βj are assumed
to have a spike-and-slab prior distribution:
In Equation 2.15, τ 2 c2 > 0 is the variance of the slab component and τ 2 > 0 is the variance of the
spike component. The density of the spike component is concentrated closely around zero. Note
p
that the two Gaussian densities intersect at the points ±, where = τ 2 log(c)c2 /(c2 − 1) (see
Figure 2.1). The point can be considered as a threshold for “practical significance”, because
all coefficients falling into the interval [−, ] can be interpreted as zero (Lesaffre and Lawson,
2012). This provides guidance for choosing the tuning parameters τ and c. When the parameter
c is fixed, the variance τ 2 can be chosen to reflect the perception of practical significance from
researchers. The choice of prior distributions of the inclusion indicator variable γj , the inclusion
probability parameter ωj and regression coefficients βj , and the choice of the value of will be
discussed in Section 4.6.1.
After a model is set up, it is usually fitted using Markov chain Monte Carlo (MCMC) methods,
such as the Gibbs sampler, which generates samples from the joint posterior distribution of the
12
Figure 2.1: Spike-and-slab prior distribution used in the SSVS procedure (Lesaffre and Lawson,
2012)
inclusion indicator variable γj , the inclusion probability parameter ωj and regression coefficients
βj .
Once samples are drawn from the joint posterior distribution of the parameters (γj , ωj , βj ),
the two-sided posterior inclusion probability P (βj ∈ / [−, ]| Data) can be used as a criterion for
variable selection. In the context of food-borne disease outbreaks, an odds ratio of 1.0 (i.e. β = 0)
indicates the exposure of a food product is not associated with the disease. An odds ratio larger
than 1.0 (i.e. β > 0) indicates that the exposure might be a risk factor for the disease and an
odds ratio less than 1.0 (i.e. β < 0) indicates the exposure might be a protective factor against
the disease. Because only positive regression coefficients are of interest, the one-sided posterior
inclusion probability P (βj > | Data) is used (Jacobs et al., 2017).
We use the one-sided posterior inclusion probability P (βj > | Data), a marginal probability
of each variable as a criterion for variable selection. We realize that the joint posterior probability
of all selected variables might be highest. In our thesis, we want to find the food product covariate
with the highest one-sided posterior inclusion probability and we are not interested in finding a
model as a whole to explain the response. Therefore, in our thesis, it is not problematic that the
joint posterior probability of selected variables might not be highest.
13
the responsible food product. If they are missing, data analysis of food-borne disease outbreak
investigation will face a challenge. There are different ways to deal with missing covariates. In
this section, we are going to discuss ways of how one could solve the problems of missing values
in the context of food-borne disease outbreaks.
(2) Missing at random (MAR). A variable is MAR if missingness depends on the observed
data, but does not depend on the unobserved data.
(3) Missing not at random (MNAR). A variable is MNAR if missingness depends on the un-
observed data, perhaps in addition to the observed data.
A common missing data approach is complete-case analysis (CC), which deletes all subjects with
incomplete data. When missing data are MCAR, CC analysis provides unbiased results (Little
and Rubin, 2002). However, this does not mean that CC analysis is always a desirable method.
If the proportion of incomplete cases is large, CC analysis can lead to a reduction of statistical
power (Belin et al., 2000). For example, suppose that data are MCAR across 30 variables and
the missingness proportion for each variable is 5%. Using CC analysis will lose close to four fifths
of the subjects, because the percentage of the fully observed subjects in the original data is only
(1 − 5%)30 ≈ 21% . In addition, when the missingness mechanism is not MCAR, the results can
be biased.
For the analysis of food-borne disease outbreak data, many researchers apply ad-hoc imputation
methods to fill in missing values so that the standard software can be easily used to analyse com-
plete data. For example, in the analysis of the Salmonella Bovismorbificans 2016-2017 outbreak
(Brandwagt et al., 2018) and Salmonella Thompson 2012 outbreak (Friesema et al., 2014) in the
Netherlands, food product covariates that were not filled in questionnaires were assumed to be
zero. Besides, for subjects who responded to questions on consumption of some food products
with a “maybe” answer, they were assumed to consume these products. The first assumption
is reasonable, since subjects usually only mark the food products which they have consumed.
However, the second assumption seems implausible. Subjects who had not consumed a food
product may give a “maybe” answer because they failed to recall their consumption history and
14
they were unsure about the consumption of the food product. Therefore, the second assumption
possibly overestimated food consumption, thus leading to invalid inferences.
The sequential full Bayesian (SFB) approach is proposed by Erler et al. (2016). By combin-
ing the imputation models with the analysis model in one estimation procedure, this approach
jointly imputes missing covariates and obtains inferences on the posterior distribution of the
parameters. In a standard Bayesian setting with complete data, the probability density function
of interest is p(θY |X |yi , xi ), where θY |X denotes the vector of parameters of the model (for
example (Se, Sp, β0 , β 0 )0 for the model in Section 2.1). When some covariate values are miss-
ing, X is composed of two parts: covariates containing completely observed values X obs and
covariates containing missing values X mis . The total number of covariates p is split up into q
observed covariates and r missing covariates. Then the posterior probability of interest becomes
p(θY |X , θX , xi,mis |yi , xi,obs ) which can be written as
where θX is a vector of parameters which are associated with the likelihood of partially observed
covariates X mis , and π(θY |X ) and π(θX ) are prior distributions (Erler et al., 2016). The joint
likelihood of the missing covariates p(xi,mis |xi,obs , θX ) can be specified in a convenient way by
using a sequence of conditional univariate distribution (Ibrahim et al., 2002):
After the prior distributions π(θY |X ) and π(θX ) are specified, samples can be drawn from
the joint distribution of all parameters and missing covariates using MCMC methods, such as
Gibbs sampling. The SFB approach obtains valid inferences only under ignorable missing data
mechanisms, that is, MCAR or MAR, and when the analysis model, together with the conditional
distributions of the covariates, are correctly specified (Erler et al., 2016).
Mitra and Dunson developed a 2-level variable selection model (Mitra and Dunson, 2010), in
which the variable selection is performed not only in the top level model relating the response
to covariates, but also in the covariate model characterizing the joint distribution functions in
Equation 2.19. In the re-analysis of Dutch Salmonella Thompson outbreak data, 2-level variable
selection model was applied as part of the complete Bayesian variable selection model. Some of
the parameters in θX were reasonably assumed to be zero due to sparse relationships among the
covariates (Jacobs et al., 2017). Therefore, a variable selection was performed in each covariate
model.
15
2.5 Complete Bayesian variable selection model
In the previous sections in this chapter, two types of logistic regression models, common variable
selection techniques and methods for missing data have been described. In this section, we
combine the methods into a complete Bayesian variable selection model. The complete Bayesian
variable selection model is composed of three different parts, namely, Bayesian variable selection,
misclassification correction and missing value imputation.
The food-borne disease outbreak data is a kind of “dynamic” data. Information of cases and
controls are collected over time during a food-borne disease outbreak investigation. For dynamic
data, we need to think of a way to deal with time. We deal with time by fitting the Bayesian
variable selection model on the data which is available at a certain date during the outbreak.
Additionally, one could add a time variable in the top model relating the response to covariates.
The way to perform Bayesian variable selection in the context of food-borne disease outbreak
investigations is not limited to the way described here. In Chapter 4, we assume that there is
only one responsible food product when generating new datasets. Then, for the Bayesian variable
selection models, we set a prior distribution on the probability for each food product covariate
to be in the model and choose the one with the highest one-sided posterior inclusion probability.
Alternatively, in addition, we can include the information that there is only one responsible food
product by setting a prior distribution on the model size.
yi |xi ∼ Bernoulli(µi )
µi = (Se)πi + (1 − Sp)(1 − πi ) (2.20)
logit(πi ) = β0 + xi β.
In the context of the food-borne outbreak datasets used in this thesis, we assume Sp = 1. A case
only entered the dataset if it had been twice laboratory-confirmed. Hence, it is safe to assume
that no non-infected subject was misclassified as a case, i.e. P (yi = 1|ỹi = 0) = 0, indicating
that the specificity equals to one.
βj |τ 2 , c2 ∼ γj N(0, τ 2 c2 ) + (1 − γj ) N(0, τ 2 )
γj |ωj ∼ Bernoulli(ωj ) (2.21)
ωj ∼ Beta(aj,0 , bj,0 ).
16
2.5.3 Missing imputation with variable selection
Missing imputation with variable selection is applied as a component of the complete Bayesian
variable selection model. We assume that respective missing covariate in Equation 2.19 depends
on previous covariates and is modelled by a generalized linear model with regression coefficients
θXj = (α0,j , α1,j , . . . αj−1,j )0 . Because all covariates are binary, a Bernoulli response with a
logistic regression model is used as the covariate model to obtain the probabilities in Equation
2.19. How to choose the prior distributions of α0 s and ω 0 s in the covariate models will be discussed
in Section 4.6.1.
17
Chapter 3
Two datasets are used in this simulation study. The first one is from the Salmonella Bovis-
morbificans outbreak in 2016 to 2017 (Brandwagt et al., 2018) and the second one is from the
Salmonella Thompson 2012 outbreak in the Netherlands (Friesema et al., 2014). Our simulation
study is based on the first dataset. Both of the two datasets reveal what the real outbreak data
look like and provide reference information for setting detailed simulation schemes, such as the
setup of missing covariates.
18
Figure 3.1: Number of all observations
products were merged into one pooled ham variable (raw, smoked and Coburg ham) and one
pooled cheese variable (unsliced, sliced and grated), during the analysis (Brandwagt et al., 2018),
resulting in 150 food products. All covariates except age are binary-valued. Age is a continuous
covariate which is standardized in the analysis.
The age distribution is a negatively skewed unimodal distribution with a mode age group of
70-79 (Figure 3.2). The cases were aged 5 to 89 (median 65.5) and the controls were aged 4 to
90 (median 69). For both groups, more than half were females: there were 14 females (58.3%)
in the case group and 20 females (54.1%) in the control group. Frequencies for observations, age
and gender, are summarized in Table 3.1.
A food product covariate takes the value 1 if a subject ate that product and takes the value
0 otherwise. A supermarket covariate takes the value 1 if a subject bought most of his or her
groceries at that supermarket and took the value 0 otherwise. food product covariates and
supermarket covariates that were not filled in are assumed to be zero. This is a reasonable
assumption, because subject usually only mark the food products that they have eaten and the
supermarket where they have purchased most of their groceries. For food product covariates,
subjects were allowed to respond with “maybe” if they were not sure whether or not they had
eaten the product. In the analysis in 2017 (Brandwagt et al., 2018), a subject was assumed
to consume a product when he or she answered “maybe”, thus probably overestimating food
consumption. In our analysis, we treat these covariates as being missing.
The percentage of missing covariates per subject is up to 21.7% for cases and 35.5% for
19
Figure 3.2: Histogram of ages of all observations
Table 3.1: Characteristics of cases, controls and all subjects involved in the case-control studies
during the Salmonella Bovismorbificans outbreak
Characteristics Cases Controls All subjects
Gender
Male Female Total Male Female Total Male Female Total
Age group
0-9 1 0 1 4 0 4 5 0 5
10-19 0 0 0 0 0 0 0 0 0
20-29 0 1 1 1 1 2 1 2 3
30-39 1 1 2 2 2 4 3 3 6
40-49 3 1 4 0 3 3 3 4 7
50-59 0 2 2 0 0 0 0 2 2
60-69 1 3 4 6 3 9 7 6 13
70-79 2 4 6 3 10 13 5 14 19
≥ 80 2 2 4 1 1 2 3 3 6
Total 10 14 24 17 20 37 27 34 61
controls. The covariate with the highest percentage of missing values per covariate is smoked
sausage (25.0%) for cases and chicken breast (35.1%) for controls. Among 64.7% of the food
product covariates, the percentage of missing values per covariate for controls is higher than that
for cases, which reflects the existence of recall bias. For the pooled ham variable, the recall bias
20
also existed. Only 8.3% of cases responded with “maybe” to the consumption of the pooled ham
variable, while up to 27.0% of controls responded with “maybe”. In total, 12 subjects, accounting
for 19.67% in all subjects, responded with “maybe”.
21
3.2 Salmonella Thompson outbreak (2012)
On 15 August 2012, an increase in the number of cases of Salmonella Thompson infections in the
Netherlands as reported. That week, 11 S. Thompson cases and two weeks earlier four cases were
detected at the RIVM. An outbreak investigation was started in order to identify the source of
the outbreak and thereby prevent further disease spread. As part of the outbreak investigation,
the case-control study was conducted from 16 August 2012 to 28 September 2012 when smoked
fish was identified as the source. During the studies, four potential sources were indicated by the
case-control statistical analysis, namely minced meat (10 September), ready-to-eat raw vegetables
(17 September), ice cream (18 September) and finally smoked fish (24 September).
For each of the cases, four controls were drawn from the Dutch population from the same or
neighbouring municipality with comparable age and gender (Friedman et al., 2010). Finally, 109
cases and 193 controls participated the case-control study. The numbers of cases and controls
are shown in Figure 3.3
The food-consumption questionnaire was continuously updated during the outbreak inves-
tigation. Only food products that were investigated by all revisions of the questionnaire were
included in the dataset, resulting in 108 covariates (age, gender, 95 food products and 11 super-
market covariates).
Age has a bimodal distribution with two modes in the age groups of 10-19 years and 60-69
years (Figure 3.4). The median age of all subjects was 58 years (range: 2-93 years). The median
22
age of cases was 54 years (range: 2-93 years) and of controls was 60 years (range: 3-92 years).
64.2% of cases were females and 72.0% of controls were females. Summaries of characteristics of
observations, age and gender, can be seen in Table 3.2.
For food product and supermarket covariates, 0/1 values have the same meanings as those
in section 3.1. Similarly, food product covariates that were not filled in are assumed to be zero.
All of the supermarket covariates were filled in. Hence, there is no need to do imputation for
supermarket covariates. In our analysis, we use the same definition of missing covariates as
section 3.1. That is, a food product covariate with a “maybe” answer is considered as being
missing.
Under the above definition of missing covariates, the percentage of missing covariates per
subject is up to 39.6% for cases and 67.0% for controls. The covariate with the highest percentage
of missing values per covariate is iceberg lettuce (15.6%) for cases and minced beef for controls
(21.2%). Among 74.7% of the food product covariates, the percentage of missing values per
covariate for controls is higher than that for cases. The percentage of cases who responded with
“maybe” to the consumption of the smoked fish variable is 9.2%, which is slightly higher than
controls. The corresponding percentage for controls is 7.3%.
23
Table 3.2: Characteristics of cases, controls and all observations involved in the case-control
studies during the Salmonella Thompson outbreak
109 cases 193 controls 302 subjects
Female 70 (64.2) 139 (72.0) 209 (69.2)
Sex
Male 39 (35.8) 54 (28.0) 93 (30.8)
0-9 11 (10.1) 14 (7.3) 25 (8.3)
10-19 13 (11.9) 18 (9.3) 31 (10.3)
20-29 12 (11.0) 10 (5.2) 22 (7.3)
30-39 10 (9.2) 10 (5.2) 20 (6.6)
Age group in years 40-49 7 (6.4) 13 (6.7) 20 (6.6)
50-59 12 (11.0) 29 (15.0) 41 (13.6)
60-69 13 (11.9) 49 (25.4) 62 (20.5)
70-79 15 (13.8) 30 (15.5) 45 (14.9)
≥ 80 16 (14.7) 20 (10.4) 36 (11.9)
24
Chapter 4
Simulation study
25
4.2 Simulating new datasets
4.2.1 General description
The simulation study is based on the dataset from the Salmonella Bovismorbificans outbreak
in 2016 to 2017 (Brandwagt et al., 2018). The responsible contaminated food product in this
outbreak is a smoked Coburg ham, which falls in the category of the pooled ham covariate. In
this simulation study, we still assume that the source is from the pooled ham covariate by keeping
the pooled ham covariate fixed. New datasets are generated by shuffling a certain proportion of
food product covariates together within one stratum of cases and controls with similar age and
gender. Shuffling covariates is achieved by randomly reassigning covariate values of one subject
to a different subject in the same stratum.
Generally, variable selection methods are sensitive to correlation structures among covariates.
Therefore, in order to make the assumption that the most likely source is the pooled ham valid,
the correlation structures should be kept as constant as possible in the simulated data generation
process. On the other hand, the simulation study aims to find out answers to the research ques-
tions by testing and comparing Bayesian methods on different datasets. Hence, the proportion
of food product covariates to be shuffled should be as large as possible in order to break the
original correlation structures among food product covariates but small enough to keep some of
the correlations of the food product covariates with the pooled ham covariate intact. A trade-off
is made by setting the proportion of shuffled food product covariates to 50%.
In addition, the consumption of food products is probably influenced by the confounding
variables, age and gender. To control for age and gender, we shuffle food product covariates
within each stratum which is constructed based on gender and age groups. Considering that
people of different ages have different dietary habits and that there should be at least one case-
control pair in each stratum, we constructed six age-gender strata are considered: 0-19 years
(children and teenagers), 20-59 years (young and middle-aged adults) and ≥ 60 years (older
adults) for both females and males. Because no females aged 0-19 years participated in the
case-control study, there are five age-gender strata in total.
The simulation scheme used in this simulation study is summarized in Figure 4.1. The steps
of creating these scenarios in Figure 4.1 are listed in the following sections.
4.2.2 Algorithm
For each of 100 iterations, the following steps are taken:
• Step 2 Randomly select 50% of 150 food product covariates except the pooled ham covariate.
• Step 3 Randomly assign values of selected covariates of one subject to a different subject
in the same stratum.
26
Figure 4.1: A flow graph of simulation scheme
• Step 1 Sample a variable j from the discrete uniform distribution: i ∼ DiscreteUnif(1, 4).
• Step 1 Sample a variable i from the discrete uniform distribution: i ∼ DiscreteUnif(5, 19).
• Step 2 Randomly draw i subjects from the non-NA pooled ham covariate values and set
those pooled ham covariate values to missing.
27
Finally, save the above 100 new datasets.
28
this practice is not feasible in this thesis project. First, multivariate models probably cannot be
fitted on the data up to the first early dates due to convergence failures in logistic regression.
Second, multivariable models only give us the estimate of the covariates which are selected by
univariable models. We cannot obtain the estimates of other covariates.
29
our case because the posterior distributions were unknown in the beginning. On the other hand,
the HW diagnostic has no requirement of initial values. However, the HW diagnostic has the
disadvantage that it can be only used for testing the convergence of a single chain. It could not
give us a combinative result of multiple chains. Considering the advantages and disadvantages
of these two diagnostic tests, we applied both tests.
For the implementation of the complete Bayesian variable selection model, we ran 8 chains
with a burn-in of 2000 iterations and then a further 1875 iterations per chain resulting in a
posterior sample of size 15000. The required time was 3478.067 seconds (i.e. 0.966 hour). The
convergence of 161 ω 0 s and 1 Se, i.e. 162 parameters were tested. The HW diagnostic showed
that for each chain, the number of parameters which failed in the convergence diagnosis varied
from 15 to 25. The chains of the other parameters which started at the iteration 1, 189, 376,
564 or 751 passed the convergence test. In the Gelman-Rubin diagnostic, the point estimate
of PSRF of two of the parameters was 1.1. For the rest of parameters, the point estimate of
PSRF was between 1.00 and 1.09. By combining the results of the two diagnostic tests, we could
see that adequate convergence or partial convergence had been reached in chains of most of the
parameters.
When 8 chains with a burn-in of 3000 iterations and a further 1875 per chain were run, the
results of the convergence diagnostics were improved. In the HW diagnostic, for each chain,
12-22 parameters failed in the convergence test. The Gelman-Rubin diagnostic showed that all
parameters had point estimations of PSRF less than 1.1. However, the required time increased
to 4342.600 seconds (i.e. 1.21 hours). It is possible that all parameters will pass these two
diagnostic tests when the number of burn-in iterations is increased greatly.
For the implementation of the model composed of the Bayesian variable selection and missing
value imputation parts, we ran 8 chains with a burn-in of 2000 iterations and then a further 1875
iterations per chain. The required time was 3377.166 seconds (i.e. 0.938 hours). A total of 161
parameters were tested in the convergence diagnosis. In the HW diagnostic, for each chain, there
were 3-7 parameters which failed in the convergence test. In the Gelman-Rubin diagnostic, the
point estimation of PSRF of all parameters were either 1.00 or 1.01. The convergence test results
were was not improved when we ran 8 chains with a burn-in of 3000 iterations and a further
1875 iterations per chain. In the HW diagnostic, 5-11 parameters failed in the convergence test
for each chain. In the Gelman Rubin diagnostic, the point estimation of PSRF for all parameter
either 1.00 or 1.01. On the other hand, the required time increased greatly. It consumed 5414.375
seconds (i.e. 1.50 hours).
Given the time required, we choose to run 8 chains with a burn-in of 2000 iterations and
then a further 1875 iterations per chain during the fitting of the Bayesian variable selection
models. This decision provides us with relatively good convergence while also putting limits on
the computational time which is important given the time constraints of this thesis.
30
4.6.4 Model fitting steps
There are five models which are fitted on the datasets, namely, the standard logistic regression
model and four different versions of the Bayesian variable selection models. The five models are:
2. Only Bayesian variable selection. The misclassified responses are not corrected here and
thus the response model is the one in Equation 2.1. In addition, the missing covariates are
set to 1.
3. Bayesian variable selection and misclassification correction. The missing covariates are set
to 1.
4. Bayesian variable selection and missing value imputation. The response model is the logistic
regression model in Equation 2.1.
For each dataset generated in Sections 4.2-4.5, the above five models are fitted:
• Step 1 Divide each dataset into five subsets which contain data up to 2 March, 9 March,
16 March, 22 March and 5 April.
• Step 2 Fit the five models on the five datasets. The model fitting will be stopped as soon
as a model identifies the pooled ham as the most likely suspect. Alternatively, the model
fitting is continued until the complete dataset (5 April) is fitted to. For the standard
logistic regression model, a pooled ham covariate with a p-value of less than 0.05 and
the highest positive estimated coefficient among all food product covariates is regarded as
the most likely suspect. For Bayesian variable selection models, if the one-sided posterior
inclusion probability of the pooled ham covariate, P (βham > 0.05| Data) is highest among
food product covariates, then the pooled ham is considered as the most likely suspect.
For the standard logistic regression model, if the pooled ham is found to be the most likely suspect
at a certain date, we record this date and the time which the model fitting on the data up to
this date consumes. Otherwise, nothing is recorded. For Bayesian variable selection models, if
the pooled ham is found to be the most suspect on a certain date, we record this date, the time
which the model fitting on the data up to this date requires, and the one-sided posterior inclusion
probability of the pooled ham covariate, P (βham > 0.05| Data) at this date. Otherwise, nothing
is recorded. We refer to these dates as the earliest detection dates.
31
models, we record the earliest detection date, the required time and the one-sided posterior
inclusion probabilities of the pooled ham covariate, P (βham > 0.05| Data) at this date. For
model comparisons, the primary evaluation criterion is the correct detection number, which is
the number of datasets out of 100 datasets in which a model correctly detected the contaminated
food product. The medians of the recorded earliest detection dates and the medians and standard
deviations of P (βham > 0.05| Data) at the earliest detection dates are used as the secondary
evaluation criteria. We use the secondary evaluation criteria for model comparison only if two
models give similar results of the correct detection numbers. Besides, the average required time
of each model in each scenario will also be compared in order to gain an insight of how long a
model consumes on average.
32
Chapter 5
In this chapter, we provide figures of the correct detection number, the frequency of earliest
detection dates, the one-sided posterior inclusion probabilities of the pooled ham covariate and
the average required time for the different models in the four scenarios to compare models. Tables
of performance measures on each model are provided in Appendix E. We refer to the scenarios
in the following way:
• S1: original response misclassification rate and missingness rate in the pooled ham covari-
ate,
• S4: increased response misclassification rate and missingness rate in the pooled ham co-
variate.
33
Figure 5.1: A plot of the correct detection numbers for the different models in the four scenarios.
On the x-axis, we have scenarios: S1: original response misclassification rate and missingness
rate in the pooled ham covariate, S2: increased response misclassification rate, S3: increased
missingness rate in the pooled ham covariate, and S4: increased response misclassification rate
and missingness rate in the pooled ham covariate. In the legend, we have models: LR: standard
logistic regression model, M1: model with only Bayesian variable selection, M2: model with
Bayesian variable selection and misclassification correction, M3: model with Bayesian variable
selection and missing value imputation, and M4: complete Bayesian variable selection model.
decrease in the correct detection number. The increase in both the response misclassification
rate and the missingness rate in the pooled ham covariate has the largest negative influence on
the correct detection number for Bayesian variable selection models. For the standard logistic
regression model, the increase in the response misclassification rate has the largest negative
influence.
In particular, we are interested in two cases. First, we consider the performance of the model
with misclassification correction in the Scenario S2 (i.e. the scenario with increased response
misclassification rate). The model with misclassification correction has a correct detection num-
ber of 41 in the scenario S1. Its correct detection number decreases to 35 in the scenario S2. The
percentage of decrease is 14.6%. In contrast, the model with only Bayesian variable selection
experiences a decrease of 34.6% (the correct detection number decreases from 26 to 17). This
indicates that misclassification correction makes the performance of Bayesian variable selection
model more resistant to increased response misclassification.
Second, we consider the performance of the model with missing value imputation in the
34
scenario S3 (i.e. the scenario with the increased missingness rate in the pooled ham covariate).
The model with missing value imputation has a correct detection number of 88 in the scenario
S1 and the number decreases to 50 in the scenario S3. The percentage of decrease is 43.2%.
In contrast, the model with only Bayesian variable selection experiences a decrease of 11.5%
(the correct detection number decreases from 26 to 23). This indicates that the performance
of the model with Bayesian variable selection and missing value imputation is more sensitive
to the increase in the missingness rate compared with the model with only Bayesian variable
selection. However, we supposed that missing value imputation should make the performance
of the Bayesian variable selection model more resistant to the increase of the missingness. The
possible explanation of this counterintuitive result will be discussed in Section 6.1.
Figure 5.2: A plot of the correct detection numbers for the different models in the four scenarios.
On the x-axis, we have models: LR: standard logistic regression model, M1: model with only
Bayesian variable selection, M2: model with Bayesian variable selection and misclassification
correction, M3: model with Bayesian variable selection and missing value imputation, and M4:
complete Bayesian variable selection model.
As previously mentioned in Section 1.3, the second sub-question of this thesis is which parts of
the Bayesian variable selection model contribute to the performance in each of the four scenarios.
Figure 5.2 facilitates answering this sub-question.
35
Figure 5.2 shows that in each scenario the complete Bayesian variable selection model per-
forms the best and the standard logistic regression model gives the lowest performance in the
correct detection number. Bayesian variable selection, misclassification correction and missing
value imputation all contribute positively to the model performance. In particular, missing value
imputation contributes the most among these three components.
36
5.3 Average required time at each date
Figure 5.5 shows the average required time at each date for each model in each scenario. The
required time is recorded only when the pooled ham covariate is successfully detected. When
a model fails to detect the ham, the required time is not recorded. Therefore, missing values
exist in the recording of the required time. We assume that the required time is not affected by
whether a model can successfully detect the pooled ham covariate. Based on this assumption,
for each model in each scenario, the average required time at each date shown in Figure 5.5
is computed by taking the average of the recorded required time at that date. Although the
estimation of the average required time is unbiased under the above assumption, this estimation
is not precise when the sample size of recorded required time is small. For example, for the
standard logistic regression model in Scenarios S2 and S4, the estimation of the average required
time at each date is based on a sample size of less than 6 and these estimations are not precise.
This may explain why the average required time does not increase monotonically with increased
date for the standard logistic regression model in Scenarios S2 and S4.
Figure 5.5 provides us some general information of the required time for each model in each
scenario. From this figure, we can see that the standard logistic regression model requires the
least time. The average required time is less than 1 second. The model with only Bayesian
variable selection and the model with Bayesian variable selection and misclassification correction
consume relatively little time. On average, the former requires less than 35 seconds and the
latter requires less than 2 minutes. When the complete Bayesian variable selection model or
the model with Bayesian variable selection and missing value imputation is applied, the average
required time increases greatly. On average, they consume 0.5 hour to 1 hour. This significant
increase in the average required time indicates that the missing value imputation is the most
time-consuming component.
In addition, if we look at the average required time for a model in each scenario, we can find
that in general the average required time is monotonically increasing. This suggests that the
computation time of a model increases as the sample size increases.
37
Figure 5.3: Histograms of earliest detection dates and failed detections for the different mod-
els in the four scenarios. In the horizontal direction, we have scenarios: S1: original response
misclassification rate and missingness rate in the pooled ham covariate, S2: increased response
misclassification rate, S3: increased missingness rate in the pooled ham covariate, and S4: in-
creased response misclassification rate and missingness rate in the pooled ham covariate. In the
vertical direction, we have models: LR: standard logistic regression model, M1: model with only
Bayesian variable selection, M2: model with Bayesian variable selection and misclassification
correction, M3: model with Bayesian variable selection and missing value imputation, and M4:
complete Bayesian variable selection model.
38
Figure 5.4: Boxplots of the one-sided posterior inclusion probabilities of the pooled ham covariate
for the different models in the four scenarios when the pooled ham covariate is detected as the
most likely suspect at the earliest detection date. In the horizontal direction, we have models:
M1: model with only Bayesian variable selection, M2: model with Bayesian variable selection
and misclassification correction, M3: model with Bayesian variable selection and missing value
imputation, and M4: complete Bayesian variable selection model. In the vertical direction, we
have scenarios: S1: original response misclassification rate and missingness rate in the pooled
ham covariate, S2: increased response misclassification rate, S3: increased missingness rate in
the pooled ham covariate, and S4: increased response misclassification rate and missingness rate
in the pooled ham covariate. In the boxplot, the lower and upper hinges correspond to the first
and third quartiles. The band inside the box is the second quartile (the median). The ends
of the whiskers represent the lowest datum still within 1.5 IQR of the lower quartile and the
highest datum still within 1.5 IQR of the upper quartile (where IQR is the inter-quartile range
or distance between the first and third quartiles). Data points located outside the whiskers are
outliers.
39
Figure 5.5: Plots of numbers of the average required time at each date for each of the five
models in each of the four scenarios. In the horizontal direction, we have scenarios: S1: original
response misclassification rate and missingness rate in the pooled ham covariate, S2: increased
response misclassification rate, S3: increased missingness rate in the pooled ham covariate, and
S4: increased response misclassification rate and missingness rate in the pooled ham covariate.
On the x-axis, we have dates: D1: 2 March, D2: 9 March, D3: 16 March, D4: 22 March and
D5: 5 April. In the vertical direction, we have models: LR: standard logistic regression model,
M1: model with only Bayesian variable selection, M2: model with Bayesian variable selection
and misclassification correction, M3: model with Bayesian variable selection and missing value
imputation, and M4: complete Bayesian variable selection model. Note that plots of different
models have different scales of y-axis.
40
Chapter 6
41
clear and consistent guidance on how to choose the prior distributions. To some extent, the choice
of prior distributions is a subjective decision. In addition, different choices of prior distributions
may probably change the final result we obtain, i.e. the most likely suspect in the food-borne
disease outbreak.
However, one should not ignore that the standard logistic regression used in outbreak investi-
gations is also subject to the same problem. The choice of the cut-off p-value in the pre-selection
is subjective. There is no standard screening criterion. Different choices of the p-values affect the
final result we obtain from the standard logistic regression. The problem of subjective decisions
exist in both the Bayesian approach and the standard logistic regression.
1. Data simulation. Data simulation is a crucial component of this thesis. At the same time,
it is the most difficult component. On one hand, we are faced with inherent difficulties of
simulating food-borne disease outbreak data. As explained in Section 4.1, in the design
of a simulation study for food-borne disease outbreak data, one needs to capture features
of food-borne disease outbreak data. On the other hand, we need to consider the time
constraint on simulations. First, the total time of the whole thesis project is tight. Second,
the process of model fitting is unavoidably time-consuming. Hence, there is not much time
left for the process of simulating data.
In this thesis, we design a simulation study based on real outbreak data and applied shuffling
to generate new datasets. This simulation method enables us to quickly simulate food-
borne disease outbreak data under the time constraint and while allowing us to simulate
food-borne disease outbreak data as realistic as possible.
2. Scenarios settings. In this simulation study, we set four different scenarios: (i) original
response misclassification rate and missingness rate in the pooled ham covariate, (ii) in-
creased response misclassification rate (+10%), (iii) increased missingness rate (+20%) in
the pooled ham covariate, (iv) increased response misclassification rate (+10%) and miss-
ingness rate (+20%) in the pooled ham covariate. In setting the latter three scenarios, we
introduce randomness by increasing the response misclassification rate by 4.17%-16.67%
and increasing the missingness rate by 8.20%-31.15%. On average, the expected increases
of the response misclassification rate and missingness rate are reached. Introducing ran-
domness brings more variations to the datasets compared with the practice in which the
response misclassification rate is increased by 10% and the missingness rate is increased by
20% in each dataset.
42
3. Model fitting. From our previous experience, the complete Bayesian variable selection
model and the model composed of the Bayesian variable selection and missing value im-
putation parts are most time-consuming among the five models. We implemented the two
models using different choices of burn-in iterations and the same choice of posterior sample
sizes, and performed convergence diagnostic tests. The final decision of the burn-in iter-
ations and posterior sample size provides us with relatively good convergence while also
putting limits on the computation time.
43
the generated datasets compared with our current method, because on average, the datasets are
the same as the original outbreak dataset.
Also, one could use a model-based simulation design, which is a popular design for a simulation
study from scratch. However, in the context of food-borne disease outbreaks, this design is not
practical. Such a model should be complex enough to capture all the features of a real food-borne
disease dataset. There is no simple model that one could use to generate new datasets. We have
already shown the poor performance of the standard logistic regression model.
In an ideal situation, one could assess the performance of different versions of Bayesian vari-
able selection models using hundreds or even thousands of real food-borne disease outbreak
datasets. This would be the most informative way to evaluate model performance. However, we
do not have a large volume of real outbreak datasets at hand. A simulation study is still needed.
6.3 Conclusions
In this thesis, we studied how different parts of Bayesian variable selection models affect model
performance in scenarios with (i) original response misclassification rate and missingness rate in
the pooled ham covariate, (ii) increased response misclassification rate (+10%), (iii) increased
missingness rate (+20%) in the pooled ham covariate, (iv) increased response misclassification
rate (+10%) and missingness rate (+20%) in the pooled ham covariate. This simulation study
reveals the following findings:
(i) For the four different versions of Bayesian variable selection models studied in this thesis,
the increase in the response misclassification rate or the missingness rate in the assumed
responsible food product covariate or the increase in both results in a decrease in the correct
detection number;
(ii) The increase in both the response misclassification rate and the missingness rate in the
assumed responsible food product covariate has the largest negative impact on the correct
detection number;
(iii) Bayesian variable selection, misclassification correction and missing value imputation all
contribute positively to the model performance in the context of food-borne disease out-
breaks. Although missing value imputation is most computationally expensive, it con-
tributes the most to the model performance among these three components.
Based on the above findings, we recommend applying the complete Bayesian variable selection
model in the statistical analysis of food-borne disease outbreak data. We cannot simplify the
complete Bayesian variable selection model without hampering model performance.
44
Bibliography
Belin, T. R., Hu, M. Y., Young, A. S., and Grusky, O. (2000). Using multiple imputation to
incorporate cases with missing items in a mental health services study. Health Services and
Outcomes Research Methodology, 1(1):7–22.
Best, N. G., Cowles, K., Vines, K., and Plummer, M. (2006). CODA: Convergence Diagnosis
and Output Analysis for MCMC. R News, 6(1):7–11.
Brandwagt, D., van den Wijngaard, C., Tulen, A., Mulder, A., Hofhuis, A., Jacobs, R., Heck,
M., Verbruggen, A., van den Kerkhof, J., Slegers-Fitz-James, I., Mughini-Graz, L., and Franz,
E. (2018). Outbreak of Salmonella Bovismorbificans in the Netherlands, associated with the
consumption of uncooked ham products, 2016 to 2017. Euro Surveillance, 23(1):pii=17–00335.
Centers for Disease Control and Prevention (2015). Guide to Confirming an Etiology in
Foodborne Disease Outbreak. URL https://www.cdc.gov/foodsafety/outbreaks/investigating-
outbreaks/confirming diagnosis.html [Accessed: 15 October, 2015].
Dwyer, D. M., Strickler, H., Goodman, R. A., and Armenian, H. K. (1994). Use of case-control
studies in outbreak investigations. [Review]. Epidemiologic Reviews, 16(1):109–123.
Erler, N. S., Rizopoulos, D., van Rosmalen, J., Jaddoe, V. W. V., Franco, O. H., and Lesaffre,
E. M. E. H. (2016). Dealing with missing covariates in epidemiologic studies: a comparison
between multiple imputation and a full Bayesian approach. Statistics in Medicine, 35(17):2955–
2974.
European Food Safety Authority (EFSA) and European Centre for Disease Prevention and Con-
trol (ECDC) (2015). The European Union summary Report on trends and sources of zoonoses,
zoonotic agents and foodborne outbreaks in 2014. EFSA Journal, 13(12):4329 [191 pp.].
Friedman, A. J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized
linear models via coordiante descent. Journal of Stastistical Software, 33(1):1–22.
Friesema, I., de Jong, A., Hofhuis, A., Heck, M., van den Kerkhof, H., de Jonge, R., Hameryck,
D., Nagel, K., van Vilsteren, G., van Beek, P., Notermans, D., and Van Pelt, W. (2014). Large
45
outbreak of Salmonella Tompson related to smoked salmon in the Netherlands, August to
December 2012. Eurosurveillance, 19(39):1–8.
Gelman, A., Carlin, J., Stern, H., and Rubin, D. (2004). Bayesian Data Analysis. Chapman and
Hall/CRC, Boca Ration, second edition.
Gelman, A. and Rubin, D. B. (1992). Inference from Iterative Simulation Using Multiple Se-
quences. Statistical Science, 7(4):457–472.
George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. Journal of
the American Statistical Association, 88(423):881–889.
Gilbert, R., Martin, R. M., Donovan, J., Lane, J. A., Hamdy, F., Neal, D. E., and Metcalfe, C.
(2016). Misclassification of outcome in case-control studies: Methods for sensitivity analysis.
Statistical Methods in Medical Research, 25(5):2377–2393.
Hald, T., Aspinall, W., Devleesschauwer, B., Cooke, R., Corrigan, T., Havelaar, A. H., Gibb,
H. J., Torgerson, P. R., Kirk, M. D., Angulo, F. J., Lake, R. J., Speybroeck, N., and Hoffmann,
S. (2016). World Health Organization estimates of the relative contributions of food to the
burden of disease due to selected foodborne hazards: A structured expert elicitation. PLoS
ONE, 11(1):1–35.
Harrell, F. (2015). Regression modeling strategies with applications to linear models, logistic and
ordinal regression, and survival analysis. Springer, New York, second edition.
Heidelberger, P. and Welch, P. D. (1983). Simulation Run Length Control in the Presence of an
Initial Transient. Operations Research, 31(6):1109–1144.
Hosmer, D., Lemeshow, S., and Sturdivant, R. (2013). Applied Logistic Regression. John Wiley
and Sons, Hoboken, third edition.
Ibrahim, J. G., Chen, M. H., and Lipsitz, S. R. (2002). Bayesian Methods for Generalized Linear
Models with Covariates Missing at Random. The Canadian Journal of Statistics, 30(1):55–78.
Jacobs, R., Lesaffre, E., Teunis, P. F., Höhle, M., and van de Kassteele, J. (2017). Identifying
the source of food-borne disease outbreaks: An application of Bayesian variable selection.
Statistical Methods in Medical Research, pages 1–15.
Last, J. M. (2000). A Dictionary of Epidemiology. Oxford University Press, New York, fourth
edition.
Lesaffre, E. and Albert, A. (1989). Partial Separation in Logistic Discrimination. Journal of the
Royal Statistical Society. Series B (Methodological), 51(1):109–116.
Lesaffre, E. and Lawson, A. B. (2012). Bayesian Biostatistics. John Wiley and Sons, Chichester.
46
Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data. Wiley, New
York.
Matignon, R. (2005). Neural Network Modeling Using Sas Enterprise Miner. AuthorHouse,
Bloomington.
McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Chapman and Hall, London,
second edition.
Mitra, R. and Dunson, D. (2010). Two-level stochastic search variable selection in GLMs with
missing predictors. International Journal of Biostatistics, 6(1).
Park, M. Y. and Hastie, T. (2007). L1-regularization path algorithm for generalized linear models.
Journal of the Royal Statistical Society. Series B: Statistical Methodology, 69(4):659–677.
Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs
sampling. Proceedings of the 3rd International Workshop on Distributed Statistical Computing
(DSC 2003), pages 20–22.
Pogreba-Brown, K., Ernst, K., and Harris, R. (2014). Case-case methods for studying enteric
diseases: A review and approach for standardization. OA Epidemiology, 7(1):1–9.
Raphael, K. (1987). Recall bais: A proposal for assessment and control. International Journal
of Epidemiology, 16(2):167–170.
Tang, L., Lyles, R. H., King, C. C., Celentano, D. D., and Lo, Y. (2015). Binary regression with
differentially misclassified response and exposure variables. Statistics in Medicine, 34(9):1605–
1620.
Thomas, D., Stram, D., and Dwyer, J. (1993). Exposure measurement error: influence on
exposure-disease. Relationships and methods of correction. Annual Review of Public Health,
14:69–93.
Tibshirani, R. (1996). Regression Selection and Shrinkage via the Lasso. Journal of the Royal
Statistical Society B, 58(1):267–288.
47
Appendix A
48
market 1 <- which ( colnames ( quest . data ) == " ah " )
quest . data [ , market 1 : food 2 ][ is . na ( quest . data [ , market 1 : food 2 ])] <- 0
for ( i in 1 : nrow ( quest . data )) {
if ( quest . data $ brauwham [ i ] == 1 | quest . data $ bgerham [ i ] == 1 |
quest . data $ bcoburgham [ i ] == 1 ) {
quest . data $ pooledHam [ i ] <- 1
}
if ( quest . data $ brauwham [ i ] == 0 & quest . data $ bgerham [ i ] == 0 &
quest . data $ bcoburgham [ i ] == 0 ) {
quest . data $ pooledHam [ i ] <- 0
}
}
49
# Load data from the Salmonella Thompson outbreak
# mydata < - load _ data ( filename = " Salmonella _ 2 0 1 2 . csv " , ham = 0 )
# Divide data into two groups : the case group and the control group
case . data <- mydata [ mydata $ case == 1 , ]
control . data <- mydata [ mydata $ case == 0 , ]
case . no <- nrow ( case . data )
control . no <- nrow ( control . data )
# Epi - curve
library ( " epitools " )
date <- as . Date ( case . data $ date , format = ’% m /% d /% Y ’)
x <- epicurve . weeks ( date , format = " %y -% m -% d " , axisnames =F , xlab = " Week of Year " ,
ylab = " Cases per week " , tick . offset = 0 . 5 , space = 0 . 5 )
axis ( 1 , at = x $ xvals , labels = x $ cweek )
50
axis ( 1 , at = xx $ xvals , labels = xx $ cweek )
# Missing covariates
# Percentages of missing covariates per covariate for cases and controls
# column index of the first food product
food . 1 <- which ( colnames ( mydata ) == " kipfilet " )
na _ numbers . total <- sapply ( mydata [ , food . 1 : ncol ( mydata )] ,
function ( y ) sum ( length ( which ( is . na ( y )))))
na _ percentage . total <- na _ numbers . total / nrow ( mydata )
na . total <- data . frame ( na _ numbers . total , na _ percentage . total )
# Percentages of missing pooledHam ( vis _ gerookt ) values per pooledHam ( vis _ gerookt ) for ca
na _ count [ which ( rownames ( na _ count ) == " pooledHam " ) , ]
# na _ count [ which ( rownames ( na _ count ) == " vis _ gerookt ") , ]
51
if ( na _ count $ na _ percentage . case [ i ] < na _ count $ na _ percentage . control [ i ]) {
j [ i ] <- 1
}
}
sum ( j ) / nrow ( na _ count )
52
Appendix B
# Replace na with 1
mydata [ , ][ is . na ( mydata [ , ])] <- 1
53
}
}
sig . var <- sig . var [ sig . var != 0 ]
return ( sig . var )
}
library ( MASS )
multi _ model <- function ( sig . var , dataset ) {
# Perform multivariate analysis and build a final model
# using a backward variable selection based on the AIC .
#
# Args :
# sig . var : A vector containing the indices of selected
# covariates in univariable analysis .
# dataset : A dataframe which contains the outbreak data .
#
# Returns :
# A result summary of the final model .
dataset $ age <- ( dataset $ age - mean ( dataset $ age )) / sd ( dataset $ age )
data . sig <- cbind ( dataset $ case , dataset $ age , dataset $ geslacht ,
dataset [ , sig . var ])
colnames ( data . sig )[ 1 ] <- " case "
colnames ( data . sig )[ 2 ] <- " age "
colnames ( data . sig )[ 3 ] <- " geslacht "
logitMod _ mul <- glm ( case ~ . , data = data . sig ,
family = binomial ( link = " logit " ))
step <- stepAIC ( logitMod _ mul , scope = list ( lower = ~ age + geslacht ) ,
direction = " backward " , trace = 0 )
logitMod _ final <- glm ( formula ( step ) , data = data . sig ,
family = binomial ( link = " logit " ))
summary ( logitMod _ final )$ coefficients
}
date 2 <- subset ( mydata , as . Date ( mydata $ date , ’% m /% d /% Y ’) <= as . Date ( " 2 0 1 7 -0 3 -0 9 " ))
date 2 . result <- multi _ model ( uni _ model ( date 2 ) , date 2 )
54
date 3 <- subset ( mydata , as . Date ( mydata $ date , ’% m /% d /% Y ’) <= as . Date ( " 2 0 1 7 -0 3 -1 6 " ))
date 3 . result <- multi _ model ( uni _ model ( date 3 ) , date 3 )
date 4 <- subset ( mydata , as . Date ( mydata $ date , ’% m /% d /% Y ’) <= as . Date ( " 2 0 1 7 -0 3 -2 2 " ))
date 4 . result <- multi _ model ( uni _ model ( date 4 ) , date 4 )
55
Appendix C
mydata <- load _ data ( filename = " Ham _ 2 0 1 7 . csv " , ham = 1 )
new . dataset <- vector ( " list " , 1 0 0 )
# MCAR
# Indices of the non - na values of the pooled ham covariate
56
missingness . dataset <- new . dataset
non NA . index <- which (! is . na ( mydata $ pooledHam ))
for ( i in 1 : 1 0 0 ) {
set . seed ( i )
# The number of missing values is between 2 and 1 9 .
missing . index <- sample ( non NA . index , sample ( 5 : 1 9 , 1 ))
for ( j in 1 : length ( mydata $ pooledHam )) {
missingness . dataset [[ i ]]$ pooledHam [ j ] <- ifelse ( j % in % missing . index ,
NA , mydata $ pooledHam [ j ])
}
}
save ( missingness . dataset , file = " missingness _ dataset . Rda " )
57
Appendix D
dates <- c ( " 2 0 1 7 -0 3 -0 2 " , " 2 0 1 7 -0 3 -0 9 " , " 2 0 1 7 -0 3 -1 6 " , " 2 0 1 7 -0 3 -2 2 " , " 2 0 1 7 -0 4 -0 5 " )
setClass ( Class = " Information " , representation ( time = " numeric " ,
suspectedfood = " character " , inclupro . highest = " numeric " ,
dif = " numeric " ))
58
& length ( unique ( x ))== 1 )
if ( sum ( remove . market ) != 0 ) {
markets [ names ( remove . market [ which ( remove . market )])] <-
rep ( NULL , sum ( remove . market ))
market . names <- market . names [ - which ( remove . market )]
}
for ( i in 1 : n . obs ){
logit ( mu . X [i , 1 ]) <- alpha 0 [ 1 ]
x _ cov [i , 1 ] ~ dbern ( mu . X [i , 1 ])
for ( j in 2 : n . cov ){
logit ( mu . X [i , j ]) <- alpha 0 [ j ]+ inprod ( alpha [j , 1 :( j - 1 )] , x _ cov [i , 1 :( j - 1 )])
x _ cov [i , j ] ~ dbern ( mu . X [i , j ])
}
}
# Priors
for ( j in 1 : n . cov ){
alpha 0 [ j ] ~ dnorm ( 0 ,0 . 0 0 1 )
}
59
beta 0 ~ dnorm ( 0 ,0 . 0 0 1 )
Se ~ dbeta ( 3 3 ,4 )
beta . fixed [ 1 ] ~ dnorm ( 0 ,0 . 0 0 1 )
beta . fixed [ 2 ] ~ dnorm ( 0 ,0 . 0 0 1 )
for ( j in 1 : n . cov ) {
precisionb [ j ] <- equals ( gammab [ j ] , 0 )* prec . spike
+ equals ( gammab [ j ] , 1 )* prec . slab
beta [ j ] ~ dnorm ( 0 , precisionb [ j ])
gammab [ j ] ~ dbern ( omegab [ j ])
omegab [ j ] ~ dbeta ( 1 ,2 )
}
for ( j in 1 : n . cov ) {
for ( k in 1 :( j - 1 )) {
precisiona [j , k ] <- equals ( gammaa [j , k ] , 0 )* prec . spike
+ equals ( gammaa [j , k ] , 1 )* prec . slab
alpha [j , k ] ~ dnorm ( 0 , precisiona [j , k ])
gammaa [j , k ] ~ dbern ( omegaa [j , k ])
omegaa [j , k ] ~ dbeta ( 1 ,2 )
}
}
}"
inits . list <- function () { list ( beta = beta _ init , alpha = alpha _ init ,
60
beta 0 = beta 0 _ init , alpha 0 = alpha 0 _ init , . RNG . name = " base :: Mersenne - Twister " ,
. RNG . seed = sample . int ( n = 1 0 0 0 0 0 , size = 1 ))}
N <- 1 5 0 0 0
n . thin <- 1
n . adapt <- 0
n . cores <- 8
n . burnin <- 2 0 0 0
# # Parameters to monitor
parameters <- c ( " beta " ," beta 0 " ," omegab " ," Se " )
data . list <- list ( y =y , x _ fix = x _ fix , x _ cov = x _ cov , n . obs = n . obs , n . cov = n . cov ,
c =c , tau = tau )
ptm <- proc . time ()
post . runjags <- run . jags ( model = model . string 4 , data = data . list ,
inits = inits . list , n . chains = n . cores , adapt = n . adapt ,
burnin = n . burnin , sample = round ( N / n . cores ) , thin = n . thin ,
method = " parallel " , modules = " glm " , monitor = parameters )
time <- proc . time () - ptm
post . mat <- as . matrix ( as . mcmc ( post . runjags ))
g . mcmc <- post . mat [ , grep ( " omegab " , colnames ( post . mat ))]
n . cov <- ncol ( g . mcmc )
b . mcmc <- post . mat [ , grep ( " beta " , colnames ( post . mat ))][ , 1 : n . cov ]
inclprob <- apply ( b . mcmc , 2 , function ( x ) mean ( x > 0 . 0 5 ))
suspected . food <- food . names [ which . max ( inclprob [ 1 : length ( food . names )])]
highest . prob <- max ( inclprob [ 1 : length ( food . names )])
difference <- max ( inclprob [ 1 : length ( food . names )]) -
sort ( inclprob [ 1 : length ( food . names )] , partial = length ( food . names ) - 1 )
[ length ( food . names ) - 1 ]
return ( new ( " Information " , time = time [ 3 ] , suspectedfood = suspected . food ,
inclupro . highest = highest . prob , dif = difference ))
}
61
Appendix E
Table E.1: The performance of the model with only Bayesian variable selection in the four
scenarios
Correct Median Median of inclusion probability of Ham
detection detection at the earliest detection date &
number date standard deviation in brackets
Original rates 26 03-22 0.539 (0.118)
Increased misclassification 17 03-22 0.580 (0.129)
Increased missingness 23 03-09 0.536 (0.106)
Increased misclassification
16 03-09 0.563 (0.095)
& missingness
Table E.2: The performance of the model with Bayesian variable selection and misclassification
correction in the four scenarios
Correct Median Median of inclusion probability of Ham
detection detection at the earliest detection date &
number date standard deviation in brackets
Original rates 41 03-22 0.519 (0.104)
Increased misclassification 35 03-22 0.524 (0.097)
Increased missingness 27 03-22 0.508 (0.099)
Increased misclassification
23 03-22 0.548 (0.124)
& missingness
62
Table E.3: The performance of the model with Bayesian variable selection and missing value
imputation in the four scenarios
Correct Median Median of inclusion probability of Ham
detection detection at the earliest detection date &
number date standard deviation in brackets
Original rates 88 03-22 0.650 (0.111)
Increased misclassification 55 03-22 0.625 (0.094)
Increased missingness 50 03-22 0.562 (0.119)
Increased misclassification
32 03-22 0.589 (0.110)
& missingness
Table E.4: The performance of the complete Bayesian variable selection model in the four sce-
narios
Correct Median Median of inclusion probability of Ham
detection detection at the earliest detection date &
number date standard deviation in brackets
Original rates 92 03-22 0.592 (0.149)
Increased misclassification 66 03-22 0.556 (0.129)
Increased missingness 64 03-22 0.576 (0.125)
Increased misclassification
53 03-22 0.548 (0.112)
& missingness
Table E.5: The performance of the standard logistic regression model in the four scenarios
Correct Median
detection detection
number date
Original rates 23 04-05
Increased misclassification 6 03-22
Increased missingness 11 03-22
Increased misclassification
9 04-05
& missingness
63