2018 07 05 Masterthesis Liu PDF

A simulation study to evaluate the
performance of Bayesian variable

selection in identification of the source
of food-borne disease outbreaks
Yuping Liu (s1857967)
First supervisor: Prof. Dr. Jelle Goeman (LUMC)

Second supervisor: Prof. Dr. Bart Mertens (LUMC)
External supervisor: Dr. Rianne Jacobs (RIVM)
master thesis
Defended on July 5, 2018
Specialization: Data Science
STATISTICAL SCIENCE
FOR THE LIFE AND BEHAVIOURAL SCIENCES
Abstract
Food-borne disease outbreaks constitute a large, ongoing public health burden worldwide (Hald
et al., 2016). Early identification of contaminated food products plays an important role in
reducing health burdens of food-borne disease outbreaks (Jacobs et al., 2017). Case-control
studies together with logistic regression analysis are primarily used in food-borne outbreak in-
vestigations. However, the current methodology is associated with problems including response
misclassification, missing values and ignoring small sample bias.
Jacobs et al. (2017) developed a formal Bayesian variable selection method which deals
with the problems of missing covariates and misclassified response. The re-analysis of Dutch
Salmonella Thompson 2012 outbreak data (Friesema et al., 2014) has illustrated that this
Bayesian approach allows a relatively easy implementation of these concepts and performs better
than the standard logistic regression analysis in the identification of responsible food products.
The complete Bayesian variable selection model is composed of three different parts, namely,
misclassification correction, missing value imputation and Bayesian variable selection. In this
thesis, we are interested in how these different parts affect the performance of Bayesian variable
selection models in scenarios with (i) the same response misclassification rate and missingness
rate in an assumed responsible food product covariate as in the original food-borne disease out-
break dataset, (ii) different response misclassification rates, (iii) different missingness rates in an
assumed responsible food product and (iv) the combination of different response misclassifica-
tion rates and missingness rates. We answer this research question by designing and executing
a simulation study. Our results indicate that for the four different versions of Bayesian variable
selection models studied in this thesis, the increase in the response misclassification rate or the
missingness rate in the assumed responsible food product covariate or the increase in both results
in a decrease in model performance. Bayesian variable selection, misclassification correction and
missing value imputation all contribute positively to the model performance. Although miss-
ing value imputation is most computationally expensive, it contributes the most to the model
performance among these three components.
Contents
1 Introduction 2
1.1 Case-control studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Case definition and control selection . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Data and data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Methods for analysing data from case-control studies . . . . . . . . . . . . . . . . 3
1.2.1 Standard methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Lasso logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Bayesian variable selection method . . . . . . . . . . . . . . . . . . . . . . 4
1.3 The goal of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Methodology 6
2.1 Classical logistic regression without misclassification . . . . . . . . . . . . . . . . 6
2.1.1 Estimation of conditional odds ratio . . . . . . . . . . . . . . . . . . . . . 6
2.2 Logistic regression with nondifferentially misclassified responses . . . . . . . . . . 7
2.3 Variable selection methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Classical variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2.1 Variable selection techniques . . . . . . . . . . . . . . . . . . . . 8
2.3.2.2 Combination of univariable analysis and stepwise approaches . . 9
2.3.2.3 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.3 Bayesian variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Missing covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 Missing data mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Methods for missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2.1 Complete-case analysis . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2.2 Ad-hoc imputation . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2.3 Sequential full Bayesian approach . . . . . . . . . . . . . . . . . 12
2.5 Complete Bayesian variable selection model . . . . . . . . . . . . . . . . . . . . . 13
2.5.1 Logistic regression with nondifferentially misclassified responses . . . . . . 14
2.5.2 Bayesian variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1
2.5.3 Missing imputation with variable selection . . . . . . . . . . . . . . . . . . 14
3 Description and statistical analysis of datasets 15

3.1 Salmonella Bovismorbificans outbreak (2016-2017) . . . . . . . . . . . . . . . . . 15
3.1.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2 Statistical analysis using the standard logistic regression . . . . . . . . . . 18
3.2 Salmonella Thompson outbreak (2012) . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Simulation study 22
4.1 Motivation for designing a simulation study based on real data . . . . . . . . . . 22
4.2 Simulating new datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.1 General description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Misclassification scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Missingness scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Scenarios with increased misclassification and missingness rates . . . . . . . . . . 25
4.6 Model fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6.1 Prior specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6.2 Initial values of parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6.3 Burn-in iterations and the posterior sample size . . . . . . . . . . . . . . . 26
4.6.4 Model fitting steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6.5 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Results of the simulation study 30

5.1 Correct detection numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Earliest detection dates and P (βham > 0.05| Data) . . . . . . . . . . . . . . . . . 33
5.3 Average required time at each date . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6 Discussion and conclusions 38

6.1 Discussion of the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.2 General discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2.1 Advantages of the current simulation study . . . . . . . . . . . . . . . . . 39
6.2.2 Limitations of the current simulation study . . . . . . . . . . . . . . . . . 40
6.2.3 Other alternative simulation methods . . . . . . . . . . . . . . . . . . . . 40
6.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
A R-code for descriptive analysis of datasets 45
B R-code for standard logistic regression on Salmonella Bovismorbificans out-

break 50
C R-code for creating four scenarios in the simulation study 53
2
D R-code for the complete Bayesian variable selection model 55
E Performance of the different models in the four scenarios 59
3
Chapter 1
Introduction
Food-borne disease outbreaks are defined by the Centers for Disease Control and Prevention
(CDC) as an incident in which two or more persons experience a similar illness due to ingestion
of the same food (Centers for Disease Control and Prevention, 2015). The global burden of
food-borne disease outbreaks is estimated to be considerable by the World Health Organization
(WHO) (Hald et al., 2016). For example, in 2014, 5521 food-borne disease outbreaks, resulting
in 45665 human cases, 6438 hospitalizations and 27 deaths, were reported in the European Union
(EU) (European Food Safety Authority (EFSA) and European Centre for Disease Prevention and
Control (ECDC), 2015). Early identification of contaminated food products plays an important
role in reducing health burdens of food-borne disease outbreaks (Jacobs et al., 2017).
1.1 Case-control studies

Case-control studies, one of the frequently used observational studies in outbreak investigations,
involve comparison of the prevalence of exposure to a potential risk factor between cases and
controls. A case is defined as a person who has the food-borne disease and a control is defined
as a person who is free of the food-borne disease. If the exposure to a potential risk factor is
more prevalent among cases than controls, then it may be a risk factor for the food-borne disease
under investigation. Case-control studies are useful and efficient when it is not possible to fully
define or enumerated the population at risk (Dwyer et al., 1994) and results are needed urgently
(Pogreba-Brown et al., 2014).
1.1.1 Case definition and control selection

Once the first illnesses have been recognized in an outbreak, health officials develop a case
definition in order to formulate which ill persons should be included in an outbreak. The case
definition usually contains details about clinical features of the disease, the time period and
places of disease occurrence (Dwyer et al., 1994). Controls are then selected who are similar to
4
cases in many aspects, such as age and gender, and must be unaffected individuals at the same
risk of developing the food borne disease (Lewallen and Courtright, 1998).
1.1.2 Data and data collection

Investigators decide on retrospective data to be collected. Besides potential risk factors of inter-
est, i.e. the exposure to suspected food products and supermarkets where the food products were
purchased in the context of the outbreak, data are collected on known or suspected confounding
variables such as age, gender and geographical region in order to control for these variables in
the multivariate analysis.
In the early stage of the food-borne disease outbreak investigation, data is collected using
trawling questionnaires (also hypothesis-generating or shotgun questionnaires). Such question-
naires are designed to generate hypotheses about likely sources of the food-borne disease out-
break and to capture information on a wide range of exposures. Once hypotheses emerge clearly,
a shorter and more focused questionnaire—an analytical study questionnaire—is applied. Ques-
tionnaires may be interviewer-administered (a face-to-face interview or a telephone interview)
or self-administered. Self-administered questionnaires can be distributed in person or by mail,
e-mail or internet. Detailed information of the risk factors and confounding variables is collected
in an identical way for both the case group and the control group (Lewallen and Courtright,
1998).
1.1.3 Limitations
Despite the widespread use of case-control methods in food-borne outbreak investigations, there
are several limitations. First, case-control studies are subject to information bias (Dwyer et al.,
1994). One example of information bias is the recall bias, which is caused by systematic differ-
ences in the accuracy or completeness of historical self-reported information from respondents
(Last, 2000). True exposures are likely to be underreported in controls and overreported in cases.
This exaggerates the magnitude of the difference between cases and controls in reported rates of
exposure to suspected risk factors and thus leads to an inflation of the odds ratio (Raphael, 1987).
Recall bias also depends on the types of food products. Compared with special food products,
common food products might be more likely to be underreported. Second, misclassification of
disease outcomes in case-control studies results in misclassification bias. For instance, asymp-
tomatic cases may be selected as controls, i.e. a false negative. Likewise, false positives may
occur among individuals who show symptoms consistent with the case definition, while in fact
those symptoms result from a different etiology (Dwyer et al., 1994). Misclassification can bias
estimates and may lead investigators to make incorrect conclusions about an exposure (Gilbert
et al., 2016). Finally, item non-response may arise in the data collection for several reasons. For
example, subjects may fail to remember their dietary intake in a given time period, or forget to
fill in an answer due to carelessness.
5
1.2 Methods for analysing data from case-control studies
1.2.1 Standard methods
Classical logistic regression is typically applied in the statistical analysis of the questionnaires to
estimate exposure effects while controlling for confounders. The selection of relevant variables is
a critical component of a case-control outbreak investigation as a large number of food products
that people may have consumed are usually investigated. Because the number of food products
being investigated is usually larger than the number of observations, researchers commonly use a
combination of univariable analysis and stepwise, forward or backward variable selection (Hosmer
et al., 2013). Small sample bias exists in the classical variable selection procedures.
1.2.2 Lasso logistic regression

Alternative methods can be used for the analysis of case-control data. One of alternative methods
is lasso logistic regression. The lasso method performs the parameter estimation while achieving
the model selection simultaneously by shrinking some of the unimportant coefficients to zero.
Lasso logistic regression can be applied when the number of candidate predictors is larger than
the number of observation. In addition, lasso logistic regression makes it possible to obtain
estimates even with little data. Although the shrinkage of lasso logistic regression is an advantage
compared to the classical logistic regression analysis, it has the unwanted side effect of shrinking
the large effects and making it difficult to identify the responsible food product (Jacobs et al.,
2017).
1.2.3 Bayesian variable selection method

A third option to analyse data from case-control studies is by using Bayesian variable selection.
This method has the advantage that one can naturally account for additional data problems
such as misclassification and missing value imputation. In a recent paper, Jacobs et al. (2017)
developed a formal Bayesian variable selection method. A generalization of the logistic regression
model which handles the problem of response misclassification by including sensitivity is used.
The Stochastic Search Variable Selection (SSVS) algorithm (George and McCulloch, 1993) is used
in a two-level variable selection which allows for the simultaneous treatment of the missing data
and variable selection problems (Mitra and Dunson, 2010). Moreover, the Bayesian approach
incorporates prior knowledge, which helps modelling especially in the initial stage of investigation
when very few questionnaires have been collected (Jacobs et al., 2017). The re-analysis of Dutch
Salmonella Thompson 2012 outbreak data has illustrated that this Bayesian variable selection
allows a relatively easy implementation of these concepts and performs better than the standard
logistic regression analysis in the identification of responsible food products.
6
1.3 The goal of this thesis
The complete Bayesian variable selection model is composed of three different parts, namely,
misclassification correction, missing value imputation with variable selection and Bayesian vari-
able selection. Different versions of Bayesian models are generated by different combinations of
these parts.
We are interested in investigating how these different parts affect the performance of Bayesian
variable selection models in scenarios with (i) the same missingness rate (i.e. the percentage
of missing values) in an assumed responsible food product covariate and the same response
misclassification rate as in the original dataset, (ii) different response misclassification rates, (iii)
different missingness rates in an assumed responsible food product and (iv) the combination of
different response misclassification rates and missingness rates. This main research question can
be divided into two smaller research sub-questions:
(i) What are the effects of missingness in the assumed responsible food product covariate and
response misclassification on the performance of Bayesian variable selection models? That is,
how well does a Bayesian variable selection model identify the responsible food product in the
above scenarios?
(ii) Which parts of the Bayesian variable selection model contribute to the performance in
each of the above scenarios?
In this thesis, we will answer these research questions by designing and executing a simulation
study. The simulated data will be based on a real dataset. More specifically, the simulated
datasets will constitute shuffled version of a real dataset. The simulation study will consist of
four scenarios: (i) original response misclassification rate and missingness rate in the assumed
responsible food product covariate, (ii) increased response misclassification rate, (iii) increased
missingness rate in the assumed responsible food product, (iv) increased response misclassification
rate and missingness rate in the assumed responsible food product covariate. In each of the
scenarios, the data will be analysed using standard logistic regression (used as a baseline) and
different versions of the Bayesian variable selection model. By comparing the performance of the
different models in each scenario and the performance of one specific model in all scenarios, we
can gain better insight into the mechanisms of the Bayesian variable selection method.
This thesis is organized in the following way. In Chapter 2, we present the methods used in
the analysis of the case-control data including the standard logistic regression and the Bayesian
variable selection method developed by Jacobs et al. (2017). In Chapter 3, we present two
datasets from real food-borne disease outbreaks and show the results of the descriptive statistics
and standard logistic regression. In Chapter 4, we present the simulation study design. In
Chapter 5, we present the results of the simulation study. Chapter 6 involves the discussion and
conclusions.
7
Chapter 2
Methodology
2.1 Classical logistic regression without misclassification

Suppose there is a sample of n independent observations of the pair (xi , yi , i = 1, ..., n), where yi
denotes a binary response variable, which is coded as 1 or 0 for convenience. The classical logistic
regression model assumes each response yi given xi follows a Bernoulli distribution Bernoulli(µi ),
where the probability of success µi is given by E(yi |xi ) = P (yi = 1|xi ) = µ(xi ) = µi and the
log odds (or logit) of this probability is modelled as a linear combination of the covariates plus
an intercept term:
logit(µi ) = β0 + xi β, (2.1)
where xi ∈ R1×p denotes the ith row of the design matrix X ∈ Rn×p , β ∈ Rp×1 denotes a
p-dimensional column vector listing unknown regression coefficients.
2.1.1 Estimation of conditional odds ratio

Suppose we have a binary response Y , a binary exposure of interest Z and other predictors
X1 , ..., Xp which may or may not be binary. The conditional odds ratio between Y and Z when
X1 = xi , . . . , Xp = xp , ψcond (z) is defined as
P (Y = 1|Z = 1, X = x)/P (Y = 0|Z = 1, X = x)

ψcond (z) = . (2.2)
P (Y = 1|Z = 0, X = x)/P (Y = 0|Z = 0, X = x)
If a multiple logistic regression model fits the data

Pp
eβ0 + j=1 βxj xj +βz z
P (Y = 1|Z = z, X = x) = β0 + p
P , (2.3)
1+e i=j βxj xj +βz z
then eβ̂z is an estimate of the conditional odds ratio ψcond (z). The conditional odds ratio ψcond (z)
is the odds ratio between Y and Z when the values of X1 , . . . , Xp are held fixed.
8
2.2 Logistic regression with nondifferentially misclassified
responses
Response misclassification occurs in a case-control study when the observed disease outcomes
cannot truly reflect the true disease status. Let ỹi designate the true disease status which is
subject to misclassification, with ỹi = 1 if the disease is present, and ỹi = 0 otherwise. Let
yi denote the observed disease outcome for person i = 1, 2, ..., n. We assume nondifferential
misclassification (i.e. the misclassification of disease status does not depend on the covariates):
P (yi |ỹi , xi ) = P (yi |ỹi ) (2.4)
This assumption implies that the critical diagnostic properties known as sensitivity (Se) and
specificity (Sp) do not vary according to exposure status (Thomas et al., 1993; Tang et al., 2015).
Thus we define
Se ≡ P (yi = 1|ỹi = 1) and Sp ≡ P (yi = 0|ỹi = 0) (2.5)
For an individual with covariate information xi , let the probability of having a true case be
πi = P (ỹi = 1|xi ).
By the law of total probability, a possibly misclassified, positive diagnostic test occurs with
probability
µi ≡ P (yi = 1|xi ) = P (yi = 1|ỹi = 1, xi )P (ỹi = 1|xi ) + P (yi = 1|ỹi = 0, xi )P (ỹi = 0|xi )
= P (yi = 1|ỹi = 1)P (ỹi = 1|xi ) + P (yi = 1|ỹi = 0)P (ỹi = 0|xi )
= P (yi = 1|ỹi = 1)P (ỹi = 1|xi ) + [1 − P (yi = 0|ỹi = 0)][1 − P (ỹi = 1|xi )]
= (Se)πi + (1 − Sp)(1 − πi ) (2.6)
We assume yi |xi ∼ Bernoulli(µi ), with probability of success derived from Equation 2.6. In
the logistic regression case, logit(πi ) = β0 + xi β, where xi ∈ R1×p denotes the ith row of the
design matrix X ∈ Rn×p , β ∈ Rp×1 denotes a p-dimensional column vector listing unknown
regression coefficients.
The likelihood function is
n
Y
L(β, Se, Sp|Y ) = f (yi |β, Se, Sp)
i=1
Yn
= [πi Se + (1 − πi )(1 − Sp)]yi [πi (1 − Se) + (1 − πi )Sp]1−yi (2.7)
i=1
where Y = (y1 , ..., yn ).

When Se = Sp = 1, logistic regression with nondifferentially misclassified responses simplifies
to classical logistic regression without misclassification. In this case, ỹi ≡ yi , for i = 1, 2, ..., n.
9
2.3 Variable selection methods
In many situations, researchers are interested in variable selection. Researchers may be interested
in gaining a good understanding of the real relationship between the response and the explanatory
variables, so it is important to select only those explanatory variables that best explain the
response. Or researchers may be interested in finding the best prediction model, i.e. selecting
those variables that give the best prediction. In the context of food-borne disease outbreak
investigations, investigators are particularly interested in variable selection with the purpose of
finding the one food product (i.e. variable) that best distinguishes between cases and controls.
In this section, we will discuss variable selection methods in the context of generalized linear
regression.
2.3.1 Generalized linear models

Given n pairs of p predictors and a response {(xi , yi ) : xi ∈ R1×p , yi ∈ R, i = 1, ..., n}, generalized
linear models (GLMs) model the random component Y which follows a distribution in the
exponential family with mean µ = E(Y ) and variance V = var(Y ) by using a systematic
component η (i.e. a linear combination of the predictors Xβ) as well as a link function g which
describes how the mean µ is linked to the systematic component:
η = g(µ) = β0 + Xβ. (2.8)
The density function of Y is expressed as (McCullagh and Nelder, 1989)

yθ − b (θ)
fY (y|θ, φ) = exp + c (y, θ) . (2.9)
a (θ)
Here a(·), b(·) and c(·) are known functions which vary according to the distribution.
Specifically, when Y are binary and the distribution function is Bernoulli distribution, GLMs
with a canonical logit link

µ
g(µ) = log . (2.10)
1−µ
are logistic regression models.
Variable selection in a generalized linear regression context can be seen as an exercise deciding
which of regression coefficients are equal to zero.
2.3.2 Classical variable selection

2.3.2.1 Variable selection techniques
The well-known and frequently used selection techniques are:
1. Stepwise approach. There are three main approaches: backward elimination, forward
selection and stepwise selection. Backward elimination starts with all candidate predictors
in the model and looks for predictors that are not statistically significant, i.e. whose deletion
10
does not significantly reduce the fit of the model based on a model comparison criterion.
Then the predictor that is least significant is removed and thus a new “simplified” model is
obtained. The above procedure can be repeated until all predictors in the model are found
significant. The forward selection method reverses the backward elimination method. It
starts with no variables in the model and sequentially adds the most significant predictors.
Stepwise selection is a combination of backward elimination and forward selection. At each
stage, a predictor may be added or deleted.
2. Best subsets approach. The best subsets approach searches all possible models with a
specific set of predictors and identifies the best-fitting models based on the values of the
quantitative criterion. If there are p predictors, the number of subsets is 2p .
The criteria used for model comparison include p-values, Akaike’s information criterion (AIC)
and Bayesian information criterion (BIC).
The above variable selection techniques are implicitly designed for situations where the num-
ber of the observations is larger than the number of candidate predictors. In the food-borne
disease outbreak investigation, the number of candidate predictors is usually larger than the
number of observations. In this case, the following two techniques are frequently used.
2.3.2.2 Combination of univariable analysis and stepwise approaches
A pre-selection is performed before including all the candidate predictors in the full model. One
way to do a pre-selection is to use univariable models, a model with just one predictor is tested at
a time. Then only those predictors which meet a preset criterion for significance are selected. This
criterion is often more relaxed than the conventional criterion for significance (for instance, p-
value < 0.20, instead of the usual p-value < 0.05), since the pre-selection aims to identify potential
predictors rather than to test a hypothesis. After the pre-selection procedure, a further variable
selection is performed by using backward, forward or stepwise selection procedures (Hosmer et al.,
2013; Harrell, 2015).
2.3.2.3 Lasso
The technique, least absolute shrinkage and selection operator (lasso) was first formulated by
Tibshirani (1996) for estimation in linear models. Based on this lasso method which adds a
penalty term to the residual sum of squares, Park and Hastie (2007) proposed a modified criterion
with regularization for estimating the coefficients β in GLMs:
β̂(λ) = arg min [− log{L(β|y)} + λkβk1 ] , (2.11)

β
where λ > 0 is the regularization parameter. The lasso method penalizes the coefficients of
candidate predictors, shrinking some of the unimportant coefficients to zero and thus achieves
variable selection.
11
Glmnet (Friedman et al., 2010) fits the lasso logistic regression by solving the following
problem:
" n #
1X
(β0 +xi β)

min − yi · (β0 + xi β) − log 1 + e + λkβk1 . (2.12)
(β0 ,β)∈Rp+1 n i=1
2.3.3 Bayesian variable selection

Bayesian variable selection (BVS) has obtained empirical success when the number of predictors
is much larger than the number of observations. Let p(m) be the prior probability of model m.
Then the posterior probability allocated to model m is defined via Bayes’ theorem as
p(y|m)p(m)
p(m|y) = P ,m ∈ M (2.13)
m∈M p(y|m)p(m)
with
Z
p(y|m) = p(y|θm , m)p(θm |m)d(θm ), (2.14)
where θm denotes the km -dimensional vector of parameters for model m. For example, for linear
regression θm is composed of σ 2 , β0 and dm regression coefficients βm .
Several BVS methods have been developed in the last 25 years. The BVS method that we
apply in this research is Stochastic Search Variable Selection (SSVS) proposed by George and
McCulloch (1993).
Let a latent variable γj denotes the inclusion indicator for βj , with γj = 1 denoting that the j th
covariate is included in the regression model, and γj = 0 otherwise for covariate j = 1, 2, ..., p. ωj
denotes the inclusion probability of the j th covariates. The regression coefficients βj are assumed
to have a spike-and-slab prior distribution:
βj |τ 2 , c2 ∼ γj N(0, τ 2 c2 ) + (1 − γj ) N(0, τ 2 ), (2.15)

γj |ωj ∼ Bernoulli(ωj ), (2.16)
ωj ∼ Beta(aj,0 , bj,0 ). (2.17)
In Equation 2.15, τ 2 c2 > 0 is the variance of the slab component and τ 2 > 0 is the variance of the
spike component. The density of the spike component is concentrated closely around zero. Note
p
that the two Gaussian densities intersect at the points ±, where = τ 2 log(c)c2 /(c2 − 1) (see
Figure 2.1). The point can be considered as a threshold for “practical significance”, because
all coefficients falling into the interval [−, ] can be interpreted as zero (Lesaffre and Lawson,
2012). This provides guidance for choosing the tuning parameters τ and c. When the parameter
c is fixed, the variance τ 2 can be chosen to reflect the perception of practical significance from
researchers. The choice of prior distributions of the inclusion indicator variable γj , the inclusion
probability parameter ωj and regression coefficients βj , and the choice of the value of will be
discussed in Section 4.6.1.
After a model is set up, it is usually fitted using Markov chain Monte Carlo (MCMC) methods,
such as the Gibbs sampler, which generates samples from the joint posterior distribution of the
12
Figure 2.1: Spike-and-slab prior distribution used in the SSVS procedure (Lesaffre and Lawson,
2012)
inclusion indicator variable γj , the inclusion probability parameter ωj and regression coefficients
βj .
Once samples are drawn from the joint posterior distribution of the parameters (γj , ωj , βj ),
the two-sided posterior inclusion probability P (βj ∈ / [−, ]| Data) can be used as a criterion for
variable selection. In the context of food-borne disease outbreaks, an odds ratio of 1.0 (i.e. β = 0)
indicates the exposure of a food product is not associated with the disease. An odds ratio larger
than 1.0 (i.e. β > 0) indicates that the exposure might be a risk factor for the disease and an
odds ratio less than 1.0 (i.e. β < 0) indicates the exposure might be a protective factor against
the disease. Because only positive regression coefficients are of interest, the one-sided posterior
inclusion probability P (βj > | Data) is used (Jacobs et al., 2017).
We use the one-sided posterior inclusion probability P (βj > | Data), a marginal probability
of each variable as a criterion for variable selection. We realize that the joint posterior probability
of all selected variables might be highest. In our thesis, we want to find the food product covariate
with the highest one-sided posterior inclusion probability and we are not interested in finding a
model as a whole to explain the response. Therefore, in our thesis, it is not problematic that the
joint posterior probability of selected variables might not be highest.
2.4 Missing covariates

Covariates are missing in the data collection stage of food-borne disease outbreak investigations
for various reasons. Subjects may be not able to remember their dietary intake of food items in
a given time period very well or they may forget to fill in answers due to carelessness. This leads
to missing values in the exposure to food items. These values in the exposure are needed to find
13
the responsible food product. If they are missing, data analysis of food-borne disease outbreak
investigation will face a challenge. There are different ways to deal with missing covariates. In
this section, we are going to discuss ways of how one could solve the problems of missing values
in the context of food-borne disease outbreaks.
2.4.1 Missing data mechanism

There are three broad types of missingness mechanisms (Rubin, 1976):
(1) Missing completely at random (MCAR). A variable is MCAR if missingness is unrelated

to data.
(2) Missing at random (MAR). A variable is MAR if missingness depends on the observed
data, but does not depend on the unobserved data.
(3) Missing not at random (MNAR). A variable is MNAR if missingness depends on the un-
observed data, perhaps in addition to the observed data.
2.4.2 Methods for missing data

2.4.2.1 Complete-case analysis
A common missing data approach is complete-case analysis (CC), which deletes all subjects with
incomplete data. When missing data are MCAR, CC analysis provides unbiased results (Little
and Rubin, 2002). However, this does not mean that CC analysis is always a desirable method.
If the proportion of incomplete cases is large, CC analysis can lead to a reduction of statistical
power (Belin et al., 2000). For example, suppose that data are MCAR across 30 variables and
the missingness proportion for each variable is 5%. Using CC analysis will lose close to four fifths
of the subjects, because the percentage of the fully observed subjects in the original data is only
(1 − 5%)30 ≈ 21% . In addition, when the missingness mechanism is not MCAR, the results can
be biased.
2.4.2.2 Ad-hoc imputation
For the analysis of food-borne disease outbreak data, many researchers apply ad-hoc imputation
methods to fill in missing values so that the standard software can be easily used to analyse com-
plete data. For example, in the analysis of the Salmonella Bovismorbificans 2016-2017 outbreak
(Brandwagt et al., 2018) and Salmonella Thompson 2012 outbreak (Friesema et al., 2014) in the
Netherlands, food product covariates that were not filled in questionnaires were assumed to be
zero. Besides, for subjects who responded to questions on consumption of some food products
with a “maybe” answer, they were assumed to consume these products. The first assumption
is reasonable, since subjects usually only mark the food products which they have consumed.
However, the second assumption seems implausible. Subjects who had not consumed a food
product may give a “maybe” answer because they failed to recall their consumption history and
14
they were unsure about the consumption of the food product. Therefore, the second assumption
possibly overestimated food consumption, thus leading to invalid inferences.
2.4.2.3 Sequential full Bayesian approach
The sequential full Bayesian (SFB) approach is proposed by Erler et al. (2016). By combin-
ing the imputation models with the analysis model in one estimation procedure, this approach
jointly imputes missing covariates and obtains inferences on the posterior distribution of the
parameters. In a standard Bayesian setting with complete data, the probability density function
of interest is p(θY |X |yi , xi ), where θY |X denotes the vector of parameters of the model (for
example (Se, Sp, β0 , β 0 )0 for the model in Section 2.1). When some covariate values are miss-
ing, X is composed of two parts: covariates containing completely observed values X obs and
covariates containing missing values X mis . The total number of covariates p is split up into q
observed covariates and r missing covariates. Then the posterior probability of interest becomes
p(θY |X , θX , xi,mis |yi , xi,obs ) which can be written as
p(θY |X ,θX , xi,mis |yi , xi,obs )

∝ p(yi |xi,mis , xi,obs , θY |X )p(xi,mis |xi,obs , θX )π(θY |X )π(θX ), (2.18)
where θX is a vector of parameters which are associated with the likelihood of partially observed
covariates X mis , and π(θY |X ) and π(θX ) are prior distributions (Erler et al., 2016). The joint
likelihood of the missing covariates p(xi,mis |xi,obs , θX ) can be specified in a convenient way by
using a sequence of conditional univariate distribution (Ibrahim et al., 2002):
p(xi,mis1 ,..., xi,misr |xi,obs , θX )

r
Y
= p(xi,mis1 |xi,obs , θX1 ) p(xi,misj |xi,mis1 , ..., xi,misj−1 , xi,obs , θXj ). (2.19)
j=2
After the prior distributions π(θY |X ) and π(θX ) are specified, samples can be drawn from
the joint distribution of all parameters and missing covariates using MCMC methods, such as
Gibbs sampling. The SFB approach obtains valid inferences only under ignorable missing data
mechanisms, that is, MCAR or MAR, and when the analysis model, together with the conditional
distributions of the covariates, are correctly specified (Erler et al., 2016).
Mitra and Dunson developed a 2-level variable selection model (Mitra and Dunson, 2010), in
which the variable selection is performed not only in the top level model relating the response
to covariates, but also in the covariate model characterizing the joint distribution functions in
Equation 2.19. In the re-analysis of Dutch Salmonella Thompson outbreak data, 2-level variable
selection model was applied as part of the complete Bayesian variable selection model. Some of
the parameters in θX were reasonably assumed to be zero due to sparse relationships among the
covariates (Jacobs et al., 2017). Therefore, a variable selection was performed in each covariate
model.
15
2.5 Complete Bayesian variable selection model
In the previous sections in this chapter, two types of logistic regression models, common variable
selection techniques and methods for missing data have been described. In this section, we
combine the methods into a complete Bayesian variable selection model. The complete Bayesian
variable selection model is composed of three different parts, namely, Bayesian variable selection,
misclassification correction and missing value imputation.
The food-borne disease outbreak data is a kind of “dynamic” data. Information of cases and
controls are collected over time during a food-borne disease outbreak investigation. For dynamic
data, we need to think of a way to deal with time. We deal with time by fitting the Bayesian
variable selection model on the data which is available at a certain date during the outbreak.
Additionally, one could add a time variable in the top model relating the response to covariates.
The way to perform Bayesian variable selection in the context of food-borne disease outbreak
investigations is not limited to the way described here. In Chapter 4, we assume that there is
only one responsible food product when generating new datasets. Then, for the Bayesian variable
selection models, we set a prior distribution on the probability for each food product covariate
to be in the model and choose the one with the highest one-sided posterior inclusion probability.
Alternatively, in addition, we can include the information that there is only one responsible food
product by setting a prior distribution on the model size.
2.5.1 Logistic regression with nondifferentially misclassified responses

The followings presents a summary of the logistic regression model, corrected for misclassification:
yi |xi ∼ Bernoulli(µi )
µi = (Se)πi + (1 − Sp)(1 − πi ) (2.20)
logit(πi ) = β0 + xi β.
In the context of the food-borne outbreak datasets used in this thesis, we assume Sp = 1. A case
only entered the dataset if it had been twice laboratory-confirmed. Hence, it is safe to assume
that no non-infected subject was misclassified as a case, i.e. P (yi = 1|ỹi = 0) = 0, indicating
that the specificity equals to one.
2.5.2 Bayesian variable selection

As introduced in Section 2.3.3, the SSVS procedure is applied as the Bayesian variable selection
part. The mathematical formulation of the SSVS prior used in the complete model for j =
1, 2, ..., p is written as
βj |τ 2 , c2 ∼ γj N(0, τ 2 c2 ) + (1 − γj ) N(0, τ 2 )
γj |ωj ∼ Bernoulli(ωj ) (2.21)
ωj ∼ Beta(aj,0 , bj,0 ).
16
2.5.3 Missing imputation with variable selection
Missing imputation with variable selection is applied as a component of the complete Bayesian
variable selection model. We assume that respective missing covariate in Equation 2.19 depends
on previous covariates and is modelled by a generalized linear model with regression coefficients
θXj = (α0,j , α1,j , . . . αj−1,j )0 . Because all covariates are binary, a Bernoulli response with a
logistic regression model is used as the covariate model to obtain the probabilities in Equation
2.19. How to choose the prior distributions of α0 s and ω 0 s in the covariate models will be discussed
in Section 4.6.1.
17
Chapter 3
Description and statistical

analysis of datasets
Two datasets are used in this simulation study. The first one is from the Salmonella Bovis-
morbificans outbreak in 2016 to 2017 (Brandwagt et al., 2018) and the second one is from the
Salmonella Thompson 2012 outbreak in the Netherlands (Friesema et al., 2014). Our simulation
study is based on the first dataset. Both of the two datasets reveal what the real outbreak data
look like and provide reference information for setting detailed simulation schemes, such as the
setup of missing covariates.
3.1 Salmonella Bovismorbificans outbreak (2016-2017)

3.1.1 Data description
Cases of Salmonella Bovismorbificans infections were reported to the Center of Disease Control
(CIb) of the Dutch National Institute for Public Health and the Environment (RIVM) since
9 October 2016. Over a period of 7 months, there were 61 known cases. Using the valuable
information provided by the case-control studies, the trace-back investigation finally identified
one of the collected retail ham products (smoked Cobug ham) as the responsible contaminated
food product.
During the outbreak investigation, questionnaires were returned from 9 February to 5 April
2017. There were 24 cases and 37 controls who participated in the case-control study. Controls
were matched to cases by age, gender and residence municipality, and were randomly sampled
from population registers (Brandwagt et al., 2018). The numbers of cases and controls are shown
in Figure 3.1
Initially, the food-consumption questionnaire contained 172 covariates (age, gender, 154 food
products and 16 supermarket covariates). Since subjects did not always recall exactly what
specific type of food products they had consumed during the incubation period, ham and cheese
18
Figure 3.1: Number of all observations
products were merged into one pooled ham variable (raw, smoked and Coburg ham) and one
pooled cheese variable (unsliced, sliced and grated), during the analysis (Brandwagt et al., 2018),
resulting in 150 food products. All covariates except age are binary-valued. Age is a continuous
covariate which is standardized in the analysis.
The age distribution is a negatively skewed unimodal distribution with a mode age group of
70-79 (Figure 3.2). The cases were aged 5 to 89 (median 65.5) and the controls were aged 4 to
90 (median 69). For both groups, more than half were females: there were 14 females (58.3%)
in the case group and 20 females (54.1%) in the control group. Frequencies for observations, age
and gender, are summarized in Table 3.1.
A food product covariate takes the value 1 if a subject ate that product and takes the value
0 otherwise. A supermarket covariate takes the value 1 if a subject bought most of his or her
groceries at that supermarket and took the value 0 otherwise. food product covariates and
supermarket covariates that were not filled in are assumed to be zero. This is a reasonable
assumption, because subject usually only mark the food products that they have eaten and the
supermarket where they have purchased most of their groceries. For food product covariates,
subjects were allowed to respond with “maybe” if they were not sure whether or not they had
eaten the product. In the analysis in 2017 (Brandwagt et al., 2018), a subject was assumed
to consume a product when he or she answered “maybe”, thus probably overestimating food
consumption. In our analysis, we treat these covariates as being missing.
The percentage of missing covariates per subject is up to 21.7% for cases and 35.5% for
19
Figure 3.2: Histogram of ages of all observations
Table 3.1: Characteristics of cases, controls and all subjects involved in the case-control studies
during the Salmonella Bovismorbificans outbreak
Characteristics Cases Controls All subjects
Gender
Male Female Total Male Female Total Male Female Total
Age group
0-9 1 0 1 4 0 4 5 0 5
10-19 0 0 0 0 0 0 0 0 0
20-29 0 1 1 1 1 2 1 2 3
30-39 1 1 2 2 2 4 3 3 6
40-49 3 1 4 0 3 3 3 4 7
50-59 0 2 2 0 0 0 0 2 2
60-69 1 3 4 6 3 9 7 6 13
70-79 2 4 6 3 10 13 5 14 19
≥ 80 2 2 4 1 1 2 3 3 6
Total 10 14 24 17 20 37 27 34 61
controls. The covariate with the highest percentage of missing values per covariate is smoked
sausage (25.0%) for cases and chicken breast (35.1%) for controls. Among 64.7% of the food
product covariates, the percentage of missing values per covariate for controls is higher than that
for cases, which reflects the existence of recall bias. For the pooled ham variable, the recall bias
20
also existed. Only 8.3% of cases responded with “maybe” to the consumption of the pooled ham
variable, while up to 27.0% of controls responded with “maybe”. In total, 12 subjects, accounting
for 19.67% in all subjects, responded with “maybe”.
3.1.2 Statistical analysis using the standard logistic regression

The following procedures are usually used as standard logistic regression analysis procedures:
First, a pre-selection of variables is performed during which a univariable logistic model for each
potential risk factor is fitted. Confounders, age and gender, are also included in the univariable
logistic model. The cut-off p-value used in the univariable analysis should neither be too high nor
too low (Hosmer et al., 2013). Using a higher level of p-value as a screening criterion would not
only select potential risk factors, but also select risk factors which are of questionable importance.
This criterion leads to a selection of a relatively large number of covariates, almost equal to or
larger than the number of observations. This results in a failure of convergence in the likelihood
maximization algorithm of the multivariable logistic regression (Matignon, 2005). This failure of
convergence may be due to partial or complete separation (Albert and Anderson, 1984; Lesaffre
and Albert, 1989). On the other hand, use of a lower level might exclude some potential risk
factors at the model building stage. In the practical application, the choice of the p-value is
subjective. For example, in the analysis of the Salmonella Thompson 2012 outbreak (Friesema
et al., 2014), 0.20 level was used as a screening criterion in the univariable analysis. After potential
risk factors with p-values less than the screening criterion are selected, these potential risk factors
together with the confounder variables are entered into a multivariable logistic model on which a
backward variable selection based on the AIC is performed. The model that minimizes the AIC
is selected as the final model. In the final model, the food product covariate with a p-value of
less than 0.05 and a highest positive estimated coefficient among all food product covariates is
the most likely outbreak source. The same criterion applies to the supermarket covariate.
As described before, it is important to choose a suitable p-value as a screening criterion in
the univariable analysis. In the analysis of the Salmonella Bovismorbificans outbreak, we choose
the p-value by retrospectively analysing the complete dataset. The p-value criterion is reduced
from 0.20 by 0.01 each time until the likelihood maximization algorithm converges. Finally, a
p-value of 0.09 is chosen as a criterion.
We analyse four subsets of the Salmonella Bovismorbificans data as well as the complete
dataset (5 April) to retrospectively determine when the standard logistic regression model would
have identified the pooled ham as the most likely outbreak source. The four subsets contain data
up to 2 March, 9 March, 16 March and 22 March. The subset containing data up to 9 February
is not analysed, since there are only cases in the subset. Only for the 16 March analysis and
the complete dataset analysis, multivariable models can be fitted. Multivariable models cannot
be fitted on the other subsets due to convergence failures in logistic regression. Dairy dessert is
identified as a probable source on 16 March while no supermarket chain can be identified as a
probable supermarket causing the outbreak. Pooled ham and supermarket chain 12 are identified
as a probable source and a probable supermarket, respectively on 5 April.
21
3.2 Salmonella Thompson outbreak (2012)
On 15 August 2012, an increase in the number of cases of Salmonella Thompson infections in the
Netherlands as reported. That week, 11 S. Thompson cases and two weeks earlier four cases were
detected at the RIVM. An outbreak investigation was started in order to identify the source of
the outbreak and thereby prevent further disease spread. As part of the outbreak investigation,
the case-control study was conducted from 16 August 2012 to 28 September 2012 when smoked
fish was identified as the source. During the studies, four potential sources were indicated by the
case-control statistical analysis, namely minced meat (10 September), ready-to-eat raw vegetables
(17 September), ice cream (18 September) and finally smoked fish (24 September).
For each of the cases, four controls were drawn from the Dutch population from the same or
neighbouring municipality with comparable age and gender (Friedman et al., 2010). Finally, 109
cases and 193 controls participated the case-control study. The numbers of cases and controls
are shown in Figure 3.3
Figure 3.3: Number of all observations
The food-consumption questionnaire was continuously updated during the outbreak inves-
tigation. Only food products that were investigated by all revisions of the questionnaire were
included in the dataset, resulting in 108 covariates (age, gender, 95 food products and 11 super-
market covariates).
Age has a bimodal distribution with two modes in the age groups of 10-19 years and 60-69
years (Figure 3.4). The median age of all subjects was 58 years (range: 2-93 years). The median
22
age of cases was 54 years (range: 2-93 years) and of controls was 60 years (range: 3-92 years).
64.2% of cases were females and 72.0% of controls were females. Summaries of characteristics of
observations, age and gender, can be seen in Table 3.2.
For food product and supermarket covariates, 0/1 values have the same meanings as those
in section 3.1. Similarly, food product covariates that were not filled in are assumed to be zero.
All of the supermarket covariates were filled in. Hence, there is no need to do imputation for
supermarket covariates. In our analysis, we use the same definition of missing covariates as
section 3.1. That is, a food product covariate with a “maybe” answer is considered as being
missing.
Under the above definition of missing covariates, the percentage of missing covariates per
subject is up to 39.6% for cases and 67.0% for controls. The covariate with the highest percentage
of missing values per covariate is iceberg lettuce (15.6%) for cases and minced beef for controls
(21.2%). Among 74.7% of the food product covariates, the percentage of missing values per
covariate for controls is higher than that for cases. The percentage of cases who responded with
“maybe” to the consumption of the smoked fish variable is 9.2%, which is slightly higher than
controls. The corresponding percentage for controls is 7.3%.
Figure 3.4: Histogram of ages of all observations
23
Table 3.2: Characteristics of cases, controls and all observations involved in the case-control
studies during the Salmonella Thompson outbreak
109 cases 193 controls 302 subjects
Female 70 (64.2) 139 (72.0) 209 (69.2)
Sex
Male 39 (35.8) 54 (28.0) 93 (30.8)
0-9 11 (10.1) 14 (7.3) 25 (8.3)
10-19 13 (11.9) 18 (9.3) 31 (10.3)
20-29 12 (11.0) 10 (5.2) 22 (7.3)
30-39 10 (9.2) 10 (5.2) 20 (6.6)
Age group in years 40-49 7 (6.4) 13 (6.7) 20 (6.6)
50-59 12 (11.0) 29 (15.0) 41 (13.6)
60-69 13 (11.9) 49 (25.4) 62 (20.5)
70-79 15 (13.8) 30 (15.5) 45 (14.9)
≥ 80 16 (14.7) 20 (10.4) 36 (11.9)
24
Chapter 4
Simulation study
4.1 Motivation for designing a simulation study based on

real data
We design a simulation study based on real outbreak data instead of a simulation from scratch,
because of the difficulty of designing a simulation study that represents a realistic outbreak.
Generally, food-borne disease outbreak data obtained using case-control methods contain a
binary disease outcome variable, food product covariates which indicate whether a subject did or
did not consume that product in a given time period, confounders such as age and gender, and
possibly other information such as shops where subjects bought the food products. In the design
of a simulation study from scratch, one needs to consider the correlation between exposures to
potential risk factors on the same subject as well as the correlation between subjects. In addition,
because the food-borne disease outbreak investigation is a process over time in which information
of cases and controls are collected on a daily basis, dynamic data must be generated through the
simulation. Moreover, the missingness mechanism and response misclassification must be taken
into account because missing data and response misclassification are unavoidable in case-control
studies. All these aspects make designing a simulation study from scratch be a non-trivial task.
Even if one finally manages to design a way to account for the above aspects, the simulated data
probably cannot adequately reflect the complexities inherent in real food-borne disease outbreak
data which are necessary for a good understanding of the effects of missingness and misclassi-
fication on the performance of the Bayesian variable selection model, and the contribution of
different parts of the Bayesian variable selection to the model performance.
Designing a simulation based on real outbreak data can relieve the above difficulties and
meanwhile makes it possible to simulate food-borne disease outbreak data as realistic as possible.
25
4.2 Simulating new datasets
4.2.1 General description
The simulation study is based on the dataset from the Salmonella Bovismorbificans outbreak
in 2016 to 2017 (Brandwagt et al., 2018). The responsible contaminated food product in this
outbreak is a smoked Coburg ham, which falls in the category of the pooled ham covariate. In
this simulation study, we still assume that the source is from the pooled ham covariate by keeping
the pooled ham covariate fixed. New datasets are generated by shuffling a certain proportion of
food product covariates together within one stratum of cases and controls with similar age and
gender. Shuffling covariates is achieved by randomly reassigning covariate values of one subject
to a different subject in the same stratum.
Generally, variable selection methods are sensitive to correlation structures among covariates.
Therefore, in order to make the assumption that the most likely source is the pooled ham valid,
the correlation structures should be kept as constant as possible in the simulated data generation
process. On the other hand, the simulation study aims to find out answers to the research ques-
tions by testing and comparing Bayesian methods on different datasets. Hence, the proportion
of food product covariates to be shuffled should be as large as possible in order to break the
original correlation structures among food product covariates but small enough to keep some of
the correlations of the food product covariates with the pooled ham covariate intact. A trade-off
is made by setting the proportion of shuffled food product covariates to 50%.
In addition, the consumption of food products is probably influenced by the confounding
variables, age and gender. To control for age and gender, we shuffle food product covariates
within each stratum which is constructed based on gender and age groups. Considering that
people of different ages have different dietary habits and that there should be at least one case-
control pair in each stratum, we constructed six age-gender strata are considered: 0-19 years
(children and teenagers), 20-59 years (young and middle-aged adults) and ≥ 60 years (older
adults) for both females and males. Because no females aged 0-19 years participated in the
case-control study, there are five age-gender strata in total.
The simulation scheme used in this simulation study is summarized in Figure 4.1. The steps
of creating these scenarios in Figure 4.1 are listed in the following sections.
4.2.2 Algorithm
For each of 100 iterations, the following steps are taken:
• Step 1 Construct five age-gender strata.
• Step 2 Randomly select 50% of 150 food product covariates except the pooled ham covariate.
• Step 3 Randomly assign values of selected covariates of one subject to a different subject
in the same stratum.
Finally, save the 100 new datasets.
26
Figure 4.1: A flow graph of simulation scheme
4.3 Misclassification scenarios

We decided to, on average, increase the amount of misclassified subjects by 10% by randomly
changing cases into controls. Considering the number of cases (24), we add at least one misclas-
sification and at most 4. This results in an increase in the misclassification rate of 4.17%-16.67%.
For each of the 100 datasets generated in Section 4.2.2, the following steps are taken:
• Step 1 Sample a variable j from the discrete uniform distribution: i ∼ DiscreteUnif(1, 4).
• step 2 Randomly draw j cases and change them into controls.
Finally, save the above 100 new datasets.
4.4 Missingness scenarios

We decided to, on average, increase the amount of missing values in the pooled ham covariate by
20%. With a least an increase of 8% and a total maximum number of missing values of around
50%. This means increasing the number of missing values by 8%-32%. Considering the total
number of subjects (61), we increase the the number of missing values by 5-19. MCAR data
is created by randomly selecting 5-19 subjects with non-NA pooled ham covariate values and
setting pooled ham covariate values of the selected subjects to missingness.
For each of the 100 datasets generated in Section 4.2.2, the following steps are taken:
• Step 1 Sample a variable i from the discrete uniform distribution: i ∼ DiscreteUnif(5, 19).
• Step 2 Randomly draw i subjects from the non-NA pooled ham covariate values and set
those pooled ham covariate values to missing.
27
Finally, save the above 100 new datasets.
4.5 Scenarios with increased misclassification and missing-

ness rates
Scenarios with increased response misclassification rate and missingness rate in the pooled ham
covariate are created by applying Step 2 from both Section 4.3 and Section 4.4. We do not
sample new values i and j in order to keep the variation among scenarios as small as possible.
This allows us to make fair comparisons among the scenarios.
4.6 Model fitting

In this section, we describe the prior specification, the initial values of parameters, the number of
burn-in and posterior sampling iterations and the model fitting for the Bayesian variable selection
models.
4.6.1 Prior specification

First we specify the values of , c and τ . Based on expert knowledge, large values for β are
not often seen in practice. Moreover, allowing large β 0 s may hamper the convergence of the
MCMC algorithm (Jacobs et al., 2017). Keeping this in mind, we choose a practical significance
of = 0.05 and c = 100, resulting in τ = 0.0165. These choices result in a slab distribution of
N(0, 1.652 ) and a spike distribution of N(0, 0.01652 ). We also choose the same values of τ, c and
for the spike and slab prior distribution in the variable selection of the covariate models.
A Se ∼ Beta(33, 4) prior is used for the sensitivity which suggests a median sensitivity of
90% and 5th percentile of 80%. This choice is based on preliminary expert knowledge about
asymptomatic cases in Salmonella infections.
A ωj ∼ Beta(1, 2) prior is used as the prior distribution for the ωj 0 s in the mathematical
formulation listed in Section 2.5.2. This leads to a one-sided prior inclusion probability of P (βj >
) = 0.16. The Beta(1, 2) distribution gives more weight to small probabilities, thus favouring
more parsimonious response models. The ωj 0 s used for the variable selection in the covariate
models are also given Beta(1, 2) distributions.
We use a diffuse normal prior distribution, N(0, 1000), for the intercept term in response and
covariate models and the regression coefficients for age and gender.
4.6.2 Initial values of parameters

The initial values of β 0 s and α0 s are all set to the means of their prior distributions, i.e. 0.
There are other alternatives. For example, one can first perform the standard method describe
in Section 3.1.2 on the simulated dataset and then use the estimates of β 0 s obtained from the
logistic regression model as the initial values of β 0 s. This could speed up convergence. However,
28
this practice is not feasible in this thesis project. First, multivariate models probably cannot be
fitted on the data up to the first early dates due to convergence failures in logistic regression.
Second, multivariable models only give us the estimate of the covariates which are selected by
univariable models. We cannot obtain the estimates of other covariates.
4.6.3 Burn-in iterations and the posterior sample size

Bayesian variable selection models are implemented in the R Software (R Core Team, 2016)
using JAGS (Plummer, 2003). When deciding the number of burn-in iterations and the posterior
sample size, we consider two factors, namely, the required computation time and the convergence
of parameters. The parameters used for the convergence diagnostic are ωj 0 s and Se instead of
βj 0 s. βj 0 s are sampled from a mixture of two normal distributions which may lead to bimodal
posterior distributions. Assessing convergence in the case of a bimodal posterior distribution is
not easy because most convergence diagnostics assume a unimodel posterior distribution.
We used the complete Bayesian variable selection model and the model without misclassifica-
tion correction to investigate the convergence. They take up the most computational time from
our previous experience and should, therefore, be considered when assessing the computation
burden of the simulation study. We tested convergence by running these two models on the
original complete Salmonella Bovismorbificans outbreak data. We used two diagnostic tests, the
Heidelberger-Welch (HW) diagnostic (Heidelberger and Welch, 1983) and Gelman-Rubin diag-
nostic (Gelman and Rubin, 1992), for the convergence diagnosis. Both of the two diagnostics
were performed using the coda package (Best et al., 2006) in R.
The HW diagnostic assesses convergence on a single chain and is based on the assumption that
we obtain a weakly stationary process when the chain has reached convergence. The convergence
test in the HW diagnostic is applied firstly to the whole chain. If the null hypothesis of stationarity
is rejected, then the test is repeated after discarding 10%, 20%, . . . of the chain. This process
is continued until either a portion of the chain passes the stationarity test or 50% of the chain
has been discarded. The latter outcome gives rise to a “failure of stationary” of the chain. If the
convergence test is passed, the number of initial iterations to discard is reported.
The Gelman-Rubin diagnostic can be used for testing the convergence of multiple chains.
This diagnostic requires that initial values must be overdispersed with respect to each posterior
distribution. The Gelman-Rubin diagnostic gives us a point estimation of the potential scale
reduction factor (PSRF) for each parameter. When PSRF is close to 1, this indicates likely
convergence. As a rule of thumb, it can be safe to assume adequate convergence of an MCMC
algorithm for PSRF < 1.1 (Gelman et al., 2004).
The Gelman-Rubin diagnostic is widely used in the convergence diagnosis of multiple chains.
However, it requires the initial values of a parameter to be overdispersed relative to its poste-
rior distribution while initial values of each parameters were the same in our case. Because the
requirement of the Gelman-Rubin diagnostic was not met in our case, the result from this diag-
nostic might be somewhat untrustworthy. Although it seems reasonable to choose overdispersed
values for each parameter as initial values in order to meet this requirement, it is not feasible in
29
our case because the posterior distributions were unknown in the beginning. On the other hand,
the HW diagnostic has no requirement of initial values. However, the HW diagnostic has the
disadvantage that it can be only used for testing the convergence of a single chain. It could not
give us a combinative result of multiple chains. Considering the advantages and disadvantages
of these two diagnostic tests, we applied both tests.
For the implementation of the complete Bayesian variable selection model, we ran 8 chains
with a burn-in of 2000 iterations and then a further 1875 iterations per chain resulting in a
posterior sample of size 15000. The required time was 3478.067 seconds (i.e. 0.966 hour). The
convergence of 161 ω 0 s and 1 Se, i.e. 162 parameters were tested. The HW diagnostic showed
that for each chain, the number of parameters which failed in the convergence diagnosis varied
from 15 to 25. The chains of the other parameters which started at the iteration 1, 189, 376,
564 or 751 passed the convergence test. In the Gelman-Rubin diagnostic, the point estimate
of PSRF of two of the parameters was 1.1. For the rest of parameters, the point estimate of
PSRF was between 1.00 and 1.09. By combining the results of the two diagnostic tests, we could
see that adequate convergence or partial convergence had been reached in chains of most of the
parameters.
When 8 chains with a burn-in of 3000 iterations and a further 1875 per chain were run, the
results of the convergence diagnostics were improved. In the HW diagnostic, for each chain,
12-22 parameters failed in the convergence test. The Gelman-Rubin diagnostic showed that all
parameters had point estimations of PSRF less than 1.1. However, the required time increased
to 4342.600 seconds (i.e. 1.21 hours). It is possible that all parameters will pass these two
diagnostic tests when the number of burn-in iterations is increased greatly.
For the implementation of the model composed of the Bayesian variable selection and missing
value imputation parts, we ran 8 chains with a burn-in of 2000 iterations and then a further 1875
iterations per chain. The required time was 3377.166 seconds (i.e. 0.938 hours). A total of 161
parameters were tested in the convergence diagnosis. In the HW diagnostic, for each chain, there
were 3-7 parameters which failed in the convergence test. In the Gelman-Rubin diagnostic, the
point estimation of PSRF of all parameters were either 1.00 or 1.01. The convergence test results
were was not improved when we ran 8 chains with a burn-in of 3000 iterations and a further
1875 iterations per chain. In the HW diagnostic, 5-11 parameters failed in the convergence test
for each chain. In the Gelman Rubin diagnostic, the point estimation of PSRF for all parameter
either 1.00 or 1.01. On the other hand, the required time increased greatly. It consumed 5414.375
seconds (i.e. 1.50 hours).
Given the time required, we choose to run 8 chains with a burn-in of 2000 iterations and
then a further 1875 iterations per chain during the fitting of the Bayesian variable selection
models. This decision provides us with relatively good convergence while also putting limits on
the computational time which is important given the time constraints of this thesis.
30
4.6.4 Model fitting steps
There are five models which are fitted on the datasets, namely, the standard logistic regression
model and four different versions of the Bayesian variable selection models. The five models are:
1. Standard logistic regression.
2. Only Bayesian variable selection. The misclassified responses are not corrected here and
thus the response model is the one in Equation 2.1. In addition, the missing covariates are
set to 1.
3. Bayesian variable selection and misclassification correction. The missing covariates are set
to 1.
4. Bayesian variable selection and missing value imputation. The response model is the logistic
regression model in Equation 2.1.
5. Complete Bayesian variable selection model described in Section 2.5.
For each dataset generated in Sections 4.2-4.5, the above five models are fitted:
• Step 1 Divide each dataset into five subsets which contain data up to 2 March, 9 March,
16 March, 22 March and 5 April.
• Step 2 Fit the five models on the five datasets. The model fitting will be stopped as soon
as a model identifies the pooled ham as the most likely suspect. Alternatively, the model
fitting is continued until the complete dataset (5 April) is fitted to. For the standard
logistic regression model, a pooled ham covariate with a p-value of less than 0.05 and
the highest positive estimated coefficient among all food product covariates is regarded as
the most likely suspect. For Bayesian variable selection models, if the one-sided posterior
inclusion probability of the pooled ham covariate, P (βham > 0.05| Data) is highest among
food product covariates, then the pooled ham is considered as the most likely suspect.
For the standard logistic regression model, if the pooled ham is found to be the most likely suspect
at a certain date, we record this date and the time which the model fitting on the data up to
this date consumes. Otherwise, nothing is recorded. For Bayesian variable selection models, if
the pooled ham is found to be the most suspect on a certain date, we record this date, the time
which the model fitting on the data up to this date requires, and the one-sided posterior inclusion
probability of the pooled ham covariate, P (βham > 0.05| Data) at this date. Otherwise, nothing
is recorded. We refer to these dates as the earliest detection dates.
4.6.5 Performance measures

As described in Section 4.6.4, for the standard logistic regression model, we record the earliest
detection date when the pooled ham is found to be the most likely suspect and the required
time which the model fitting on data up to this date takes. For the Bayesian variable selection
31
models, we record the earliest detection date, the required time and the one-sided posterior
inclusion probabilities of the pooled ham covariate, P (βham > 0.05| Data) at this date. For
model comparisons, the primary evaluation criterion is the correct detection number, which is
the number of datasets out of 100 datasets in which a model correctly detected the contaminated
food product. The medians of the recorded earliest detection dates and the medians and standard
deviations of P (βham > 0.05| Data) at the earliest detection dates are used as the secondary
evaluation criteria. We use the secondary evaluation criteria for model comparison only if two
models give similar results of the correct detection numbers. Besides, the average required time
of each model in each scenario will also be compared in order to gain an insight of how long a
model consumes on average.
32
Chapter 5
Results of the simulation study
In this chapter, we provide figures of the correct detection number, the frequency of earliest
detection dates, the one-sided posterior inclusion probabilities of the pooled ham covariate and
the average required time for the different models in the four scenarios to compare models. Tables
of performance measures on each model are provided in Appendix E. We refer to the scenarios
in the following way:
• S1: original response misclassification rate and missingness rate in the pooled ham covari-
ate,
• S2: increased response misclassification rate,
• S3: increased missingness rate in the pooled ham covariate,
• S4: increased response misclassification rate and missingness rate in the pooled ham co-
variate.
5.1 Correct detection numbers

Figure 5.1 and Figure 5.2 show the correct detection numbers for the five different models in
the four scenarios. These two figures are different in the x-axis. In Figure 5.1, we have the four
scenarios on the x-axis while in Figure 5.2, we have the five models on the x-axis. Note that the
lines between any two adjacent points in both figures have no concrete mathematical meaning.
These lines are theoretically not correct because the variable plotted on the x-axis is categorical,
not continuous. Nevertheless, we included the lines to facilitate the interpretation of the figures.
As previously mentioned in Section 1.3, the first sub-question of this thesis is about the effects
of missingness in the assumed responsible food product covariate and response misclassification
on the performance of Bayesian variable selection models. We are going to use Figure 5.1 to
answer this sub-question.
Figure 5.1 shows that for each of the five model, the increase in the response misclassification
rate or the missingness rate in the pooled ham covariate or the increase in both results in a
33
Figure 5.1: A plot of the correct detection numbers for the different models in the four scenarios.
On the x-axis, we have scenarios: S1: original response misclassification rate and missingness
rate in the pooled ham covariate, S2: increased response misclassification rate, S3: increased
missingness rate in the pooled ham covariate, and S4: increased response misclassification rate
and missingness rate in the pooled ham covariate. In the legend, we have models: LR: standard
logistic regression model, M1: model with only Bayesian variable selection, M2: model with
Bayesian variable selection and misclassification correction, M3: model with Bayesian variable
selection and missing value imputation, and M4: complete Bayesian variable selection model.
decrease in the correct detection number. The increase in both the response misclassification
rate and the missingness rate in the pooled ham covariate has the largest negative influence on
the correct detection number for Bayesian variable selection models. For the standard logistic
regression model, the increase in the response misclassification rate has the largest negative
influence.
In particular, we are interested in two cases. First, we consider the performance of the model
with misclassification correction in the Scenario S2 (i.e. the scenario with increased response
misclassification rate). The model with misclassification correction has a correct detection num-
ber of 41 in the scenario S1. Its correct detection number decreases to 35 in the scenario S2. The
percentage of decrease is 14.6%. In contrast, the model with only Bayesian variable selection
experiences a decrease of 34.6% (the correct detection number decreases from 26 to 17). This
indicates that misclassification correction makes the performance of Bayesian variable selection
model more resistant to increased response misclassification.
Second, we consider the performance of the model with missing value imputation in the
34
scenario S3 (i.e. the scenario with the increased missingness rate in the pooled ham covariate).
The model with missing value imputation has a correct detection number of 88 in the scenario
S1 and the number decreases to 50 in the scenario S3. The percentage of decrease is 43.2%.
In contrast, the model with only Bayesian variable selection experiences a decrease of 11.5%
(the correct detection number decreases from 26 to 23). This indicates that the performance
of the model with Bayesian variable selection and missing value imputation is more sensitive
to the increase in the missingness rate compared with the model with only Bayesian variable
selection. However, we supposed that missing value imputation should make the performance
of the Bayesian variable selection model more resistant to the increase of the missingness. The
possible explanation of this counterintuitive result will be discussed in Section 6.1.
Figure 5.2: A plot of the correct detection numbers for the different models in the four scenarios.
On the x-axis, we have models: LR: standard logistic regression model, M1: model with only
Bayesian variable selection, M2: model with Bayesian variable selection and misclassification
correction, M3: model with Bayesian variable selection and missing value imputation, and M4:
complete Bayesian variable selection model.
As previously mentioned in Section 1.3, the second sub-question of this thesis is which parts of
the Bayesian variable selection model contribute to the performance in each of the four scenarios.
Figure 5.2 facilitates answering this sub-question.
35
Figure 5.2 shows that in each scenario the complete Bayesian variable selection model per-
forms the best and the standard logistic regression model gives the lowest performance in the
correct detection number. Bayesian variable selection, misclassification correction and missing
value imputation all contribute positively to the model performance. In particular, missing value
imputation contributes the most among these three components.
5.2 Earliest detection dates and P (βham > 0.05| Data)

Figure 5.3 shows the earliest detection dates and failed detections for the five models in the four
scenarios. In total, there are five possible detection dates, i.e. 2 March, 9 March, 16 March, 22
March and 5 April. For some models, some dates are missing. For example, 2 March, 9 March
and 16 March are missing for the standard logistic regression model in the scenario S1. This
happens when for none of the 100 datasets, the responsible food product was detected on that
date.
From Figure 5.3, we can see that for the standard logistic regression model, the frequencies of
earliest detection dates are similar in each scenario when we only look at the non-zero frequencies.
For the four Bayesian variable selection models, distributions of the earliest detection dates are
approximately left-skewed with 22 March as the median date in each scenario. One exception
is the the model with only Bayesian variable selection in the scenarios S3 and S4. For these
two distributions, there is very little difference in the frequencies of the dates and the median
detection dates are 9 March.
Earliest detection dates provide us additional information to evaluate the performance of
different models. For example, in the scenario S3, the correct detection number of the model
with Bayesian variable selection and misclassification correction (27) is slightly higher than that
of the model with only Bayesian variable selection (23). The former has a median detection date
of 22 March and the latter has a median detection date of 9 March. This suggests that the latter
can detect the pooled ham covariate earlier. In this sense, the misclassification correction does
not contribute a lot to the performance of Bayesian variable selection model in the scenario S3.
However, for the four Bayesian variable selection models, the median detection dates are
all 22 March, except for the model with only Bayesian variable selection in the scenarios S3
and S4 where the median detection date is 9 March. The information of the median detection
dates does not provide additional information with respect to model performance than already
obtained from the correct detection number for the different models.
Figure 5.4 presents information of P (βham > 0.05| Data) for the different models in the four
scenarios when the pooled ham covariate is detected as the most likely suspect at the earliest
detection date. This figure shows that the boxes of the five models overlap each other in the
same scenario and that the medians are between 0.50 and 0.65. This indicates that there are no
significant differences in these inclusion probabilities of the pooled ham covariate among the five
models in each of the four scenarios.
36
5.3 Average required time at each date
Figure 5.5 shows the average required time at each date for each model in each scenario. The
required time is recorded only when the pooled ham covariate is successfully detected. When
a model fails to detect the ham, the required time is not recorded. Therefore, missing values
exist in the recording of the required time. We assume that the required time is not affected by
whether a model can successfully detect the pooled ham covariate. Based on this assumption,
for each model in each scenario, the average required time at each date shown in Figure 5.5
is computed by taking the average of the recorded required time at that date. Although the
estimation of the average required time is unbiased under the above assumption, this estimation
is not precise when the sample size of recorded required time is small. For example, for the
standard logistic regression model in Scenarios S2 and S4, the estimation of the average required
time at each date is based on a sample size of less than 6 and these estimations are not precise.
This may explain why the average required time does not increase monotonically with increased
date for the standard logistic regression model in Scenarios S2 and S4.
Figure 5.5 provides us some general information of the required time for each model in each
scenario. From this figure, we can see that the standard logistic regression model requires the
least time. The average required time is less than 1 second. The model with only Bayesian
variable selection and the model with Bayesian variable selection and misclassification correction
consume relatively little time. On average, the former requires less than 35 seconds and the
latter requires less than 2 minutes. When the complete Bayesian variable selection model or
the model with Bayesian variable selection and missing value imputation is applied, the average
required time increases greatly. On average, they consume 0.5 hour to 1 hour. This significant
increase in the average required time indicates that the missing value imputation is the most
time-consuming component.
In addition, if we look at the average required time for a model in each scenario, we can find
that in general the average required time is monotonically increasing. This suggests that the
computation time of a model increases as the sample size increases.
37
Figure 5.3: Histograms of earliest detection dates and failed detections for the different mod-
els in the four scenarios. In the horizontal direction, we have scenarios: S1: original response
misclassification rate and missingness rate in the pooled ham covariate, S2: increased response
misclassification rate, S3: increased missingness rate in the pooled ham covariate, and S4: in-
creased response misclassification rate and missingness rate in the pooled ham covariate. In the
vertical direction, we have models: LR: standard logistic regression model, M1: model with only
Bayesian variable selection, M2: model with Bayesian variable selection and misclassification
correction, M3: model with Bayesian variable selection and missing value imputation, and M4:
complete Bayesian variable selection model.
38
Figure 5.4: Boxplots of the one-sided posterior inclusion probabilities of the pooled ham covariate
for the different models in the four scenarios when the pooled ham covariate is detected as the
most likely suspect at the earliest detection date. In the horizontal direction, we have models:
M1: model with only Bayesian variable selection, M2: model with Bayesian variable selection
and misclassification correction, M3: model with Bayesian variable selection and missing value
imputation, and M4: complete Bayesian variable selection model. In the vertical direction, we
have scenarios: S1: original response misclassification rate and missingness rate in the pooled
ham covariate, S2: increased response misclassification rate, S3: increased missingness rate in
the pooled ham covariate, and S4: increased response misclassification rate and missingness rate
in the pooled ham covariate. In the boxplot, the lower and upper hinges correspond to the first
and third quartiles. The band inside the box is the second quartile (the median). The ends
of the whiskers represent the lowest datum still within 1.5 IQR of the lower quartile and the
highest datum still within 1.5 IQR of the upper quartile (where IQR is the inter-quartile range
or distance between the first and third quartiles). Data points located outside the whiskers are
outliers.
39
Figure 5.5: Plots of numbers of the average required time at each date for each of the five
models in each of the four scenarios. In the horizontal direction, we have scenarios: S1: original
response misclassification rate and missingness rate in the pooled ham covariate, S2: increased
response misclassification rate, S3: increased missingness rate in the pooled ham covariate, and
S4: increased response misclassification rate and missingness rate in the pooled ham covariate.
On the x-axis, we have dates: D1: 2 March, D2: 9 March, D3: 16 March, D4: 22 March and
D5: 5 April. In the vertical direction, we have models: LR: standard logistic regression model,
M1: model with only Bayesian variable selection, M2: model with Bayesian variable selection
and misclassification correction, M3: model with Bayesian variable selection and missing value
imputation, and M4: complete Bayesian variable selection model. Note that plots of different
models have different scales of y-axis.
40
Chapter 6
Discussion and conclusions
6.1 Discussion of the results

Bayesian variable selection, misclassification correction and missing value imputation all con-
tribute positively to the model performance. Among these three components, missing value
imputation contributes the most. On the other hand, missing value imputation is the most time-
consuming component. Considering the influences of missing value imputation in the model
performance and computation time, it is still worthwhile to add missing value imputation in the
Bayesian variable selection model.
As described in Section 5.1, we find that missing value imputation causes the performance
of the Bayesian variable selection model to be more sensitive to the increase of the missingness
in the pooled ham covariate compared with the model with only Bayesian variable selection.
We infer that this is caused by the convergence of MCMC algorithm. When there are more
missing values in the pooled ham covariate, the model with missing value imputation could have
worse convergence with a burn-in of 2000 iterations and a posterior sample of size 15000. This
inadequate convergence could have a negative impact on finding the most likely suspect. Our
inference could be tested by performing convergence diagnostic tests on both the model with
only Bayesian variable selection and the model with Bayesian variable selection and missing
value imputation and possibly increasing the burn-in iterations and the posterior sample size.
This topic is outside the time frame of this thesis and is a topic for future research.
From the results of the simulation study in Section 5.1, we can see that the performances
of the four Bayesian variable selection models are better than the performances of the standard
logistic regression model in the four scenarios. In particular, the complete Bayesian variable se-
lection model greatly outperforms the standard logistic regression model. The Bayesian approach
gives us the flexibility to deal with the problems of missing covariates and response misclassifica-
tion. Apart from this, the Bayesian variable selection model has the advantage of incorporating
external information in the form of prior distributions to help modelling when data are scarce.
However, one could argue that the incorporation of prior knowledge is subjective. There is no
41
clear and consistent guidance on how to choose the prior distributions. To some extent, the choice
of prior distributions is a subjective decision. In addition, different choices of prior distributions
may probably change the final result we obtain, i.e. the most likely suspect in the food-borne
disease outbreak.
However, one should not ignore that the standard logistic regression used in outbreak investi-
gations is also subject to the same problem. The choice of the cut-off p-value in the pre-selection
is subjective. There is no standard screening criterion. Different choices of the p-values affect the
final result we obtain from the standard logistic regression. The problem of subjective decisions
exist in both the Bayesian approach and the standard logistic regression.
6.2 General discussion

6.2.1 Advantages of the current simulation study
The current simulation study is composed of three parts, namely, data simulation, scenario
settings and model fitting. The advantages of the design of each part will be discussed below.
1. Data simulation. Data simulation is a crucial component of this thesis. At the same time,
it is the most difficult component. On one hand, we are faced with inherent difficulties of
simulating food-borne disease outbreak data. As explained in Section 4.1, in the design
of a simulation study for food-borne disease outbreak data, one needs to capture features
of food-borne disease outbreak data. On the other hand, we need to consider the time
constraint on simulations. First, the total time of the whole thesis project is tight. Second,
the process of model fitting is unavoidably time-consuming. Hence, there is not much time
left for the process of simulating data.
In this thesis, we design a simulation study based on real outbreak data and applied shuffling
to generate new datasets. This simulation method enables us to quickly simulate food-
borne disease outbreak data under the time constraint and while allowing us to simulate
food-borne disease outbreak data as realistic as possible.
2. Scenarios settings. In this simulation study, we set four different scenarios: (i) original
response misclassification rate and missingness rate in the pooled ham covariate, (ii) in-
creased response misclassification rate (+10%), (iii) increased missingness rate (+20%) in
the pooled ham covariate, (iv) increased response misclassification rate (+10%) and miss-
ingness rate (+20%) in the pooled ham covariate. In setting the latter three scenarios, we
introduce randomness by increasing the response misclassification rate by 4.17%-16.67%
and increasing the missingness rate by 8.20%-31.15%. On average, the expected increases
of the response misclassification rate and missingness rate are reached. Introducing ran-
domness brings more variations to the datasets compared with the practice in which the
response misclassification rate is increased by 10% and the missingness rate is increased by
20% in each dataset.
42
3. Model fitting. From our previous experience, the complete Bayesian variable selection
model and the model composed of the Bayesian variable selection and missing value im-
putation parts are most time-consuming among the five models. We implemented the two
models using different choices of burn-in iterations and the same choice of posterior sample
sizes, and performed convergence diagnostic tests. The final decision of the burn-in iter-
ations and posterior sample size provides us with relatively good convergence while also
putting limits on the computation time.
6.2.2 Limitations of the current simulation study

Although the simulation method, the setting of scenarios and the choices of the number of burn-
in iterations and posterior sample size we made in the process of model fitting enabled us to
perform a comprehensive simulation study in a relatively short time, there are several limitations
of this simulation study.
First, the current simulation study is based on a Salmonella outbreak data. We have not
assessed the performance of Bayesian variable selection models using simulated data based on
real outbreaks due to other sources. Second, a limited number of scenarios is created in the
current simulation study. We create only four different scenarios. Besides, the missing values
in the pooled ham covariate are generated based on only one type of missingness mechanism,
i.e. MCAR. The other two types of missingness mechanism are not considered in the setting
of scenarios with increased missingness rates. Third, we assess the performance of the different
models in each scenario based on 100 simulated datasets. This number of datasets is too small
for a good simulation study. Last, burn-in iterations and the posterior sample size used in the
fitting of Bayesian variable selection models are not large enough to ensure all parameters pass
the convergence tests.
The above limitations of the current simulation study come from the time constraints of
this thesis project. If more time were available, we would perform a simulation study based on
outbreaks due to different contaminated food products, create more different scenarios, generate
more simulated datasets (for example, 10000) and use a larger number of burn-in iterations and
posterior sample size which would ensure better convergence.
6.2.3 Other alternative simulation methods

Based on whether real data is used, simulation studies can be divided into two types: a simulation
study based on real data and a simulation study from scratch. The current simulation study
belongs to the first type. There are other alternative ways to design the simulation study.
For example, one could design a simulation study based on a real outbreak dataset and apply
bootstrapping to generate new datasets. To be specific, one can sample subjects with replacement
from the real outbreak dataset to form a new dataset. By using this method, one is able to capture
features of food-borne disease data, such as correlations among food products, dynamic data,
the missingness mechanism and response misclassification. However, there are few variations in
43
the generated datasets compared with our current method, because on average, the datasets are
the same as the original outbreak dataset.
Also, one could use a model-based simulation design, which is a popular design for a simulation
study from scratch. However, in the context of food-borne disease outbreaks, this design is not
practical. Such a model should be complex enough to capture all the features of a real food-borne
disease dataset. There is no simple model that one could use to generate new datasets. We have
already shown the poor performance of the standard logistic regression model.
In an ideal situation, one could assess the performance of different versions of Bayesian vari-
able selection models using hundreds or even thousands of real food-borne disease outbreak
datasets. This would be the most informative way to evaluate model performance. However, we
do not have a large volume of real outbreak datasets at hand. A simulation study is still needed.
6.3 Conclusions
In this thesis, we studied how different parts of Bayesian variable selection models affect model
performance in scenarios with (i) original response misclassification rate and missingness rate in
the pooled ham covariate, (ii) increased response misclassification rate (+10%), (iii) increased
missingness rate (+20%) in the pooled ham covariate, (iv) increased response misclassification
rate (+10%) and missingness rate (+20%) in the pooled ham covariate. This simulation study
reveals the following findings:
(i) For the four different versions of Bayesian variable selection models studied in this thesis,
the increase in the response misclassification rate or the missingness rate in the assumed
responsible food product covariate or the increase in both results in a decrease in the correct
detection number;
(ii) The increase in both the response misclassification rate and the missingness rate in the
assumed responsible food product covariate has the largest negative impact on the correct
detection number;
(iii) Bayesian variable selection, misclassification correction and missing value imputation all
contribute positively to the model performance in the context of food-borne disease out-
breaks. Although missing value imputation is most computationally expensive, it con-
tributes the most to the model performance among these three components.
Based on the above findings, we recommend applying the complete Bayesian variable selection
model in the statistical analysis of food-borne disease outbreak data. We cannot simplify the
complete Bayesian variable selection model without hampering model performance.
44
Bibliography
Albert, A. and Anderson, J. A. (1984). On the existence of maximum likelihood estimates in

logistic regression models. Biometrika, 71(1):1–10.
Belin, T. R., Hu, M. Y., Young, A. S., and Grusky, O. (2000). Using multiple imputation to
incorporate cases with missing items in a mental health services study. Health Services and
Outcomes Research Methodology, 1(1):7–22.
Best, N. G., Cowles, K., Vines, K., and Plummer, M. (2006). CODA: Convergence Diagnosis
and Output Analysis for MCMC. R News, 6(1):7–11.
Brandwagt, D., van den Wijngaard, C., Tulen, A., Mulder, A., Hofhuis, A., Jacobs, R., Heck,
M., Verbruggen, A., van den Kerkhof, J., Slegers-Fitz-James, I., Mughini-Graz, L., and Franz,
E. (2018). Outbreak of Salmonella Bovismorbificans in the Netherlands, associated with the
consumption of uncooked ham products, 2016 to 2017. Euro Surveillance, 23(1):pii=17–00335.
Centers for Disease Control and Prevention (2015). Guide to Confirming an Etiology in
Foodborne Disease Outbreak. URL https://www.cdc.gov/foodsafety/outbreaks/investigating-
outbreaks/confirming diagnosis.html [Accessed: 15 October, 2015].
Dwyer, D. M., Strickler, H., Goodman, R. A., and Armenian, H. K. (1994). Use of case-control
studies in outbreak investigations. [Review]. Epidemiologic Reviews, 16(1):109–123.
Erler, N. S., Rizopoulos, D., van Rosmalen, J., Jaddoe, V. W. V., Franco, O. H., and Lesaffre,
E. M. E. H. (2016). Dealing with missing covariates in epidemiologic studies: a comparison
between multiple imputation and a full Bayesian approach. Statistics in Medicine, 35(17):2955–
2974.
European Food Safety Authority (EFSA) and European Centre for Disease Prevention and Con-
trol (ECDC) (2015). The European Union summary Report on trends and sources of zoonoses,
zoonotic agents and foodborne outbreaks in 2014. EFSA Journal, 13(12):4329 [191 pp.].
Friedman, A. J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized
linear models via coordiante descent. Journal of Stastistical Software, 33(1):1–22.
Friesema, I., de Jong, A., Hofhuis, A., Heck, M., van den Kerkhof, H., de Jonge, R., Hameryck,
D., Nagel, K., van Vilsteren, G., van Beek, P., Notermans, D., and Van Pelt, W. (2014). Large
45
outbreak of Salmonella Tompson related to smoked salmon in the Netherlands, August to
December 2012. Eurosurveillance, 19(39):1–8.
Gelman, A., Carlin, J., Stern, H., and Rubin, D. (2004). Bayesian Data Analysis. Chapman and
Hall/CRC, Boca Ration, second edition.
Gelman, A. and Rubin, D. B. (1992). Inference from Iterative Simulation Using Multiple Se-
quences. Statistical Science, 7(4):457–472.
George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. Journal of
the American Statistical Association, 88(423):881–889.
Gilbert, R., Martin, R. M., Donovan, J., Lane, J. A., Hamdy, F., Neal, D. E., and Metcalfe, C.
(2016). Misclassification of outcome in case-control studies: Methods for sensitivity analysis.
Statistical Methods in Medical Research, 25(5):2377–2393.
Hald, T., Aspinall, W., Devleesschauwer, B., Cooke, R., Corrigan, T., Havelaar, A. H., Gibb,
H. J., Torgerson, P. R., Kirk, M. D., Angulo, F. J., Lake, R. J., Speybroeck, N., and Hoffmann,
S. (2016). World Health Organization estimates of the relative contributions of food to the
burden of disease due to selected foodborne hazards: A structured expert elicitation. PLoS
ONE, 11(1):1–35.
Harrell, F. (2015). Regression modeling strategies with applications to linear models, logistic and
ordinal regression, and survival analysis. Springer, New York, second edition.
Heidelberger, P. and Welch, P. D. (1983). Simulation Run Length Control in the Presence of an
Initial Transient. Operations Research, 31(6):1109–1144.
Hosmer, D., Lemeshow, S., and Sturdivant, R. (2013). Applied Logistic Regression. John Wiley
and Sons, Hoboken, third edition.
Ibrahim, J. G., Chen, M. H., and Lipsitz, S. R. (2002). Bayesian Methods for Generalized Linear
Models with Covariates Missing at Random. The Canadian Journal of Statistics, 30(1):55–78.
Jacobs, R., Lesaffre, E., Teunis, P. F., Höhle, M., and van de Kassteele, J. (2017). Identifying
the source of food-borne disease outbreaks: An application of Bayesian variable selection.
Statistical Methods in Medical Research, pages 1–15.
Last, J. M. (2000). A Dictionary of Epidemiology. Oxford University Press, New York, fourth
edition.
Lesaffre, E. and Albert, A. (1989). Partial Separation in Logistic Discrimination. Journal of the
Royal Statistical Society. Series B (Methodological), 51(1):109–116.
Lesaffre, E. and Lawson, A. B. (2012). Bayesian Biostatistics. John Wiley and Sons, Chichester.
Lewallen, S. and Courtright, P. (1998). Epidemiology in practice: Case-control studies. Com-

munity Eye Health Journal, 11(28):57–58.
46
Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data. Wiley, New
York.
Matignon, R. (2005). Neural Network Modeling Using Sas Enterprise Miner. AuthorHouse,
Bloomington.
McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Chapman and Hall, London,
second edition.
Mitra, R. and Dunson, D. (2010). Two-level stochastic search variable selection in GLMs with
missing predictors. International Journal of Biostatistics, 6(1).
Park, M. Y. and Hastie, T. (2007). L1-regularization path algorithm for generalized linear models.
Journal of the Royal Statistical Society. Series B: Statistical Methodology, 69(4):659–677.
Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs
sampling. Proceedings of the 3rd International Workshop on Distributed Statistical Computing
(DSC 2003), pages 20–22.
Pogreba-Brown, K., Ernst, K., and Harris, R. (2014). Case-case methods for studying enteric
diseases: A review and approach for standardization. OA Epidemiology, 7(1):1–9.
R Core Team (2016). R: A Language and Environment for Statistical Computing.
Raphael, K. (1987). Recall bais: A proposal for assessment and control. International Journal
of Epidemiology, 16(2):167–170.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3):581–592.
Tang, L., Lyles, R. H., King, C. C., Celentano, D. D., and Lo, Y. (2015). Binary regression with
differentially misclassified response and exposure variables. Statistics in Medicine, 34(9):1605–
1620.
Thomas, D., Stram, D., and Dwyer, J. (1993). Exposure measurement error: influence on
exposure-disease. Relationships and methods of correction. Annual Review of Public Health,
14:69–93.
Tibshirani, R. (1996). Regression Selection and Shrinkage via the Lasso. Journal of the Royal
Statistical Society B, 58(1):267–288.
47
Appendix A
R-code for descriptive analysis of

datasets
load _ data <- function ( filename , ham ) {

# Load the 2 0 1 6 -2 0 1 7 Salmonella Bovismorbificans outbreak dataset
# or 2 0 1 2 Salmonella Thompson outbreak , create a pooled ham variable and
# cheese pooled variable for the 2 0 1 6 -2 0 1 7 outbreak ,
# and replace missing reponses with zero (" No ") , " maybe " with NA .
#
# Args :
# filename : The filename of the original dataset .
# ham : One binary number which equals to zero or one .
# If the dataset is from 2 0 1 6 -2 0 1 7 outbreak , ham = 1 .
# Otherwise , ham = 0 .
#
# Returns :
# The original dataset of the 2 0 1 6 -2 0 1 7 outbreak or 2 0 1 2 outbreak .
quest . data <- read . csv ( file = filename )
if ( ham == 1 ) {
quest . data $ geslacht <- as . factor ( quest . data $ geslacht )
quest . data $ pooledCheese <- rep ( 2 , nrow ( quest . data ))
quest . data $ pooledHam <- rep ( 2 , nrow ( quest . data ))
# column index of the first food product
food 1 <- which ( colnames ( quest . data ) == " kipfilet " )
# column index of the last food product
food 2 <- which ( colnames ( quest . data ) == " pooledHam " )
# column index of the first covariate , i . e . " ah "
48
market 1 <- which ( colnames ( quest . data ) == " ah " )
quest . data [ , market 1 : food 2 ][ is . na ( quest . data [ , market 1 : food 2 ])] <- 0
for ( i in 1 : nrow ( quest . data )) {
if ( quest . data $ brauwham [ i ] == 1 | quest . data $ bgerham [ i ] == 1 |
quest . data $ bcoburgham [ i ] == 1 ) {
quest . data $ pooledHam [ i ] <- 1
}
if ( quest . data $ brauwham [ i ] == 0 & quest . data $ bgerham [ i ] == 0 &
quest . data $ bcoburgham [ i ] == 0 ) {
quest . data $ pooledHam [ i ] <- 0
}
}
for ( i in 1 : nrow ( quest . data )) {

if ( quest . data $ goudkaasblok [ i ] == 1 | quest . data $ goudkaasplak [ i ] == 1 |
quest . data $ goudkaasrasp [ i ] == 1 ) {
quest . data $ pooledCheese [ i ] <- 1
}
if ( quest . data $ goudkaasblok [ i ] == 0 & quest . data $ goudkaasplak [ i ] == 0 &
quest . data $ goudkaasrasp [ i ] == 0 ) {
quest . data $ pooledCheese [ i ] <- 0
}
}
quest . data [ , food 1 : food 2 ][ quest . data [ , food 1 : food 2 ] == 2 ] <- NA
quest . data <- subset ( quest . data , select = -c ( brauwham , bgerham , bcoburgham ,
goudkaasblok , goudkaasplak ,
goudkaasrasp ))
} else {
food 1 <- which ( colnames ( quest . data ) == " kipfilet " )
# column index of the last food product
food 2 <- which ( colnames ( quest . data ) == " waterijs " )
# column index of the first covariate , i . e . " AH "
market 1 <- which ( colnames ( quest . data ) == " AH " )
quest . data [ , market 1 : food 2 ][ is . na ( quest . data [ , market 1 : food 2 ])] <- 0
quest . data [ , food 1 : food 2 ][ quest . data [ , food 1 : food 2 ] == 1 ] <- NA
quest . data [ , food 1 : food 2 ][ quest . data [ , food 1 : food 2 ] == 2 ] <- 1
}
return ( quest . data )
}
# Perform a descriptive analysis of outbreak data
# Load data from the Salmonella Bovismorbificans outbreak

mydata < - load _ data ( filename = " Ham _ 2 0 1 7 . csv " , ham = 1 )
49
# Load data from the Salmonella Thompson outbreak
# mydata < - load _ data ( filename = " Salmonella _ 2 0 1 2 . csv " , ham = 0 )
# Divide data into two groups : the case group and the control group
case . data <- mydata [ mydata $ case == 1 , ]
control . data <- mydata [ mydata $ case == 0 , ]
case . no <- nrow ( case . data )
control . no <- nrow ( control . data )
# characteristics of data : age and gender

# Draw a histogram of age
hist ( mydata $ age , breaks = 1 0 , right = FALSE , main = NULL , xlab = " Ages " ,
xlim = c ( 0 , 1 0 0 ))
# characteristics of all subjects

summary ( mydata $ age )
agecut <- cut ( mydata $ age , include . lowest = TRUE , right = FALSE ,
breaks = c ( 0 , 1 0 , 2 0 , 3 0 , 4 0 , 5 0 , 6 0 , 7 0 , 8 0 , 9 0 , 1 0 0 ))
table ( agecut )
table ( agecut ) / nrow ( mydata )
table ( mydata $ geslacht )

table ( mydata $ geslacht ) / nrow ( mydata )
# characteristics of cases and controls

summary ( case . data $ age )
summary ( control . data $ age )
age <- table ( agecut , mydata $ case )

age
prop . table ( age , 2 )
gender <- table ( mydata $ geslacht , mydata $ case )

prop . table ( gender , 2 )
# Epi - curve
library ( " epitools " )
date <- as . Date ( case . data $ date , format = ’% m /% d /% Y ’)
x <- epicurve . weeks ( date , format = " %y -% m -% d " , axisnames =F , xlab = " Week of Year " ,
ylab = " Cases per week " , tick . offset = 0 . 5 , space = 0 . 5 )
axis ( 1 , at = x $ xvals , labels = x $ cweek )
date <- as . Date ( control . data $ date , format = ’% m /% d /% Y ’)

xx <- epicurve . weeks ( date , format = " %y -% m -% d " , axisnames =F , xlab = " Week of Year " ,
ylab = " Controls per week " , tick . offset = 0 . 5 , space = 0 . 5 )
50
axis ( 1 , at = xx $ xvals , labels = xx $ cweek )
xs <- rep ( " case " , nrow ( case . data ))

xs [( nrow ( case . data ) + 1 ): ( nrow ( case . data ) + nrow ( control . data ))] <-
rep ( " control " , nrow ( control . data ))
date <- as . Date ( mydata $ date , format = ’% m /% d /% Y ’)
xxx <- epicurve . weeks ( date , format = " %y -% m -% d " , strata = xs , axisnames =F ,
xlab = " Week of Year " , ylab = " Observations per week " ,
tick . offset = 0 . 5 , space = 0 . 5 ,
legend = c ( " cases " , " controls " ))
axis ( 1 , at = xxx $ xvals , labels = xxx $ cweek )
# Missing covariates
# Percentages of missing covariates per covariate for cases and controls
food . 1 <- which ( colnames ( mydata ) == " kipfilet " )
na _ numbers . total <- sapply ( mydata [ , food . 1 : ncol ( mydata )] ,
function ( y ) sum ( length ( which ( is . na ( y )))))
na _ percentage . total <- na _ numbers . total / nrow ( mydata )
na . total <- data . frame ( na _ numbers . total , na _ percentage . total )
na _ numbers . case <- sapply ( case . data [ , food . 1 : ncol ( mydata )] ,

na _ percentage . case <- na _ numbers . case / case . no
# Covariate with the highest percentage of missing values for cases
na _ percentage . case [ which . max ( na _ percentage . case )]
na . case <- data . frame ( na _ numbers . case , na _ percentage . case )
na _ numbers . control <- sapply ( control . data [ , food . 1 : ncol ( mydata )] ,

na _ percentage . control <- na _ numbers . control / control . no
# Covariate with the highest percentage of missing values for controls
na _ percentage . control [ which . max ( na _ percentage . control )]
na . control <- data . frame ( na _ numbers . control , na _ percentage . control )
na _ count <- cbind ( na . case , na . control , na . total )
# Percentages of missing pooledHam ( vis _ gerookt ) values per pooledHam ( vis _ gerookt ) for ca
na _ count [ which ( rownames ( na _ count ) == " pooledHam " ) , ]
# na _ count [ which ( rownames ( na _ count ) == " vis _ gerookt ") , ]
# for what percentage of covariates , the percentage of missing covariates

# per covariate for controls is higher than that for cases
j <- numeric ( nrow ( na _ count ))
for ( i in 1 : nrow ( na _ count )) {
51
if ( na _ count $ na _ percentage . case [ i ] < na _ count $ na _ percentage . control [ i ]) {
j [ i ] <- 1
}
}
sum ( j ) / nrow ( na _ count )
# Calculate percentage of missing covariates per respondent

market <- which ( colnames ( mydata ) == " ah " )
# market <- which ( colnames ( mydata ) == " AH ")
missingPer . respondentcase <- apply ( case . data [ , food . 1 : ncol ( mydata )] , 1 ,
function ( y ) sum ( is . na ( y )))
max ( missingPer . respondentcase ) / ( ncol ( mydata ) - market + 1 )
missingPer . respondentcontrol <- apply ( control . data [ , food . 1 : ncol ( mydata )] , 1 ,
function ( y ) sum ( is . na ( y )))
max ( missingPer . respondentcontrol ) / ( ncol ( mydata ) - market + 1 )
52
Appendix B
R-code for standard logistic

regression on Salmonella
Bovismorbificans outbreak
# Load data from the Salmonella Bovismorbificans outbreak

mydata < - load _ data ( filename = " Ham _ 2 0 1 7 . csv " , ham = 1 )
# Replace na with 1
mydata [ , ][ is . na ( mydata [ , ])] <- 1
market <- which ( colnames ( mydata ) == " ah " )
uni _ model <- function ( dataset ) {

# Build a univariable logistic regression model .
#
# Args :
# dataset : A data frame which contains the outbreak data .
#
# Returns :
# A vector containing the indices of selected covariates ,
# which will be entered into the multivariable analysis .
dataset $ age <- ( dataset $ age - mean ( dataset $ age )) / sd ( dataset $ age )
sig . var <- numeric ( length = ( ncol ( dataset ) - market + 1 ))
for ( i in market : ncol ( dataset )) {
logitMod _ uni <- glm ( case ~ dataset [ , i ] + age + geslacht , data = dataset ,
family = binomial ( link = " logit " ))
if ( summary ( logitMod _ uni )$ coefficients [ 2 ,4 ] < 0 . 0 9 ) {
sig . var [ i - ( market - 1 )] <- i
53
}
}
sig . var <- sig . var [ sig . var != 0 ]
return ( sig . var )
}
library ( MASS )
multi _ model <- function ( sig . var , dataset ) {
# Perform multivariate analysis and build a final model
# using a backward variable selection based on the AIC .
#
# Args :
# sig . var : A vector containing the indices of selected
# covariates in univariable analysis .
# dataset : A dataframe which contains the outbreak data .
#
# Returns :
# A result summary of the final model .
data . sig <- cbind ( dataset $ case , dataset $ age , dataset $ geslacht ,
dataset [ , sig . var ])
colnames ( data . sig )[ 1 ] <- " case "
colnames ( data . sig )[ 2 ] <- " age "
colnames ( data . sig )[ 3 ] <- " geslacht "
logitMod _ mul <- glm ( case ~ . , data = data . sig ,
step <- stepAIC ( logitMod _ mul , scope = list ( lower = ~ age + geslacht ) ,
direction = " backward " , trace = 0 )
logitMod _ final <- glm ( formula ( step ) , data = data . sig ,
summary ( logitMod _ final )$ coefficients
}
# Covariates which are entered into the multivariate analysis

print ( colnames ( mydata )[ uni _ model ( mydata )])
# A result summary of the final model
wholedata . result <- multi _ model ( uni _ model ( mydata ) , mydata )
print ( wholedata . result )
# Analysis using data available at each time point

date 1 <- subset ( mydata , as . Date ( mydata $ date , ’% m /% d /% Y ’) <= as . Date ( " 2 0 1 7 -0 3 -0 2 " ))
date 1 . result <- multi _ model ( uni _ model ( date 1 ) , date 1 )
54
55
Appendix C
R-code for creating four scenarios

in the simulation study
mydata <- load _ data ( filename = " Ham _ 2 0 1 7 . csv " , ham = 1 )
new . dataset <- vector ( " list " , 1 0 0 )
# Row indices of subjects within the same stratum

for ( j in 1 : length ( unique ( mydata $ strata ))) {
index <- as . numeric ( rownames ( subset ( mydata , strata == j )))
assign ( paste ( " rowindex " , j , sep = " " ) , index )
}
for ( i in 1 : 1 0 0 ) {
set . seed ( i )
mydata . shuffle <- mydata
shuffled . covariates <- sample ( 2 4 : 1 7 2 , 7 5 )
for ( j in 1 : length ( unique ( mydata $ strata ))) {
assign ( paste ( " new . rowindex " , j , sep = " " ) ,
sample ( eval ( parse ( text = paste ( " rowindex " , j , sep = " " )))))
mydata . shuffle [ eval ( parse ( text = paste ( " new . rowindex " , j , sep = " " ))) ,
shuffled . covariates ] <- mydata . shuffle [ eval ( parse ( text =
paste ( " rowindex " , j , sep = " " ))) , shuffled . covariates ]
new . dataset [[ i ]] <- mydata . shuffle
}
}
remove ( mydata . shuffle )
save ( new . dataset , file = " new _ dataset . Rda " )
# MCAR
# Indices of the non - na values of the pooled ham covariate
56
missingness . dataset <- new . dataset
non NA . index <- which (! is . na ( mydata $ pooledHam ))
for ( i in 1 : 1 0 0 ) {
set . seed ( i )
# The number of missing values is between 2 and 1 9 .
missing . index <- sample ( non NA . index , sample ( 5 : 1 9 , 1 ))
for ( j in 1 : length ( mydata $ pooledHam )) {
missingness . dataset [[ i ]]$ pooledHam [ j ] <- ifelse ( j % in % missing . index ,
NA , mydata $ pooledHam [ j ])
}
}
save ( missingness . dataset , file = " missingness _ dataset . Rda " )
# Increase the response misclassification rate

misclassification . dataset <- new . dataset
for ( i in 1 : 1 0 0 ) {
set . seed ( i )
# The number of misclassified response is between 1 and 3 .
mis . index <- sample ( 1 : 2 4 , sample ( 1 : 4 , 1 ))
misclassification . dataset [[ i ]]$ case [ mis . index ] <- 0
}
save ( misclassification . dataset , file = " misclassification _ dataset . Rda " )
# Increase the missingness rate & misclassification rate

bimis . dataset <- missingness . dataset
for ( i in 1 : 1 0 0 ) {
set . seed ( i )
mis . index <- sample ( 1 : 2 4 , sample ( 1 : 4 , 1 ))
bimis . dataset [[ i ]]$ case [ mis . index ] <- 0
}
save ( bimis . dataset , file = " bismis _ dataset . Rda " )
57
Appendix D
R-code for the complete Bayesian

variable selection model
dates <- c ( " 2 0 1 7 -0 3 -0 2 " , " 2 0 1 7 -0 3 -0 9 " , " 2 0 1 7 -0 3 -1 6 " , " 2 0 1 7 -0 3 -2 2 " , " 2 0 1 7 -0 4 -0 5 " )
setClass ( Class = " Information " , representation ( time = " numeric " ,
suspectedfood = " character " , inclupro . highest = " numeric " ,
dif = " numeric " ))
full _ model <- function ( dataset , time . point ) {

# Built a complete Bayesian variable selection model
# using data up to a certain date .
#
# Args :
# dataset : A dataframe which contains the complete outbreak data .
# time . point : an integer which indicates the date
#
# Returns :
# The required time , the name of the most suspicious food ,
# the one - sided posterior inclusion probability of this food ,
# and the difference of the highest and second highest food product
# posterior inclusion probabilities
dataset <- subset ( dataset , as . Date ( dataset $ date , ’% m /% d /% Y ’) <=
as . Date ( dates [ time . point ]))
market . names <- colnames ( dataset )[ 8 : 2 3 ]
food . names <- colnames ( dataset )[ 2 4 : 1 7 3 ]
markets <- dataset [ , market . names ]
food <- dataset [ , food . names ]
remove . market <- sapply ( markets , function ( x ) sum ( is . na ( x )) == 0
58
& length ( unique ( x ))== 1 )
if ( sum ( remove . market ) != 0 ) {
markets [ names ( remove . market [ which ( remove . market )])] <-
rep ( NULL , sum ( remove . market ))
market . names <- market . names [ - which ( remove . market )]
}
remove . food <- sapply ( food , function ( x ) sum ( is . na ( x )) == 0

& length ( unique ( x ))== 1 )
if ( sum ( remove . food ) != 0 ) {
food [ names ( remove . food [ which ( remove . food )])] <- rep ( NULL , sum ( remove . food ))
food . names <- food . names [ - which ( remove . food )]
}
covariate . names <- c ( food . names , market . names )

y <- dataset $ case
x _ cov <- as . matrix ( cbind ( food , markets ))
x _ fix <- as . matrix ( cbind ( dataset $ age , ( as . numeric ( dataset $ geslacht ) - 1 )))
n . obs <- nrow ( dataset )

n . cov <- ncol ( x _ cov )
model . string 4 <- " model {

# likelihood
for ( i in 1 : n . obs ){
y [ i ] ~ dbern ( mu [ i ])
mu [ i ] <- pi [ i ]* Se
logit ( pi [ i ]) <- beta 0 + inprod ( beta [] , x _ cov [i ,])+ inprod ( beta . fixed [] , x _ fix [i ,])
}
for ( i in 1 : n . obs ){
logit ( mu . X [i , 1 ]) <- alpha 0 [ 1 ]
x _ cov [i , 1 ] ~ dbern ( mu . X [i , 1 ])
for ( j in 2 : n . cov ){
logit ( mu . X [i , j ]) <- alpha 0 [ j ]+ inprod ( alpha [j , 1 :( j - 1 )] , x _ cov [i , 1 :( j - 1 )])
x _ cov [i , j ] ~ dbern ( mu . X [i , j ])
}
}
# Priors
for ( j in 1 : n . cov ){
alpha 0 [ j ] ~ dnorm ( 0 ,0 . 0 0 1 )
}
59
beta 0 ~ dnorm ( 0 ,0 . 0 0 1 )
Se ~ dbeta ( 3 3 ,4 )
beta . fixed [ 1 ] ~ dnorm ( 0 ,0 . 0 0 1 )
beta . fixed [ 2 ] ~ dnorm ( 0 ,0 . 0 0 1 )
# Spike and slab priors

prec . spike <- 1 /( pow ( tau , 2 ))
prec . slab <- 1 /( pow (c , 2 )* pow ( tau , 2 ))
for ( j in 1 : n . cov ) {
precisionb [ j ] <- equals ( gammab [ j ] , 0 )* prec . spike
+ equals ( gammab [ j ] , 1 )* prec . slab
beta [ j ] ~ dnorm ( 0 , precisionb [ j ])
gammab [ j ] ~ dbern ( omegab [ j ])
omegab [ j ] ~ dbeta ( 1 ,2 )
}
for ( j in 1 : n . cov ) {
for ( k in 1 :( j - 1 )) {
precisiona [j , k ] <- equals ( gammaa [j , k ] , 0 )* prec . spike
+ equals ( gammaa [j , k ] , 1 )* prec . slab
alpha [j , k ] ~ dnorm ( 0 , precisiona [j , k ])
gammaa [j , k ] ~ dbern ( omegaa [j , k ])
omegaa [j , k ] ~ dbeta ( 1 ,2 )
}
}
}"
# Assign initial values to parameters

alpha 0 _ init <- rep ( 0 , n . cov )
beta 0 _ init <- 0
beta _ init <- rep ( 0 , n . cov )
alpha _ init <- matrix ( NA , nrow = n . cov , ncol = ( n . cov - 1 ))
thetri <- lower . tri ( alpha _ init )
alpha _ init [ thetri ] <- 0
# Specify fixed values

c <- 1 0 0
epsilon <- 0 . 0 5
tau <- epsilon / sqrt ( 2 * log ( c ) * c ^ 2 / ( c ^ 2 - 1 ))
inits . list <- function () { list ( beta = beta _ init , alpha = alpha _ init ,
60
beta 0 = beta 0 _ init , alpha 0 = alpha 0 _ init , . RNG . name = " base :: Mersenne - Twister " ,
. RNG . seed = sample . int ( n = 1 0 0 0 0 0 , size = 1 ))}
N <- 1 5 0 0 0
n . thin <- 1
n . adapt <- 0
n . cores <- 8
n . burnin <- 2 0 0 0
# # Parameters to monitor
parameters <- c ( " beta " ," beta 0 " ," omegab " ," Se " )
data . list <- list ( y =y , x _ fix = x _ fix , x _ cov = x _ cov , n . obs = n . obs , n . cov = n . cov ,
c =c , tau = tau )
ptm <- proc . time ()
post . runjags <- run . jags ( model = model . string 4 , data = data . list ,
inits = inits . list , n . chains = n . cores , adapt = n . adapt ,
burnin = n . burnin , sample = round ( N / n . cores ) , thin = n . thin ,
method = " parallel " , modules = " glm " , monitor = parameters )
time <- proc . time () - ptm
post . mat <- as . matrix ( as . mcmc ( post . runjags ))
g . mcmc <- post . mat [ , grep ( " omegab " , colnames ( post . mat ))]
n . cov <- ncol ( g . mcmc )
b . mcmc <- post . mat [ , grep ( " beta " , colnames ( post . mat ))][ , 1 : n . cov ]
inclprob <- apply ( b . mcmc , 2 , function ( x ) mean ( x > 0 . 0 5 ))
suspected . food <- food . names [ which . max ( inclprob [ 1 : length ( food . names )])]
highest . prob <- max ( inclprob [ 1 : length ( food . names )])
difference <- max ( inclprob [ 1 : length ( food . names )]) -
sort ( inclprob [ 1 : length ( food . names )] , partial = length ( food . names ) - 1 )
[ length ( food . names ) - 1 ]
return ( new ( " Information " , time = time [ 3 ] , suspectedfood = suspected . food ,
inclupro . highest = highest . prob , dif = difference ))
}
61
Appendix E
Performance of the different

models in the four scenarios
Table E.1: The performance of the model with only Bayesian variable selection in the four
scenarios
Correct Median Median of inclusion probability of Ham
detection detection at the earliest detection date &
number date standard deviation in brackets
Original rates 26 03-22 0.539 (0.118)
Increased misclassification 17 03-22 0.580 (0.129)
Increased missingness 23 03-09 0.536 (0.106)
Increased misclassification
16 03-09 0.563 (0.095)
& missingness
Table E.2: The performance of the model with Bayesian variable selection and misclassification
correction in the four scenarios
Original rates 41 03-22 0.519 (0.104)
23 03-22 0.548 (0.124)
& missingness
62
Table E.3: The performance of the model with Bayesian variable selection and missing value
imputation in the four scenarios
Original rates 88 03-22 0.650 (0.111)
32 03-22 0.589 (0.110)
& missingness
Table E.4: The performance of the complete Bayesian variable selection model in the four sce-
narios
Original rates 92 03-22 0.592 (0.149)
53 03-22 0.548 (0.112)
& missingness
Table E.5: The performance of the standard logistic regression model in the four scenarios
Correct Median
detection detection
number date
Original rates 23 04-05
Increased misclassification 6 03-22
Increased missingness 11 03-22
9 04-05
& missingness
63

2018 07 05 Masterthesis Liu PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2018 07 05 Masterthesis Liu PDF

Uploaded by

Copyright:

Available Formats

A simulation study to evaluate the

performance of Bayesian variable

Yuping Liu (s1857967)

First supervisor: Prof. Dr. Jelle Goeman (LUMC)

Defended on July 5, 2018

Specialization: Data Science

3 Description and statistical analysis of datasets 15

5 Results of the simulation study 30

6 Discussion and conclusions 38

A R-code for descriptive analysis of datasets 45

B R-code for standard logistic regression on Salmonella Bovismorbificans out-

C R-code for creating four scenarios in the simulation study 53

E Performance of the different models in the four scenarios 59

1.1 Case-control studies

1.1.1 Case definition and control selection

1.1.2 Data and data collection

1.2.2 Lasso logistic regression

1.2.3 Bayesian variable selection method

2.1 Classical logistic regression without misclassification

2.1.1 Estimation of conditional odds ratio

P (Y = 1|Z = 1, X = x)/P (Y = 0|Z = 1, X = x)

If a multiple logistic regression model fits the data

P (yi |ỹi , xi ) = P (yi |ỹi ) (2.4)

Se ≡ P (yi = 1|ỹi = 1) and Sp ≡ P (yi = 0|ỹi = 0) (2.5)

where Y = (y1 , ..., yn ).

2.3.1 Generalized linear models

η = g(µ) = β0 + Xβ. (2.8)

The density function of Y is expressed as (McCullagh and Nelder, 1989)

2.3.2 Classical variable selection

The well-known and frequently used selection techniques are:

2.3.2.2 Combination of univariable analysis and stepwise approaches

β̂(λ) = arg min [− log{L(β|y)} + λkβk1 ] , (2.11)

2.3.3 Bayesian variable selection

βj |τ 2 , c2 ∼ γj N(0, τ 2 c2 ) + (1 − γj ) N(0, τ 2 ), (2.15)

2.4 Missing covariates

2.4.1 Missing data mechanism

(1) Missing completely at random (MCAR). A variable is MCAR if missingness is unrelated

2.4.2 Methods for missing data

2.4.2.2 Ad-hoc imputation

2.4.2.3 Sequential full Bayesian approach

p(θY |X ,θX , xi,mis |yi , xi,obs )

p(xi,mis1 ,..., xi,misr |xi,obs , θX )

2.5.1 Logistic regression with nondifferentially misclassified responses

2.5.2 Bayesian variable selection

Description and statistical

3.1 Salmonella Bovismorbificans outbreak (2016-2017)

3.1.2 Statistical analysis using the standard logistic regression

Figure 3.3: Number of all observations

Figure 3.4: Histogram of ages of all observations

4.1 Motivation for designing a simulation study based on

• Step 1 Construct five age-gender strata.

Finally, save the 100 new datasets.

4.3 Misclassification scenarios

• step 2 Randomly draw j cases and change them into controls.

Finally, save the above 100 new datasets.

4.4 Missingness scenarios

4.5 Scenarios with increased misclassification and missing-

4.6 Model fitting

4.6.1 Prior specification

4.6.2 Initial values of parameters

4.6.3 Burn-in iterations and the posterior sample size

1. Standard logistic regression.