You are on page 1of 20

Statistical Methods & Applications

https://doi.org/10.1007/s10260-023-00727-9

ORIGINAL PAPER

A flexible Bayesian variable selection approach


for modeling interval data

Shubhajit Sen1,3 · Damitri Kundu1 · Kiranmoy Das1,2

Accepted: 7 September 2023


© The Author(s), under exclusive licence to Società Italiana di Statistica 2023

Abstract
Interval datasets are not uncommon in many disciplines including medical experi-
ments, econometric studies, environmental studies etc. For modeling interval
data traditionally separate models are used for modeling the center and the radius
of the response variable. In this article, we consider a Bayesian regression frame-
work for jointly modeling the center and the radius of the intervals corresponding
to the response, and then use appropriate priors for variable selection. Unlike the
traditional setting, both the centres and the radii of all the predictors are used for
modeling the center and the radius of response. We consider spike and slab priors
for the regression coefficients corresponding to the centers (radii) of the predictors
while modeling the center (radius) of the response, and global–local shrinkage prior
for the coefficients corresponding to the radii (centers) of the predictors. Through
extensive simulation studies, we illustrate the effectiveness of our proposed variable
selection approach for the analysis and prediction of interval datasets. Finally, we
analyze a real dataset from a clinical trial related to the Acute Lymphocytic Leuke-
mia (ALL), and then select the important set of predictors for modeling the lympho-
cyte count which is an important biomarker for ALL. Our numerical studies show
that the proposed approach is efficient, and it provides a powerful statistical infer-
ence for handling interval datasets.

Keywords Global–local shrinkage prior · Interval data · Joint model · MCMC ·


Spike and slab prior

* Kiranmoy Das
kiranmoy.das@gmail.com
1
Applied Statistics Division, Indian Statistical Institute, Kolkata, India
2
Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, China
3
Department of Statistics, North Carolina State University, Raleigh, USA

13
Vol.:(0123456789)
S. Sen et al.

1 Introduction

It is a common practice in Statistics to summarize big datasets by lists, intervals


or some distributions. These summarized data are called symbolic data, and they
occur frequently in various disciplines including medical, agricultural, economet-
ric, and environmental studies. A specific type of symbolic data, namely, interval-
valued data arise in many disciplines. A set of variables are measured for a cer-
tain period of time (e.g. a day, a week, a month etc.), and then are summarized as
intervals (based on their respective minimum and maximum values). For exam-
ple, the daily temperature of a city is often reported as the lowest and the highest
temperature for each day.
Analysis of interval datasets are typically challenging, and is an important
research problem in the recent years. Various Statistical models have been pro-
posed in the literature for handling such datasets. Lima Neto and Carvalho (2008)
proposed the Center and Range Method (CRM) for analysing interval datasets by
converting intervals to centers and radii, and then considered separate regression
models for modeling the center and the radius of the response variable. Giordani
(2015) proposed a Lasso-constrained regression model for analysing interval
datasets. Lima Neto et al. (2011) proposed a bivariate symbolic regression model,
and Lim (2016) used a non-parametric additive model for the similar analysis.
Lima Neto and Carvalho (2017) also proposed a non-linear regression model
for interval datasets. Alternatively, Xu (2010) developed a Symbolic Covariance
Model (SCM), and Ahn et al. (2012) proposed a resampling based Monte Carlo
Method (MCM) for analysing interval data. More recently, Sun et al. (2021)
developed models for analysing interval time series data, and they modeled the
center and log-radius of the interval values by a two-equation vector autoregres-
sive with exogenous covariates model. However, these methods mostly consider
separate modeling of the center and the radius of the response, and hence might
result in biased estimates.
Our work is based on the joint modeling of the center and radius of the
response to account for their probable dependence. Xu and Qin (2022) proposed
a Bivariate Bayesian Regression Model (BBM) which considers the dependence
between the center and the radius of the response variable. Through simulation
studies and real applications Xu and Qin (2022) illustrated that BBM really out-
performs, and results in a more powerful inference (smaller root mean squared
errors) than the other methods (e.g. CRM, SCM, MCM etc.). In fact, their study
illustrates that for small sample (n), and large number of predictors (p) BBM
really performs well. We consider the Bivariate Bayesian Model proposed by Xu
and Qin (2022), but we use both the centers and radii of all predictors for (jointly)
modeling the center and the radius of the response. Thus, the variable selection
problem occurs naturally in our setting since with p predictors we get 4p regres-
sion coefficients in our joint model, and therefore, need to zero-out the unimpor-
tant predictors for developing a good predictive model.
The literature on variable selection is quite rich in a Bayesian framework.
George and McCulloch (1993, 1997) proposed a stochastic search variable

13
A flexible Bayesian variable selection approach for modeling…

selection (SSVS) method which selects the important predictors by averaging


over all possible submodels. This approach used a diffuse prior for the inclu-
sion of a predictor in the model, but used a highly informative prior for the
exclusion. On the other hand, Smith and Kohn (1996), Kuo and Mallick (1998)
considered independent priors for the regression coefficients and the selection
indicators. These priors can effectively select the best model among all possi-
ble submodels. Following their works, we consider independent spike and slab
prior for the coefficients corresponding to the centers (radii) of the predictors
while modeling the center (radius) of the response. The relative importance of
the predictors are assessed by computing the posterior marginal inclusion prob-
abilities (MIP). On the other hand, for the coefficients corresponding to the radii
(centers) of the predictors while modeling the center (radius) of the response
we use a global–local shrinkage prior (Polson and Scott 2010). In particular,
we consider horse-shoe priors introduced by Carvalho et al. (2010) which can
effectively shrink these regression coefficients towards zero. Thus, we expect to
obtain a better predictive model where only the important components (centers
and/or radii) of the predictors will contribute.
Our work is motivated by a real dataset from an experiment conducted by
Tata Translational Cancer Research Center (TTCRC), Kolkata. A group of
236 children, diagnosed as acute lymphocytic leukemia (ALL) patients, were
treated with two standard drugs for nearly two years. Patients were tested for
some important biomarkers (for example, the lymphocyte count, the neutrophil
count, the platelet count), and based on the response the doses of the drugs were
adjusted. Patients visited the clinic once or twice in a month almost for two
years. In the end, the biomarkers and drug doses are summarized as intervals
for each patient. We analyze this dataset for assessing the effectiveness of the
drug doses and the effects of the other two biomarkers on the lymphocyte counts
since an uncontrolled growth of lymphocyte causes ALL.
Rest of this article is organized as follows. In Sect. 2, we describe the Bivari-
ate Bayesian Model, and the Bayesian variable selection approach. Details of the
simulation studies and the results are summarized in Sect. 3. We compare our
proposed joint model with the traditional models in this section through simula-
tion studies. Results from the real data analysis are described in Sect. 4. Finally,
some concluding remarks are given in Sect. 5.

2 Model and method

We propose our joint regression model for modeling the center (C) and the
radius (R) of the response following Xu and Qin (2022). Note that the radius
values cannot be negative and therefore, we consider the natural logarithm of
the (response) radius values in our setting. We consider a cross-sectional dataset
with n observations and we have a single response variable Y and p predictors
X1 , X2 , … , Xp.

13
S. Sen et al.

2.1 Bayesian joint model

YiC and YiR, respectively, denote the center and radius of the response variable for the
i-th observation. Similarly, XiC and XiR , respectively, denote the corresponding design
matrices. Our Bayesian joint model is given as follows:
T T
YiC = XiC 𝛽 C∶C + XiR 𝛽 C∶R + 𝜖iC , (2.1)

( ) T T
log YiR = XiC 𝛽 R∶C + XiR 𝛽 R∶R + 𝜖iR , (2.2)

for i = 1, 2, … , n. Here, we have four sets of p × 1 parameter vectors; 𝛽 C∶C denotes


the regression coefficient corresponding to the centers of predictors for modeling the
center of the response, 𝛽 C∶R denotes the regression coefficient corresponding to the
radii of predictors for modeling the center (of the response), 𝛽 R∶C denotes the regres-
sion coefficient corresponding to the centers of predictors for modeling the radius
(of the response), and 𝛽 R∶R denotes the regression coefficient corresponding to the
radii of predictors for modeling the radius of the response. The vectors of random
errors 𝜖iC and 𝜖iR , jointly follow a bivariate normal distribution as follows:
[ C] ([ ] [ 2 ])
𝜖i iid 0 𝜎1 𝜌𝜎1 𝜎2
𝜖iR
∼N
0
,Σ =
𝜌𝜎1 𝜎2 𝜎22
. (2.3)

We note that the correlation coefficient 𝜌 measures the dependence between the
center and the radius of the responses, and 𝜎12 and 𝜎22, respectively, denote the vari-
ances of the corresponding random error. We note that our joint model is different
from the one proposed in Xu and Qin (2022) since we introduce the coefficients 𝛽 C∶R
and 𝛽 R∶C in our model.

2.2 Prior specification

As shown in Eqs. (2.1) and (2.2), we consider four types of regression coefficients:
𝛽 C∶C, 𝛽 R∶R, 𝛽 C∶R and 𝛽 R∶C. The first two types of the coefficients will be referred to as
“Type 1 coefficients” and the last two as “Type 2 coefficients”. Existing works on mod-
eling interval data consider only the “Type 1 coefficients”, but we argue that for data
with a stronger correlation between the center and the radius “Type 2 coefficients” will
provide some meaningful inference. However, for avoiding the possible over-fitting we
consider shrinkage priors for the “Type 2 coefficients”. For the proposed joint model
we are interested in the individual shrinkage (local shrinkage) of the Type 2 coeffi-
cients to eliminate some of the radius predictors while modeling the center of response,
and vice versa. Hence, we consider a “Horseshoe Prior” as an ideal candidate for such
global–local shrinkage.

13
A flexible Bayesian variable selection approach for modeling…

2.2.1 Horseshoe prior

A family of shrinkage priors, namely “Global–local priors”, is defined by a scale


mixture of normal priors on the coefficient vector, and other priors on the vari-
ance component of a normal distribution. Such priors can be written as:

𝛽j |𝜆j , 𝜏 ∼ N(0, 𝜆2j 𝜏 2 ), 𝜆2j ∼ 𝜋(𝜆2j ), 𝜏 2 ∼ 𝜋(𝜏 2 ).

Here 𝜏 2 (the global shrinkage parameter) controls the overall shrinkage of all the
parameters, whereas 𝜆2j (known as the local scale parameters) is responsible for the
local shrinkage (shrinkage of the coefficient of the j-th predictor). Specifically, the
horseshoe prior is introduced by using the half-cauchy C+ (0, 1) distribution as the
prior on 𝜆j and 𝜏 . The reason for using this prior distribution (for scale parameters)
is that most of its mass is near 0, which employs shrinkage. At the same time the
heavy tail of the distribution allows certain predictors to escape the shrinkage. In a
simple set-up with y | 𝛽 ∼ N(𝛽, I), and the prior for 𝛽 is defined for 𝜏 = 1, it can be
shown that,
( 2 ) ( )
𝜆j 1
E(𝛽j |y, 𝜆j ) = yj + 0. (2.4)
1 + 𝜆2j 1 + 𝜆2j

Define, 𝜅j = 1
1+𝜆2j
. In this set-up, 𝜅j is the weight put on 0 or complete shrinkage in
the posterior mean of 𝛽j . Hence, 𝜅j can be interpreted as the “shrinkage coefficient”.
It can be shown that in case of the horseshoe prior, 𝜅j ∼ Beta(0.5, 0.5). If we com-
pare this with the Bayesian Lasso (Park and Casella 2008), which in this set-up can
2
be written in the aforementioned form with 𝜋(𝜆2j ) = 𝜆2 exp −𝜆2 𝜆2j ∕2, we can see that
the shrinkage factor of the horseshoe prior exhibits a density that is symmetric with
respect to 0.5 and unbounded at both 0 and 1 with most of the density near 0 and 1,
resulting in either almost no shrinkage or complete shrinkage. However, Lasso does
not exhibit such features, indicating that although the larger coefficients might be
able to remain unaffected by shrinkage, sometimes it might not be able to shrink the
noise coefficients as much as it should do. This describes the advantage of the horse-
shoe prior in our setting. For a better shrinkage and more accurate inference, one can
also consider horse-shoe+ prior (Bhadra et al. 2017) for the Type 2 coefficients.

2.2.2 Spike and slab prior

We note that for modeling interval data Type 1 coefficients are the natural
choices, and hence are considered in the existing works. Although in our model
we consider both the Type 1 and Type 2 coefficients, however, Type 1 coefficients
definitely deserve more attention. We consider the “Spike and Slab priors” (Kedia
et al. 2022) for these coefficients, and such priors provide the relative importance
of each predictor through their posterior marginal inclusion probabilities (Kuo

13
S. Sen et al.

and Mallick 1998). We consider the following prior specification for the Type 1
coefficients of our proposed joint model:
( ) ( )
𝛽j | 𝛾j , 𝜏j2 , 𝜙2j ∼ 𝛾j N 0, 𝜏j2 + (1 − 𝛾j ) N 0, 𝜙2j ,
𝜏j2 ∼ IG(0.5, 0.5), 𝛾j |pj ∼ Bernoulli(pj ),

where IG stands for an Inverse Gamma distribution. Here, 𝜙2j are very small number
(≈ 0.001) for creating the spikes. The other component is known as the slab. To esti-
mate the variance of the slab we take a vague prior on it (for example, Inverse
Gamma (0.5,0.5)). The prior for the mixing parameter 𝛾j is Bernoulli(pj ). Note that
pj could be interpreted as the prior inclusion probability of the j-th predictor in the
model. For model selection, out of (2p − 1) possible submodels we choose the one
with the highest posterior inclusion probability (Kuo and Mallick 1998).

2.2.3 Complete prior structure

We consider the spike and slab priors for the Type 1 coefficients, and horse-shoe
priors for the Type 2 coefficients. Also, for the covariance matrix (of the random
errors) Σ, we take an Inverse-Wishart (IW) prior for conjugacy (Das et al. 2021).
In our case, the Inverse-Wishart prior is specified by a 2 × 2 scale matrix V0, and
degrees of freedom (df) 𝜈0, with the restriction 𝜈0 > 1. Below, we give the complete
specification of the prior distributions for our proposed joint model.
( ) ( )
2 2 C∶C 2 C∶C 2
𝛽jC∶C | 𝛾jC∶C , 𝜏jC∶C , 𝜙C∶C
j
∼ 𝛾 C∶C
j
N 0, 𝜏j
+ (1 − 𝛾j
C∶C
) N 0, 𝜙j ,
2
𝜏jC∶C ∼ IG(0.5, 0.5), 𝛾jC∶C | pCj ∼ Bernoulli(pCj ),
(2.5)
( ) ( )
2 2 2 2
𝛽jR∶R | 𝛾jR∶R , 𝜏jR∶R , 𝜙R∶R
j
∼ 𝛾jR∶R N 0, 𝜏jR∶R + (1 − 𝛾jR∶R ) N 0, 𝜙R∶R
j ,
2
𝜏jR∶R ∼ IG(0.5, 0.5), 𝛾jR∶R | pRj ∼ Bernoulli(pRj ),
(2.6)
2 C∶R 2
𝛽jC∶R |𝜆C∶R
j
, 𝜏jC∶R ∼ N(0, 𝜆C∶R
j
𝜏j ),
(2.7)
𝜆C∶R
j ∼ C+ (0, 1), 𝜏 C∶R +
∼ C (0, 1),

2 2
𝛽jR∶C |𝜆R∶C
j
, 𝜏jR∶C ∼ N(0, 𝜆R∶C
j
𝜏jR∶C ),
(2.8)
𝜆R∶C
j ∼ C+ (0, 1), 𝜏 R∶C ∼ C+ (0, 1),

Σ ∼ IW(V0 , 𝜈0 ). (2.9)
We implement a Gibbs Sampler for estimating the model parameters. Note that that
the C+ (0, 1) distribution can be written as the scale mixture of a gamma and an

13
A flexible Bayesian variable selection approach for modeling…

inverse gamma distribution. Thus, 𝜆C∶R j


∼ C+ (0, 1) is written as
𝜆j | aj ∼ IG(0.5, aj ) and aj ∼ Gamma(0.5, 1). Similar representations are
C∶R C∶R C∶R C∶R

considered for 𝜏 C∶R , 𝜆R∶C


j
and 𝜏 R∶C.
A graphical representation of our proposed model is given in Figure S.1 in the
web-appendix. Note that the dependence between the centers and the radii of the
response variable comes through bivariate dependence structure of the correspond-
ing random errors. Hence, conditional on Σ, Y C and Y R are independently distributed.
Based on the proposed prior structure, we compute the joint posterior distribu-
tion. The explicit forms of the full conditional distributions are given in the web-
appendix, and we note that these distributions are in the known forms. Hence, we
implement Markov Chain Monte Carlo (MCMC) algorithm to estimate the model
parameters using Gibbs sampler. Sample means (based on MCMC iterations) are
used to estimate the corresponding model parameters. All our computations are per-
formed in R JAGS.

3 Simulation study

We study the performance of our proposed joint model under various settings, for
example, for different sample sizes (n), for different number of predictors (p), and
more importantly for different correlations (𝜌) between the center and radius of the
response. We first compare the performance of our model with the Center and Range
Method (CRM) where we assume two independent normal distributions for the two
random errors (in the Eqs. 2.1 and 2.2). Hence, we assume two Inverse-Gamma pri-
ors on the variance parameters of the normal distributions separately as follows:
[ C] ([ ] [ 2 ])
𝜖i iid 0 𝜎 0
∼N ,Σ = 1 2 ,
𝜖iR 0 0 𝜎2 (3.1)
2 2
𝜎1 ∼ IG(k1 , k2 ), 𝜎2 ∼ IG(k3 , k4 ),

where the density of IG(𝛼, 𝛽) on 𝜎 2 is given by,


{ }
( 2
) 2 −(𝛼+1) 𝛽
𝜋 𝜎 | 𝛼, 𝛽 ∝ (𝜎 ) exp − 2 .
𝜎

3.1 Simulation Study 1

First, we simulate the centers and the radii of the predictors. The centers are simu-
lated from Uniform(−1, 1) distribution, and the radii are simulated from Uniform
(0, 1) distribution. Next, we simulate the regression coefficients. In this regard, we
consider two types of coefficients, namely the strong and the weak coefficients. The
strong coefficients are chosen randomly from Uniform (2, 5), whereas the weak
coefficients are chosen from Uniform (−0.5, 0.5).

13
S. Sen et al.

For the Type 1 coefficients, we simulate the coefficients randomly from the
strong and weak coefficients. Corresponding sparsity parameters are simulated
from a Bernoulli distribution with success probability p = 0.8. So we obtain a
dense set of regression coefficients chosen from the weak and strong set of coeffi-
cients. Next, for Type 2 coefficients, the coefficients are simulated from the strong
coefficients only, and the corresponding sparsity parameters are simulated from a
Bernoulli distribution with success probability p = 0.2.
Once we simulate X C , X R , 𝛽 C∶C , 𝛽 R∶R , 𝛽 C∶R and 𝛽 R∶C , next we generate the ran-
dom errors of the center 𝛜C , and ([of the
] [range 𝛜]), from a bivariate normal distribu-
R

0 1 𝜌
tion with correlation= 𝜌 , i.e. N2 , . Hence, we obtain Y C and Y R.
0 𝜌 0.1
In our simulation study, we consider 𝜌 = 0.1, 0.5, and 0.9, to explore weak,
moderate and strong dependence, respectively, between the center and the radius
(of the response). For a given set of predictor values and the regression coef-
ficients we generate random errors B times, consequently B replicated datasets
are obtained. We fit the models for each of the generated datasets, and report the
average estimate or posterior inclusion probability over all the B replications. To
understand the nature of the model fitting, we also vary n and p. The table below
summarizes the different settings for our simulation studies.

Parameters Values

n {50, 100, 200, 500}


p {5, 10, 20}
𝜌 {0.1, 0.5, 0.9}
B 1000

Regression coefficients are estimated by Markov Chain Monte Carlo (MCMC)


algorithm. We run 50, 000 MCMC iterations, and discard the first 10, 000
as burn-in, save every 10-th observation, and consider 5 different independent
chains. Convergence of the chains are assessed by computing scale reduction
factors.
We consider three different measures for comparing the effectiveness of the
proposed Bayesian joint model with the traditional CRM as discussed above.
Our first measure is accuracy in prediction(A𝛽 ). For each predictor, it com-
putes the proportion of times the true value is contained in the 95% Credible
n

Intervals (CI), A𝛽 = 1 I(𝛽i,0 ∈ [L(𝛽̂i ), U(𝛽̂i )], where L(𝛽̂i ) and U(𝛽̂i ) are, respec-
n
i=1
tively, the lower limits and the upper limits of the credible interval corresponding
to the true coefficient 𝛽i,0.
In Fig. 1, we plot the accuracy values of the 𝛽 C∶C coefficients for different
𝜌 , across different values of n and p. As the plot suggests, for low correlation
( 𝜌 = 0.1) both the models are pretty good as they exhibit more than 94% accuracy.
However, as the dimension increases the proposed joint model outperforms CRM,
even at this low correlation. As 𝜌 increases to 0.5, the proposed joint model shows
its efficacy in terms of the accuracy. While the accuracy of the CRM reduces to

13
A flexible Bayesian variable selection approach for modeling…

Fig. 1  The comparison between the CRM and the proposed joint model in terms of the accuracy in esti-
mating 𝜷 C∶C . The lighter surface corresponds to the proposed joint model, whereas, the darker surface
corresponds to CRM

80−85%, the proposed joint model still maintains more than 95% accuracy. Upon
further increase in 𝜌 to 0.9, the accuracy of CRM is reduced to 70−75%, whereas
the proposed joint model shows almost the perfect accuracy. Similar phenomena
are observed for 𝛽 R∶R , 𝛽 C∶R , and 𝛽 R∶C (given in Figures S.2, S.3, and S.4, respec-
tively, in the web-appendix).
Our second measure is the width of the 95% credible intervals. Lower width indi-
cates lower uncertainty associated with the estimation. In Table 1, we tabulate the aver-
age width of the corresponding credible intervals for different values of 𝜌. The widths
corresponding to a particular type of coefficient are grouped together, and their medi-
ans are reported here as a robust estimate. As these tables suggest, the proposed joint

13
S. Sen et al.

Table 1  Width of the 95% CIs for the model parameters in the simulation Study 1 for different correla-
tions (𝜌 = 0.1, 0.5, 0.9)
Dim. (p) Sample size (n) Length of 95% credible interval (𝜌 = 0.1)

𝛽C∶C 𝛽R∶R 𝛽C∶R 𝛽R∶C


Joint CRM Joint CRM Joint CRM Joint CRM

5 50 1.0381 1.0501 0.3904 0.3960 1.3953 1.4080 0.3272 0.3300


100 0.7006 0.7040 0.4001 0.4037 1.1858 1.1925 0.2046 0.2051
200 0.4856 0.4883 0.2605 0.2621 0.7814 0.7844 0.1304 0.1306
500 0.3057 0.3067 0.1563 0.1565 0.4968 0.4978 0.0950 0.0954
10 50 1.0250 1.0493 0.6535 0.6685 1.4923 1.5154 0.3031 0.3076
100 0.6739 0.6830 0.4262 0.4312 1.1666 1.1770 0.1825 0.1838
200 0.4784 0.4809 0.3009 0.3025 0.5833 0.5874 0.1552 0.1560
500 0.3047 0.3058 0.1810 0.1816 0.4988 0.5001 0.0781 0.0782
20 50 1.4332 1.5205 0.9114 0.9318 2.5481 2.6633 0.3340 0.3354
100 0.7845 0.7924 0.5207 0.5297 0.9102 0.9098 0.2570 0.2580
200 0.5188 0.5208 0.3197 0.3212 0.8885 0.8918 0.1382 0.1390
500 0.5017 0.5139 0.2858 0.2769 0.8136 0.8029 0.0847 0.0818
Length of 95% credible interval ( 𝜌 = 0.5)
5 50 0.9846 1.1535 0.5583 0.9476 0.9736 1.2645 0.6385 0.8375
100 0.9715 1.1521 0.5185 0.9356 0.9105 1.2517 0.6118 0.8319
200 0.9628 1.1484 0.5028 0.9245 0.9018 1.2386 0.6062 0.8287
500 0.9436 1.0937 0.4948 0.9194 0.8938 1.2049 0.5985 0.8019
10 50 0.9527 1.1427 0.8372 1.0837 0.7838 0.9284 0.9184 1.0837
100 0.9516 1.1273 0.8364 1.0363 0.7736 0.9184 0.9125 1.0574
200 0.9417 1.1027 0.8275 1.0247 0.7627 0.9018 0.9019 1.0193
500 0.9184 1.0774 0.7938 1.0104 0.7473 0.8839 0.8873 0.9937
20 50 0.9726 1.1326 0.8937 1.0836 0.8463 0.9973 0.8636 0.9736
100 0.9627 1.1028 0.8837 1.0573 0.8363 0.9827 0.8526 0.9627
200 0.9587 1.0638 0.8725 1.0274 0.8275 0.9685 0.8473 0.9527
500 0.9463 1.0128 0.8527 0.9837 0.8184 0.9612 0.8374 0.9435
Length of 95% credible interval ( 𝜌 = 0.9)
5 50 0.9476 1.0385 0.4939 0.8375 0.8476 1.0574 0.4837 0.7395
100 0.9275 1.0275 0.4760 0.8323 0.8249 1.0383 0.3628 0.7386
200 0.9138 1.0184 0.4638 0.8274 0.8184 1.0249 0.3607 0.7303
500 0.9029 1.0136 0.4537 0.8194 0.8095 1.0085 0.3527 0.7288
10 50 0.9373 1.0463 0.8476 1.0194 0.7024 1.2019 0.8736 1.0564
100 0.9284 1.0387 0.8403 1.0138 0.7018 1.1838 0.8689 1.0484
200 0.9194 1.0194 0.8294 0.9948 0.7008 1.1593 0.8574 1.0383
500 0.9018 0.9928 0.8029 0.9564 0.6949 1.1394 0.8494 1.0194
20 50 1.0637 1.3426 1.0372 1.1536 0.9738 1.0848 0.8364 1.0375
100 1.0194 1.3284 1.0038 1.0948 0.9275 1.0375 0.8284 0.9885
200 0.9985 1.1028 0.9738 0.9837 0.8647 0.9639 0.8093 0.9738
500 0.9638 1.0375 0.9194 0.9739 0.8362 0.9593 0.7829 0.9418

Bold values correspond to the proposed joint model, and illustrate that the proposed model performs bet-
ter than its competitors

13
A flexible Bayesian variable selection approach for modeling…

model almost consistently maintains a lower width of the CI than CRM, even in case
of lower correlation. Although the difference in the width of the CI is very small for
𝜌 = 0.1, as it takes on values 0.5, and 0.9, the gap significantly widens, once again illus-
trating the usefulness of the proposed joint model over the CRM.
The final measure is a scale independent measure of the closeness of the true coeffi-
cients, and the predicted coefficients, namely the symmetric mean absolute percentage
error (SMAPE). This is defined as follows:

1 ∑ |𝛽̂i − 𝛽0,i |
p
𝜷̂SMAPE = ,
p i=1 |𝛽̂i | + |𝛽0,i |

where 𝛽̂i s are the predicted regression coefficients, and 𝛽0,i s are the correspond-
ing true values. Similar to the accuracy measure, here we provide surface plots for

Fig. 2  The comparison between the CRM and the proposed joint model in terms of the SMAPE in esti-
mating 𝜷 C∶C . The lighter surface corresponds to the proposed joint model, whereas, the darker one cor-
responds to CRM

13
S. Sen et al.

different values of 𝜌 across different n and p. In Fig. 2, we provide the same for the
𝜷 C∶C coefficients. As the plots suggest, for 𝜌 = 0.1, two surfaces (for two models) are
almost inseparable. However, as 𝜌 increases, the SMAPE surface of the proposed
joint model remains below to that of the CRM. This reflects a better prediction accu-
racy of the proposed joint model. The plots for the rest of the coefficients are pro-
vided in the web-appendix (in Figures S.5, S.6, and S.7).
Next, we provide some empirical evidence of model consistency. In the simula-
tion setting described previously, we fix 𝜌 = 0.9, p = 5, and vary n to obtain the pos-
terior distribution of all the 20 regression coefficients. We compute the square of the
bias for each of the coefficients across different values of n, and also report the cor-
responding width of the 95% credible region. Figure 3 is the graphical representation
of the output. It is noted that the bias indeed reduces to 0, as n increases. Also, the
width of the 95% credible region decreases with n. In summary, this is an empirical
evidence of the consistency of our proposed joint model.

3.2 Simulation Study 2

In this study, we compare the performance of the proposed Bayesian joint model
with the Bivariate Bayesian Method (BBM) proposed in Xu and Qin (2022). Note
that Xu and Qin (2022) illustrated through the extensive simulation studies that
BBM performs much better than the Center Range Additive Model (CRAM), Sym-
bolic Covariance Method (SCM), and Monte Carlo Method (MCM). Therefore, we
only compare our proposed approach with BBM.
We simulate the datasets using the same specifications of the model parameters
as in Simulation 1. We consider n=50, 100, and 200; p=5,10,20; and 𝜌=0.5. We
consider B=1000 replications for our computations.
Following Xu and Qin (2022), we consider three evaluation measurements as
follows:

1. The lower and upper root mean squared errors, RMSEL and RMSEU . Define Yl and
Yu as the lower and upper limit of the interval for the response, and Ŷ l and Ŷ u are

n

the respective predicted values. Then RMSEL = 1n (Yil − Ŷ il )2 , and
� i=1
n

RMSEU = 1n (Yiu − Ŷ iu )2 .
i=1
2. The root mean squared
� error based on adaptive Hausdorff distance is defined as
n

follows: RMSEH = 1n (�Yic − Ŷ ic � + �Yir − Ŷ ir �)2 , where c and r, respectively,
i=1
denote the center and the radius.
n
∑ w(Yil ∩Ŷ il )
3. Finally, the rate of different intervals is defined as follows: RI = 1
n w(Yil ∪Ŷ il )
,
i=1
where w() denotes the width of the unions or intersections of intervals.

13
A flexible Bayesian variable selection approach for modeling…

Fig. 3  Squared bias and the width of the CI across different values of sample size (n) in the Simulation
Study. The 20 categories in the legend represent 20 regression coefficients (five for each of 𝜷 C∶C , 𝜷 R∶R,
𝜷 C∶R, 𝜷 R∶C)

13
S. Sen et al.

Table 2  Comparison of the proposed joint model with bivariate Bayesian method (BBM) in simulation
Study 2
Dim. (p) Sample size (n) Evaluation measurements

RMSEL RMSEU RMSEH RI


Joint BBM Joint BBM Joint BBM Joint BBM

5 50 0.976 0.982 0.874 0.881 1.012 1.051 0.971 0.971


100 0.985 0.987 0.984 0.991 1.142 1.192 0.975 0.975
200 1.034 1.038 1.265 1.292 0.962 0.985 0.981 0.981
10 50 0.976 1.172 0.874 1.062 1.012 1.651 0.975 0.975
100 0.985 1.071 0.984 1.204 1.142 1.491 0.984 0.984
200 1.034 1.283 1.296 1.562 1.062 1.348 0.981 0.981
20 50 0.998 1.402 1.162 1.581 1.271 1.593 0.975 0.975
100 0.994 1.269 1.046 1.471 1.239 1.761 0.984 0.983
200 1.241 1.538 1.342 1.664 1.219 1.543 0.978 0.978

In Table 2 we summarize the results based on the average values from 1000 rep-
lications. We note that for p=5, the results for BBM and the proposed model are
comparable. However, for p=10 and 20, the proposed model results in smaller
root mean squared errors than BBM. This illustrates the better performance of
our proposed joint model.

4 Real data analysis

We analyze a dataset obtained from a clinical study conducted by Tata Transla-


tional Cancer Research Center (TTCRC), Kolkata. The goal of this study is to
assess the efficacy of two standard drugs on lymphocytes, which is an important
biomarker for Acute Lymphocytic Leukemia (ALL).
The study was conducted by TTCRC during 2014–2019, where for the first two
years the patients were treated with two standrd drugs, and were then followed for
the next three years. The study included 236 children diagnosed as ALL patients.
The patients were in the age-interval [1,17.5], with a median age 4.7 at the time
of presentation. Two standard drugs, i.e. 6-mercaptopurine (6MP) and methotrex-
ate (MTx) were used for the treatment, and the doses were adjusted based on the
outcomes from time to time. The patients were tested for the lymphocyte count,
the absolute neutrophil count (ANC), and platelet count bi-weekly, or once in a
month. Once the treatment phase is over the biomarkers, age and the drug doses
are summarized for each patient as intervals.
Since ALL is caused due to the uncontrolled growth of the lymphocytes, we
consider lymphocyte count as our response. It is important to study the effects
of (i) two biomarkers (i.e. ANC and platelet count), (ii) doses of the two drugs

13
A flexible Bayesian variable selection approach for modeling…

(6MP and MTx), and (iii) age of the patient on the response (lymphocyte count),
and therefore, we develop the proposed Bayesian joint model for analysing this
dataset.

4.1 Computational details

We note that here we have one response and five predictors. Biomarkers were
obtained as log-transformed, and we develop models given in equations (2.1) and
(2.2) for our analysis. For the Type 1 coefficients, we consider spike and slab priors
with a Beta (1, 2) prior for pj , and for the Type 2 coefficients we use horse-shoe
prior as discussed in Sect. 2.2.1. For the matrix Σ we consider Inverse Wishart prior
with V0 = I2 (identity matrix of order=2), and 𝜈0 = 2.
We develop a Gibbs Sampler for estimating the model parameters. We consider
five independent chains with different starting values, run them for 50,000 iterations
and discard the first 10,000 as burn-in for removing the effects of the staring values.
We also thin the chains by saving every 10-th observations, and it reduces the auto-
correlations among the estimates from successive iterations. Model parameters are
estimated by their respective sample means. Marginal posterior inclusion probability
for each predictor is computed based on the sample proportions from MCMC itera-
tions. Scale reduction factors are mostly smaller than 1.1 indicating a good conver-
gence (Brooks and Gelman 1998).

4.2 Findings

The estimated (posterior) marginal inclusion probabilities for the Type 1 coeffi-
cients are shown in Table 3. Note that for modeling C:C (the response center with
the predictor centers), the drug 6MP has the highest marginal inclusion probability.
Also the marginal inclusion probabilities of the predictors MTx and ANC are to the

Table 3  Marginal inclusion Model Predictor (Posterior) MIP


probabilities (MIP) for Type 1
coefficients in our data analysis C:C ANC 0.67
– Platelet count 0.54
– 6MP 0.92∗
– MTx 0.74
– Age 0.53
R:R ANC 0.37
– Platelet count 0.41
– 6MP 0.76∗
– MTx 0.63
– Age 0.51

C:C and R:R represent, respectively, center to center, and radius to


radius model

13
S. Sen et al.

higher side. Age and Platelet count are moderately significant in the C:C modeling.
However, for modeling R:R (the response radius with the predictor radii), ANC and
Platelet count are not significant (MIP<0.5), 6MP still has the highest inclusion
probability, Age and MTx are moderately significant (higher than 0.5). This reas-
sures the effectiveness of the drug 6MP for the treatment of ALL.
In Figs. 4 and 5, we show the trace plots and the posterior density plots for
the Type 2 coefficients. We see that only the center of ANC has some significant
positive effect on the radius of lymphocyte. The effects of the radius (center) of
platelet count on the center (radius) of the lymphocyte count are not shrunk to
zero, but it is centered about zero (hence no overall effect). All other Type 2 coef-
ficients are shrunk towards zero by the horse-shoe prior.

4.3 Model comparison

We introduce the idea of the model comparisons based on their predictive pow-
ers for modeling interval data. For interval data, if the radius is over-predicted
then the overlap between the true and the predicted intervals increases (assuming
other effects unchanged). On the other hand, if the radius is under-predicted then
this overlap decreases. Hence, we introduce two measures of overlap between the
actual and the predicted intervals. One of them computes the proportion of the
true interval that is predicted correctly. If the radius is over-predicted, keeping
the prediction of the center to be more or less accurate, it would provide a high
accuracy, whereas for the radius being under-predicted, its value would be lower.
For the other measure, the consequences of the over and under prediction of the
radius would be interchanged.
For our analysis, we consider the (C, R) representation of the interval data. Let
(Ci , Ri ) respectively be the center and the radius of the i-th observation, with their
predicted values being C ̂ i , and R
̂i respectively. Now, the minimum point-maximum
point representation of the same interval is: 𝛼i = Ci − Ri (min), and 𝜆i = Ci + Ri
(max), with their predicted values being 𝛼 ̂i , and 𝜆̂i = C
̂i − R
̂i = C ̂i . Now, the
̂i + R
predictive power of a model can be compared by the following three measures
tabulated below:

Name Measure

Relative Overlap 1 (RO1 ) n


1 ∑ |(𝛼i , 𝜆i ) ∩ (̂𝛼i , 𝜆̂i )|
n i=1 ̂
𝛼i , 𝜆i )|
|(̂
Relative Overlap 2 (RO2 ) n
1 ∑ |(𝛼i , 𝜆i ) ∩ (̂𝛼i , 𝜆̂i )|
n i=1 |(𝛼i , 𝜆i )|
{ }
Mean Magnitude of the Relative Error (MMRE) n |
1 ∑ | 𝛼i − 𝛼 ̂i || || 𝜆i − 𝜆̂i ||
| |+| |
2n i=1 || 𝛼i || || 𝜆i ||

13
A flexible Bayesian variable selection approach for modeling…

Table 4  Model comparison Method RO1 RO2 MMRE


results for the ALL data analysis
Mean Median

Proposed joint 0.830 0.727 0.674 0.348


model
CRM 0.528 0.451 0.836 0.517
BBM 0.827 0.729 0.713 0.429

Bold values correspond to the proposed joint model, and illustrate


that the proposed model performs better than its competitors

Fig. 4  Trace plots and the posterior density plots for the Type 2 coefficients (for C : R model) in the ALL
data analysis

The first measure reflects the proportion of the predicted interval that is cor-
rect, whereas the second measure shows the proportion of the actual interval that
is predicted correct. Finally, in the third measure, the separate relative error in
predicting the center and radius of the interval is considered, and are aggregated.

13
S. Sen et al.

Fig. 5  Trace plots and the posterior density plots for the Type 2 coefficients (for R : C model) in the ALL
data analysis

Based on these three measures we compare the proposed joint model with the
Center and the Range Method (CRM) and the Bivariate Bayesian Method (BBM).
Table 4 summarizes the results. We note that the proposed Bayesian joint
model provides a better overlap and a lower relative error than the CRM. In addi-
tion, we observe that BBM results in the similar overlap as the proposed model
but its relative error is higher. This justifies the usefulness of our proposed joint
model for analysing the ALL dataset.

5 Conclusion

In this article, we have developed a Bayesian joint model for analysing interval-val-
ued datasets, and also proposed a Bayesian variable selection approach. We have
assessed the accuracy and precision of our estimation method through an exten-
sive simulation study. The predictive power of the proposed model is also assessed

13
A flexible Bayesian variable selection approach for modeling…

through the simulation studies. We have shown that a joint model performs better
than the Center and Range Method (CRM) in terms of the estimation as well as
prediction. Also, our model performs better than the Bivariate Bayesian Method
(BBM) proposed recently in Xu and Qin (2022). This is also illustrated by a real
dataset. We also note that our proposed approach can be easily extended for a multi-
variate setting where multiple (interval) responses are available with multiple (inter-
val) predictors.
However, there are several issues related to modeling interval data which need
to be investigated further. First, for modeling nested interval data (for example, lon-
gitudinal or spatial interval data) there is no standard approach in the existing lit-
erature. Second, for categorical interval data the problem is even more severe. To
our knowledge, there is no work on modeling categorical interval data. Finally, for
modeling the quantiles of interval-valued response the standard quantile regression
models might not work. These are some of our future research avenues.
Supplementary Information The online version contains supplementary material available at https://​doi.​
org/​10.​1007/​s10260-​023-​00727-9.

Funding The dataset used in this work for illustration was collected in a clinical trial partially funded by
the National Cancer Grid (2016/001; 2016-) and the Indian Council of Medical Research (79/159/2015/
NCD-III; 2017-19).

Data Availability The relevant dataset is available with this manuscript.

Declarations
Conflict of interest The authors declare no conflict of interest/competing interests.

Code availability Relevant R code is available with this manuscript.

References
Ahn J, Peng M, Park C, Jeon Y (2012) A resampling approach for interval-valued data regression. Stat
Anal Data Min ASA Data Sci J 5(4):36–348
Bhadra A, Datta J, Polson NG, Willard B (2017) The horseshoe+ estimator of ultra-sparse signals. Bayes-
ian Anal 12(4):1105–1131
Brooks SP, Gelman A (1998) General methods for monitoring convergence of iterative simulations. J
Comput Graph Stat 7(4):434–455
Carvalho CM, Polson NG, Scott JG (2010) The horseshoe estimator for sparse signals. Biometrika
97(2):465–480
Das K, Ghosh P, Daniels MJ (2021) Modeling multiple time-varying related groups: a dynamic hierar-
chical Bayesian approach with an application to the health and retirement study. J Am Stat Assoc
116(534):558–568
George EI, McCulloch RE (1993) Variable selection via Gibbs sampling. J Am Stat Assoc
88(423):881–889
George EI, McCulloch RE (1997) Approaches for Bayesian variable selection. Stat Sin 7(2):339–373
Giordani P (2015) Lasso-constrained regression analysis for interval-valued data. Adv Data Anal Classif
9(1):5–19
Kedia P, Kundu D, Das K (2023) A Bayesian variable selection approach to longitudinal quantile regres-
sion. Stat Methods Appl 32:149–168

13
S. Sen et al.

Kuo L, Mallick B (1998) Variable selection for regression models. Sankhyā Indian J Stat Ser B
60(1):65–81
Lim C (2016) Interval-valued data regression using nonparametric additive models. J Korean Stat Soc
45(3):358–370
Lima Neto EDA, Carvalho FDAD (2017) Nonlinear regression applied to interval-valued data. Pattern
Anal Appl 20(3):809–824
Lima Neto EDA, Carvalho FDAD (2008) Centre and range method for fitting a linear regression model to
symbolic interval data. Comput Stat Data Anal 52(3):1500–1515
Lima Neto EDA, Cordeiro GM, Carvalho FDAD (2011) Bivariate symbolic regression models for inter-
val-valued variables. J Stat Comput Simul 81(11):1727–1744
Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103(482):681–686
Polson NG, Scott JG (2010) Shrink globally, act locally: sparse Bayesian regularization and prediction.
Bayesian Stat 9:501–538
Smith M, Kohn R (1996) Nonparametric regression using Bayesian variable selection. J Econ
75(2):317–343
Sun Y, Zhang X, Wan AT, Wang S (2021) Model averaging for interval-valued data. Eur J Oper Res.
https://​doi.​org/​10.​1016/j.​ejor.​2021.​11.​015
Xu W (2010) Symbolic data analysis: interval-valued data regression. Doctoral dissertation, University
of Georgia
Xu M, Qin Z (2022) A bivariate Bayesian method for interval-valued regression models. Knowl-Based
Syst 235(107):396

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and
applicable law.

13

You might also like