You are on page 1of 9

Pattern Recognition 97 (2020) 107021

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog

Decomposed slice sampling for factorized distributions


Jiachun Wang, Shiliang Sun∗
Department of Computer Science and Technology, East China Normal University, 3663 Zhongshan Road, Shanghai 200241, PR China

a r t i c l e i n f o a b s t r a c t

Article history: Slice sampling provides an automatical adjustment to match the characteristics of the distribution. Al-
Received 27 September 2018 though this method has made great success in many situations, it becomes limited when the distribution
Revised 19 July 2019
is complex. Inspired by Higdon [1], in this paper, we present a decomposed sampling framework based
Accepted 25 August 2019
on slice sampling called decomposed slice sampling (DSS). We suppose that the target distribution can
Available online 26 August 2019
be divided into two multipliers so that information in each term can be used, respectively. The first mul-
Keywords: tiplier is used in the first step of DSS to obtain horizontal slices and the last term is used in the second
Slice sampling step. Simulations on four simple distributions indicate the effectiveness of our method. Compared with
Markov chain Monte Carlo slice sampling and Hamiltonian Monte Carlo on Gaussian distributions in different dimensions and ten
Decomposed slice sampling real-world datasets, the proposed method achieves better performance.
Hamiltonian Monte Carlo
© 2019 Elsevier Ltd. All rights reserved.

1. Introduction a variety of scenarios, for example, in variable selection [9,10],


spatial models [11,12] and biological models [13–15].
Markov chain Monte Carlo (MCMC) [2,3] is a very general and Although slice sampling from univariate distributions is
powerful framework that can be used to sample from a large straightforward, in complex multivariate distributions, effectively
class of distributions encountered in pattern recognition and finding an interval of the “slice” is difficult. Therefore, adaptive
machine learning. MCMC aims to efficiently draw samples from an methods that consider the dependencies between variables were
unnormalized density function by utilizing the properties of the proposed, and other upgraded approaches such as random walk
Markov chain. It serves as a fundamental approach for probabilistic suppression methods, i.e., overrelaxed slice sampling and reflective
inference in many computational statistical problems. slice sampling, were also developed [8]. Recently, Murray et al.
Markov chains in MCMC methods with auxiliary variables may [16] attempted to perform slice sampling in latent variable mod-
mix faster and be easier to sample than standard single variable els with multivariate Gaussian priors by sampling from a high
algorithms [1]. The Swendsen-Wang (SW) algorithm [4] is the dimensional elliptical curve.
first formal auxiliary variable methods for MCMC [1,5]. It was Similar in introducing auxiliary variable but different in ap-
introduced to overcome some of the drawbacks of the Gibbs proach is the Hamiltonian Monte Carlo (HMC) [17,18]. HMC casts
sampler [6] for the Ising model and was generalized by Edwards the probabilistic simulation in the form of a Hamiltonian system.
and Sokal [7]. The further development of auxiliary variables By introducing momentum as the auxiliary variable, HMC exploits
methods for MCMC is the introduction of slice sampling. The the gradient information to generate samples. The combination
slice sampling method [8] alternates between uniformly sampling with the Metropolis algorithm removes any bias associated with
from the vertical region determined by the density at the current the discretization in HMC [18,19]. With careful tuning of pa-
point and uniformly sampling from the horizontal “slice” deter- rameters, HMC can achieve favorable mixing properties in many
mined by the current vertical position. It is an auxiliary-variable cases. However, proper tuning is difficult because HMC is sensitive
method where the horizontal slice is the auxiliary variable. Slice to parameter change. As a consequence, adaptive approaches
sampling can automatically adjust to match the characteristics [20,21] were proposed for the interest of automating parameter
of the distribution, which makes it easy to tune. Thus in some tuning. Recently, HMC with Riemann manifold information was
cases, slice sampling has better performance than Gibbs sampling also proposed [22]. Another attempt towards improving HMC
and Metropolis methods [8]. Because of the easy implementation has exploited non-canonical Hamiltonian dynamics to expand the
and good mixing properties, the slice sampler can be applied in range of sampling space [23]. As a representative MCMC method,
HMC will be taken as a compared method in our experiments.
The previous work mentioned above is dedicated to improv-

Corresponding author. ing the specific algorithm for solving slice intervals. Unlike the
E-mail address: slsun@cs.ecnu.edu.cn (S. Sun).

https://doi.org/10.1016/j.patcog.2019.107021
0031-3203/© 2019 Elsevier Ltd. All rights reserved.
2 J. Wang and S. Sun / Pattern Recognition 97 (2020) 107021

Fig. 1. Illustration of slice sampling. (a) For a given value x0 , a vertical level, u, is drawn uniformly from (0, p˜ (x )), and used to define a “slice”, shown by the solid horizontal
lines. (b) Because in practice, solving the exact endpoints of the slice is intractable, a new point is drawn uniformly from the interval [xmin , xmax ] which contains the current
point x0 .

previous work, in this paper, we present a novel sampling frame- shrinking rank method is also a variation of slice sampling, which
work based on slice sampling called decomposed slice sampling can effectively sample multivariate distributions with highly cor-
(DSS) to efficiently sample more complex distributions for the rea- related parameters [25]. In this method, the crumbs are Gaussian
son that traditional slice sampling becomes less efficient when the random variables centered at the current state. By sampling within
target distribution is complex. We suppose that the unnormalized a transformed space, the factor slice sampler [26] can generate
target distribution is a factorized distribution which can be decom- nearly independent draws from a highly correlated, high dimen-
posed into two multipliers so that information in each term can be sional target distribution. Though the DSS is based on the standard
used, respectively. Note that for most cases in practice, the target slice sampling in this paper, the above mentioned methods are
distribution can be factorized. For example, the posterior distribu- complementary to DSS (i.e., they can be equipped on the DSS to
tion is proportional to a likelihood function multiplied by a prior meet specific purposes).
distribution. The first multiplier is used in the first step of DSS to Other approaches include parallelizing the multivariate slice
uniformly generate an auxiliary variable and define the horizontal sampling [27] and relaxing the requirement of evaluating the
slice. Then we sample a new point from the second multiplier over target distribution up to a constant by bringing together the
the interval of slices given by the first multiplier. The correctness advantages of pseudo-marginal methods and slice sampling [28].
of DSS is verified theoretically and practically. The simulated exper- It is interesting to consider improving the efficiency of the DSS
iments of DSS compared with slice sampling on different dimen- through these methods in future work.
sions show that DSS performs better in the multi-modal cases. The
result of experiments performed on ten real-world datasets indi- 3. Slice sampling method
cates that methods based on the DSS perform better than the slice
sampling and HMC in the case of complex distributions. Though Slice sampling [8] provides an approach to match the distri-
the DSS is based on the standard slice sampling method in this bution by automatically adjusting the step size. It can be seen
paper, other upgraded slice sampling methods which improve the as an auxiliary variable method because the sampling process
algorithm for solving slice intervals can also be used in the DSS. involves the introduction of the auxiliary variable. Suppose we
The remainder of this paper is organized as follows. wish to sample from a distribution, p(x), which is proportional to
Section 2 presents related works. Section 3 introduces the slice a computable function p˜ (x ). We can first introduce an auxiliary
sampling method. Section 4 presents the decomposed sampling variable u, sample from the joint distribution pˆ (x, u ) instead, and
framework and show its correctness. Experimental results and then ignore u. This joint distribution is defined as follows:
analysis are provided in Section 5. Finally, we conclude this paper 1
and discuss future work in Section 6. i f 0 < u < p˜ (x ),
pˆ (x, u ) = Zp
(1)
0 otherwise,
2. Related works 
where Z p = p˜ (x )dx. The marginal distribution of x is then
Recent works about slice sampling mainly focuse on adaptive   p˜(x )
1 p˜ (x )
multivariate slice sampling. A new multivariate slice sampling pˆ (x, u ) du = du = = p( x ) . (2)
0 Z p Zp
method that uses multiple auxiliary variables to perform multi-
variate updating was proposed in [5]. This method extends the In order to sample from pˆ (x, u ) and then ignore u, we can
bivariate slice sampling to the multivariate one, and generates alternately sample x and u. According to Eq. (1), fixing x, u is
samples by updating all dimensions at once. Our method is also a uniform from 0 to p˜ (x ). Thus, given the initial value of x and p˜ (x ),
multivariate sampling method but with only one auxiliary variable. we sample u uniformly from (0, p˜ (x )). Then we fix u and sample
In order to generate effective samples with highly correlated x uniformly from the horizontal “slice” defined by {x : p˜ (x ) > u}.
parameters, some work [24–26] has been put forward. Thompson The procedure of slice sampling is illustrated in Fig. 1a. Note that
and Neal [24] described two slice sampling methods for taking in practice, solving the exact endpoints of the “slice” is intractable.
multivariate steps using the “crumb” framework [8]. These meth- Hence, we turn to find an interval xmin ≤ x ≤ xmax that contains
ods utilize the gradients at rejected samples to accommodate the current x. This interval is hoped to contain as much of the
the local curvature of the log-density surface, which can produce slices as possible, but not much larger than the slice. The two
better samples when the parameters are highly correlated. The most commonly used approaches for finding an appropriate slice
J. Wang and S. Sun / Pattern Recognition 97 (2020) 107021 3

Algorithm 1 The “stepping out” method. auxiliary variable and define the horizontal slice. Then we sample
new points from the second part of the distribution over the slice
Input:
interval provided by the first part. In this section, we will detail
Function proportional to the density, f (x );
the decomposed slice sampling and show the correctness of this
The current point, xn ;
proposed framework.
The vertical level defining the slice, u;
Estimate of the scale of a slice, w;
The integer that limits the maximum size of a slice interval to mw, 4.1. Decomposed slice sampling
m;
Output: The final interval, (L, R ); Suppose that the target distribution p(x) can be written in the
form
1: Generate a random number, U ∼ Uni f orm(0, 1 )
2: Set the value of the left bound of the initial interval, L = xn − p( x ) ∝ b 1 ( x ) ∗ b 2 ( x ) , (3)
w ∗U
where b2 (x) is a distribution that is relatively simpler than b1 (x).
3: Set the value of the right bound of the initial interval, R = L + w
In practice, there are many cases that the target distribution
4: Generate another random number, V ∼ Uni f orm(0, 1 )
is a factorized distribution which can be decomposed into two
5: Compute the maximum times that the initial interval can be
factors. For example, the posterior distribution is usually the
expanded to the left side, J = F loor (m ∗ V )
target distribution and it is proportional to a likelihood function
6: Compute the maximum times that the initial interval can be
multiplied by a prior distribution.
expanded to the right side, K = (m − 1 ) − J
By introducing an auxiliary variable u, the distribution for u
7: while J > 0 and u < f (L ) do
given x is defined as:
8: Update the left bound of the interval, L = L − w
Update the residual times that expand to the left side, J = 1
p( u | x ) =
9:
, s.t. 0 < u < b1 (x ). (4)
J−1 b1 ( x )
10: while K > 0 and u < f (R ) do Given x, the auxiliary variable u is distributed as Uniform(0,
11: Update the right bound of the interval, R = R + w b1 (x)).
12: Update the residual times that expand to the right side, K = Thus, the distribution for x given u is then
K −1
13: return (L, R )
p(x|u ) = b2 (x ), s.t. 0 < u < b1 (x ). (5)
That is, given u, x is distributed according to b2 (x) over {x:
0 < u < b1 (x)}. Notice that b2 (x) is a distribution relatively simpler
interval are the “stepping out” procedure and the “doubling” than b1 (x) (e.g, a prior), which makes the simulation on p(x|u)
procedure [8]. In this paper, we use the “stepping out” procedure. easier.
The procedure of slice sampling in one update [8], i.e., from the Similar with the slice sampling, this method also contains three
current state xn to the next state xn+1 is summarized as follows: steps in one update, from the current state xn to the next state
xn+1 :
Step 1. Generate a value u uniformly from (0, p˜ (xn )), which
defines a horizontal “slice”: S = {x : u < p˜ (x )}. Step 1. Generate a value u uniformly from (0, b1 (xn )), which
Step 2. Randomly choose an initial interval I0 that contains the defines a horizontal “slice”: S = {x : u < b1 (x )}.
current state xn with a size of w, where w is a rough Step 2. Randomly choose an initial interval I0 that contains the
estimate of the scale of S. Then expand the interval by w current state xn with a size of w, where w is a rough esti-
each time using the “stepping out” method (details are mate of the scale of S. Then expand the interval by w each
shown in Algorithm 1) until both of the endpoints are time using the “stepping out” method until both of the
outside the slice region S, or the width of the interval endpoints are outside the slice region S, or the width of
already reached the given maximum size mw. Notice that the interval already reached the given maximum size mw.
both parameters, m and w, are given manually. Step 3. Sample one new point xn+1 each time from b2 (x) over the
Step 3. Sample one new point xn+1 each time uniformly from interval found in step 2, until xn+1 lies in the slice region.
the interval found in step 2 until xn+1 lies in the slice Points that are sampled outside the slice region S are
region. Points that are sampled outside the slice region used in the shrinkage procedure to shrink the interval.
S are used to shrink the interval, which is called the
This procedure is illustrated in Fig. 2. Note that in step 3,
“shrinkage” procedure.
the sampling method adopted to sample new points can be any
The proof of the correctness of slice sampling is provided in efficient MCMC method, which depends on the specific practical
Appendix A. Though the slice sampling has a very simple form, situations. In the section of experiments, we will employ slice
it can be applied to a wide variety of distributions and is more sampling and HMC in step 3, respectively.
efficient than Gibbs sampling and Metropolis-Hastings in some In fact, related methods have been used in the past. The
simple cases [8]. However, slice sampling from complex distri- Swendsen–Wang (SW) algorithm is an early auxiliary variable
butions becomes difficult because of the time-consuming solving method for MCMC. The original SW algorithm [4] applied only
process of the slice interval. to the Ising/Potts model for image analysis; in the following year
Edwards and Sokal [7] generalized the SW algorithm to make it
4. Decomposed slice sampling available to other models. Later, Higdon [1] reinterpreted the gen-
eralized SW algorithm from the perspective of auxiliary variables.
Based on the idea of slice sampling, we propose a novel In these methods, the target distribution can be written as the
sampling method called decomposed slice sampling (DSS) which form of multiplication of k + 1 functions: p(x)∝p0 (x)b1 (x)bk (x).
is able to simulate more complex distributions than the original By introducing k auxiliary variables, u1 , . . . , uk , the joint distri-
slice sampling, especially in high dimensions. Specifically, we bution p(x, u1 , . . . , uk ) is defined as a uniform distribution over
decompose the factorized target distribution into two multipliers. the region 0 < ui < bi (x) for i = 1, . . . , k. Then p(x, u1 , . . . , uk ) can
On the basis of the first part of the distribution, we generate the be sampled using Gibbs sampling or some other effective MCMC
4 J. Wang and S. Sun / Pattern Recognition 97 (2020) 107021

Fig. 2. Illustration of decomposed slice sampling. The gray curves represent p(x). (a) For a given value xn , a vertical level, u, is drawn uniformly from (0, b1 (xn )), and used
to define a “slice”. The slice interval is shown in bold. (b) Sample one new point xn+1 from b2 (x) over the interval [xmin , xmax ], until xn+1 lies in the slice region. The rejected
point x∗ is used to shrink the interval to [xmin , xmax ].

methods. The slice sampling is actually a special case of the We can demonstrate that the update procedure satisfies the
generalized SW algorithm, where k = 1 and p0 (x ) = 1 [8], and the detailed balance by proving the following stronger result [8]:
decomposed slice sampling procedure described above is also a
kn × p(xn+1 , r|xn , u ) = kn+1 × p(xn , π (r )|xn+1 , u ), (11)
special case where k = 1 but p0 (x) = 1.
where r denotes random choices in the sampling procedure, or
more specifically, U and J (or V, K) in the “stepping out” step, and
4.2. Correctness of decomposed slice sampling
the sequence of rejected points P in the “shrinkage” step. π (r) is
a first order one-to-one mapping that depends on xn and xn+1 .
To show that the decomposed slice sampling is a correct proce-
Regarding r as a variable, we can obtain Eq. (10) by integrating r
dure, it is necessary to prove that the distribution p(x) is invariant
(i.e., considering all the possible choices of r).
during each step in the Markov chain. To show invariance, we first
Next, we define the mapping π (r). We first define the π (U)
suppose that the initial point x0 follows the distribution p(x). Thus
that maps Un to Un+1 as follows:
the joint distribution of x0 and u follows:
 Un+1 = Frac(Un + (xn+1 − xn )/w ), (12)
b2 (x )/Z i f 0 < u < b1 (x ),
p(x, u ) = (6) where the function Frac(x ) = x − Floor(x ) retains the fractional
0 otherwise,
part of x. The definition of π (U) makes the left bound Ln+1 of

where Z = b1 (x ) ∗ b2 (x )dx. The marginal distribution of x is the initial slice interval In+1 that contains xn+1 align with the
obtained by integrating u as follows: left bound Ln of the initial slice interval In that contains xn , or
  explained in formulation, (Ln+1 − Ln ) mod w = 0.
b1 ( x )
b1 ( x ) ∗ b2 ( x ) The “stepping out” procedure will generate a random number
p(x, u ) du = b2 (x )/Z du = = p( x ) . (7)
0 Z J ∈ {0, . . . , m − 1}, and so we define the mapping π (J) that maps
Jn to Jn+1 as follows:
Note that in the subsequent update from x0 to x1 , the auxiliary
variable u has the same value. If the joint distribution p(x0 , u) and Jn+1 = Jn + (xn+1 /w − Un+1 ) − (xn /w − Un ), (13)
p(x1 , u) are identically distributed, then p(x0 ) and p(x1 ) are also
where (xn+1 /w − Un+1 ) − (xn /w − Un ) denotes the distance (w
identically distributed by integrating u. So does the subsequent up-
units) between the left bound Ln and Ln+1 .
date from x1 to x2 , and so on. Thus, we only need to show that
When the transition between xn and xn+1 is possible, it is easy
p(xn , u) and p(xn+1 , u ) are identically distributed in the step of
to find that the final slice interval will be the same for xn as for
sampling xn+1 given xn (n = 0, 1, 2, . . .). Because these steps do not
xn+1 with the above definitions. Actually, these two intervals will
change u, it is equal to prove that the conditional distribution for x
be the same when the interval for xn first contains xn+1 . Then in
given u is invariant in step 2 and step 3 of the DSS. As introduced
the following “stepping out” steps, the two intervals will stay the
above, p(x|u) is distributed according to b2 (x) over S = {x : u <
same. For the situation that the transition between xn and xn+1 is
b1 (x )}. We can show the invariance of the conditional distribution
impossible, Eq. (11) will hold because both sides of it are zero.
by proving that the updates satisfy the detailed balance [29]:
Finally, π maps the sequence of rejected points Pn used to
p(xn |u ) p(xn+1 |xn , u ) = p(xn+1 |u ) p(xn |xn+1 , u ). (8) shrink the interval that contains xn to the sequence of rejected
points Pn+1 that are generated from the interval found from xn+1 .
Denote Thus, Pn and Pn+1 are the same sequence.
p( x n | u ) = b 2 S ( u ) ( x n ) = k n , After the completion of the definition of π , we return to prove
Eq. (11), which can be expanded as:
p(xn+1 |u ) = b2S(u ) (xn+1 ) = kn+1 , (9)
kn × p(Un |xn ) p(xn+1 , Jn , Pn |Un , xn )
where b2S(u) (x) is the b2 distribution over the slice S determined
= kn+1 × p(Un+1 |xn+1 ) p(xn , Jn+1 , Pn+1 |Un+1 , xn+1 ). (14)
by u. It is worth noting that, here, S should be explained as the
accurate analytic solution of the slice region. Now Eq. (8) can be Here, the unchanged u is omitted for conciseness. Obviously,
simplified as: p(Un |xn ) is uniform over (0,1). Although the value of Un+1 depends
on xn and xn+1 , p(Un+1 ) is also uniform over (0,1) because Un+1
kn × p(xn+1 |xn , u ) = kn+1 × p(xn |xn+1 , u ). (10) is a linear transformation of Un . Therefore, the second term of
J. Wang and S. Sun / Pattern Recognition 97 (2020) 107021 5

both sides in Eq. (14) can be crossed out and the equation can be 5.2. Simulation experiment
written as:
In this subsection, we respectively sample the unimodal and
kn × p(Jn |Un , xn ) p(xn+1 , Pn |Jn , Un , xn ) multi-modal Gaussian distributions from 1 to 60 dimensions
= kn+1 × p(Jn+1 |Un+1 , xn+1 ) p(xn , Pn+1 |Jn+1 , Un+1 , xn+1 ). (15) by using DSS-ss and the slice sampling, and then compare the
CPU-time and the maximum mean discrepancy (MMD) of these
Similarly, p(Jn |Un , xn ) and p(Jn+1 |Un+1 , xn+1 ) are both uniform two methods.
over (0,1). Therefore, Eq. (15) can be simplified and transformed
as: 5.2.1. Evaluation criteria
kn × p(Pn |Jn , Un , xn ) p(xn+1 |r, xn ) In order to evaluate the time performance of different methods,
we take the CPU time consumed during sampling as one of the
= kn+1 × p(Pn+1 |Jn+1 , Un+1 , xn+1 ) p(xn |π (r ), xn+1 ). (16) evaluation criteria. For judging the correctness of the sampling
results, we use the MMD as the second criterion.
Here r represents Un , Jn and Pn , while π (r) represents Un+1 ,
MMD [30,31] was first proposed for the two-sample test prob-
Jn+1 and Pn+1 . As already described, the sequences of rejected
lem to determine whether two distributions p and q are identical.
points Pn and Pn+1 in the “shrinkage” procedure are the same,
It supposes that taking all the sample spaces generated by distri-
which means that the probability p(Pn |Jn , Un , xn ) is equal to
butions as the input of the function f, if the means of the corre-
p(Pn+1 |Jn+1 , Un+1 , xn+1 ). Thus, we only need to prove the following
sponding function values are equal, then the two distributions can
equation:
be considered to be the same. It is now generally used to measure
kn × p(xn+1 |r, xn ) = kn+1 × p(xn |π (r ), xn+1 ). (17) the similarity between two distributions. In this experiment, we
choose the function f as the kernel function κ (x, x ) = 1 + xT ∗ x .
As we have demonstrated above, the definitions of π (U) and Thus the square of MMD can be expressed as:
π (J) make sure that the final interval containing xn overlaps the
final interval containing xn+1 completely while the definition of MMD2 [F, p, q] = E x,x [κ (x, x )] − 2E x,y [κ (x, y )] + E y,y [κ (y, y )],
π (P) keeps the shrinkage of these two interval synchronized. (19)
Denote this identical interval as Sˆ and we obtain:
where x is a random variable obeying the distribution p, and y is
p(xn+1 |r, xn ) b ˆ (xn+1 ) kn+1 a random variable obeying distribution q.
= 2S = , (18)
p(xn |π (r ), xn+1 ) b2Sˆ (xn ) kn In our experiments, x represents samples picked by the pro-
posed methods or comparison methods, and y represents random
where b2Sˆ (x ) is the b2 distribution over Sˆ. vectors from Gaussian distribution generated by the library func-
So far we have proved that Eq. (17) holds, which means that tion mvnrnd in Matlab. The similarity between x and y is measured
the distribution p(x) is invariant during each step in the Markov by calculating the MMD. Smaller values of MMD mean that x is
chain. This completes the proof. more similar to y, that is, the distribution of x is closer to the
target Gaussian distribution. A value of zero for MMD indicates
that x and y are equally distributed.
5. Experiments and results

5.2.2. Settings
In this section, we demonstrate the performance of the de-
For experiments of sampling unimodal Gaussian, we set
composed slice sampling by simulation experiments and real
b1 (x ) = N (μ1 , 1 ) and b2 (x ) = N (μ2 , 2 ). For the multi-
data experiments. We compare the proposed DSS methods with
modal case, we set b1 (x ) ∝ N (μ3 , 3 ) + N (μ4 , 3 ) and
standard slice sampling and Hamiltonian Monte Carlo. For each
b2 (x ) ∝ N (μ5 , 4 ) + N (μ6 , 4 ). As for these hyper-parameters, μi
compared method, the parameters are allowed to be tuned as
is an n-dimensional vector randomly generated from [0,4] and di-
optimal. The estimate of the typical size of a slice, w, is manually
agonal covariance matrix j obtains n diagonal elements randomly
set to be 15, and the integer that limits the maximum size of a
from [0,30], where i = 1, . . . , 6, j = 1, . . . , 4 and n = 1, . . . , 60.
slice interval (or the maximum number of iterations) m is set to
We respectively sample the unimodal and multi-modal Gaussian
be 300 in all the related experiments.
distributions from 1 to 60 dimensions by using DSS-ss and the
slice sampling. We draw 10,0 0 0 samples for each experiment,
5.1. Effectiveness test and perform experiments for 10 times on randomly assigned
hyper-parameters. Note that in order to ensure the repeatability of
Though we have theoretically proved the correctness of DSS, the experiments, we fixed all the random seeds as constants.
it is more intuitive to verify this proposed method through some
tests. For the convenience of visualization, we evaluate the perfor- 5.2.3. Results and analysis
mance of DSS on two univariate distributions and two bivariate The average CPU time and its standard deviations in 10 re-
distributions. We employ slice sampling in the third step of DSS to peating experiments for sampling 10,0 0 0 points by DSS and slice
pick new points and name this specific DSS method as DSS-ss. We sampling in different dimensions are shown in Fig. 5. From the
draw 20,0 0 0 samples in each experiment. Histograms of samples figures, we can see that in the case of unimodal distributions, DSS
from 1D distributions compared with the corresponding density is always more time-consuming than slice sampling, but in the
contour are shown in Fig. 3. Scatter diagrams of samples taken multi-modal situation it turns out to be the opposite. A reasonable
from 2D distributions are shown in Fig. 4. explanation is that in the case of unimodal Gaussian, slice sam-
It can be seen that by using the DSS, both histograms of sam- pling only needs to sample one unimodal Gaussian distribution,
ples from 1D distributions and scatter diagrams of samples from while DSS needs to obtain slice interval from b1 (x) and perform
2D distributions completely fit the corresponding target density a full slice sampling on b2 (x), which takes more than the time to
contours. This means that the resulting samples correctly follow only simulate one unimodal distribution. For the case of multi-
their target distributions, which provides intuitional evidence for modal Gaussian, if the likelihood and the prior are mixed Gaussian
the effectiveness of DSS. of two components, the slice sampling needs to sample a mixed
6 J. Wang and S. Sun / Pattern Recognition 97 (2020) 107021

Fig. 3. Histograms of samples taken by DSS from 1D distributions compared with the corresponding density contour: (a) 1D unimodal Gaussian, b1 (x ) = N (3, 1 ) and b2 (x ) =
N (1, 1 ) and (b) 1D bimodal Gaussian, b1 (x ) ∝ N (0, 4 ) + N (10, 4 ) and b2 (x ) = N (4, 4 ).

Fig. 4. Scatter diagrams of samples taken by DSS from 2D distributions compared with the corresponding density contour: (a) 2D unimodal Gaussian, b1 (x ) = N (3, I ) and
b2 (x ) = N (1, I ) and (b) 2D bimodal Gaussian, b1 (x ) ∝ N (0, I ) + N (8, I ) and b2 (x ) = N (4, I ).

Fig. 5. The averaged CPU-time (s) of DSS and slice sampling in different dimensions: (a) Unimodal Gaussian distributions and (b) Multi-modal Gaussian distributions.
J. Wang and S. Sun / Pattern Recognition 97 (2020) 107021 7

Fig. 6. The averaged MMDs of DSS and slice sampling: (a) Unimodal Gaussian distributions and (b) Multi-modal Gaussian distributions.

Gaussian with four components, while DSS samples the mixed Table 1
Dataset description.
Gaussian with only two components in each step. The results of
CPU time indicates DSS is more time-saving than slice sampling Dataset # of instances # of features
when the target distribution is not a simple unimodal distribution. Iris 150 4
We also record the average MMD of the samples produced in Mammographic 830 5
every 10 repeating experiments; thus both methods have recorded Brestcancer 683 9
60 MMDs. Taking the MMD of DSS as the horizontal axis and the Fertility 100 9
Heart 270 13
MMD of slice sampling as the vertical axis, we obtain Fig. 6. Note
Pabomovie 2000 20
that the closer the point is to the line y = x, the more similar Parkinsons 195 22
the MMDs of DSS and slice sampling are. We can see that points Germannuber 1000 24
are close to the line in both subfigures, especially in the case of Chen-2002 179 85
Gordon-2002 181 1626
multi-modal. This means that DSS has similar performance with
slice sampling in MMD, especially in complex cases.
From the simulation experiments, we can conclude that as the
target distribution becomes complex, the proposed DSS is more compared with HMC, the decomposed slice sampling employs
time-saving than slice sampling, but has similar performance with HMC (DSS-hmc) in its third step.
slice sampling in MMD.
5.3.1. Datasets and setups
5.3. Real data experiment Table 1 shows the information about the datasets used in the
classification tasks. The first eight datasets can be found in the UCI
In this subsection, we apply the proposed method to the data repository [34], and the last two datasets are cancer gene ex-
Bayesian logistic regression model (BLR) [32] for classification pression datasets that are available at [35]. Feature dimensions are
tasks. BLR is a famous linear discriminant model for binary between 4 to 1626, and the total data instances range from 100
classification. The model has a great deal of theoretical support, to 20 0 0. All the datasets contain 2 classes except for the classical
and performs prominently well in terms of predictive accuracy Iris dataset which has 3 classes. Thus, for Iris, we experimented
[32,33]. BLR uses the sigmoid function to describe the relationship with the instances of label “Versicolour” and “Virginica” because
between data and prediction, i.e., y = σ (wT φ ), where y is the pre- the classification accuracies of the other combinations are always
dictive probability, φ is the feature vectors, and w is parameters. 100% for each method.
By considering a Gaussian prior and a likelihood function with After normalization, all of the datasets are randomly split into
a multiplicative form, the non-normalized form of the posterior the training set and test set by a ratio of 7:3. Since the Bayesian
distribution of w can be obtained by the product of prior and logistic regression seeks a Gaussian representation of the posterior
likelihood function. In the following experiments, we employ distribution, for simplicity, we choose Gaussian priors N (0, I ). We
four MCMC methods to approximate the integral of the posterior draw 50 0 0 samples for each experiment, and perform experiments
distribution, i.e., two methods that are based on the decomposed for 10 times on randomly assigned datasets. The average accura-
slice sampling framework, standard slice sampling and HMC. The cies in percentage with the corresponding standard deviations are
DSS methods applied on BLR models set b1 (x) proportional to the reported in Table 2.
likelihood function and b2 (x) as the prior.
In order to evaluate the performance of the decomposed slice 5.3.2. Classification performances
sampling, we design two groups of comparison experiments on ten From Table 2, we can see that methods based on the DSS per-
real-world datasets. The first comparison is DSS-ss and the original form better than the slice sampling and HMC in higher dimension
slice sampling (SS). As mentioned in Section 5.1, here DSS-ss cases. For the first group, DSS-ss outperforms SS on the last five
represents the decomposed slice sampling that employs the slice datasets whose feature dimensions are no less than twenty. On the
sampling in step 3, which avoids the comparison being interfered lower-dimensional datasets (Iris and Brestcancer) where the tar-
by other sampling methods. Similarly, in the second group, in get distributions are simple, DSS-ss performs slightly inferior to
order to be constant with the method being compared, when SS. For the second group, the performance of DSS-hmc is never
8 J. Wang and S. Sun / Pattern Recognition 97 (2020) 107021

Table 2
Averaged classification accuracies and standard deviations (%) for all datasets.

Dataset slice sampling HMC methods

DSS-ss SS DSS-hmc HMC

Iris 89.00 ± 0.1610 90.00 ± 0 90.00 ±0 85.33 ± 0.1146


Mammographic 73.49 ±0 73.49 ± 0 73.29 ± 0 73.53 ± 0.0013
Brestcancer 85.02 ± 0.0024 85.07 ± 0.0025 84.93 ± 0.0043 85.07 ± 0.0034
Fertility 96.67 ±0 96.67 ± 0 96.67 ±0 96.67 ±0
Heart 87.65 ±0 87.65 ± 0 87.78 ± 0.0070 87.41 ± 0.0052
Pabomovie 81.53 ± 0.0028 81.33 ± 0.0030 81.38 ± 0.0025 81.32 ± 0.0018
Parkinsons 82.76 ±0 81.55 ± 0.0116 81.03 ±0 80.35 ± 0.0121
Germannuber 76.27 ± 0.0041 75.97 ± 0.0037 75.80 ± 0.0029 74.80 ± 0.0311
Chen-2002 90.74 ± 1.9520 90.19 ± 1.9618 90.93 ± 1.3664 89.63 ± 1.7891
Gordon-2002 96.53 ± 1.8383 95.99 ± 2.1673 99.21 ± 0.9887 98.68 ± 0.9027

overtaken by HMC unless on two low-dimensional datasets, Mam- each step in the Markov chain. To show invariance, we first sup-
mographic and Brestcancer. For clarifying the performance differ- pose that the initial point x0 follows the distribution p(x). Thus the
ences between different methods in higher dimension cases, we joint distribution of x0 and u follows:
run paired t-test on averaged accuracies of datasets whose feature 1
i f 0 < u < p˜ (x ),
dimensions are no less than twenty. The p-value for the first group, p(x, u ) = Zp
(A.1)
DSS-ss vs SS, is 0.0335, and the p-value for the second group, DSS- 0 otherwise,

hmc vs HMC, is 0.0276. Both p-values are less than 0.05, which where Z p = p˜ (x )dx and p˜ (x ) is a computable function that is pro-
indicates the significant improvements of the proposed method. portional to p(x). The marginal distribution of x is obtained by in-
In short, the experiment results again reveal the superior per- tegrating u as follows:
formance of the proposed method, especially in high-dimensional   p˜(x )
1 p˜ (x )
cases where the target distributions are complex. p(x, u ) du = du = = p( x ) . (A.2)
0 Z p Zp
6. Conclusions and future work Note that in the subsequent update from x0 to x1 , the auxiliary
variable u is unchanged. If the joint distribution p(x0 , u) and p(x1 ,
In this paper, we have proposed a novel sampling framework, u) are identically distributed, then p(x0 ) and p(x1 ) are also identi-
namely, the decomposed slice sampling method, which is used to cally distributed by integrating u. So does the subsequent update
efficiently sample from unnormalized factorized distributions. This from x1 to x2 , and so on. Thus, we only need to show that p(xn , u)
slice sampling based method divides the target distribution into and p(xn+1 , u ) are identically distributed in the step of sampling
two multipliers, alternating between drawing an auxiliary variable xn+1 given xn (n = 0, 1, 2, . . .). Because these steps do not change
from the first multiplier to define the horizontal slice and sampling u, it is equal to prove that the conditional distribution for x given
a new point from the second term over the interval given by the u is invariant in step 2 and step 3 described in Section 3. As in-
first multiplier. The correctness and effectiveness of the proposed troduced in Section 3, p(x|u) is uniform over S = {x : u < p˜ (x )}. We
MCMC framework have been verified theoretically and experimen- can show the invariance of the conditional distribution by proving
tally. The experimental results have demonstrated that the DSS that the updates satisfy the detailed balance:
based sampling methods outperform the slice sampling and the p(xn |u ) p(xn+1 |xn , u ) = p(xn+1 |u ) p(xn |xn+1 , u ). (A.3)
HMC, especially in complex situations. Thus, it is recommended
For a uniform distribution, this reduces to proving that
to use DSS for sampling when the target distribution is complex.
Though the traditional slice sampling method is suitable for simple p(xn+1 |xn , u ) = p(xn |xn+1 , u ), (A.4)
unimodal distributions, in practice, most target distributions are for any xn and xn+1 in S.
more complex. Therefore, the DSS proposed is a practical sampling We can demonstrate that the update procedure satisfies the de-
method, and can be more widely applied to sampling from factor- tailed balance by proving the following stronger result:
ized distributions. In the future work, we will consider upgrading
p(xn+1 , r|xn , u ) = p(xn , π (r )|xn+1 , u ), (A.5)
our method by using some improved algorithms for solving slice
intervals. where r denotes random choices in the sampling procedure, or
more specifically, U and J (or V, K) in the “stepping out” step, and
Acknowledgments the sequence of rejected points P in the “shrinkage” step. π (r) is a
first order one-to-one mapping that depends on xn and xn+1 . We
This work is supported by the National Natural Science Foun- can obtain Eq. (A.4) by integrating all the possible choices of r.
dation of China under Project 61673179, Shanghai Knowledge Ser- Next, we define the mapping π (r). We first define the π (U) that
vice Platform Project (No. ZF1213), and the Fundamental Research maps Un to Un+1 as follows:
Funds for the Central Universities. Un+1 = Frac(Un + (xn+1 − xn )/w ), (A.6)

Appendix A. Proof of slice sampling where the function Frac(x ) = x − Floor(x ) retains the fractional
part of x. The definition of π (U) makes the left bound Ln+1 of
This proof is based on the idea in Section 4.8 of [8], and we pro- the initial slice interval In+1 that contains xn+1 align with the left
vide a more detailed explanation of the correctness of slice sam- bound Ln of the initial slice interval In that contains xn , or ex-
pling. For the natural language description is difficult to under- plained in formulation, (Ln+1 − Ln ) mod w = 0.
stand in [8], we use the form of formulas to explain it rigorously. The “stepping out” procedure will generate a random number
The proof is as follows. J ∈ {0, . . . , m − 1}, and so we define the mapping π (J) that maps Jn
To show that slice sampling is a correct procedure, it is neces- to Jn+1 as follows:
sary to prove that the target distribution p(x) is invariant during Jn+1 = Jn + (xn+1 /w − Un+1 ) − (xn /w − Un ), (A.7)
J. Wang and S. Sun / Pattern Recognition 97 (2020) 107021 9

where (xn+1 /w − Un+1 ) − (xn /w − Un ) denotes the distance (w [7] R.G. Edwards, A.D. Sokal, Generalization of the Fortuin-Kasteleyn-Swend-
units) between the left bound Ln and Ln+1 . sen-Wang representation and Monte Carlo algorithm, Phys. Rev. D 38 (1988)
2009–2012.
When the transition between xn and xn+1 is possible, it is easy [8] R.M. Neal, Slice sampling, Ann. Stat. (2003) 705–741.
to find that the final slice interval will be the same for xn as for [9] D.J. Nott, D. Leonte, Sampling schemes for Bayesian variable selection in gen-
xn+1 with the above definitions. Actually, these two intervals will eralized linear models, J. Comput. Graph. Stat. 13 (2004) 362–382.
[10] S.K. Kinney, D.B. Dunson, Fixed and random effects selection in linear and lo-
be the same when the interval for xn first contains xn+1 . Then in gistic models, Biometrics 63 (2007) 690–698.
the following “stepping out” steps, the two intervals will stay the [11] D.K. Agarwal, A.E. Gelfand, Slice sampling for simulation based fitting of spatial
same. For the situation that the transition between xn and xn+1 is data models, Stat. Comput. 15 (2005) 61–69.
[12] T.D. Kulkarni, P. Kohli, J.B. Tenenbaum, V. Mansinghka, Picture: a probabilistic
impossible, Eq. (A.5) will hold because both sides of it are zero.
programming language for scene perception, in: IEEE Conference on Computer
Finally, π maps the sequence of rejected points Pn used to Vision and Pattern Recognition, Vol. 1, 2015, pp. 4390–4399.
shrink the interval that contains xn to the sequence of rejected [13] P.O. Lewis, M.T. Holder, K.E. Holsinger, Polytomies and Bayesian phylogenetic
inference, Syst. Biol. 54 (2005) 241–253.
points Pn+1 generated from the interval found from xn+1 . Thus, Pn
[14] B. Shahbaba, R.M. Neal, Gene function classification using Bayesian models
and Pn+1 are the same sequence. with hierarchy-based priors, BMC Bioinform. 7 (2006) 448–456.
After the completion of the definition of π , we return to prove [15] S. Sun, C.M. Greenwood, R.M. Neal, Haplotype inference using a Bayesian hid-
Eq. (A.5), which can be expanded as: den Markov model, Genet. Epidemiol. 31 (2007) 937–948.
[16] I. Murray, R.P. Adams, D.J.C. Mackay, Elliptical slice sampling, J. Mach. Learn.
p(Un |xn ) p(xn+1 , Jn , Pn |Un , xn ) Res. 9 (2009) 541–548.
[17] S. Brooks, A. Gelman, G.L. Jones, X.L. Meng, Handbook of Markov chain Monte
= p(Un+1 |xn+1 ) p(xn , Jn+1 , Pn+1 |Un+1 , xn+1 ). (A.8) Carlo, Chance 25 (2011) 53–55.
[18] S. Duane, A.D. Kennedy, B.J. Pendleton, D. Roweth, Hybrid Monte Carlo, Phys.
Here, the unchanged u is omitted for conciseness. Obviously, Lett. B 195 (1987) 216–222.
p(Un |xn ) is uniform over (0,1). Although the value of Un+1 depends [19] C.M. Bishop, Pattern Recognition and Machine Learning (Information Science
and Statistics), Springer-Verlag, New York, 2006.
on xn and xn+1 , p(Un+1 ) is also uniform over (0,1) because Un+1 is a [20] Z. Wang, S. Mohamed, N.D. Freitas, Adaptive Hamiltonian and Riemann mani-
linear transformation of Un . Therefore, the first term of both sides fold Monte Carlo samplers, Statistics (2013) 1462–1470.
in Eq. (A.8) can be crossed out and the equation can be written [21] M.D. Hoffman, A. Gelman, The no-U-turn sampler: adaptively setting path
lengths in Hamiltonian Monte Carlo, J. Mach. Learn. Res. 15 (2014) 1593–1623.
as: [22] M. Girolami, B. Calderhead, Riemann manifold Langevin and Hamiltonian
p(Jn |Un , xn ) p(xn+1 , Pn |Jn , Un , xn )
Monte Carlo methods, J. R. Stat. Soc. 73 (2015) 123–214.
[23] N. Tripuraneni, M. Rowland, Z. Ghahramani, R. Turner, Magnetic Hamiltonian
= p(Jn+1 |Un+1 , xn+1 ) p(xn , Pn+1 |Jn+1 , Un+1 , xn+1 ). (A.9) Monte Carlo, in: Proceedings of the 34th International Conference on Machine
Learning, Vol. 70, 2017, pp. 3453–3461.
Similarly, p(Jn |Un , xn ) and p(Jn+1 |Un+1 , xn+1 ) are both uniform [24] M. Thompson, R.M. Neal, Covariance-adaptive slice sampling, Statistics (2010).
arXiv: 1003.3201
over (0,1). Therefore, Eq. (A.9) can be simplified and transformed
[25] M.B. Thompson, R.M. Neal, Slice sampling with adaptive multivariate steps: the
as: shrinking-rank method (2010). arXiv:1011.4722.
[26] M.M. Tibbits, C. Groendyke, M. Haran, J.C. Liechty, Automated factor slice sam-
p(Pn |Jn , Un , xn ) p(xn+1 |r, xn ) pling, J. Comput. Graph. Stat. 23 (2014) 543–563.
= p(Pn+1 |Jn+1 , Un+1 , xn+1 ) p(xn |π (r ), xn+1 ). (A.10) [27] M.M. Tibbits, M. Haran, J.C. Liechty, Parallel multivariate slice sampling, Stat.
Comput. 21 (2011) 415–430.
Here r represents Un , Jn and Pn , while π (r) represents Un+1 , Jn+1 [28] I. Murray, M. Graham, Pseudo-marginal slice sampling, in: Artificial Intelli-
gence and Statistics, Vol. 51, 2016, pp. 911–919.
and Pn+1 . As already described, the sequences of rejected points Pn [29] R.M. Neal, Probabilistic inference using Markov chain Monte Carlo methods,
and Pn+1 in the “shrinkage” procedure are the same, which means Technical Report CRG-TR-93-1, Department of Computer Science, University of
that p(Pn |Jn , Un , xn ) is equal to p(Pn+1 |Jn+1 , Un+1 , xn+1 ). Thus, we Toronto, 1993.
[30] A. Gretton, K.M. Borgwardt, M.J. Rasch, B. Schölkopf, A. Smola, A kernel
only need to prove the following equation: two-sample test, J. Mach. Learn. Res. 13 (2012) 723–773.
[31] A. Rozantsev, M. Salzmann, P. Fua, Beyond sharing weights for deep domain
p(xn+1 |r, xn ) = p(xn |π (r ), xn+1 ). (A.11) adaptation, IEEE Trans. Pattern Anal. Mach. Intell. 41 (2019) 801–814.
[32] A. Genkin, D.D. Lewis, D. Madigan, Large-scale Bayesian logistic regression for
As we have demonstrated above, the definitions of π (U) and text categorization, Technometrics 49 (2007) 291–304.
π (J) make sure that the final interval containing xn overlaps the [33] G. Kumar, V. Govindaraju, Bayesian background models for keyword spotting
final interval containing xn+1 completely while the definition of in handwritten documents, Pattern Recognit. 64 (2017) 84–91.
[34] D. Dheeru, E. Karra Taniskidou, UCI machine learning repository, 2017.
π (P) keeps the shrinkage of these two interval synchronized. Con- [35] Cancer gene expression data sets, http://schlieplab.org/Static/Supplements/
sequently, xn and xn+1 always obey the same uniform distribution, CompCancer/datasets.htm. Accessed July 12, 2019.
i.e., Eq. (A.11) holds, which means that the target distribution p(x)
is invariant during each step in the Markov chain. This completes Jiachun Wang is a student in the Pattern Recognition and Machine Learning Re-
search Group, East China Normal University, Shanghai, China. Her research interests
the proof. include Markov chain Monte Carlo sampling, probabilistic models, and their appli-
cations.
References
Shiliang Sun received the Ph.D. degree in pattern recognition and intelligent sys-
[1] D.M. Higdon, Auxiliary variable methods for Markov chain Monte Carlo with tems from Tsinghua University, Beijing, China, in 2007.
applications, Publ. Am. Stat. Assoc. 93 (1998) 585–595. He is a Professor with the Department of Computer Science and Technology and
[2] C. Andrieu, N. De Freitas, A. Doucet, M.I. Jordan, An introduction to MCMC for the Head of the Pattern Recognition and Machine Learning Research Group, East
machine learning, Mach. Learn. 50 (2003) 5–43. China Normal University, Shanghai, China. From 2009 to 2010, he was a Visiting
[3] S. Sun, A review of deterministic approximate inference techniques for Researcher with the Department of Computer Science, Centre for Computational
Bayesian machine learning, Neural Comput. Appl. 23 (2013) 2039–2050. Statistics and Machine Learning, University College London, London, U.K. In 2014, he
[4] R.H. Swendsen, J.-S. Wang, Nonuniversal critical dynamics in Monte Carlo sim- was a Visiting Researcher with the Department of Electrical Engineering, Columbia
ulations, Phys. Rev. Lett. 58 (1987) 86–88. University, New York, NY, USA. His current research interests include approximate
[5] M.W. Liechty, J. Lu, Multivariate normal slice sampling, J. Comput. Graph. Stat. inference, sequential modeling, multi-view learning, deep learning, and their appli-
19 (2010) 281–294. cations. His research results have expounded in 100+ publications at peer-reviewed
[6] S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions, and the journals and conferences.
Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1984) Prof. Sun is on the Editorial Board of multiple international journals, including Pat-
721–741. tern Recognition and IEEE Transactions on Neural Networks and Learning Systems.

You might also like