You are on page 1of 23

Journal of Computational and Graphical Statistics

ISSN: 1061-8600 (Print) 1537-2715 (Online) Journal homepage: http://www.tandfonline.com/loi/ucgs20

On the Bayesian mixture model and identifiability

Ramsés H. Mena & Stephen G. Walker

To cite this article: Ramsés H. Mena & Stephen G. Walker (2014): On the Bayesian
mixture model and identifiability, Journal of Computational and Graphical Statistics, DOI:
10.1080/10618600.2014.950376

To link to this article: http://dx.doi.org/10.1080/10618600.2014.950376

Accepted online: 08 Aug 2014.

Submit your article to this journal

Article views: 74

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=ucgs20

Download by: [Central Michigan University] Date: 15 October 2015, At: 15:04
ACCEPTED MANUSCRIPT

On the Bayesian mixture model and identifiability


Ramsés H. Mena
Universidad Nacional Autónoma de México, Mexico
Stephen G. Walker
University of Texas at Austin, USA
This paper is concerned with Bayesian mixture models and identifiability issues. There
are two sources of unidentifiability, the well known likelihood invariance under label
Downloaded by [Central Michigan University] at 15:04 15 October 2015

switching and the perhaps less well known parameter identifiability problem. When us-
ing latent allocation variables determined by the mixture model, these sources of uniden-
tifiability create arbitrary labeling which renders estimation of the model very difficult.
We endeavour to tackle these problems by proposing a prior distribution on the alloca-
tions which provides an explicit interpretation for the labeling by removing gaps with
high probability. An MCMC estimation method is proposed and supporting illustrations
are presented.

Keywords: Allocation variable; Gibbs sampler; Identifiability; Label switching; Mixture


model.

1 Introduction

The random allocation of objects into clusters or groups is of fundamental interest in statistics
and other areas that interact with it. A common way to assign a distribution able to disentangle
the different features of observations into homogeneous clusters, e.g. characterized by the same
parametric family K(∙ | θ), is through mixture models. Under such framework, independent and
identically distributed observations, y := (yi )ni=1 , are modeled by
X
M
f (yi | w, θ) =
(1) w j K(yi | θ j ), M≥1
j=1

where the weights w := (w j ) Mj=1 , adding up to one, and locations θ := (θ j ) Mj=1 are instrumental and
might be considered as random under complete uncertainty. The literature backing up this approach

ACCEPTED MANUSCRIPT
1
ACCEPTED MANUSCRIPT

is vast ranging from its origins in Pearson (1894) to expositions such as the ones presented by
Titterington, Smith and Makov (1985), McLachlan and Peel (2000) or Marin, Mengersen and
Robert (2005). In particular, the case M = ∞, was first considered and studied by Lo (1984) where
the weights and locations are derived from the Dirichlet process (Ferguson, 1973). Indeed, this
constitutes the workhorse for Bayesian nonparametric inference; for density estimation and also
for clustering problems. A recent review in this direction can be found in Hjort, Holmes, Müller
and Walker (2010).
Downloaded by [Central Michigan University] at 15:04 15 October 2015

An alternative way to write such a model, is via an allocation variable, which tells us which
cluster each observation belongs to. That is

f (yi | di ) = K(yi | θdi )


(2) with P(di = j|w) = w j ,

independently for i = 1, . . . , n. The advantage of this latter representation is that there is an ex-
plicit characterization of the distribution of the group membership to each observation. Under full
uncertainty about the clustering structure featuring y, the natural choice is to have a symmetric
distribution such as the multinomial. However, care should be taken as, under such scenario, the
likelihood is invariant under the many possible permutations of the indices, namely

p(y1 , . . . , yn |d1 , . . . , dn ) = p(y1 , . . . , yn |ρ(d1 ), . . . , ρ(dn ))

for any permutation ρ of the allocation variables d := (di )ni=1 . Furthermore, under symmetric priors,
across components, for θ and w, this invariance would be inherited in posterior inference. Clearly,
having such an invariance implies model unidentifiability.
In particular, this issue is well known to cause the label switching problem when performing
MCMC based inference; namely the likelihood invariance prevents a proper identification of a
grouping structure. We refer to Titterington, Smith and Makov (1985); Celeux, Hurn and Robert
(2000) and Stephens (2000) for deeper discussions of the label switching problem. There are
various proposals dealing with the label switching problem, some aimed at selecting a particular
permutation, e.g. by means of a loss function (Frühwirth-Schnatter, 2001; Nobile and Fearnside,

ACCEPTED MANUSCRIPT
2
ACCEPTED MANUSCRIPT

2007; Papastamoulis and Iliopoulos, 2010; Rodriguez and Walker, 2014), and others aimed at im-
posing parameter constraints, so to break the aforementioned symmetry and favor a single labeling
(Diebolt and Robert, 1994; Richardson and Green, 1997; Chung, Loken and Schafer, 2004).
Having said that, there is a seemingly related but different identifiability issue under the above
approach, namely the correct identification of the components that better describe y. In other
words, besides the label switching problem, there is the problem that different parameter values
(w, θ) could lead to the same value of f (∙). To expand on the point of this kind of unidentifiability,
Downloaded by [Central Michigan University] at 15:04 15 October 2015

notice that we can recover a standard normal for f (∙), with w1 = 1 and K(∙ | θ1 ) = N(∙ | 0, 1), but
also with w1 as anything, w2 = 1 − w1 , K(∙ | θ1 ) = N(∙ | 0, 1) and K(∙ | θ2 ) ≈ N(∙ | 0, 1). Indeed, an
observation in this direction can be traced back to Ferguson (1983) who states the following about
the parameters of the model: “We are not interested in estimating the parameters of the model . . ..
It would not make much sense to do so since the parameters are not identifiable.”
Solving these unidentifiability issues simultaneously, i.e. likelihood invariance under label
switching and parameter unidentifiably, is still a challenging problem. Our attempt to solve them
is based on a different starting point, namely by suitably constructing a prior distribution on the
allocation variables d, say p(d). Specifically, the idea is to attach some meaning to such alloca-
tions that, remove the gaps and at the same time solve such unidentifiability problems with high
probability. We then connect this to the mixture model via the standard conditional distributions
p(y, θ | d) and p(w | d).
Due to the fact that we start without the weights, we need to assign a prior distribution on
the allocations. Some approaches in this direction, e.g. Casella, Robert and Wells (2004), Nobile
(1994) and Nobile and Fearnside (2007), follow this idea and construct MCMC samplers to draw
posterior inference on the clustering structure of the data using p(d | y). However, having such a
large support for the sequence d, not all weight distributions will lead to “ good ” p(d), and thus
p(d | y).

ACCEPTED MANUSCRIPT
3
ACCEPTED MANUSCRIPT

There are some proposals working directly with p(d | y); but our contention is that if the form
for p(d) has not been chosen astutely then the problems of poor allocations will persist. Attempts
to solve the problem include constraining the support of d at this stage, so set d1 = 1, then sample
d2 from p(d2 | d1 , y) ∝ p(d1 , d2 | y) with d2 from the choice of (1, 2). In general, take di = j from
the density proportional to
p(d1 , . . . , di−1 , j | y)

with j restricted to (1, . . . , k) and k = max{d1 , . . . , di−1 }. However, this constraint on the d leads to
Downloaded by [Central Michigan University] at 15:04 15 October 2015

a non-exchangeable model and so depends on who is assigned as y1 , and so on. The benefit is that
sampling is direct and avoids MCMC, since we can calculate p(d | y) when the d are given or the
choice of d values is limited. However, to avoid the order problem, one would need to repeat the
process for each permutation on the first n integers. See Wang and Dunson (2012).
In a similar direction Quintana and Iglesias (2003) proposed the idea of using product partition
models where weights are not considered and only the model f (y | d) is constructed. In this case,
it is only necessary to consider f (y | k, C1 , . . . , Ck ) where (k, C1 , . . . , Ck ) define a partition of the
observations into k groups and a prior is placed on it. Lau and Green (2007) propose an optimal
clustering procedure via the application of a specific loss function that penalises pairs of samples
assigned to different clusters, when they should be in the same one. A related proposal, but in
one dimension only, and without the need of MCMC methods, Dahl (2009) proposes an algorithm
to find the modal partition based on a reallocation of overlapping partitions. Although these ap-
proaches are very appealing, when looking for the most appropriate clustering of observations, the
aforementioned unidentifiability issues persist.
Our aim is to construct p(d) so that with high probability label 1 is assigned to the largest
group, label 2 the second largest group, and so on, without allowing for gaps. More accurately
we want the labels to be associated with a non-increasing sequence of groups sizes so we do not
make an issue of equally sized groups. This will remove the label switching problem with high
probability. The parameter identifiability problem is caused by too many groups being generated

ACCEPTED MANUSCRIPT
4
ACCEPTED MANUSCRIPT

than are actually needed and so p(d) will specifically penalize large numbers of groups.
The layout of the paper is as follows: In Section 2, we discuss and motivate our particular choice
of prior for the allocation variables. In this section, we also present a toy example to illustrate the
properties of the prior and posterior distributions. In Section 3, we describe the estimation method,
via MCMC to obtain samples from p(d | y) and then the procedure for estimating the weights
associated with the allocations. Section 4 contains an illustration using MCMC methods to sample
p(d, θ | y) and then calculate the weights. In particular, for the benchmark galaxy data set, we
Downloaded by [Central Michigan University] at 15:04 15 October 2015

obtain a clear three mixture normal representation of the data and provide the associated weights,
means and variances of each component. In this section we also discuss what happens when we
have groups of equal size. Section 5 concludes with a summary.

2 A prior distribution on allocation variables

In this section we describe a prior distribution for allocating n individuals to k ≤ n groups. This
can be removed from the context of the mixture model, though obviously we will return to it later
on. The aim, for reasons expanded on in the introduction, is to put very high probability on the
largest group having label 1, the second largest group label 2, and so on. The prior needs to be
exchangeable for all n, that is:

(i) For all permutations ρ on {1, . . . , n}, and for all n ≥ 1, it is that

p(d1 , . . . , dn ) = p(dρ(1) , . . . , dρ(n) ).

(ii) It is that

X
p(d1 , . . . , dn , j) = p(d1 , . . . , dn )
j=1

for all n ≥ 1.

It is well known that d = (d1 , . . . , dn ) defines k, the number of distinct di s in d, and (C1 , . . . , Ck ),
the groups. That is, C1 is a group in which the di s all share the same value. However, the labeling,

ACCEPTED MANUSCRIPT
5
ACCEPTED MANUSCRIPT

or the value that these di s take, is irrelevant; so C1 has no implication other than it describes one of
the groups. In particular, due to the arbitrary labeling and that the di s take any value in {1, 2, . . .},
we can have, even with very high probability, that

C j , {i : di = j},

since we have labeled the C j s from 1 to k and yet it is not the case, in general, that the di s are
labeled from 1 to k.
Downloaded by [Central Michigan University] at 15:04 15 October 2015

To think about how we plan to construct p(d), we ask how should we deal with (1, 2, 2, 4) and
(1, 1, 2, 3)? We do not need both of them. We neither want the gap associated with leaving out the
3 in (1, 2, 2, 4) and we actually want the group of size 2 to be labeled as 1. So what is needed is a
model for which
p(1, 1, 2, 3)  p(1, 2, 2, 4).

That is we need the labeling to be biased towards

(1, 1, 2, 3) rather than (1, 2, 2, 4).

There are two things going on here. The first is that we do not want arbitrary numbers representing
the groups, so we want no gaps; and the second is that we want the size of the group to determine
the label, so the larger groups occupy the smaller labels. This is not going to be 100% precise,
but the ambition of this direction should be a motivation when proposing a p(d). It is clear from
looking at (1, 1, 2, 3) is that it minimizes the sum of the di s which have the same (k, C1 , . . . , Ck ) and
this forms the basis of the prior.
Therefore, and the thinking is straightforward, our proposal for having no gaps and the larger
groups to be labelled with the smaller integers is to have
X
n
p(d) ∝ g(D) where D= di , D ≥ n,
i=1

ACCEPTED MANUSCRIPT
6
ACCEPTED MANUSCRIPT

for some decreasing function g(∙). Clearly, this will meet the exchangeability requirement (i) pro-
vided we make sure that
X
g(D) < +∞.
d

We also need to demonstrate (ii) which puts some restrictions on the choice of g(∙). The preferred
choice is
g(D) = qD

for some 0 < q < 1 which clearly meets condition (ii). To see this, write
Downloaded by [Central Michigan University] at 15:04 15 October 2015

p(d) = cn qD

so that (ii) implies



X
n+1 D q
c q q j = cn+1 qD
j=1
1−q
and hence we need c = 1/q−1. In the illustration we will assign a hyper-prior to q. The construction
of a suitable prior for q can be based on the result that
  Z
P max{di } ≤ k = (1 − qk )n π(dq)
1≤i≤n

and hence one can find a π which matches some probability on the number of groups, since
 
P max{di } ≤ k ≈ P(≤ k groups).
1≤i≤n

A more general class of g(∙) is

Γ(D − n + β)
g(D) = ,
Γ(D + α + β)

for some α, β > 0 though we will not pursue this case in the present paper.
Now, returning to the mixture model; this will yield us a p(y | d) which we can combine with
p(d) to obtain the posterior p(d | y). The p(y | d) which arises from a mixture model has the form
Yk Z Y
p(y | d) =
(3) K(yi | θ j ) π(dθ j ).
j=1 i∈C j

ACCEPTED MANUSCRIPT
7
ACCEPTED MANUSCRIPT

This does not involve any weight parameters since they have been superseded by the d. Hence,
in some way, we are building up a mixture model which is fundamentally different to the usual
starting point.
For (3), if a conjugate prior has been chosen for the θ j s, written as π(θ), we can integrate out
the θ j s to obtain

p(y | d) = p(y | C1 , . . . , Ck , k).

For example, in the particular case when


Downloaded by [Central Michigan University] at 15:04 15 October 2015

!
1
K(y | θ j ) = N y μ j ,
λj

and the conjugate prior N(μ | 0, (τλ)−1 ) × Ga(λ | a, b) is chosen independently for each (μ j , λ j ), we
obtain
Y
k
1 Γ(a + n j /2)
p(y | k, C1 , . . . , Ck ) ∝ ,
j=1
(n j + τ)1/2 [b + 0.5 (s2j − n2j yˉ 2j /(n j + τ))]a+n j /2

where
X
yˉ j = n−1
j yi ,
i∈C j

n j = |C j |, and
X
s2j = y2i .
i∈C j

This form for p(y | C1 , . . . , Ck , k) is the classic clustering probability once k, the number of
groups is known. In other words, it provides the likelihood for “who is with who”.
Our model therefore will be of the type

p(d, y) ∝ g(D) p(y | C1 , . . . , Ck , k),

where g(D) is a decreasing function of D. Clearly, we have

p(y | d) = p(y | C1 , . . . , Ck , k)

ACCEPTED MANUSCRIPT
8
ACCEPTED MANUSCRIPT

while
p(d) ∝ g(D).

Hence, the model splits neatly into two components: the g(D) which determines the labeling, and
p(y | d) which decides “who is with who”.
In order to have a clearer idea of the posterior distribution on d, and to see how our idea works,
here we consider a small dataset where we can compute exactly the p(d | y). Specifically, we
consider n = 10 data points given by
Downloaded by [Central Michigan University] at 15:04 15 October 2015

−1.521, −1.292, −0.856, −0.104, 2.388, 3.079, 3.312, 3.415, 3.922, 4.194.

See Figure 1 for a histogram plot of the data. This data set has been considered before by Fuentes-
Garcı́a, Mena and Walker (2010) in a similar context. To compute the exact posterior values at
each d we use the same kernel and prior, on the θ j s, specified earlier in this section, with p(d) ∝ qD
and M = 3. In this case we fix M so we can do all the calculations exactly, though in practice
as we will see M = ∞ can be dealt with using MCMC methods. That is, we have M n possible
values for d. Setting a = 5, b = 1, τ = 0.001 and q = 0.1 we obtain that the mode is placed at
(d1 , d2 , . . . , d10 ) = (2, 2, 2, 2, 1, 1, 1, 1, 1, 1) which has posterior probability 0.933. Table 1 shows
the top nine posterior values for d as well as their corresponding probabilities. It is worth noting
that increasing the value of M, e.g. to 4, 5, . . . does not lead to any significant difference in the
reported posterior values. We computed the corresponding posterior values for M = 4 and M = 5,
but decided not to report them here as they do not provide any difference. From these results it is
evident that the higher mass is placed on the allocation values where the largest group is identified
by the label 1, the second largest by 2.
In summary, for any fixed configuration S = (k, C1 , . . . , Ck ) there exists a dS which allocates
the largest group to label 1, the second largest to label 2, and so on. This dS will minimize, over
P
all other dS∗ yielding the same configuration, the sum D = i di . Hence, the choice of prior of the
type p(d) ∝ g(D).

ACCEPTED MANUSCRIPT
9
ACCEPTED MANUSCRIPT

It is important to point out here what the consequences are if we have two groups of exactly the
same size. To see what happens we can re-do the toy example giving both groups 5 observations
each, e.g. by changing 2.388 to −2 in the above data set. In this case we find the posterior
probability of (1, 1, 1, 1, 1, 2, 2, 2, 2, 2) is exactly the same as that of (2, 2, 2, 2, 2, 1, 1, 1, 1, 1), i.e.
equal 0.466 each. This is expected and is what causes the label switching problem when running
MCMC algorithms, namely the Markov chain will spend 50% of the time in one mode and 50%
on the other. However, using our p(d) the chain we run will not allow switching between modes
Downloaded by [Central Michigan University] at 15:04 15 October 2015

due to the extremely low probability of doing so. This point is expanded on in Section 4.2.

3 Estimation

Given the dimension of the support for the allocation variables, we need to obtain p(d | y) using
MCMC methods. We actually do this with the locations included in the estimating model, so we
sample from p(d, θ | y). Hence, the MCMC is based on the conditional distributions

p(di | θ, yi , d−i ) and p(θ j | d, y),

for i = 1, . . . , n, and j = 1, 2, . . ., respectively. Indeed, both of these conditional distributions are


easy to find. A specific case will be given in Section 4. In the finite M case it is direct since
each 1 ≤ di ≤ M. In the infinite case we need to be careful about how to sample the (di ) due to
their infinite choice, but this can be resolved using latent variables, as in Kalli, Griffin and Walker
(2011). To expand on this point here, we have

p(y, d | θ) ∝ g(D) p(y | k, C1 , . . . , Ck , θ)

and we can introduce the latent variable u so that the joint density with (d, y) is given by

p(y, d, u) ∝ 1(u < g(D)) p(y | k, C1 , . . . , Ck , θ).

ACCEPTED MANUSCRIPT
10
ACCEPTED MANUSCRIPT

Then, when it comes to the sampling of di , we have


 
p(di = j | yi , θ, u) ∝ 1 u < g(D−i + j) N(yi | μ j , 1/λ j )

where
X
D−i = di0
i0 ,i

and so di is constrained to lie in the finite set

Ai = { j : g(D−i + j) > u} ≡ { j : j ≤ g−1 (u) − D−i }.


Downloaded by [Central Michigan University] at 15:04 15 October 2015

At each iteration we would simulate sufficient number of the θ j s values to be able to find the Ai
and hence we would need (θ j ) Jj=1 for J = maxi {g−1 (u) − D−i }.
Once this is done, we can use the MCMC estimate of p(d | y) and draw inference about any
functional of (d, y), say β(d, y). In particular, we could compute the expectation of this functional
with respect to the posterior for d, namely
X
β(y) = β(d, y) p(d | y).
d

In the basic mixture model (1), estimation problems of the type


Z

β (y) = γ(θ, w) p(θ, w | y) dθ dw,

for some function γ(θ, w), make no sense due to the fact that the (θ, w) are unidentifiable. However,
if we condition on the allocation variables, which are from p(d | y), then estimation becomes
identifiable (with very high probability), and hence we can estimate functionals such as
Z
β(d, y) = γ(θ, w) p(θ, w | d, y) dθ dw

via
X
β(y) = β(d, y) p(d | y).
d

Here
p(θ, w | d, y) = p(w | d) p(θ | d, y)

ACCEPTED MANUSCRIPT
11
ACCEPTED MANUSCRIPT

will be the posterior for estimating (w, θ).


To elaborate; the model so far, i.e. p(y|d) and p(d) is set up without the need for weights, since
the d tell us who is from which component exactly. To recover the weights for each component we
use p(w|d) which is proportional to p(d|w) p(w). Note that d is all one needs to learn about w. Now
p(d|w) is the standard P(di = j|w) = w j and we are free to select p(w), the prior for the weights,
and we take this as a stick-breaking prior. In fact we can do this for the standard Dirichlet process
mixture model. In this case, we have w1 = v1 and, for j > 1,
Downloaded by [Central Michigan University] at 15:04 15 October 2015

Y
wj = vj (1 − vl ).
l< j

where each v j is an independent Beta(1, c) variable, for some c > 0, and then, it is easy to show
that independently for each j,
 

p(v j | d) = Beta v j 1 + #{di = j}, c + #{di > j} .

The form for p(θ | y, d) is also well known.


So, for example, a particular γ(θ, w) might be given by

X
γ(θ, w) = 1(w j > )
j=1

for some small  > 0, which would tell us how many groups had weights larger than some .
Here we can see clearly the benefit of minimizing D with high probability. If the labeling is
arbitrary then we will see di = j for larger values of j than are strictly necessary. This suggests that
the number of w j s we should be interested in, i.e. which are larger than , are more than is actually
needed to model the data. Keeping the same structure, i.e. “who is with who”, while minimizing
the labeling, means we are interested only in weights which are actually needed.

ACCEPTED MANUSCRIPT
12
ACCEPTED MANUSCRIPT

4 Illustrations

In this section we look at two illustrations; one involving the benchmark galaxy data set and the
other a simulated data set in which the observations fall into one of two distinct groups of equal
size. It is well known that the label switching problem arises by the Markov chain switching
between modes. If these modes are asymmetric then the chain will spend predominantly all, if not
all, the time in one mode of d, since one has much higher probability than the others. This mode
will be the one with non-increasing group sizes coinciding with the increasing labels. However,
Downloaded by [Central Michigan University] at 15:04 15 October 2015

it is worth looking at the case when two modes have the same probability, see the toy example
in Section 2, which occurs when two groups have exactly the same size. In this case we will
demonstrate that jumping between modes does not happen due to the extremely low probability of
doing so.

4.1 Galaxy data

Here we re-analyse the galaxy data set. This data set consist of n = 82 observations and was
originally presented by Roeder (1990). To expand on the description of the MCMC given earlier,
we select M = 10 as an upper bound on the number of clusters, and hence we only have (μ j , λ j )10
j=1

and ten weights. As we have mentioned before, handling the infinite case involves the introduction
of a latent variable.
Thus, for j = 1, . . . , 10
  !
 X  1

p(di = j | θ, y, d−i ) ∝ g  j + 

dl  × N yi | μ j , .
l,i
λj

Here we take g(D) = qD and take a prior distribution on q, which we have as the Beta(1/2, 1/2)
distribution. This is Jeffrey’s prior and we use this as a default choice prior. One can note that the
choice of (1/2, 1/2) will not be influential when one considers the posterior, see below, which is
Beta with updated parameters (1/2 + D − n, 1/2 + n).

ACCEPTED MANUSCRIPT
13
ACCEPTED MANUSCRIPT

Then, taking the prior N(∙ | 0, τ−1 ) for the (μ j ), with τ = 0.001, and the prior Ga(∙ | a, b) for the
(λ j ), with a = 0.5 and b = 0.1, we have
!
n j yˉ j 1
p(μ j | λ j , y, d) = N μ j , ,
n jλ j + τ n jλ j + τ
 
 X 
p(λ j | θ, y, d) = Ga λ j a + 0.5 n j , b + 0.5 (yi − θ j )2  .
di = j

Finally, the conditonal posterior for q is given by


Downloaded by [Central Michigan University] at 15:04 15 October 2015

p(q | d) = Beta(q | 0.5 + D − n, 0.5 + n).

Note here that


E(q|d) ≈ 1 − n/D

so when q is close to 0 we must have D ≈ n implying one group, and when q is close to 1 we must
have D  n and hence there are many groups.
At each iteration of the chain, we have a set of di s. We also have a set of θ j s and so we know
the locations of the clusters. To find the weights we estimate the functionals, for j = 1, . . . , 10,
Z
w j (d) = w j p(w j | d) dw j

with respect to the model p(d | y). Specifically we take p(w j | d) to be a stick-breaking model with
the v j s independent Beta(1, c) with c = 1. Thus we are estimating and computing
X
wˉ j = w j (d) p(d | y).
d

In this analysis we are emphasising the fact that the model will, with high probability, return
weights wˉ j which are decreasing with j. It will also be possible to identify the number of groups
by the size of the wˉ j .
With the settings described in this section, we ran the Gibbs sampler for 50,000 iterations and
so have 50,000 samples of d from p(d, θ | y). Hence, we have the estimates of the locations, i.e.

ACCEPTED MANUSCRIPT
14
ACCEPTED MANUSCRIPT

the (μˉ j , λˉ j ). From the ouput of the allocations we obtain the wˉ j ; the results are as follows:

wˉ 1 = 0.868, wˉ 2 = 0.087, wˉ 3 = 0.035

with the remaining weights decreasing and with remaining mass of 0.01. The corresponding loca-
tions are given by

(μˉ 1 , λˉ 1 ) = (21.40, 0.21), (μˉ 2 , λˉ 2 ) = (9.71, 4.85), (μˉ 3 , λˉ 3 ) = (32.72, 1.21).

This clearly suggests 3 groups and provides the normal distributions which model the data. A
Downloaded by [Central Michigan University] at 15:04 15 October 2015

figure of this mixture of 3 normals, with total mass of 0.999, is given in Figure 2, which for
comparison includes a histogram representation of the data. We believe this is a good 3 normal
mixture representation of the data. Moreover, it has been obtained via posterior parameter values
rather than the usual predictive samples.
Lau and Green (2007) also find that 3 mixtures are appropriate using a loss function approach
to find a posterior estimate of the allocations, and do this by concentrating exclusively on p(d | y).

Figure 2 here

4.2 Toy example 2: Equal group sizes

Here we simulate data so there are clearly two groups of equal size. Hence, for n = 1, . . . , 20 we
generate observations from N(−20, 1) and for i = 21, . . . , 40 we have observations from N(20, 1).
Given this set up we modify the prior for q allowing us to ensure no more than 2 groups with high
probability. We aim to show the label switching problem is solved with high probability due to the
fact that the chain does not move between the two modes, after a suitable burn-in to deal with the
starting position, even though the modes have equal probability.
Now
P(max{d1 , . . . , dn } ≤ k | q) = (1 − qk )n

ACCEPTED MANUSCRIPT
15
ACCEPTED MANUSCRIPT

and hence if we want no more than 2 groups with high probability we should make q about 0.01.
In fact we choose the prior for q as Beta(1, 100). We could fix M = 2 but we prefer the prior route.
Running the chain for 20,000 iterations, and with a burn-in of 15,000, we collected 5,000 samples
and estimated the weights accordingly:

w̃1 = 0.51 and w̃2 = 0.49,

with locations estimated as


Downloaded by [Central Michigan University] at 15:04 15 October 2015

μ̃1 = 19.92 and μ̃2 = −19.72.

In the last 5000 iterations the chain did not switch between modes. This lack of ability to jump can
be understood in the following way; at some point in the chain all the 1’s (20 of them) would need
to switch to 2 and this probability, effectively adding 20 to D, making qD so small that it can not
happen.
We also ran such an example now adding 5 further observations from N(0, 1) and putting the
prior for q as uniform on (0, 1). With the same settings for the MCMC, we now obtain

w̃1 = 0.45, w̃2 = 0.42 and w̃3 = 0.11,

with locations estimated as

μ̃1 = 19.73, μ̃2 = −20.06 and μ̃3 = 1.04.

The last value might seem high but the mean of the samples from the N(0, 1) distribution was 1.2.

5 Summary

The idea of the paper is to build the mixture model in a different way to the usual starting point of

X
M
f (y) = w j N(y | μ j , 1/λ j ).
j=1

ACCEPTED MANUSCRIPT
16
ACCEPTED MANUSCRIPT

We start by constructing a prior model p(d), such that with high probability the larger groups are
assigned the lower allocations labels, and there are no gaps. We then adopt the standard p(y | d)
from the mixture model, and hence obtain p(d | y). With d from p(d | y) we can have easy access
to p(θ, w | y, d); i.e. the weights and locations of the groups.
In order to better grasp the idea behind our proposal, let us define for each allocation S =
(k, C1 , . . . , Ck ) the dS which minimizes the sum D. Then, for an equivalent allocation vector dS∗ ,
which yields an identical (k, C1 , . . . , Ck ), in terms of clustering, but with different labels, we have
Downloaded by [Central Michigan University] at 15:04 15 October 2015

that

p(dS )
p(dS∗ )
P
increases as i diS∗ increases. It is precisely this property which emphasises the importance of our
choice of p(d). Even more interesting is the fact that this property also carries over to the posterior,
i.e.

p(dS | y) p(dS )
= ,
p(dS∗ | y) p(dS∗ )

since p(y | dS∗ ) = p(y | dS ), and is the key to the idea. Specifically, solving the label switching
problem. So the posterior really only focuses on the d which for any choice of (k, C1 , . . . , Ck )
minimize D. Indeed, this then yields an interpretable p(d | y) from which we can estimate the
weights via the distribution of choice p(w | d) and then estimate functionals of choice. The point
is that the d coming from p(d | y) are “good” allocation variables.

Acknowledgements

The paper was completed during a visit by the second author to UNAM, Mexico, with support
from CONACyT project 131179. The first author gratefully acknowledges the support of PAPIIT-
UNAM project IN100411. The authors would like to thank the Editor, an Associate Editor and two
referees for their comments on an earlier version of the paper.

ACCEPTED MANUSCRIPT
17
ACCEPTED MANUSCRIPT

References

Casella, G., Robert, C.P. and Wells, M. T. (2004). Mixture models, latent variables and parti-
tioned importance sampling. Statistical Methodology 1, 1–18.

Celeux, G., Hurn, M. and Robert, C. (2000). Computational and inferential difficulties with mix-
ture posterior distributions. J. Amer. Statist. Assoc. 95, 957–970.

Chung, H., Loken, E. and Schafer, J.L. (2004). Difficulties in drawing inferences with finite–
Downloaded by [Central Michigan University] at 15:04 15 October 2015

mixture models. Amer. Statist. 58, 152–158.

Dahl, D. B. (2009). Modal clustering in a class of product partition models. Bayes. Anal. 4, 243–
264

Diebolt J. and Robert C. P. (1994). Estimation of finite mixture distributions through Bayesian
sampling. J. Roy. Statist. Soc. Ser. B 56, 363–375.

Ferguson, T.S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1, 209–
230.

Ferguson, T.S. 1983. Bayesian density estimation by mixtures of normal distribution. In Recent
Advances In Statistics: Papers in Honor of Herman Chernoff on His Sixtieth Birthday, eds: M.
H. Rizvi, J. Rustagi and D. Siegmund, Academic Press: New York.

Fuentes-Garcı́a, R., Mena, R.H. and Walker, S.G. (2009). A nonparametric dependent process
for Bayesian regression. Statist. Probab. Lett. 79, 1112–1119.

Fuentes-Garcı́a, R., Mena, R.H. and Walker, S.G. (2010). A new Bayesian nonparametric mixture
model. Comm. Statist. Sim. Comput. 39, 669–682.

Fuentes-Garcı́a, R., Mena, R.H. and Walker, S.G. (2010). A probability for classification based
on the mixture of Dirichlet process model. Journal of Classification. 27, 389–403.

ACCEPTED MANUSCRIPT
18
ACCEPTED MANUSCRIPT

Frühwirth-Schnatter, S. (2001). Markov Chain Monte Carlo estimation of classical and dynamic
switching and mixture models. J. Amer. Statist. Assoc. 96, 194–209.

Hjort, N.L., Holmes, C.C. Müller, P., Walker, S.G. (Eds.) (2010). Bayesian Nonparametrics.
Cambridge University Press, Cambridge.

Kalli, M., Griffin, J.E. and Walker, S.G. (2011). Slice sampling mixture models. Statistics and
Computing, 21, 93-105.
Downloaded by [Central Michigan University] at 15:04 15 October 2015

Lau, J.W. and Green, P.J. (2007). Bayesian model based clustering procedures. J. Comp. Graphical
Statist. 16, 526–558.

Lijoi, A., Prünster, I. (2010). Models beyond the Dirichlet process. In Bayesian Nonparametrics,
(Hjort, N.L., Holmes, C.C., Müller, P., Walker, S.G. Eds.), Cambridge University Press.

Lo, A.Y. (1984). On a class of Bayesian nonparametric estimates I. Density estimates. Ann. Statist.
12, 351–357.

Marin, J., Mengersen, K. and Robert, C. Bayesian modelling and inference on mixtures of distri-
butions. In Handbook of Statistics, D. Dey & C. Rao, eds., vol. 25. Elsevier-Sciences.

McLachlan, G. & Peel, D. (2000). Finite Mixture Models. Wiley- Interscience, New York.

Nobile, A. (1994). Bayesian Analysis of Finite Mixture Distributions. Ph.D. disserta-


tion, Department of Statistics, Carnegie Mellon University, Pittsburgh. Available at
http://www.stats.gla.ac.uk/ agostino

Nobile, A. and Fearnside, A. T. (2007). Bayesian finite mixtures with an unknown number of
components: the allocation sampler. Statistics and Computing. 17, 147–162.

Papastamoulis, P. and Iliopoulos, G. (2010). Retrospective MCMC for Dirichlet process hierarchi-
cal models. J. Comp. Graphical Statist., 19, 313–331.

ACCEPTED MANUSCRIPT
19
ACCEPTED MANUSCRIPT

Pearson, K. (1894). Contributions to the mathematical theory of evolution.. Proc. Trans. Roy. Soc.
A 185, 71–110.

Quintana, F. A. and Iglesias, P. L. (2003). Bayesian clustering and product partition models. J.
Roy. Statist. Soc. Ser. B 65, 557574.

Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown


number of components. J. Roy. Statist. Soc. Ser. B 59, 731–792.
Downloaded by [Central Michigan University] at 15:04 15 October 2015

Roeder, K. (1990). Density estimation with confidence sets exemplified by superclusters and voids
in the galaxies. J. Amer. Statist. Assoc. 85, 617–624.

Rodriguez, C.E. and Walker, S.G. (2014). Label switching in Bayesian mixture models: Deter-
ministic relabeling strategies. J. Comp. Graphical Statist. 23, 25–45.

Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985). Statistical Analysis of Finite


Mixture Distributions. John Wiley & Sons Ltd.

Stephens, M. (2000). Dealing with label switching in mixture models. J. Roy. Statist. Soc. Ser. B
62, 795–809.

Wang, L. & Dunson, D.B. (2012). Fast Bayesian inference in Dirichlet process mixture models. J.
Comp. Graphical Statist. 20, 198–216.

ACCEPTED MANUSCRIPT
20
ACCEPTED MANUSCRIPT

d(1) d(2) d(3) d(4) d(5) d(6) d(7) d(8) d(9)


d1 2 2 1 3 2 2 2 3 2
d2 2 2 1 2 3 2 2 3 2
d3 2 2 1 2 2 2 3 2 3
d4 2 3 1 2 2 2 2 2 3
d5 1 1 2 1 1 3 1 1 1
d6 1 1 2 1 1 1 1 1 1
d7 1 1 2 1 1 1 1 1 1
d8 1 1 2 1 1 1 1 1 1
d9 1 1 2 1 1 1 1 1 1
d10 1 1 2 1 1 1 1 1 1
Downloaded by [Central Michigan University] at 15:04 15 October 2015

p(d | y) 0.933 0.027 0.009 0.008 0.004 0.004 0.003 0.002 0.002

Table 1: Highest exact posterior probabilities for d for toy example dataset, where d(r) has the rth
highest posterior probability. Note that the largest group is allocated label 1 and the second largest
group is allocated label 2, for the configuration with the highest posterior probability.

ACCEPTED MANUSCRIPT
21
ACCEPTED MANUSCRIPT

3.0
2.5
2.0
1.5
1.0
0.5
Downloaded by [Central Michigan University] at 15:04 15 October 2015

0.0

−2 −1 0 1 2 3 4

Figure 1: Histogram for the small data set.

Figure 2: Galaxy data set and estimation using 3 mixture of normal distributions

ACCEPTED MANUSCRIPT
22

You might also like