You are on page 1of 5

Compound Distributions and Mixture

Distributions
Applied Statistics 565, Fall 2011
Parameters
To date, we have considered parameters to be fixed indices that specified a
particular member of a distribution family. We have tinkered around the edges with
this idea. Now we will change the idea entirely. We will consider parameters to be
(unobservable) random variables. Our goal will be to find the marginal distribution
of the observable random variable.
You might ask what sort of problem this could address. Suppose that we are
employed by a public health agency, and working on a regional measure. This might
be measuring neonatal mortality rates, or post-operative infection rates for
hospitals.
There is no a priori reason to believe that these rates should be homogenous
across hospitals. In fact, there are structural reasons to believe that they should
differ. Teaching hospitals serve a different clientele than community hospitals,
which in turn serve a different clientele than rural community hospitals. For
example, the Obstetrics department at a teaching hospital will handle many more
high-risk pregnancies. Their neonatal mortality rate should be higher than for a
community hospital obstetricians practicing in community hospitals generally
send high-risk cases to specialists in high-risk pregnancies. These specialists often
prefer to practice in teaching hospitals, because more state-of-the-art equipment is
available. Even with the high-tech equipment, the death rate for these newborns
will be higher.
So, we need marginal models that make allowance for possibly different rates
for different hospitals. Now, for any particular hospital, the death rate should be an
unknown value, (say) P. But the value of P will vary for different hospitals.
One way to manage this problem is to take P to be a random variable, and
allow the number of neonatal deaths to be Binomial with parameters
1
n and P. We
can take P ~ Beta(o, |) and X | P ~ binomial(n, P). This method of creating new
distributions is called compounding. This particular compounding is written
( )

( ). The notation is read, Data distribution, Compounding

parameter, Parameter distribution. So, in this instance we are compounding a
binomial distribution (on p) with a beta distribution.

1
We could consider allowing n to be a random variable as well. My personal
preference in these situations is to make inferences conditional on n, rather than on
the marginal distribution.
The Beta-binomial distribution
If we take the distributions as defined above, we find the joint distribution is

( | )
{
( )
()()

( )

( )

The parameter space for this joint density is n = 1, 2, ; 0 < o; 0 < |. This simplifies
by grouping the terms in p together

( | ) {
( )
()()
(

( )

Of course, we are not especially interested in the joint density/mass function, except
as a tool to find the marginal mass function of X. This requires us to integrate out p

( | )
( )
()()
(

( )

( )
()()
(

( )

( )
()()
(

)
( )( )
( )

By using the recursion relationship for gamma functions, we can simplify this
slightly

( | ) {
(

)
( )

( )

( )

This is a discrete distribution with mean

The variance is

) (

) (

)

These parameters are easily interpreted if we recall that the mean of the Beta
distribution is o/(o+|) and think of this as the average value of p. Then the mean
follows the binomial form exactly and the variance is multiplied by a constant
greater than 1. That is, the marginal distribution of X is more variable than a
binomial with p = o/(o+|). A graph of a binomial and a beta-binomial with the
same mean is shown on the next page. The standard deviation of the beta-binomial
is nearly twice the binomial value.
This distribution is called the Beta-binomial distribution. The analysis is
made possible by the relationship between the beta distribution density
(| ) {
( )
()()

( )

and the binomial mass function
( | ) {
(

( )

Notice that the functional forms are nearly
identical. If we considered the binomial mass
function as a density for p with support for 0 <
p < 1, it works. (Rewrite the combinatorial
symbol as the equivalent gamma functions, and
a Beta(x, n x) density results.) The beta
density and the binomial mass function are
conjugate densities
2
. Finding the conjugate
density for a given distribution is a matter of
looking carefully at the kernel of the data
density and asking, When I consider this
kernel as a function of the parameter, what
density/mass function results? Other pairs of
conjugate distributions are the
Poisson/gamma, negative binomial/beta, and
Normal/Normal (for ) and Normal/gamma (for o).

All of this is also related to the problem of Bayesian inference. The
compounding distribution is called the prior (distribution of the parameter).
However, instead of finding the marginal distribution of X, in a Bayesian setting we
need to find the conditional distribution of p | X=x. This may (but usually doesnt)
involve finding the marginal density of X to produce the conditional distribution.
The Gamma-Poisson Distribution
case where X | M = m ~ Poisson(m), and M ~ _
2
(1). We will generalize this to the
case where M ~ Gamma(, p/(1-p) ). The joint density is

( ) {

() (

()

2
This sort of thing is another reason it is convenient (in linguistic terms) to refer to
the mass function of a discrete variable as a density.
0 5 10 15 20
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
x
P
r
o
b
a
b
i
l
i
t
y
Binomial(20, 1/6)
Beta-binomial(1,5,20)
{

( )() (

We now integrate out the Poisson parameter m to obtain the marginal distribution
of X

()

( )()

()

( )

( )() (

( )

{
( )
()

( )

Perhaps you recognize this, perhaps you dont. In case you dont, this is a negative
binomial with parameters and (1 p). We can interpret the negative binomial
distribution as gamma compounding of the Poisson distribution. The mean of this
negative binomial is p/(1-p), and the mean of the gamma is (generically) o|, and in
this specific case that becomes p/(1-p). The mean of the compound distribution
corresponds to the mean of the distribution of M. If X were Poisson with parameter
p/(1-p), this would also be its variance. But the variance of the negative binomial
is p/(1-p)
2
: this is (obviously) larger than the Poisson variance. The negative
binomial is an over-dispersed alternative to the Poisson distribution.
Mixture distributions
We mentioned in class that every random variable could be written as a
combination of a discrete random variable, a continuous random variable, and a
singular random variable
3
. Some of the exercises in Chapter I suggested that a
convex combination of densities/mass/distribution functions is also a
density/mass/distribution function. Showing this is not difficult. Suppose that F1,
F2,, Fm are distribution functions, and o1, o2, om are non-negative numbers
summing to 1. Then
()

()
is also a distribution function. Showing this involves passing to the limits as
and showing that for every x and every h > 0, F(x + h) > F(x). These
properties all follow immediately from the fact the individual terms in the sum are

3
Singular random variables have a distribution function, but the distribution
function is such a mess that it does not admit a density. Such variables are
continuous, but not absolutely continuous. They are of theoretical interest only.
themselves distribution functions. From this fact, the similar combinations of
densities and mass functions being densities or mass functions follows by taking the
derivative with respect to the appropriate measure.
One application of mixture distributions is in modeling heavier-than-Normal
tailed data distributions. The c-contamination model asks us to suppose that for
some 0 < c < 1, our data follow a contaminated Normal distribution, that is the
distribution function of the data is
() ( )(

) (

)
That is, we usually observe data with standard deviation s, but we occasionally get
an observation from a distribution with the same mean, but a much larger standard
deviation. One question of interest is, How large must c be before the mixture
seriously damages the SD of the sampling model of the mean? The answer
(surprisingly) is, Not very large at all. According to John Tukey, with this sort of
contamination, c as small as 3% is sufficient to make the median superior to the
mean.
We can also perform mixtures with respect to a continuous distribution. For
example, it can be shown that the T distribution is a gamma mixture (with respect to
o) of Normal distributions. At this point, the distinction between mixing and
compounding grows blurry indeed.