You are on page 1of 19

12 Chapter 1

Before proceeding to discuss the computation of the likelihood estimate


of</>, the concept of identifiability is considered for finite mixture distribu-
tions.

1.5 IDENTIFIABILITY
In order to be able to estimate all the parameters in <P from x1, .. . , xn, it is
n,ecessary that they should be identifiable. In general, the parametric family
of p.d.f.'s. f (x; tp) is said to be identifiable if distinct values of <P determine
distinct members of the family. This can be interpreted as follows in the
case where f (x; tp) defines a class of finite mixture densities according to
(1.4 .1) . A class of finite mixtures is said to be identifiable for tp E O if for
any two members
g
f(x; ¢>) =芷弘 (x; 0),
i=l

and
g嶋

f(x; 4> *) =芷可」, (x; fJ*),


i=l

then

f(x; ¢,) = f (x; 护)


if and only if g = g* and we can permute the component labels so that
*
'lr · ='lr · and fi(x; 0) 三 Ji (x; o*) (i=I, ... ,g).
``
Here = implies equality of the densities for almost all x relative to the
underlying measure on RP appropriate for /(x; </>). Titterington, Smith
and Makov (1985, Section 3.1) have given a lucid account of the concept of
identifiability for mixtures, including several examples and a survey of the
literature on this topic.
It was tacitly assumed in the previous section that </> is identifiable.
There is a difficulty with mixtures in that if fi(x; 0) and f;(x; 0) belong
to the same parametric family, then f(x; </>) will have the same value when
the component labels i and j are interchanged in </>. That is, although this
class of mixtures may be identifiable, </> is not. Indeed, if all the fi(x; 0)
belong to the same parametric family, then /(x; </>) is invariant under the
General Introduction 19

g! permutations of the component labels in q, . In this case, if J corresponds


to a local maximum, L(q,) will have at least g! local maxima of the same
value. This matter is to be discussed forth er ·m considermg the consistency
of the likelihood estimator of q, in Section 1.8.
However, this lack of identifiability of q, due to the interchanging of
component labels is of no concern in practice, as it can be easily overcome
by the imposition of an appropriate constraint on q, . For example, the
approach of Aitkin and Rubin (1985) in the case where all the components
belong to the same parametric family, is to impose the restriction that

町 ~11"2~... ~1rg, (1.5.1)

but to carry out likelihood estimation without this restriction. Finally, the
component labels are permuted to achieve this restriction on the estimates
of the mixing proportions. In the work to be presented here, we do not
explicitly impose any restriction on the mixing proportions, but with any
application of the mixture likelihood approach we report the result for only
one of the possible arrangements of the elements of J.

1.6 LIKELIHOOD ESTIMATION FOR MIXTURE


MODELS VIA EM ALGORITHM
The computation of solutions of the likelihood equation under the finite
mixture model {1.4.2) is described now for arbitrary component densities.
This work is specialized to the important and commonly used model of
normal component densities in Sections 2.1 and 2.2.
The likelihood equation for ef>, 8L(<p)/8ef> = 0, can be so manipulated
that the likelihood estimate of <p, <p, satisfies

氘= I: fi1/n
, :=1
(i=l, .. . ,g) (1.6.1)

and
g n
LL
i=l j=l
fij a log fi(x, 潼)虛 =0 (1.6.2)

These manipulations were carried out by various researchers in the past


in their efforts to solve the likelihood equation for mixture models with
specific component densities; see, for example, Hasselblad (1966, 1969),
Wolfe (1967, 1970), and Day (1969) . They observed in their special cases

...
14 Chapter 1

that the equations (1.6.1) and (1.6.2) suggest an iterative computation of


the solution which can be identified now with the direct application of
the EM algorithm of Dempster, Laird and Rubin (1977); see Orchard and
Woodbury (1972) for an example of earlier work on a general approach to
likelihood estimation from incomplete data. On the use in particular in-
stances of iterative schemes essentially equivalent to applications of the EM
algorithm, Butler (1986) has pointed out examples in a mixture framework
dating back to Newcomb (1886) and Jeffreys (1932), as described earlier in
Section 1.2.
As one of the many examples in their general account on the use of
the EM algorithm for the computation of likelihood estimates when the
data can be viewed as incomplete, Dempster, Laird and Rubin (1977) con-
sidered the mixture problem. Corresponding to the formulation of finite
mixture models in Section 1.4, we let the vector of indicator variables
zi = (z1i, ... , zgj)'be defined by

z u~ =
{
L,O,
X; E Gi,
(1.6.3)
X; €t Gi,
where z1, .. . , zn are . independently and identically distributed according
to a multinomial distribution consisting of one draw on g categories with
probabilities 冗 ... 讜g respectively. We write

Z1, ··· ,zn 逆 Mult9 (I, 7r) . (1.6.4)

In accordance with the mixture model (1.4.2), it is further assumed that


x1, ... , xn given z1, ... , zn respectively are conditionally independent, and
X; given z; has log density

芷 zi; log fi(x;; 0) (j = 1, ... ,n) .


i=l

Hence the log likelihood for the complete data,

X = (x~, ... ,x~)' and Z=(z~, ... ,z~)',

is given by

g n
L庫)=芷 L zij {log 1ri + l_og fi(x;; 0)} (1.6.5)
i=l j=l
General Introduction
15

The EM algorithm is applied to the mixture model (1.4.2) by treating


z as missing data. It is· easy to program and it proceeds iteratively in two
steps, E (for expectation)
(o) and M (for max1m1zation) . Using some initial
value for¢,, say <I> , the E step requires the calculation of

Q(¢,,¢,(o)) = E{Lc(¢,) I X;¢,(o)},

the expectation of the complete data lo~likelihood, L碣), conditional on


the observed data and the initial fit 伊 for </> . It can be easily seen from
(1.6.5), that this step is effected here simply by replacing each indicator
variable zij by its expectation conditional on x3, given by

E(zi; Ix,-; 严) = ri(x,-; 4,(0)) (i=I, ... ,g). (1.6.6)

That is, zij is replaced by the initial estimate of the posterior probability
i = 1 . .'g; 」: = 1, .. . ,n).
that x • belongs to G i. ('.
On the M step first time through, the intent is to choose the value of
<P , say 4>(1), that maximizes Q(</>, ¢>(0)) which, from the E step, is equal
here to L碣) with each zij replaced by ri(x3; 严) . It leads therefore to
solving equations (1.6.1) and (1.6.2) with fij replaced by ri(x3; ¢>(0)) . One
nice feature of the EM algorithm is that the solution to the M step often
exists in closed form, as is to be demonstrated for mixtures of normals in
Section 2.1.
The E and M steps are alternated repeatedly, where in their subse uent
executions, the initial fit </> (o) is replaced by the current fit for </>, say </>(k-1)
at the kth stage. Another nice feature of the EM algorithm is that the log
likelihood for the incomplete data specification can never be decreased after
an EM sequence. Hence the log likelihood under the mixture model satisfies

L(</>(k+l))~L(</>(k)), {1.6.7)

which implies that L(</>(k)) converges to some L* for a sequence bounded


above. Dempster, Laird and Rubin (1977) showed that if the very weak
condition that Q (</>, tp) is continuous in both 4> and t/J holds, then L * will
be a local maximum of L(4>), provided the sequence is not trapped at
some saddle point. In which case, a small random perturbation of </> away
from the saddle point will cause the EM algorithm to diverge from the
saddle point. The condition that Q(4>, t/J) be continuous in both </> and
t/J should hold in most practical situations, and it has been shown that
it is satisfied for component densities belonging to the important curved
exponential family. The reader is referred to the detailed account of the
16 Chapter 1

convergence properties of the EM algorithm in a general setting given by


Wu (1983) who addressed, in particular, the problem that the converfk)nce
of L(4,(k)) to L* does not automatically imply the convergence of 4> to
a point 4>* . On this same matter, Boyles (1983) presented an example of a
generalized EM sequence which converges to the circle of unit radius and
not to a single point.
Unfortunately, convergence with the EM algorithm is generally quite
slow and it does not yield directly the observed informat10n matrix which
may be needed for the estimation of the standard errors of the elements of</> .
However, Louis (1982) has devised a procedure for extracting the observed
information matrix when using the EM algorithm, as well as developing a
method for speeding up its convergence. The extraction of the observed
information matrix is to be described in Section 1.9.
- -:) Concerning other possible alternatives to the EM algorithm, !Q己恒-
ton-Raphson method may not increase• the likelihood after each
. .•.• ··-----···一一- - iteration,
--
although when it does converge the convergence can be quite rapid, in
particular, when the log likelihood is well approximated by a quadratic
function . As a Newton iterative scheme requires the inversion of the ma-
trix of second derivatives of the log likelihood, it automatically provides
an estimate of the covariance matrix of J, although this inversion may not
be computationally convenient at each step if the dimension of </> is high.
For a mixture of two univariate normal densities, Everitt (1984a) has com-
pared the EM algorithm with five other methods for finding the likelihood
estimates. Titterington (1984a) has considered recursive estimation for a
mixture of two univariate normal densities as part of a wider study of re-
cursive estimation for incomplete data and its link with the EM algorithm.
More recently, Hathaway (1986a) showed that the EM algorithm for mix-
ture problems can be interpreted as a method of coordinate descent on a
particular objective function .
The Appendix contains the FORTRAN listing of computer programs
for likelihood estimation via the EM algorithm for mixtures of normal dis-
tributions. Agha and Ibrahim (1984) have provided an ALGOL 60 listing
of this algorithm for likelihood estimation for univariate mixtures of bi-
nomial, Poisson, exponential and normal distributions. Also, as noted by
Aitkin (1980a) , th· e 1terat1ve computations required for the EM algorithm
in the fitting of mixture models are particularly simple in GLIM.

1.7 STARTING VALUES FOR EM ALGORITHM


As explained in the previous section, the EM algorithm is started from
some initial value of </>, </> (o) . We can either give a value for </> (o), or initially
partition the data into the specified number of groups g according, perhaps,
General Introduction 17

to some ad hoc criterion, and then take <p (o) to be the estimate of <p based
on this partition as if it represented a true grouping of the data. That is,
we execute the M step of the EM algorithm with the initial estimate of the
posterior probability, ri (x; ; <p (o)), set equal to 1 or Oaccording as to whether
xi lies in G i or not under this partition. For example, one ad hoc way of
initially partitioning the data in the case of, say g = 2 normal populations
with the same covariance matrices, would be to plot the data for selections
of two of the P variables, and then draw a line which divides the bivariate
data into two groups which have a scatter that appears normal; see, for
example, O'Neill (1978) and Ganesalingam and McLachlan (1979b) . On
the selection of a reduced set of variables from the full observation vector,
note that Chang (1983) has shown that the practice of applying principal
components to reduce the dimension of the data before clustering is not
just ified in general. By considermg · a mixture of two normal distributions
with a common covariance matrix, it was shown that the components with
the large eigenvalues do not necessarily provide more information (distance)
between the groups. ·
As commented in the last section, convergence with the EM algorithm
is very slow, and the situation will be exacerbated by a poor choice of tp (o) .
Indeed, in some cases where the likelihood is unbounded on the edge of the
parameter space, the sequence of estimates {严} generated by the EM
algorithm may diverge if <p (0) 油-chosen _ too close to the boundary. This
matter is discussed further 瓦笆坐旦:!_0nother problem with mixture
models is that the likelihood equation will usually have multiple roots, and
so the EM algorithm should be applied from a wide choice of starting values
in any search for all local maxima.
It will be illustrated in Section 3.2 that a key factor in obtaining a sat-
isfactory clustering of the data at hand is the accuracy of the estimate of fl',
the vector of mixing proportions. For univariate mixtures Fowlkes (1979)
suggested the determination of the point of inflexion in quantile-quantile
(Q-Q) plots to estimate first the mixing proportions of the underlying pop-
ulations. The remaining parameters can then be estimated from the sample
partitioned into groups in accordance with the estimates of the mixing pro-
portions. After the completion of a search for the various local maxima,
there is the problem of which root of the likelihood equation to use. If
there is available some consistent estimator of <p, then the estimator de-
fined by the root of the likelihood equat ion closest to it is also consistent,
and hence efficient (Lehmann, 1983, page 421), providing regularity condi-
tions hold for the existence of a sequence of consistent and asymptotically
efficient solutions of the likelihood equation. Where computationally feasi-
hle, the method of moments as discussed in Section 1.2 provides a way of
constructing a y'n-consistent estimator of <p . Quandt and Ramsey (1978)
•`
I
18 Chapter 1

proposed a method for obtaining a Jn-consistent estimator of <p, m terms


of minimizing a distance function based on the moment generating func-
tion, but there can be numerical complexities; see the discussion following
that paper. Minimum distance estimators of this type are to be discussed
further in Section 4.6. The usefulness of a .fa-consistent estimator of <p,
say~'is that under the same regularity conditions referred to above, a
consistent and asymptotically efficient estimator of <p is obtained by a one-
step approximation to the solution of the likelihood equation according to
a Newton iterative scheme started from 孟 . This approach does not r~quire
the determination of the closest root of the likelihood equation to <P, but
its implementation does require the inversion of a symmetric matrix whose
dimension is equal to the total of unknown parameters. Hence it may not
be as convenient computationally as the EM algorithm applied from~-
In the absence of the observed value of any known consistent estima-
tor of <p, an obvious choice for the root of the likelihood equation is the
one corresponding to the largest of the local maxima (assuming all have
been located), although it does not necessarily follow that the consequent
estimator is consistent and asymptotically efficient (Lehmann, 1980, page
234) . However, at least for mixtures of univariate normal distributions, the
estimator of <p corresponding to the largest of the local maxima is consistent
and efficient. The properties of this estimator for mixtures of multivariate
normal distributions are to be examined in Section 2.1.

1.8 PROPERTIES OF LIKELIHOOD ESTIMATORS


FOR MIXTURE MODELS
If the maximum likelihood estimator of 4> exists for a mixture distribution
with compact parameter space, then the conditions of Wald (1949) ensure
that it is strongly consistent; see Redner (1981). This is assuming that
4> is identifiable. It has been seen in Section 1.5 that, although a class of
mixture densities may be identifiable, 4> may not. Let 4> 。 denote the true
value of 4> and let 0。 denote the subset of the parameter space for which

f (x; 4>) = f(x; 奼)


for almost all x in RP . This set will contain more than the single point q, 。 in
the case of component densities belonging to the same parametric family.
For then, as discussed in Section 1.5 , a permutation of the component
labels in q, does not alter the value off (x; q,) . This particular identifiability
problem can be avoided by the imposition of a constraint on q, like (1.5.1) .
Redner's (1981) work, however, was specifically aimed at families of
distributions which are not identifiable. His results imply for mixture fam-
~

General Introduction 19

iJies with compact parameter space that, under the conditions of Wald
(1949) except for the identifiability of 4>, L(4>) is almost surely maximized
in a neighborhood of n 。 . More precisely, Redner (1981) referred to it as
convergence of the maximum likelihood estimator in the topology of the
quotient space obtained by collapsing O 。 into a single point; see Ghosh
and Sen (1985), Li and Sedransk (1986), and Redner and Walker (1984)
for further discussion . In the sequel it is implicitly assumed that 0 。 is a
singleton set; that is, 4> is identifiable or, in the terminology of Ghosh and
Sen (1985), the class of mixture densities is strongly identifiable.
As noted earlier, in some instances the log likelihood L(4>) under the
mixture model (1.4.2) is unbounded, and so the maximum likelihood esti-
ma tor of 4> does not exist. However, as emphasized recently by Lehmann
(1980, 1983), the essential aim of likelihood estimation is the determina-
tion of a sequence of roots of the likelihood equation that is consistent
and asymptotically efficient. Under suitable regularity conditions, it is well
known (Cramer, 1946) that there will exist such a sequence of roots . With
probability tending to one, these roots correspond to local maxima in the
interior of the parameter space. As stressed by Lehmann (1980, 1983),
these roots do not necessarily correspond to global maxima. But usually
this sequence of roots with the desirable asymptotic properties will corre-
spond to global maxima, or if the latter do not exist, to the largest of the
local maxima located in the interior of the parameter space. Whether this
is the case for mixtures of normal distributions will be discussed in Section
2.1, where available results for this specific parametric family are reviewed.
Concerning likelihood estimation for mixtures in general, Redner and
Walker (1984) and Peters and Walker (1978) have given, for the class of
identifiable mixtures, the regularity conditions which must be satisfied in
order for there to exist a sequence of roots of the likelihood equation
aL(4>)/ 西= 0 with the properties of consistency, efficiency and asymptotic
normality. The form of the conditions, which are essentially multivariate
generalizations of Cramer's (1946) results for the corresponding properties
in a general context, suggest that they should hold for many parametric
families .

1.9 INFORMATION MATRIX FOR MIXTURE


MODELS
We now consider the amount of information in the sample x1, . . . , xn under
the mixture model {l.4.2) . Let I{</>) be the matrix defined by

I(¢,)= ((-a2 庫) ;a年, a先)) . (1.9.1)


20 Chapter 1

In a general context, Efron and Hinkley .(1978) have termed I(¢,) as the ob-
served Fisher information matrix as distinguished from the expected Fisher
information matrix, 1(¢,), given by the expectation of (1.9.1). For one-
parameter families they presented a frequentist justification for preferring
贝(¢) to 1/I(句 as an estimate of the variance of the likelihood estimator
</>.
For the mixture model the observed information matrix I(J) is easier to
compute than I (J), as Louis (1982) showed that I(J) can be expressed in
terms of the gradient and curvature of the log likelihood function Le(</>) for
the complete data X and Z. This applies to any use of the EM algorithm
and not just to mixture models. It was shown that I(J) is equal to the
value at </> = </> of

E[ {I屯) -I2(¢>)} IX;¢>], (1.9.2)

where

11 僮) = ((-B2 L碣) /B</>iB<I>;))

and

12(;) = (({BL0(;)/B</>i} {BL0(;)/B<I>;})) .

The log likelihood for the complete data, L庫) can be written as

n
L庫)=芷辰 (</>;xj,zj),
j=l

where

L迢; X;,z;) =芷 zii {log 7ri + log fi(xj 」 o)}


i=l

is the log likelihood for the single complete data point (x'. z'-)'. For ind竺
J',
pendent data as assumed here, it follows that (1.9.2) evaluated at¢, =¢,
can be reduced to give

n
I(J)~L 11,.11;, (1.9.3)
, i::: 1
;;,-

General Introduction 21

where

h(¢J;x3,z3) = 8Lc(¢J;x3,z3)/8¢J

and where

hi= h(J;x占)

and

ri = (行,, ... ,%)'


for 」= 1, ... , n. Hence I(4>) is approximated solely in terms of the gradient
vector of L0(4>;xj,zj) at 4> = J, where the unknown vector z • of indicator
,
variables is replaced by the vector ri of the estimated posterior probabilities
of group memb ersh1p . for xi (J = 1, ... , n ). If only the mixing proportions
are unknown, (1.9.3) holds exactly. The computation of I(J) from (1.9.3)
for normal component distributions will be discussed in Section 2.4.

1.10 TESTS FOR THE NUMBER OF COMPONENTS


IN A MIXTURE
Testing for the number of components g in a mixture is an important but
very difficult problem which has not been completely resolved. For example,
with applications of mixture models in cluster or latent class analyses, the
problem arises with the question of how many clusters or latent classes there
are. An obvious way of approaching the problem is to use the likelihood
ratio test statistic A to test for the smallest value of g compatible with the
data.
Unfortunately with mixture models, regularity conditions do not hold
for -2 log A to have its usual asymptotic null distribution of chi-squared
with degrees of freedom equal to the difference in the number of parameters
in the two hypotheses. To briefly illustrate why this is so, consider a mixture
of two univariate normal densities; that is, p = 1 in the model (1.4.1) . The
null hypothesis that there is one underlying population,

H。 : g = 1, (1.10.1)

can be approached by testing whether 1r1 = 1, which is on the boundary


of the parameter space with a consequent breakdown in the standard reg-
ularity conditions. Alternatively, we can view Ji。 as testing for whether
柘 = µ2 and 吋=吋, where now the value of 1r1 is irrelevant. If for a
22 Chapter 1

specified value of 1r1 regularity conditions held, so that -2 log.\ under H,。
were distributed asymptotically as chi-square d , then the null distribution
of -2 log,\ where 1r1 is unspecified, would correspond to the maximum of
a set of dependent chi-squared variables. However, even for specified mix-
ing proportions the usual regularity conditions do not hold; see Quinn and
McLachlan (1986) .
The recent article by Ghosh and Sen (1985) provides a comprehensive
account of the breakdown in regularity conditions for the classical asymp-
totic theory to hold for the likelihood ratio test. Other relevant references
in addition to those to be discussed below include Bock (1985), Hartigan
(1977), and Moran (1973) .
For a mixture of two known but general univariate densities in unknown
proportions, Titterington (1981) and Titterington, Smith and Makov (1985)
considered the likelihood ratio test of H。 : g=l 伝= 1) versus H 1 : g = 2
佑< 1) . They showed asymptotically under H。 that -2 log ,\ is zero
with probability 0.5 and, with the same probability, is distributed as chi-
squared with one degree of freedom. Another way of expressing this is that
the asymptotic null distribution of -2 log,\ is the same as the distribution
of

{max(O, Y)}2,

where Y is a standard normal random variable.


Hartigan (1985a,b) obtained the same result for the asymptotic null
distribution of -2 log,\ in the case of a mixture in unknown proportions of
two univariate normal densities with known common variance and known
meansµ1 andµ2 where, as in the previous example, the null hypothesis
H。 : g = 1 was specified by 7r1 = 1. This example was considered also
by Ghosh and Sen (1985) in the course of their development of asymptotic
theory for the distribution of the likelihood ratio test statistic for mixture
models. They were able to derive the limiting null distribution of -2 log,\
for unknown but identifiableµ1 andµ2, whereµ2 lies in a compact set.
They showed in the limit, that -2 log,\ is distributed as a certain functional,

[max{ O,s~ 了 Y(叫}]\


where Y (·) is a Gaussian process with zero mean and covariance kernel
depending on the true value ofµ1 under H,。, and the variance of Y (µ2)
is unity for allµ2 . Ghosh and Sen (1985) established a similar result for
component densities from a general parametric family under certain con-
ditions. For the case where the vector of parameters cp was not assumed
~

General Introduction 29

to be identifiable, they imposed a separation condition on the values of¢>


under H。 and H1 .
The above result of Hartigan (1985a,b) was derived as a preliminary
step in his argument that ifµ2 is unknown with no restrictions on it, then
-2 log,\ is asymptotically infinite. He conjectured that ifµ2 were given
a prior distribution, then the resulting test may have better asymptotic
behavior. This approach was advocated too by Ghosh and Sen (1985), and
they suggested that the prior distribution should probably be chosen so
that the associated test is asymptotically locally minimax. As a first step
in this direction, they noted for their normal mixture example that the
likelihood ratio test is not asymptotically locally minimax, but it is ifµ2
were known; that is, if a degenerate prior were placed onµ2. A similar
result was proved for general exponential component densities. Li and
Sedransk (1986) have considered recently a Bayesian approach to testing
for the number of components in a mixture.
One of the initial attempts on this problem was by Wolfe (1971) . He
considered the likelihood ratio test for assessing the null hypothesis that
the data arise from a mixture of g1 populations versus the alternative that
they arise from g2 如< g2). He put forward a recommendation on the
basis of a small scale simulation study performed for 91 = 1 and 92 = 2.
It follows from his proposal that the null distribution of 2 log,\ would be
approximated as

-2clog..\ ~戏, (1.10.2)


p i 店-、 云旁

where the degrees of freedom, d, is taken to be twice the difference in


the number of parameters in the two hypotheses, :o._ot 」rrcluding !!i_e_JD,_i_xing
prop<>r._t椏ns . His suggested value of c is

(n-1-p- 柘) /n,

which is similar to the correction factor

{n - 1 -½(P + g2)} /n (1.10.3)

derived by Bartlett (1938), for improving the asymptotic chi-squared dis-


tribution of - 2 log .X for the problem of testing the equality of g2 means
in a multivariate analysis of variance. It applies to the case of only one
observation on each entity to be clustered; that is, 「= 1. When there .are
r replications per entity, as in the clustering of treatments in a randomized
`
£4 Chapter 1

complete block design (RCBD), the corresponding value of c would be

{ r (n - I) -½(p +祠 /nr.

It is analogous to the value of the correction factor obtained by Kendall


(1965, page 134) by first eliminating the block differences from the variation
before using Bartlett's approximation for testing between g2 means in a
multivariate analysis of variance.
On the basis of some simulations for normal populations, Hernandez-
Avila (1979) concluded it was reasonable to work with the suggestion
(1.10.2) of Wolfe (1971) . Also, some recent simulations performed by
Everitt (1981b) for testing g = 1 versus g = 2 normal populations with
equal covariance matrices suggest that -2c log,\ with c defined by (1.10.3) is
very well approximated by a chi-squared distribution under (1.10.1). How-
ever, in these simulations the sample size n was equal to 50 and the ratio of
n to the number of dimensions. p was not less than five. The more extensive
simulations of Everitt (1981a) suggest that this ratio n/p needs to be at
least five if Wolfe's approximation is to be applicable for the determination
of P-values. Even then the simuiated power of the test was low in cases
where the Mahalanobis distance b.. was not greater than two.
Hence it is recommended in general that the outcome of Wolfe's modi-
fied likelihood ratio test should not be rigidly interpreted, but rather used
as a guide to the possible number of underlying groups. In the numerical
examples to be presented later our use of (1.10.2) will be in this spirit. The
constant c will be taken to be unity since we are only after a rough guide.
Also, there is no theoretical justification for taking the null distribution of
-2 log,\ to be chi-squared in the first place, let alone for its subsequent
refinement through the use of the multiplicative constant c by analogy to
results in the regular case. It is suggested that use be made also of the
estimates of posterior probabilities of group membership in the choice of g.
They can be examined for values of g near to the value accepted according
to the likelihood ratio test, and may therefore be of assistance in leading
to a final decision as to the number of underlying groups. On another cau-
tionary note on the use of the likelihood ratio for testing for the number
of groups, it is sensitive to the normality assumption concerning the group
densities. As reiterated recently by Hartigan and Hartigan (1985), this test
may well reject the null hypothesis of g = 1 with a high probability in the
case of a long tailed unimodal distribution, since the latter may resemble
more closely a mixture of normal densities than a single normal. For ex-
ample, the lognormal density can be well approximated by a mixture of
two univariate normal densities with a common variance; see Titterington,
Smith and Makov (1985, page 30) .

L
.A
General Introduction 25

Aitkin, Anderson and Hinde (1981) have reservations about the ade-
quacy of chi-squared approximations like (1.10.2) for the null distribution
of -2 log.\, and in the context of latent class models outlined a solution
to the problem using essentially a bootstrap approach. The "bootstrap"
was introduced by Efron (1979), who has investigated it further in a series
of articles; see Efron (1982, 1983, 1984) and the references therein. It is a
powerful technique which permits the variability in a random quantity to
be assessed using just the data at hand. An estimate F 。f the underlying
distribution is formed from the observed sample. Conditional on the lat-
ter, the samJ?ling distribution of the random quantity of interest with F
replaced by F, defines its so-called bootstrap distribution, which provides
an approximation to its true distribution. It is assumed that F has been so
formed that the stochastic structure of the model has been preserved. Usu-
ally, it is impossible to express the bootstrap distribution in simple form,
and it must be approximated by Monte Carlo methods whereby pseudo-
random samples (bootstrap samples) are drawn from F. The bootstrap
method can be implemented nonparametrically by using the empirical dis-
tribution function constructed from the original data. In the following
application the bootstrap is applied in a parametric framework in which
the bootstrap samples are drawn from the parametric likelihood estimate
of the underlying distribution function.
The log likelihood ratio statistic for the test of the null hypothesis
H。 : 9 = 91 groups versus the alternative H1 : 9 = 92 can be bootstrapped
as follows. Proceeding under H,。, a bootstrap sample is generated from a
mixture of 91 groups where, in the specified form of their densities, unknown
parameters are replaced by their likelihood estimates formed under H。 from
the original sample. The value of -2 log,\ is computed for the bootstrap
sample after fitting mixture models for 9 = 91 and 92 in turn to it. This
process is repeated independently a number of times K, and the replicated
values of -2 log,\ formed from the successive bootstrap samples provide
an assessment of the bootstrap, and henc~of the true, null distribution of
- 2 log,\, It enables an approximation to be made to the achieved level of
significance P corresponding to the value of -2 log ,\ evaluated from the
original sample.
If a very accurate estimate of the P-value were required, then K may
have to be very large. Indeed for less complicated models than mixtures,
Efron {1984), Efron and Tibshirani (1986), and Tibshirani (1985) have
shown that whereas 50 to 100 bootstrap replications may be sufficient for
standard error and bias estimation, a larger number, say 350, are needed to
give a useful estimate of a percentile or P -value, and many more for a highly
accurate assessment. Usually, however, there is no interest in estimating
a P-value with high precision. Even with a limited replication number K,
26 Chapter 1

the amount of computation involved is still considerable, in particular for


values of g1 and g2 not close to one.
In the narrower sense where the decision to be made concerns solely
the rejection or retention of the null hypothesis at a specified significance
level o:, Aitkin, Anderson and Hinde (1981) noted how analogous to the
Monte Carlo test procedure of Hope (1968), the bootstrap replications can
be used to provide a test of approximate size o:. The test which rejects H,。
if -2 log>. for the original data is greater than the jth smallest of its K
bootstrap replications, has size

o: = 1 - j/ (K + 1) (l.10.4)

approximately. For if any difference between the bootstrap and true null
distributions of -2 log,\ is ignored, then the original and subsequent boot-
strap values of -2 log,\ can be treated as the realizations of a random
sample of size K + 1, and the probability that a specified member is greater
than j of the others is 1 - i/(K + 1) . For some hypotheses the null distri-
bution of ,\ will not depend on any unknown parameters, and so then there
will be no difference between the bootstrap and true null distributions of
-2 log ,\. An example is the case of normal populations with all parameters
unknown where g1 = 1 under H。 . The normality assumption is not crucial
in this example.
Note that the result (1.10.4) applies to the unconditional size of the
test and not to its size conditional on the K bootstrap values of -2 log,\.
For a specified significance level a, the values of;'and K can be appropri-
ately chosen according to (1.10.4) . For example, for a= 0.05, the smallest
value of K needed is 19 wi.th i = 19. As cautioned above on the estima-
tion of the P -value for the likelihood ratio test, K needs to be very large
to ensure an accurate assessment. In the present context the size of K
A
manifests itself in the power of the test; see Hope (1968) . Although the
a、?> test may have essentially the prescribed size for small K, its power may be
well below its limiting value as K - t oo. For the 0.05 level test of a single
normal population versus a mixture of two normal homoscedastic popula-
tions, McLachlan (1986b) performed some simulations to demonstrate the
improvement in the power as K is increased from 19 through 39 to 99.
In an attempt to overcome some of the problems with the use of the
likelihood ratio statistic in testing for the number of groups completely
within a frequentist framework, Aitkin and Rubin (1985) adopted an ap-
proach which places a prior distribution on the vector of mixing proportions
1r. Likelihood estimation is undertaken for the remaining parameters in <P ,
0, after first integrating the likelihood of¢over the prior distribution of 7r ,
f1r(1r) . In this account of their proposal we shall write L(<fo) as L(1r , O) , as
General Introduction 27

we wish to let L(fJ) denote the log likelihood given by

L(9) = log [/ exp {L(ff, 9)} J土) dff l


The likelihood estimate of (J using L(O) is denoted by O and, as before,
q, = (江) 'denotes the likelihood estimate of 4> based on L(1r, 0) .
Aitkin and Rubin (1985) noted that Ocan be obtained using the EM
algorithm with the vector of indicator variables, z = (zi, ... , z~)', and 1r
treated as unkn~wn. It follows that Ocan be computed iteratively in the
same manner as fJ from Jl.6.2). Whereas at the (k + l)th stage of the Estep
in the computation of 0, zii is replaced by the posterior probability of xj
using the current estimate 4>(k) for <f,, T · (X · j 7r (k) o(k)) , in the computation
of O, zii is replaced by

丨 ri (xj ; 1r,o(k))f1!"(1r I x,o(k))d1r, (1.10.5)

where the posterior density of 1r given X and O = 9(k) is proportional to

f 1r(1r IX, 9(k)) ex /出 exp{L(o(k))} .

It can be seen with this proposal that there is the need to perform a
numerical integration (1.10.5) . H~wever, according to Aitkin and Rubin
(1985), 0 is often quite close to 9, which can be used for 9 in initially
forming the posterior probabilities from (1.10.5) . Likelihood ratio tests for
intermediate numbers of groups, for example, H,。 : g = 2 versus H1 : g = 3,
are not recommended since in general the g = 2 model is not nested in the
g = 3 model.
An advantage of this proposal is that any null hypothesis about the
number of groups is specified in the interior of the p_a! ~~eter space of (J,
and so there are not some of the problems as in the formulation using <p.
However, it follows from the recent results of Quinn, McLachlan, and Hjort
(1987) that the asymptotic null distribution of -2 log A will not necessarily
be chi-squared, as regularity conditions still do not hold. In particular, for
the test of H。 : g = 1 versus H 1 : g = g2 (> 1) for component densities
belonging to the same parametric family, they showed that 1/ n times the
observed information matrix has negative eigenvalues with nonzero proba-
bility under H。, as n---+ oo.
Sclove (1983) and Bozdogan and Sclove (1984) have proposed using
Akaike's information criterion (AIC) for the choice of the number of groups
\

28 Chapter 1

g in clustering. Their proposal in the context of the mixture model (1.4.2)


is to choose the g which minimizes

AIC(g) = -2L(J) + 2N(g),


where N(g) is the number of free parameters within the model. However,
as explained by Titterington, Smith and Makov (1985), this criterion relies
essentially on the same regularity conditions needed for -2 log A to have its
usual asymptotic distribution under the null hypothesis.
Another way of approaching the problem of the number of components
in a mixture is through an assessment of the number of modes. Care, how-
ever, must be exercised as bimodality can occur in a sample from a single
normal population. On the other hand, a mixture distribution can be uni-
modal; for example, a mixture in equal proportions of univariate normal
densities with meansµ1 andµ2 and variance a2 is bimodal if and only if
貽 -µ21/ a > 2. Conditions for a mixture of two univariate normal densities
to be bimodal are discussed by Everitt and Hand (1981, Section 2.2) and
also by Titterington, Smith and Makov (1985, Section 5.5) who included
the multivariate case. Inferential procedures for assessing the number of ·
modes of a distribution are described by Titterington, Smith and Makov
(1985, Section 5.6), including the univariate technique of Silverman (1981,
1983) which uses the kernel method to estimate the density function non-
parametrically and which permits the assessment of modality under certain
circumstances. More recently, Wong (1985) showed in the univariate case
that a bootstrap procedure based on the kth nearest neighbor clustering
method of Wong and Lane (1983), is an effective tool for determining the
number of populations under the assumption that a population corresponds
to a mode in the mixture density. His statistic is similar to that of Silverman
(1981) for kernel density estimates.
On this problem of assessing the number of compon~nts in a mixture,
there have also been various attempts made in special cases. For instance,
Anderson (1985) considered this problem in association with the mixture
likelihood approach for component densities which were univariate normal
with a known common variance. Using the score statistic BL(1r)/ 研 for a
mixture of two known univariate distributions, Durairajan and Kale (1979)
developed the locally most powerful test of the null hypothesis 7r1 = 1
against the alternative 7r 1 < 1. Durairajan and Kale (1982) subsequently
gave the locally most powerful similar test in the case where the two com-
ponent densities were known up to a single nuisance parameter common
to both. For a mixture of known univariate distributions, Henna (1985)
proposed that the number of components be estimated by first choosing 1f
to minimize the average squared distance between the mixture distribution
General Introduction 29

function F(x; ff') and its empirical version,

n
硏= L{F(x(i) 司 -i/n} 勺n. (1.10.6)
」i=l

where x(f) denotes the jth order statistic (j = 1, ... , n) . If frMD denotes
the estimate of fl'so obtained, then the number of components is estimated
by g, the minimal integer g such that

6(i-Mo) <屯2(n)/n,

where 屯 (n) 「 oo and 屯2(n)/n--+ O, as n--+ oo, and

00

2 伊 (n)/n} exp {-2'112(n)} < oo.


n=l

For an identifiable class of finite mixtures, Henna (1985) established that


g converges to the true value of g, g。, with probability one as n --1- oo,
providing O < 1ri < 1 (i = 1, ... , g。) . An account of t eh est1mat1on
· of
mixing proportions through the use of (1.10.6) and other distance measures
is to be given in Section 4.6.

1.11 PARTIAL CLASSIFICATION OF THE DATA


We now consider the situation where the mixture model is not just an
artificial device by which to effect a clustering of the unclassified data
x1, ... ,xn, but where the superpopulation G is a genuine mixture of g pop-
ulations G 1, .. . , G g in proportions 町,. .. , 1rg, respectively. It is assumed
that in addition to the unclassified data x1, .. . ,xn taken from G, there
are also available observations y ij (」= 1, ... , mi) known to come from G •
(i = 1, ... ,g), and m = 立鬪 mi. We shall refer to each y ij as a classified
observation in the sense that its population of origin is known. An example
of this situation is the case study to be reported in Section 4.7 on the esti-
mation of mixing proportions. There the classified data Yi; (j = 1, ... , mi)
have been obtained by sampling separately from Gi (i = 1, ... ,g), and so
provide no information about the unknown mixing proportions 冗 ... 潢g,
which have to be estimated solely on the basis of the unclassified data. An-
other example might occur in a discriminant analysis context, where the
existence of the populations G1, . .. , G g is thus known, but where it is very
difficult to obtain data of known origin. The primary aim then is not to
estimate the mixing proportions, although they will have to be estimated
so Chapter 1

along the way, but rather to use the unclassified data to improve the initial
estimate of the component densities fi(x; 0) and hence the performance of
the discriminant rule, as assessed by its overall error rate in allocating a
subsequent unclassified observation.
The question of whether it is a worthwhile exercise to update a dis-
criminant rule on the basis of a limited number of unclassified observations
has been considered by McLachlan and Ganesalingam (1982) for a mixture
of two normal distributions. Updating procedures appropriate for nonnor-
mal situations have been suggested by Murray and Titterington (1978) who
expounded various approaches using distribution-free kernel methods, and
Anderson (1979) who gave a method for the logistic discriminant function.
A Bayesian approach to the problem was considered by Titterington (1976)
who also considered sequential updating. The more recent work of Smith
and Makov (1978) and their other papers on the Bayesian approach to the
finite mixture problem, where the observations are obtained sequentially,
are covered in Titterington, Smith and Makov (1985, Chapter 6) .
Concerning the computation of the likelihood estimate of 4> on the basis
of the classified and unclassified data ·combined, we can easily incorporate
the additional information of the known origin of the y ij into the likelihood
equation. In the case where the y ij have been sampled from the mixture
G, the modified versions of (1.6.1) and (1.6.2) are

n
氘= (mi 十 L fij)/(m + n) (i = 1, ... ,g) (1.11.1)
, i=l

and

:信吐(Y;;; 句/。6+ 言亢,弒(x;;'o)/iJO} = o (1.11.2)

However, (1.6.1) and not (1.11.1) is appropriate for the estimate of 1ri, where
the y ij provide no information on the mixing proportions, for example,
where they have been sampled separately from each population Gi (i =
1, ... , g), or where they have been sampled from a mixture of G 1, .. • , Gg
but not in the proportions 1r 1, ... 讜g · The actual form of (1.11.2) for a
mixture of normal distributions is displayed in Section 2.1.
Likelihood estimation is facilitated by the presence of some data of
known origin with respect to each of the populations. For example, 1·t has
been pointed out for the case of normal populations with unequal covari-
ance matrices that there are singularities in the likelihood on the edge of

You might also like