Professional Documents
Culture Documents
掃描文件 2019年10月24日
掃描文件 2019年10月24日
1.5 IDENTIFIABILITY
In order to be able to estimate all the parameters in <P from x1, .. . , xn, it is
n,ecessary that they should be identifiable. In general, the parametric family
of p.d.f.'s. f (x; tp) is said to be identifiable if distinct values of <P determine
distinct members of the family. This can be interpreted as follows in the
case where f (x; tp) defines a class of finite mixture densities according to
(1.4 .1) . A class of finite mixtures is said to be identifiable for tp E O if for
any two members
g
f(x; ¢>) =芷弘 (x; 0),
i=l
and
g嶋
then
but to carry out likelihood estimation without this restriction. Finally, the
component labels are permuted to achieve this restriction on the estimates
of the mixing proportions. In the work to be presented here, we do not
explicitly impose any restriction on the mixing proportions, but with any
application of the mixture likelihood approach we report the result for only
one of the possible arrangements of the elements of J.
氘= I: fi1/n
, :=1
(i=l, .. . ,g) (1.6.1)
and
g n
LL
i=l j=l
fij a log fi(x, 潼)虛 =0 (1.6.2)
...
14 Chapter 1
z u~ =
{
L,O,
X; E Gi,
(1.6.3)
X; €t Gi,
where z1, .. . , zn are . independently and identically distributed according
to a multinomial distribution consisting of one draw on g categories with
probabilities 冗 ... 讜g respectively. We write
is given by
g n
L庫)=芷 L zij {log 1ri + l_og fi(x;; 0)} (1.6.5)
i=l j=l
General Introduction
15
That is, zij is replaced by the initial estimate of the posterior probability
i = 1 . .'g; 」: = 1, .. . ,n).
that x • belongs to G i. ('.
On the M step first time through, the intent is to choose the value of
<P , say 4>(1), that maximizes Q(</>, ¢>(0)) which, from the E step, is equal
here to L碣) with each zij replaced by ri(x3; 严) . It leads therefore to
solving equations (1.6.1) and (1.6.2) with fij replaced by ri(x3; ¢>(0)) . One
nice feature of the EM algorithm is that the solution to the M step often
exists in closed form, as is to be demonstrated for mixtures of normals in
Section 2.1.
The E and M steps are alternated repeatedly, where in their subse uent
executions, the initial fit </> (o) is replaced by the current fit for </>, say </>(k-1)
at the kth stage. Another nice feature of the EM algorithm is that the log
likelihood for the incomplete data specification can never be decreased after
an EM sequence. Hence the log likelihood under the mixture model satisfies
L(</>(k+l))~L(</>(k)), {1.6.7)
to some ad hoc criterion, and then take <p (o) to be the estimate of <p based
on this partition as if it represented a true grouping of the data. That is,
we execute the M step of the EM algorithm with the initial estimate of the
posterior probability, ri (x; ; <p (o)), set equal to 1 or Oaccording as to whether
xi lies in G i or not under this partition. For example, one ad hoc way of
initially partitioning the data in the case of, say g = 2 normal populations
with the same covariance matrices, would be to plot the data for selections
of two of the P variables, and then draw a line which divides the bivariate
data into two groups which have a scatter that appears normal; see, for
example, O'Neill (1978) and Ganesalingam and McLachlan (1979b) . On
the selection of a reduced set of variables from the full observation vector,
note that Chang (1983) has shown that the practice of applying principal
components to reduce the dimension of the data before clustering is not
just ified in general. By considermg · a mixture of two normal distributions
with a common covariance matrix, it was shown that the components with
the large eigenvalues do not necessarily provide more information (distance)
between the groups. ·
As commented in the last section, convergence with the EM algorithm
is very slow, and the situation will be exacerbated by a poor choice of tp (o) .
Indeed, in some cases where the likelihood is unbounded on the edge of the
parameter space, the sequence of estimates {严} generated by the EM
algorithm may diverge if <p (0) 油-chosen _ too close to the boundary. This
matter is discussed further 瓦笆坐旦:!_0nother problem with mixture
models is that the likelihood equation will usually have multiple roots, and
so the EM algorithm should be applied from a wide choice of starting values
in any search for all local maxima.
It will be illustrated in Section 3.2 that a key factor in obtaining a sat-
isfactory clustering of the data at hand is the accuracy of the estimate of fl',
the vector of mixing proportions. For univariate mixtures Fowlkes (1979)
suggested the determination of the point of inflexion in quantile-quantile
(Q-Q) plots to estimate first the mixing proportions of the underlying pop-
ulations. The remaining parameters can then be estimated from the sample
partitioned into groups in accordance with the estimates of the mixing pro-
portions. After the completion of a search for the various local maxima,
there is the problem of which root of the likelihood equation to use. If
there is available some consistent estimator of <p, then the estimator de-
fined by the root of the likelihood equat ion closest to it is also consistent,
and hence efficient (Lehmann, 1983, page 421), providing regularity condi-
tions hold for the existence of a sequence of consistent and asymptotically
efficient solutions of the likelihood equation. Where computationally feasi-
hle, the method of moments as discussed in Section 1.2 provides a way of
constructing a y'n-consistent estimator of <p . Quandt and Ramsey (1978)
•`
I
18 Chapter 1
General Introduction 19
iJies with compact parameter space that, under the conditions of Wald
(1949) except for the identifiability of 4>, L(4>) is almost surely maximized
in a neighborhood of n 。 . More precisely, Redner (1981) referred to it as
convergence of the maximum likelihood estimator in the topology of the
quotient space obtained by collapsing O 。 into a single point; see Ghosh
and Sen (1985), Li and Sedransk (1986), and Redner and Walker (1984)
for further discussion . In the sequel it is implicitly assumed that 0 。 is a
singleton set; that is, 4> is identifiable or, in the terminology of Ghosh and
Sen (1985), the class of mixture densities is strongly identifiable.
As noted earlier, in some instances the log likelihood L(4>) under the
mixture model (1.4.2) is unbounded, and so the maximum likelihood esti-
ma tor of 4> does not exist. However, as emphasized recently by Lehmann
(1980, 1983), the essential aim of likelihood estimation is the determina-
tion of a sequence of roots of the likelihood equation that is consistent
and asymptotically efficient. Under suitable regularity conditions, it is well
known (Cramer, 1946) that there will exist such a sequence of roots . With
probability tending to one, these roots correspond to local maxima in the
interior of the parameter space. As stressed by Lehmann (1980, 1983),
these roots do not necessarily correspond to global maxima. But usually
this sequence of roots with the desirable asymptotic properties will corre-
spond to global maxima, or if the latter do not exist, to the largest of the
local maxima located in the interior of the parameter space. Whether this
is the case for mixtures of normal distributions will be discussed in Section
2.1, where available results for this specific parametric family are reviewed.
Concerning likelihood estimation for mixtures in general, Redner and
Walker (1984) and Peters and Walker (1978) have given, for the class of
identifiable mixtures, the regularity conditions which must be satisfied in
order for there to exist a sequence of roots of the likelihood equation
aL(4>)/ 西= 0 with the properties of consistency, efficiency and asymptotic
normality. The form of the conditions, which are essentially multivariate
generalizations of Cramer's (1946) results for the corresponding properties
in a general context, suggest that they should hold for many parametric
families .
In a general context, Efron and Hinkley .(1978) have termed I(¢,) as the ob-
served Fisher information matrix as distinguished from the expected Fisher
information matrix, 1(¢,), given by the expectation of (1.9.1). For one-
parameter families they presented a frequentist justification for preferring
贝(¢) to 1/I(句 as an estimate of the variance of the likelihood estimator
</>.
For the mixture model the observed information matrix I(J) is easier to
compute than I (J), as Louis (1982) showed that I(J) can be expressed in
terms of the gradient and curvature of the log likelihood function Le(</>) for
the complete data X and Z. This applies to any use of the EM algorithm
and not just to mixture models. It was shown that I(J) is equal to the
value at </> = </> of
where
and
The log likelihood for the complete data, L庫) can be written as
n
L庫)=芷辰 (</>;xj,zj),
j=l
where
is the log likelihood for the single complete data point (x'. z'-)'. For ind竺
J',
pendent data as assumed here, it follows that (1.9.2) evaluated at¢, =¢,
can be reduced to give
n
I(J)~L 11,.11;, (1.9.3)
, i::: 1
;;,-
General Introduction 21
where
h(¢J;x3,z3) = 8Lc(¢J;x3,z3)/8¢J
and where
hi= h(J;x占)
and
H。 : g = 1, (1.10.1)
specified value of 1r1 regularity conditions held, so that -2 log.\ under H,。
were distributed asymptotically as chi-square d , then the null distribution
of -2 log,\ where 1r1 is unspecified, would correspond to the maximum of
a set of dependent chi-squared variables. However, even for specified mix-
ing proportions the usual regularity conditions do not hold; see Quinn and
McLachlan (1986) .
The recent article by Ghosh and Sen (1985) provides a comprehensive
account of the breakdown in regularity conditions for the classical asymp-
totic theory to hold for the likelihood ratio test. Other relevant references
in addition to those to be discussed below include Bock (1985), Hartigan
(1977), and Moran (1973) .
For a mixture of two known but general univariate densities in unknown
proportions, Titterington (1981) and Titterington, Smith and Makov (1985)
considered the likelihood ratio test of H。 : g=l 伝= 1) versus H 1 : g = 2
佑< 1) . They showed asymptotically under H。 that -2 log ,\ is zero
with probability 0.5 and, with the same probability, is distributed as chi-
squared with one degree of freedom. Another way of expressing this is that
the asymptotic null distribution of -2 log,\ is the same as the distribution
of
{max(O, Y)}2,
General Introduction 29
(n-1-p- 柘) /n,
{ r (n - I) -½(p +祠 /nr.
L
.A
General Introduction 25
Aitkin, Anderson and Hinde (1981) have reservations about the ade-
quacy of chi-squared approximations like (1.10.2) for the null distribution
of -2 log.\, and in the context of latent class models outlined a solution
to the problem using essentially a bootstrap approach. The "bootstrap"
was introduced by Efron (1979), who has investigated it further in a series
of articles; see Efron (1982, 1983, 1984) and the references therein. It is a
powerful technique which permits the variability in a random quantity to
be assessed using just the data at hand. An estimate F 。f the underlying
distribution is formed from the observed sample. Conditional on the lat-
ter, the samJ?ling distribution of the random quantity of interest with F
replaced by F, defines its so-called bootstrap distribution, which provides
an approximation to its true distribution. It is assumed that F has been so
formed that the stochastic structure of the model has been preserved. Usu-
ally, it is impossible to express the bootstrap distribution in simple form,
and it must be approximated by Monte Carlo methods whereby pseudo-
random samples (bootstrap samples) are drawn from F. The bootstrap
method can be implemented nonparametrically by using the empirical dis-
tribution function constructed from the original data. In the following
application the bootstrap is applied in a parametric framework in which
the bootstrap samples are drawn from the parametric likelihood estimate
of the underlying distribution function.
The log likelihood ratio statistic for the test of the null hypothesis
H。 : 9 = 91 groups versus the alternative H1 : 9 = 92 can be bootstrapped
as follows. Proceeding under H,。, a bootstrap sample is generated from a
mixture of 91 groups where, in the specified form of their densities, unknown
parameters are replaced by their likelihood estimates formed under H。 from
the original sample. The value of -2 log,\ is computed for the bootstrap
sample after fitting mixture models for 9 = 91 and 92 in turn to it. This
process is repeated independently a number of times K, and the replicated
values of -2 log,\ formed from the successive bootstrap samples provide
an assessment of the bootstrap, and henc~of the true, null distribution of
- 2 log,\, It enables an approximation to be made to the achieved level of
significance P corresponding to the value of -2 log ,\ evaluated from the
original sample.
If a very accurate estimate of the P-value were required, then K may
have to be very large. Indeed for less complicated models than mixtures,
Efron {1984), Efron and Tibshirani (1986), and Tibshirani (1985) have
shown that whereas 50 to 100 bootstrap replications may be sufficient for
standard error and bias estimation, a larger number, say 350, are needed to
give a useful estimate of a percentile or P -value, and many more for a highly
accurate assessment. Usually, however, there is no interest in estimating
a P-value with high precision. Even with a limited replication number K,
26 Chapter 1
o: = 1 - j/ (K + 1) (l.10.4)
approximately. For if any difference between the bootstrap and true null
distributions of -2 log,\ is ignored, then the original and subsequent boot-
strap values of -2 log,\ can be treated as the realizations of a random
sample of size K + 1, and the probability that a specified member is greater
than j of the others is 1 - i/(K + 1) . For some hypotheses the null distri-
bution of ,\ will not depend on any unknown parameters, and so then there
will be no difference between the bootstrap and true null distributions of
-2 log ,\. An example is the case of normal populations with all parameters
unknown where g1 = 1 under H。 . The normality assumption is not crucial
in this example.
Note that the result (1.10.4) applies to the unconditional size of the
test and not to its size conditional on the K bootstrap values of -2 log,\.
For a specified significance level a, the values of;'and K can be appropri-
ately chosen according to (1.10.4) . For example, for a= 0.05, the smallest
value of K needed is 19 wi.th i = 19. As cautioned above on the estima-
tion of the P -value for the likelihood ratio test, K needs to be very large
to ensure an accurate assessment. In the present context the size of K
A
manifests itself in the power of the test; see Hope (1968) . Although the
a、?> test may have essentially the prescribed size for small K, its power may be
well below its limiting value as K - t oo. For the 0.05 level test of a single
normal population versus a mixture of two normal homoscedastic popula-
tions, McLachlan (1986b) performed some simulations to demonstrate the
improvement in the power as K is increased from 19 through 39 to 99.
In an attempt to overcome some of the problems with the use of the
likelihood ratio statistic in testing for the number of groups completely
within a frequentist framework, Aitkin and Rubin (1985) adopted an ap-
proach which places a prior distribution on the vector of mixing proportions
1r. Likelihood estimation is undertaken for the remaining parameters in <P ,
0, after first integrating the likelihood of¢over the prior distribution of 7r ,
f1r(1r) . In this account of their proposal we shall write L(<fo) as L(1r , O) , as
General Introduction 27
It can be seen with this proposal that there is the need to perform a
numerical integration (1.10.5) . H~wever, according to Aitkin and Rubin
(1985), 0 is often quite close to 9, which can be used for 9 in initially
forming the posterior probabilities from (1.10.5) . Likelihood ratio tests for
intermediate numbers of groups, for example, H,。 : g = 2 versus H1 : g = 3,
are not recommended since in general the g = 2 model is not nested in the
g = 3 model.
An advantage of this proposal is that any null hypothesis about the
number of groups is specified in the interior of the p_a! ~~eter space of (J,
and so there are not some of the problems as in the formulation using <p.
However, it follows from the recent results of Quinn, McLachlan, and Hjort
(1987) that the asymptotic null distribution of -2 log A will not necessarily
be chi-squared, as regularity conditions still do not hold. In particular, for
the test of H。 : g = 1 versus H 1 : g = g2 (> 1) for component densities
belonging to the same parametric family, they showed that 1/ n times the
observed information matrix has negative eigenvalues with nonzero proba-
bility under H。, as n---+ oo.
Sclove (1983) and Bozdogan and Sclove (1984) have proposed using
Akaike's information criterion (AIC) for the choice of the number of groups
\
28 Chapter 1
n
硏= L{F(x(i) 司 -i/n} 勺n. (1.10.6)
」i=l
where x(f) denotes the jth order statistic (j = 1, ... , n) . If frMD denotes
the estimate of fl'so obtained, then the number of components is estimated
by g, the minimal integer g such that
6(i-Mo) <屯2(n)/n,
00
along the way, but rather to use the unclassified data to improve the initial
estimate of the component densities fi(x; 0) and hence the performance of
the discriminant rule, as assessed by its overall error rate in allocating a
subsequent unclassified observation.
The question of whether it is a worthwhile exercise to update a dis-
criminant rule on the basis of a limited number of unclassified observations
has been considered by McLachlan and Ganesalingam (1982) for a mixture
of two normal distributions. Updating procedures appropriate for nonnor-
mal situations have been suggested by Murray and Titterington (1978) who
expounded various approaches using distribution-free kernel methods, and
Anderson (1979) who gave a method for the logistic discriminant function.
A Bayesian approach to the problem was considered by Titterington (1976)
who also considered sequential updating. The more recent work of Smith
and Makov (1978) and their other papers on the Bayesian approach to the
finite mixture problem, where the observations are obtained sequentially,
are covered in Titterington, Smith and Makov (1985, Chapter 6) .
Concerning the computation of the likelihood estimate of 4> on the basis
of the classified and unclassified data ·combined, we can easily incorporate
the additional information of the known origin of the y ij into the likelihood
equation. In the case where the y ij have been sampled from the mixture
G, the modified versions of (1.6.1) and (1.6.2) are
n
氘= (mi 十 L fij)/(m + n) (i = 1, ... ,g) (1.11.1)
, i=l
and
However, (1.6.1) and not (1.11.1) is appropriate for the estimate of 1ri, where
the y ij provide no information on the mixing proportions, for example,
where they have been sampled separately from each population Gi (i =
1, ... , g), or where they have been sampled from a mixture of G 1, .. • , Gg
but not in the proportions 1r 1, ... 讜g · The actual form of (1.11.2) for a
mixture of normal distributions is displayed in Section 2.1.
Likelihood estimation is facilitated by the presence of some data of
known origin with respect to each of the populations. For example, 1·t has
been pointed out for the case of normal populations with unequal covari-
ance matrices that there are singularities in the likelihood on the edge of