You are on page 1of 7

A Score Test for Zero Inflation in a Poisson Distribution

Author(s): Jan van den Broek


Source: Biometrics, Vol. 51, No. 2 (Jun., 1995), pp. 738-743
Published by: International Biometric Society
Stable URL: https://www.jstor.org/stable/2532959
Accessed: 29-04-2020 10:19 UTC

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms

International Biometric Society is collaborating with JSTOR to digitize, preserve and extend
access to Biometrics

This content downloaded from 152.118.24.31 on Wed, 29 Apr 2020 10:19:46 UTC
All use subject to https://about.jstor.org/terms
BIOMETRICS 51, 738-743
June 1995

A Score Test for Zero Inflation in a Poisson Distribution

Jan van den Broek

Center for Biostatistics, University of Utrecht,


Yalelaan 7, 3584 CL Utrecht, The Netherlands

SUMMARY

When analyzing Poisson-count data sometimes a lot of zeros are observed. When there are too many
zeros a zero-inflated Poisson distribution can be used. A score test is presented to test whether the
number of zeros is too large for a Poisson distribution to fit the data well.

1. Introduction
Johnson, Kotz, and Kemp (1992, pp. 312-318) discuss a simple way of modifying a discrete
distribution to handle extra zeros. An extra proportion of zeros, w, is added to the proportion of
zeros from the original discrete distribution, f(0), while decreasing the remaining proportions in an
appropriate way:

JP(Yi= 0) = w + (1 - Wo)f)
P(Yi = yi) = (1 - W)f(yi) (Yi >0) (1)
They state that it is possible to take w less than zero, provided that:

f(0)
[1 - M)]A
with equality for left truncation.
Farewell and Sprott (1988) discuss an inflated binomial as a mixture model for count data. They
also point out the two-population interpretation of this model: in one population one observes only
zeros, while in the other one observes counts from a discrete distribution.
As an example of such an interpretation consider a population which consists of two groups: one
of people who are not at risk of developing a certain disease and one of people who are at risk and
may develop the disease several times. Of course such a model should be plausible in a given
situation.
Another example is discussed by Lambert (1992). Manufacturing equipment may be in two states:
a perfect state in which the machine produces no defects and an imperfect state in which the
machine produces a number of mistakes according to a Poisson distribution. She discusses maxi-
mum likelihood estimation and testing in the zero-inflated Poisson regression using:

ln(A) = X,8 (with A the mean of the Poisson distribution)

ln ) = Gy

for covariate matrices X and G. Two cases are considered: A and o functionally not related and A
and o functionally related. She also proves the asymptotic normality of the distribution of the
parameter estimates and shows that the likelihood ratio statistic is asymptotically distributed as a x2
with appropriate degrees of freedom.
These examples assume that the zero inflated distribution is appropriate or, to put it differently,
that the population considered consists of two subpopulations as described above. This is not always
obvious. One would like to see if there is some evidence from the observed data to support such an

Key words: Poisson; Score test; Zero inflation.

738

This content downloaded from 152.118.24.31 on Wed, 29 Apr 2020 10:19:46 UTC
All use subject to https://about.jstor.org/terms
A Score Test for Zero Inflation 739
assumption. To achieve this for the zero-inflated Poisson distribution a score test is proposed in the
next section.

2. A Score Test
A score test for w = 0 in the inflated Poisson has the advantage that one need not fit the inflated
Poisson but just a Poisson, which is the distribution under the null hypothesis, instead.
Using (1) the density for the inflated Poisson is:

P(Yi = 0) = + (1 - )e-

e AI
P(Yi = Yi) = ( - c)) Y f (Yi > ?).

Using 0 = = constant (-f(O) - 0 < xo), the log likelihood can be written as:

l(A, 0; y) = E {-elog(l + 6) + l(Y=)elog(6 + eA) + I(.,,>O)[-Ai + yielog(Ai) - elog(y

where 1(condition) takes the value 1 if the condition is true and 0 otherwise. From this, taking
ln(Ai) = XiB, the score function U(,B, 0) and the expected information J(,B, 0) can be calculated.
score statistic for testing 0 = 0 is then:

S(f3) = S(f3, 0) = UT(f3, O)[J(f3, O)]PU(f3, 0)

(3)

{ ( - 1) TXXTdiag()X11XT

where f8 and Ai are the estimates of f8 and Ai under the null hypothesis. (See the Appe
and, for instance, Cox and Hinkley (1974), pp. 321-325, for a discussion of the score test.)
If the model contains a constant then diag() IXTA = E Ai = ny7, the latter equality
being true due to the estimating equations under the null hypothesis (see Appendix).
If one writes: floi = P(Yi = 0) = ek-i, then the score statistic can be written as:

S(f3) 1 ~ l(Yj=o) Poi}

ELi=l I J ~ 4
{ 1-j5}-Poi 4
Under the null hypothesis this statistic will have an asymptotic chi-squared distribution with 1
degree of freedom.
The term XTX[XTdiag(X)X]-IXTA can be read as E(XTY)[var(XTY)]-IE(XTY). This can be inter-
preted as an "F value"-like statistic as it relates to XTE(Y). If XTE(Y) departs much from zero, the
second term in the denominator of (3) will have substantial influence. The statistic S(13) can be seen
as a goodness-of-fit statistic. It looks at the fit of the zeros but also accounts for the means of the
fitted Poisson distribution. This is reasonable, because the question of too many zeros is not only
answered by looking at the number of zeros but also by comparing this number with the mean of the
observations. A moderate number of zeros and a high mean of the observations can indicate that the
number of zeros is too high. If, instead of the Poisson distribution, the binomial distribution B(ni, pi)
is used, the same statistic is obtained except for the term E(XTY)[var(XTY)]-IE(XTY) which will
have values according to the binomial distribution. fl0i then becomes (1 - fii)'i. However, under the
same conditions under which the binomial can be approximated by the Poisson (small pi, np1
constant, and ni large), the statistic (4) will be an approximation of the same statistic obtained with
the binomial distribution.

3. The case of no covariates


Consider the case where there are n observations, among them n0 zeros, and no covariates. The
score statistic for testing whether the Poisson distribution fits the number of zeros well is, in this
case:

This content downloaded from 152.118.24.31 on Wed, 29 Apr 2020 10:19:46 UTC
All use subject to https://about.jstor.org/terms
740 Biometrics, June 1995

Table 1
Percentile points of the statistic S(,81) based on 5,000 samples of size n from a Poisson
distribution with mean A and the same points of a X2(l) distribution

Percentile points of a X2(l)


P7 = 1.07 P.8 = 1.64 P9 = 2.71 P95 = 3.84 P99 = 6.63
n = 100 A = .5 1.13 1.66 2.73 3.86 6.77
A = 1 1.11 1.68 2.75 3.97 6.87

n = 200 A= .5 1.13 1.37 2.67 3.73 6.52


A= 1 1.02 1.58 2.56 3.68 6.61

(nO - nfo )2

= fo(l - fio) - nypfiO (5)


In order to see if the chi-square approximation is appropriate a simulation study was carried out.
From a Poisson distribution with mean .5, 5,000 samples were taken once with sample size 100 and
once with sample size 200. The same was done with a mean of 1. These small values for A are
chosen, because if there are a lot of zeros the mean of the Poisson distribution under the null
hypothesis is low. For every sample the score statistic was calculated and afterwards percentile
points were obtained. These are to be compared with the percentile points of a chi-square distri-
bution with one degree of freedom (Table 1). The y2(l) approximation for S(f31) looks very
reasonable. A reasonably high mean and hardly any zeros gives problems with the approximate
chi-squared distribution unless the sample size is large. In this case however, there might be no need
for a test on the fit of the zeros.
Cochran (1954) proposed a statistic for comparing the observed and expected frequencies of a
single outcome from a Poisson distribution. If one uses this statistic to compare the observed and
expected zero-frequencies, the score statistic (5) is obtained.
Another statistic for looking at the number of zeros in the case of no covariates was proposed by
Rao and Chakravarti (1956):

n (f; l)Y
no-n- jl
n
n ( ) ( - 2 + n(n - 1)(-

This statistic is obtained by conditioning on the sum of the observations.


In a simulation study El-Shaarawi (1985) compared the above two statistics together with the
likelihood ratio statistic. This simulation study, using sample sizes of 15 and 50 and a mean of 5,
showed that the likelihood ratio statistic gives closest estimate of the true significance level but has
much lower power then the other two. The significance levels of the score statistic and the statistic
of Rao and Chakravarti are closer to the true level for a sample size of 50 than for a sample size of
15. El-Shaarawi concludes that the score statistic and the one of Rao and Chakravarti are preferable
to the likelihood ratio statistic because of the higher power. The score statistic has the advantage of
being easier to compute. Besides this, it can be used in the case were there are covariates.

4. An example
Of 98 human immunodeficiency virus (HIV)-infected men, attending the department of internal
medicine at the Utrecht University Hospital, the number of times they had an urinary tract infection
(number of episodes) was recorded (Hoepelman et al., 1992). Besides this, the immune status of
every patient was determined by measuring the CD4+ cell count. Table 2 shows that a lot of patients
did not have a urinary tract infection.

Table 2
Frequencies of the number of episodes

Number of episodes 0 1 2 3
Frequencies 81 9 7 1

This content downloaded from 152.118.24.31 on Wed, 29 Apr 2020 10:19:46 UTC
All use subject to https://about.jstor.org/terms
A Score Test for Zero Inflation 741
To assess whether there are too many zeros for the data to have arisen from a Poisson distribu-
tion, the score statistic (4) can be calculated with

Pioi = exp(-e F+jj1CD4+i)


The outcome of the score statistic is 5.96 (whereas the score statistic without using the covariate
CD4+ has an outcome of 15.35, illustrating the importance of the use of covariates), giving evidence
that too many zeros are observed for the Poisson to fit the data well.
An alternative distribution for the Poisson then is the inflated Poisson. As pointed out, the model
then used is that the population can be thought of as consisting of two parts: a proportion w
consisting of patients not being at risk of developing a urinary tract infection and an other part
consisting of patients who are at risk of developing a urinary tract infection.
The covariate can be used to model the mean number of episodes of the patients being at risk:

ln(A) = I3 + 131CD4+.

It can also be used to model the "probability" of not being at risk:

ln( = co + c CD4+.

If one fits an inflated Poisson with both, the log-likelihood is -53.21 on 94 degrees of freedom.
Lambert (1992) discusses the fitting procedure for which iterative methods are needed. The likeli-

hood ratio statistic for testing 3, - 0 has an out come of .101, indicating that ln(A) can be
as a constant: ln(A) = p0. Using this and the same model for the "probability" of not being at
risk as above, the log-likelihood is -53.26 on 95 degrees of freedom. The results of this fit are in
Table 3.

Table 3
Estimation results

Asymptotic correlations

Standard between estimates


Parameter Estimates errors ao al I 3
a}o -.487 .699 1 -.66 .64
a, ] .007 .003 1 - .23
130 --.094 .317 1

To see if the model can be simplified any further the likelihood ratio statistic for testing ca = 0 was
calculated. The outcome is 12.19, so there is a relation between the "probability" of not being at risk
and the CD4+ cell count. Roughly: the odds of not being at risk for a patient who has a CD4+ cell
count of 100 higher than another patient, are about twice as high.
It might be possible that there is a relation between the follow-up time and the number of observed
episodes. The model- was refitted with ln(follow-up time) as an additional covariate. This gave no
improvement (a lik'elihood ratio statistic of .56).
As El-Shaarawi (1985) points out, rejecting the hypothesis that the Poisson distribution fits the
number of zeros well, does not imply that the inflated Poisson is the appropriate model. There might
be other distributions that fit the data well. For instance the negative binomial can be a good
candidate. Fitting the negative binomial with mean ,u, variance ,u(1 + cr,u) and a log-link, in this
example, gives a log-likelihood of -55.67 on 95 degrees of freedom. Table 3 shows some summary
statistics of the Pearson residuals, defined as [Yi - E(Yi)]/V'var(Yi) with Yi the number of episo
of patient i, for the negative binomial and the inflated Poisson. Inspection of these residuals shows
that they are, in absolute sense, more often smaller for the inflated Poisson. (See Table 4).

Table 4
Summary statistics of the Pearson residuals

First quartile Mean Third quartile Range

Negative binomial - .513 - .010 - .096 6.42


Inflated Poisson - .544 - .003 - .042 5.39

This content downloaded from 152.118.24.31 on Wed, 29 Apr 2020 10:19:46 UTC
All use subject to https://about.jstor.org/terms
742 Biometrics, June 1995

ACKNOWLEDGEMENTS

I would like to thank Jim Lindsey, Byron J. T. Morgan, and an Associate Editor for their helpful
comments and suggestions and Andy Hoepelman for permission to use the data.

RESUME

I1 arrive quelquefois qu'on observe un grand nombre de z6ros lors de l'analyse de donn6es de
comptages. Dans une telle situation une distribution de Poisson modifi6e peut etre utilis6e. Nous
pr6sentons un test du score pour d6cider si le nombre de z6ros observ6s est trop grand pour qu'une
distribution de Poisson non modifi6e puisse etre compatible avec les donn6es.

REFERENCES

Cochran, W. G. (1954). Some methods for strengthening the common x2 tests. Biometrics 10,
417-451.
Cox, D. R. and Hinkley, D. V. (1974). Theoretical Statistics. London: Chapman and Hall.
El-Shaarawi, A. H. (1985). Some goodness-of-fit methods for the Poisson plus added zeros distri-
bution. Applied and Environmental Microbiology 49, 1304-1306.
Farewell, V. T., and Sprott, D. A. (1988). The use of a mixture model in the analysis of count data.
Biometrics 44, 1191-1194.
Hoepelman, A. I. M., Van Buren, M., Van den Broek, J., and Borleffs, J. C. C. (1992). Bacteriuria
in men infected with HIV-1 is related to their immune status (CD4+ cell count). AIDS 6,
179-184.
Johnson, N. L., Kotz, S., and Kemp, A. W. (1992). Univariate Discrete Distributions, second
edition. New York: John Wiley & Sons, Inc.
Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufactur-
ing. Technometrics 34, 1-14.
Rao, C. R., and Chakravarti, I. M. (1956). Some small sample tests of significance for a Poisson
distribution. Biometrics 12, 264-282.

Received December 1993; revised December 1994; accepted February 1995.

APPENDIX

The model under the null hypothesis has link function ln(A) = X,B with /8 a p x 1
matrix includes a constant. Then differentiating the log likelihood (2) with respect to 18 and 0 gives:

dl() E {l(Y=O) e Aixir + l(yi>O)(yi - Ai)xi,4 r = 1 p (6)

dl( *) ' ?1
d6 -1~~ ~~ ?(A)jt. (7)
d(9 (It + 0) + 0j=) + e-Ai}(7

Under the null hypothesis: 0 = 0, with /3 and Aj max


hypothesis, (6) becomes:

E(yji-Aj)xij. = O, r= 1 ... p (8)

and (7) then is E {;Y x;-ijso

UT(p, 0) =0, 0, E -( j) (9)

The second derivatives are given by:

This content downloaded from 152.118.24.31 on Wed, 29 Apr 2020 10:19:46 UTC
All use subject to https://about.jstor.org/terms
A Score Test for Zero Inflation 743

d 21() f /-e [(A1 - Aj)6 + e A\ l


dfrdf3, A |;O) [0 + 2 ) AI iXv xi - 1(,>o)AiXisxi4 r = 1 p, S I 1 P

df3rdOs E (5'=O)( I +e12)AiXirJ r = 1 p

d21( ) i 1 1 1 +

d02 (I1 + 0)2 (yi=?) Ai e-j2J

0 + eA 1 - i
Using E[1(y;=O)] = P(yi = 0) = and E [1(,>o)] = P(yi > 0) = it can be seen that
1 +0 1 +6
the expected informati

/d 1( X -e( AiAi6 + 6 + e A
Jr,S - ~ dpi = (1 + 6)[6 + eAil }AiXirXis r = 1 p, s = 1. p

Jr,p+ = -(d/3dfJ) = - ((1 + 6[f6+ e Al)Aixir r = 1 * * p

p l,pd I ( d =8 s i ((1 + 0)2[0 + e-A)

and so J(f3, 0) has entries

r,s = > ixirxis r= ...p,s =1...p; Jp + I=- Aix r=1...


ii

JPIp+= (1 - A )+

Lill j121X TX, _


Partition J(f3, 0) as [J2 J22 where Jl = XTdiag(X)X;jl2 = -XTAJ21 = XATX; J

(eA; - i). Now denote the inverse of J(P, 0) as C which can be partitioned as [C c2J

Due to the structure of U(13, 0) only C22 is needed.

C22 =22 - J 2 (> A; l) - XTX[XTdiag(A)XI XTX.

Using this with (9), equation (3) follows.


Since the model contains a constant, A can be written as A = diag(A)Xep, where ep is a (p x 1)
vector having a 1 as first element and having other elements equal zero. So ATX[XTdiag(X)X]- IXTA
epTXTdiag(X)X[XTdiag(A)X] -IXTdiag(X)Xep = 1TX. From (8) this can be seen to equal ny.

This content downloaded from 152.118.24.31 on Wed, 29 Apr 2020 10:19:46 UTC
All use subject to https://about.jstor.org/terms

You might also like