You are on page 1of 13

Communications in Statistics - Simulation and

Computation

ISSN: 0361-0918 (Print) 1532-4141 (Online) Journal homepage: https://www.tandfonline.com/loi/lssp20

Several two-component mixture distributions for


count data

Razik Ridzuan Mohd Tajuddin, Noriszura Ismail & Kamarulzaman Ibrahim

To cite this article: Razik Ridzuan Mohd Tajuddin, Noriszura Ismail & Kamarulzaman Ibrahim
(2020): Several two-component mixture distributions for count data, Communications in Statistics -
Simulation and Computation, DOI: 10.1080/03610918.2020.1722834

To link to this article: https://doi.org/10.1080/03610918.2020.1722834

Published online: 03 Feb 2020.

Submit your article to this journal

Article views: 32

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=lssp20
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
R

https://doi.org/10.1080/03610918.2020.1722834

Several two-component mixture distributions for


count data
Razik Ridzuan Mohd Tajuddin , Noriszura Ismail, and Kamarulzaman Ibrahim
Department of Mathematical Sciences, Universiti Kebangsaan Malaysia, Bangi, Malaysia

ABSTRACT ARTICLE HISTORY


Finite mixture model is a flexible approach for modeling multimodal Received 8 April 2019
data. Multimodality can be present in the data when the data consti- Accepted 23 January 2020
tute several subpopulations. In this study, several two-component
KEYWORDS
mixture distributions for count data are proposed and described to
Count data; finite mixture;
cater for bimodality issue. The distributions considered in developing negative binomial;
mixture distributions are Poisson (P), Poisson Lindley (PL), negative Poisson-Lindley
binomial (NB) as well as negative binomial Lindley (NBL), and
altogether, a total of ten two-component mixture distributions are
obtained. The maximum likelihood estimators for each mixture distri-
bution are obtained by employing the L-BFGS-B method. A compari-
son study based on graphical approach is conducted to investigate
the effect of mixing proportion on the resulting mixture distribution
which are found based on different shapes of probability curve and
positions of the mode. A simulation study is conducted to investi-
gate the performance of each mixture distribution in fitting data
that come from two subpopulations with different mean and disper-
sion values. Three mixture models which are P-NB, PL-NB and NB-NB,
are the most commonly identified as adequate in describing the
simulated data with various different types of mixing properties.
These three distributions are considered to be the most flexible and
thus, suggested for real data applications.

1. Introduction
In some cases, data from multiple subpopulations with different effects on the popula-
tion can cause the fit of Poisson and negative binomial models insufficient to capture
the overall data. As an alternative, the finite mixture distribution can be employed as it
is very flexible (McLachlan and Peel 2002). The need to use any finite mixture distribu-
tion is generally based on whether there are multiple modes in the plot of the data
(Everitt 2006).
There have been many applications of mixture distributions for count data, used over
the years. For example, a two and three-component mixture of Poisson distributions
were fitted on insurance claim count dataset for the year of 2000 in Malaysia (Ismail,
Mohd Ali, and Chiew 2004). Based on the concept of parsimony, as well as from

CONTACT Razik Ridzuan Mohd Tajuddin razikridzuan@siswa.ukm.edu.my Department of Mathematical Sciences,


Universiti Kebangsaan Malaysia, Bangi, Malaysia.
Supplemental data for this article can be accessed here.
ß 2020 Taylor & Francis Group, LLC
2 R. R. M. TAJUDDIN ET AL.

adequacy tests, Ismail, Mohd Ali, and Chiew (2004) deduced that the two-component
mixture Poisson distribution is adequate to describe the insurance claim count data. A
bimodal truncated count data was fitted using Com-Poisson mixtures to cater for both
dispersion and bimodality properties (Bose et al. 2013). The contributions of several dis-
crete mixture distributions on the presence of extra zeros were explored extensively by
Zamzuri, Sapuan, and Ibrahim (2018). The authors found that the large proportion of
zeros in the data is contributed by the high mixing proportion of a distribution with
low mean values. Past studies have shown that the mixtures distribution consists of
same distribution but with different parameters. However, several researchers have sug-
gested combining different distributions into finite mixture models (Benecha et al. 2017;
Deni, Jemain, and Ibrahim 2009; Dobi-Wantuch, Mika, and Szeidl 2000).
This study is motivated by the demand for a discrete distribution, that is flexible
enough to fit count data that has bimodality property. This study aims to propose sev-
eral new two-component mixture distributions for count data to cater for bimodality
issue. Since the Poisson Lindley (Sankaran 1970) and the negative binomial Lindley
(Zamani and Ismail 2010) have proven to be better than the Poisson and the negative
binomial respectively, the mixture distribution consists of Poisson, Poisson Lindley,
negative binomial and negative binomial Lindley. Therefore, several new two-compo-
nent mixture distributions can be obtained which are Poisson-Poisson Lindley (P-PL),
Poisson Lindley-Poisson Lindley (PL-PL), Poisson Lindley-Negative Binomial (PL-NB),
Poisson-Negative Binomial Lindley (P-NBL), Poisson Lindley-Negative Binomial Lindley
(PL-NBL), Negative Binomial-Negative Binomial Lindley (NB-NBL) and Negative
Binomial Lindley-Negative Binomial Lindley (NBL-NBL). The two-component Poisson-
Poisson (P-P), Poisson-Negative Binomial (P-NB) and Negative Binomial-Negative
Binomial (NB-NB) distributions will also be discussed.
This paper is organized as follows. In Sec. 2, the two-component mixture distribu-
tions will be developed and summarized in terms of probability mass function and
mean. In Sec. 3, the effect of mixing proportion, trend shapes of probability values and
position of modes of the two distributions on the resulting mixture distribution, is
studied using graphical approach. In Sec. 4, a simulation study is conducted to investi-
gate the performance of the mixture models given that the subpopulation comes from
the quasi-Poisson distribution with different mean and dispersion values. The applica-
tions of the best four mixture distributions on two datasets are studied in Sec. 5.
Finally, Sec. 6 concludes the study.

2. Mixture model
2.1. A k-component discrete mixture distribution
Generally, for a random variable X which follows a k-component discrete mixture dis-
tribution with parameters hi and pi , where i ¼ 1, 2, :::, k, the probability mass function
of X (Everitt 2006; Ismail, Mohd Ali, and Chiew 2004; Johnson, Kemp, and Kotz 2005;
McLachlan and Peel 2002) can be written as
  Xk
f ðxjh; pÞ ¼ f xjh1 , h2 , :::, hk ; p1 , p2 , :::, pk ¼ pi hðxjhi Þ,
i¼1
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
R
3

P
where ki¼1 pi ¼ 1, 0 < pi < 1 and hðxjhi ÞP is the probability mass function for ith com-
k1
ponent with parameter hi : Since pk ¼ 1  i¼1 pi , for a k-component discrete mixture
distribution, one less parameter will be estimated. The rth moment about the origin for
X, lr (Johnson, Kemp, and Kotz 2005) can be written as
X
k
lr ¼ EðX r Þ ¼ pi EðX r jhi Þ (1)
i¼1

The log-likelihood function for X with N data can be written as


2 3 " #
  YN
  XN Xk
 
ln f ðxjh; pÞ ¼ ln4 f xj jh; p 5 ¼ ln pi h xj jhi :
j¼1 j¼1 i¼1

For example, the probability mass function and the log-likelihood function for a ran-
dom variable Y which follows a k-component Poisson distribution with N data is given by
  X k   X k y
k expðki Þ
f yjk; p ¼ pi h yjki ¼ pi i , y ¼ 0, 1, 2, :::,
i¼1 i¼1
y!

and
" #
  X N Xk y
ki j expðki Þ
ln f ðyjk; pÞ ¼ ln pi ,
j¼1 i¼1
yj !
P
respectively. By using Eq. (1), the mean for Y is l ¼ ki¼1 pi ki :

2.2. Two-component discrete mixture distributions


A two-component discrete mixture distribution can be obtained by setting k to 2. Based
on the rth moment in (1), the first two moments can be obtained and subsequently, the
variance, r2 can be obtained as
r2 ¼ pð1  pÞðl1  l2 Þ2 þ pr21 þ ð1  pÞr22 (2)
where r2 1 and r2 2 are the variance of the distribution involved in the mixture respect-
ively. The derivation of the variance for the two-component mixture distribution is
given below. The first two moments about the origin are
EðX Þ ¼ pl1 þ ð1  pÞl2 ,
and
       
EðX 2 Þ ¼ pE X1 2 jh1 þ ð1  pÞE X2 2 jh2 ¼ p r21 þ l21 þ ð1  pÞ r22 þ l22 ,
respectively. Therefore, the variance can be obtained by taking
VarðX Þ ¼ EðX 2 Þ  ½EðX Þ
2
   
¼ p r21 þ l21 þ ð1  pÞ r22 þ l22  ½pl1 þ ð1  pÞl2 2
¼ pð1  pÞðl1  l2 Þ2 þ pr21 þ ð1  pÞr22 :
4 R. R. M. TAJUDDIN ET AL.

In this study, there are 10 two-component discrete mixture distributions considered.


For each mixture distribution, the probability mass function and the mean formula are
obtained and given in Table 1. As given in Table 1, there are 3 three-parameter mixture
distributions, 4 four-parameter mixture distributions and 3 five-parameter mixture dis-
tributions which are considered in this study.

2.3. Optimization method


Generally, the derivative of the log-likelihood function for each two-component mixture
distribution is quite complicated; thus, requiring an iterative method to maximize the
log-likelihood value. But, this can be easily implemented in R (R Core Team 2018). In
the estimation, since several constraints of parameter values are involved, an optimiza-
tion method, which is known as L-BFGS-B (Byrd et al. 1995) will be used. This method
is actually a modified version of the BFGS method with bounded constraints. For fur-
ther explanation on L-BFGS-B method, one can refer to Byrd et al. (1995).

2.4. Model selection


The selection of the best model will be based on the AIC (Akaike 1974) and the BIC
(Schwarz 1978; Wit, Heuvel, and Romeijn 2012). The AIC and BIC formulae are given as
AIC ¼ 2lnL þ 2r,
and
BIC ¼ 2lnL þ rlnN,
respectively where r is the number of parameters estimated by the model.

3. Comparison study
A comparison study based on graphical approach is conducted to investigate the effect
of mixing proportion, trend shapes of probability values and position of modes of the
two distributions on to the resulting mixture distribution. The general procedures of the
comparison study can be outlined as follows:

i. Generate two different datasets from Poisson distribution, each with different
mean value, for depicting two subpopulations, known as SubP 1 and SubP 2,
with different trend shapes of probability values and mode values.
ii. Create new datasets by combining the two datasets with p ¼ 0:1,
0:3, 0:5, 0:7, 0:9:
iii. Plot the trend of the probability values for the generated and the newly mixed
datasets on the graph.

From the comparison study based on graphical approach, several conclusions can be
made. Based on Figure 1 (i) and 1 (ii), for any value of mixing proportion, p, when
both datasets have similar shape and have modes close to each other, the resulting mix-
ture data follows the similar shape, with only one mode. In this case, it is impractical to
Table 1. The mixture distributions, their probability mass functions and the parameters involved in estimation for fitting count data.

Probability
distribution Probability mass function Mean, l
x k1 x k2
P-P PrðX ¼ x Þ ¼ p k1 x!e þ ð1  pÞ k2 x!e l ¼ pk1 þ ð1  pÞk2

x k
P-L h2
xþ3 ðh þ x þ 2Þ
PrðX ¼ x Þ ¼ p k x!e þ ð1  pÞ ð l ¼ pk þ ð1  pÞ hðhþ2
hþ1Þ
hþ1Þ

2 2
h1
PL-PL ðh1 þ x þ 2Þ þ ð1  pÞ h2 xþ3 ðh2 þ x þ 2Þ l ¼ p h1hðh1 þ2
1 þ1Þ
þ ð1  pÞ h2hðh2 þ2
2 þ1Þ
1 Þxþ3 2
PrðX ¼ x Þ ¼ p ðh þ1 ðh þ1Þ

x k1
    x
r r k2
P-NB PrðX ¼ x Þ ¼ p k1 x!e þ ð1  pÞ r þ xx  1 rþk 2 rþk2 l ¼ pk1 þ ð1  pÞk2

    x
h2 r r k
PL-NB PrðX ¼ x Þ ¼ p ð xþ3 ðh þ x þ 2Þ þ ð1  pÞ r þ x  1 rþk rþkx
l ¼ p hðhþ2
hþ1Þ
þ ð1  pÞk
hþ1Þ

 
x k 2
P x
h rþx1 x h2 þh1
P-NBL PrðX ¼ x Þ ¼ p k x!e þ ð1  pÞ hþ1 x l¼0
ð1Þl hþrþlþ12
ðhþrþlÞ
l ¼ pk þ ð1  pÞr ð 2
l hþ1Þðh1Þ

 
2 2 2
P x
h1 rþx1 x 2 þh2 1
PL-NBL ðh1 þ x þ 2Þ þ ð1  pÞ hh2 2þ1 l¼0
ð1Þl h2 þrþlþ12 l ¼ p h1hðh1 þ2
1 þ1Þ
1 Þxþ3 x 2 2
PrðX ¼ x Þ ¼ p ðh þ1 ðh2 þrþlÞ
þ ð1  pÞr ðh hþ1 Þðh 1Þ2
l
  r1  x   r2  x
r1 k1 r2 k2
NB-NB PrðX ¼ x Þ ¼ p r1 þ xx  1 r1 þk 1 r1 þk1 þ ð1  pÞ r2 þ xx  1 r2 þk 2 r2 þk2 l ¼ pk1 þ ð1  pÞk2

  r1  x   
1 k h2 r2 þ x  1 Px x h2 þh1
NB-NBL PrðX ¼ x Þ ¼ p r1 þ xx  1 r1rþk r1 þk þ ð 1  pÞ hþ1 x l¼0
ð1Þl hþr2 þlþ12
ðhþr2 þlÞ
l ¼ pk þ ð1  pÞ ð 2
l hþ1Þðh1Þ

  
2 2 2 2
P x P x
r1 þ x  1 x x 1 þh1 1 2 þh2 1
NBL-NBL PrðX ¼ x Þ ¼ p hh1 1þ1 l¼0 2
ð1Þl h1 þr1 þlþ12 þ ð1  pÞ hh2þ1 r2 þ x  1 l¼0
ð1Þl h2 þr2 þlþ12 ,
x ðh 1 þr1 þl Þ x ðh2 þr2 þlÞ 1 1 2 2
l ¼ ðh hþ1 Þðh 1Þ2
þ ð1  pÞ ðh hþ1 Þðh 1Þ2
l l
R
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
5
6 R. R. M. TAJUDDIN ET AL.

Figure 1. The four plots in Figure 1 refer to: i) Data generated from Poisðk1 ¼ 10:5Þ and Poisðk2 ¼
14:5Þ for the purpose of having both datasets possessing bell-shaped curves and the value of modes
close to each other. ii) Data generated from Poisðk1 ¼ 0:5Þ and Poisðk2 ¼ 0:1Þ for the purpose of
having both datasets possessing exponential-shaped curves and the value of modes close to each
other. iii) Data generated from Poisðk1 ¼ 3:5Þ and Poisðk2 ¼ 12:5Þ for the purpose of having both
datasets possessing bell-shaped curves and the value of modes are far from each other. iv) Data gen-
erated from Poisðk1 ¼ 0:5Þ and Poisðk2 ¼ 5:5Þ for the purpose of having first dataset possessing
exponential-shaped curve and second dataset possessing bell-shaped curve, and the value of modes
are far from each other.

consider a two-component mixture distribution unless it is known that the datasets


come from two different subpopulations.
However, as seen from Figure 1 (iii), when the both datasets have similar bell-shaped
curves and the value of modes are far from each other, a small value of p can cause the
resulting mixture data to be bimodal. The same conclusion can be obtained when both data
have different shapes, i.e., exponential-shaped and bell-shaped, as seen from Figure 1 (iv). In
the case of datasets with similar bell-shaped curves and the value of modes are far from each
other, or with different shapes, the two-component mixture distribution needs to be consid-
ered, so that the bimodality property in the data still persists.
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
R
7

4. Simulation study
A simulation study is conducted to investigate the performance of the two-component
mixture distributions in fitting count data which have the property of bimodality. The
data is randomly generated using two-component quasi-Poisson distribution, denoted as
QP1  QP2 ðp, k1 , k2 , u1 , u2 Þ: In this simulation study, it is presumed that the popula-
tion comes from either two equidispersed subpopulations, two overdispersed subpopula-
tions or one from equidispersed and another from overdispersed subpopulations, where
all subpopulations have different mean values. The algorithm for the simulation as given
below:

i. Generate a value of u, where U  unif ð0, 1Þ:


ii. For a given value of p, if u < p, generate one data from QP1 ðk1 , u1 Þ and set it
to random variable X: If u  p, generate one data from QP2 ðk2 , u2 Þ and set it
to random variable X: We call QP1 ðk1 , u1 Þ as the first subpopulation and
QP2 ðk2 , u2 Þ as the second subpopulation.
iii. Simulate the process n times (in this study, N ¼ 1000).

For all simulations, k1 is set to 1 and k2 is set to 10, to portray low and high mean
values and to ascertain the existence of the property of bimodality. For the first set of
simulation, let u1 ¼ u2 ¼ 1: For the second set of simulation, let u1 ¼ 1 and u2 ¼ 3:
For the third set of simulation, let u1 ¼ 3 and u2 ¼ 1: For the fourth set of simulation,
let u1 ¼ u2 ¼ 3: When the dispersion parameter is set as 1, it is presumed that the sub-
population is equidispersed and when the dispersion parameter is set as 3, it is pre-
sumed that the subpopulation is overdispersed. The dispersion values that have been
identified in literature with respect to the real data that is of interest, the value usually
ranges from 1.16 to 2.95 (see Bose et al. 2013; Ismail, Mohd Ali and Chin 2004;
Sankaran 1970; Zamani and Ismail 2010). We believe that the selection of dispersion
value of 3 suggests a larger overdispersion than the ones in literature that is of interest,
and suitable in describing overdispersed data for the simulation study.
For each simulation, it is assumed that p ¼ 0:2, 0:5 and 0:7, to mimic the inclusion
of low, equal and high mixing proportion of the first subpopulation. The simulated data
is fitted to all 10 two-component mixture distributions considered in this study. The
results for the goodness-of-fit tests of the fitted models on the simulated data from two-
component quasi-Poisson distribution with different values of parameters are given in
Appendix A, Supplemental material. The results of fitting two-component mixture dis-
tributions to simulated data based on QP1  QP2 ðp, k1 , k2 , u1 , u2 Þ are summarized in
Table 2.
As given in Table 2, out of the 12 simulation cases, the mixture models that are
found to adequately fit the simulated data, are arranged in the order of the most fre-
quent to the least frequent. The NB-NB distribution fits nine of them, P-NB fits seven
of them, PL-NB fits six of them, P-P fits three of them, P-PL fits two of them, PL-PL
and P-NBL fits only one of them and the rest do not fit the simulated data at all. Three
mixture distributions which are P-NB, PL-NB and NB-NB, are the most commonly
identified as adequate in describing the simulated data with various different types of
mixing properties. These three distributions are considered to be the most flexible and
8 R. R. M. TAJUDDIN ET AL.

Table 2. The results of fitting two-component mixture distributions to simulated data based
on QP1  QP2 :ðp, k1 , k2 , u1 , u2 Þ.
Simulated data p
Low ðk1 ¼ 1Þ High ðk1 ¼ 10Þ
Equidispersed Overdispersed Equidispersed Overdispersed
ðu1 ¼ 1Þ ðu1 ¼ 3Þ ðu1 ¼ 1Þ ðu1 ¼ 3Þ 0.2 0.5 0.7
冑 冑 P-P P-P P-P
P-PL P-NB P-NB
P-NB NB-NB
PL-NB
NB-NB
冑 冑 PL-NB P-NB P-NB
NB-NB NB-NB
冑 冑 P-PL PL-NB P-NB
PL-NB P-NBL NB-NB
NB-NB NB-NB
冑 冑 P-NB PL-NB PL-PL
NB-NB NB-NB P-NB
PL-NB
NB-NB

thus, be use for real data applications. The P-P distribution is selected for comparison
purposes since this mixture distribution is able to fit two subpopulations data with equi-
dispersed property, which is supported from the simulation study.

5. Applications
The P-NB, PL-NB and NB-NB mixture distributions that are selected based on the
simulation study will be fitted to two datasets. The P-P distribution is selected for com-
parison because it is able to fit two subpopulations data with equidispersed property.
The adequacy of the mixture models is based on the smallest AIC and BIC values as
well as based on non-significant p-value from goodness-of-fit tests.
Example 1. The first dataset consists of 250 observations on the number of visitors visit-
ing a fishing park (Bruin 2006), which has also been investigated by Zamzuri, Sapuan, and
Ibrahim (2018). Three modes are present in the data, as shown in Figure 2. From Figure 2,
the first mode is observed at the beginning of the data, the second mode in the middle of
the data and the last mode at the end of the data due to accumulation. Since there are
more than one mode, a finite mixture distribution is applicable for the data.
Results of model fittings of the number of visitors to a fishing park are summarized
in Table 3. Based on Table 3, the model fitting of NB-NB distribution gives the smallest
AIC value but not the smallest BIC value, with a significant p-value from goodness-of-
fit test. Therefore, the best model is PL-NB because it has relatively small AIC and BIC
values with non-significant p-value form the goodness-of-fit tests. The fitted function
based on PL-NB distribution can be written as
  ^r !x
^h 2     ^r þ x  1 ^
r ^k
PrðX ¼ xÞ ¼ ^p   ^h þ x þ 2 þ 1  ^p ,
^h þ 1 xþ3 x ^r þ ^k ^r þ ^k
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
R
9

Figure 2. A vertical line graph describing the number of visitors visiting a fishing park. The ‘*’ in the
graph refers to the accumulation of data at least equals to 16.

where ^p ¼ 0:0524, ^h ¼ 0:0543, ^r ¼ 0:2775, ^k ¼ 1:4909: Based on the best mixture dis-
tribution, it is reasonable to believe that the population of visitors visiting to a fishing
park consists of two subpopulations, one subpopulation follows a Poisson-Lindley with
mean of 35.8839 and dispersion of 19.8779, and another one subpopulation follows a
negative binomial with mean of 1.4909 and dispersion of 6.3726. The mean and disper-
sion formulae for a random variable that follows a Poisson-Lindley distribution can be
referred to Ghitany and Al-Mutairi (2009). For a random variable that follows a nega-
tive binomial distribution, the mean and dispersion formulae can be referred to Lloyd-
Smith (2007).
Example 2 The second dataset refers to the frequency of occurrence of the symptoms
fever, cough, running nose or all the three together among 602 preschool children over
the period June 1982 until September 1985. The data is obtained from R package
‘CAMAN’ developed by Schlattmann, Hoehne, and Verba (2016). The distribution of
the data can be seen in Figure 3, where one mode is observed at the beginning of the
data and another mode in the middle of the data, suggesting a finite mixture distribu-
tion can suitably be use for model fitting.
Results of model fittings of the frequency of occurrence of the symptoms among the
preschool children are summarized in Table 4. Based on Table 4, the fittings of P-NB,
PL-NB and NB-NB distributions provide an adequate fit based on the non-significant
p-value from goodness-of-fit test. However, the model fitting of PL-NB distribution
gives the smallest AIC and BIC values as compared to the other three distributions, and
thus be selected as the best model with fitted function which can be written as
10 R. R. M. TAJUDDIN ET AL.

Table 3. AIC and BIC values for each distribution in describing the number of visitors to fish-
ing park.
Distribution AIC BIC p-value
P-P 1450.20 1460.76 <0.0001
P-NB 926.11 940.20 0.0427
PL-NB 926.12 940.21 0.1344
NB-NB 925.17 942.77 0.0246
The distribution that provides the most adequate fit is written in bold.

Figure 3. A vertical line graph describing the frequency of occurrence of the symptoms among the
preschool children.

Table 4. AIC and BIC values for each distribution in describing the frequency of occurrence of the
symptoms among preschool children.
Distribution AIC BIC p-value
P-P 3273.06 3286.26 <0.0001
P-NB 3131.54 3149.14 0.0710
PL-NB 3123.27 3140.88 0.2969
NB-NB 3124.88 3146.88 0.2754
The distribution that provides the most adequate fit is written in bold.

  ^r !x
^h 2     ^r ^k
PrðX ¼ xÞ ¼ ^p  xþ3 ^h þ x þ 2 þ 1  ^p ^r þ x  1 ,
^h þ 1 x ^r þ ^k ^r þ ^k

where ^p ¼ 0:9091, ^h ¼ 0:3552, ^r ¼ 1:3552, ^k ¼ 0:0009: Based on the best mixture dis-
tribution, it is fair to believe that the population of occurrence of the symptoms among
the preschool children comprises two subpopulations, one subpopulation follows a
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
R
11

Poisson-Lindley with mean of 4.8927 and dispersion of 4.1286, and another one subpo-
pulation follows a negative binomial with mean of 0.0009 and dispersion of 1.0007.

6. Conclusions
In this study, several two-component mixture distributions are proposed for fitting
count data that has bimodality property. A comparison study based on graphical
approach is conducted to investigate the effect of mixing proportion, trend shapes of
probability values and position of modes of the two distributions on the resulting mix-
ture distribution. From the comparison study, it can be concluded that a small value of
mixing proportion affects the overall shape of the graph either if the plot of two subpo-
pulations have different shape or if the positions of two modes are located far from
each other. In addition, a simulation study is also conducted to investigate the perform-
ance of each mixture distribution in describing count data that comes from two subpo-
pulations, each with different mean and dispersion property. From the simulation study,
it can be concluded that the best three distributions that adequately fit are P-NB, PL-
NB and NB-NB distributions. The selected best three distributions are considered for
fitting two real datasets and the results from fitting these three distributions are com-
pared with those for P-P distribution, which has been considered for comparison in this
study. For both datasets, the PL-NB distribution can provide the best fit based on AIC,
BIC values and p-value from the goodness-of-fit test.
The properties of the PL-NB distribution can be studied for future work. These may
include considering PL-NB as the distribution for response variable in the linear model.
The PL-NB distribution can be modified into truncated version and the resulting mix-
ture distribution can employed in fitting meteorology and social science data. In add-
ition, the Bayesian approach can also be considered when modeling mixture
distributions with some prior information regarding the behavior of the subpopulations.

Funding
The authors gratefully acknowledge the financial support received through research grants
(FRGS/1/2019/STG06/UKM/01/5) from Ministry of Education, Malaysia and (GUP-2019-031)
from Universiti Kebangsaan Malaysia. The authors would also like to thank the referees for the
constructive comments.

ORCID
Razik Ridzuan Mohd Tajuddin http://orcid.org/0000-0001-6534-3678

References
Akaike, H. 1974. A new look at the statistical model identification. IEEE Transactions on
Automatic Control 19 (6):716–23.
Benecha, H. K., B. Neelon, K. Divaris, and J. S. Preisser. 2017. Marginalized mixture models for
count data from multiple source populations. Journal of Statistical Distributions and
Applications 4 (1):1–17.
12 R. R. M. TAJUDDIN ET AL.

Bose, S., G. Shmueli, P. Sur, and P. Dubey. 2013. Fitting Com-Poisson mixtures to bimodal count
data. In 1st International Conference on Information, Operations Management and Statistics
(ICIOMS 2013).
Bruin, J. 2006. Newtest: Command to compute new test. UCLA: Statistical Consulting Group.
Accessed February 9, 2019. https://stats.idre.ucla.edu/r/dae/zip/.
Byrd, R. H., P. Lu, J. Nocedal, and C. Zhu. 1995. A limited memory algorithm for bound con-
strained optimization. SIAM Journal on Scientific Computing 16 (5):1190–208. doi:10.1137/
0916069.
Deni, S. M., A. A. Jemain, and K. Ibrahim. 2009. The best probability models for dry and wet
spells in Peninsular Malaysia during monsoon seasons. International Journal of Climatology 30
(8):1194–205.
Dobi-Wantuch, I., J. Mika, and L. Szeidl. 2000. Modelling wet and dry spells with mixture distri-
butions. Meteorology and Atmospheric Physics 73 (3–4):245–56. doi:10.1007/s007030050076.
Everitt, B. S. 2006. Mixture distributions-I. Encyclopedia of statistical sciences 7. New York: John
Wiley and Sons, Inc.
Ghitany, M. E., and K. Al-Mutairi. 2009. Estimation methods for the discrete Poisson- Lindley
distribution. Journal of Statistical Computation and Simulation 79 (1):1–9. doi:10.1080/
00949650701550259.
Ismail, N., K. M. Mohd Ali, and A. C. Chiew. 2004. A model for insurance claim count with sin-
gle and finite mixture distribution. Sains Malaysiana 33:173–94.
Johnson, N. L., A. W. Kemp, and S. Kotz. 2005. Mixture distributions. In Univariate discrete dis-
tributions. 3rd ed., eds. D. J. Balding, N. A. C. Cressie, N. I. Fisher, I. M. Johnstone, J. B.
Kadane, G. Molenberghs, L. M. Ryan, D. W. Scott, A. F. M. Smith and J. L. Teugels, 343–80.
Hoboken, NJ: John Wiley & Sons.
Lloyd-Smith, J. 2007. Maximum likelihood estimation of the negative binomial dispersion param-
eter for highly overdispersed data, with applications to infectious diseases. PloS One 2 (2):e180.
McLachlan, G., and D. Peel. 2002. General introduction. In Finite mixture models. eds. N. A. C.
Cressie, N. I. Fisher, I. M. Johnstone, J. N. Kadane, D. W. Scott, B. W. Silverman, A. F. M.
Smith, J. L. Teugels, V. Barnett, R. A. Bradley, J. S. Hunter, D. G. Kendall, 1–39. New York:
John Wiley & Sons.
R Core Team. 2018. R: A language and environment for statistical computing. Vienna, Austria: R
Foundation for Statistical Computing.
Sankaran, M. 1970. The discrete Poisson-Lindley distribution. Biometrics 26 (1):145–9. doi:10.
2307/2529053.
Schlattmann, P., J. Hoehne, and M. Verba. 2016. Package ‘CAMAN’.
Schwarz, G. 1978. Estimating the dimension of a model. The Annals of Statistics 6 (2):461–4. doi:
10.1214/aos/1176344136.
Wit, E., E. V. D. Heuvel, J. W. Romeijn. 2012. All models are wrong … . Statistica Neerlandica
66 (3):217–36.
Zamani, H., and N. Ismail. 2010. Negative binomial-Lindley distribution and its application.
Journal of Mathematics and Statistics 6:4–9.
Zamzuri, Z. H., M. S. Sapuan, and K. Ibrahim. 2018. The extra zeros in traffic accident data: A
study on the mixture of discrete distributions. Sains Malaysiana 47 (8):1931–40. doi:10.17576/
jsm-2018-4708-35.

You might also like