Professional Documents
Culture Documents
Methods
Volume 8 | Issue 2 Article 8
11-1-2009
B. M. Golam Kibria
Florida International University, kibriag@fiu.edu
Recommended Citation
Banik, Shipra and Kibria, B. M. Golam (2009) "On Some Discrete Distributions and their Applications with Real Life Data," Journal of
Modern Applied Statistical Methods: Vol. 8 : Iss. 2 , Article 8.
DOI: 10.22237/jmasm/1257034020
Available at: http://digitalcommons.wayne.edu/jmasm/vol8/iss2/8
This Regular Article is brought to you for free and open access by the Open Access Journals at DigitalCommons@WayneState. It has been accepted for
inclusion in Journal of Modern Applied Statistical Methods by an authorized editor of DigitalCommons@WayneState.
Journal of Modern Applied Statistical Methods Copyright © 2009 JMASM, Inc.
November 2009, Vol. 8, No. 2, 423-447 1538 – 9472/09/$95.00
On Some Discrete Distributions and their Applications with Real Life Data
This article reviews some useful discrete models and compares their performance in terms of the high
frequency of zeroes, which is observed in many discrete data (e.g., motor crash, earthquake, strike data,
etc.). A simulation study is conducted to determine how commonly used discrete models (such as the
binomial, Poisson, negative binomial, zero-inflated and zero-truncated models) behave if excess zeroes
are present in the data. Results indicate that the negative binomial model and the ZIP model are better
able to capture the effect of excess zeroes. Some real-life environmental data are used to illustrate the
performance of the proposed models.
Key words: Binomial Distribution; Poisson distribution; Negative Binomial; ZIP; ZINB.
423
DISCRETE DISTRIBUTIONS AND THEIR APPLICATIONS WITH REAL LIFE DATA
greater than expected. Reasons may include guidelines to fit a discrete process appropriately.
failing to observe an event during the First, a simulation study was conducted to
observational period and an inability to ever determine the performance of the considered
experience an event. Some researchers (Warton, models when excess zeroes are present in a
2005; Shankar, et al., 2003; Kibria, 2006) have dataset. Second, the following real-life data (For
applied zero-inflated models to model this type details, see Table 4.1) were analyzed, the
of process (known as a dual-states process: one numbers of:
zero-count state and one other normal-count
state). These models generally capture apparent 1. Road accidents per month in the Dhaka
excess zeroes that commonly arise in some district,
discrete processes, such as road crash data, and 2. People visiting the Dhaka medical hospital
improve statistical fit when compared to the (BMSSU) per day,
Poisson and the negative binomial model. The 3. Earthquakes in Bangladesh per year, and
reason is that data obtained from a dual-state 4. Strikes (hartals) per month in Dhaka.
process often suffer from over-dispersion
because the number of zeroes is inflated by the Statistical Distribution: The Binomial
zero-count state. A zero-inflated model Distribution
(introduced by Rider, 1961) is defined by If X~B(n, p), then the probability mass
function (pmf) of X is defined by
1 − θ if X = 0
P( X = k ) = P( X = k ; n , p ) = nCK p k ( 1 − p ) n − k ,
θP( X ; μ i ) if X > 0
k = 0, 1, 2, …, n. (2.1)
where θ is the proportion of non-zero values of
X and P(X;μi) is a zero-truncated probability where n is the total number of trials and p is the
model fitted to normal-count states. To address probability of success of each trial. The moment
phenomena with zero-inflated counting generating function (mgf) of (2.1) is
processes, the zero-inflated Poisson (ZIP) model
and the zero-inflated negative binomial (ZINB) M X (t ) = ( p + qet )n ,
model have been developed. A ZIP model is a
mix of a distribution that is degenerate at zero thus, E(X), V(X) and skewness (Sk) of (2.1) are
and a variant of the Poisson model. Conversely, np, npq and [( 1 − 2 p )2 / npq ] respectively.
the ZINB model is a mix of zero and a variant of
negative binomial model.
Statistical Distribution: The Poisson Distribution
Opposite situations from the zero-
inflated models are also encountered; this article In (2.1) if n→∞ and p→0, then X
examines processes that have no zeroes: the follows the Poisson distribution with parameter
zero-truncated models. If the Poisson or the λ(>0) (denoted X~P(λ)). The pmf is defined as
negative binomial model is used with these
processes, the procedure tries to fit the model by e −λ λk
P( X = k ; λ ) = , k = 0, 1, 2, … (2.2)
including probabilities for zero values. More k!
accurate models that do not include zero values
should be able to be produced. If the value of where λ denotes expected number of
zero cannot be observed in any random occurrences. The mgf of (2.2) is
experiment, then these models may be used.
Two cases are considered: (1) the zero-truncated t
M X (t ) = eλ ( e −1) ,
Poisson model, and (2) the zero-truncated
negative binomial model.
Given a range of possible models, it is thus, E(X) and V(X) of (2.2) are the same, which
difficult to fit an appropriate discrete model. The is λ and Sk is equal to 1/λ.
main purpose of this article is to provide
424
BANIK & KIBRIA
Γ( k + X ) X − ( k + X ) 1 − θ if X = 0
P( X ; k , p) = p q ,
X !Γk P( X ;θ , k , p ) = P( X ; k , p )
p > 0, k > 0, X = 0, 1, 2,... θ 1 − q−k
if X > 0
(2.3) (2.5)
where p is the chance of success in a single trial where P(X; k, p) is defined in (2.3) and θ is the
and k are the number of failures of repeated proportion of non-zero values of X. The mgf of
identical trials. If k→∞, then X~P(λ), where λ = (2.5) is
kp. The mgf of (2.3) is
θ
M X (t ) = (q − pet )− k , M X (t ) = (1 − θ ) + [(q − pet ) − k − 1],
1 − q−k
thus, E(X), V(X) and Sk of (2.3) are kp, kpq and thus, E(X), V(X) and Sk of (2.5) are
[( 1 + 2 p )2 / kpq ] respectively.
θ kp θ kp θ kp
Statistical Distribution: The Zero-Inflated −k
, −k p( k + 1 ) + 1 − −k
Poisson (ZIP) Distribution 1− q 1− q 1− q
If X~ZIP(θ, λ) with parameters θ and λ,
then the pmf is defined by and
1 − θ if X = 0 θ kp θ kp 2
( kp ) 2 + 3kp + 1 − [ 3kp + 3 −
−k
]
P( X ; θ , λ ) = P( X ; λ ) (2.4) 1− q 1− q
−k
θ 1 − e − λ if X > 0
θ kp θ kp 3
−k
[ kp + 1 − ]
−k
1− q 1− q
where P(X; λ) is defined in (2.2) and θ is the
proportion of non-zero values of X. The mgf of
(2.4) is respectively. As k→∞, ZINB (θ, k, p) ~ZIP(θ,
λ), where λ= kp.
θ e− λ λe t
M X (t ) = (1 − θ ) + (e − 1),
1 − e− λ Statistical Distribution: Zero-Truncated Poisson
(ZTP) Distribution
thus, E(X), V(X) and Sk of (2.4) are If X~ZTP(λ), then the pmf of X is given
by
θλ θλ θλ P ( X = x)
, λ +1− P ( X = x | X > 0) =
1 − e −λ 1 − e − λ 1 − e − λ P ( X > 0)
and P( X ; λ )
θλ θλ 2 =
λ 2 + 3λ + 1 − −λ
[ 3λ + 3 − −λ ] 1 − P ( X = 0)
1 − e 1−e
P( X ; λ )
θλ θλ 3 =
−λ
[λ + 1− −λ
] (1 − e − λ )
1−e 1−e
respectively.
425
DISCRETE DISTRIBUTIONS AND THEIR APPLICATIONS WITH REAL LIFE DATA
thus,
thus, E(X) and V(X) are λ( 1 − e − λ ) −1 and Sk is
X i2 n
X i3
n
( 1 − e − λ )( λ + 3 + λ−1 ) − 3( λ + 1 ) − ( 1 − e − λ ) −1 . M 1 = E( X ) , M 2 = , M3 = .
i =1 n i =1 n
Statistical Distribution: Zero-Truncated
Negative Binomial (ZTNB) Distribution The Maximum Likelihood Estimation Method
If X~ZTNB(k, p), then the pmf of X is (MLE)
given by Find the log-likelihood function for a
given distribution and take a partial derivative of
P( X ; k , p ) P( X ; k , p ) this function with respect to each parameter and
P( X = x | X > 0 ) = = set it equal to 0; solve it to find the parameters
1 − P( X = 0 ) 1 − q −k estimate.
respectively. ∂ log L( X ; n, p ) Xi n − Xi
= i =1
− i =1
.
∂p p 1− p
Parameter Estimation
To estimate the parameters of the Simplifying results in:
considered models, the most common methods n
426
BANIK & KIBRIA
Estimator of λ
The log-likelihood expression of (2.2) is Differentiating the above expression with
respect to p and k, the following equations
n n result:
LogL(X;λ) = − nλ + X i log λ − (log X i )! n n
i =1 i =1
∂LogL( X ; k , p ) Xi ( k + X i )
= i =1
− i =1
(2.8)
Differentiating the above expression with ∂p p 1+ p
respect to λ, results in and
p
∂ log L( X ; λ ) Xi
∂LogL( X ; k , p ) 1 n X i −1 j
∂λ
= − n + i =1
λ =
∂k n i =1 j =0 1 + jk −1
and, after simplification, kp ( E ( X ) + k )
+ k 2 ln(1 + p ) −
1+ p
Pλˆ ( ml ) = M 1 (2.9)
thus
Pλˆ ( mom ) = Pλˆ ( ml ) = M 1 . Solving (2.8), results in NB p ( ml ) = M 1 / kˆ. It
was observed that NBk̂ ( ml ) does not exist in
Negative Binomial Distribution: Moment
closed form, thus, NBk̂ ( ml ) was obtained by
Estimators of p and k
Based on (2.3), it is known that E(X) = optimizing numerically (2.9) using the Newton-
kp and V(X) = kpq, thus Raphson optimization technique where p= p .
n − 1 i =1 M
ZIPλˆ ( mom ) = 2 − 1
M1
Negative Binomial Distribution: Maximum and
Likelihood Estimators of p and k
The log-likelihood expression of (2.3) is
427
DISCRETE DISTRIBUTIONS AND THEIR APPLICATIONS WITH REAL LIFE DATA
M (1 − e − λ )
ˆ LogL( X ; θ , k , p ) =
ZIPθˆ ( ml ) = 1 .
λˆ
n n
I(X i
= 0) log(1 − θ ) + I ( X i > 0) log θ
i =1 i =1
ZIP Distribution: Maximum Likelihood n n
+ log( ) − ( k + X i ) log q
Γ( Xi +K )
Estimators of θ and λ X i ! ΓK
The log-likelihood expression of (2.4) is i =1 i =1
n n
+ X i log p − log(1 − q − k )
LogL(X; θ, λ) = i =1 i =1
n
I(X i = 0) log(1 − θ )
i =1 Differentiating the above with respect to each of
n
+ I ( X i > 0){log θ + X i log λ − log(e − 1) − log X i !} λ parameters, results in the following estimators
i =1 for θ, k, and p:
n n
I( X i = 0 ) I( X i > 0 ) Other estimates p̂ and k̂ were found
∂ log L( X ;θ , λ )
= − i =1 + i =1 iteratively: k, p is given by
∂θ 1 −θ θ
−1
ZINBpˆ ( ml ) = E ( X ){1 − (1 + kp)− k } (2.10)
n n n
∂ log L(X; θ, λ ) e −λ
∂λ
= i =1
Xi / λ −
i =1
I ( X i > 0) − I( X
i =1
i > 0) thus, the solution of p̂ has the same properties
e λ −1
as described above. Because the score equation
After the above equations are simplified for θ for k̂ does not have a simple form, k was
and λ, the following are obtained: estimated numerically given the current estimate
n of p̂ from (2.10) (for details, see Warton, 2005).
ZIPθˆ ( mom ) = I ( X i > 0 ) / n
i =1
428
BANIK & KIBRIA
n
Γ( X i + k ) n excess zeroes - it is necessary to investigate
log(
i =1 X i ! Γk
) + X i log p
i =1
which model would best fit a discrete process.
Thus, a series of simulation experiments was
n n
conducted to determine the effect of excess
− (k + X i ) log q − log(1 − q − k ) zeroes on selected models. These simulation
i =1 i =1
studies reflect how commonly used discrete
models behave if excess zeroes are present in a
Methods for Comparison of the Distributions:
set of data.
Goodness of Fit (GOF) Test
The GOF test determines whether a
Simulation Experiment Design
hypothesized distribution can be used as a model
A sample, X = {X1, X2, …, Xn}, was
for a particular population of interest. Common
obtained where data were generated from a
tests include the χ2, the Kolmogorov-Smirnov Poisson model with:
and the Anderson-Darling tests. The χ2 test can λ: 1.0, 1.5, 2.0 and 2.5;
be applied for discrete models; other tests tend n: 10, 20, 30, 50, 100, 150, 200; and
to be restricted to continuous models. The test 10%, 20%, 80% zeroes.
procedure for the χ2 GOF is simple: divide a set Different data sets for different sample sizes and
of data into a number of bins and the number of λs were generated to determine which model
points that fall into each bin is compared to the performs best if zeroes (10% to 80%) are present
expected number of points for those bins (if the in a dataset. To select the possible best model,
data are obtained from the hypothesized the Chi-square GOF statistic defined in (2.11)
distribution). More formally, suppose: for all models were calculated. If the test was
not statistically significant, then the data follows
H0: Data follows the specified population the specified (e.g., binomial, Poisson, or other)
distribution. population distribution. Tables 3.1-3.8 show the
H1: Data does not follow the specified GOF statistic values for all proposed
population distribution. distributions. Both small and large sample
behaviors were investigated for all models, and
If the data is divided into bins, then the test all calculations were carried out using the
statistic is: programming code MATLAB (Version 7.0).
s (O − E )
2
χ cal
2
= i i
(2.11)
i =1 Ei Results
Tables 3.1 to 3.8 show that the performance of
where Oi and Ei are the observed and expected the models depends on the sample size (n), λ
frequencies for bin i. The null, Ho, is rejected if and the percentage of zeroes included in the
χ cal
2
> χ df2 ,α , where degrees of freedom (df) is sample. It was observed that, as the percentage
of zeroes increases in the sample, the proportion
calculated as (s−1− # of parameters estimated) of over dispersion decreases and performance of
and α is the significance level. the binomial and Poisson distributions decrease.
For small sample sizes, most of the models fit
Methodology well; for large sample sizes, however, both the
Simulation Study binomial and Poisson performed poorly
Because the outcome of interest in many compared to others. For samples containing a
fields is discrete in nature and generally follows moderate to high percentage of zeroes, the
the binomial, Poisson or the negative binomial negative binomial performed best followed by
distribution. It is evident from the literature that ZIP and ZINB. Based on simulations, therefore,
these types of variables often contain a high in the presence of excess zeroes the negative
proportion of zeroes. These zeroes may be due binomial model and the ZIP (moment estimator
to either the presence of a population with only of parameters) model to approximate a real
zero counts and/or over-dispersion. Hence, it discrete process are recommended.
may be stated that - to capture the effect of
429
DISCRETE DISTRIBUTIONS AND THEIR APPLICATIONS WITH REAL LIFE DATA
indicates over-dispersion; mom denotes moment estimator, and ml denotes MLE of model parameters.
430
BANIK & KIBRIA
431
DISCRETE DISTRIBUTIONS AND THEIR APPLICATIONS WITH REAL LIFE DATA
indicates over-dispersion; mom denotes moment estimator, and ml denotes MLE of model parameters.
432
BANIK & KIBRIA
indicates over-dispersion; mom denotes moment estimator, and ml denotes MLE of model parameters.
433
DISCRETE DISTRIBUTIONS AND THEIR APPLICATIONS WITH REAL LIFE DATA
indicates over-dispersion; mom denotes moment estimator, and ml denotes MLE of model parameters.
434
BANIK & KIBRIA
indicates over-dispersion; mom denotes moment estimator, and ml denotes MLE of model parameters.
435
DISCRETE DISTRIBUTIONS AND THEIR APPLICATIONS WITH REAL LIFE DATA
436
BANIK & KIBRIA
indicates over-dispersion; mom denotes moment estimator, and ml denotes MLE of model parameters.
437
DISCRETE DISTRIBUTIONS AND THEIR APPLICATIONS WITH REAL LIFE DATA
Applications to Real Data Sets cases, the negative binomial model can improve
The selected processes were fitted using statistical fit to the process. In the literature, it
theoretical principles and by understanding the has been suggested that over-dispersed processes
simulation outcomes. Theoretical explanations may also be characterized by excess zeroes
of a discrete process are reviewed as follows: (more zeroes than expected under the P(λ)
Generally, a process with two outcomes (see process) and zero-inflated models can be a
Lord, et al., 2005, for details) follows a statistical solution to fit these types of processes.
Bernoulli distribution. To be more specific,
consider a random variable, which is NOA. Each Traffic Accident Data
time a vehicle enters any type of entity (a trial) The first data set analyzed was the
on a given transportation network, it will either number of accidents (NOA) causing death in the
be involved in an accident or it will not. Dhaka district per month; NOA were counted
Thus, X~B(1,p), where p is the for each of 64 months for the period of January
probability of an accident when a vehicle enters 2003 to April 2008 and the data are presented in
any transportation network. In general, if n Table 4.1.1 and in Figure 4.1.1.
vehicles are passing through the transportation
network (n trials) they are considered records of
NOA in n trials, thus, X~B(n,p). However, it was Table 4.1.1: Probability
observed that the chance that a typical vehicle Distribution of NOA
will cause an accident is very small out when Observed
NOA
considering the millions of vehicles that enter a Months
transportation network (large number of n trials). 0 0.12
Therefore, a B(n,p) model for X is approximated 1 0.24
by a P(λ) model, where λ represents expected
2 0.17
number of accidents. This approximation works
well when λs are constant, but it is not 3 0.22
reasonable to assume that λ across drivers and 4 0.17
road segments are constant; in reality, this varies 5 0.05
with each driver-vehicle combination. 6 0
Considering NOA from different roads with 7 0.02
different probabilities of accidents for drivers,
Total 64 months
the distribution of accidents have often been
observed over-dispersed: if this occurs, P(λ) is
unlikely to show a good fit. In these
438
BANIK & KIBRIA
A total of 141 accidents occurred during the frequencies with GOF statistic values are
considered periods (see Figure 4.1.1) and the tabulated and shown in Table 4.1.2. The sample
rate of accidents per month was 2.2 (see Table mean and variance indicate that the data shows
4.1.2). Figure 4.1.1 also shows that since 2004 over dispersion; about 16% of zeroes are present
the rates have decreased. Main causes of road in the NOA data set. According to the GOF
accidents identified according to Haque, 2003 statistic, the binomial model fits poorly, whereas
include: rapid increase in the number of the Poisson and the negative binomial appear to
vehicles, more paved roads leading to higher fit. The excellent fits of different models are
speeds, poor driving and road use knowledge, illustrated in Figure 4.1.2; based on the figure
skill and awareness and poor traffic and the GOF statistic, the negative binomial
management. The observed and expected model was shown to best fit NOA.
21 20
20
10 2
0
2003 2004 2005 2006 2007 04/2008
439
DISCRETE DISTRIBUTIONS AND THEIR APPLICATIONS WITH REAL LIFE DATA
20
# of months
15
10
0
1 2 3 4 5 6 7 8
Number of Patient Visits (NOPV) at Hospital working hour (see Table 4.2.2). Expected
The number of patient visits (NOPV) frequencies and GOF statistic values were
data were collected from the medical unit of the tabulated and are shown in Table 4.2.2 and
Dhaka BMSSU medical hospital for the period Figure 4.2.2 shows a bar chart of observed vs.
of 26 April 2007 to 23 July 2007, where the expected frequencies. Tabulated results and the
variable of interest is the total number of chart show that the negative binomial model and
patients visit in BMSSU per day. The frequency the ZTNB model (ZTNB model had the best fit)
distribution for NOPV is reported in Table 4.2.1, fit NOPV data well compared to other models.
which shows that the patients visiting rate per Based on this analysis, the ZTNB model is
day is 142.36; this equates to a rate of 14.23 per recommended to accurately fit NOPV per day.
Table 4.2.1: Frequency Figure 4.2.1: Trend to Visits in BMSSU per Day
Distribution of NOPV 250
Observed
NOPV 200
Days
No. of Patients
150
51-83 1
100
84-116 12
50
117-149 32
0
150-182 23 26.04.07 17.05.07 05.06.07 24.06.07 12.07.07
183-215 6
440
BANIK & KIBRIA
50
Number of Days
40
30
20
10
0
51-83 84-116 117-149 150-182 183-215
441
DISCRETE DISTRIBUTIONS AND THEIR APPLICATIONS WITH REAL LIFE DATA
10
8
NEQ
0
1973 1978 1983 1988 1993 1998 2003 2008
5
Magnitude
0
2/11/73 14/10/86 28/5/92 6/1/98 27/7/03 14/2/07
442
BANIK & KIBRIA
GOF Statistic 1.9e+004* 7.7e+003* 97.72* 16.23 15.25 77.09* 83.67* 2.28e+005
7
No. of Years
5
4
3
2
0
0 1 2 3 4 5 6 7 8 9 10 11
443
DISCRETE DISTRIBUTIONS AND THEIR APPLICATIONS WITH REAL LIFE DATA
Hartal (Strike) Data 413 hartals were observed and the monthly
The fourth variable is the number of hartal rate is 0.96 per month (see Table 4.4.2).
hartals (NOH) per month observed in Dhaka city Figure 4.4.1 shows the NOH for two periods:
from 1972 to 2007. Data from 1972 to 2000 was 1972-1990 (post-independence) and 1991-2007
collected from Dasgupta (2001) and from 2001- (parliamentary democracy). It has been observed
2007 was collected from the daily newspaper, that the NOH have not decreased since the
the Daily Star. Historically, the hartal Independence in 1971. Although there were
phenomenon has respectable roots in Ghandi’s relatively few hartals in the early years
civil disobedience against British colonialism following independence, the NOH began to rise
(the word hartal, derived from Gujarati, is sharply after 1981, with 101 hartals between
closing down shops or locking doors). In 1982 and 1990. Since 1991(during the
Bangladesh today, hartals are usually associated parliamentary democracy), the NOH have
with the stoppage of vehicular traffic, closure of continued to rise with 125 hartals occurring from
markets, shops, educational institutions and 1991-1996. Thus, the democratic periods (1991-
offices for a specific period of time to articulate 1996 and 2003-2007) have experienced by far
agitation (Huq, 1992). When collecting monthly the largest number of hartals. Lack of political
NOH data, care was taken to include all events stability was found to be the main cause for this
that were consistent with the above definition of higher frequency of hartals (for details, see
hartal (e.g., a hartal lasting 4 to 8 hours was Beyond Hartals, 2005, p. 11). From Table 4.4.1,
treated as a half-day hartal, 9 to 12 hours as a it may be observed that the hartal data contains
full-day hartal; for longer hartals, each 12 hour about 60% of zeroes. Table 4.4.2 indicates that
period was treated as a full-day hartal). NOH process displays over-dispersion with a
Historical patterns of hartals in Dhaka city, variance to mean > 1. According to data in this
NOH with respect to time are plotted in Figure study (Table 4.4.2), the negative binomial
4.4.1, and the frequency distribution of NOH is distribution to model NOH with 31% chance of
shown in Table 4.4.1. Between 1972 and 2007, hartal per month is recommended.
Figure 4.4.1: Total Hartals in Dhaka City: 1972-2007 Table 4.4.1: Frequency
140 Distribution of NOH
125
Number of
120 111 NOH
101 Parliamentary Months
Democracy
100 0 257
NOH
80 1 86
Post- 54
60 2 41
Independence
40 3 14
13
20 9 4 9
0 5 8
72-75 76-81 82-90 91-96 97-02 03-07
6 10
7 or More 7
444
BANIK & KIBRIA
250
No. of Months
200
150
100
50
0
0 1 2 3 4 5 6 7 8 9
445
DISCRETE DISTRIBUTIONS AND THEIR APPLICATIONS WITH REAL LIFE DATA
446
BANIK & KIBRIA
447