You are on page 1of 14

Journal of Applied Statistics

ISSN: 0266-4763 (Print) 1360-0532 (Online) Journal homepage: http://www.tandfonline.com/loi/cjas20

Generalized Poisson–Lindley linear model for


count data

Weerinrada Wongrin & Winai Bodhisuwan

To cite this article: Weerinrada Wongrin & Winai Bodhisuwan (2016): Generalized
Poisson–Lindley linear model for count data, Journal of Applied Statistics, DOI:
10.1080/02664763.2016.1260095

To link to this article: http://dx.doi.org/10.1080/02664763.2016.1260095

Published online: 24 Nov 2016.

Submit your article to this journal

Article views: 8

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=cjas20

Download by: [Athabasca University] Date: 27 November 2016, At: 05:08


JOURNAL OF APPLIED STATISTICS, 2016
http://dx.doi.org/10.1080/02664763.2016.1260095

Generalized Poisson–Lindley linear model for count data


Weerinrada Wongrin and Winai Bodhisuwan
Department of Statistics, Kasetsart University, Bangkok, Thailand

ABSTRACT ARTICLE HISTORY


The purpose of this paper is to develop a new linear regression Received 1 December 2014
model for count data, namely generalized-Poisson Lindley (GPL) lin- Accepted 19 October 2016
ear model. The GPL linear model is performed by applying gener- KEYWORDS
alized linear model to GPL distribution. The model parameters are Count data; generalized
estimated by the maximum likelihood estimation. We utilize the GPL Poisson–Lindley distribution;
linear model to fit two real data sets and compare it with the Poisson, generalized linear model;
negative binomial (NB) and Poisson-weighted exponential (P-WE) maximum likelihood
models for count data. It is found that the GPL linear model can fit estimation; over-dispersion
over-dispersed count data, and it shows the highest log-likelihood,
CLASSIFICATION CODES
the smallest AIC and BIC values. As a consequence, the linear regres- 62J12
sion model from the GPL distribution is a valuable alternative model
to the Poisson, NB, and P-WE models.

1. Introduction
The Poisson regression model is a model to fit count data, where the occurrence num-
ber of a specified experiment is random with its conditional mean of occurrence equals to
the conditional variance (equi-dispersion) [1,2,8]. However, in practical real data collected
from some experiments, the assumption that the conditional mean to be equal to the con-
ditional variance is rejected. The inequality of conditional variance and mean may result
in two cases. The first one is under-dispersion, which occur when the conditional variance
is less than the conditional mean. The other case is over-dispersion where the conditional
variance exceed the conditional mean [12,21]. The over-dispersion case has been widely
studied as shown in research of Karlis and Xekalaki [9].
One way to analyze count data with over-dispersion, some new mixed Poisson distri-
butions have been applied to model count data such as a Poisson-gamma distribution
or negative binomial (NB) distribution [7], Poisson–Lindley distribution [18], nega-
tive binomial-Lindley (NB-L) distribution [10], etc. In addition, the applications of new
mixed Poisson distributions with auxiliary variables have been proposed to provide more
appropriate models to predict behavior of count response variable in comparison to the
predictions from Poisson regression model [2,7].
Lord and Greedipally introduced an NB-L generalized linear model [10] and a Pois-
son–Weibull generalized linear model was introduced by Cheng et al. [3]. Both mod-
els were developed within the Bayesian framework and used to predict the number of

CONTACT Winai Bodhisuwan fsciwnb@ku.ac.th

© 2016 Informa UK Limited, trading as Taylor & Francis Group


2 W. WONGRIN AND W. BODHISUWAN

accidents. In addition, many new mixed Poisson regression models were established in
the framework of generalized linear model which the model parameters were estimated
by using maximum likelihood estimation (MLE), for instance, an NB regression model
[7], Poisson-Inverse Gaussian regression model [19], generalized Waring regression model
[16], Poisson-normal or Hermite regression model [6] and hyper-Poisson regression model
[17]. In 2014, a Poisson-weighted exponential (P-WE) distribution and its regression
model were presented by Zamani et al. [23]. Besides, its regression model was formulated
by the generalized linear model and MLE.
The aim of this work is to apply the generalized linear model to construct a new linear
regression model based on a generalized Poisson–Lindley (GPL) distribution, called GPL
linear model. The GPL distribution was introduced by Mahmoudi and Zakerzadeh in 2010
[11]. It is a mixed Poisson distribution, obtained by mixing the Poisson distribution with a
generalized Lindley (GL) distribution [22], which provides a flexible model for count data
with over-dispersion. In addition, the MLE is used in order to estimate the parameters of
the model. Moreover, its result is compared with some traditional regression models for
count data including Poisson, NB and P-WE regression models. The GPL linear model is
an alternative model to explain the relationship between count data and a set of covariates
with high accuracy of predicted values.
In this paper, we illustrate a mixed Poisson distribution in Section 2. A new linear
regression model for count data is formulated base on GPL distribution as in Section 3.
In addition, Section 4 shows some statistical inference of parameter estimators. We also
apply the GPL linear model to real data sets, moreover, the summary of data and model
performance are considered in Section 5. Finally, some conclusions are demonstrated in
Section 6.

2. GPL distribution
The Poisson distribution is a basic distribution in count data analysis. If a random variable
Y is distributed as the Poisson with parameter λ, its probability mass function (pmf) is
e−λ λy
p(y; λ) = ; y = 0, 1, 2, . . . , for λ > 0, (1)
y!
consequently, E(Y) = λ = Var(Y).
Zakerzadeh and Dolati [22] introduced the GL distribution with parameters α, γ and
θ, GL(α, γ , θ), and its probability density function (pdf) is
θ α+1 λα−1 (α + γ λ) e−θ λ
g(λ; α, γ , θ) = , for λ > 0, and α, γ , θ > 0,
(θ + γ )(α + 1)
∞
where (k) = 0 t k−1 e−t dt stands for the gamma function.
Mahmoudi and Zakerzadeh [11] focused on λ which is distributed as the GL with γ = 1,
it is λ ∼ GL(α, γ = 1, θ), then the pdf of λ can be written as
θ α+1 λα−1 (α + λ) e−θ λ
g(λ; α, γ = 1, θ) = , for λ > 0, and α, θ > 0. (2)
(θ + 1)(α + 1)
Then E(λ) = (α(θ + 1) + 1)/θ(θ + 1) and Var(λ) = (α + 1)(α(θ + 1) + 2)/θ 2
(θ + 1) − [(α(θ + 1) + 1)/θ(θ + 1)]2 .
JOURNAL OF APPLIED STATISTICS 3

Hence, the marginal pmf of Y ∼ GPL(α, γ = 1, θ ) is


 ∞
f (y; α, γ = 1, θ) = p(y; λ)g(λ; α, γ = 1, θ ) dλ
0
 
(y + α)θ α+1 α +
y+α
θ+1
= . (3)
y!(1 + α)(θ + 1) y+α+1

Also some specific properties of the GPL distribution such as the mean and variance are
provided in [11].

3. GPL linear model


In count data, the GPL distribution is a valuable alternative distribution to traditional Pois-
son and NB distributions [11]. However, the GPL variable has not been explained with
covariates. In this study, we extend the existing statistical distribution, GPL distribution,
to generalized linear model to receive the GPL linear model.
The generalized linear model is an extension of the classical linear regression model
when the continuous assumption of response variable is broken, and the response can
be a count variable. So, we have to consider link function, which can be any monotonic,
invertible and differentiable function that maps from Xβ ∈ Rn onto the mean of response
variable (E(Y) > 0) [1,2,12].
On the GPL linear model, the log-linearity for the mean is considered as its link func-
tion. The vector-value link function is defined as η = g(μ), where ηi = g(μi ) = log(μi ) =
xTi β and xTi is the ith row of a n × (k + 1) design matrix, X. Therefore, E(Yi ) = exi β , that
T

is, the mean of response variable is equal to the exponential of linear predictor [5,12].

Proposition 3.1: Let Yi | xTi be a response variable given a set of covariates, xTi . The condi-
tional distribution of Yi given xTi follows the GPL(α, γ = 1, θ ) distribution with parameters
α and θ > 0, that is, Yi | xTi ∼ GPL(α, γ = 1, θ). The pmf of Yi | xTi can be written as

T
exi β θ (θ +1)−1 xT β
(yi + θ+1 ) θ ((e i θ (θ +1)−1)/(θ +1)+1)
f (yi | xTi ) = T
·
xT β
exi β θ (θ +1)−1
) (θ + 1)(yi +(e θ (θ +1)−1)/(θ +1)+1)
i
yi !(1 + θ+1
xT
i β
exi β θ(θ + 1) − 1 (yi + e θθ+1 (θ +1)−1
)
T

× +
θ +1 x Tβ
yi !(1 + e θθ+1 (θ +1)−1
i
)
T
xT β exi β θ (θ +1)−1
θ ((e i θ (θ +1)−1)/(θ +1)+1) yi + θ+1
× · . (4)
xT β θ +1
(θ + 1)(yi +(e i θ (θ +1)−1)/(θ +1)+1)

Proof: If Yi | xTi has a pmf as in Equation (1) and the pdf of λi is proposed in Equation (2),
then the marginal pmf of Yi | xTi ∼ GPL(α, γ = 1, θ ) can be expressed as in Equation (3).
4 W. WONGRIN AND W. BODHISUWAN

Furthermore, the mean of the GPL distribution is

E(Yi | xTi ) = E[E[(Yi | xTi ) | λi ]]


αi (θ + 1) + 1
=
θ(θ + 1)
= μi

= exi β ,
T

where αi = (exi β θ(θ + 1) − 1)/(θ + 1) is obtained by parameterizing the mean of the


T

GPL distribution.
The components of variance of the GPL distribution can be composed as

Var(Yi | xTi ) = E[Var[(Yi | xTi ) | λi ]] + Var[E[(Yi | xTi ) | λi ]]


= E[E[(Yi | xTi ) | λi ]] + Var(λi )
(αi + 1)(αi (θ + 1) + 2)
= μi + − μ2i .
θ 2 (θ + 1)

Therefore, the pmf of Yi |xTi ∼ GPL(α, γ = 1, θ ) can be expressed in the form of a lin-
ear model with a log-link function by substituting αi = (exi β θ (θ + 1) − 1)/(θ + 1) into
T

Equation (3) as
T
exi β θ (θ +1)−1 xT β
(yi + θ+1 ) θ ((e i θ (θ +1)−1)/(θ +1)+1)
f (yi | xTi ) = T
·
xT β
exi β θ (θ +1)−1
yi !(1 + θ+1 ) (θ + 1)(yi +(e i θ (θ +1)−1)/(θ +1)+1)

xT
i β
exi β θ(θ + 1) − 1 (yi + e θθ+1 (θ +1)−1
T
)
× +
θ +1 T
xi β
yi !(1 + e θθ+1 (θ +1)−1
)
T
xT β exi β θ (θ +1)−1
θ ((e i θ (θ +1)−1)/(θ +1)+1) yi + θ+1
× · .
T
(yi +(exi β θ (θ +1)−1)/(θ +1)+1) θ +1
(θ + 1)

4. Model estimation
This section, model parameter estimation is derived. We also discuss about asymptotical
property of the estimated parameters of the model.

4.1. Maximum likelihood estimation


The estimation of the regression coefficients (β) and the parameter of distribution (θ ) have
been carried out by maximizing the log-likelihood function of parameters, , called the
JOURNAL OF APPLIED STATISTICS 5

MLE. Let  = (β T , θ)T be a vector of the parameters. Then log-likelihood function is


   n


exi β θ (θ + 1) − 1
T
θ
() = n log + log  yi +
θ +1 i=1
θ +1

exi β θ(θ + 1) − 1
T
n
− log  1 + − log yi !
θ +1 i=1

T  
n
exi β θ(θ + 1) − 1 θ
n
+ log − [(yi + 1) log(θ + 1)]
θ +1 θ +1
i=1 i=1

n
e xTi β θ (θ + 1) − 1
log exi β θ(θ + 1) − 1 + yi +
T
+ ,
i=1
θ +1

by differentiating () with respect to each parameter and the score functions of the
parameters are


∂() n ⎨ e xTi β θ(θ + 1) − 1 exTi β θ (θ + 1) − 1
xTi θ exi β ψ yi +
T
= −ψ 1+
∂β ⎪
⎩ θ +1 θ +1
i=1

  ⎪

θ θ +2
+ log + ,
θ +1 exi β θ (θ +1)−1 ⎪
T
exi β θ(θ
T
+ 1) − 1 + yi + ⎭
θ+1

and
⎧  T


⎪ 2 xTi β exi β θ (θ +1)−1
n ⎨
((θ + 1) e + 1)ψ yi + θ+1
∂() n
= +
∂θ θ(θ + 1) ⎪
⎪ (θ + 1)2
i=1 ⎩

xT
i β
((θ + 1)2 exi β + 1)ψ( θ (e (θ+1)+1)
T
θ+1 )

(θ + 1)2
T
θ
θ log( θ+1 )((θ +1)2 exi β +1)
(θ + 1)θ exi β +
T
θ+1 −1
+
θ(θ + 1) 2



yi + 1 2e xTi β (θ + 1) ⎬
− + ,
θ +1 exi β θ (θ +1)−1 ⎪
T
exi β θ(θ
T
+ 1) − 1 + yi + ⎪

θ+1

where ψ(k) =   (k)/(k), which is so called the digamma function.


Hence, setting the score function ∂()/∂β = 0 and ∂()/∂θ = 0 and solving the
nonlinear resulting equations by a numerical method or by direct numerical search for the
maximum of the log-likelihood surface, the maximum likelihood estimators are obtained.
6 W. WONGRIN AND W. BODHISUWAN

In this study, we use the optim() function of the R Programming Language [15] for
finding the local minima of the minus log-likelihood. The function optim() is a function
that can provide basic optimization capabilities and is among the most widely used func-
tions in R [13,20]. It is based on several algorithms such as Nelder–Mead, quasi-Newton,
conjugate-gradient and stochastic annealing algorithms. The parameter estimates in the
model can be obtained using several initial values to guarantee the convergence to the
global optimum. Differences in the estimates have been detected only when these estimates
are in the boundary of the parametric space, because they are usually very high [16].

4.2. Asymptotic of the maximum likelihood estimators


ˆ =
The maximum likelihood estimators are asymptotically consistent and efficient. Let 
T
(β̂ , θ̂ )T be the maximum likelihood estimator of , the sampling distribution of β̂ can be
found by the asymptotic normality of the maximum likelihood estimators as n tends to ∞
[12,21].

Theorem 4.1: Let {β̂ n } be a sequence of independent and identically distributed (iid) ran-

dom vector with E(β̂ n ) = β and covariance matrix is 0 < I −1 (β̂) < ∞. Then n(β̂ n −
d
→ N(0, I −1 (β̂)) for n large and hence
β) −
√ d
→ N(0, I(k+1) ),
nI 1/2 (β̂)(β̂ n − β) −

where I(β̂) is the observed information matrix of β̂ under MLE and I is an identity matrix
[14,21].

The observed information matrix of β̂ is presented in the appendix.


Thus, the sampling distribution of β̂ is a standard normal distribution, and the test
statistic for parameters based on H0 : β = 0 against H1 : β = 0 is performed by z-test.
The efficiency of model is considered with the log-likelihood, Akaike’s information
criterion, AIC = −2 + 2p and Bayesian information criterion, BIC = −2 + log(n)p,
where p is the number of parameters in the specified model and n is the sample size.

5. Applications
We consider two over-dispersed applications of the GPL linear model. The first data
set comes from the 1991 Arizona cardiovascular patient files (AZPRO), which included
3589 observations [7]. We have modeled the different length of stay (LOS) for patients
entering the hospital to receive one of two standard cardiovascular procedures (PROCE-
DURE): Coronary Artery Bypass Graft (CABG) and Percutaneous Transluminal Coronary
Angioplasty (PTCA). Furthermore, the LOS can be explained by the covariates including
PROCEDURE, sex (SEX), type of admission (ADMIT) and age (AGE). Figure 1(a) shows
the distribution of LOS data reminding the over-dispersion of data (mean = 8.831 and
variance = 47.973).
Another data set is the United States National Medical Expenditure Survey 1987 and
1988 (NMES) data from Deb and Trivedi [4]. The NMES data has been used to model the
demand for medical care, captured by the number of physician office visits and the number
JOURNAL OF APPLIED STATISTICS 7

700
400

600
300

500
400
frequency

frequency
200

300
200
100

100
0

0
1 6 12 19 26 33 40 47 64 83 0 5 11 18 25 32 39 47 55 63 89
number of length of stay (day) number of physician office visit

Figure 1. (a) The distribution of different length of stay for patients entering the hospital and (b) the
distribution of the number of physician office visits.

Table 1. Summary statistics for the Arizona cardiovascular patient data set (n = 3589).
Variables Min Max Median
LOS 1 83 8
PROCEDURE (1 = CABG, 0 = PTCA) 0 1 8
SEX (1 = male, 0 = female) 0 1 1
ADMIT (1 = urgency, 0 = other) 0 1 1
AGE (1 = > 75, 0 = other) 0 1 0

of hospital outpatient visits. For illustration purpose, the response variable being the num-
ber of physician visits (OFP) is modeled by the covariates, including the number of hospital
stays (HOSP), number of chronic conditions (NUMCHRON), condition that limits activi-
ties of daily living (ADLDIFF), age (AGE), sex (SEX), status (MARIED), family income in
10,000 $ (FAMINC), employed status (EMPLOYED), medicaid indicator (MEDICAID),
private insurance indicator (PRIVINS) and self-perceived health status (POORHLTH and
EXCLHLTH).
The distribution of OFP is presented in Figure 1(b) which is clear that the OFP is an
over-dispersed data (mean = 5.774 and variance = 45.687).
The summary statistics for the AZPRO and NMES data are shown in Tables 1 and 2,
respectively.

5.1. Results
This section presents the modeling results for the GPL, Poisson, NB and P-WE models
based on observed two data sets. We use enter method to select the covariates into model,
and then fit the model by using significant covariates.
For AZPRO data, we construct predictive model from the lengths of stay for patients
entering hospital with covariates. By comparing the models based on the log-likelihood,
AIC and BIC, the results show that the GPL model gives the best fitted model to the AZPRO
data as in Table 3. The estimated parameters and summarized fitted models are shown
8 W. WONGRIN AND W. BODHISUWAN

Table 2. Summary statistics for the US National Medical Expenditure Survey data set (n = 4406).
Variables Min Max Median
OFP 0 89 4
HOSP 0 8 0
NUMCHRON 0 8 1
ADLDIFF 0 1 0
AGE (divided by 10) 6.6 10.9 7.3
SEX (1 = male, 0 = female) 0 1 0
MARIED (1 = yes, 0 = no) 0 1 1
FAMINC (in 10000$) −1.012 54.84 1.6980
EMPLOYED (1 = yes, 0 = no) 0 1 0
PRIVINS (1 = yes, 0 = no) 0 1 1
MEDICAID (1 = yes, 0 = no) 0 1 0
HLTHEXCELLENT (Self-perceived health status in excellent) 0 1 0
HLTHPOOR (Self-perceived health status in poor) 0 1 0

Table 3. The performance of the models for the Arizona cardiovascular patient data set.
Criterion Poisson NB P-WE GPL
log-likelihood −11,189.9 −10,578.9 −10,428.6 −9970.6
AIC 22,389.8 21,169.8 20,869.3 19,953.2
BIC 22,420.7 21,206.9 20,906.4 19,990.4

Table 4. The performance of the models forthe US National Medical Expenditure Survey data set.
Criterion Poisson NB P-WE GPL
log-likelihood −17,993.7 −12,199.4 −12,183.9 −12,164.2
AIC 36,013.4 24,416.9 24,385.8 24,346.5
BIC 36,096.4 24,474.4 24,443.3 24,404.0

Table 5. Modeling results for the Arizona cardiovascular patient data set.
Poisson NB P-WE GPL

Estimate Estimate Estimate Estimate


Covariates (s.e.) p-Value (s.e.) p-Value (s.e.) p-Value (s.e.) p-Value
Intercept 1.4558 < 0.001 1.0780 < 0.001 1.4558 < 0.001 1.4812 < 0.001
(0.0158) (0.0298) (0.0345) (0.0234)
PROCEDURE 0.9606 < 0.001 1.0866 < 0.001 0.9606 < 0.001 0.9410 < 0.001
(0.0122) (0.0243) (0.0271) (0.0181)
SEX −0.1240 < 0.001 0.0724 0.003 −0.1240 < 0.001 −0.1132 < 0.001
(0.0118) (0.0249) (0.0284) (0.0177)
ADMIT 0.3266 < 0.001 0.5319 < 0.001 0.3266 < 0.001 0.2998 < 0.001
(0.0121) (0.0249) (0.0281) (0.0179)
AGE 0.1224 < 0.001 0.3161 < 0.001 0.1224 < 0.001 0.1175 < 0.001
(0.0124) (0.0273) (0.0302) (0.0186)

in Table 5. The lengths of stay for patients entering hospital will increase if they receive
Coronary Artery Bypass Graft procedure and have urgency admit as well as they are older.
However, women stay in the hospital longer than men in the GPL and Poisson models.
So, we should predict the lengths of stay for patients entering hospital with the GPL linear
model, and we can write the predictive model as

μ̂i = exp(1.4812 + 0.9410CABGi − 0.1132SEXi + 0.2998ADMITi + 0.1175AGEi ).


JOURNAL OF APPLIED STATISTICS 9

Table 6. Modeling results for the US National Medical Expenditure Survey data set.
Poisson NB P-WE GPL

Estimate Estimate Estimate Estimate


Covariates (s.e.) p-Value (s.e.) p-Value (s.e.) p-Value (s.e.) p-Value
Intercept 1.6156 < 0.001 1.0219 < 0.001 1.0350 < 0.001 1.1637 < 0.001
(0.0836) (0.0406) (0.0471) (0.0404)
HOSP 0.1610 < 0.001 0.2194 < 0.001 0.21426 < 0.001 0.1307 < 0.001
(0.006) (0.0210) (0.0216) (0.0130)
NUMCHRON 0.1444 < 0.001 0.1728 < 0.001 0.1803 < 0.001 0.1531 < 0.001
(0.0046) (0.0122) (0.0125) (0.0092)
ADLDIFF 0.0545 0.001 0.0974 0.007
(0.0168) (0.0362)
AGE −0.0642 < 0.001
(0.0108)
SEX −0.0890 < 0.001 −0.1136 < 0.001 −0.1203 < 0.001
(0.0143) (0.0318) (0.0259)
MARIED −0.0359 0.014
(0.0147)
FAMINC 0.0049 0.024
(0.0022)
EMPLOYED 0.0667 0.002
(0.0219)
PRIVINS 0.3636 < 0.001 0.3561 < 0.001 0.4108 < 0.001 0.3776 < 0.001
(0.0191) (0.0368) (0.0435) (0.0360)
MEDICAID 0.2438 < 0.001 0.2268 < 0.001 0.2453 < 0.001 0.2343 < 0.001
(0.0247) (0.0488) (0.0625) (0.0488)
HLTHEXCELLENT −0.3477 < 0.001 −0.1333 0.007 -0.3159 < 0.001 −0.1906 < 0.001
(0.0304) (0.0493) (0.0612) (0.0509)
HLTHPOOR 0.1994 < 0.001 0.1774 < 0.001 0.2363 < 0.001 0.0977 0.012
(0.0184) (0.0409) (0.0483) (0.0391)

By considering comparative models for NMES data, we found that the GPL model is the
best model with the highest log-likelihood, the smallest AIC and BIC as shown in Table 4.
The fitting models of the number of physician office visits on covariates show the difference
of fitted model for all three-model (Table 6). The Poisson model shows that covariates can
explain the response whereas the NB, P-WE and GPL models can account for the response
with some covariates. However, all models have the same sign of predicted values. The
predictive model for the NMES data set can be written as

μ̂i = exp(1.1637 + 0.1307HOSPi + 0.1531NUMCHRONi − 0.1203SEXi


+ 0.3776PRIVINSi + 0.2343MEDICAIDi − 0.1906HLTHEXCELLENTi
+ 0.0977HLTHPOORi ).

Figures 2 and 3 show the fitted lines of fitted value from the AZPRO and NMES data,
respectively, which were modeled by the GPL linear model. Figure 2(a) represents the
expected LOS that will be increased by 0.9410 if the patient receives CABG procedure.
Female have 0.1132 larger the expected LOS than male in Figure 2(b). The expected LOS
when the patients admit by urgency will be grown by 0.2998 and will be risen by 0.1175
when they are older as in Figure 2(c) and 2 (d), respectively.
Figure 3(a) demonstrates the expected OFP which will be nondecreased by 0.1307 when
the number of the hospitals rises. Moreover, it will be increased due to the amount of the
chronic conditions, which goes up by 0.1531 as in Figure 3(b). According to Figure 3(c)
10 W. WONGRIN AND W. BODHISUWAN

80

80
60

60
LOS

LOS
40

40
20

20
0

0
PTCA CABG female male

PROCEDURE SEX
80

80
60

60
LOS

LOS
40

40
20

20
0

other urgency less than 75 greater or equal 75


ADMIT AGE

Figure 2. The fitted lines of different length of stay for patients entering the hospital with each covariate.

and 3 (e) the expected OFP will be narrowed down multiply by 0.1203 and 0.1906 in
male and self-perceived excellent health, respectively. On the other hand, private insurance
indicator, medicaid indicator and self-perceived poor health will augment the expected
OFP by 0.3776, 0.2343 and 0.0977 as shown in Figure 3(d), 3 (f) and 3 (g), sequentially.
From the results of fitting model, Figures 2 and 3 confirm that the estimates from
optim() function can fit to real data.

6. Conclusions
The new linear model based on the GPL distribution, namely a GPL linear model, pro-
vides a practical tool for modeling count data. It generalized the Poisson regression model,
which was widely used to model count data. The generalized linear model approach is used
to develop the GPL linear model. The MLE is used to estimate the model parameters. The
covariates in the model are selected by using enter method and using z-test to eliminate
some not significant covariates from the model. We apply the GPL linear model to real
data including the AZPRO and NMES data and compare it with the Poisson regression
JOURNAL OF APPLIED STATISTICS 11

80

80
60

60
OFP

OFP
40

40
20

20
0

0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
HOSP NUMCHRON
80

80
60

60
OFP

OFP
40

40
20

20
0

female male no yes


SEX PRIVINS
80

80
60

60
OFP

OFP
40

40
20

20
0

no yes excellent other


MEDICAID HLTH
80
60
OFP
40
20

11
0

other poor
HLTH

Figure 3. The fitted lines of the number of physician office visits with each covariate.
12 W. WONGRIN AND W. BODHISUWAN

model, NB regression model and P-WE regression model. The results are very interesting;
the two applications reveal that the GPL linear model can account for count phenomenon
with the highest log-likelihood, and the smallest AIC and BIC values. As a result, the GPL
linear model provides a more useful model for count data than the Poisson regression
model, which is considered as an alternative model to predict the count phenomenon with
covariates.

Acknowledgments
This paper was made possible by Department of Statistics, Faculty of Science, Kasetsart University,
Graduate School Kasetsart University and Science Achievement Scholarship of Thailand (SAST).
The authors are indebted to the referees for helpful comments made on an earlier version of this
paper and for many inspiring discussions.

Disclosure statement
No potential conflict of interest was reported by the authors.

References
[1] A.C. Cameron and P.K. Trivedi, Regression Analysis of Count Data, Cambridge University Press,
Cambridge, 1998.
[2] A.C. Cameron and P.K. Trivedi, Regression Analysis of Count Data, 2nd ed., Cambridge
University Press, Cambridge, 2013.
[3] L. Cheng, S.R. Geedipally, and D. Lord, The Poisson-Weibull generalized linear model for
analyzing motor vehicle crash data, Safety Sci. 54 (2013), pp. 38–42.
[4] P. Deb and P.K. Trivedi, Demand for medical care by the elderly: A finite mixture approach, J.
Appl. Econometrics 12 (1997), pp. 313–336. Special Issue: Econometric Models.
[5] E.G. Déniz, A new discrete distribution: Properties and applications in medical care, J. Appl. Stat.
40 (2013), pp. 2760–2770. doi:10.1080/02664763.2013.827161.
[6] D.E. Giles, Hermite regression analysis of multi-modal count data, Tech. Rep., Econometrics
Working Paper EWP1001, Department of Economics, University of Victoria, Canada, 2010.
[7] J.M. Hilbe, Negative Binomial Regression, 2nd ed., Cambridge University Press, Cambridge,
2011.
[8] N.L. Johnson, A.W. Kemp, and S. Kotz, Univariate Discrete Distributions, 3rd ed., Wiley Series
in Probability and Statistics, John Willey & Sons, Inc., Hoboken, 2005.
[9] D. Karlis and E. Xekalaki, Mixed Poisson distributions, Int. Statist. Rev. 73 (2005), pp. 35–58.
[10] D. Lord and S.R. Geedipally, The negative binomial-Lindley distribution as a tool for analyzing
crash data characterized by a large amount of zeros, Accid. Anal. Prev. 43 (2011), pp. 1738–1742.
[11] E. Mahmoudi and H. Zakerzadeh, Generalized Poisson – Lindley distribution, Comm. Statist.
Theory Methods 39 (2010), pp. 1785–1798.
[12] P. McCullagh and J. Nelder, Generalized Linear Models, 2nd ed., Chapman and Hall/CRC,
Washington, DC, 1989.
[13] J.C. Nash, On best practice optimization methods in R, J. Stat. Softw. 60 (2014), pp. 1–14.
Available at http://www.jstatsoft.org/v60/i02.
[14] E. Ohlsson and B. Johansson, Non-Life Insurance Pricing with Generalized Linear Models, EAA
Series, Springer, Berlin, 2010.
[15] R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for
Statistical Computing, Vienna, Austria, 2014. Available at http://www.R-project.org.
[16] J. Rodríguez-Avi, A. Conde-Sánchez, A.J. Sáez-Castillo, M.J. Olmo-Jiménez, and A.M.
Martínez-Rodríguez, A generalized Waring regression model for count data, Comput. Statist.
Data Anal. 53 (2009), pp. 3717–3725.
JOURNAL OF APPLIED STATISTICS 13

[17] A.J. Sáez-Castillo and A. Conde-Sánchez, A hyper-Poisson regression model for overdispersed
and underdispersed count data, Comput. Statist. Data Anal. 61 (2013), pp. 148–157.
[18] M. Sankaran, The discrete Poisson-Lindley distribution, Biometrics 26 (1970), pp. 145–149.
[19] M.M. Shoukri, M.H. Asyali, R. VanDorp, and D. Kelton, The Poisson inverse Gaussian regression
model in the analysis of clustered counts data, J. Data Sci. 2 (2004), pp. 17–32.
[20] R. Varadhan, Numerical optimization in R: Beyond optim, J. Stat. Softw. 60 (2014), pp. 1–3.
Available at http://www.jstatsoft.org/v60/i01.
[21] R. Winkelmann, Econometric Analysis of Count Data, 5th ed., Springer, Berlin, 2008.
[22] H. Zakerzadeh and A. Dolati, Generalized Lindley distribution, J. Math. Extension 3 (2009), pp.
13–25.
[23] H. Zamani, N. Ismail, and P. Faroughi, Poisson-weighted exponential univariate version and
regression model with applications, J. Math. Statist. 10 (2014), pp. 148–154.

Appendix. The observed information matrix


The observed information matrix is given by the negative expectation of the Hessian matrix: H(β̂),
that is, I(β̂) = nEβ̂ [H(β̂; Y, xTi )], where H(β̂) = ∂ 2 (β)/∂β∂β T , and

∂ 2 () T
n
exi β θ(θ + 1) − 1 exi β θ(θ + 1) − 1
T T
xTi β xTi β (1) (0)
= xi θ e e θψ yi + +ψ yi +
∂β∂β T i=1
θ +1 θ +1

exi β θ(θ + 1) − 1 exi β θ(θ + 1) − 1
T T
xTi β (1) (0)
+ e θψ 1+ +ψ 1+
θ +1 θ +1
  
θ (θ + 1)(θ + 2)(yi (θ + 1) − (θ + 2))
− log + xTi .
θ +1 x Tβ
[(e i θ(θ + 1) − 1)(θ + 2) + y (θ + 1)]
2
i

Hence, the law of large numbers: (1/n)H(β̂) = nEβ̂ [H(β̂; Y, xTi )] = −I(β̂), finally, we obtain
  
1
n
exi β θ(θ + 1) − 1
T
xTi β θ
− exi β θψ (1) (yi +
T
I(β) = xi θ e ln )
n θ +1 θ +1
i=1

exi β θ(θ + 1) − 1 exi β θ(θ + 1) − 1
T T
(0) xTi β (1)
−ψ yi + − e θψ 1+
θ +1 θ +1

exi β θ(θ + 1) − 1
T
(0) (θ + 1)(θ + 2)(yi (θ + 1) − (θ + 2))
−ψ 1+ − xTi .
θ +1 x Tβ
[(e i θ(θ + 1) − 1)(θ + 2) + y (θ + 1)]
2
i

You might also like