You are on page 1of 14

Journal of Statistical Computation and Simulation

ISSN: 0094-9655 (Print) 1563-5163 (Online) Journal homepage: http://www.tandfonline.com/loi/gscs20

Simulating comparisons of different computing


algorithms fitting zero-inflated Poisson models for
zero abundant counts

Xueyan Liu, Bryan Winter, Li Tang, Bo Zhang, Zhiwei Zhang & Hui Zhang

To cite this article: Xueyan Liu, Bryan Winter, Li Tang, Bo Zhang, Zhiwei Zhang & Hui Zhang
(2017): Simulating comparisons of different computing algorithms fitting zero-inflated Poisson
models for zero abundant counts, Journal of Statistical Computation and Simulation, DOI:
10.1080/00949655.2017.1327590

To link to this article: http://dx.doi.org/10.1080/00949655.2017.1327590

Published online: 22 May 2017.

Submit your article to this journal

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=gscs20

Download by: [The UC San Diego Library] Date: 24 May 2017, At: 13:26
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2017
https://doi.org/10.1080/00949655.2017.1327590

Simulating comparisons of different computing algorithms fitting


zero-inflated Poisson models for zero abundant counts
Xueyan Liua , Bryan Wintera , Li Tanga , Bo Zhangb , Zhiwei Zhangc and Hui Zhanga
a Department of Biostatistics, St. Jude Children’s Research Hospital, Memphis, TN, USA; b Office of Surveillance and
Biometrics, Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, MD, USA;
c Department of Statistics, The University of California, Riverside, CA, USA

ABSTRACT ARTICLE HISTORY


Zero-inflated Poisson models are frequently used to analyse count data Received 9 February 2017
with excess zeroes. However, results generated by different algorithms, by Accepted 3 May 2017
various statistical packages or procedures in R and SAS, are often incon- KEYWORDS
sistent, especially for small sample sizes or when the proportion of zero Count data; overdispersion;
inflation is not large. In this study, we compared the underlying nonlin- zero-inflated Poisson; excess
ear optimization approaches and the statistical theories on which common zeroes; SAS; R; nonlinear
packages and procedures are based. Then, multiple sets of simulated data optimization; zeroinfl; PSCL;
of small, medium, and large sample sizes were fitted to test the performance GENMOD
of algorithms in available R packages and SAS procedures. They were also
compared by using a real-data example. The zeroinfl function with methods
CD type 1, CD type 2, and CD type 3 in the PSCL package in R and the GEN-
MOD procedure in SAS generally outperformed in the simulation studies
and produced consistent results for the real-data example.

1. Introduction
Poisson and related log-linear regression models are frequently used to analyse count data, which
are commonly encountered in biomedical, economic, and social studies. However, when data are
fit in a Poisson distribution, the mean is equal to the variance, which is a stringent assumption
that is often questionable in the real world. Examples include many applications in which data
contain an excessively large number of zeroes or have a much larger variance than the mean (i.e.
overdispersion). For such applications, the traditional Poisson regression model is not appropriate.
To address overdispersion, a negative binomial model could be fit or a quasi-likelihood estima-
tion be performed under a quasi-Poisson model, which includes an additional dispersion parame-
ter [1]. However, these approaches may not be adequate to address the excess zeroes. In cases, in
which overdispersion is caused by abundant zeroes, or zero inflation, zero-inflated models are more
appropriate [2].
In the real world, count data with a cluster of zeroes are very common. An example is the number
of physician or hospital visits made by a group of people over a given time period. A person may
have zero counts of physician or hospital visits because he or she is generally healthy or is unavailable
for visits during a certain time period. The first type of zeroes is called structural zeroes, and the
second type is called random zeroes and may be modelled by a well-known model such as the Poisson
or negative binomial for count data. Models that accommodate both types of zeroes are known as
zero-inflated count models. Another example is the number of eruptions of volcanoes located in a

CONTACT Hui Zhang hui.zhang@stjude.org

© 2017 Informa UK Limited, trading as Taylor & Francis Group


2 X. LIU ET AL.

certain area during the last 10 years. Data on extinct volcanoes produce the structural zeroes, whereas
those on live volcanoes with no eruptions during the investigated time period generate the random
zeroes.
To account for the 2 types of zeroes in these situations, mixed distributions have been used to
combine constant zeroes with a Poisson (or negative binomial) distribution that generates counts
including zeroes. The earliest work on zero-inflated Poisson (ZIP) models was done by Lambert in
1992 [2]. Since then, zero-inflated discrete models have been rapidly developed, motivated in part by
the rapid advances in the development of biomedical techniques such as next-generation sequencing
and proteomics, as well as in economics and sociology [3–9]. In addition to ZIP models, excess zeroes
in count data can also be accommodated by zero-inflated negative binomial models or the hurdle
model [10].
Numerous choices of statistical softwares are available for fitting zero-inflated count data with ZIP
models. The most popular is the zeroinfl function offered by the PSCL package in R with 5 possi-
ble nonlinear optimization (NLO) methods. SAS offers 3 procedures COUNTREG, GENMOD, and
NLMIXED for ZIP modelling. COUNTREG offers 7 NLO methods, of which 3 have several updates,
thereby yielding a total of 14 choices. GENMOD is simpler and offers only 1 NLO method. Sim-
ilar to COUNTREG, 14 optimization algorithms are available in NLMIXED. The aforementioned
NLO methods frequently generate discordant parameter estimates and inferential conclusions. In
light of many (34 in total) methods available for ZIP modelling and the discordance in the results
they produce, it is important to compare these methods and identify those that perform the best. In
this study, we discuss the algorithms underlying the different procedures for ZIP modelling across
different packages in R and SAS, evaluate and compare their performances in simulation studies, and
illustrate their differences by using a real-data example.
The remaining paper is organized as follows: in Section 2, we briefly review the ZIP Poisson
model theory; in Section 3, we summarize and discuss the packages and procedures available in
R and SAS for ZIP model fitting, as well as their accompanying NLO methods and updates; in
Section 4, we perform simulations by the most commonly available packages and procedures for
comparisons and fit in a real-data example; and in Section 5, we give concluding remarks and a
discussion.

2. ZIP models
Log-linear regression is the most commonly used method to model count responses. Considering a
sample of n subjects and denoting yi as the count response and xi as the vector of explanatory variables
of interest for the ith subject with i = 1, 2, . . . , n, the traditional log-linear model is given by

i.d.
yi | xi ∼ Poisson(μi ), μi = E(yi | xi ), log(μi ) = xTi β, i = 1, 2, . . . , n, (1)

where i.d. stands for independently distributed, Poisson(μi ) the Poisson distribution with mean μi ,
and β the vector of regression coefficients. In this model, the logarithm is the link function, that is,
the function on μi that links this conditional mean to the linear predictor, xTi β. The probability mass
y
function is P(yi |xi ) = e−μi μi i /yi !, which leads to the log-likelihood function defined by


n
(yi xTi β − exi β − ln yi !).
T
l(β) =
i=1

The log-likelihood is used as the objective function for computing maximum-likelihood estimates of
the regression coefficient β.
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION 3

For count data with excess zeroes, the ZIP regression model, as proposed by Lambert [2], treats
the data as a mixture of constant zeroes and Poisson counts. The model is as follows:


i.d. 0, with probability ρi ,
yi | x1i , x2i ∼
Poisson(μi ), with probability 1 − ρi ,

log(μi ) = xT1i β, logit(ρi ) = xT2i γ , i = 1, 2, . . . , n, (2)


where x2i denotes a vector of explanatory variables for the excess-zeroes process and γ the vector of
regression coefficients. In this model, the logit function is the link function for ρi and the log function
is the link function for the expected mean count μi , which yields

ex2i γ
T
xT1i β
μi = e , ρi = . (3)
1 + ex2i γ
T

The resulting distribution gives


P(yi = 0) = ρi + (1 − ρi )e−μi ,
j
e−μi μi
P(yi = j) = (1 − ρi ) , j = 1, 2, . . . .
j!
In this model, zero counts can come from both the excess-zeroes and the Poisson-count processes.
The excess-zeroes process is thought to be a certain stochastic process that only allows zero counts.
The variables that affect the count process can also be involved in the excess-zeroes process; that is,
the variables in x1i and x2i need not be mutually exclusive and can be overlapping or the same.
Lambert [2] proposed a latent class construction that yields this model with an unobserved
Bernoulli random variable ci :

1, if yi = 0,
ci =
0, if yi is from Poisson(μi ),

which establishes
ex1i β
T

E(yi ) = E(E(yi |ci )) = μi (1 − ρi ) = , (4)


1 + ex2i γ
T

Var(yi ) = E(Var(yi |ci )) + Var(E(yi |ci )) = μi (1 − ρi )(1 + μi ρi ). (5)


Clearly, the ZIP model accommodates for overdispersion. The log-likelihood function is
⎡ T

 xT β  e −ex1i β eyi xT1i β
l(β, γ |y) = log[ρi + (1 − ρi ) e−e ] + log ⎣(1 − ρi ) ⎦
1i

yi !
yi =0 yi >0
 T 
−ex1i β
log(ex2i γ + e (yi xT1i β − ex1i β )
T T
= )+
yi =0 yi >0

 
n
log(1 + ex2i γ ).
T
− log yi ! − (6)
yi >0 i=1

Because this log-likelihood function is difficult to maximize explicitly, Lambert [2] constructed the
so-called complete log-likelihood function that was based on both the response variable yi and the
4 X. LIU ET AL.

latent variable ci , which makes the analytic procedure of maximizing the log-likelihood function
much easier. At the same time, the expectation maximization (EM) algorithm was employed for the
maximization. On the basis of the joint probability mass function of the bivariates (y, c), the complete
likelihood function is


n 
n
(1 − ci )(yi xT1i β − ex1i β ) −
T
lc (β, γ |y, c) = (1 − ci ) log yi !
i=1 i=1

n
(ci xT2i γ − log(1 + ex2i γ ).
T
+ (7)
i=1

In R and SAS, the parameters β and γ are numerically estimated by maximizing the log-likelihood
function by various NLO methods.

3. Softwares for ZIP models


3.1. R
The zeroinfl function in package PSCL [11] has been commonly used [12] to fit a ZIP model by R [13].
In this package, the starting values are provided either by generalized linear regression (default) or
the EM algorithm, which uses

P(ci = 1 and yi = 0)
= (1 + e−x2i γ −x1i β )−1
T T
E(ci ) = P(ci = 1) =
P(yi = 0)

and applies the weights (1 − E(ci )) for generalized linear regression for β. Then, the maximization
procedure is passed to the optim function to execute the optimization and calculate the Hessian matrix
by using the NLO methods. Five NLO methods are available: Broyden–Fletcher–Goldfarb–Shanno
(BFGS); Conjugate gradient (CG) type 1, CG type 2, CG type 3; and Nelder–Mead.
The BFGS algorithm, which was published simultaneously by four scientists in 1970 [14–17], is a
quasi-Newton method.
At each iterative step, the method applies the constructed approximate Hessian matrix by using
gradient and function values to locate the search direction. Then the next point is the one closer to the
targeted point, such that the gradient is closer to zero, where the maximum of likelihood is obtained
in some cases. The BFGS approximation does not converge to the true Hessian matrix, but when it
does, it generally converges in fewer iterations than those for CG methods.
The CG method does not directly calculate the Hessian matrix. In the first step, the algorithm finds
the steepest descent direction. Then, at each iteration, the algorithm performs a line search to deter-
mine the optimal distance to move along the current search direction. The next search direction is
determined by a combination of the new steepest descent direction and the previous search direction
in such a way that it is conjugate to the current search direction. Three updates based on three distinct
ways of making this combination have been published: Fletcher and Reeves [18] for CG type 1, Polak
and Ribière [19] for CG type 2, and Beale [20] and Sorenson [21] for CG type 3.
The Nelder–Mead method [22], one of the most well-known derivative-free methods needs only
values of the function (F : Rp → R) to be optimized for a direct search without any line searches. It
consists of building a (p + 1)-simplex and moving or shrinking this simplex in a favourable direction
during the iterations. Although this method is robust, it is relatively slow. Since it does not calculate
derivatives, it can work reasonably well for non-differentiable functions.
In R, the package glmmADMB is used for fitting zero-inflated models with only constant coef-
ficients. Therefore, it does not work for our case. We did not use the package MCMCglmm, which
implements Bayesian methods rather than maximum likelihood, a different way.
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION 5

3.2. SAS
The three procedures available in SAS (version 9.2 or higher) to fit a ZIP model are COUNTREG,
GENMOD, and NLMIXED. Table 1 lists all the available NLO options and their updates.
The SAS procedure NLMIXED maximizes the likelihood by integrating over the random effects
and presents the parameter estimates. If no random effects are involved, it directly maximizes the
likelihood function, which is automatically generated by a specified distribution in a standard form
(normal, binomial, or Poisson) or a general distribution that can be coded by using SAS program-
ming statements. The parameters must be specified and initialized in the PARMS statement. An initial
value of 1 is provided to reference the parameters which are not listed in the PARMS statement. Pro-
grammers are strongly encouraged to provide favourable initial values of parameter estimates. Along
with this procedure, seven alternative optimization techniques are available to conduct the maximiza-
tion: CONGRA with four updates PB (default), FR, PR, and CD; DBLDOG with two updates DBFGS
(default) and DDFP; NMSIMP; NEWRAP; NRRIDG; QUANEW (default) with four updates DBFGS
(default), DDFP, BFGS, and DFP; and TRUREG.
CONGRA performs a conjugate-gradient optimization which has four updates: the default PB
update by Powell [23] and Beale [20]; the FR update by Fletcher and Reeves [18]; the PR update by
Polak and Ribière [19], and the CD update by Fletcher [24].
The double-dogleg optimization (DBLDOG) utilizes both the quasi-Newton and trust region
methods without performing a line search [25]. The DBFGS update performs the dual Broyden [17],
Fletcher [14], Goldfarb [15], and Shanno [16] update of the Cholesky factor of the approximate Hes-
sian matrix. The DDFP performs the dual Davidon [26], Fletcher [24], and Powell [23] update of the
Cholesky factor of the approximate Hessian matrix.
The Newton–Raphson optimization with line search (NEWRAP) is a quadratic optimization that
uses the gradient and the Hessian matrix and line search to obtain the local optimum.
The Newton–Raphson ridge optimization (NRRIDG) uses a strategy of decomposing orthogonally
the approximate Hessian. Each iteration can be slower than that of NEWRAP. However, NRRIDG
usually needs fewer iterations than does NEWRAP.
Quasi-Newton optimization (QUANEW) is the default optimization algorithm of NLMIXED. It
computes only the gradients, without calculating the Hessian matrix. It has 4 updates BFGS, DBFGS,
DDFP, DEP.
Trust region optimization (TRUREG) requires first- and second-order continuous differentiability
to provide a quadrative approximation to the objective function.
Of the 14 algorithms, the Nelder–Mead simplex optimization (NMSIMP) is the only derivative-
free technique. All the remaining 13 methods require first-order derivative. In particular, TRUREG,

Table 1. SAS procedures with various optimization techniques, and updates.


SAS procedures NLOa Description Updates
COUNTREG CONGRA Conjugate gradient CD, FR, PB (df.), PR
COUNTREG DBLDOG Double-dogleg DBFGS, DDFP
COUNTREG NMSIMP Nelder–Mead simplex NA
COUNTREG NRA (df.) Newton–Raphson with line search NA
COUNTREG NRRIDG Newton–Raphson ridge NA
COUNTREG QN Quasi-Newton BFGS, DBFGS, DDFP, DFP
COUNTREG TR Trust region NA
GENMOD NEWRR Ridge-stabilized Newton–Raphson NA
NLMIXED CONGRA Conjugate gradient CD, FR, PB, PR
NLMIXED DBLDOG Double-dogleg DBFGS, DDFP
NLMIXED NEWRAP Newton–Raphson with line search NA
NLMIXED NMSIMP Nelder–Mead simplex NA
NLMIXED NRRIDG Newton–Raphson ridge NA
NLMIXED QUANEW (df.) Quasi-Newton BFGS, DFP, DBFGS, DDFP
NLMIXED TRUREG Trust region NA
a Abbreviations: NLO, nonlinear optimization.
6 X. LIU ET AL.

NEWRAP, and NRRIDG require second-order derivatives. A detailed comparison of these techniques
is given in the introduction of NLMIXED [25].
GENMOD maximizes the log-likelihood function by a ridge-stabilized Newton–Raphson
algorithm in order to maximize the log-likelihood function with respect to the regression parameters.
It compensates for problems with weakly positive definite matrices. Weighted least-squares estimates
based on using response data will be employed as the initial parameter estimates.
COUNTREG is used specifically for count data and it directly performs maximum log-likelihood
estimation by the 14 optimization methods, which are the same techniques as those used by
NLMIXED. The default initial values are computed by using the linear ordinary least-squares
regression. The default NLO is NRA.

4. Illustration of evaluations
In this section, we performed simulations to evaluate the performances and reliabilities of SAS/R
procedures/packages in fitting the ZIP model. All the codes for the simulation are available from the
authors upon request. The significance level α was set to 0.05.

4.1. Simulation studies


Simulated data were generated from a ZIP model, which is a combination of Poisson and constant
zeroes. To evaluate performance, we repeated the analysis for four different sample sizes, n = 20, 50,
100, and 250, representing smaller, small, medium, and large samples, respectively. In each setting, we
performed 1000 Monte Carlo (MC) replications and calculated the average and standard deviation
of the 1000 estimates of parameters and the empirical type I error rate, that is, the proportion of MC
replicates rejecting the null hypothesis incorrectly. Our simulation was based on


0, if ci = 1,
yi | x1i , x2i = ci ∼ Bernoulli(ρi ),
Poisson(μi ), if ci = 0,
eγ0 +γ1 x2i
μi = eβ0 +β1 x1i , ρi = ,
1 + eγ0 +γ1 x2i
x1i ∼ N(0, 0.5), x2i ∼ N(3, 0.5), i = 1, 2, . . . , n.

To better evaluate the performances of all packages/procedures in different situations, we use 2 dif-
ferent sets of choices for the coefficients: Model 1: β0 = 0.1, β1 = 1.0, γ0 = 0.5, γ1 = −1.0; Model 2:
β0 = 1.0, β1 = 1.5, γ0 = 0.5, γ1 = −0.5. The 2 explanatory random variables were generated from
normal distributions with different means but the same variance. These 2 random variables did not
overlap and were independent of each other. By Equation (4), we can see that Model 2 yields a larger
μ value than does Model 1. Also, ρModel1 = e0.5−x /(1 + e0.5−x ), ρModel2 = e0.5−0.5x /(1 + e0.5−0.5x ).
A simple comparison between the 2 expressions of ρ values revealed that they satisfy the following
relation: ρModel1 < ρModel2 , for x > 0. Since x2i ∼ N(3, 0.5), the probability of x2i being positive is
approximately 99.77%. Therefore, for almost all x2i , ρModel1 < ρModel2 . A larger ρ contributes to a
larger zero inflation. Therefore, Model 1 has a lower percentage of structural zeroes (mean = 7.59%)
and Model 2 has larger ρ values (mean = 26.89%) for almost all x2i values.
Fitting the data by using all R and SAS packages/procedures revealed that all of them were excellent
in estimating β0 and β1 for both Models 1 and 2. However, for the zero-inflation component, almost
all methods performed much better for Model 2 than for Model 1 on the inference of γ0 and γ1 ,
implying that they work better for data with a higher percentage of structural zeroes. Because of the
limitations of space, we only show the inference of γ1 here. The results of γ0 have similar patterns.
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION 7

Our hypothesis setting used to test the inferential performance of the R zeroinfl function and SAS
procedures was H0 : γ1 = −1.0 v.s. H1 : γ1 = −1.0 for Model 1 and H0 : γ1 = −0.5 v.s. H1 : γ1 = −0.5
for Model 2. For each MC replicate, we calculated the p-value = 2 × (1 − CDF of (|γˆ1 − γ1 |/σˆγ1 )).
Then, the frequency of p -values < 0.05 over all the 1000 MC replicates was averaged to calculate the
1 1000
empirical type I error rate, that is, type 1 error = 1000 i=1 I{p≥0.05} .
We compared the estimates, standard errors, and type I error rates of the ZIP regression by R and
SAS procedures for Models 1 and 2, respectively, for the settings n = 20, 50, 100, and 250. Figures 1–4
give the results of this analysis.
In terms of bias and type I error rates, the inferences by almost all functions/procedures on Model
2 were better than those on Model 1. It is because that Model 2 had a larger proportion of structural
zeroes than Model 1. With a very small portion of constant zeroes such as Model 1, these zeroes will be
more easily mis-taken as random zeroes from Poisson rather than inflated zeroes [27]. Across sample
sizes, as expected, the inference became more reliable as the sample size increased. In particular, for
Model 2, bias was smaller and type I error rate was closer to the nominal value of 0.05 as sample size
increased. Across all scenarios, all estimates of γ1 were downwardly biased, with no exceptions in both

Figure 1. Comparison of Models 1 and 2 for estimates, standard errors, and type I error rates of the ZIP regression by the R package
and SAS procedures for simulated data for n = 20.
8 X. LIU ET AL.

Figure 2. Comparison of Models 1 and 2 for estimates, standard errors, and type I error rates of the ZIP regression by the R package
and SAS procedures for simulated data for n = 50.

R and SAS. Most of the listed procedures were severely biased, even after removing the outliers (≤ 1%
of 1000 MC replicates). However, the zeroinfl function with the CG type 1, CG type 2, and CG type 3
methods in R and GENMOD in SAS performed much better than other methods. They showed the
least bias and the smallest standard errors in all three settings for both Models 1 and 2, and their type
I error rates were the closest to the nominal value of 0.05 in every condition. The phenomenon of
downbias and small type I error based on Wald test shown in our simulation especially when sample
size is not large is consistent with Lambert’s simulation results [2] and recent reports [27].
For large-size samples (i.e. more than a few hundred), results by other choices could also be reli-
able, such as zeroinfl with BFGS, CONGRA method with updates FR and PR in SAS procedures
COUNTREG and NLMIXED.
Although a low convergence rate did not have a strong effect, the convergence rates of all pro-
cedures were much larger for Model 2 than for Model 1. However, for Model 1, in regard to SAS
procedure COUNTREG with method CONGRA and update CD, method DBLDOG and update
DDFP, and procedure NLMIXED with method DBLDOG and update DDFP and method NMSIMP,
more than half of the simulated data were unconverging or running out of the limit of iterations. Their
inferences are not shown in Figures 1–4.
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION 9

Figure 3. Comparison of Models 1 and 2 for estimates, standard errors, and type I error rates of the ZIP regression by the R package
and SAS procedures for simulated data for n = 100.

4.2. Real-data example


We applied all the above-mentioned procedures/methods to a real-data set [28] on modelling the
number of days of absence of students from 2 schools. The number of absence days, a stan-
dardized math score, and the type of programs in which each student was enrolled (i.e. gen-
eral, academic, or vocational) were considered in our analysis. The aim of school administrators
was to study the students’ attendance behaviours in relation to their math scores and program
types.
Count data showed overdispersion [28], indicating that the IDRE [28] fit the data well by using
negative binomial regression. Since the data had abundant zeroes, ZIP was fitted considering the
scenario that some students were enrolled but did not attend school at all later for some reasons.
The program types and math scores were used as covariates for the excess-zero process. All initial
values were automatically calculated by default in zeroinfl function, COUNTREG, and GENMOD.
In NLMIXED, parameters were initialized to be zero except for dummy variables for program types,
which were categorical data. Table 2 shows results of this analysis, but the inference on coefficients of
program types in the Poisson component is not included because we are more interested in parameter
estimations in the zero-inflation component.
10 X. LIU ET AL.

Figure 4. Comparison of Models 1 and 2 for estimates, standard errors, and type I error rates of the ZIP regression by the R package
and SAS procedures for simulated data for n = 250.

As expected, the zeroinfl function in the PSCL package of R and GENMOD in SAS gave
fairly consistent parameter estimates. Another Newton–Raphson method, SAS COUNTREG_
NRA, yielded exactly the same results as did PSCL and GENMOD. However, the over-
all results in both the count and excess-zeroes components were quite different across the
package and procedures, especially in case of the excess zeroes, but very similar across
each PSCL package in R, SAS COUNTREG, and SAS NLMIXED. Some methods in SAS,
such as COUNTREG_DBLDOG_DDFP, COUNTREG_NRRIDG, NLMIXED_DBLDOG_DDFP,
NLMIXED_NEWRAP, NLMIXED_NRRIDG, NLMIXED_NMSIMP, NLMIXED_QUANEW_TRU-
REG either did not converge or ran beyond their default iteration limits. Hence, they were not
included in Table 2.

5. Conclusion and discussion


In this study, we examined the performance of 34 possible methods of various R and SAS statisti-
cal packages and procedures corresponding to an assortment of NLO methods and updates by fitting
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION 11

Table 2. Estimates of parameters, standard errors, and p-values for the school absence study using different procedures/packages.
Intercept in count Intercept in zero Coeff. of math score in zero
Packages/procedures Estimate (s.e.) p-Value Estimate (s.e.) p-Value Estimate (s.e.) p-Value
R_pscl_zeroinfl_bfgs 1.338(0.064) 7.48e−97 −2.512(0.389) 1.10e−10 0.018(0.006) 0.005
R_pscl_zeroinfl_cg1 1.338(0.064) 7.47e−97 −2.512(0.389) 1.10e−10 0.018(0.006) 0.005
R_pscl_zeroinfl_cg2 1.338(0.064) 7.47e−97 −2.512(0.389) 1.10e−10 0.018(0.006) 0.005
R_pscl_zeroinfl_cg3 1.338(0.064) 7.47e−97 −2.512(0.389) 1.10e−10 0.018(0.006) 0.005
R_pscl_zeroinfl_neldermead 1.338(0.064) 7.47e−97 −2.512(0.389) 1.10e−10 0.018(0.006) 0.005
Sas_countreg_congra_cd 0.983(0.059) <.0001 6.038(43.704) 0.891 −7.146(43.658) 0.87
Sas_countreg_congra_fr 0.983(0.059) <.0001 7.939(112.68) 0.944 −9.041(112.66) 0.936
Sas_countreg_congra_pb 0.983(0.059) <.0001 17.63(0.579) <.0001 −18.732(0.579) <.0001
Sas_countreg_congra_pr 0.983(0.059) <.0001 5.852(39.611) 0.883 −6.956(39.56) 0.8604
Sas_countreg_dbldog_dbfgs 0.983(0.059) <.0001 15.557(0.579) <.0001 −16.659(0.579) <.0001
Sas_countreg_nmsimp 0.983(0.059) <.0001 42.994(0.579) <.0001 −44.047(0.579) <.0001
Sas_countreg_nra 1.338(0.064) <.0001 −2.512(0.389) <.0001 0.018(0.006) 0.005
Sas_countreg_qn_bfgs 0.983(0.059) <.0001 20.1016(0.579) <.0001 −21.204(0.579) <.0001
Sas_countreg_qn_dbfgs 0.983(0.059) <.0001 21.01(0.579) <.0001 −22.11(0.579) <.0001
Sas_countreg_qn_ddfp 0.983(0.059) <.0001 7.81(104.8) 0.941 −8.904(104.8) 0.932
Sas_countreg_qn_dfp 0.983(0.059) <.0001 20.102(0.579) <.0001 −21.204(0.579) <.0001
Sas_countreg_tr 0.983(0.059) <.0001 10.521(410.0) 0.980 −11.623(410.0) 0.977
Sas_genmod_newrr 1.338(0.064) <.0001 −2.512(0.389) <.0001 0.018(0.006) 0.005
Sas_nlmixed_congra_cd 0.596(329.9) 0.9986 −0.0352(2.805) 0.99 −1.5914(1.977) 0.4215
Sas_nlmixed_congra_fr 0.596(346.2) 0.9986 −0.0353(2.807) 0.99 −1.5921(1.979) 0.4217
Sas_nlmixed_congra_pb 0.597(346.2) 0.9986 −0.0353(2.807) 0.99 −3.305(5.271) 0.5311
Sas_nlmixed_congra_pr 0.596(109.4) 0.9975 −0.032(2.505) 0.99 −1.46(1.6632) 0.381
Sas_nlmixed_dbldog_dbfgs 0.596(1415) 0.9997 9.7475(280.31) 0.9723 −10.856(280.3) 0.969
Sas_nlmixed_quanew_bfgs 0.622(0.376) 0.0099 −0.03(2.292) 0.9896 −1.3494(1.443) 0.3503
Sas_nlmixed_quanew_dbfgs 0.622(0.376) 0.0099 −0.03(2.292) 0.9896 −1.3494(1.443) 0.3503
Sas_nlmixed_quanew_ddfp 0.597(14.56) 0.9673 −0.025(3.844) 0.9948 −1.9311(3.059) 0.5284
Sas_nlmixed_quanew_dfp 0.597(14.56) 0.9673 −0.025(3.844) 0.9948 −1.9311(3.059) 0.5284

them to simulated ZIP count data. This assessment is important for practical purposes because statis-
ticians are often unable to decide which method to choose to have a higher likelihood of obtaining a
reliable inference when fitting data to ZIP models.
We briefly reviewed and discussed the operating characteristics of 34 approaches in R and SAS.
All the zeroinfl function in R, SAS COUNTREG, GENMOD, NLMIXED directly execute NLO meth-
ods to maximize the log-likelihood function of ZIP models. A major difference among the methods
is the distinct process of calculating the initial values of the parameter. Zeroinfl function uses the
GLM model, COUNTREG implements ordinary least-squares regression, and GENMOD uses the
weighted least-squares estimates based on response data. In contrast, NLMIXED does not provide
reasonable guess of the initial values in the procedure, which must be manually inputted by program-
mers. When the sample size or zero inflation is small in a data set, different initial estimates might
produce very diverse results.
The results of our simulation study show that the zeroinfl function in R and the procedure GEN-
MOD in SAS perform the best in terms of a smaller bias and empirical type I error rate, and are most
likely to provide reliable estimates and inferences of parameters with smaller bias, even in difficult sit-
uations such as a small sample size (as small as n = 20) and small zero inflation (as small as 7.59%).
They were also very consistent with each other in the real-data modelling. When the sample size and
zero inflation are large, other procedures and methods can also be employed for reasonable estimates.
However, NLMIXED is not recommended because it does not effectively estimate the initial values
of parameters in the operation.
The real data fitting results by R package PSCL (zeroinfl) and SAS procedure GENMOD coincided
with each other, which supports the conclusion of simulations. SAS procedure COUNTREG_NRA
also offered almost the same results as zeroinfl and GENMOD. However, we investigated the rea-
son of its poor performance in simulations and found it is due to some outlier estimations. Actually,
12 X. LIU ET AL.

when zeroinfl and GENMOD are not converging for a data set with abnormal observations and many
outliers, COUNTREG_NRA still provides an unreliable result. Therefore, in real data practice, we
would like to recommend R package PSCL (zeroinfl) and SAS procedure GENMOD based on our
investigation.
In all, in terms of fitting ZIP regression, based on both of our simulation and real data example,
zeroinf in R PSCL and GENMOD in SAS would be the best choice.

Acknowledgements
We thank Ms Lu Huang for assistance to plot the figures. We also thank the valuable comments of the referee which
helped improve our paper greatly.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
XL, BW, LT, and HZ received financial support from ALSAC.

References
[1] Pounds SB, Gao CL, Zhang Hu. Empirical Bayesian selection of hypothesis testing procedures for analysis of
sequence count expression data. Stat Appl Genet Mol Biol. 2012;11(5). DOI:10.1515/1544-6115.1773
[2] Lambert D. Zero-inflated Poisson regression with an application to defects in manufacturing. Technometrics.
1992;34:1–4.
[3] Cameron AC, Trivedi PK. Regression analysis of count data. Cambridge: Cambridge University Press; 1998.
[4] Cameron AC, Trivedi PK. Microeconometrics: methods and applications. Cambridge: Cambridge University
Press; 2005.
[5] Agresti A. Categorical data analysis. 2nd ed. Hoboken (NJ): Wiley; 2002.
[6] Min YY, Agresti A. Modeling nonnegative data with clumping at zero: a survey. J Iran Stat Soc. 2002;1:7–34.
[7] Lord D, Mannering F. The statistical analysis of crash-frequency data: a review and assessment of methodological
alternatives. Transp Res A. 2010;44:291–305.
[8] Staub KE, Winkelmann R. Consistent estimation of zero-inflated count models. Health Econ. 2013;22(6):673–86.
[9] Preissser JS, Stamm JW, Long DL, et al. Review and recommendations for zero-inflated count regression modeling
of dental caries indices in epidemiological studies. Caries Res. 2012;46:413–423.
[10] Winkelmann R. Econometric analysis of count data. 5th ed. Berlin: Springer; 2008.
[11] Zeileis A, Kleiber K, Jackman S. Regression models for count data in R. J Stat Softw. 2008;27. Available from:
http://www.jstatsoft.org/v27/i08/
[12] Jackman S. pscl: classes and methods for R. developed in the political science computational laboratory, Depart-
ment of Political Science, Stanford University. Stanford, California. R package version 1.4.9 [cited 2016 Oct 1].
Available from: http://pscl.stanford.edu/
[13] R Development Core Team. R: a language and environment for statistical computing. R Foundation for Statistical
Computing. Vienna, Austria; 2010. ISBN 3-900051-07-0 [cited 2016 Oct 1]. Available from: http://www.R-project.
org
[14] Fletcher R. A new approach to variable metric algorithms. Comput J. 1970;13(3):317–322.
[15] Goldfarb D. A family of variable metric updates derived by variational means. Math Comp. 1970;24(109):23–26.
[16] Shanno DF. Conditioning of quasi-Newton methods for function minimization. Math. Comp. 1970;24(111):
647–656.
[17] Broyden CG. The convergence of a class of double-rank minimization algorithms. J Inst Math Appl. 1970;6:76–90.
[18] Fletcher R, Reeves CM. Function minimization by conjugate gradients. Comput J. 1964;7:149–154.
[19] Polak E, Ribière G. Note sur la convergence de directions conjugué. Rev. Francaise Informat Recherche Opera-
tionelle. 3e Année 16. 1969;R1:35–43.
[20] Beale EML. A derivation of conjugate gradients. In Lootsma FA, editor. Numerical methods for nonlinear
optimization. London: Academic Press; 1972. p. 39–43.
[21] Sorenson HW. Comparison of some conjugate direction procedures for function minimization. J Franklin Inst.
1969;288:421–441.
[22] Nelder JA, Mead R. A simplex algorithm for function minimization. Comput J. 1965;7:308–313.
[23] Powell MJD. Restart procedures for the conjugate gradient method. Math Program. 1977;12:241–254.
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION 13

[24] Fletcher R. Practical methods of optimization, vol. 1: unconstrained optimization. New York: John Wiley & Sons;
1987.
[25] SAS Institute Inc. Enhancements in SAS/STAT 9.2 Software. Cary (NC): SAS Institute Inc.; 2008.
[26] Davidon WC. Variable metric method for minimization. SIAM J Optim. 1991;1(1):1–17.
[27] Miller JM. Comparing Poisson, Hurdle, ZIP model fit under varying degrees of skew and zero-inflation [PhD
dissertation]. USA: University of Florida; 2007.
[28] Introduction to SAS. UCLA: statistical consulting group [cited 2007 Nov 24]. Available from: http://www.ats.ucla.
edu/stat/sas/notes2/

You might also like