Professional Documents
Culture Documents
LR likelihood ratio term (gamma distributed), yielding the negative binomial re-
HNB1 hurdle model with threshold 1, and negative gression model (NBRM).
binomial distribution Another variation of count models are zero-inflated, and
HNB2 hurdle model with threshold 2, and negative hurdle models, which are used when there is an excess of zeros
binomial distribution for the dependent variable. The proportion of faulty modules
HP1 hurdle model with threshold 1, and Poisson of a high assurance software system (such as telecommu-
distribution nications) is usually very small. Under such scenarios, the
HP2 hurdle model with threshold 2, and Poisson zero-inflated, and hurdle count models may be more appro-
distribution priate. The zero-inflated model assumes that the population
MLE maximum likelihood estimation of the software modules consists of two groups, i.e., perfect
MSD minimum -significant difference modules, and non-perfect modules. For the perfect modules,
MSE mean squared error no faults occur; while for the non-perfect modules, the number
of faults follows some standard distribution, such as Poisson,
NBRM negative binomial regression model
or negative binomial. If the non-perfect group is assumed to
PDF probability density function follow a Poisson distribution, then a zero-inflated Poisson (ZIP)
PRM Poisson regression model model is obtained. If the non-perfect group is assumed to follow
ZIP zero-inflated Poisson a negative binomial distribution, then a zero-inflated negative
ZINB zero-inflated negative binomial binomial (ZINB) model is obtained.
The hurdle model also consists of two parts. However, in-
stead of having perfect, and non-perfect groups, it divides the
I. INTRODUCTION modules into lower, and higher count groups, based on a binary
IVEN the goal of delivering a software product that has
G minimal corrective maintenance, software quality mod-
eling techniques are useful tools for applying timely software
distribution. The dependent variable of these groups is assumed
to follow a separate distribution process. In the case of hurdle
count models, two factors may affect its predictive quality: the
quality improvement efforts. For example, by predicting the dependent variable’s threshold2 value that is used to form the
number of faults in program modules, a software fault predic- two groups, and the specific distribution each group is assumed
tion model can direct the software quality assurance team in to follow. We investigated these two factors to study their impact
targeting the most faulty modules first. on the prediction performances of hurdle models. For the case
The relationship between software complexity metrics, and study presented, we chose two threshold values for the number
the occurrence of faults in program modules has been used of faults: 1, and 2.
by software quality prediction models, such as case-based The distributions chosen for the count groups of the hurdle
reasoning [1], regression trees [2], [3], fuzzy logic [4], and models are the Poisson, and negative binomial distributions.
multiple linear regression [5]. Typically, a software quality Hence, we have four kinds of hurdle models: 1) hurdle model
model of a given software system is calibrated using software with threshold 1, and Poisson distribution (HP1); 2) hurdle
metrics, and fault data collected from a previous system release model with threshold 2, and Poisson distribution (HP2); 3)
or similar projects. The model can then be applied to predict hurdle model with threshold 1, and negative binomial dis-
the software quality of a release currently under development,
tribution (HNB1); and 4) hurdle model with threshold 2, and
or similar projects.
negative binomial distribution (HNB2). Previous works related
Software fault prediction based on count model techniques
to hurdle models were focused on applying HP1, and HNB1 in
[6] is attractive because a specific count model can be chosen
econometrics [10]–[12]. The application of hurdle regression
such that it best represents the fault occurrence process of the
techniques with different threshold values to software quality
given software system. In addition, they have a unique feature
estimation modeling is a contribution of this study. To our
in that count models can provide the probability that a given
knowledge, this is the first study that has investigated hurdle
number of faults will occur in any given program module. Count
models for software quality estimation.
models can also be used for software quality classification of
program modules, i.e., fault-prone, and not fault-prone [7], [8]. In related literature, we did not find any comprehensive em-
We feel that the application of count models in software engi- pirical study that provides a clear guidance for applying existing
neering has been very limited [6], [9]. count models for software quality estimation. In one of the very
Poisson regression is the basis of the various count models few studies, Evanco [9] applied a Poisson Regression Model
which are derived from it. It has a statistical characteristic called (PRM) to determine the fault locality, and fault correction ef-
equidispersion, which implies the equality of the mean to the fort during unit, system, and acceptance testing of a software
variance of the dependent variable (a non-negative discrete in- project. However, no reasoning was provided regarding the ap-
teger). In the software fault prediction problem, it is often ob- propriateness of the PRM for the software data set. Moreover, the
served that the distribution of the number of faults (dependent study did not use any evaluation criteria to assess the quality of
variable) is such that its variance exceeds its mean value. Such a the fitted count model.
scenario is referred to as overdispersion. One way of accounting 2In our study the dependent variable is the number of faults in a software
for overdispersion is introducing an unobserved heterogeneity module.
GAO AND KHOSHGOFTAAR: EMPIRICAL STUDY OF COUNT MODELS FOR SOFTWARE FAULT PREDICTION 225
Our preliminary efforts investigated the benefits of applying is Poisson distributed with the probability density function (PDF)
ZIP models for software fault prediction [6]. This study presents of
a comprehensive empirical investigation of eight different count
models for software fault prediction. Six of the count models (1)
are commonly used in non-software engineering fields, such as
econometrics [11], [13]. The other two, i.e., HP2, and HNB2, are
where is the mean value of the dependent variable . The
investigated for the first time in this study. A statistically-based
expected value, and the variance of are given by
comparative analysis, including several model selection tech-
niques, is presented to provide a definitive guidance for applying
(2)
count models for software fault prediction.
The case study presented is of software metrics, and fault data
Equation (2) demonstrates an important property of the Poisson
collected from two Windows-based embedded software systems
regression model, i.e., equidispersion, which implies that the
configured for wireless telecommunications. The eight count
expected value of is equal to its variance.
models built for the system were compared with each other. A
To ensure that the expected value of is nonnegative, the link
one-way ANOVA model, and Tukey’s multiple comparison tech-
function which displays a relationship between the expected
nique were used in our comparative analysis.
value, and the independent variables should have the form [13]
The remainder of this paper presents details of the count
models in Section II, model selection tests & techniques in
(3)
Section III, a case study in Section IV, and conclusions in
Section V.
where denotes an unknown parameter
vector, and represents the transpose of the vector , which
is equal to . Note that both , and are
II. COUNT MODELING TECHNIQUES vectors, where is the number of independent variables used
Count models are generally a variation of the commonly in the data set. (1), and (3) jointly define the Poisson regression
used Poisson regression model. Another commonly used count model.
model is the negative binomial regression model, which is a The maximum likelihood estimation (MLE) technique is often
derivative of the PRM. Other count models have been derived by used in the parameter estimation for regression models [15].
combining different distributions, for example, finite-mixture Given a set of observations, the log-likelihood function of the
PRM is given by
models [13]. In a finite-mixture model, a random variable is
postulated as a draw from a super-population that is an additive
mixture of a group of distinct populations, each of which has its (4)
own unique distribution, such as Poisson, or negative binomial.
Zero-inflated count models, such as zero-inflated Poisson, and
The commonly used method in MLE is the Newton-Raphson
zero-inflated negative binomial models are specific cases of fi-
iteration technique. Other methods, such as the EM algorithm,
nite-mixture models. Hurdle models, including hurdle Poisson,
also perform successfully for parameter estimation [16].
and hurdle negative binomial models are also special cases of
finite-mixture models.
In addition to the finite-mixture models discussed in this B. Zero-Inflated Poisson Model
paper, we investigated several other mixture models. How- A data set with an excess of zeros for the dependent vari-
ever, it was observed that they were either unsuitable for the able is a commonly observed phenomenon in software quality
embedded software system being studied, or yielded similar modeling. This observation is especially prevalent for high as-
results as compared to some of the count models presented. surance, and mission-critical software systems.
For example, upon investigating the Poisson-inverse Gaussian The ZIP model first introduced by Lambert [16] assumes that
model [14], it was observed that its prediction results were very all zeros come from two sources: the source representing the
similar to those of the NBRM. Therefore, the modeling results perfect modules in which no faults occur, and the source repre-
of other mixture models are not presented in the paper. senting the non-perfect modules in which the number of faults in
the modules follows the Poisson distribution. In the ZIP model, a
parameter is introduced to denote the probability of a module
A. Poisson Regression Model
being perfect. Hence, the probability of the module being non-
The Poisson regression model is derived from the Poisson dis- perfect is . The PDF of the dependent variable under the
tribution by allowing the expected value of the dependent vari- ZIP model is therefore,
able to be a function associated with the independent variables.
Let ( , ) be an observation in a data set, such that , and ,
are the dependent variable, and vector of independent vari- .
ables for the observation, respectively. Given , assume (5)
226 IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007
A ZIP model may be obtained by adding the following two link where the gamma function denoted by is defined in [13],
functions: and is a constant parameter. The expected value of for the
negative binomial distribution is the same as that for the Poisson
(6) distribution, i.e., . However, the condi-
(7) tional variance of the negative binomial is given by
Equations (8), and (9) display that the variance of exceeds its D. Zero-Inflated Negative Binomial Model
expected value, which is known as overdispersion. An excess The zero-inflated negative binomial model is similar to the ZIP
of zeros for the dependent variable implies the overdispersion model. The primary difference is that, in the case of the ZINB
phenomenon. model, the negative binomial distribution is used for the non-
We use the MLE technique as the standard parameter estima- perfect modules group, as compared to the Poisson distribution
tion technique for the ZIP model. Let denote an in- used in the ZIP model.
dicator variable that takes the value of 1 when , and the The probability density function for the ZINB model is given
value of 0 otherwise. The log-likelihood function for the ZIP is by (13) at the bottom of the page. The ZINB model is obtained
therefore given by by adding the following two link functions:
(14)
(15)
,
(13)
GAO AND KHOSHGOFTAAR: EMPIRICAL STUDY OF COUNT MODELS FOR SOFTWARE FAULT PREDICTION 227
modules that had a number of faults greater than or equal to the tributions. The probability density function of HNB2 is therefore
selected threshold are categorized into the upper count group, given by
and into the lower count group otherwise. Moreover, for each
threshold value, we chose two distributions (Poisson, and nega- ,
tive binomial) to represent the groups of the hurdle model. When , (21)
the Poisson distribution is used, the resulting model is a hurdle
Poisson (HP) model. Whereas, when the negative binomial dis-
tribution is used, the resulting model is a hurdle negative bino- where is the same as in (20), and
mial (HNB) model. The hurdle model has a general form of
(22)
,
, and have the same forms as , and except
(16)
, for the replacement of parameters , and by , and ,
respectively.
The log-likelihood function for each hurdle model can be
where , and are some density functions that may have written as a sum of two individuals [13]: , and .
the same format, but with different parameters; and repre-
sents the selected threshold value. In this paper, we discuss four
different hurdle models. (23)
1) Case 1 (HP1): For a threshold value of 1 fault, both ,
and are defined to have Poisson distributions. Therefore,
the HP1 density function becomes (24)
(28)
where denotes the mean of , for , and
denotes the standard deviation of , for . 2) Absolute, and Relative Errors: The accuracy of
Vuong’s statistic, , is used to test the hypothesis that the fault prediction models are measured by the av-
, i.e., for each observation, the two models have erage absolute error (AAE), and the average relative error
the same predicted probabilities for a given dependent variable (ARE), i.e., AAE = , and ARE =
. Vuong showed that the statistic is bidirectional, and , where is the number of
asymptotically -normal. For a given -significance level, , if modules, are the actual values of the dependent variable, and
, then Model 1 is chosen; if , then are the predicted values of the dependent variable. In the case
Model 2 is selected; otherwise, i.e., , either of of ARE, because the actual value of the dependent variable may
the two models can be selected as the appropriate model. In be zero, we add a “1” to the denominator to make the definition
this case, a simpler model is always preferred. For our study, always well-defined [5].
Vuong’s test is also suitable for testing the relative appropriate-
ness of the NBRM, and the ZINB model. IV. EMPIRICAL CASE STUDY
A. System Description
C. Information Criteria
The software metrics, and fault data for this case study was
Information criteria-based model selection techniques, which collected from initial releases of two large Windows-based em-
are based on the fitted log-likelihood function, can be used for bedded software applications used primarily for customizing
both nested, and non-nested count models. It is assumed that the configuration of wireless telecommunications products. The
the log-likelihood will increase as more parameters are added two applications, written in C++, provided similar functionali-
to a model. The penalty of increasing log-likelihood takes into ties, and contained common source code; and hence, are anal-
account the number of parameters as well as the number of ob- ysed as one software system. Each application contained more
servations. The three information criteria measures considered than 27.5 million lines of code. A program module is comprised
in this study include of a single source file. Upon preprocessing & cleaning of the
GAO AND KHOSHGOFTAAR: EMPIRICAL STUDY OF COUNT MODELS FOR SOFTWARE FAULT PREDICTION 229
TABLE I TABLE II
SOFTWARE METRICS HYPOTHESIS TESTING RESULTS
ZINB, PRM versus HP1, PRM versus HP2, NBRM versus HNB1,
NBRM versus HNB2, HP1 versus HNB1, and HP2 versus HNB2.
collected software metrics, and fault data, i.e., removing incom-
In addition, the non-nested models were examined by applying
plete or illogical data points, 1211 modules remained in the data
Vuong’s hypothesis testing to two pairwise comparisons: PRM
set. versus ZIP, and NBRM versus ZINB.
The software metrics for each module were collected using a The testing results, which were consistent for all fifty data
combination of tools, and databases. Among the 1211 modules, splits, are summarized in Table II. The first column of the table
over two-thirds had no faults, and the maximum number of the lists the ten pairwise comparisons, the second column shows
faults in one module was 97. The dependent variable, Fault, in- the hypothesis testing techniques utilized, the third column
dicated the number of the software faults discovered in a source indicates the recommended model according to the results of
file during system test. In the context of this case study, Table I the respective pairwise testing, and the fourth column presents
lists the five software metrics that are used as independent vari- the -values of the comparisons. The -values indicate the
ables for calibrating software quality models. -significance by which the recommended model is -better
The fit, and test data sets were obtained by applying an im- than its counterpart. The overall conclusion from the hypothesis
partial data splitting technique to the original data set of 1211 testing indicated that the ZINB, and HNB models are -signif-
program modules. Consequently, two-thirds of the modules, i.e., icantly better than the other count models, for both threshold
807, were assigned to the fit data set, whereas, the remaining values. Determining the best model among these three is not
one-third of the modules, i.e., 404, were assigned to the test data feasible with hypothesis testing, i.e., to our knowledge, no
set. To avoid biased results due to a lucky (or unlucky) data split- hypothesis testing-based technique exists that can be used
ting, the original data set was randomly split 50 times, yielding to compare ZINB versus HNB models, and the respective HNB
50 pairs of the fit, and test data sets. Empirical studies were per- models with different thresholds. Therefore, with respect to hy-
formed for all 50 data splits. pothesis testing-based model selection, these three models are
considered to have similar performance. However, if one were
B. Results & Analysis to select the best count model among them, then information
According to the modeling methodology of each count model criteria (IC)-based model selection techniques could be used.
(see Section II), eight count models were calibrated for each The next step involved IC-based model selection, i.e., AIC, BIC,
of the fifty data splits. To determine the best fitted models for and CAIC, and evaluating the results with those obtained by hy-
the case study, we employed pairwise hypothesis testing tech- pothesis testing. It was observed that for all three information
niques. Because we considered eight different count models, the criteria, similar model selection results were obtained. More-
number of possible pairwise comparisons is . How- over, the top three models selected by IC were the same as those
ever, we present only ten of the pairwise comparisons, and sub- determined by hypothesis testing. More specifically, ZINB, and
sequently selected the three best count models. Other compar- HNB have the lower IC values, indicating that they are the pre-
isons were either not performed because hypothesis testing was ferred models for this case study. In contrast, the PRM demon-
not feasible, or are not presented because the respective pair- strated the largest values with respect to information criteria,
wise tests were irrelevant. The eight pairwise likelihood ratio- whereas the NBRM, ZIP, and HP models depicted intermediate
based hypothesis testing included PRM versus NBRM, ZIP versus information criteria values.
230 IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007
Note: Means with the same letter are not s-significantly different. Note: Means with the same letter are not s-significantly different.
TABLE X
MULTIPLE PAIRWISE COMPARISONS: AAE, AND ARE (FIT DATA SETS)
Note: Means with the same letter are not s-significantly different.
234 IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007
TABLE XI
MULTIPLE PAIRWISE COMPARISONS: AAE, AND ARE (TEST DATA SETS)
Note: Means with the same letter are not s-significantly different.
GAO AND KHOSHGOFTAAR: EMPIRICAL STUDY OF COUNT MODELS FOR SOFTWARE FAULT PREDICTION 235
[5] T. M. Khoshgoftaar, J. C. Munson, B. B. Bhattacharya, and G. D. [17] Q. H. Vuong, “Likelihood ratio tests for model selection and non-nested
Richardson, “Predictive modeling techniques of software quality from hypotheses,” Econometrica, vol. 57, no. 2, pp. 307–333, Mar. 1989.
software measures,” IEEE Trans. Software Engineering, vol. 18, no. [18] SAS/STAT User’s Guide. Cary, NC, USA: SAS Institute Inc, 1990,
11, pp. 979–987, Nov. 1992. vol. 2.
[6] T. M. Khoshgoftaar, K. Gao, and R. M. Szabo, “An application of [19] M. L. Berenson, D. M. Levine, and M. Goldstein, Intermediate Statis-
zero-inflated Poisson regression for software fault prediction,” in Pro- tical Methods and Applications: A Computer Package Approach. En-
ceedings of the Twelfth International Symposium on Software Relia- glewood Cliffs, New Jersey: Prentice Hall, 1983.
bility Engineering, Hong Kong, China, November 2001, pp. 66–73,
IEEE Computer Society.
[7] L. C. Briand, W. L. Melo, and J. Wust, “Assessing the applicability of
fault-proneness models across object-oriented software projects,” IEEE
Trans. Software Engineering, vol. 28, no. 7, pp. 706–720, July 2002.
[8] T. M. Khoshgoftaar, B. Cukic, and N. Seliya, “Predicting fault-prone
modules in embedded systems using analogy-based classification Kehan Gao received the Ph.D. degree in Computer Engineering from Florida
models,” International Journal of Software Engineering and Knowl- Atlantic University, Boca Raton, FL, USA, in 2003. She is currently an
edge Engineering, vol. 12, no. 2, pp. 201–221, Apr. 2002, World Assistant Professor in the Department of Mathematics and Computer Science
Scientific Publishing. at Eastern Connecticut State University. Her research interests include software
[9] W. M. Evanco, “Modeling the effort to correct faults,” Journal of Sys- engineering, software metrics, software reliability and quality engineering,
tems and Software, vol. 29, pp. 75–84, 1995. computer performance modeling, computational intelligence, and data mining.
[10] P. Deb and P. K. Trivedi, “Demand for medical care by the elderly: A She is a member of the IEEE Computer Society, and the Association for
finite mixture approach,” Journal of Applied Econometrics, vol. 12, pp. Computing Machinery.
313–336, 1997.
[11] J. Mullahy, “Specification and testing of some modified count data
models,” Journal of Econometrics, vol. 33, pp. 341–365, 1986. Taghi M. Khoshgoftaar is a professor of the Department of Computer Science
[12] W. Pohlmeier and V. Ulrich, “An ecometric model of the two-part de- and Engineering, Florida Atlantic University, and the Director of the Empirical
cision making process in the demand for health care,” The Journal of Software Engineering Laboratory, and the Data Mining and Machine Learning
Human Resources, vol. 30, pp. 339–361, 1995. Laboratory. His research interests are in software engineering, software met-
[13] A. C. Cameron and P. K. Trivedi, Regression Analysis of Count Data. rics, software reliability and quality engineering, computational intelligence,
: Cambridge University Press, 1998. computer performance evaluation, data mining, machine learning, and statis-
[14] C. Dean, J. F. Lawless, and G. E. Willmot, “A mixed Poisson-inverse- tical modeling. He has published more than 300 refereed papers in these areas.
Gaussian regression model,” The Canadian Journal of Statistics, vol. He is a member of the IEEE, IEEE Computer Society, and IEEE Reliability So-
17, no. 2, pp. 171–181, 1989. ciety. He was the program chair, and General Chair of the IEEE International
[15] W. H. Greene, Econometric Analysis, 4th ed. Upper Saddle River, Conference on Tools with Artificial Intelligence in 2004, and 2005 respectively.
New Jersey: New York University: Prentice-Hall Inc., 2000. He has served on technical program committees of various international confer-
[16] D. Lambert, “Zero-inflated Poisson regression, with an application to ences, symposia, and workshops. Also, he has served as North American Editor
defects in manufacturing,” Technometrics, vol. 34, no. 1, pp. 1–14, Feb. of the Software Quality Journal, and is on the editorial boards of the journals
1992. Software Quality and Fuzzy systems.