You are on page 1of 13

The Journal of Systems and Software 80 (2007) 1349–1361

www.elsevier.com/locate/jss

Predicting object-oriented software maintainability


using multivariate adaptive regression splines
Yuming Zhou, Hareton Leung *

Department of Computing, Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, PR China

Received 21 February 2006; received in revised form 26 October 2006; accepted 29 October 2006
Available online 8 December 2006

Abstract

Accurate software metrics-based maintainability prediction can not only enable developers to better identify the determinants of soft-
ware quality and thus help them improve design or coding, it can also provide managers with useful information to help them plan the
use of valuable resources. In this paper, we employ a novel exploratory modeling technique, multiple adaptive regression splines
(MARS), to build software maintainability prediction models using the metric data collected from two different object-oriented systems.
The prediction accuracy of the MARS models are evaluated and compared using multivariate linear regression models, artificial neural
network models, regression tree models, and support vector models. The results suggest that for one system MARS can predict main-
tainability more accurately than the other four typical modeling techniques, and that for the other system MARS is as accurate as the
best modeling technique.
Ó 2006 Elsevier Inc. All rights reserved.

Keywords: Object-oriented; Maintainability; Prediction; Multiple adaptive regression splines

1. Introduction tion models. In Fenton et al. (2002) and Fenton and Neil
(1999, 2000), Fenton et al. state that the Bayesian Belief
The largest cost associated with any software product Networks (BBNs) is the most promising technique for soft-
over its lifetime is the software maintenance cost. One ware quality prediction because of their ability in handling
approach to controlling maintenance costs is to utilize soft- uncertainties, incorporating expert knowledge, and model-
ware metrics during the development phase (Bandi et al., ing the complex relationships among variables. The intro-
2003). Studies examining the link between OO software duction of the BBNs to software quality prediction is
metrics and maintainability have found that in general certainly a positive step forward. However, the limitations
these metrics can be used as predictors of maintenance of the BBNs have also been recognized by researchers
effort (van Koten and Gray, 2006; Fioravanti and Nesi, (Ma et al., 2006; Weaver, 2003; Yu et al., 2002). First,
2001; Li and Henry, 1993; Briand et al., 2001; Bandi for software quality prediction, the links in the BBNs rep-
et al., 2003; Misra, 2005; Thwin and Quah, 2005). In Fio- resent casual relationships between variables that are gen-
ravanti and Nesi (2001), Li and Henry (1993), and Misra erally software metrics, but not all related software
(2005), linear regression techniques are used to build main- metrics have causal relationships. Second, it is not easy
tainability prediction models. In Thwin and Quah (2005), to express uncertainty in terms of probabilities. Third,
neural networks are used to build maintainability predic- the inference of the BBNs requires a (subjective) prior
probability distribution, which may not be always available
or reasonable. The prediction accuracy may be greatly
*
Corresponding author. Tel.: +86 852 27667252; fax: +86 852
decreased if the subjective prior are made incorrectly.
27740842. Forth, it is difficult to identify the correct structure of the
E-mail address: cshleung@inet.polyu.edu.hk (H. Leung). network that one has never explicitly described in detail

0164-1212/$ - see front matter Ó 2006 Elsevier Inc. All rights reserved.
doi:10.1016/j.jss.2006.10.049
1350 Y. Zhou, H. Leung / The Journal of Systems and Software 80 (2007) 1349–1361

before. In practice, whether a modeling technique is suit- regression models, artificial neural network models, regres-
able for maintainability prediction not only depends on sion tree models, and support vector regression models. We
its ability in capturing causal relationships but also compare the MARS models with these four models in
depends on the ease of building prediction models. terms of their prediction performance using leave-one-out
Although the BBNs are strong in casual modeling, the lim- cross-validation. The results suggest that for one system
itations mentioned above might make it difficult for practi- MARS can predict maintainability more accurately than
tioners to develop a maintainability model with high the other four typical modeling techniques, and for the
prediction accuracy. In van Koten and Gray (2006), Koten other system MARS is as accurate as the best modeling
et al. make the first use of the BBNs in building software technique.
maintainability prediction models. They use a special type The rest of this paper is organized as follows. Section 2
of Bayesian networks called Naı̈ve–Bayes classifier, which introduces four typical prediction modeling techniques.
assumes no expert knowledge about the prior probability Section 3 describes the main modeling technique used: mul-
distribution but learns it from data by batch learning. tivariate adaptive regression splines (MARS). Section 4
The results show that the prediction accuracy of the BBN discusses the prediction accuracy measures, cross-valida-
model is more accurate than regression-based models for tion method, and significance test technique employed in
one system but is less accurate than regression-based mod- evaluating prediction models. Section 5 presents the OO
els for another system. Accurate software metrics-based software data sets used in our study. Section 6 reports
maintainability prediction is desirable first because it the results of MARS models and compares them with other
reduces future maintenance efforts by enabling developers models. Section 7 concludes the paper and outlines direc-
to better identify the determinants of software quality tions for future work.
and thereby improve design or coding, and second because
it provides managers with information for more effectively 2. Typical modeling techniques
planning the use of valuable resources. Although a number
of maintainability prediction models have been developed In this section, we first introduce three commonly used
in last decade, they have low prediction accuracies accord- modeling techniques: multivariate linear regression, artifi-
ing to the criteria suggested in the literature (Conte et al., cial neural network, and regression tree. Then, we describe
1986). Therefore, it is necessary to explore new techniques, the recently proposed support vector regression, which is
which are easy to use, for building maintainability predic- regarded as an excellent technique for producing prediction
tion models with high prediction accuracy. models with good generalization.
This paper investigates the applicability to software
maintainability prediction of a novel exploratory multivar-
2.1. Multivariate linear regression
iate analysis technique, MARS. MARS differs from other
techniques in that its approach to model building does
Multivariate linear regression (MLR) is the most com-
not require the specification in advance of a functional
monly used technique for modeling the relationship
form. Rather, it attempts to adapt to the unknown func-
between two or more independent variables and a depen-
tional form using a series of piecewise regression splines.
dent variable by fitting a linear equation to observed data.
This makes it very suitable for modeling complex relation-
The main advantages of this technique are its simplicity
ships that other modeling techniques find difficult, if not
and that it is supported by many popular statistical pack-
impossible, to reveal. The power of MARS for building
ages (Heiat, 2002). The general form of a MLR model
prediction models has been demonstrated in many applica-
can be given by:
tions such as biomedical analysis (Deconinck et al., 2005),
network intrusion detection (Peddabachigari et al., 2007), ^y i ¼ a0 þ a1 xi1 þ    þ ak xik
credit scoring (Lee and Chen, 2005), and cancer diagnosis y i ¼ a0 þ a1 xi1 þ    þ ak xik þ ei
(Chou et al., 2004). However, MARS has been little
used in software engineering (Briand et al., 2002, 2004). where xi1 ; . . . ; xik are the independent variables, a0, . . . , ak
In this paper, we describe its first use in building software the parameters to be estimated, ^y i the dependent variable
maintainability prediction models with Li and Henry’s to be predicted, yi the actual value of the dependent vari-
metric data collected from two different object-oriented able, and ei is the error in the prediction of the ith case
systems (Li and Henry, 1993). The maintainability of a (Khoshgoftaar and Seliya, 2003).
software system can be measured in different ways. In past When building such an MLR model, the use of a vari-
studies, maintainability has been defined as ‘‘time required able selection method ensures that only important indepen-
to make changes’’ (Gibson and Senn, 1989) and ‘‘time to dent variables are included in the model. Three commonly
understand, develop, and implement modification’’ (Ris- used variable selection methods are forward elimination,
ing, 1992). In this paper, it is measured as the number of backward elimination, and stepwise selection. The forward
changes made to code during a maintenance period. To elimination method starts with a model that includes the
evaluate the benefits of using MARS over typical modeling intercept only. Based on certain evaluation criteria, inde-
techniques, we also build classical multivariate linear pendent variables are then selected for inclusion in the
Y. Zhou, H. Leung / The Journal of Systems and Software 80 (2007) 1349–1361 1351

model, until a stopping criterion is fulfilled. The backward associated with a transfer function. The behavior of an
elimination method starts with a model that includes all ANN depends on both the weights and the transfer func-
independent variables. A number of these independent tions that are specified for the nodes.
variables are then deleted one at a time until a stopping cri- In order to train an ANN to perform some task, we
terion is fulfilled. Neither the forward nor the backward must adjust the weight of each line in such a way as to
elimination method is guaranteed to find the subset with reduce the error between the expected output and the
the highest evaluation criterion. In contrast, stepwise selec- actual output. The back-propagation algorithm is the most
tion selects an optimal subset of independent variables for widely used training method for multilayer neural net-
the model. At each step of the stepwise selection process, works. It initializes the network with a random set of
independent variables are either added or deleted from weights, and the network is trained from a set of input–out-
the regression model. put pairs. Given an input–output pair, the error between
Once the significant independent variables have been the actual output and the expected output is ‘‘back-propa-
determined, i.e. the model is selected, the parameters gated’’ through the network and guides the adjustment of
a0, . . . , ak are then estimated using the least squares meth- weights in a way that reduces the collective error between
ods. More specifically, the values P of the parameters are actual and expected outputs on the training set (Minsky
selected in order to minimize ni¼1 e2i , where n is the number and Paper, 1969). The details of this algorithm can be
of observations in the fit data set. found in Hochman et al. (2003).

2.3. Regression tree


2.2. Artificial neural network
Regression tree (RT) is a variant of decision trees that
An Artificial Neural Network (ANN) is an information predicts values of continuous variables instead of labels
processing paradigm that is inspired by the way biological of classification (Breiman et al., 1993). Fig. 2 shows an
nervous systems, such as the brain, process information. example regression tree over 209 different computer config-
The key element of this paradigm is the novel structure urations, adapted from Witten and Frank (2000).
of the information processing system. It is composed of a A regression tree is built through a recursive partitioning
large number of highly interconnected processing elements process. This is an iterative process of splitting the data
(neurons) working in unison to solve specific problems into partitions, and then splitting it up further on each of
(Belsley et al., 2004; Rosenblatt, 1962). the branches. Initially all of the records in the training set
Feed-forward supervised-learning neural networks are are put together in one node. The algorithm chooses an
the most commonly used model-building technique as an independent variable with values that minimize the sum
alternative to MLR for software quality or effort prediction of the squared deviations from the mean in the separate
(Heiat, 2002; Srinivasan et al., 1995; Karunaithi et al., parts. More specifically, assume that the values of each
1992; Khoshgoftaar and Lanning, 1995). Fig. 1 shows a independent variable IVi partition the entire training data
fully connected feed-forward ANN, which consists of four set T into subsets Tij, where every observation in Tij takes
layers of nodes: one input layer, two hidden layers, and one on the same value, say
output layer. Each line between nodes has a corresponding P Vj for IVi. The independent variable
IVi that minimizes j MSEðT ij Þ is selected to divide the
and distinct weight. The feed-forward ANN allows signals tree, where MSE(Tij) is the squared deviations from the
to travel one way only, from input to output. There is no mean in Tij for the dependent variable y, i.e.,
feedback (loops), i.e. the output of any layer does not affect P
that same layer. The output of each node is determined by y k 2T ij ðy k  y Þ2
its inputs and the weights associated with the lines between MSEðT ij Þ ¼
jT ij j
these inputs and this node. In other words, each node is

CHMIN
hidden layer 1 ≤ 7.5 > 7.5
input layer hidden layer 2
CACH MMAX
≤ 8.5 > 8.5 ≤ 28000 > 28000
x1 output layer
MMAX 88.7 (5) 157 (21) 454.3 (23)
y
x2 ≤ 4250 > 4250

25.3 (65) CACH


x3 ≤ 0.5 (0.5, 8.5]

32.2 (26) 59.3 (24)

Fig. 1. A feed-forward neural network. Fig. 2. A regression tree model for CPU performance data.
1352 Y. Zhou, H. Leung / The Journal of Systems and Software 80 (2007) 1349–1361

where y is the mean of the yk values in Tij. This partitioning X


l

is then applied to each of the new branches. The process f ðxÞ ¼ ðai  ai Þð/ðxi Þ  /ðxÞÞ þ b
i¼1
continues until each node reaches a user-specified mini-
mum node size and becomes a terminal node. Note that X
l
¼ ðai  ai Þkðxi ; xÞ þ b
if the sum of squared deviations from the mean in a node i¼1
is zero, then that node is considered a terminal node even
if it has not reached the minimum size. In practice, the where k(xi, x) is called the kernel function, which enables
most commonly used regression trees are binary trees. the dot product to be performed in high-dimensional fea-
ture space using low-dimensional space data input without
the need to know the transformation /. Common kernel
2.4. Support vector regression functions include the linear, polynomial, and radial basis
functions (Smola et al., 1998). The most widely used cost
Support vector machine (SVM) was originally devel- function is the so called e-insensitive loss function, which
oped for solving the classification problems (Vapnik, takes the form
1995; Cortes and Vapnik, 1995) but recently it was 
jf ðxÞ  yj  e if jf ðxÞ  yj P e
extended to the domain of regression problems (Vapnik Cðf ðxÞ  yÞ ¼
et al., 1997; Smola, 1996; Smola et al., 1998). Since SVM 0 otherwise
employs the structural risk minimization (SRM) principle
In this loss function, errors are not an issue as long as they
rather than the empirical risk minimization (ERM) princi-
are less than e, but any deviation larger than this is not
ple as adopted in ANN, it can, unlike ANN approaches,
acceptable. Such deviations are penalized in a linear
produce prediction models with excellent generalization
fashion.
performance. As a result, SVM is gaining in popularity in
When applying SVR in real applications, we need to
the machine learning community.
give a kernel function, the penalty C, and the radius e
Suppose we are given a set of l training data
which determines the data inside the e tube to be ignored
{(x1, y1), . . . , (xl, yl)}, where xi 2 Rd denotes the ith input
in regression.
pattern from the d dimension input space and has a corre-
sponding target value y i 2 R for i = 1, . . . , l, where R is
the set of real number. The goal of support vector regres- 3. Multivariate adaptive regression splines
sion (SVR) is to find a function that approximates the actu-
ally obtained targets yi for all the training data, and has a Multivariate adaptive regression spline (MARS), pro-
minimum generalization error. The general form of a SVR posed by Freidman (1991), is a non-parametric regression
function can be given by: technique which models complex relationships that are dif-
ficult, if not impossible, for other modeling methods to
f ðxÞ ¼ w  /ðxÞ þ b reveal. In a sense, MARS is based on a divide-and-conquer
strategy, partitioning the training data sets into separate
where w 2 Rn , b 2 R, Æ denotes the dot product in Rn , and regions, each of which gets its own regression equation.
/ is a non-linear transformation from Rd to the high- This makes MARS particularly suitable for problems with
dimensional space Rn (i.e., n > d). Our goal is to determine high input dimensions.
the value of w and b such that f(x) can be determined by Fig. 3 shows a simple example of how MARS would use
minimizing the regression risk piece-wise linear regression splines to attempt to fit data, in
a two dimension space (where Y is the dependent variable
X
l
1 2
Rreg ðf Þ ¼ C Cðf ðxi Þ  y i Þ þ jjwjj
i¼0
2
30

where C is a cost function, C is a constant that represents 25


penalties for estimation error (a large value for C means er-
rors are penalized heavily whereas a small value for C 20
means errors are penalized only lightly). A heavier penalty Y 15
trains the regression to minimize errors by making fewer
generalizations. w can be written in terms of data points as 10

X
l
5
w¼ ðai  ai Þ/ðxi Þ
i¼1
0
16 24
0 10 20 30 40
a*
where a and are Lagrange multipliers with the properties
X
a, a*P 0. Therefore, the general SVR function can be refor-
mulated as Fig. 3. Example knots in MARS.
Y. Zhou, H. Leung / The Journal of Systems and Software 80 (2007) 1349–1361 1353

and X is the independent variable). A key concept is the then selected. Here, the lack of fit measure used is based
notion of knots, which are the points that mark the end on the generalized cross-validation criterion (GCV),
of a region of data where a distinct regression equation is defined as
run, i.e. where the behavior of the modeled function , 2
changes. Fig. 3 shows two knots: 16 and 24. They delimit 1X n
_ 2 CðMÞ
GCVðMÞ ¼ ðy  y Þ 1
three intervals where different linear relationships are n i¼1 i n
identified.
MARS makes no assumption about the underlying where n is the number of observations in the data set, M
functional relationship between the dependent and inde- the number of non-constant terms in the model, and
pendent variables. It builds flexible regression models by C(M) is a complexity penalty function. The purpose of
fitting separate splines (or basis functions) to distinct inter- C(M) is to penalize model complexity, to avoid overfitting,
vals of the independent variables. Both the variables to be and to promote the parsimony of models. It is usually de-
used and the end points of the intervals for each variable fined as
(i.e. knots) are found through a fast but intensive search
CðMÞ ¼ M þ cd
procedure. In addition to searching variables one by one,
MARS also searches for interactions between independent where c is an user-defined cost penalty factor for each base
variables, allowing any degree of interaction to be consid- function optimization, and d is the effective degrees of free-
ered as long as the model that is built can better fit the data. dom, which is equal to the number of independent basis
The general MARS model can be represented using the functions in the model. The higher the factor c is, the more
following equation: basis functions will be excluded. In practice, c is increased
_ X
M Y
Km during the pruning step in order to obtain smaller models.
y ¼ c0 þ cm bkm ðxvðk;mÞ Þ Once the model is built, it is possible to estimate, on a
m¼1 k¼1 scale between 0 and 100, the relative importance of a vari-
_
where y is the dependent variable predicted by the MARS able in terms of its contribution to the fit of the model. To
model, c0 is a constant, bkm ðxvðk;mÞ Þ is the truncated power calculate the relative importance of a variable, we delete all
basis function with v(k, m) being the index of the indepen- terms containing the variable in question, refit the model,
dent variable used in the mth term of the kth product, and and then calculate the reduction in fit. The most important
Km is a parameter that limits the order of interactions (the (and highest scoring) variable is the one that, when deleted,
resulting model will be an additive for Km = 1, and pairwise most reduces the fit of the model. Less important variables
interactions are allowed for Km = 2). The splines bkm are receive lower scores. These scores correspond to the ratio
defined in pairs: of the reduction in fit produced by these variables to that
 of the most important variable.
ðx  tkm Þq if x > tkm
bkm ðxÞ ¼ ðx  tkm Þqþ ¼
0 otherwise
4. Model evaluation
and
 q In this section, we first introduce the criteria for evaluat-
q ðtkm  xÞ if tkm > x
bkmþ1 ðxÞ ¼ ðtkm  xÞþ ¼ ing the prediction accuracy of a model. Then, we discuss
0 otherwise the cross-validation method used in this study to obtain
for m an odd integer, where tkm, one of the unique values of nearly unbiased estimators of prediction error. Finally,
xv(k, m), is known as the knot of the spline, q P 0 is the we describe the significant test method for comparing the
power to which the splines are raised in order to manipu- prediction capability of different models.
late the degree of smoothness of the resultant regression
models. When q = 1, simple linear splines are applied. 4.1. Prediction accuracy
The optimal MARS model is built in two stages: a for-
ward stepwise selection process followed by a backward An important question that needs to be asked of any
‘‘pruning’’ process. The forward stepwise selection of the prediction model is ‘‘How accurate are its predictions?’’
basis function starts with the constant basis function. At Researchers have used various measures of accuracy, the
each step, from all the possible splits in each basis function, most popular being residual, magnitude of relative error,
the process chooses the split that minimized some ‘‘lack of and Pred. All of these are based on only two terms, the
fit’’ criterion. This search continues until the model reaches actual and the predicted values.
some predetermined maximum number of basis functions. Suppose the training set consists of n observations.
In the backward ‘‘pruning’’ process, the ‘‘lack of fit’’ crite- Given an observation i, the corresponding residual is the
rion is used to evaluate the contribution of each basis func- difference between the actual value and the predicted value
tion to the descriptive abilities of the model. The base (Maxwell, 2002), that is,
functions contributing the least to the model are eliminated
stepwise. The optimal model in the stepwise sequence is Resi ¼ y i  ^y i
1354 Y. Zhou, H. Leung / The Journal of Systems and Software 80 (2007) 1349–1361

where yi is the ith value of the dependent variable as ob- k


PredðqÞ ¼
served in the data set and ^y i is the corresponding predictive n
value from the prediction model. Thus, positive numbers
where q is the specified value, k the number of observations
mean underestimates and negative numbers mean overesti-
whose MRE is less than or equal to q, and n is the total
mates. The absolute value of this residual is called absolute
number of observations in the data set. In this paper,
residual error (ARE), that is,
Pred(0.25) and Pred(0.30) are used because they are com-
AREi ¼ jResi j monly used in the empirical software engineering literature
(De Lucia et al., 2005; Kitchenham et al., 2002). The for-
To characterize the distribution of AREs, we use the mer reports the percentage of the estimates with an MRE
sum of AREs (SumARE), the median of AREs (MedARE), of 25% or less, and the latter reports the percentage of
and the standard deviation of AREs (SDARE). SumARE the estimates with an MRE of 30% or less.
measures the total residuals over the data set, MedARE
measures the central tendency of the ARE distribution,
and SDARE measures the dispersion of the ARE distribu- 4.2. Cross validation
tion. These measures can be used to compare the prediction
performance of different models. In Pickard et al. (1999), When a model is built, we should perform some kind of
however, Pickard et al. pointed out that it is often more cross-validation. Cross-validation is a way of obtaining a
useful to look at the boxplots of the residuals, which allow realistic estimate of the predictive power of a model when
different models to be compared visually. Moreover, resid- it is applied to data sets other than those from which the
ual boxplots show whether or not the predictions are model was derived. In general, a data set is divided into
biased (i.e. whether the median differs from zero) and two subsets: a training set and a test set. The training set
whether the model has a tendency to underestimate or is used to fit the model, and the test set is used to validate
overestimate. the model. This is called split-sample validation. Since data
Although ARE can be used to compare models on the sets used in effort prediction are usually of limited size,
same data set, it cannot be used as an indicator alone split-sample validation is often difficult. As a remedy, v-fold
(De Lucia et al., 2005). For this reason, a normalized mea- cross-validation is a way of obtaining nearly unbiased esti-
sure magnitude of relative error (MRE) is often used. Given mators of prediction error. For a data set with n observa-
an observation i, MREi is defined as tions, a v-fold cross-validation divides the data set into v
approximately equal partitions, and each in turn is used
jy i  ^y i j for testing while the remainder is used for training.
MREi ¼
yi Leave-one-out (LOO) is simply n-fold cross-validation,
where n is the number of observations in the data set. We
Similar to van Koten and Gray (2006), this paper uses choose to use LOO cross-validation in this study for three
the maximum value of MRE (MaxMRE) to compare the reasons. First, as Myrtveit et al. pointed out, LOO cross-
prediction performance of different models. The MaxMRE validation is a widely used variant of v-fold cross-valida-
measures the maximum relative discrepancy, which is the tion. More importantly, from a practitioner’s standpoint
maximum error relative to the actual value in the predic- it is closer to a real world situation than k-cross validation
tion. The mean magnitude of relative error (MMRE) is cal- (k < n) (Myrtveit et al., 2005). Second, unlike k-fold cross-
culated with the following formula: validation (k < n), LOO is deterministic, i.e. no sampling is
1Xi¼n involved (Witten and Frank, 2000). Third, LOO ensures
MMRE ¼ MREi that the greatest possible amount of data is used for train-
n i¼1
ing for each case, which presumably increases the chance of
where n is the number of observations in the data set. getting as accurate an estimate as can possibly be obtained
MMRE is regarded as a versatile assessment criterion (Witten and Frank, 2000). The disadvantage of LOO is
and has a number of advantages: it can be used to make that it is computationally intensive, for the entire learning
comparisons across data sets and all kinds of prediction procedure must be executed n times. However, this is not
model types; and it is independent of measurement units a problem for our study because of the relatively small
and is scale independent (Walkerden and Jeffery, 1999; sample size.
Conte et al., 1986; Strike et al., 2001). Although some
research has shown that MMRE may produce misleading 4.3. Significance test
results (Myrtveit et al., 2005; Foss et al., 2003), it has been
the de facto standard as an accuracy measure for prediction We are interested in observing whether MARS is supe-
models because it is easy to understand and meaningful for rior to MLR, ANN, RT, and SVR for building maintain-
the final users of prediction models. ability prediction models. As such, the Wilcoxon signed-
Pred is a measure of the proportion of the predicted rank test for matched samples is used to test and compare
values that have an MRE less than or equal to a specified the MARS models and the other models in terms of the
value, given by two types of error, ARE and MRE. This test was chosen
Y. Zhou, H. Leung / The Journal of Systems and Software 80 (2007) 1349–1361 1355

because: (a) it is a distribution-free technique that does not give the distribution and correlation analysis results of the
require any underlying distributions in the data; (b) it deals metrics that were investigated.
with the signs and ranks of the values and not with their
magnitude (thus not influenced by outlier data points). 5.1. Studied metrics
The test procedure first calculates the differences
between the paired observations, ranks them from the This study makes use of two OO software data sets pub-
smallest to largest by absolute value (the rank assigned to lished by Li and Henry (1993). The first data set, UIMS,
tied observations is the mean of ranks that would have contains the metric data of 39 classes collected from a user
been assigned to the observations had they not been tied), interface management system. The second data set, QUES,
and then affixes the sign of each difference to the corre- contains the metric data of 71 classes collected from a qual-
sponding rank (Zar, 1984). The sum of the ranks having ity evaluation system. Both systems were implemented in
a plus sign is called T+ and the sum of the ranks having Ada.
a minus sign is called T. When the sample size n is larger The metric data of both the UIMS and QUES data sets
than 25, the distribution of T (where either T+ or T may consists of eleven metrics: nine object-oriented metrics, one
be used for T) is closely approximated by a normal distri- traditional size metric, and one maintainability metric.
bution with a mean of Among these object-oriented metrics, WMC, DIT, RFC,
NOC, and LCOM were proposed by Chidamber and
nðn þ 1Þ
lT ¼ Kemerer (1994), MPC, DAC, NOM and SIZE2 were pro-
4
posed by Li and Henry (1993), and SIZE1 is the traditional
and a standard error of lines of code size metric. The object-oriented metrics cap-
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ture the concept of cohesion, coupling, inheritance, and
nðn þ 1Þð2n þ 1Þ
rT ¼ size. Maintainability was measured with the CHANGE
24 metric by counting the number of lines in the code that
Thus, the test statistic can be calculated from were changed per class over a 3-year maintenance period.
A line change could be an addition or a deletion. A change
jT  lT j  0:5
Z¼ in the content of a line is counted as a deletion followed by
rT an addition.
where for T we may use, with identical results, either T+ or Table 1 provides definitions of these metrics. Essentially,
T. Then, for a two-tailed test, if Z is greater than or equal all the object-oriented metrics measure some aspect of the
to the critical value Za(2), then the null hypothesis H0 complexity of software. WMC is used to measure the com-
(l1 = l2) is rejected, where l1 and l2 are two population plexity of an individual class attributable to control flow
means of matched pairs. That is, the alternative hypothesis structures. DIT and NOC determine the complexity of a
H1 (l15l2) is accepted. In particular, if Z P Za(2) and class attributable to inheritance relationships. RFC,
T+ < T, we conclude that l1 < l2; if Z P Za(2) and MPC, and DAC measure the complexity of a class caused
T+ > T, we conclude that l1 > l2. The level of statistical by coupling, while LCOM quantifies the complexity of a
significance a provides an insight into the accuracy of test- class caused by cohesion. NOM, SIZE1, and SIZE2 mea-
ing. The selection of a may depend on the analyst and the sure the complexity of a class caused by its size. It is
property of the investigated problem. In empirical software believed that the complexity of an object-oriented system
engineering literature, it is common to set a = 0.05 or 0.1. strongly affects its maintainability (van Koten and Gray,
In this study, the null and alternative hypotheses using 2006; Li and Henry, 1993; Thwin and Quah, 2005). There-
ARE are as follows: fore, it is reasonable to use these object-oriented metrics to
predict software maintainability.
H0 : AREMARS ¼ AREX
H1 : AREMARS 6¼ AREX 5.2. Characteristics of data sets
where X denotes MLR, ANN, RT, or SVR. Similarly, the
In the following we describe the results of the distribu-
null and alternative hypotheses using MRE are as follows:
tion and correlation analysis of the eleven metrics as they
H0 : MREMARS ¼ MREX appear in our data set. Tables 2 and 3 present some com-
H1 : MREMARS 6¼ MREX mon descriptive statistics from the UIMS and QUES data
sets, respectively. As can be seen, both the UIMS and
The minimum significance level for rejecting a null hypoth- QUES data sets have low medians and means for DIT.
esis was set at a = 0.05. This indicates that inheritance was not used a lot in both
systems. Similar medians and means of NOM and SIZE2
5. Data sets are found in the UIMS and QUES data sets, suggesting
that the class sizes at the design level in both systems are
In this section, we describe the data set used for this similar. However, the medians and means of SIZE1 in
study. We first introduce the metrics under study and then the QUES data set are significantly larger than those in
1356 Y. Zhou, H. Leung / The Journal of Systems and Software 80 (2007) 1349–1361

Table 1
Definitions of metrics
Metric Definition
WMC (Weighted methods per class) The sum of McCabe’s cyclomatic complexity of all local methods in a given class
DIT (Depth of inheritance tree) The length of the longest path from a given class to the root in the inheritance hierarchy
RFC (Response for a class) The number of methods that can potentially be executed in response to a message
being received by an object of a given class
NOC (Number of children) The number of classes that directly inherit from a given class
LCOM (Lack of cohesion in methods) The number of pairs of local methods in a given class using no attribute in common
MPC (Message-passing coupling) The number of send statements defined in a given class
DAC (Data abstraction coupling) The number of abstract data types defined in a given class
NOM (Number of methods) The number of methods implemented within a given class
SIZE1 (Lines of code) The number of semicolons in a given class
SIZE2 (Number of properties) The total number of attributes and the number of local methods in a given class
CHANGE (Number of lines changed in the Insertion and deletion are independently counted as 1, change of the contents is counted as 2
class)

Table 2
Descriptive statistics of the UIMS data set
Metric Maximum 75% Median 25% Minimum Mean Standard deviation Skewness Kurtosis
WMC 69 12 5 1 0 11.38 15.90 2.03 3.98
DIT 4 3 2 2 0 2.15 0.90 0.54 0.09
RFC 101 30 17 11 2 23.21 20.19 2.00 4.94
NOC 8 1 0 0 0 0.95 2.01 2.24 4.28
LCOM 31 8 6 4 1 7.49 6.11 2.49 6.86
MPC 12 6 3 1 1 4.33 3.41 0.731 0.70
DAC 21 3 1 0 0 2.41 4.00 3.33 12.87
NOM 40 13 7 6 1 11.38 10.21 1.67 1.94
SIZE1 439 131 74 27 4 106.44 114.65 1.71 2.04
SIZE2 61 16 9 6 1 13.97 13.47 1.89 3.44
CHANGE 289 39 18 10 2 46.82 71.89 2.29 4.35

Table 3
Descriptive statistics of the QUES data set
Metric Maximum 75% Median 25% Minimum Mean Standard deviation Skewness Kurtosis
WMC 83 22 9 2 1 14.96 17.06 1.77 3.33
DIT 4 2 2 2 0 1.92 0.53 0.10 5.46
RFC 156 62 40 34 17 54.44 32.62 1.62 1.96
NOC 0 0 NA 0 0 0 0.00 NA NA
LCOM 33 14 5 4 3 9.18 7.34 1.35 1.10
MPC 42 21 17 12 2 17.75 8.33 0.88 1.17
DAC 25 4 2 1 0 3.44 3.91 2.99 12.82
NOM 57 21 6 5 4 13.41 12.00 1.39 1.40
SIZE1 1009 333 211 172 115 275.58 171.60 2.11 5.23
SIZE2 82 25 10 7 4 18.03 15.21 1.71 3.42
CHANGE 217 85 52 35 6 64.23 43.13 1.36 2.17

the UIMS data set. This may suggest that the complexities that the mean of CHANGE in the QUES data set is larger
of the problems processed by the two systems are rather than that in the UIMS data set.
different. Moreover, the medians and means of RFC and As stated in Briand et al. (2000), metrics that vary little
MPC in the QUES data set are far higher than the corre- do not differentiate classes very well and therefore are not
sponding medians and means in the UIMS data set. This likely to be useful predictors. Only measures with more
suggests that the coupling between classes in the quality than five non-zero data values will be considered for subse-
evaluation system is higher than those in the user interface quent analysis. From Table 2, for all metrics except NOC,
management system. In contrast, the median and mean of DIT, and DAC, there are large differences between the
LCOM in the QUES data set are similar to the median and lower 25th percentile, the median, and the 75th percentile,
mean of LCOM in the UIMS data set, implying that these thus showing strong variations across classes. For NOC,
two systems have similarly cohesion. It can also be seen DIT, and DAC, the number of their non-zero data points
Y. Zhou, H. Leung / The Journal of Systems and Software 80 (2007) 1349–1361 1357

Table 4
Correlations between the metrics in the data sets UIMS (upper triangle) and QUES (lower triangle)
WMC DIT RFC NOC LCOM MPC DAC NOM SIZE1 SIZE2 CHANGE
* * * * * * *
WMC 1 0.22 0.91 0.23 0.80 0.63 0.44 0.84 0.97 0.77 0.65*
DIT 0.13 1 0.23 0.47 0.19 0.06 0.43* 0.36* 0.19 0.41* 0.43*
RFC 0.74* 0.11 1 0.21 0.79* 0.74* 0.62* 0.93* 0.91* 0.89* 0.64*
NOC NA NA NA 1 0.13 0.03 0.32* 0.23 0.17 0.27 0.56*
LCOM 0.57* 0.12 0.82* NA 1 0.50* 0.36* 0.75* 0.82* 0.68* 0.57*
MPC 0.14 0.02 0.33* NA 0.10 1 0.44* 0.55* 0.67* 0.55* 0.45*
DAC 0.57* 0.39* 0.64* NA 0.56* 0.02 1 0.75* 0.52* 0.87* 0.63*
NOM 0.70* 0.13 0.81* NA 0.88* 0.11 0.81* 1 0.87* 0.98* 0.64*
SIZE1 0.89* 0.01 0.80* NA 0.54* 0.37* 0.64* 0.69* 1 0.82* 0.63*
SIZE2 0.69* 0.20 0.81* NA 0.84* 0.08 0.89* 0.99* 0.71* 1 0.67*
CHANGE 0.43* 0.09 0.39* NA 0.05 0.46* 0.08 0.14 0.64* 0.15 1
*
Correlation is significant at the 0.01 level (2-tailed).

is larger than five (Briand et al., 2000). Therefore, all these In this study, we use the stepwise selection procedure to
metrics from the UIMS data set will be used in the follow- build the MLR prediction models. The software used is
ing analysis. However, from Table 3, we can see that there SPSS 13.0.2 The entry criterion used in the stepwise selec-
is no non-zero data point for NOC. Thus, the metric NOC tion is the p-value of the F statistic being smaller than or
in the QUES data set is removed from the following equal to 0.05. The eliminating criterion used is the p-value
analysis. of the F statistic being larger than or equal to 0.10. In par-
Table 4 shows the results of a linear Pearson’s correla- ticular, the multi-collinearity between the predictor vari-
tions analysis for the data sets. More specifically, the upper ables is tested using the condition index of the correlation
triangular matrix represents the correlations between the matrix of the covariates in the model (Belsley et al.,
metrics in the UIMS data set, and the lower triangular 2004). On the other hand, we use the tool WEKA 3.4.63
matrix represents the correlations between the metrics in to build backpropagation trained feed-forward ANN
the QUES data sets. Assuming a reasonably sized data models, RT models, and SVR models. The default settings
set, Hopkins calls a correlation value of less than 0.1 trivial, provided by this tool are used to build these models.
0.1–0.3 minor, 0.3–0.5 moderate, 0.5–0.7 large, 0.7–0.9 very
large, and 0.9–1 almost perfect (Hopkins, 2003). From 6.1. Results from UIMS data set
Table 4, it can be seen that almost all metrics except DIT
and NOC are related to each other in both data sets. In Table 5 shows the optimal MARS model from the
particular, in the UIMS data set, all the object-oriented UIMS data set. The columns show, from left to right, the
metrics are statistically related to CHANGE. However, basis functions selected as significant covariates in the
in the QUES data set, only four metrics are statistically model, the coefficients estimated, the associated standard
related to CHANGE. From Tables 2–4, we conclude that error, the t-ratio, and the p-value telling us about the signif-
the characteristics of the UIMS dataset are different from icance of these coefficients. In this study, the interaction
the QUES dataset. Therefore, to some extent, the UIMS effects between basis functions are not examined since we
and QUES datasets are heterogeneous. use the default setting provided by MARS 2.0. Therefore,
this model is actually a ‘‘main-effects’’ model and its
6. Empirical results adjusted R2 is 0.656.
Table 6 provides a ranking of the independent variables
In this section, we analyze the results of MARS models by order of importance. Variables having no impact at all
using the UIMS and QUES data sets. The tool used is are not shown. The loss in GCV is denoted as ‘‘gcv’’ in
MARS 2.0 developed by Salford Systems.1 We use the Table 6. The column ‘‘Importance’’ shows the relative
default setting provided by this tool to build MARS mod- importance of variables in terms of percentage of the high-
els. To assess the benefits of using MARS, we compare the est gcv, that is the highest reduction of goodness of fit
prediction performances of MARS models with those of among all variables.
MLR models, ANN models, RT models, and SVR models Table 7 shows the prediction accuracy measures
with LOO cross-validation. Furthermore, we employ the achieved by the MARS model, the MLR model, the SVR
Wilcoxon signed-rank test between MARS models and model, the ANN model, and the RT model for the UIMS
other models to check whether their differences are data set within LOO cross-validation. Although these
significant. results do not satisfy the criteria for an accurate prediction

2
www.spss.com.
1 3
www.salford-systems.com. www.cs.waikato.ac.nz/ml/weka.
1358 Y. Zhou, H. Leung / The Journal of Systems and Software 80 (2007) 1349–1361

Table 5 contrast, the RT model has a median residual that is far


The MARS model for UIMS data set below zero. Thus, there is tendency to significantly overes-
Metric Coefficient Standard t- p- timate in the RT model. We also find that the MLR model
error Ratio Value tends to slightly underestimate while the ANN model tends
Intercept 2.184 8.637 0.253 0.802 to slightly overestimate. The MARS model also has the
BF2 = (NOC  0.131E  08)+ 12.957 3.586 3.613 0.009 narrowest box and the smallest whiskers (i.e. the lines
BF3 = (WMC  0.337E  6)+ 1.888 0.481 3.928 0.004
BF14 = (DAC  1.000)+ 6.316 2.033 3.106 0.004
above and below from the box) but at the same time, it
has the most outliers. In summary, comparing these box-
plots does not clearly show whether the MARS model is
Table 6 better than the other models.
Variable importance in the UIMS data set Table 8 presents the Z statistic values of the two-tailed
Metrics Importance gcv Wilcoxon signed-rank test for ARE and MRE values for
WMC 100.000 3277.827
the five models, where the numbers in parentheses are the
NOC 85.673 3123.369 corresponding p-values. As can be seen, the MARS model
DAC 59.394 2902.030 is not significantly different from the SVR and ANN mod-
els. However, although the test of ARE values indicates
that the MARS model is not much different from the
model, it is reported that the prediction accuracy of soft- MLR model (p = 0.364), the test of MRE values does show
ware maintenance effort prediction models is often low strong evidence that the MARS model outperforms the
and thus it is very difficult to satisfy the criteria (De Lucia MLR model (p-value = 0.009). Moreover, both the Wilco-
et al., 2005). As can be seen, the MARS model has the xon signed-rank tests of ARE and MRE values show that
smallest MedARE while its MaxMRE, MMRE, the MARS model is significantly better than the RT model.
Pred(0.25), and Pred(0.30) are inferior only to those of
the SVR model.
Fig. 4 shows the residual boxplots of all five prediction 6.2. Results from QUES data set
models for the UIMS data set, allowing a visual compari-
son. The line in the middle of each box represents the med- Table 9 shows the optimal MARS model for the QUES
ian of the residual. As can be seen, both the MARS model data set. The adjusted R2 of this model is 0.837, which is
and the SVR model have a median residual close to zero. In much larger than that (0.656) of the model for the UIMS
data set. Table 10 shows the relative importance of inde-
pendent variables for the QUES data set. Interestingly,
both the results from the UIMS and QUES data sets sug-
gest that WMC and DAC are important indicators of
object-oriented software maintainability.
Table 11 shows the prediction accuracy measures
achieved by the MARS model, the MLR model, the SVR
model, the ANN model, and the RT model for the QUES
data set within LOO cross-validation. Of all the models,
the MARS model has the best prediction accuracy.
Fig. 5 shows the residual boxplots of all five prediction
models for the QUES data set. As can be seen, the MARS
model, the SVR model, and the RT model all have a med-
ian residual close to zero. In contrast, the MLR and the
ANN models both have a median residual that is slightly
below zero causing them to have a tendency to slightly
overestimate. Of all the models, the MARS model has
Fig. 4. Residual boxplots of MARS and commonly used modeling the narrowest box and the smallest whiskers, as well as
techniques for UIMS data set. the fewest outliers. Based on these boxplots, we can say

Table 7
Prediction accuracy for UIMS data set (within LOO cross-validation)
Method MaxMRE MMRE Pred(0.25) Pred(0.30) SumARE MedARE SDARE
MARS 14.06 1.86 0.28 0.28 1532.78 9.26 59.76
MLR 18.88 2.70 0.15 0.21 1457.26 13.64 54.66
SVR 9.13 1.68 0.31 0.36 1242.59 11.03 48.66
ANN 19.63 1.95 0.15 0.15 1473.14 10.84 55.50
RT 24.57 4.95 0.10 0.10 1988.96 29.54 62.87
Y. Zhou, H. Leung / The Journal of Systems and Software 80 (2007) 1349–1361 1359

Table 8
Wilcoxon signed-rank test for UIMS data set
Modeling method MLR SVR ANN RT
ARE MRE ARE MRE ARE MRE ARE MRE
MARS 0.907a 2.610a 0.809b 0.777b 0.112b 0.334b 2.344a 2.686a
(0.364) (0.009) (0.418) (0.437) (0.911) (0.739) (0.019) (0.007)
a
T+ < T.
b
T+ > T.

Table 9
The MARS model for QUES data set
Metric Coefficient Standard error t-Ratio p-Value
Intercept 70.649 6.982 10.250 0.000
BF1 = (WMC  13.000)+ 3.663 0.885 4.137 0.001
BF2 = (13.000  WMC)+ 4.863 0.629 7.734 0.000
BF3 = (SIZE1  252.000)+ 0.786 0.166 4.728 <0.001
BF4 = (252.000  SIZE1)+ 0.612 0.066 9.229 0.000
BF5 = (DAC  0.122050E  06)+ 3.719 0.791 4.702 <0.001
BF6 = (WMC  26.000)+ 5.714 1.121 5.096 <0.001
BF9 = (SIZE1  310.000)+ 0.651 0.179 3.626 0.006
BF11 = (MPC  18.000)+ 1.982 0.449 4.418 <0.001

Table 10
Variable importance in the QUES data set
Metrics Importance gcv
SIZE1 100.000 1471.475
WMC 82.008 1171.803
DAC 36.684 679.524
MPC 33.426 658.626

that the MARS model is clearly the best of the five models.
This result is in line with the findings in Table 11.
Table 12 presents the Z statistic values of the two-tailed
Wilcoxon signed-rank test for ARE and MRE values
between the MARS model and the other four models,
where the numbers in parentheses are the corresponding
p-values. As can be seen, both the test of MRE values
Fig. 5. Residual boxplots of MARS and commonly used modeling
and the test of ARE values show that the MARS model
techniques for QUES data set.
is significantly better than the other four models.

7. Conclusions of the Li and Henry’s data sets, UIMS and QUES,


obtained from two different object-oriented systems. The
In this paper, we presented an empirical study that prediction performances of the MARS models were
sought to build object-oriented software maintainability assessed and compared with those of the multivariate linear
prediction models using a novel exploratory modeling tech- regression models, the artificial neural network models, the
nique, MARS. To build the MARS models, we made use regression tree models, and the support vector models. The

Table 11
Prediction accuracy for QUES data set (within LOO cross-validation)
Method MaxMRE MMRE Pred(0.25) Pred(0.30) SumARE MedARE SDARE
MARS 1.91 0.32 0.48 0.59 1153.10 13.45 14.66
MLR 2.03 0.42 0.37 0.41 1566.08 16.71 19.90
SVR 2.07 0.43 0.34 0.46 1563.67 15.57 19.79
ANN 3.07 0.59 0.37 0.45 2396.85 18.48 49.34
RT 4.82 0.58 0.41 0.45 1983.88 16.67 28.87
1360 Y. Zhou, H. Leung / The Journal of Systems and Software 80 (2007) 1349–1361

Table 12
Wilcoxon signed-rank test for QUES data set
Modeling method MLR SVR ANN RT
ARE MRE ARE MRE ARE MRE ARE MRE
MARS 3.217a 3.134a 3.226a 3.155a 3.037a 3.158a 3.134a 2.854a
(0.001) (0.002) (0.001) (0.002) (0.002) (0.002) (0.002) (0.004)
a
T+ < T.

results show that the MARS models can effectively predict Chou, S.M., Lee, T.S., Shao, Y.E., Chen, I.F., 2004. Mining the breast
the maintainability of OO software systems. For the UIMS cancer pattern using artificial neural networks and multivariate
adaptive regression splines. Expert Systems with Applications 27 (1),
data set, the MARS model is as accurate as the best predic- 133–142.
tion model. For the QUES data set, the MARS model has Conte, S.D., Dunsmore, H.E., Shen, V.Y., 1986. Software Engineering
achieved significantly better prediction accuracy than the Metrics and Models. Benjamin/Cummings, Menlo Park, Calif.
other four prediction models. Thus, overall, it is concluded Cortes, C., Vapnik, V., 1995. Support-vector networks. Machine Learning
that the MARS model is better than the other models built 20 (3), 273–297.
Deconinck, E., Xu, Q.S., Put, R., Coomans, D., Massart, D.L., Vander
using typical modeling techniques and it can be a useful Heyden, Y., 2005. Prediction of gastro-intestinal absorption using
modeling technique for software maintainability pre- multivariate adaptive regression splines. Journal of Pharmaceutical
diction. and Biomedical Analysis 39 (5), 1021–1030.
One limitation of our study is that the metric data was De Lucia, A., Pompella, E., Stefanucci, S., 2005. Assessing effort
collected from two systems implemented with a single lan- estimation models for corrective maintenance through empirical
studies. Information and Software Technology 47 (1), 3–15.
guage. Future work will replicate this study across pro- Fenton, N.E., Neil, M., 1999. A critique of software defect prediction
gramming languages. Such studies would allow us to models. IEEE Transactions on Software Engineering 25 (5), 675–689.
further investigate the capability of MARS in software Fenton, N.E., Neil, M., 2000. Software Metrics: Roadmap. In: Proceed-
maintainability prediction. Another interesting work in ings of International Conference on Software Engineering. Future of
the future is to develop more accurate prediction models Software Engineering Track, Limerick, Ireland, pp. 357–370.
Fenton, N.E., Krause, Paul, Neil, M., 2002. Software measurement:
by combining MARS with other prediction techniques. uncertainty and causal modeling. IEEE Software 19 (4), 116–122.
Fioravanti, F., Nesi, P., 2001. Estimation and prediction metrics for
Acknowledgements adaptive maintenance effort of object-oriented systems. IEEE Trans-
actions on Software Engineering 27 (12), 1062–1084.
We are very grateful to Prof. Wei Li for giving us a Foss, T., Stensrud, E., Kitchenham, B., Myrtveit, I., 2003. A simulation
study of the model evaluation criterion MMRE. IEEE Transactions on
scanned copy of their paper, and thank Dr. C. van Koten Software Engineering 29 (11), 985–995.
for providing us with an electronic version of Li and Freidman, J., 1991. Multivariate adaptive regression splines. Annals of
Henry’s data sets. We also wish to thank the anonymous Statistics 19, 1–141.
reviewers of this paper for their helpful comments and Gibson, V.R., Senn, J.A., 1989. System structure and software mainte-
suggestions. nance performance. Communication of ACM 32 (3), 347–358.
Heiat, A., 2002. Comparison of artificial neural network and regression
models for estimating software development effort. Information and
References Software Technology 44 (15), 911–922.
Heiat, A., 2002. Comparison of artificial neural network and regression
Bandi, R.K., Vaishnavi, V.K., Turk, D.E., 2003. Predicting maintenance models for estimating software development effort. Information and
performance using object-oriented design complexity metrics. IEEE Software Technology 44 (15), 911–922.
Transactions on Software Engineering 29 (1), 77–87. Hochman, R., Khoshgoftaar, T.M., Allen, E.B., Hudepohl, J.P., 2003.
Belsley, D., Kuh, E., Welsch, R., 2004. Regression Diagnostics: Identi- Improved fault-prone detection analysis of software modules using an
fying Influential Data and Sources of Collinearity. Wiley, New York. evolutionary neural network approach. In: Khoshgoftaar, T.M. (Ed.),
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1993. Classifica- Software Engineering with Computational Intelligence. Kluwer Aca-
tion and Regression Trees. Chapman & Hall, London. demic Publishers, pp. 69–100.
Briand, L.C., Wust, J., Daly, J.W., Porter, D.V., 2000. Exploring the Hopkins, W.G., 2003. A New View of Statistics. SportScience, Dunedin,
relationships between design measures and software quality in object- New Zealand.
oriented systems. Journal of Systems and Software 51 (3), 245–273. Karunaithi, N., Whitley, D., Malaiya, Y.K., 1992. Prediction of software
Briand, L.C., Bunse, C., Daly, J.W., 2001. A controlled experiment for reliability using connectionist models. IEEE Transactions on Software
evaluating quality guidelines on the maintainability of object-oriented Engineering 18 (7), 563–574.
designs. IEEE Transactions on Software Engineering 27 (6), 513–530. Khoshgoftaar, T.M., Lanning, D.L., 1995. A neural network approach for
Briand, L.C., Melo, W.L., Wüst, J., 2002. Assessing the applicability of early detection of program modules having high risk in the mainte-
fault-proneness models across object-oriented software projects. IEEE nance phase. Journal of Systems and Software 29 (1), 85–91.
Transactions on Software Engineering 28 (7), 706–720. Khoshgoftaar, T.M., Seliya, N., 2003. Fault prediction modeling for
Briand, L.C., Freimut, B., Vollei, F., 2004. Using multiple adaptive software quality estimation: comparing commonly used techniques.
regression splines to support decision making in code inspections. Empirical Software Engineering 8 (3), 255–283.
Journal of Systems and Software 73 (2), 205–217. Kitchenham, B., Pfleeger, S.L., McColl, B., Eagan, S., 2002. An empirical
Chidamber, S.R., Kemerer, C.F., 1994. A metrics suite for object-oriented study of maintenance and development estimation accuracy. Journal
design. IEEE Transactions on Software Engineering 20 (6), 476–493. of Systems and Software 64 (1), 57–77.
Y. Zhou, H. Leung / The Journal of Systems and Software 80 (2007) 1349–1361 1361

Lee, T.S., Chen, I.F., 2005. A two-stage hybrid credit scoring model using Strike, K., El-Emam, K., Madhavji, N., 2001. Software cost estimation
artificial neural networks and multivariate adaptive regression splines. with incomplete data. IEEE Transactions on Software Engineering 27
Expert Systems with Applications 28 (4), 743–752. (10), 890–908.
Li, W., Henry, S., 1993. Object-oriented metrics that predict maintain- Thwin, M.M.T., Quah, T.S., 2005. Application of neural networks for
ability. Journal of Systems and Software 23 (2), 111–122. software quality prediction using object-oriented metrics. Journal of
Ma, Y., Guo, L., Cukic, B., 2006. A Statistical Framework for the Systems and Software 76 (2), 147–156.
Prediction of Fault-pronenessAdvances in Machine Leaning Applica- van Koten, C., Gray, A.R., 2006. An application of Bayesian network for
tion in Software Engineering. Idea Group Inc. predicting object-oriented software maintainability. Information and
Maxwell, K.D., 2002. Applied Statistics for Software Managers. Prentice Software Technology 48 (1), 59–67.
Hall. Vapnik, V., 1995. The Nature of Statistical Learning Theory. Springer,
Minsky, M., Paper, S., 1969. Perceptrons. MIT Press. New York.
Misra, S.C., 2005. Modeling design/coding factors that drive maintain- Vapnik, V., Golowich, S., Smola, A., 1997. Support Vector Method for
ability of software systems. Software Quality Journal 13 (3), 297–320. Function Approximation, Regression Estimation, and Signal Process-
Myrtveit, I., Stensrud, E., Shepperd, M., 2005. Reliability and validity in ing. Advances in Neural Information Processing Systems 9. MIT Press,
comparative studies of software prediction models. IEEE Transactions pp. 281–287.
on Software Engineering 31 (5), 380–391. Walkerden, F., Jeffery, R., 1999. An empirical study of analogy-based
Peddabachigari, S., Abraham, A., Grosan, C., Thomas, J., 2007. software effort estimation. Empirical Software Engineering 4 (2), 135–
Modeling intrusion detection system using hybrid intelligent systems. 158.
Journal of Network and Computer Applications 30 (1), 114–132. Weaver, R.A., 2003. The safety of software – constructing and assuring
Pickard, L.M., Kitchenham, B.A., Linkman, S.J., 1999. An investigation arguments. Ph.D. thesis. Department of Computer Science, University
of analysis techniques for software test data sets. Technical Report of York.
TR99-05, Department of Computer Science, Keele University, Staf- Witten, I.H., Frank, E., 2000. Data Mining: Practical Machine Learning
fordshire, UK. Tools and Techniques with Java Implementations. Morgan Kaufman
Rising, L.S., 1992. Information hiding metrics for modular programming Publishers.
languages. PhD dissertation, Arizona State University. Yu, Y.Y., Johnson, B.W., 2002. Modeling COTS systems for safety-
Rosenblatt, F., 1962. Principles of Neurodynamics: Perceptrons and the critical applications using system safety standards by Bayesian
Theory of Brain Mechanisms. Spartan Press. Belief Networks. In: Proceedings of the 6th International Confer-
Smola, A.J., 1996. Regression estimation with support vector learning ence on Probabilistic Safety Assessment and Management (PSAM-6),
machines. Master’s thesis, Technische Universitat Munchen. June.
Smola, A.J., Scholkopf, B., 1998. A tutorial on support vector regression. Zar, J.H., 1984. Biostatistical Analysis. Prentice Hall, Englewood Cliffs,
Technical Report, NC2TR -1998-030, NeuroCOLT2. New Jersey.
Srinivasan, K., Fisher, D., 1995. Machine learning approaches to
estimating software development effort. IEEE Transactions on Soft-
ware Engineering 21 (2), 126–137.

You might also like