Professional Documents
Culture Documents
Editor: Dybå
Editor Name
SINTEF
affiliation
Editor:Helen
Editor: EditorSharp
Name
The Open University, London
affiliation
h.c.sharp@open.ac.uk
email@email.com
Developing Fault-
Prediction Models:
What the Research Can Show Industry
Tracy Hall, Sarah Beecham, David Bowes, David Gray, and Steve Counsell
CODE FAULTS ARE a daily reality develop the models, they must under- Although it would be nice if com-
for software development companies. stand questions such as which metrics panies could simply select one of these
Finding and removing them costs the to include, which modeling techniques published models and use it in their
industry billions of dollars each year. perform best, and how context affects own environment, the evidence sug-
The lure of potential cost savings fault prediction. gests that fault-prediction models
and quality improvements has moti- To answer these questions, we con- perform less well when transferred to
vated considerable research on fault- ducted a systematic review of the pub- different contexts. 2 This means that
prediction models, which try to iden- lished studies of fault prediction in practitioners must understand how ex-
tify areas of code where faults are code from the beginning of 2000 to isting models might or might not relate
most likely to lurk. Developers can the end of 2010.1 On the basis of this to their own model development. In
then focus their testing efforts on review and subsequent analysis of 206 particular, they must understand the
these areas. Effectively focused test- models, we present key features of suc- specific context in which an existing
ing should reduce the overall cost of cessful models here. model was developed, so they can base
finding faults, eliminate them earlier, their own models on those that are
and improve the delivered system’s Context Is Key contextually compatible.
quality. Problem solved. We found 208 studies published on However, we found three challenges
to this apparently simple requirement.
First, many studies presented insuffi-
cient information about the develop-
Analysis of 206 fault-prediction models ment context. Second, most studies re-
ported models built using open source
reported in 19 mature research studies data, which can limit their compatibil-
reveals key features to help industry ity with commercial systems. Third, it
developers build models suitable remains unclear which context vari-
ables (application domain, program-
to their specific contexts. ming language, size, system maturity,
and so on) are tied to a fault-prediction
model’s performance.
So why do so few companies seem to predicting code faults over the 11-year Establishing Confidence
be developing fault-prediction models? period covered in our review. These in Existing Models
One reason is probably the sheer studies contain many hundreds of indi- Practitioners need a basic level of con-
number and complexity of studies in vidual models, whose construction var- fidence in an existing model’s perfor-
this field. Before companies can start to ies considerably. mance. Such confidence is based on
96 I E E E S O F T W A R E | P U B L I S H E D B Y T H E I E E E C O M P U T E R S O C I E T Y 0 74 0 -74 5 9 / 11 / $ 2 6 . 0 0 © 2 0 11 I E E E
understanding how well the model has
Phase One: Prediction Criteria
been constructed, its development con-
text, and its performance relative to Is a prediction model reported?
other models.
Is the prediction model tested on unseen data?
Our review suggests that few of the
published models provide sufficient Phase Two: Context Criteria
information to adequately support
Is the source of data reported?
this understanding. We developed a
set of criteria based on research (for Is the maturity of data reported?
example, see Kai Petersen and Claes
Is the size of data reported?
Wohlin 3), to assess whether a fault-
prediction study reports the basic in- Is the application domain of data reported?
formation to support confidence in Is the programming language of data reported?
a model. Figure 1 shows a checklist
based on these criteria (details are Phase Three: Model Criteria
available elsewhere 1). Are the independent variables clearly reported?
When we applied these criteria to
the 208 studies we reviewed, only 36 Is the dependent variable clearly reported?
passed all criteria. Each of these 36 Is the granularity of the dependent variable reported?
studies presents clear models and pro-
vides the information necessary to Is the modeling technique used reported?
understand the model’s relevance to a Phase Four: Data Criteria
particular context.
We quantitatively analyzed the per- Is the fault data acquisition process described?
formance of the models presented in Is the independent variables data acquisition process described?
19 of the 36 studies. These 19 studies
all report categorical predictions, such Is the faulty/nonfaulty balance of data reported?
as whether a code unit was likely to be
faulty or not (see the sidebar). Such pre- FIGURE 1. Checklist of criteria for establishing confidence in a model. Only 36 of 208 studies
dictions use performance measures that from the systematic literature review met all criteria.
usually stem from a confusion matrix
(see Figure 2). This matrix supports a
performance comparison across cate-
Number of code units
gorical studies.
We omitted 13 studies from further
analysis because they reported contin- Predicted faulty Predicted not faulty
uous predictions, such as how many
Actually faulty True positive (TP) False negative (FN)
faults are likely to occur in each code
unit. Such studies employ a wide va- Actually not faulty False positive (FP) True negative (TN)
riety of performance measures that
are difficult to compare. Likewise, we
omitted four categorical studies that re- FIGURE 2. Confusion matrix. The two columns and two rows capture the ways in which a
ported performance data based on the prediction can be correct or incorrect.
area under a curve.
The 19 studies reporting categorical
predictions contained 206 individual were faulty; For studies that didn’t report these
models. For each model, we extracted • recall = TP/(TP + FN): proportion of measures, we recomputed them from the
performance data for faulty units correctly classified; and available confusion-matrix-based data.
• f-measure = (2 × recall × precision)/ This let us compare the performance of
• precision = TP/(TP + FP): propor- (recall + precision): the harmonic all 206 models across the 19 studies and
tion of units predicted as faulty that mean of precision and recall. to draw quantitative conclusions.
N OV E M B E R / D E C E M B E R 2 0 11 | IEEE S O F T W A R E 97
VOICE OF EVIDENCE
O
There are three essential elements to a fault-prediction model.
ur systematic literature re-
PREDICTOR VARIABLES view suggests fi rst that suc-
These independent variables are usually metrics based on software artifacts, such cessful fault-prediction mod-
as static code or change data. Models have used a variety of metrics—from simple els are built or optimized to specific
LOC metrics to complex combinations of static-code features, previous fault data, contexts. The 36 mature studies we
and information about developers. identified can support this task with
clearly defi ned models that include de-
OUTPUT VARIABLES velopment contexts and methodolo-
A model’s output, or independent variable, is a prediction of fault proneness in terms gies. Our quantitative analysis of the
of faulty versus nonfaulty code units. This output typically takes the form of either 19 categorical studies from this set
categorical or continuous output variables. further suggests that successful mod-
Categorical outputs predict code units as either faulty or nonfaulty. Continuous els are based on both simple modeling
outputs usually predict the number of faults in a code unit. Predictions can address techniques and a wide combination of
varying units of code—from high-granularity units, such as plug-in level, to low- metrics. Practitioners can use these re-
granularity units, such as method level. sults in developing their own fault-pre-
diction models.
MODELING TECHNIQUES
Model developers can use one or more techniques to explore the relationship
between the predictor (or independent) variables and the output (or dependent) Acknowledgments
variables. Among the many available techniques are statistically based regression We thank the UK’s Engineering and Physical
Science Research Council, which support-
techniques and machine-learning techniques such as support vector machines. ed this research at Brunel University under
Ian Witten and Eibe Frank provide an excellent guide to using machine-learning grant EPSRC EP/E063039/1, and the Science
techniques.1 Foundation Ireland, which partially sup-
ported this work at Lero under grant 3/CE2/
I303_1. We are also grateful to Sue Black and
Reference Paul Wernick for providing input to the early
1. I.H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan stages of the work reported.
Kaufmann, 2005.
References
1. T. Hall et al., “A Systematic Review of Fault
Prediction Performance in Software Engineer-
ing,” accepted for publication in IEEE Trans.
Software Eng.; preprint available at http://
What Works in Existing Models Additionally, the models that per- bura.brunel.ac.uk/handle/2438/5743.
When compared with each other, most formed relatively well tended to com- 2. N. Nagappan, T. Ball, and A. Zeller, “Mining
of the 206 models peaked at about 70 bine a wide range of metrics, typically Metrics to Predict Component Failures,” Proc.
28th Int’l Conf. Software Eng., (ICSE 06),
percent recall—that is, they correctly including metrics based on the source ACM Press, 2006, pp. 452–461.
predict about 70 percent of actual real code, change data, and data about de- 3. K. Petersen and C. Wohlin, “Context in
faults. Some models performed consid- velopers (see Christian Bird and his Industrial Software Eng. Research,” Proc. 3rd
Int’l Symp. Empirical Software Eng. and Mea-
erably higher (for example, see Shiv- colleagues6). Often, the models per- surement (ESEM 09), IEEE CS Press, 2009,
kumar Shivaji and his colleagues 4), forming best had optimized this set of pp. 401–404.
4. S. Shivaji et al., “Reducing Features to Im-
while others performed considerably metrics (for example, by using Prin- prove Bug Prediction,” Proc. 24th IEEE/ACM
lower (see Erik Arishholm and his cipal Component Analysis or Feature Int’l Conf. Automated Software Eng. (ASE
colleagues5). Selection as in Shivaji4). Models using 09), IEEE CS Press, 2009, pp. 600–604.
5. E. Arisholm, L.C. Briand, and E.B. Johannes-
Models based on techniques such source-code text directly as a predictor sen, “A Systematic and Comprehensive Investi-
as naïve Bayes and logistic regression yielded promising results (see Osamu gation of Methods to Build and Evaluate Fault
seemed to perform best. Such tech- Mizuno and Tohru Kikuno7).Models Prediction Models,” J. Systems and Software,
vol. 83, no. 1, 2010, pp. 2–17.
niques are comparatively easy to under- using static-code or change-based met- 6. C. Bird et al., “Putting It All Together: Us-
stand and simple to use. rics alone performed least well. Models ing Socio-Technical Networks to Predict
98 I E E E S O F T W A R E | W W W. C O M P U T E R . O R G / S O F T W A R E
VOICE OF EVIDENCE
DAVID GRAY is a PhD student at the University Southwest: Mike Hughes, Email: mikehughes@computer.org;
of Hertfordshire, UK. Contact him at d.gray@herts. Phone: +1 805 529 6790; Fax: +1 941 966 2590
ac.uk.
Advertising Sales Representative (Classified Line/Jobs Board)
STEVE COUNSELL is a reader in software Heather Buonadies, Email: h.bounadies@computer.org;
engineering at Brunel University, UK. Contact him at Phone: +1 973 585 7070; Fax: +1 973 585 7071
steve.counsell@brunel.ac.uk.
N OV E M B E R / D E C E M B E R 2 0 11 | IEEE S O F T W A R E 99