Professional Documents
Culture Documents
Shen, V. Y., Yu, T.-L., Theabaut, S. M., and Paulsen, L. R. 1985. “Identifying error-prone software—an empirical
study. IEEE Transactions on Software Engineering SE-11(4): 317–323.
Siegel, S., and Jr., N. J. C. 1988. Nonparametrics Statistics for the Behavioural Sciences. McGraw-Hill, second
edition.
Turner, K. J., ed. 1993. Using Formal Description Techniques—An Introduction to ESTELLE, LOTOS and SDL.
John Wiley & Sons.
In one form or another, quantifying the maintainability of software has been attempted
for decades, in pursuit of two goals:
• predicting the likelihood of corrective maintenance,
• predicting the difficulty of maintenance activities, corrective or otherwise.
Despite the extensive activity in this area, we are not much further along than when we
started. This paper looks at some of the reasons why, and the factors necessary for future
progress.
Predicting the likelihood of defects in source code has long been a major focus of software
metrics research (Melton, 1996). There are two necessary factors that must be considered
in any such prediction: characteristics of the source code itself (including the design it
embodies), and characteristics of the testing process used on the code up to the point when
the prediction is made. Let us examine each of these in turn.
Some thirty years of research has yielded two basic classes of metrics: size metrics (most
famously, lines of code and the Halstead metrics) and complexity metrics (most famously,
McCabe’s cyclomatic complexity metric). Size metrics do predict likelihood of defects, but
in precisely the same way as any exposure variable does: the more source code, the more
defects. If an application’s source code were to be evenly divided into equal-sized files,
size would no longer accurately predict likelihood of defects per file. For this reason, size
metrics are only useful as covariates in evaluating the effect of other, non-size, metrics, that
is to say, predictions must be adjusted for size to be valid.
174 EMPIRICAL SOFTWARE ENGINEERING
Although complexity metrics are somewhat correlated with size metrics (since it is diffi-
cult for a module to be very complex without also being somewhat large), factor analyses
typically show them to be somewhat distinct from the latter. Of the many different complex-
ity metrics (cf. Zuse, 1991), cyclomatic complexity is the most frequently cited example
associated with the presence of defects (Shepperd and Ince, 1993; Rosenberg 1996a). The
association is nevertheless weak and sometimes missing altogether.
There are a variety of reasons for the failure to find a compelling connection between
source code characteristics and defects: for example, the various studies are done quite
differently, on different samples of source code (often in different languages). Misun-
derstandings about measurement are also rampant (Fenton, 1991, Briand 1996). Yet the
fundamental reason is that researchers have failed to appreciate just how important the
distinction between syntactic and semantic defects is. Syntactic defects, from typographic
errors to abuse of language features like goto, are easy to detect and measure. As a result,
such syntactic defects are increasingly caught by development tools, from compilers to
memory-leak checkers.1 This reduction in syntactic defects both reduces the effectiveness
of syntactic predictions and heightens the role of semantically based defects, yet semanti-
cally based metrics are few and rarely studied. This is a pity, because the semantic errors are
the “deep” ones, due either to a conceptual oversight or a fundamental misunderstanding
of the problem or its solution. Even when they are not the most numerous, they are the
hardest to find and fix (see the review of studies in Roper, 1994).
If there are to be predictive source code characteristics, then, it is likely they will be
semantically based metrics, rather than the syntactic ones currently in use. This does not
mean that source code analysis is pointless however, merely that it will have to be much
more sophisticated. For example, metrics based on the results of program slicing (Weiser,
1984; Gallagher and Lyle 1991) may give a better indication of complexity than any current
syntactic metric.
While the software metrics community has been trying to predict the occurrence of defects
based on syntactic features of the code itself, the testing community has been independently
trying to assess the presence or absence of defects based on characteristics of the testing
process the source code has undergone. Here again a basic distinction has been made, but
not fully appreciated: the distinction between structural and functional testing. Both are
necessary, but they have very different roles: since, by definition, only a sample of possible
inputs can be used to test software, stochastically based functional testing is the only way
to rigorously and objectively assess the reliability of the software. Since infrequently used
and/or safety critical parts of the code cannot be adequately tested by functional means,
structural testing must be used there instead.
Unfortunately, just as the ease of syntactic measurements has distracted metrics re-
searchers, the ease of structural testing compared to stochastic testing has distracted testing
researchers into unhelpful comparisons of different coverage methods and partitioning
strategies (Hamlet and Taylor, 1990). Even though there is no evidence that degree of
test coverage is a cause rather than a consequence of a thorough test program, increasing
POSITION PAPERS 175
test coverage is usually taken as the goal of test process improvement. This misdirection
of effort leaves the critical issues of functional testing, such as the construction of valid
operational profiles and the automation of the test oracle, unaddressed.
The basic principle still stands, however, that the type and amount of testing carried
out on the source code must also be factored into the equation predicting likelihood of
defects: complex code that has been heavily tested may actually be less likely to have
defects than simple code that has not been extensively tested. Exactly what the relationship
is, is unknown, since to my knowledge no one has done the experiment.
Beyond the likelihood that corrective maintenance may need to be performed, we would
like to be able to tell from the source code how difficult it will be to make a change, for
whatever reason. The difficulty is measured both in the amount of time needed to make
the change, and in the probability of successfully making the change on the first attempt.
Again we are faced with essentially semantic problems rather than syntactic ones.
Code Intelligibility
Code Modularity
The more modular the code, the less likely a change to it will create a ripple effect involving
changes, especially unforeseen changes, elsewhere in it. The challenge, then, is to create
valid measures of modularity which can guide the process of “software change impact anal-
ysis” (Bohner and Arnold, 1996). Such measures are partly syntactic and partly semantic,
and will require some sophisticated tools to be effective. Program slicing techniques may
also be valuable here, as metrics based on them could indicate how far-reaching the effects
of one module are. The advent of object-oriented programming languages creates new
176 EMPIRICAL SOFTWARE ENGINEERING
opportunities, and new challenges, to defining and assessing modularity. While object-
oriented metrics (Lorenz and Kidd, 1994) and object-oriented design principles (Lakos,
1996) have been proposed, there is little empirical work based on them.
Code Decay
Just as the testing history of source code must be taken into account in predicting its like-
lihood of defects, the amount of modification the code has undergone must be taken into
account in assessing its ease of modifiability. The phenomenon of “code decay” discussed
by other participants in this workshop, whereby continually modified code becomes pro-
gressively harder to comprehend and maintain, is a kind of limiting factor in the natural
history of programs. Understanding and predicting this phenomenon is one of the central
issues of maintainability.
Future Prospects
Given that there has been so little progress to date, were past trends to continue there would
be little hope for future progress, but I am optimistic because much of the lack of progress
has been due to a lack of appreciation of how other disciplines are needed to solve these
problems. The unique aspects of software seem to have convinced software engineers that
they have little to learn from other disciplines, with a consequent tendency to sporadically
and poorly re-invent methods that have been known for decades in other domains. Slowly
an awareness is growing of the need for participation by psychologists, statisticians, and
traditional industrial engineers in solving software engineering problems (see, for example,
NRC 1996 and Rosenberg 1996b). There are two factors that I believe are critical to success
in quantifying maintainability:
Notes
1. Indeed, Hatton (1995) makes the claim that even a defect-causing language like C can be made safe to use if
tools are used to catch the stereotypic errors the language induces.
References
Bohner, S., and Arnold, R. eds. 1996. Software Change Impact Analysis. Los Alamitos, CA: IEEE Computer
Society Press.
Briand, L., El Amam, K., and Morasca, S. 1996. On the application of measurement theory in software engineering.
Empirical Software Engineering 1(1): 61–81.
Cook, C., Scholtz, J., and Spohrer, J. eds. 1993. Empirical Studies of Programmers: Fifth Workshop. Norwood,
NJ: Ablex.
Gallagher, K., and Lyle, J. 1991. Using program slicing in software maintenance. IEEE Trans. Software Eng.
17(8): 751–761.
Fenton, N. 1991. Software Metrics: A Rigorous Approach. London: Chapman and Hall.
Hamlet, R., and Taylor, R. 1990. Partition testing does not inspire confidence. IEEE Trans. Software Eng. 16(12):
1402–1411.
Hatton, L. 1995. Safer C. NY: McGraw-Hill.
Lakos, J. 1996. Large Scale C++ Software Design. Reading, MA: Addison-Wesley.
Lorenz, M., and Kidd, J. 1994. Object-Oriented Software Metrics. Englewood Cliffs, NJ: Prentice-Hall.
Melton, A. ed. 1996. Software Measurement. London: International Thomson Computer Press.
National Research Council (NRC) 1996. Statistical Software Engineering. Committee on Applied and Theoretical
Statistics, Board on Mathematical Sciences. Washington, D.C.: National Academy Press.
Roper, M. 1994. Software Testing. NY: McGraw-Hill.
Rosenberg, J. 1996a. Linking internal and external quality measures. Proceedings of the Ninth International
Software Quality Week. San Francisco, California, May 1996.
Rosenberg, J. 1996b. Software Testing As Acceptance Sampling. Proceedings of the Fourteenth Pacific Northwest
Software Quality Conference. Portland, Oregon. Oct. 1996.
Shepperd, M., and Ince, D. 1993. Derivation and Validation of Software Metrics. Oxford: Clarendon Press.
Weiser, M. 1984. Program slicing. IEEE Trans. Software Eng. 10(6): 352–357.
Zuse, H. 1991. Software Complexity Methods and Measures. NY: De Gruyter.