You are on page 1of 5

POSITION PAPERS 173

Shen, V. Y., Yu, T.-L., Theabaut, S. M., and Paulsen, L. R. 1985. “Identifying error-prone software—an empirical
study. IEEE Transactions on Software Engineering SE-11(4): 317–323.
Siegel, S., and Jr., N. J. C. 1988. Nonparametrics Statistics for the Behavioural Sciences. McGraw-Hill, second
edition.
Turner, K. J., ed. 1993. Using Formal Description Techniques—An Introduction to ESTELLE, LOTOS and SDL.
John Wiley & Sons.

Problems and Prospects in Quantifying Software


Maintainability
JARRETT ROSENBERG jarrett.rosenberg@sun.com; jarrett.rosenberg@acm.org
Sun Microsystems, 2550 Garcia Avenue, Mountain View, CA 94043

In one form or another, quantifying the maintainability of software has been attempted
for decades, in pursuit of two goals:
• predicting the likelihood of corrective maintenance,
• predicting the difficulty of maintenance activities, corrective or otherwise.
Despite the extensive activity in this area, we are not much further along than when we
started. This paper looks at some of the reasons why, and the factors necessary for future
progress.

Goal #1: Predicting Likelihood of Corrective Maintenance

Predicting the likelihood of defects in source code has long been a major focus of software
metrics research (Melton, 1996). There are two necessary factors that must be considered
in any such prediction: characteristics of the source code itself (including the design it
embodies), and characteristics of the testing process used on the code up to the point when
the prediction is made. Let us examine each of these in turn.

Source Code Characteristics

Some thirty years of research has yielded two basic classes of metrics: size metrics (most
famously, lines of code and the Halstead metrics) and complexity metrics (most famously,
McCabe’s cyclomatic complexity metric). Size metrics do predict likelihood of defects, but
in precisely the same way as any exposure variable does: the more source code, the more
defects. If an application’s source code were to be evenly divided into equal-sized files,
size would no longer accurately predict likelihood of defects per file. For this reason, size
metrics are only useful as covariates in evaluating the effect of other, non-size, metrics, that
is to say, predictions must be adjusted for size to be valid.
174 EMPIRICAL SOFTWARE ENGINEERING

Although complexity metrics are somewhat correlated with size metrics (since it is diffi-
cult for a module to be very complex without also being somewhat large), factor analyses
typically show them to be somewhat distinct from the latter. Of the many different complex-
ity metrics (cf. Zuse, 1991), cyclomatic complexity is the most frequently cited example
associated with the presence of defects (Shepperd and Ince, 1993; Rosenberg 1996a). The
association is nevertheless weak and sometimes missing altogether.
There are a variety of reasons for the failure to find a compelling connection between
source code characteristics and defects: for example, the various studies are done quite
differently, on different samples of source code (often in different languages). Misun-
derstandings about measurement are also rampant (Fenton, 1991, Briand 1996). Yet the
fundamental reason is that researchers have failed to appreciate just how important the
distinction between syntactic and semantic defects is. Syntactic defects, from typographic
errors to abuse of language features like goto, are easy to detect and measure. As a result,
such syntactic defects are increasingly caught by development tools, from compilers to
memory-leak checkers.1 This reduction in syntactic defects both reduces the effectiveness
of syntactic predictions and heightens the role of semantically based defects, yet semanti-
cally based metrics are few and rarely studied. This is a pity, because the semantic errors are
the “deep” ones, due either to a conceptual oversight or a fundamental misunderstanding
of the problem or its solution. Even when they are not the most numerous, they are the
hardest to find and fix (see the review of studies in Roper, 1994).
If there are to be predictive source code characteristics, then, it is likely they will be
semantically based metrics, rather than the syntactic ones currently in use. This does not
mean that source code analysis is pointless however, merely that it will have to be much
more sophisticated. For example, metrics based on the results of program slicing (Weiser,
1984; Gallagher and Lyle 1991) may give a better indication of complexity than any current
syntactic metric.

Test Process Characteristics

While the software metrics community has been trying to predict the occurrence of defects
based on syntactic features of the code itself, the testing community has been independently
trying to assess the presence or absence of defects based on characteristics of the testing
process the source code has undergone. Here again a basic distinction has been made, but
not fully appreciated: the distinction between structural and functional testing. Both are
necessary, but they have very different roles: since, by definition, only a sample of possible
inputs can be used to test software, stochastically based functional testing is the only way
to rigorously and objectively assess the reliability of the software. Since infrequently used
and/or safety critical parts of the code cannot be adequately tested by functional means,
structural testing must be used there instead.
Unfortunately, just as the ease of syntactic measurements has distracted metrics re-
searchers, the ease of structural testing compared to stochastic testing has distracted testing
researchers into unhelpful comparisons of different coverage methods and partitioning
strategies (Hamlet and Taylor, 1990). Even though there is no evidence that degree of
test coverage is a cause rather than a consequence of a thorough test program, increasing
POSITION PAPERS 175

test coverage is usually taken as the goal of test process improvement. This misdirection
of effort leaves the critical issues of functional testing, such as the construction of valid
operational profiles and the automation of the test oracle, unaddressed.
The basic principle still stands, however, that the type and amount of testing carried
out on the source code must also be factored into the equation predicting likelihood of
defects: complex code that has been heavily tested may actually be less likely to have
defects than simple code that has not been extensively tested. Exactly what the relationship
is, is unknown, since to my knowledge no one has done the experiment.

Goal #2: Predicting the Difficulty of Maintenance

Beyond the likelihood that corrective maintenance may need to be performed, we would
like to be able to tell from the source code how difficult it will be to make a change, for
whatever reason. The difficulty is measured both in the amount of time needed to make
the change, and in the probability of successfully making the change on the first attempt.
Again we are faced with essentially semantic problems rather than syntactic ones.

Code Intelligibility

The intelligibility or comprehensibility of code is obviously a major factor in successfully


making changes to it. While various tools have been developed to visualize code and thus
aid its comprehension (e.g., Eick 1992), the real issues are cognitive, not technical. Psy-
chologists have become increasingly interested in programming as a domain to explore
theories of learning and cognition (as in the series of symposia on empirical studies of pro-
grammers, e.g., Cook et al., 1993) but they unfortunately are rarely sufficiently acquainted
with industrial software development and developers to make a major impact. Conversely,
computer scientists grappling with the issue of program comprehension (as in the IEEE
symposia on program comprehension) seem unaware of the existence of cognitive psychol-
ogy. This lack of awareness of both halves of the puzzle has led each side to focus on what
it knows best, leaving unaddressed some of the most basic issues, such as how the factors
of programmer skill, domain knowledge, language features, and development environment
interact to allow the creation and comprehension of intelligible code.

Code Modularity

The more modular the code, the less likely a change to it will create a ripple effect involving
changes, especially unforeseen changes, elsewhere in it. The challenge, then, is to create
valid measures of modularity which can guide the process of “software change impact anal-
ysis” (Bohner and Arnold, 1996). Such measures are partly syntactic and partly semantic,
and will require some sophisticated tools to be effective. Program slicing techniques may
also be valuable here, as metrics based on them could indicate how far-reaching the effects
of one module are. The advent of object-oriented programming languages creates new
176 EMPIRICAL SOFTWARE ENGINEERING

opportunities, and new challenges, to defining and assessing modularity. While object-
oriented metrics (Lorenz and Kidd, 1994) and object-oriented design principles (Lakos,
1996) have been proposed, there is little empirical work based on them.

Code Decay

Just as the testing history of source code must be taken into account in predicting its like-
lihood of defects, the amount of modification the code has undergone must be taken into
account in assessing its ease of modifiability. The phenomenon of “code decay” discussed
by other participants in this workshop, whereby continually modified code becomes pro-
gressively harder to comprehend and maintain, is a kind of limiting factor in the natural
history of programs. Understanding and predicting this phenomenon is one of the central
issues of maintainability.

Future Prospects

Given that there has been so little progress to date, were past trends to continue there would
be little hope for future progress, but I am optimistic because much of the lack of progress
has been due to a lack of appreciation of how other disciplines are needed to solve these
problems. The unique aspects of software seem to have convinced software engineers that
they have little to learn from other disciplines, with a consequent tendency to sporadically
and poorly re-invent methods that have been known for decades in other domains. Slowly
an awareness is growing of the need for participation by psychologists, statisticians, and
traditional industrial engineers in solving software engineering problems (see, for example,
NRC 1996 and Rosenberg 1996b). There are two factors that I believe are critical to success
in quantifying maintainability:

• A genuinely scientific approach


There needs to be a serious emphasis on formulating hypotheses and conducting prop-
erly designed empirical research to test those hypotheses. A large body of publicly
available source code and related data (e.g., defect data, configuration data) is needed
to support the normal scientific practices of re-analysis and replication; far too much of
current work is based on either toy systems or proprietary data. It will be hard to build
a science under those conditions. Moreover, many of the basic phenomena under study
need to more fully defined: for example, while it is generally accepted that programmer
characteristics such as expertise are critical factors in software quality and productivity,
there are as yet no generally accepted definitions or measuring instruments for them.
Without such standard measures, there is no way to empirically demonstrate those ef-
fects, let alone compare experiments and accumulate a body of knowledge about them.
Finally, until computer scientists are trained in research methodology and data analysis,
there needs to be an awareness that experts in these areas should be consulted in doing
empirical research.
POSITION PAPERS 177

• Active interdisciplinary collaboration with other fields


Few of software engineering’s critical problems can be solved without the active in-
volvement of other disciplines, whether they be methodologists such as statisticians,
or those with critical theories and results, such as cognitive psychologists or industrial
engineers. Such collaborations have already started to occur, and they need to be ac-
tively supported; interdisciplinary work is not easy, but progress without it is simply
not possible.

Notes

1. Indeed, Hatton (1995) makes the claim that even a defect-causing language like C can be made safe to use if
tools are used to catch the stereotypic errors the language induces.

References

Bohner, S., and Arnold, R. eds. 1996. Software Change Impact Analysis. Los Alamitos, CA: IEEE Computer
Society Press.
Briand, L., El Amam, K., and Morasca, S. 1996. On the application of measurement theory in software engineering.
Empirical Software Engineering 1(1): 61–81.
Cook, C., Scholtz, J., and Spohrer, J. eds. 1993. Empirical Studies of Programmers: Fifth Workshop. Norwood,
NJ: Ablex.
Gallagher, K., and Lyle, J. 1991. Using program slicing in software maintenance. IEEE Trans. Software Eng.
17(8): 751–761.
Fenton, N. 1991. Software Metrics: A Rigorous Approach. London: Chapman and Hall.
Hamlet, R., and Taylor, R. 1990. Partition testing does not inspire confidence. IEEE Trans. Software Eng. 16(12):
1402–1411.
Hatton, L. 1995. Safer C. NY: McGraw-Hill.
Lakos, J. 1996. Large Scale C++ Software Design. Reading, MA: Addison-Wesley.
Lorenz, M., and Kidd, J. 1994. Object-Oriented Software Metrics. Englewood Cliffs, NJ: Prentice-Hall.
Melton, A. ed. 1996. Software Measurement. London: International Thomson Computer Press.
National Research Council (NRC) 1996. Statistical Software Engineering. Committee on Applied and Theoretical
Statistics, Board on Mathematical Sciences. Washington, D.C.: National Academy Press.
Roper, M. 1994. Software Testing. NY: McGraw-Hill.
Rosenberg, J. 1996a. Linking internal and external quality measures. Proceedings of the Ninth International
Software Quality Week. San Francisco, California, May 1996.
Rosenberg, J. 1996b. Software Testing As Acceptance Sampling. Proceedings of the Fourteenth Pacific Northwest
Software Quality Conference. Portland, Oregon. Oct. 1996.
Shepperd, M., and Ince, D. 1993. Derivation and Validation of Software Metrics. Oxford: Clarendon Press.
Weiser, M. 1984. Program slicing. IEEE Trans. Software Eng. 10(6): 352–357.
Zuse, H. 1991. Software Complexity Methods and Measures. NY: De Gruyter.

You might also like