You are on page 1of 11

.

Status Report on
Software
Measurement
SHARI LAWRENCE PFLEEGER, Systems/Software and Howard University
ROSS JEFFERY, University of New South Wales
BILL CURTIS, TeraQuest Metrics
BARBARA KITCHENHAM, Keele University

I
n any scientific field, measurement generates quantita-
tive descriptions of key processes and products,
The most successful enabling us to understand behavior and result. This
measurement programs enhanced understanding lets us select better tech-
niques and tools to control and improve our processes,
are ones in which products, and resources. Because engineering involves
researcher, practitioner, the analysis of measurements, software engineering cannot
and customer work hand become a true engineering discipline unless we build a solid
in hand to meet goals foundation of measurement-based theories.
One obstacle to building this base is the gap between measure-
and solve problems. But
ment research and measurement practice. This status report
such collaboration is rare. describes the state of research, the state of the art, and the state of
The authors explore the practice of software measurement. It reflects discussion at the
gaps between these Second International Software Metrics Symposium, which we
groups and point toward organized. The aim of the symposium is to encourage researchers
and practitioners to share their views, problems, and needs, and to
ways to bridge them. work together to define future activities that will address common
goals. Discussion at the symposium revealed that participants had
different and sometimes conflicting motivations.

IEEE SOFTWARE 0740-7459/97/$10.00 © 1997 IEEE 33


.

♦ Researchers, many of whom are number of lines in a program listing. we must be sure that the measurement
in academic environments, are moti- As early as 1974, in an ACM Computing efforts are consonant with our project,
vated by publication. In many cases, Surveys article, Donald Knuth reported process, and product goals; otherwise,
highly theoretical results are never on using measurement data to demon- we risk abusing the data and making
tested empirically, new metrics are strate how Fortran compilers can be bad decisions.
defined but never used, and new theo- optimized, based on actual language
use rather than theory. Indeed, mea- Real-world abuses. For a look at how
We must ensure surement has become a natural part of dissonance in these goals can create
many software engineering activities. problems, consider an example
that measurement ♦ Developers, especially those described by Michael Evangelist. 4
involved in large projects with long Suppose you measure program size
efforts are schedules, use measurements to help using lines of code or Halstead mea-
them understand their progress toward sures (measures based on the number
consonant with completion. of operators and operands in a pro-
♦ Managers look for measurable gram). In both cases, common wis-
our project goals. milestones to give them a sense of pro- dom suggests that module size be kept
ries are promulgated but never exer- ject health and progress toward effort small, as short modules are easier to
cised and modified to fit reality. and schedule commitments. understand than large ones. More-
♦ Practitioners want short-term, ♦ Customers, who often have little over, as size is usually the key factor in
useful results. Their projects are in control over software production, look predicting effort, small modules
trouble now, and they are not always to measurement to help determine the should take less time to produce than
willing to be a testbed for studies quality and functionality of products. large ones. However, this metrics-
whose results won’t be helpful until the ♦ Maintainers use measurement to driven approach can lead to increased
next project. In addition, practitioners inform their decisions about reusabili- effort during testing or maintenance.
are not always willing to make their ty, reengineering, and legacy code For example, consider the following
data available to researchers, for fear replacement. code segment:
that the secrets of technical advantage FOR i = 1 to n DO
will be revealed to their competitors. Proper usage. IEEE Software and other READ (x[i])
♦ Customers, who are not always publications have many articles on how
involved as development progresses, measurement can help improve our Clearly, this code is designed to
feel powerless. They are forced to products, processes, and resources. For read a list of n things. But Brian
specify what they need and then can example, Ed Weller described how Kernighan and William Plauger, in
only hope they get what they want. metrics helped to improve the inspec- their classic book The Elements of
It is no coincidence that the most tion process at Honeywell;1 Wayne Lim Programming Style, caution program-
successful examples of software mea- discussed how measurement supports mers to terminate input by an end-of-
surement are the ones where re- Hewlett-Packard’s reuse program, help- file or marker, rather than using a
searcher, practitioner, and customer ing project managers estimate module count. If a count ends the loop and the
work hand in hand to meet goals and reuse and predict the savings in set being read has more or fewer than
solve problems. But such coordination resources that result; 2 and Michael n elements, an error condition can
and collaboration are rare, and there Daskalontanakis reported on the use of result. A simple solution to this prob-
are many problems to resolve before measurement to improve processes at lem is to code the read loop like this:
reaching that desirable and productive Motorola.3 In each case, measurement i = 1
state. To understand how to get there, helped make visible what is going on in WHILE NOT EOF DO
we begin with a look at the right and the code, the development processes, READ (x[i])
wrong uses of measurement. and the project team. i:= i+1
END
For many of us, measurement has
become standard practice. We use
MEASUREMENT: structural-complexity metrics to target This improved code is still easy to
USES AND ABUSES our testing efforts, defect counts to read but is not subject to the counting
help us decide when to stop testing, or errors of the first code. On the other
Software measurement has existed failure information and operational hand, if we judge the two pieces of
since the first compiler counted the profiles to assess code reliability. But code in terms of minimizing size, then

34 MARCH/APRIL 1997
.

the first code segment is better than MEASUREMENT THEORY incorporates an absolute zero, preserves
the second. Had standards been set ratios, and permits the most sophisti-
according to size metrics (as some- One way of distinguishing between cated analysis. Measures such as lines of
times happens), the programmer could real-world objects or entities is to code or numbers of defects are ratio
have been encouraged to keep the describe their characteristics. Measure- measures. It is for this scale that we can
code smaller, and the resulting code ment is one such description. A mea- say that A is twice the size of B.
would have been more difficult to test sure is simply a mapping from the real, The importance of measurement
and maintain. empirical world to a mathematical type to software measurement rests in
Another abuse can occur when you world, where we can more easily the types of calculations you can do
use process measures. Scales such as understand an entity’s attributes and with each scale. For example, you can-
the US Software Engineering relationship to other entities. The diffi- not compute a meaningful mean and
Institute’s Capability Maturity Model culty is in how we interpret the mathe- standard deviation for a nominal scale;
can be used as an excuse not to imple- matical behavior and judge what it such calculations require an interval or
ment an activity. For example, man- means in the real world. ratio scale. Thus, unless we are aware
agers complain that they cannot insti- None of these notions is particular of the scale types we use, we are likely
tute a reuse program because they are to software development. Indeed, mea- to misuse the data we collect.
only a level 1 on the maturity scale. surement theory has been studied for Researchers such as Norman Fenton
But reuse is not prohibited at level 1; many years, beginning long before and Horst Zuse have worked extensive-
the CMM suggests that such practices computers were around. But the issues ly in applying measurement theory to
are a greater risk if basic project disci- of measurement theory are very impor- proposed software metrics. Among the
plines (such as making sensible com- tant in choosing and applying metrics ongoing questions is whether popular
mitments and managing product base- to software development. metrics such as function points are
lines) have not been established. If pro- meaningful, in that they include unac-
ductivity is a particular project goal, Scales. Measurement theory holds, as ceptable computations for their scale
and if a rich code repository exists a basic principle, that there are several types. There are also questions about
from previous projects, reuse may be scales of measurement—nominal, ordi- what entity function points measure.
appropriate and effective regardless of nal, interval, and ratio—and each cap-
your organization’s level. tures more information than its prede- Validation. We validate measures so
cessor. A nominal scale puts items into we can be sure that the metrics we use
Roots of abuse. In each case, it is not categories, such as when we identify a are actually measuring what they claim
the metric but the measurement programming language as Ada, Cobol, to measure. For example, Tom McCabe
process that is the source of the abuse: Fortran, or C++. An ordinal scale ranks proposed that we use cyclomatic num-
The metrics are used without keeping items in an order, such as when we ber, a property of a program’s control-
the development goals in mind. In the assign failures a progressive severity like flow graph, as a measure of testing com-
code-length case, the metrics should be minor, major, and catastrophic.
chosen to support goals of testability An interval scale defines a distance Unless we are
and maintainability. In the CMM case, from one point to another, so that
the goal is to improve productivity by there are equal intervals between con- aware of the scale
introducing reuse. Rather than prevent secutive numbers. This property per-
movement, the model should suggest mits computations not available with
types we use, we
which steps to take first. the ordinal scale, such as calculating the are likely to misuse
Thus, measurement, as any technol- mean. However, there is no absolute
ogy, must be used with care. Any appli- zero point in an interval scale, and thus the data we collect.
cation of software measurement should ratios do not make sense. Care is thus
not be made on its own. Rather, it needed when you make comparisons. plexity. Many researchers are careful to
should be an integral part of a general The Celsius and Fahrenheit tempera- state that cyclomatic number is a mea-
assessment or improvement program, ture scales, for example, are interval, so sure of structural complexity, but it does
where the measures support the goals we cannot say that today’s 30-degree not capture all aspects of the difficulty
and help to evaluate the results of the Celsius temperature is twice as hot as we have in understanding a program.
actions. To use measurement properly, yesterday’s 15 degrees. Other examples include Ross Jeffery’s
we must understand the nature and The scale with the most information study of programs from a psychological
goals of measurement itself. and flexibility is the ratio scale, which perspective, which applies notions

I E E E S O FT W A R E 35
.

about how much we can track and The notion of validity is not specific is constantly developing different and
absorb as a way of measuring code to software engineering, and general challenging avionics systems with a level
complexity. Maurice Halstead claimed concepts that we rarely consider—such 2 organization that develops versions of
that his work, too, had psychological as construct validity and predictive a relatively simple Cobol business appli-
underpinnings, but the psychological validity—should be part of any discus- cation. Obviously, we are comparing
basis for Halstead’s “software science” sion of software engineering measure- sliced apples with peeled oranges, and
ment. For example, Kitchenham, the domain, customer type, and many
Measurement Pfleeger, and Fenton have proposed a other factors moderate the relationships
general framework for validating soft- we observe.
theory is getting ware engineering metrics based on mea- This situation reveals problems not
surement theory and statistical rules.9 with the CMM as a measure, but with
attention from the model on which the CMM is based.
Apples and oranges. Measurement the- We begin with simple models that pro-
researchers but ory and validation should not distract us vide useful information. Sometimes
from the considerable difficulty of mea- those models are sufficient for our
is being ignored suring software in the field. A major dif- needs, but other times we must extend
by practitioners ficulty is that we often try to relate mea-
sures of a physical object (the software)
the simple models in order to handle
more complex situations. Again, this
and customers. with human and organizational behav- approach is no different from other sci-
iors, which do not follow physical laws. ences, where simple models (of molec-
measures have been soundly debunked Consider, for example, the capability ular structure, for instance) are expand-
by Neil Coulter.5 (Bill Curtis and his maturity level as a measure. The matu- ed as scientists learn more about the
colleagues at General Electric found, rity level reflects an organization’s soft- factors that affect the outcomes of the
however, that Halstead’s count of ware development practices and is pur- processes they study.
operators and operands is a useful mea- ported to predict an organization’s abil-
sure of program size.6) ity to produce high-quality software on State of the gap. In general, measure-
We say that a measure is valid if it time. But even if an organization at ment theory is getting a great deal of
satisfies the representation condition: if level 2 can be determined, through attention from researchers but is being
it captures in the mathematical world extensive experimentation, to produce ignored by practitioners and cus-
the behavior we perceive in the empiri- better software (measured by fewer tomers, who rely on empirical evidence
cal world. For example, we must show delivered defects) than a level 1 organi- of a metric’s utility regardless of its sci-
that if H is a measure of height, and if A zation, it doesn’t hold that all level 2 entific grounding.
is taller than B, then H(A) is larger than organizations develop software better Researchers should work closely
H(B). But such a proof must by its than level 1 organizations. Some with practitioners to understand the
nature be empirical and it is often diffi- researchers welcome the use of capabil- valid uses and interpretations of a soft-
cult to demonstrate. In these cases, we ity maturity level as a predictor of the ware measure based on its measure-
must consider whether we are measur- likelihood (but not a guarantee) that a ment-theoretic attributes. They should
ing something with a direct measure level n organization will be better than also consider model validity separate
(such as size) or an indirect measure a level n−1. But others insist that, for from measurement validity, and devel-
(such as using the number of decision CMM level to be a measure in the mea- op more accurate models on which to
points as a measure of size) and what surement theory sense, level n must base better measures. Finally, there is
entity and attribute are being addressed. always be better than level n−1. much work to be done to complete a
Several attempts have been made to Still, a measure can be useful as a framework for measurement valida-
list a set of rules for validation. Elaine predictor without being valid in the tion, as well as to achieve consensus
Weyuker suggested rules for validating sense of measurement theory. More- within the research community on the
complexity,7 and Austin Melton and over, we can gather valuable informa- framework’s accuracy and usefulness.
his colleagues have proffered a similar, tion by applying—even to heuristics—
general list for the behavior of all met- the standard techniques used in other
rics.8 However, each of these frame- scientific disciplines to assess association MEASUREMENT MODELS
works has been criticized and there is by analyzing distributions. But let’s
not yet a standard, accepted way of val- complicate this picture further. Suppose A measurement makes sense only
idating a measure. we compare a level 3 organization that when it is associated with one or more

36 MARCH/APRIL 1997
.

models. One essential model tells us in understanding the type of analysis boundary or explain how one system
the domain and range of the measure needed when the data is in hand. interacts with another. Thus, research
mapping; that is, it describes the entity Some practitioners, such as Bill and practice have a very long way to go
and attribute being measured, the set Hetzel, encourage a bottom-up ap- in exploring and exploiting what mod-
of possible resulting measures, and the proach to metrics application, where els can do to improve software products
relationships among several measures organizations measure what is avail- and processes.
(such as productivity is equal to size able, regardless of goals.11 Other mod-
produced per unit of effort). Models els include Ray Offen and Jeffery’s
also distinguish prediction from assess- M 3 P model derived from business MEASURING THE PROCESS
ment; we must know whether we are goals described on page 45 and the
using the measure to estimate future combination of goal-question-metric For many years, computer scientists
characteristics from previous ones and capability maturity built into the and software engineers focused on
(such as effort, schedule, or reliability European Community’s ami project measuring and understanding code. In
estimation) or determining the current framework. recent years—as we have come to
condition of a process, product, or understand that product quality is evi-
resource (such as assessing defect den- Model research. Experimentation dence of process success—software
sity or testing effectiveness). models of measurement are essential process issues have received much
There are also models to guide us in for case studies and experiments for attention. Process measures include
deriving and applying measurement. A software engineering research. For large-grain quantifications, such as the
commonly used model of this type is example, an organization may build CMM scale, as well as smaller-grain
the Goal-Question-Metric paradigm software using two different tech- evaluations of particular process activi-
suggested by Vic Basili and David niques: one a formal method, another ties, such as test effectiveness.
Weiss (and later expanded by Basili not. Researchers would then evaluate
and Dieter Rombach).10 This approach the resulting software to see if one Process perspective. Process research
uses templates to help prospective method produced higher quality soft- can be viewed from several perspec-
users derive measures from their goals ware than the other. tives. Some process researchers devel-
and the questions they must answer An experimentation model de- op process description languages, such
during development. The template scribes the hypothesis being tested, the as the work done on the Alf (Esprit)
encourages the user to express goals in factors that can affect the outcome, the project. Here, measurement supports
the following form: degree of control over each factor, the the description by counting tokens that
relationships among the factors, and indicate process size and complexity.
Analyze the [object] for the purpose
the plan for performing the research Other researchers investigate the actu-
of [purpose] with respect to [focus]
and evaluating the outcome. To al process that developers use to build
from the viewpoint of [viewpoint] in
address the lack of rigor in software
the [environment].
experimentation, projects such as the The lack of
For example, an XYZ Corporation UK’s Desmet—reported upon exten-
manager concerned about overrunning sively in ACM Software Engineering models in software
the project schedule might express the Notes beginning in October of 1994—
goal of “meeting schedules” as have produced guidelines to help soft- engineering is
ware engineers design surveys, case
Analyze the project for the purpose of
studies, and experiments. symptomatic
control with respect to meeting sched-
ules from the viewpoint of the project
Model future. As software engineers,
of a lack of
manager in the XYZ Corporation.
we tend to neglect models. In other sci- systems focus.
From each goal, the manager can entific disciplines, models act to unify
derive questions whose answers will help and explain, placing apparently disjoint software. For example, early work by
determine whether the goal has been events in a larger, more understandable Curtis and his colleagues at MCC
met. The questions derived suggest met- framework. The lack of models in soft- revealed that the way we analyze and
rics that should be used to answer the ware engineering is symptomatic of a design software is typically more itera-
questions. This top-down derivation much larger problem: a lack of systems tive and complex than top-down.12
assists managers and developers not only focus. Few software engineers under- Researchers also use measurement
in knowing what data to collect but also stand the need to define a system to help them understand and improve

I E E E S O FT W A R E 37
.

MORE INFORMATION ABOUT MEASUREMENT

Books Conferences
♦ D. Card and R. Glass, Measuring Software Design ♦ Applications of Software Measurement, sponsored by
Complexity, Prentice Hall, Englewood Cliffs, N.J., 1991. Software Quality Engineering, held annually in Florida and
♦ S.D. Conte, H E. Dunsmore, and V.Y. Shen, Software California on alternate years. Contact: Bill Hetzel, SQE,
Engineering Metrics and Models, Benjamin Cummings, Menlo Jacksonville, FL, USA.
Park, Calif., 1986. ♦ International Symposium on Software Measurement,
♦ T. DeMarco, Controlling Software Projects, Dorset sponsored by IEEE Computer Society (1st in Baltimore,
House, New York, 1982. 1993, 2nd in London, 1994, 3rd in Berlin, 1996; 4th is
♦ J.B. Dreger, Function Point Analysis, Prentice Hall, upcoming in Nov. 1997 in Albuquerque, New Mexico); pro-
Englewood Cliffs, N.J., 1989. ceedings available from IEEE Computer Society Press.
♦ N. Fenton and S.L. Pfleeger, Software Metrics: A Contact: Jim Bieman, Colorado State University, Fort
Rigorous and Practical Approach, second edition, International Collins, CO, USA.
Thomson Press, London, 1996. ♦ Oregon Workshop on Software Metrics Workshop,
♦ R.B. Grady, Practical Software Metrics for Project sponsored by Portland State University, held annually near
Management and Process Improvement, Prentice Hall, Portland, Oregon. Contact: Warren Harrison, PSU,
Englewood Cliffs, N.J., 1992. Portland, OR, USA.
♦ R.B. Grady and D.L. Caswell, Software Metrics: ♦ Minnowbrook Workshop on Software Performance
Establishing a Company-Wide Program, Prentice Hall, Evaluation, sponsored by Syracuse University, held each
Englewood Cliffs, N.J., 1987. summer at Blue Mountain Lake, NY. Contact: Amrit Goel,
♦ T.C. Jones, Applied Software Measurement: Assuring Syracuse University, Syracuse, NY, USA.
Productivity and Quality, McGraw Hill, New York, 1992. ♦ NASA Software Engineering Symposium, sponsored
♦ K. Moeller and D.J. Paulish, Software Metrics: A by NASA Goddard Space Flight Center, held annually at the
Practitioner’s Guide to Improved Product Development, IEEE end of November or early December in Greenbelt,
Computer Society Press, Los Alamitos, Calif., 1993. Maryland; proceedings available. Contact: Frank McGarry,
♦ P. Oman and S.L. Pfleeger, Applying Software Metrics, Computer Sciences Corporation, Greenbelt, MD, USA.
IEEE Computer Society Press, Los Alamitos, Calif., 1996. ♦ CSR Annual Workshop, sponsored by the Centre for
Software Reliability, held annually at locations throughout
Journals Europe; proceedings available. Contact: Bev Littlewood,
♦ IEEE Software (Mar. 1991 and July 1994, special issues Centre for Software Reliability, City University, London,
on measurement; January 1996, special issue on software UK.
quality)
♦ Computer (September 1994, special issue on product Organizations
metrics) ♦ Australian Software Metrics Association. Contact:
♦ IEEE Transactions on Software Engineering Mike Berry, School of Information Systems, University of
♦ Journal of Systems and Software New South Wales, Sydney 2052, Australia.
♦ Software Quality Journal ♦ Quantitative Methods Committee, IEEE Computer
♦ IEE Journal Society Technical Council on Software Engineering.
♦ IBM Systems Journal Contact: Jim Bieman, Department of Computer Science,
♦ Information and Software Technology Colorado State University, Fort Collins, CO, USA.
♦ Empirical Software Engineering: An International Journal ♦ Centre for Software Reliability. Contact: Bev
Littlewood, CSR, City University, London, UK.
Key Journal Articles ♦ Software Engineering Laboratory. Contact: Vic Basili,
♦ V.R. Basili and H.D. Rombach, “The TAME Project: Department of Computer Science, University of Maryland,
Towards Improvement-Oriented Software Environments,” College Park, MD, USA.
IEEE Trans. Software Eng., Vol. 14, No. 6, 1988, pp. 758-773. ♦ SEI Software Measurement Program. Contact: Anita
♦ C. Billings et al., “Journey to a Mature Software Pro- Carleton, Software Engineering Institute, Carnegie Mellon
cess,” IBM Systems Journal, Vol. 33, No. 1, 1994, pp. 46-61. University, Pittsburgh, PA, USA.
♦ B. Curtis, “Measurement and Experimentation in ♦ Applications of Measurement in Industry (ami) User
Software Engineering,” Proc. IEEE, Vol. 68, No. 9, 1980, pp. Group. Contact: Alison Rowe, South Bank University,
1144-1157. London, UK.
♦ B. Kitchenham, L. Pickard, and S.L. Pfleeger, “Using ♦ International Society of Parametric Analysts. Contact:
Case Studies for Process Improvement,” IEEE Software, J. Clyde Perry & Associates, PO Box 6402, Chesterfield, MO
July 1995. 63006-6402, USA.
♦ S.L. Pfleeger, “Experimentation in Software ♦ International Function Point Users Group. Contact:
Engineering,” Annals Software Eng., Vol. 1, No. 1, 1995. IFPUG Executive Office, Blendonview Office Park, 5008-28
♦ S.S. Stevens, “On the Theory of Scales of Pine Creek Drive, Westerville, OH 43081-4899, USA. ◆
Measurement,” Science, No. 103, 1946, pp. 677-680.

38 MARCH/APRIL 1997
.

existing processes. A good example is raises several problems. First, large- tions about whether new defects are
an ICSE 1994 report in which Larry grained process measures require valida- introduced when old ones are repaired.
Votta, Adam Porter, and Basili report- tion, which is difficult to do. Second, However, more work is required both
ed that scenario-based inspections project managers are often intimidated in making the assumptions realistic and
(where each inspector looked for a par- by the effort required to track process in helping users select appropriate
ticular type of defect) produced better measures throughout development. models. Some models are accurate
results than ad hoc or checklist-based Individual process activities are usually most of the time, but there are no
inspections (where each inspector easier to evaluate, as they are smaller and guarantees that a particular model will
looks for any type of defect).13 Basili more controllable. Third, regardless of perform well in a particular situation.
and his colleagues at the NASA the granularity, process measures usually Most developers and customers do
Software Engineering Laboratory con- require an underlying model of how not want to wait until delivery to
tinue to use measurement to evaluate they interrelate; this model is usually determine if the code is reliable or
the impact of using Ada, cleanroom, missing from process understanding and maintainable. As a result, some practi-
and other technologies that change the evaluation, so the results of research are tioners measure defects as evidence of
software development process. Billings difficult to interpret. Thus, even as code quality and likely reliability. Ed
and his colleagues at Loral (formerly attention turns increasingly to process in Adams of IBM showed the dangers of
IBM) are also measuring their process the larger community, process measure- this approach. He used IBM operating
for building space shuttle software. ment research and practice lag behind system data to show that 80 percent of
the use of other measurements. the reliability problems were caused by
Remeasuring. The reuse community only 2 percent of the defects. 16
provides many examples of process- Research must be done to determine
related measurement as it tries to MEASURING THE PRODUCTS which defects are likely to cause the
determine how reuse affects quality most problems, as well as to prevent
and productivity. For example, Wayne Because products are more concrete such problems before they occur.
Lim has modeled the reuse process and than processes and resources and are
suggested measurements for assessing thus easier to measure, it is not surpris- Early measurement. Earlier life-cycle
reuse effectiveness.14 Similarly, Shari ing that most measurement work is products have also been the source of
Lawrence Pfleeger and Mary Theo- directed in this area. Moreover, cus- many measurements. Dolly Samson
fanos have combined process maturity tomers encourage product assessment and Jim Palmer at George Mason
concepts with a goal-question-metric because they are interested in the final University have produced tools that
approach to suggest metrics to instru- product’s characteristics, regardless of measure and evaluate the quality of
ment the reuse process.15 the process that produced it. As a informal, English-language require-
Reengineering also offers opportuni- result, we measure defects (in specifica- ments; these tools are being used by the
ties to measure process change and its tion, design, code, and test cases) and US Federal Bureau of Investigation and
effects. At the 1994 International failures as part of a broader program to
Software Engineering Research assess product quality. Quality frame- There are no
Network meeting, an Italian research works, such as McCall’s or the pro-
group reported on their evaluation of a posed ISO 9126 standard, suggest ways guarantees that
large system reengineering project. In to describe different aspects of product
the project, researchers kept an extensive quality, such as distinguishing usability a particular model
set of measurements to track the impact from reliability from maintainability.
of the changes made as a banking appli-
will perform well
cation’s millions of lines of Cobol code Measuring risk. Because failures are in a particular
were reengineered over a period of the most visible evidence of poor quali-
years. These measures included the sys- ty, reliability assessment and prediction situation.
tem structure and the number of help have received much attention. There
and change requests. Measurement let are many reliability models, each the Federal Aviation Authority on pro-
the team evaluate the success and pay- focused on using operational profile jects where requirements quality is
back of the reengineering process. and mean-time-to-failure data to pre- essential. Similar work has been pur-
dict when the next failure is likely to sued by Anthony Finkelstein’s and
Process problems. Use of these and occur. These models are based on Alistair Sutcliffe’s research groups at
other process models and measurements probability distributions, plus assump- City University in London. Suzanne

I E E E S O FT W A R E 39
.

Robertson and Shari Pfleeger are cur- the size of functionality from the speci- resource measurement, as developers
rently working with the UK Ministry fication in a way that is impossible for and managers find it threatening.
of Defence to evaluate requirements lines of code. Both sides have valid
structure as well as quality, so require- points, and both have attempted to Nonhuman resources. More attention
unify and define their ideas so that has been paid to other resources: bud-
Debate continues counting and comparing across organi- get and schedule assessment, and effort,
zations is possible. However, practi- cost, and schedule prediction. A rich
over whether lines tioners and customers have no time to assortment of tools and techniques is
wait for resolution. They need mea- available to support this work, includ-
of code are a sures now that will help them under- ing Barry Boehm’s Cocomo model,
stand and predict likely effort, quality, tools based on Albrecht’s function
reasonable measure and schedule. points model, Larry Putnam’s Slim
of software size. Thus, as with other types of mea-
surement, there is a large gap between
model, and others. However, no model
works satisfactorily for everyone, in
ments volatility and likely reuse can be the theory and practice of product part because of organizational and pro-
assessed. However, because serious measurement. The practitioners and ject differences, and in part because of
measurement of requirements attribut- customers know what they want, but model imperfections. June Verner and
es is just beginning, very little require- the researchers have not yet been able Graham Tate demonstrated how tailor-
ments measurement is done in practice. to find measures that are practical, sci- ing models can improve their perfor-
entifically sound (according to mea- mances significantly. Their 4GL modi-
Design and code. Researchers and surement theory principles), and cost- fication of an approach similar to func-
practitioners have several ways of evalu- effective to capture and analyze. tion points was quite accurate com-
ating design quality, in the hope that pared to several other alternatives.18,19
good design will yield good code. Sallie Barbara Kitchenham’s work on the
Henry and Dennis Kafura at Virginia MEASURING THE RESOURCES Mermaid Esprit project demonstrated
Tech proposed a design measure based how several modeling approaches can
on the fan-in and fan-out of modules. For many years, some of our most be combined into a larger model that is
David Card and Bill Agresti worked insightful software engineers (includ- more accurate than any of its compo-
with NASA Goddard developers to ing Jerry Weinberg, Tom DeMarco, nent models.20 And Boehm is updating
derive a measure of software design Tim Lister, and Bill Curtis) have and improving his Cocomo model to
complexity that predicts where code encouraged us to look at the quality reflect advances in measurement and
errors are likely to be. But many of the and variability of the people we employ process understanding, with the hope
existing design measures focus on func- for the source of product variations. of increasing its accuracy.21
tional descriptions of design; Shyam Some initial measurement work has
Chidamber and Chris Kemerer at MIT been done in this area. Shaky models. The state of the prac-
have extended these types of measures DeMarco and Lister report in tice in resource measurement lags far
to object-oriented design and code. Peopleware on an IBM study which behind the research. Many of the
The fact that code is easier to mea- showed that your surroundings—such research models are used once, publi-
sure than earlier products does not as noise level, number of interruptions, cized, and then die. Those models that
prevent controversy. Debate continues and office size—can affect the produc- are used in practice are often imple-
to rage over whether lines of code are a tivity and quality of your work. mented without regard to the underly-
reasonable measure of software size. Likewise, a study by Basili and David ing theory on which they are built. For
Bob Park at SEI has produced a frame- Hutchens suggests that individual vari- example, many practitioners implement
work that organizes the many decisions ation accounts for much of the differ- Boehm’s Cocomo model, using not
involved in defining a lines-of-code ence in code complexity;17 these results only his general approach but also his
count, including reuse, comments, exe- support a 1979 study by Sylvia cost factors (described in his 1981
cutable statements, and more. His Sheppard and her colleagues at ITT, book, Software Engineering Economics).
report makes clear that you must know showing that the average time to locate However, Boehm’s cost factor values
your goals before you design your a defect in code is not related to years are based on TRW data captured in the
measures. Another camp measures of experience but rather to breadth of 1970s and are irrelevant to other envi-
code in terms of function points, experience. However, there is relative- ronments, especially given the radical
claiming that such measures capture ly little attention being paid to human change in development techniques and

40 MARCH/APRIL 1997
.

tools since Cocomo was developed. simple analysis techniques such as scat- Applying the appropriate statistical
Likewise, practitioners adopt the equa- ter charts and histograms provide use- techniques to the measurement scale is
tions and models produced by Basili’s ful information about what is happen- even more important. Measures of
Software Engineering Laboratory, even ing on a project or in a product. Others central tendency and dispersion differ
though the relationships are derived prefer to use statistical analysis, such as with the scale, as do appropriate trans-
from NASA data and are not likely to regression and correlation, box plots, formations. You can use mode and fre-
work in other environments. The and measures of central tendency and quency distributions to analyze nomi-
research community must better com- dispersion. More complex still are clas- nal data that describes categories, but
municate to practitioners that it is the sification trees, applied by Adam you cannot use means and standard
techniques that are transferable, not the Porter and Rick Selby to determine deviations. With ordinal data—where
data and equations themselves. which metrics best predict quality or an order is imposed on the cate-
productivity. For example, if module gories—you can use medians, maxima,
quality can be assessed using the num- and minima for analysis. But you can
STORING, ANALYZING ber of defects per module, then a clas- use means, standard deviations, and
AND REPORTING THE sification tree illustrates which of the more sophisticated statistics only when
MEASUREMENTS metrics collected predicts modules you have interval or ratio data.
likely to have more than a threshold
Researchers and practitioners alike number of defects.22 Presentation. Presenting measure-
often assume that once they choose the Process measures are more difficult ment data so that customers can
metrics and collect the data, their mea- to track, as they often require trace- understand it is problematic because
surement activities are done. But the ability from one product or activity to metrics are chosen based on business
goals of measurement—understanding another. In this case, databases of and development goals and the data is
and change—are not met until they traceability information are needed, collected by developers. Typically, cus-
analyze the data and implement change. coupled with software to track and tomers are not experts in software
analyze progress. Practitioners often engineering; they want a “big picture”
Measuring tools. In the UK, a team use their configuration management of what the software is like, not a large
led by Kitchenham has developed a system for these measures, augmenting vector of measures of different aspects.
tool that helps practitioners choose existing configuration information Hewlett-Packard has been successful in
metrics and builds a repository for the with measurement data. using Kiviat diagrams (sometimes
collected data. Amadeus, an American called radar graphs) to depict multiple
project funded by the Advanced Analyzing tools. For storing and ana- measures in one picture, without losing
Research Projects Agency, has some of lyzing large data sets, it is important to the integrity of the individual mea-
the same goals; it monitors the devel- choose appropriate analysis techniques. sures. Similarly, Contel used multiple
opment process and stores the data for Population dynamics and distribution metrics graphs to report on software
later analysis. Some Esprit projects are are key aspects of this choice. When switch quality and other characteristics.
working to combine research tools into sampling from data, it is essential that
powerful analysis engines that will help the sample be representative so that
developers manipulate data for decision your judgments about the sample apply
The choice of tool
making. For example, Cap Volmac in
the Netherlands is leading the Squid
to the larger population. It is equally
important to ensure that your analysis
depends on how
project to build a comprehensive soft- technique is suitable for the data’s dis- you will use the
ware quality assessment tool. tribution. Often, practitioners use a
It is not always necessary to use technique simply because it is available measurements.
sophisticated tools for metrics collec- on their statistical software packages,
tion and storage, especially on projects regardless of whether the data is dis- Measurement revisited. A relatively
just beginning to use metrics. Many tributed normally or not. As a result, new area of research is the packaging of
practitioners use spreadsheet software, invalid parametric techniques are used previous experience for use by new
database management systems, or off- instead of the more appropriate non- development and maintenance projects.
the-shelf statistical packages to store parametric ones. Many of the paramet- Since many organizations produce new
and analyze data. The choice of tool ric techniques are robust enough to be software that is similar to their old soft-
depends on how you will use the mea- used with nonnormal distributions, but ware or developed using similar tech-
surements. For many organizations, you must verify this robustness. niques, they can save time and money

I E E E S O FT W A R E 41
.

by capturing experience for reuse at a part the fault of researchers, who have must provide product metrics that sup-
later time. This reuse involves not only not described the limitations of and port such purchasing decisions. And as
code but also requirements, designs, constraints on techniques put forth for our customers insist on higher levels of
test cases, and more. For example, as practical use. reliability, functionality, usability,
part of its Experience Factory effort, Finally, the measurement commu- reusability, and maintainability, we
the SEL is producing a set of docu- nity has yet to deal with the more must work closely with the rest of the
ments that suggests how to introduce global issue of technology transfer. It is software engineering community to
techniques and establish metrics pro- unreasonable for us to expect practi- understand the processes and resources
grams. Guillermo Arango’s group at tioners to become experts in statistics, that contribute to good products.
Schlumberger has automated this expe- probability, or measurement theory, or We should not take the gap
rience capture in a series of “project even in the intricacies of calculating between measurement research and
books” that let developers call up code complexity or modeling parame- practice lightly. During an open-mike
requirements, design decisions, mea- ters. Instead, we need to encourage session at the metrics symposium, a
surements, code, and documents of all researchers to fashion results into tools statistician warned us not to become
kinds to assist in building the next ver- and techniques that practitioners can like the statistics community, which he
sion of the same or similar product.23 easily understand and apply. characterized as a group living in its
own world with theories and results
Refining the focus. In the past, mea-
surement research has focused on met- J ust as we preach the need for mea-
surement goals, so too must we
that are divorced from reality and use-
less to those who must analyze and
ric definition, choice, and data collec- base our activities on customer goals. understand them. If the measurement
tion. As part of a larger effort to exam- As practitioners and customers cry out community remains separate from
ine the scientific bases for software for measures early in the development mainstream software engineering, our
engineering research, attention is now cycle, we must focus our efforts on delivered code will be good in theory
turning to data analysis and reporting. measuring aspects of requirements but not in practice, and developers will
Practitioners continue to use what is analysis and design. As our customers be less likely to take the time to mea-
readily available and easy to use, regard- request measurements for evaluating sure even when we produce metrics
less of its appropriateness. This is in commercial off-the-shelf software, we that are easy to use and effective. ◆

REFERENCES

1. E.F. Weller, “Using Metrics to Manage Software Projects,” Computer, Sept. 1994, pp. 27-34.
2. W.C. Lim, “Effects of Reuse on Quality, Productivity, and Economics,” IEEE Software, Sept. 1994, pp. 23-30.
3. M.K. Daskalantonakis, “A Practical View of Software Measurement and Implementation Experiences within Motorola,” IEEE Trans. Software Eng., Vol. 18,
No. 11, 1992, pp. 998-1010.
4. W.M. Evangelist, “Software Complexity Metric Sensitivity to Program Restructuring Rules,” J. Systems Software, Vol. 3, 1983, pp. 231-243.
5. N. Coulter, “Software Science and Cognitive Psychology,” IEEE Trans. Software Eng., Vol. 9, No. 2, 1983, pp. 166-171.
6. B. Curtis et al., “Measuring the Psychological Complexity of Software Maintenance Tasks with the Halstead and McCabe Metrics,” IEEE Trans. Software
Eng., Vol. 5, No. 2, 1979, pp. 96-104.
7. E.J. Weyuker, “Evaluating Software Complexity Measures,” IEEE Trans. Software Eng., Vol. 14, No. 9, 1988, pp. 1357-1365.
8. A.C. Melton et al., “Mathematical Perspective of Software Measures Research,” Software Eng. J., Vol. 5, No. 5, 1990, pp. 246-254.
9. B. Kitchenham, S.L. Pfleeger, and N. Fenton, “Toward a Framework for Measurement Validation,” IEEE Trans. Software Eng., Vol. 21, No. 12, 1995,
pp. 929-944.
10. V.R. Basili and D. Weiss, “A Methodology For Collecting Valid Software Engineering Data,” IEEE Trans. Software Eng., Vol. 10, No. 3, 1984, pp. 728-738.
11. W. Hetzel, Making Software Measurement Work: Building an Effective Software Measurement Program, QED Publishing, Boston, 1993.
12. B. Curtis, H. Krasner, and Neil Iscoe, “A Field Study of the Software Design Process for Large Systems,” Comm. ACM, Nov. 1988, pp. 1268-1287.

42 MARCH/APRIL 1997
.

13. A.A. Porter, L.G. Votta, and V.R. Basili, “An Experiment to Assess Different Defect Detection Methods for Software Requirements Inspections,” Proc. 16th
Int’l Conf. Software Eng., 1994, pp. 103-112.
14. W.C. Lim “Effects of Reuse on Quality, Productivity and Economics,” IEEE Software, Sept. 1994, pp. 23-30.
15. M.F. Theofanos and S.L. Pfleeger, “A Framework for Creating a Reuse Measurement Plan,” tech. report, 1231/D2, Martin Marietta Energy Systems, Data
Systems Research and Development Division, Oak Ridge, Tenn., 1993.
16. E. Adams, “Optimizing Preventive Service of Software Products,” IBM J. Research and Development, Vol. 28, No. 1, 1984, pp. 2-14.
17. V.R. Basili and D.H. Hutchens, “An Empirical Study of a Syntactic Complexity Family,” IEEE Trans. Software Eng., Vol. 9, No. 6, 1983, pp. 652-663.
18. J.M. Verner and G. Tate, “Estimating Size and Effort in Fourth-Generation Language Development,” IEEE Software, July 1988, pp. 173-177.
19. J. Verner. and G. Tate, “A Software Size Model,” IEEE Trans. Software Eng., Vol. 18, No. 4, 1992, pp. 265-278.
20. B.A. Kitchenhamm P.A.M. Kok, and J Kirakowski, “The Mermaid Approach to Software Cost Estimation,” Proc. Esprit, Kluwer Academic Publishers,
Dordrecht, the Netherlands, 1990, pp. 296-314.
21. B.W. Boehm et al., “Cost Models for Future Life Cycle Processes: COCOMO 2.0,” Annals Software Eng. Nov. 1995, pp. 1-24.
22. A. Porter and R. Selby, “Empirically Guided Software Development Using Metric-Based Classification Trees,” IEEE Software, Mar. 1990, pp. 46-54.
23. G. Arango, E. Schoen, and R. Pettengill, “Design as Evolution and Reuse,” in Advances in Software Reuse, IEEE Computer Society Press, Los Alamitos, Calif.,
March 1993, pp. 9-18.

Shari Lawrence Pfleeger is director of the Center for Ross Jeffery is a professor of information systems and
Research in Evaluating Software Technology director of the Centre for Advanced Empirical Software
(CREST) at Howard University in Washington, DC. Research at the University of New South Wales,
The Center establishes partnerships with industry and Australia. His research interests include software engi-
government to evaluate the effectiveness of software neering process and product modeling and improve-
engineering techniques and tools. She is also president ment, software metrics, software technical and manage-
of Systems/Software Inc., a consultancy specializing in ment reviews, and software resource modeling. He is on
software engineering and technology evaluation. the editorial board of the IEEE Transactions on Software
Pfleeger is the author of several textbooks and dozens Engineering, the Journal of Empirical Software
of articles on software engineering and measurement. Engineering, and the editorial board of the Wiley
She is an associate editor-in-chief of IEEE Software and International Series on information systems. He is also a
is an advisor to IEEE Spectrum. Pfleeger is a member of the executive commit- founding member of the International Software Engineering Research Network.
tee of the IEEE Technical Council on Software Engineering, and the program
cochair of the next International Symposium on Software Metrics in
Albuquerque, New Mexico.
Pfleeger received a PhD in information technology and engineering from
George Mason University. She is a member of the IEEE and ACM.

Bill Curtis is co-founder and chief scientist of Barbara Kitchenham is a principal researcher in soft-
TeraQuest Metrics in Austin, Texas where he works ware engineering at Keele University. Her main inter-
with organizations to increase their software develop- est is in software metrics and their support for project
ment capability. He is a former director of the and quality management. She has written more than 40
Software Process Program in the Software Engineering articles on the topic as well as the book Software Metrics:
Institute at Carnegie Mellon University, where he is a Measurement for Software Process Improvement. She spent
visiting scientist. Prior to joining the SEI, Curtis 10 years working for ICL and STC, followed by two
worked at MCC, ITT's Programming Technology years at City University and seven years at the UK
Center, GE's Space Division, and the University of National Computing Centre, before joining Keele in
Washington. He was a founding faculty member of the February 1996.
Software Quality Institute at the University of Texas. Kitchenham received a PhD from Leeds University.
He is co-author of the Capability Maturity Model for software and the principal
author of the People CMM. He is on the editorial boards of seven technical
journals and has published more than 100 technical articles on software engi-
neering, user interface, and management.

Address questions about this article to Pfleeger at CREST, Howard University Department of Systems and Computer Science, Washington, DC 20059;
s.pfleeger@ieee.org.

I E E E S O FT W A R E 43

You might also like