You are on page 1of 5

ACM S I G S O F T Software Engineering Notes vol 21 no 1 J a n u a r y 1996 Page 11

Evaluating Software Engineering Methods We have a t t e m p t e d to address this problem in our evaluation
guidelines.
and Tool
Evaluation contezt
Part 1: The Evaluation Context and
A D E S M E T evaluation exercise is comparative. T h a t is, we
Evaluation Methods assume that there are several alternative ways of performing
a software engineering task and we want to identify which
Barbara A n n K i t c h e n h a m of the alternatives is best in specific circumstances. In some
N C C Services Ltd cases, comparisons will be made against an theoretical ideal
National Computing Centre (i.e. a gold standard).
Oxford R o a d With the exception of formal experiments, D E S M E T evalu-
M a n c h e s t e r , M1 TED, E n g l a n d ations are context-dependent, which means that we do not
+ 4 4 - 1 6 1 - 2 2 8 - 6 3 3 3 - fax: + 4 4 - 1 6 1 - 2 4 2 - 2 4 0 0 expect a specific m e t h o d / t o o l to be the best in all circum-
barbarak@ncc.co.uk stances. It is possible that an evaluation in one company
Copyright NCC Services Ltd. 1995 would result in one m e t h o d / t o o l being identified as superior,
but a similar evaluation in another company would come to
a different conclusion. For example, suppose two companies
INTRODUCTION
were comparing the use of formal specification notations with
In the last five issues of S I G S O F T Notes, Shari Lawrence the use of Fagan Inspections as a means of reducing the num-
Pfleeger has discussed the use of formal experiments to eval- ber of defects that reach the customer. If one of the compa-
uate software engineering methods and tools [1]. Shari's arti- nies had a mathematically sophisticated workforce, it might
cles were based on work she performed for the U.K. D E S M E T find formal specification superior in an evaluation; another
project which aimed to develop a methodology for evaluating company with a commercially-oriented workforce might find
software engineering methods and tools. Fagan Inspections superior. Hence, the difference in the re-
The D E S M E T project identified a number of useful evalua- suits of the evaluation might be due to the properties of the
tion methods in addition to formal experiments, and Shari companies that perform the evaluations, not to the methods
asked me to continue this column by describing some of the themselves.
other methods. As a starting point, I will give an overview However, as Shari has describe formal experiments are not
of the scope of the D E S M E T methodology in this article and context dependent. Under fully controlled conditions, we ex-
describe the nine different evaluation methods D E S M E T iden- pect a formal experiment to give the same results irrespective
tified. In the next few articles I will discuss criteria for select- of the particular evaluator or the particular organisation.
ing a specific method in particular circumstances. Later I will
Evaluation objects
present the D E S M E T guidelines for performing quantitative
case studies and feature analysis. D E S M E T assumes that you m a y want to evaluate any one of
the following objects:
THE DESMET METHODOLOGY:
SCOPE, TERMINOLOGY AND LIMITATIONS • a generic method which is a generic paradigm for some
aspect of software development (e.g., structured design);
D E S M E T users
• a method which is a specific approach within a generic
D E S M E T is concerned with the evaluation of methods or paradigm (e.g., Yourdon structured design);
tools within a particular organisation. The term organisa- • a tool which is a software application that supports a
tion is meant to apply to a software development group in a well-defined activity.
particular company/division performing broadly similar tasks
under similar conditions. In addition, the D E S M E T method- Sets of methods or tools being evaluated together are treated
ology can be used by academic institutions interested in ex- in the same way as an individual method or tool. A
perimental software engineering. m e t h o d / t o o l combination is usually treated as a method if
different paradigms are being compared, and a tool, if dif-
The D E S M E T methodology is intended to help an evaluator
ferent software support packages for the same paradigm are
in a particular organisation to plan and execute an evalua-
being compared.
tion exercise that is unbiased and reliable (e.g. maximises the
chance of identifying the best method/tool). One of the prob- Evaluation types
lems with undertaking evaluations in industry is that as well The D E S M E T evaluation methodology separates evaluation
as technical difficulties, there are sociological and managerial exercises into two main types:
factors that can bias an evaluation exercise. For example, if
staff undertaking an evaluation believe a new tool is supe- 1. evaluations aimed establishing measurable effects of us-
rior to their current tool, they are likely to get good results; ing a method/tool;
however, the favourable results might not carry over to other 2. evaluations aimed at establishing m e t h o d / t o o l appro-
software staffwho do not have the same confidence in the tool. priateness i.e. how well a m e t h o d / t o o l fits the needs
ACM S I G S O F T Software Engineering Notes vol 21 no 1 J a n u a r y 1996 Page 12

and culture of an organisation. 2. your organisation does not have a controllable develop-
ment process.
Measurable effects are usually based on reducing production,
rework or maintenance time or costs. D E S M E T refers to this The D E S M E T methodology is aimed at evaluating specific
type of evaluation as a quantitative or objective evaluation. methods/tools in specific circumstances. With the exception
Quantitative evaluations are based on identifying the benefits of hybrid methods such as the use of expert opinion, evalua-
you expect a n e w / m e t h o d or tool to deliver in measurable tion methods do not give any indication as to how different
terms and collecting d a t a to determine whether the expected methods and tools interact.
benefits are actually delivered. For example, suppose a company undertakes two parallel, in-
The appropriateness of a m e t h o d / t o o l is usually assessed in dependent evaluations: one investigating Fagan Inspection
terms of the features provided by the method/tool, the char- and the other investigating the use of a formal specification
acteristics of its supplier, and its training requirements. The language such as Z. Supposes the results of evaluation 1 sug-
specific features and characteristics included in the evaluation gest that Fagan Inspections are better than the c o m p a n y ' s
are based on the requirements of the user population and any current walkthrough standard for detecting specification de-
organisational procurement standards. Evaluators assess the fects (e.g., 80% of specification defects found before product
extent to which the m e t h o d / t o o l provides the required fea- release when Fagan Inspections are used compared with 45%
tures in a usable and effective manner based (usually) on per- using walkthroughs). In addition suppose that the results of
sonal opinion. D E S M E T refers to this type of evaluation as evaluation 2 suggest that the use of Z specifications results in
feature analysis and identifies such an evaluation as a quali- fewer residual specification defects than the c o m p a n y ' s cur-
tative or subjective evaluation. rent natural language specifications (e.g., 1 specification fault
per 1000 lines of code found after release when Z is used com-
Evaluation procedure
pared with 10 specification faults per 1000 lines of code with
In addition to the separation between quantitative, qualita- current methods). The evaluator cannot assume t h a t the
tive and hybrid evaluations, there is another dimension to results of changing the organisations process to using both
an evaluation: the way in which the evaluation is organised. Z specifications and Fagan Inspection will be independent.
D E S M E T has identified three rather different ways of organ- T h a t is the D E S M E T methodology does not guarantee that
ising an evaluation exercise: the number of specification defects per 1000 lines of code after
release will be 0.2 per 1000 lines instead of 4.5 per 1000 lines.
• as a f o r m a l e x p e r i m e n t where m a n y subjects (i.e.
It may be a reasonable assumption, but the combined effect
software engineers) are asked to perform a task (or vari-
of Z and Fagan Inspections would still need to be explicitly
ety of tasks) using the different methods/tools under in-
confirmed. In this ease, the effect m a y not be what you ex-
vestigation. Subjects are assigned to each m e t h o d / t o o l
pect because each technique might eliminate the same type
such that results are unbiased and can be analysed using
of fault as the othex.
standard statistical techniques;
• as a c a s e s t u d y where each m e t h o d / t o o l under inves- Undertaking industrial evaluations is usually only worth-
tigation is tried out on a real project using the stan- while if the organisation intends to make widespread process
dard project development procedures of the evaluating changes. If you work for an organisation which undertakes
organisation; a variety of different applications using m e t h o d s / t o o l s speci-
• as a s u r v e y where staff/organisations that have used fied by your customers (e.g., you produce custom-made soft-
specific methods or tools on past projects 4 are asked to ware products using client-dictated development procedures),
provide information about the method or tool. Informa- you may not have a standard development process. Thus,
tion from the m e t h o d / t o o l users can be analysed using the results of an evaluation exercise undertaken on a specific
standard statistical techniques. project may not generalise to other projects because each fu-
ture project is likely to use a different process. In such cir-
Although the three methods of organising an evaluation are cumstances the value of any evaluation exercise is likely to be
usually associated with quantitative investigations, they can limited.
equally well be applied to qualitative evaluations. In general, the parts of the D E S M E T methodology you will
Limitations of the DESMET methodology be able to use depend on the organisation in which you work.
You will need to assess the "maturity" of your organisation
The D E S M E T methodology is of limited use if: from the viewpoint of the type of evaluation it is capable of
performing.
1. you want to mix and match methods and tools;
EVALUATION METHODS
4 I n so me c i r c u m s t a n c e s it m a y b e possible to o r g a n i s e a s u r v e y based
on the assumption that respondents w o u l d be i n t e r e s t e d enough to set Quantitative evaluation m e t h o d s
u p a t r i a l of a m e t h o d / t o o l a n d r e t u r n t h e i n f o r m a t i o n once t h e y h a v e
Quantitative evaluations are based on the assumption that
f inished t h e trial. However, t h i s m a y r e s u l t i n a low r e s p o n s e to the
survey and a r a t h e r long timeseale. you can identify some measurable property (or properties) of
your software product or process that you expect to change
A C M SIGSOFT Software Engineering Notes vol 21 no 1 J a n u a r y 1996 Page 13

as a result of using the methods/tools you want to evaluate. 2. There is no guarantee that similar results will be found
Quantitative evaluations can be organised in three differ- on other projects.
ent ways: case studies, formal experiments, and surveys. 3. There are few agreed standards/procedures for under-
D E S M E T has concentrated on developing guidelines for un- taking case studies. Different disciplines have different
dertaking formal experiments and case studies of software approaches and often use the term to mean different
methods/tools. Formal experiments are the basic scien- things.
tific method of assessing differences between methods/tools.
Surveys
D E S M E T has tried to a d a p t as much as possible of the princi-
ples of formal experiments (i.e. terminology and techniques) Surveys are used when various method(s)/tool(s) have been
to developing guidelines for performing case studies [2]. used in an organisation (or group of organisations). In this
situation, the users of the method(s)/tool(s) can be asked
Shari Lawrence Pfleeger has already discussed the principles
to provide information a b o u t any property of interest (e.g.
of formal experiments in a series of articles in S I G S O F T Notes
[1], so I will not discuss them any further. development productivity, post-release defects, etc.).

Case Studies Information from a number of different tool users implies


replication. This means that we can use statistical techniques
Case studies involve the evaluation of a m e t h o d / t o o l after it to evaluate the differences between method(s)/tool(s) with
has been used on a "real"software project. This means that respect to the properties of interest. The m a j o r differences
you can be sure that the effects of the m e t h o d / t o o l applies in between a formal experiment and a survey is that a survey is
your own organisation and scales-up to the particular type of usually post-hoe and less well-controlled than an experiment.
projects you undertake. This results in several problems with surveys:
Case studies are easier for software development organisations 1. They are only able to demonstrate association not
to perform because they do not require replication. However, causality. An example of the difference between asso-
you can have only limited confidence that a case study will ciation and causality is that it has been observed that
allow you to assess the true effect of a method/tool. there is a strong correlation between the birth rate and
In practice, there are a number of practical difficulties with the number of storks in Scandinavia; however few peo-
case studies: ple believe that shooting storks would decrease the birth
rate. This limitation holds even if we believe a causal
1. It is difficult to ensure that there is a means of inter- link is very plausible. For example, the basis for the
preting the results of a case study. For example, a single
tobacco industry's claim that a link between smoking
productivity value of 0.02 function points per hour for and cancer is not-proven is the fact that scientists have
a project using a new 4GL language cannot be inter-
only demonstrated an association not a causal link.
preted unless there is some baseline productivity value
2. The results may be biased if the relationship between
for normal projects with which to compare it.
the population (i.e. the total number of persons in a
2. It is difficult to assess the extent to which the results of
position to respond to the survey) and the respondents
one case study can be used as the basis of m e t h o d / t o o l
is unknown. For example, in a survey aimed at obtain-
investment decisions. For example, if we found the pro-
ing information about the quality and productivity of
ductivity of a project using a new method were twice
projects using specific methods/tools, the project man-
that of other projects in our organisation, we would still
agers who reply are likely to be those that are well-
not be advised to invest in that tool if the case study
organised enough to have extracted and stored such in-
project was an in-house project management database
formation. Projects run by well-organised project man-
and the projects we normally undertook for clients were
agers may gain more from the use of methods/tools that
embedded, real-time c o m m a n d and control systems.
projects run by badly-organised project managers.
Advantages of case studies
Advantages of surveys
1. They can be incorporated into the normal software de-
velopment activities. 1. They make use of existing experience (i.e. existing
2. If they are performed on real projects, they are already data).
"scaled-up" to life size. 2. They can confirm that an effect generalises to m a n y
3. They allow you to determine whether (or not) expected project s/organisations.
effects apply in your own organisational and cultural 3. They make use of standard statistical analysis tech-
circumstances. niques.

Disadvantages of ease studies Disadvantages of surveys

1. With little or no replication, they may give inaccurate 1. They rely on different projects/organisations keeping
results. comparable data.
ACM SIGSOFT Software Engineering Notes vol 21 no 1 January 1996 Page 14

2. They only confirm association not causality. scribe them below as Qualitative Effects Analysis and Bench-
3. They can be biased due to differences between those marking. W e do not claim these are the only hybrid methods,
who respond and those who do not respond. they were, however, the only ones we could think of! W e have
not provided any detailed guidelines for these types of evalu-
Card et al [3] provide a good example of a quantitative survey ation.
(although they have not recognised the problem of association Collated ezpert opinion - Qualitative Effects Analysis
versus causality).
A common way of deciding which m e t h o d / t o o l to use on a
Qualitative M e t h o d s project is to rely on the expert opinion of a senior software
DESMET uses the term Feature Analysis to describe a quali- engineer or manager. Gilb [4] suggested assessing techniques
tative evaluation. Feature Analysis is based on identifying the in terms of the contribution they make towards achieving a
requirements that users have for a particular task/activity and measurable product requirement. This approach was used by
mapping those requixemcnts to features that a method/tool Walker and Kitchenham [5] to specify a tool to evaluate the
aimed at supporting that task/activity should possess. This quality levels likely to be achieved by a certain selection of
is the type of analysis you see in personal computer maga- methods/techniques. In D E S M E T , we refer to this method
zines when different W o r d Processors, or graphics packages of using expert opinion to assess the quantitative effects of
are compared feature by feature. different methods and tools as Qualitative Effects Analysis.

In the context of software methods/tools, the users would be Thus, Qualitative Effects Analysis provides a subjective as-
the software managers, developers or maintainers. A n evalua- sessment of the quantitative effect of methods and tools.
tot then assesses how well the identified features are provided DESMET assumes that a knowledge base of expert opinion
by a number of alternative methods/tools. Feature Analy- about generic methods and techniques is available. A user of
sis is referred to as qualitative because it usually requires a the knowledge base can request an assessment of the effects
subjective assessment of the relative importance of different of individual methods and/or the combined impact of several
features and how well a feature is implemented. methods. This is quite a useful approach because the infor-
mation held in a database containing expert opinion can be
Feature analysis can be done by a single person who identi- updated as and when the results of objective studies become
fies the requirements, maps them to various features and then available.
assesses the extent to which alternative methods/tools sup-
port them by trying out the method/tool or reviewing sales Benchmarking
literature. This is the method that would normally be used Benchmarking is a process of running a number of standard
when screening a large number of methods and tools. How- tests/trials using a number of alternative tools/methods (usu-
ever, using an analogy with quantitative studies, D E S M E T ally tools) and assessing the relative performance of the tools
suggests that it is also possible to organise Feature Analysis in those tests. Benchmarking is usually associated with as-
evaluations more formally using the same three ways: sessing performance characteristics of computer hardware but
there are circumstances when the technique is appropriate to
1. Qualitative experiment: A feature-based evaluation
software. For example, compiler validation, which involves
done by a group of potential users. This involves devis-
ing the evaluation criteria and the method of analysing running a compiler against a series of standard tests, is a type
of benchmarking. The compiler example illustrates another
results using the standard Feature Analysis approach
point about benehrnarking: it is most likely to be useful if the
but organising the evaluation activity as an experiment
method or tool does not require human expertise to use.
(e.g. using a random selection of potential users to un-
dertake the evaluation). When you perform a benchmarking exercise, the selection of
2. Qualitative case study: A feature-based evaluation un- the specific tests is subjective but the actual performance as-
dertaken after a method/tool has been used in practice pects measured can be quite objective, for example, the speed
on a real project. of processing a n d / o r the accuracy of results. Benchmarks are
3. 3. Qualitative survey: A feature-based evaluation done most suitable for comparisons of tools, when the output of
by people who have experience of using or have studied the tool can be characterised objectively.
the methods/tools of interest. This involves devising S U M M A R Y
an evaluation criteria and method of analysing the re-
sults using the standard Feature Analysis approach but This discussion has identified nine distinct evaluation types:
organising the evaluation as a survey. 1. Quantitative experiment: A n investigation of the
quantitative impact of methods/tools organised as a for-
Hybrid Evaluation Methods mal experiment.
2. Quantitative case study: A n investigation of the
When considering the different ways that methods/tools
quantitative impact of methods/tools organised as a
might be evaluated, we identified two specific evaluation
methods that each seemed to have some quantitative and case study.
qualitative elements. We call these hybrid methods and de- 3. Quantitative survey: A n investigation of the quanti-
ACM SIGSOFT Software Engineering Notes vol 21 no 1 January 1996 Page 15

tative impact of methods/tools organised as a survey. 4th International Software Reuse


4. Q u a l i t a t i v e s c r e e n i n g : A feature-based evaluation Conference Overview
done by a single individual who not only determines the
features to be assessed and their rating scale but also Murali Sitaraman
does the assessment. For initial screening, the evalua- Univ. West Virginia
tions are usually based on literature describing the soft- murali@cs.wvu.edu
ware method/tools rather than actual use of the meth-
ods/tools. Y O U A R E I N V I T E D to participate in the Fourth Inter-
5. Q u a l i t a t i v e e x p e r i m e n t : A feature-based evaluation national Conference on Software R E U S E (ICSR4) to be held
done by a group of potential user who are expected to April 23-26, 1996 at the Royal Plaza hotel in the middle of
try out the methods/tools on typical tasks before mak- Walt Disney World Village in Orlando, Florida, USA. The
ing their evaluations. event is sponsored by the IEEE Computer Society with co-
operation from the ACM SIGSOFT. This is the first time
6. Q u a l i t a t i v e case s t u d y : A feature-based evaluation
this premier international conference on software reuse is be-
performed by someone who has used the method/tool
ing held in the United States. The conference will bring to-
on a real project.
gether researchers, practitioners, and entrepreneurs to make
7. Q u a l i t a t i v e s u r v e y : A feature-based evaluation done reuse "business as usual" in software development and man-
by people who have had experience of using the agement.
method/tool, or have studied the method/tool. The
difference between a survey and an experiment is that The 1996 conference features three prominent K E Y N O T E
participation in a survey is at the discretion of the sub- S P E A K E R S : Barry Boehm from the University of Southern
ject. California, Joseph Goguen from the University of Oxford, and
Bjarne Stroustrup from AT&T Bell Labs.
8. Q u a l i t a t i v e effects a n a l y s i s : A subjective assess-
ment of the quantitative effect of methods and tools, The highly selective conference program includes P A P E R
based on expert opinion. SESSIONS and P A N E L S spanning technical and business
9. B e n c h m a r k i n g : A process of running a number of aspects of reuse. Technical sessions focus on components
and composition, domain analysis and modeling, object-based
standard tests using alternative tools/methods (usually
computing, programming languages, and software architec-
tools) and assessing the relative performance of the tools
against those tests. tures. Reuse business sessions address internet issues, reuse
investment strategies, legal and process-related issues, and
aspects of technology transfer.
All these methods of evaluation might be regarded an embar-
rassment of riches because they lead directly to the problem of A day and a half of pre- and post-conference T U T O R I A L
deciding which one to use. The problem of selecting an appro- sessions (13 half-day tutorials) will address a variety of topics
priate evaluation method given your specific circumstances is of interest to practitioners, managers, and researchers. Most
discussed in the next article in this series. of them are to be offered by internationally-known speakers
at exceptionally low rates. In addition, there will be an ex-
REFERENCES
hibition of several ongoing software reuse efforts and poster
[1] Pfleeger, S.L. Experimental Design and Analysis in Soft- sessions.
ware Engineering. Parts 1 to 5. SIGSOFT Notes, 1994 and
Interested participants are invited to visit the conference web
1995.
page at
[2] Kitchenham, B.A. Pickard, L.M., and Pfleeger, S.L. Case
studies for method and tool evaluation. IEEE Software vol. http://gww.cis.ohio-state.edu/icsr4
12, no. 4, July 1995, pp52-62.
[3] Card, D.N, McGarry, F.M. and Page, G.T. Evaluating for the advance program announcement (including complete
software engineering technologies, IEEE Transactions on Soft- details on the papers, panels, and tutorials at the confer-
ware Engineering, 13(7), 1987. ence), general R E G I S T R A T I O N information, and our spe-
cial rates for early registration and students. Participants
[4] Gilb, T. Principals of software engineering management.
may also contact the registration chair Dr. Steven D. Fraser
Addison-Wesley, 1987.
(Nortel, Inc.) by electronic mail: sdfraser@bnr.ca.
[5] J.G. Walker and B.A. Kitchenham. Quality Requirements
Come to Orlando and join the software reuse extravaganza!
Specification and Evaluation. in Measurement for Software
Control and Assurance (B.A. Kitchenham and B.Littlewood
eds.), Elsevier Applied Science, 1989.

You might also like