Professional Documents
Culture Documents
Evaluating Software Engineering Methods We have a t t e m p t e d to address this problem in our evaluation
guidelines.
and Tool
Evaluation contezt
Part 1: The Evaluation Context and
A D E S M E T evaluation exercise is comparative. T h a t is, we
Evaluation Methods assume that there are several alternative ways of performing
a software engineering task and we want to identify which
Barbara A n n K i t c h e n h a m of the alternatives is best in specific circumstances. In some
N C C Services Ltd cases, comparisons will be made against an theoretical ideal
National Computing Centre (i.e. a gold standard).
Oxford R o a d With the exception of formal experiments, D E S M E T evalu-
M a n c h e s t e r , M1 TED, E n g l a n d ations are context-dependent, which means that we do not
+ 4 4 - 1 6 1 - 2 2 8 - 6 3 3 3 - fax: + 4 4 - 1 6 1 - 2 4 2 - 2 4 0 0 expect a specific m e t h o d / t o o l to be the best in all circum-
barbarak@ncc.co.uk stances. It is possible that an evaluation in one company
Copyright NCC Services Ltd. 1995 would result in one m e t h o d / t o o l being identified as superior,
but a similar evaluation in another company would come to
a different conclusion. For example, suppose two companies
INTRODUCTION
were comparing the use of formal specification notations with
In the last five issues of S I G S O F T Notes, Shari Lawrence the use of Fagan Inspections as a means of reducing the num-
Pfleeger has discussed the use of formal experiments to eval- ber of defects that reach the customer. If one of the compa-
uate software engineering methods and tools [1]. Shari's arti- nies had a mathematically sophisticated workforce, it might
cles were based on work she performed for the U.K. D E S M E T find formal specification superior in an evaluation; another
project which aimed to develop a methodology for evaluating company with a commercially-oriented workforce might find
software engineering methods and tools. Fagan Inspections superior. Hence, the difference in the re-
The D E S M E T project identified a number of useful evalua- suits of the evaluation might be due to the properties of the
tion methods in addition to formal experiments, and Shari companies that perform the evaluations, not to the methods
asked me to continue this column by describing some of the themselves.
other methods. As a starting point, I will give an overview However, as Shari has describe formal experiments are not
of the scope of the D E S M E T methodology in this article and context dependent. Under fully controlled conditions, we ex-
describe the nine different evaluation methods D E S M E T iden- pect a formal experiment to give the same results irrespective
tified. In the next few articles I will discuss criteria for select- of the particular evaluator or the particular organisation.
ing a specific method in particular circumstances. Later I will
Evaluation objects
present the D E S M E T guidelines for performing quantitative
case studies and feature analysis. D E S M E T assumes that you m a y want to evaluate any one of
the following objects:
THE DESMET METHODOLOGY:
SCOPE, TERMINOLOGY AND LIMITATIONS • a generic method which is a generic paradigm for some
aspect of software development (e.g., structured design);
D E S M E T users
• a method which is a specific approach within a generic
D E S M E T is concerned with the evaluation of methods or paradigm (e.g., Yourdon structured design);
tools within a particular organisation. The term organisa- • a tool which is a software application that supports a
tion is meant to apply to a software development group in a well-defined activity.
particular company/division performing broadly similar tasks
under similar conditions. In addition, the D E S M E T method- Sets of methods or tools being evaluated together are treated
ology can be used by academic institutions interested in ex- in the same way as an individual method or tool. A
perimental software engineering. m e t h o d / t o o l combination is usually treated as a method if
different paradigms are being compared, and a tool, if dif-
The D E S M E T methodology is intended to help an evaluator
ferent software support packages for the same paradigm are
in a particular organisation to plan and execute an evalua-
being compared.
tion exercise that is unbiased and reliable (e.g. maximises the
chance of identifying the best method/tool). One of the prob- Evaluation types
lems with undertaking evaluations in industry is that as well The D E S M E T evaluation methodology separates evaluation
as technical difficulties, there are sociological and managerial exercises into two main types:
factors that can bias an evaluation exercise. For example, if
staff undertaking an evaluation believe a new tool is supe- 1. evaluations aimed establishing measurable effects of us-
rior to their current tool, they are likely to get good results; ing a method/tool;
however, the favourable results might not carry over to other 2. evaluations aimed at establishing m e t h o d / t o o l appro-
software staffwho do not have the same confidence in the tool. priateness i.e. how well a m e t h o d / t o o l fits the needs
ACM S I G S O F T Software Engineering Notes vol 21 no 1 J a n u a r y 1996 Page 12
and culture of an organisation. 2. your organisation does not have a controllable develop-
ment process.
Measurable effects are usually based on reducing production,
rework or maintenance time or costs. D E S M E T refers to this The D E S M E T methodology is aimed at evaluating specific
type of evaluation as a quantitative or objective evaluation. methods/tools in specific circumstances. With the exception
Quantitative evaluations are based on identifying the benefits of hybrid methods such as the use of expert opinion, evalua-
you expect a n e w / m e t h o d or tool to deliver in measurable tion methods do not give any indication as to how different
terms and collecting d a t a to determine whether the expected methods and tools interact.
benefits are actually delivered. For example, suppose a company undertakes two parallel, in-
The appropriateness of a m e t h o d / t o o l is usually assessed in dependent evaluations: one investigating Fagan Inspection
terms of the features provided by the method/tool, the char- and the other investigating the use of a formal specification
acteristics of its supplier, and its training requirements. The language such as Z. Supposes the results of evaluation 1 sug-
specific features and characteristics included in the evaluation gest that Fagan Inspections are better than the c o m p a n y ' s
are based on the requirements of the user population and any current walkthrough standard for detecting specification de-
organisational procurement standards. Evaluators assess the fects (e.g., 80% of specification defects found before product
extent to which the m e t h o d / t o o l provides the required fea- release when Fagan Inspections are used compared with 45%
tures in a usable and effective manner based (usually) on per- using walkthroughs). In addition suppose that the results of
sonal opinion. D E S M E T refers to this type of evaluation as evaluation 2 suggest that the use of Z specifications results in
feature analysis and identifies such an evaluation as a quali- fewer residual specification defects than the c o m p a n y ' s cur-
tative or subjective evaluation. rent natural language specifications (e.g., 1 specification fault
per 1000 lines of code found after release when Z is used com-
Evaluation procedure
pared with 10 specification faults per 1000 lines of code with
In addition to the separation between quantitative, qualita- current methods). The evaluator cannot assume t h a t the
tive and hybrid evaluations, there is another dimension to results of changing the organisations process to using both
an evaluation: the way in which the evaluation is organised. Z specifications and Fagan Inspection will be independent.
D E S M E T has identified three rather different ways of organ- T h a t is the D E S M E T methodology does not guarantee that
ising an evaluation exercise: the number of specification defects per 1000 lines of code after
release will be 0.2 per 1000 lines instead of 4.5 per 1000 lines.
• as a f o r m a l e x p e r i m e n t where m a n y subjects (i.e.
It may be a reasonable assumption, but the combined effect
software engineers) are asked to perform a task (or vari-
of Z and Fagan Inspections would still need to be explicitly
ety of tasks) using the different methods/tools under in-
confirmed. In this ease, the effect m a y not be what you ex-
vestigation. Subjects are assigned to each m e t h o d / t o o l
pect because each technique might eliminate the same type
such that results are unbiased and can be analysed using
of fault as the othex.
standard statistical techniques;
• as a c a s e s t u d y where each m e t h o d / t o o l under inves- Undertaking industrial evaluations is usually only worth-
tigation is tried out on a real project using the stan- while if the organisation intends to make widespread process
dard project development procedures of the evaluating changes. If you work for an organisation which undertakes
organisation; a variety of different applications using m e t h o d s / t o o l s speci-
• as a s u r v e y where staff/organisations that have used fied by your customers (e.g., you produce custom-made soft-
specific methods or tools on past projects 4 are asked to ware products using client-dictated development procedures),
provide information about the method or tool. Informa- you may not have a standard development process. Thus,
tion from the m e t h o d / t o o l users can be analysed using the results of an evaluation exercise undertaken on a specific
standard statistical techniques. project may not generalise to other projects because each fu-
ture project is likely to use a different process. In such cir-
Although the three methods of organising an evaluation are cumstances the value of any evaluation exercise is likely to be
usually associated with quantitative investigations, they can limited.
equally well be applied to qualitative evaluations. In general, the parts of the D E S M E T methodology you will
Limitations of the DESMET methodology be able to use depend on the organisation in which you work.
You will need to assess the "maturity" of your organisation
The D E S M E T methodology is of limited use if: from the viewpoint of the type of evaluation it is capable of
performing.
1. you want to mix and match methods and tools;
EVALUATION METHODS
4 I n so me c i r c u m s t a n c e s it m a y b e possible to o r g a n i s e a s u r v e y based
on the assumption that respondents w o u l d be i n t e r e s t e d enough to set Quantitative evaluation m e t h o d s
u p a t r i a l of a m e t h o d / t o o l a n d r e t u r n t h e i n f o r m a t i o n once t h e y h a v e
Quantitative evaluations are based on the assumption that
f inished t h e trial. However, t h i s m a y r e s u l t i n a low r e s p o n s e to the
survey and a r a t h e r long timeseale. you can identify some measurable property (or properties) of
your software product or process that you expect to change
A C M SIGSOFT Software Engineering Notes vol 21 no 1 J a n u a r y 1996 Page 13
as a result of using the methods/tools you want to evaluate. 2. There is no guarantee that similar results will be found
Quantitative evaluations can be organised in three differ- on other projects.
ent ways: case studies, formal experiments, and surveys. 3. There are few agreed standards/procedures for under-
D E S M E T has concentrated on developing guidelines for un- taking case studies. Different disciplines have different
dertaking formal experiments and case studies of software approaches and often use the term to mean different
methods/tools. Formal experiments are the basic scien- things.
tific method of assessing differences between methods/tools.
Surveys
D E S M E T has tried to a d a p t as much as possible of the princi-
ples of formal experiments (i.e. terminology and techniques) Surveys are used when various method(s)/tool(s) have been
to developing guidelines for performing case studies [2]. used in an organisation (or group of organisations). In this
situation, the users of the method(s)/tool(s) can be asked
Shari Lawrence Pfleeger has already discussed the principles
to provide information a b o u t any property of interest (e.g.
of formal experiments in a series of articles in S I G S O F T Notes
[1], so I will not discuss them any further. development productivity, post-release defects, etc.).
1. With little or no replication, they may give inaccurate 1. They rely on different projects/organisations keeping
results. comparable data.
ACM SIGSOFT Software Engineering Notes vol 21 no 1 January 1996 Page 14
2. They only confirm association not causality. scribe them below as Qualitative Effects Analysis and Bench-
3. They can be biased due to differences between those marking. W e do not claim these are the only hybrid methods,
who respond and those who do not respond. they were, however, the only ones we could think of! W e have
not provided any detailed guidelines for these types of evalu-
Card et al [3] provide a good example of a quantitative survey ation.
(although they have not recognised the problem of association Collated ezpert opinion - Qualitative Effects Analysis
versus causality).
A common way of deciding which m e t h o d / t o o l to use on a
Qualitative M e t h o d s project is to rely on the expert opinion of a senior software
DESMET uses the term Feature Analysis to describe a quali- engineer or manager. Gilb [4] suggested assessing techniques
tative evaluation. Feature Analysis is based on identifying the in terms of the contribution they make towards achieving a
requirements that users have for a particular task/activity and measurable product requirement. This approach was used by
mapping those requixemcnts to features that a method/tool Walker and Kitchenham [5] to specify a tool to evaluate the
aimed at supporting that task/activity should possess. This quality levels likely to be achieved by a certain selection of
is the type of analysis you see in personal computer maga- methods/techniques. In D E S M E T , we refer to this method
zines when different W o r d Processors, or graphics packages of using expert opinion to assess the quantitative effects of
are compared feature by feature. different methods and tools as Qualitative Effects Analysis.
In the context of software methods/tools, the users would be Thus, Qualitative Effects Analysis provides a subjective as-
the software managers, developers or maintainers. A n evalua- sessment of the quantitative effect of methods and tools.
tot then assesses how well the identified features are provided DESMET assumes that a knowledge base of expert opinion
by a number of alternative methods/tools. Feature Analy- about generic methods and techniques is available. A user of
sis is referred to as qualitative because it usually requires a the knowledge base can request an assessment of the effects
subjective assessment of the relative importance of different of individual methods and/or the combined impact of several
features and how well a feature is implemented. methods. This is quite a useful approach because the infor-
mation held in a database containing expert opinion can be
Feature analysis can be done by a single person who identi- updated as and when the results of objective studies become
fies the requirements, maps them to various features and then available.
assesses the extent to which alternative methods/tools sup-
port them by trying out the method/tool or reviewing sales Benchmarking
literature. This is the method that would normally be used Benchmarking is a process of running a number of standard
when screening a large number of methods and tools. How- tests/trials using a number of alternative tools/methods (usu-
ever, using an analogy with quantitative studies, D E S M E T ally tools) and assessing the relative performance of the tools
suggests that it is also possible to organise Feature Analysis in those tests. Benchmarking is usually associated with as-
evaluations more formally using the same three ways: sessing performance characteristics of computer hardware but
there are circumstances when the technique is appropriate to
1. Qualitative experiment: A feature-based evaluation
software. For example, compiler validation, which involves
done by a group of potential users. This involves devis-
ing the evaluation criteria and the method of analysing running a compiler against a series of standard tests, is a type
of benchmarking. The compiler example illustrates another
results using the standard Feature Analysis approach
point about benehrnarking: it is most likely to be useful if the
but organising the evaluation activity as an experiment
method or tool does not require human expertise to use.
(e.g. using a random selection of potential users to un-
dertake the evaluation). When you perform a benchmarking exercise, the selection of
2. Qualitative case study: A feature-based evaluation un- the specific tests is subjective but the actual performance as-
dertaken after a method/tool has been used in practice pects measured can be quite objective, for example, the speed
on a real project. of processing a n d / o r the accuracy of results. Benchmarks are
3. 3. Qualitative survey: A feature-based evaluation done most suitable for comparisons of tools, when the output of
by people who have experience of using or have studied the tool can be characterised objectively.
the methods/tools of interest. This involves devising S U M M A R Y
an evaluation criteria and method of analysing the re-
sults using the standard Feature Analysis approach but This discussion has identified nine distinct evaluation types:
organising the evaluation as a survey. 1. Quantitative experiment: A n investigation of the
quantitative impact of methods/tools organised as a for-
Hybrid Evaluation Methods mal experiment.
2. Quantitative case study: A n investigation of the
When considering the different ways that methods/tools
quantitative impact of methods/tools organised as a
might be evaluated, we identified two specific evaluation
methods that each seemed to have some quantitative and case study.
qualitative elements. We call these hybrid methods and de- 3. Quantitative survey: A n investigation of the quanti-
ACM SIGSOFT Software Engineering Notes vol 21 no 1 January 1996 Page 15