You are on page 1of 17

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/242371833

A framework for human factors evaluation

Article in Behaviour and Information Technology · January 1991


DOI: 10.1080/01449299108924272

CITATIONS READS

135 496

3 authors, including:

Frank Wilson John Dowell


Interaction Design Ltd University College London
57 PUBLICATIONS 239 CITATIONS 57 PUBLICATIONS 1,232 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Frank Wilson on 16 July 2015.

The user has requested enhancement of the downloaded file.


This article was downloaded by: [86.164.173.234]
On: 12 February 2015, At: 14:07
Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,
37-41 Mortimer Street, London W1T 3JH, UK

Behaviour & Information Technology


Publication details, including instructions for authors and subscription information:
http://www.tandfonline.com/loi/tbit20

A framework for human factors evaluation


a b a
ANDY WHITEFIELD , FRANK WILSON & JOHN DOWELL
a
Ergonomics Unit , 26 Bedford Way, London , WC1H 0AP , UK
b
London HCI Centre, University College London , 26 Bedford Way, London , WC1H 0AP , UK
Published online: 26 Apr 2007.

To cite this article: ANDY WHITEFIELD , FRANK WILSON & JOHN DOWELL (1991) A framework for human factors evaluation,
Behaviour & Information Technology, 10:1, 65-79, DOI: 10.1080/01449299108924272

To link to this article: http://dx.doi.org/10.1080/01449299108924272

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the
publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations
or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any
opinions and views expressed in this publication are the opinions and views of the authors, and are not the
views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be
independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses,
actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever
caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any
form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
BEHAVIOUR & INFORMATION TECHNOLOGY, 1991, VOL, 10, NO. 1,65-79

A framework for human factors evaluation

ANDY WHITEFIELD*, FRANK WILSONt and


JOHN DOWELL*
* Ergonomics Unit and t London HCI Centre, University College London,
26 Bedford Way, London WCIH OAP, UK

Abstract. Successful human factors evaluation of interactive computer


systems has tended to relyheavily on the experience of the practitioner, who
has had little explicit support on which to draw. This paper concerns
support for evaluation in the form of a framework for describing and
guiding the general activity. The paper starts with a critique of current
approaches to evaluation, and particularly of evaluation within the 'design
for usability' approach. Following a definition of evaluation, a framework is
Downloaded by [86.164.173.234] at 14:07 12 February 2015

proposed that attempts to clarify what can be done towardswhich goals and
how it can be done. This highlights and discusses notions of system
performance, of assessment statements, and of assessment methods. The
paper concludes with a discussion of the implications of the framework for
evaluation practice.

1. Introduction
Evaluation is an integral part of any development process, whether for
considerations of cost, safety, production, maintainability,. or whatever.
Evaluating the human factors aspects of interactive computer systems is simply
a part (albeit an important part) of this wider evaluation process. In practice,
human factors evaluation has tended to rely heavily on the experience of the
human factors specialist, who has had very little in the way of formal,
theoretical or tool support. This has begun to change recently, with an
increasing interest in understanding the nature of human factors evaluation and
in developing support for it (for instance, Lea 1988, Clegg et al. 1988, Dowell
and Long 1989a, Denley and Long 1990). This paper is intended to extend that
understanding and hence the consequent support.
The paper aims firstly to characterize and comment upon current
approaches to human factors evaluation. This is done in section 2. A
framework for evaluation, based on a definition of evaluation and on an
analysis of its methods, is outlined in section 3. The paper concludes with a
discussion of this framework in section 4.

2. Current approaches
Both personal experience and accounts of evaluations in the literature
suggest that human factors evaluation practice varies widely. This is due in part
to the variability in the processes and products of interactive system
development, but we would suggest it is also partly due to the absence of any
accepted framework for evaluation. Whilst acknowledging this variety in
human factors evaluation practice, it is important to characterize that practice
by identifying its major features. This will allow us to identify what would be
improvements to practice.
0144-929X191 $3.00 e 1991 Taylor & Francis LId.
66 A. Whitefield et al.

Explicit human factors evaluations of early interactive systems (when they


were done at all) were poorly integrated with development and therefore
ineffective. They tended to be done too late for any substantial changes to the
system still be be feasible and, in common with other human factors
contributions to development, they were often unfavourably received (for
example, Hammond et al. 1983). This situation has been improved in recent
years in a number of ways. The general progress in human-computer interaction
(HCI) knowledge and methods has helped, but there have also been at least three
areas of research specifically concerned with evaluation. The first has focused on
the development of analytic methods of evaluation, i.e., models or other
descriptions of a system that can be manipulated to describe or predict the
system's behaviour. Although a promising development, this work has so far had
little impact on practice (Bellotti 1988). Analytic methods will be discussed
further in section 3.3.
The second research improvement in human factors evaluation has been the
development of tools to support the practioner. Among the most prominent
Downloaded by [86.164.173.234] at 14:07 12 February 2015

examples are those proposed by Clegg et al. (\ 988) and by Ravden and Johnson
(1989). These seek to provide the (non-specialist) practitioner with wide-ranging
questionnaire and checklist tools with which to evaluate a company's current or
potential computer systems. While these will almost certainly be useful to many
people, their orientation towards the non-specialist evaluator, and their
apparently implicit and experiential basis, make them difficult to use in
characterizing human factors evaluation practice.
The third area of research concerned with improving human factors
evaluation has been as part ofa general approach to incorporating human factors
in system development known as usability engineering or design for usability.
These terms cover loosely-related work by several human factors researchers in
both industry and academia. Although the approach concerns systems
development in general and is therefore not concerned solely with evaluation, it
has included a number of explicit and implicit proposals about it. We shall use
Shackel (1986) as our principal reference for the work, since he attempts to
provide an integration and overview of the approach. We shall also use his
favoured term, design for usability. The reader should bear in mind, however,
that design for usability is not a single, unified, fully-developed and widely-
adopted approach that is completely and clearly represented by Shackel's paper.
Another view is presented by Whiteside et at. (1988).
Shackel summarizes the fundamental features of design for usability as:
User centred design - focused from the start on users and tasks;
Participative design - with users as members of the design team;
Experimental design - with formal tests of usability in pilot trials,
simulations, and full prototype evaluations;
Iterative design - design, test and measure, and redesign as a regular cycle
until results satisfy the usability specification;
User supportive design - help systems, manuals, training, and selection
should be treated as part of the design.
Evaluation within this approach is characterized by a reliance on iterative
design, on experimental testing with users, and on usability criteria. As Shackel
(1986, p. 54) puts it
A framework for human factors evaluation 67

perhaps the most important feature of this process is that the usability goals
thus set become criteria by which to test the design as it evolves and to
improve it by iterative redesign. Such tests are embodied first in trials of early
versions of the design and later in formal evaluations of prototypes.

Design for usability does appear to be influential in many current


developments in practice. For example, researchers for many major
manufacturers (including IBM and DEC) have contributed to the literature on
the approach, as have (indirectly) many major user organizations (for example,
Gardner and McKenzie 1988). Also most manufacturers appear to have invested
in usability laboratories (see Dillon 1988) - a forum for evaluation entirely
suited to the role of evaluation within design for usability.
An assessment of the limitations and weaknesses of current approaches to
evaluation, as represented by design for usability, is therefore an appropriate
starting point that should allow us to extend and to clarify our understanding of
human factors evaluation. The assessment of the current approach focuses on
Downloaded by [86.164.173.234] at 14:07 12 February 2015

each of its three characteristics as mentioned above.


The first such characteristic is iterative design. Iterative design is generally
recommended in most development contexts, and it has many advantages for
developing quality interfaces, particularly in the absence of good human factors
methods for system development. However, as a procedure for evaluation (and
particularly for evaluations of prototypes) it does have some drawbacks. One is
that it is not easily applicable to all systems. For example, iteration may be
difficult for very large systems and for systems requiring very formal methods,
such as safety-critical systems. The second drawback is that it does not alter the
type of human factors input to the development. Iterative evaluations tries to
circumvent the difficulties of a single evaluation late in development, but it does
so by replicating the same activities in a shorter timescale. Thus some of the
weaknesses of the single late evaluation will also be found with the iterative
evaluation. For example, the resource requirements of typical late evaluations
are considerable - an important drawback for repeating them too often. The
solution to both these drawbacks is to seek evaluation inputs of different kinds
and to seek them at other stages of development. Thus, while recognizing
iterative design as a laudable and useful practice, one should also recognize that
evaluation can take place in other ways.
The second characteristic of design for usability evaluation is experimental
testing with users. There are two points to be made here. The first is that the
approach in general emphasizes the inclusion of users within the development in
various ways, but principally by participation and testing. Clearly these are both
important and productive, but we would propose that the crucial idea is ofthe user
being represented in the development rather than necessarily being included in it.
Inclusion assumes the users' ability to present themselves in ways useful to the
developers. However, users may not be able to articulate or otherwise usefully
disclose what they know; they may not have a clear and conscious appreciation of
what they know; and they may not be aware ofhow what they know might support
development. This is not to belittle the potential contribution of including users in
development, but it does point out that there are representations of the user, for
example, conceptual models, which may be recruited as well as, or instead of, the
users' physical presence in the development activity.
68 A. Whitefield et al.

The second point about testing with users is whether it must be experimental.
It is not clear just what Shackel means by experimental. By talking about
comparative testing and 'formal and empirical' studies, one infers that he means
full control and manipulation of relevant variables, or something approaching it.
We prefer to adopt the view that there is a wide range of methods for user testing,
.including informal observation, interviews, and questionnaires as well as
controlled experimentation, and all of these can be used for the purposes of
evaluation. To equate evaluation with fully controlled experiments is to confuse
the goal with a method of reaching the goal.
. The third characteristic of evaluation within design for usability, and one on
which it lays strong emphasis, is the use of usability goals and criteria. Shackel
defines usability in terms of the operational factors ofeffectiveness, leamability,
flexibility, and attitude. Learnability, for example, is defined operationally as the
users' ability to perform a set range of tasks within some specified time frame,
based on some specified amount of training and user support, and within some
specified relearning time each time for intermittent users. Usability goals are set
Downloaded by [86.164.173.234] at 14:07 12 February 2015

early in development by giving numerical values to the items in italics. These


then become criteria against which the system is tested. (Note that these criteria
can be set for a number of different levels, e.g., best, planned, and worst cases -
see Whiteside et al. 1988.) Failure to meet the criteria results in redesign and
testing, thus leading to iterative design. Goals for the other usability factors of
effectiveness, flexibility, and attitude are set in a similar way.
The use of such criteria requires the ability to specify clearly and to make
quantitative measurements of these various factors. While one can see (at least
on an informal basis) how this might be done for learning time, it is less clear for
other factors, such as 'acceptable levels of human cost in terms of tiredness,
discomfort, frustration and personal effort' (the operational definition of
attitude). At the moment the approach offers no support to developers in
specifying the criteria and measuring against them (although Taylor and Bonner
(1989) report work in progress that addresses this problem).
Moreover the approach concentrates on quantitative performance and does
not address either qualitative assessments or the behaviour that underlies
performance. Thus there may well be both aspects of system performance and
occasions during development (especially early in development) for which
quantitative assessments cannot be made andonly more qualitative judgements
will be possible. In addition, the concentration on comparing observed
performance with some prespecified criterion may well fail to provide
developers with the most useful information. If performance fails to meet the·
criterion, it is important that developers have some understanding of why, so
that they can direct their redesign efforts appropriately. That is, they need to
know something of the user and computer behaviour that produced the
performance - what problems the user encountered, how easy or difficult
certain parts of the task were, what were the constraints on performance, and so
on. This is the type of evaluation that Long and Whitefield (1986) termed
diagnostic, as opposed to the measurement evaluation characteristic of the use
of criteria.
Thus the problems with criterial evaluation are that it is not diagnostic with
respect to the underlying behaviour and sources of problems, and it relies too
heavily on quantitative measurements. Whiteside et al. (1988) make a similar
A framework for human factors evaluation 69

point when they say 'Not all, or even most, aspects of usability experience can be
given operationally defined criteria and measured.' (p. 814).
Implicit in this criticism of usability goals and criteria is the view that the
notions of usability and performance within design for usability are too poorly
specified (and insufficiently distinguished) to support fully evaluation practice.
Essentially the view of usability is inflexible and its derivation is unclear. Thus it
follows from Shackel's statement (1986, p. 52) that 'usability can be defined in
terms of the interaction between user, task and system in the environment', that
usability is not an inherent property of a computer but is a property of a
particular system interaction. The implication of this is that usability must be
constructed differently for different system configurations and instances, but in
this respect Shackel's definition does not allow sufficient variation. The
definition does two things: it identifies four factors (effectiveness, learnability,
flexibility, attitude) and it specifies them in terms of performance measures
(time, errors, percentage of completed tasks, amount of training, etc). One
Downloaded by [86.164.173.234] at 14:07 12 February 2015

problem is that the factors are not explicitly derived and there is no clear
argument (other than experience) for this set as opposed to any other. Other
factors would certainly be possible. Sutcliffe (1988), for example, has proposed
the additional factor of coverage. Second, the particular performance
specification of the factors may be unnecessarily constraining. For example, why
should learnability always be specified in terms of time? Why not errors, or
memory recall, or ... ?
Whiteside et al. (1988) have a better approach here, in that they do not limit
in advance the usability factors or their measurement criteria. On the other
hand, they do not limit the factors ('attributes') at all and provide no assistance
in identifying suitable factors for any evaluation - an approach equally unhelpful
in its own way.
It has been suggested (Brooke 1990) that some of the proponents of the use of
usability goals and criteria were, from an early stage, aware of many of the
criticisms made above. Nevertheless, they decided that these problems were
acceptable given theirgoal of establishing human factors considerations within
system development on the same (quantitative) terms as those of other factors.
While we accept this as a motivation, and accept that considering human factors
in these terms may well have been beneficial rather than detrimental, this does
not of course mean that the criticisms above are invalid.
This completes the review of the design for usability approach to evaluation.
Although this section has concentrated very much on the one approach, it has
not been the intention simply to criticize this particular approach. Rather our
intention has been to examine critically current approaches to evaluation, of
which design for usability is the clearest and most prevalent example, with the
aim of improving our understanding of evaluation, and hence its practice. Our
attempt to do this is presented in the next section.

3. A framework for evaluation


The previous analysis has identified several weaknesses and problems with
current approaches to evaluation. Many of these arise because the relevant
concepts (for example, performance) and the relations between concepts (for
example, what criterial evaluation can address about performance) are
insufficiently defined. Overcoming many of these weaknesses could be achieved
70 A. Whitefield et al.

by developing a more complete and coherent framework for evaluation. This


should enable an improvement in practice by clarifying what can be done
towards which goals and how it can be done. The intention of this section is to
outline such a framework.

3.1. Definition
In an attempt to make explicit. the concepts and relations of evaluation, we
begin by proposing a definition. Definitions are not absolute but are fashioned
for particular purposes; the purpose here is to characterize all instances of
evaluation so as to identify important commonalities. The definition is a re-
expression of that used by Long and Whitefield (1986): human factors evaluation
is an assessment ofthe conformity between a system's performance and its desired
performance. The various terms in the definition will be discussed in turn.

3.2. Discussion of terms


It is important to clarify what is meant by a system. A system in the
Downloaded by [86.164.173.234] at 14:07 12 February 2015

ergonomic sense (and as used throughout this paper) is a user and a computer
engaged upon some task within an environment. A set of hardware and software
components is therefore not a system in this sense but simply a computer. Note
that this interpretation requires that a complete evaluation of human-computer
interaction within a system must consider not just users and how they perform
but also computers and how they perform. One must understand how the
structures and behaviours of software and hardware determine performance as
well as how the users' mental and physical structures and behaviours do the
same.
The definition of course requires clarification of the concept of performance.
In their discussion of HCI engineering, Dowell and Long (J989b) define
performance as the system's effectiveness in accomplishing tasks. It is 'a two-
factor concept expressed as the quality of task product (i.e., how well the task's
outcome meets its goal) and the incurred resource costs (i.e., the resources
employed by both the user and the computer in accomplishing the task). A most
effective system would minimize the resource costs in performing a specified
task with a given quality of product.
Dowell and Long (1989b) state that the resource costs incurred by the human
are of two kinds: structural and behavioural. Structural human resource costs are
the costs incurred in establishing and maintaining the mental and physical
structures that support behaviour during the task. Such costs are typically
incurred in educating and training users in the relevant skills and knowledge.
Behavioural human resource costs are the costs incurred in recruiting structures
to express behaviour during the task. They are both mental costs (e.g., in
memorizing, planning, and decision-making) and physical costs (e.g., in the use
of keyboards and pointing devices). Both mental structural and mental
behavioural costs can be differentiated into cognitive, conative, and affective
costs, relating to knowledge, motivation, and emotion respectively. Because they
concentrate on human factors (and not on software engineering) within HCI,
Dowell and Long treat computer resource costs simply as an undifferentiated
processing cost.
While this treatment of resource costs should be sufficient for our current
purposes, it is worth noting that further decomposition would almost certainly
A framework for human factors evaluation 71

be required for any instance of evaluation (e.g., cognitive structural costs could
be divided into the various classes of knowledge recruited for the task).
Decomposition will also be required for computer resource costs.
The performance of a system is determined by its behaviour. The system's
behaviour comprises the interacting behaviours of the user (e.g., sequencing
subtasks, remembering command syntax, pressing keys) and the computer (e.g.,
calculating balances, parsing inputs, displaying characters). Both user and
computer will have important behavioural limitations that constrain
performance (e.g., memory search mechanisms).
This notion of performance and behaviour therefore distinguishes what is
achieved from how it is achieved (the same performance can be achieved by
different behaviours) and distinguishes the quality of the task product from how
effectively it is produced (the same quality can be produced by systems of
different effectiveness). Both these distinctions help for evaluation purposes by
focusing on exactly what is to be assessed.
The notion of desired performance has similarities to that of specifying
Downloaded by [86.164.173.234] at 14:07 12 February 2015

usability goals. To specify fully a system's desired performance means to identify


both the required quality of the task product and the acceptable levels of
resource costs to be incurred in accomplishing the task. For example, for a
computer that sells train tickets to passengers (users), the task goal of the system
might be the successful exchange of the correct ticket for the correct money; a
specified desired performance might be that the correct ticket is indeed
exchanged for the correct money (the required quality of task product) and that
the resource costs are within some limits. Examples of such resource cost limits
might be not more than two minutes spent in planning and decision-making (a
cognitive behavioural cost); no increase in frustration (an effective behavioural
cost); not more than one reference to a Help facility (a cognitive structural
cost); no discomfort in entering the relevant data (a physical behavioural cost);
and a computer response time of less than one second (a computer processing
cost).
There are several points to be made here. First, the quality of task product
will not always be so easily specifiable; specifying the quality of a word processed
letter sufficient to persuade the bank not to close one's account would be much
more difficult. Second, the specification of acceptable resource costs could take
many forms. Thus cognitive behavioural costs could have been specified as not
more than lOs searching for particular key inputs, or no requirement to
remember information from previous screens, or not more than two command
syntax errors, and so on. Specifying desired performance is therefore not
restricted to a particular format as usability goals are with design for usability.
Further, some resource costs could be left unspecified - or, more accurately, left
as infinitely large defaults, for example, any amount of affective behavioural
costs. Both these aspects thus allow one to concentrate on the particular aspects
of performance that are of interest. The final point about desired performance is
that, as with design for usability, no help is currently given to the evaluator in
identifying and specifying suitable levels of performance.
The notion of conformity does not involve any technical meaning here - it
simply means how good or bad is the match between desired performance and
performance as assessed in the evaluation. The aim is therefore to assess how
similar or different they are.
72 A. Whitefield et al.

The final term in the definition is assessment. An assessment involves both a


method (the process by which it is done) and a statement (the resulting
product). Evaluation methods are discussed in section 3.3. Evaluation
statements are of two kinds (to borrow some medical terminology): presentation
(those that simply report performance and conformity) and diagnostic (those
that relate the behavioural causes of the performance). Thus the use of usability
criteria as discussed in section 2 would be a typical example of presentation
evaluation - it reports how closely observed performance matches some
measurable criterion, but it reports nothing about how that performance is
achieved. Diagnostic evaluation, on the other hand, attempts to identify the
critical structures and behaviours underlying the performance - which aspects
are disfluent, what are the causes of particular errors, how users structure the
task, and so on..
Some examples should illustrate this difference. Imagine an evaluation of a
data record entry system that used the following criteria (for given users,
computers, tasks, and environments): (a) 80% of the set tasks should be correctly
Downloaded by [86.164.173.234] at 14:07 12 February 2015

completed (the desired product quality); (b) the percentage of syntactically


incorrect command inputs should be less than 5%; (c) the data input rate should
average 20 records per hour; and (d) hand transitions between input devices
should not be required within a record entry. Straightforward presentation
evaluations would report the observed performance and compare it with these
criteria. For example, correct task completion rate was 65%; command syntax
errors were at the rate of 10%; data input rate was 15 records per hour; six
records required hand transitions between input devices. Such information is
sufficient to indicate that improvements are required, but it says very little about
what these improvements might be. In contrast, diagnostic evaluations ought to
identify the reasons for the unsatisfactory performance '- why tasks were
incorrectly completed, what particular syntactic features led to errors, the
constraints operating to reduce the data input rate, and what types of records
required hand transitions between devices. This requires an understanding ofthe
system's behaviour - the points where incompatibilites between the user and
computer task structures occurred, the particular memory failures resulting in
syntactic errors, the delays caused by needs to look up missing data that
constrained input rate, and exactly why certain records required transitions
between input devices.
Of course in practice even purely presentation evaluations are likely to offer
opportunities to observe some relevant aspects of system behaviour. However,
unless the evaluation is intended and set up from the start to investigate the
behaviour, any such observations will inevitably be ad hoc, incomplete, and
unreliable.
That this distinction between presentation and diagnostic evaluation is used
in practice (although not called by these terms) is illustrated by the work of
Brooke (1986). He describes how usability goals can be set and tested, and states
that 'should the product fail to achieve the criteria' only then is a detailed
analysis of its usability problems (an 'impact analysis') conducted.
The above definition of evaluation and discussion of its terms should have
served to clarify what can be evaluated towards which goals (the notion of
assessment statements). The following section will concentrate on how
evaluation can be done (the notion of assessment methods).
A framework for human factors evaluation 73

3.3. Methods
The system development literature contains many discussions of what is
meant by method. There seems to be at least some basic agreement that a
method must contain both procedure and notation - neither is sufficient on its
own. For our purposes, we wish to consider as a method any means of
investigating human-computer systems that would support an assessment of
performance conformity. Thus a method must involve the system in some
form, and it must address system behaviour or task product quality and
resource costs. This is of course a very broad sense of the term method. It
includes on the one hand notations with little procedural content (for example,
theories, models, and representations), and on the other hand procedures with
little notational content (for example, knowledge of scientific and engineering
methodology),
Clearly this view of methods means that the number of methods to be
considered is very large. It is necessary to identify classes of method, both for
an adequate description and as a step towards the selection of appropriate
Downloaded by [86.164.173.234] at 14:07 12 February 2015

methods for a given evaluation. Various groupings of evaluation methods


have been proposed but each is inadequate in some way. Some of their faults
are:
(a) their bases are either absent or unclear (for example, Howard and
Murray 1987, Long and Whitefield 1986);
(b) their bases are spurious (for example, Sweeney and Dillon's (1987)
notion of depth of analysis);
(c) they are either too general or mix levels (for example Karat (1988) does
both);
(d) not all evaluation methods are included (for example, Meister (1986)
includes only methods requiring direct user involvement; Lea (1988)
excludes the class of methods termed specialist reports below);
and
(e) inappropriate terms are applied (for example, Lea's use of 'formal' to
refer to non-observational methods).
The framework for evaluation requires a clear basis for grouping evaluation
methods, which includes the full range of methods and identifies important
differences between them. The definition and discussion of evaluation above can
be employed to derive terms for grouping the methods. Given that a method
must relate to the system in some form, one way to distinguish methods is on the
bases of how the user and computer components of the system to be evaluated are
present in the evaluation process. This presence could be real or representational
in each case. As shown in figure I, this groups evaluation methods into four
classes, which are termed here analytic methods. specialist reports. user reports
and observational methods. These classes are the same as those identified by
Wilson (1988) and similar to those of Long and Whitefield (1986). Some general
comments about the groupings will be made before each of these classes is
discussed in turn.
The distinctions between real and representational computers, and between
real and representational users, are not straightforward dichotomies. Applied to
the computer component of the system, real here means the physical presence of
the computer or an approximation of it. Thus implemented systems, prototypes,
74 A. Whitefield et aI.

USER

Representational Real

l\l
c:
'-....,....,
0

l\l Analytic User


c: Methods Reports
Q,)
Vl
a: Q,)
W c,
l-
:;)
e..
Q,)
a. Cl::
:L
0
U

l\l
Specialist Observational
Q,)
Reports Methods
Downloaded by [86.164.173.234] at 14:07 12 February 2015

Cl::

Figure I. Classes of evaluation method.

and simulations all count as a real computer presence in the evaluation. They
utilize the medium of the computer to demonstrate, to varying degrees of
completeness and fidelity, how the computer under development will appear and
function, and they could all be set before users to interact with in evaluation
tests. On the other hand, specifications and notational models are
representational computer presences, as are users' mental representations of the
computer. They utilize symbolic representation in a non-computer medium and
require transformation or manipulation to demonstrate computer appearance
and functionality. Thus asking users to remember interface features, as might be
done in interviews or questionnaires, involves their symbolic mental
representations of the computer and not the real computer. Similarly, a code
inspection to check that a program meets a functional requirement involves
representations of the computer only and not the real computer.
Applied to the user component of the system, real similarly means actual
users or approximations of them. Thus experiments involving subjects from the
system's actual user population or from a different population (for example,
students) are both observational methods with the presences of real users
(although the former group would provide more accurate and reliable results). In
contrast, the presence of representational users means (explicit or implicit)
descriptions or models of the users. Thus some analytic methods contain explicit
models of users, and human factors engineers performing a specialist report
operate with written or implicit representations of users.
Given this basis for distinguishing classes of methods, we can now consider
each class in turn. To help in understanding the use of the methods for
evaluation, a number of comments about characteristics of the methods will be
A framework for human factors evaluation 75

made. The characteristics selected for comment are mostly drawn from the
operational characteristics of methods, as identified by Wilson and Whitefield
(1989). The comments themselves are based both on personal experience with
the methods and on published accounts of evaluations. More detailed
descriptions of the various methods can be found in a number of places, for
example, Meister (1986) and Lea (1988) discuss user reports and observational
methods; Reisner (1983) and Whitefield (1990) discuss analytic methods;
Hammond et al. (1985) discuss specialist reports.
It should be clear that analytic methods of evaluation involve the use of
representations of both system components - the user and the computer.
Normally this would mean manipulating models of the system to predict
performance. Examples in the human factors literature are the Keystroke Level
Model (Card et al. 1983) and the Blackboard Design Model (Whitefield 1989,
1990). The principal advantages of such methods are that they can be used early
in development (before any real computers exist), require few resources to apply
Downloaded by [86.164.173.234] at 14:07 12 February 2015

(including neither real users nor real computers), and are potentially fast. The
current disadvantages are that suitable modelling techniques are still under
investigation and development, and consequently the validity and reliability of
the methods are uncertain.
Specialist reports involve one or more people who are not real users assessing
a version of the real computer. A typical method would be for a human factors
specialist to evaluate the screen design of a prototype version, making use of
relevant handbooks, guidelines, and experience. The specialist's use of the
computer could be unstructured or structured around particular tasks. Real users
are not involved, but the specialist will be using representations of them both
mentally and in the reference works. Note that these methods can be used by
other specialists, such as application domain experts or software engineers
(although to perform a human factors evaluation these other specialists would
need to focus on human factors issues). The advantages are that the methods are
relatively fast, use few resources, provide an integrated view, and can address a
wide range of behaviour. On the other hand, their reliability will vary between
specialists, and since their assessments are inevitably somewhat subjective, their
reports are likely to be incomplete, biased, and difficult to validate. A
comparison of specialist reports with observational methods is reported in
Hammond et al. (1985).
User reports involve real users but not a real computer. They typically involve
the use of questionnaires, interviews, or rating methods (which Lea (1988) refers
to collectively as survey methods). The methods are used to obtain data or
opinions from the users on some aspect of the system, but where they have to rely
on their knowledge of the computer and not to interact with it directly. Because
the data are subjective they are open to a number of inaccuracies. However,
relatively formal techniques for data collection and analysis are available, the
methods can be relatively quick, and it is the real users who are being involved.
Lea describes survey methods as indispensable.
The final class of evaluation methods is observational methods, which involve
a real system, i.e., real users interacting with a real computer. The set of such
methods is itself very large, ranging from informal observation of a single user to
full-scale experimentation with appropriate numbers of subjects and control of
variables. Although this range makes it difficult to generalize, observational
76 A. Whitefield et al.

methods have the major advantage of investigating the performance of the real
system and ought therefore to reflect that performance more accurately than the
other methods. Simple observational methods (which can be used in conjunction
with user reports such as interviews) can be easy to conduct, and provide a
wealth of detailed data, but they can be difficult to integrate and interpret, and
time-consuming to analyse fully. More formal experimental methods provide
detailed, quantitative, and reliable answers to particular questions, but they tend
to be slow to conduct, use many resources, and require expertise in experimental
design as well as in interface issues.
This discussion of the advantages and disadvantages. of the various classes of
method ends the description of the evaluation framework. The following section
concludes the paper with a discussion of some aspects of the framework.

4. Discussion .
Given the framework described in section 3, this section considers some of its
implications and discusses its relationship with improvements to evaluation
Downloaded by [86.164.173.234] at 14:07 12 February 2015

practice.
Conceiving of evaluation as in the above framework has a number of
implications and consequences. Some of these have been identified as the
framework has been described (for example, that purely presentation evaluations
do not address the behaviour underlying performance; that the quality of the
task product should be distinguished from how effectively it is produced; and
that representations of users can serve as a means of including users in
development). However, it is worth identifying and discussing some of the major
implications of the framework that have not yet been spelled out.
The first point is that evaluation can be done during system development at
any time after the first description of the proposed system is produced (whether
this be a formal specification or a prototype or whatever). Of course the levels of
description of the system and its desired performance would need to be
comparable - thus highly abstract descriptions are unlikely to support
assessments against detailed low-level specifications of desired performance -
but some forms of evaluation could be conducted from an early stage. Whether
evaluations early in development can be as accurate as those later in
development is currently a question for empirical research, see for example,
Dowell and Long (1989a).
The second important implication of the framework should be fairly clear:
evaluations address particular aspects of system performance and not
performance in general. Any evaluation requires a focus on certain system
behaviours and certain task products - it must be tailored to the given system
and to the intended goals of that system. It is therefore not the case that a
human factors evaluation can address all human factors aspects of a system -
something that it appears may be assumed by many developers, and something
that is to some extent assumed by the operational definition of usability in the
design for usability approach. An important requirement for the future
development of sound human factors evaluation practice is to construct tools
and methods that will enable evaluators first to select an appropriate focus for a
system evaluation, and second to recruit methods that will allow that focus to
be addressed.
A framework for human factors evaluation' 77

The requirement to understand computer behaviour as well as user


behaviour is a third implication of the framework. This means evaluation must
be conducted by personnel who are at least knowledgeable about general system
development issues, constraints and practices, or who are directly involved in
the particular system development in question. Brooke (1986) advocates the
latter approach. One way to ensure that the evaluators have such knowledge is
for the system developers themselves to conduct the evaluations. While this may
well lead to less reliable and less accurate evaluations as a whole, there is some
evidence that developers unskilled in human factors are capable of making a
good use of some forms of observational methods (Jorgensen 1990).
It is important to remember, however, that the role of evaluation in
development does not directly require the generation of new designs for all or
part of the system. An evaluation statement simply assesses the conformity of
performance with desired performance, and is not necessarily required to suggest
alterations or alternatives to the current design. Naturally ideas for improving
the system are likely to arise from (diagnostic) evaluations, but these need to be
Downloaded by [86.164.173.234] at 14:07 12 February 2015

considered by the developers in the light of all the other non-human factors
concerns that affect the system. The relationship between evaluation and
generation in design is a very close and complex one (Whitefield 1989) but one
should not make the mistake of requiring solutions as well as problems to arise
from an evaluation.
One aim of the framework described in section 3 is to contribute to
improvements in evaluation practice. How might it do this? First, we must point
out that the framework is not a prescriptive procedure for how to carry out an
evaluation. It might provide support for prescriptions (for example, Wilson and
Whitefield (1989) use part of the framework to discuss the selection and
configuration of appropriate methods) but it does not itself prescribe practice
procedure. Its main contribution to practice therefore, is as a means of clarifying
what can be done towards which goals and how it can be done. Thus it allows one
to identify the possible types of evaluation statements, the areas of performance
and behaviour that might be addressed, and the range of available methods. Such
clarifications support the appropriate recruitment and allocation of evaluation
resources.
Unfortunately it is extremely hard to demonstrate that the framework does
indeed lead to such improvements in practice. For one thing, it is a novel
proposal and therefore has had little opportunity to influence practice. For
another, any convincing empirical test of the issue would require excessive
resources - a wide range of systems, of evaluators, of methods, of evaluation
statements, and of performance variables. We have no plans to attempt such a
test.
The case that the framework does, or could contribute to improvements is
therefore threefold. First, we have found the ideas in the framework to be useful
for our own evaluation practice (e.g., Dowell and Long 1989a, Sutcliffe and
Whitefield 1989, Wilson 1989). Second, the framework does enable one to
identify problems with, or omissions from, particular evaluations or evaluation
approaches, and as such it suggests potential areas for improvement. Third, an
important demonstration would be that others choose to recruit (all or part of)
the framework in their own practice; this paper is part of an attempt to make the
framework available for that purpose.
78 A. Whitefield et al.

Acknowledgments
This work was done while the authors were working on projects funded by the
Department of Trade and Industry, and by the Alvey Programme (projects
MMIII22 and MMIII51). We would like to thank our colleagues at the
Ergonomics Unit for discussions and for comments on an earlier draft.

References
BELLOTII, V.1988, Implications of current design practice for the use ofHCI techniques.
In D. M. Jones and R. Winder (eds), People and Computers IV. Proceedings ofHCI
88 (Cambridge: Cambridge University Press).
BROOKE, J. B. 1986, Usability engineering in office product development. In M. D.
Harrison and A. F. Monk (eds), People and Computers: Designing For Usability.
Proceedings of HCI 86 (Cambridge: Cambridge University Press).
BROOKE, J. B. 1990, Personal communication.
CARD, S. K., MORAN, T. and NEWELL, A. 1983, The Psychology of Human Computer
Interaction (Hillsdale, New Jersey: Lawrence Erlbaum Associates).
CLEGG, c., WARR, P., GREEN, T., MONK, A., KEMP, N., ALUSON, G., LANSDALE, M.,
Downloaded by [86.164.173.234] at 14:07 12 February 2015

POlTS, C. SELL, R. and COLE, !. 1988 People and Computers: How To Evaluate
Your Company's New Technology (Chichester: Ellis Horwood).
DENLEY, !. and LoNG, J. B. 1990, A framework for evaluation practice. In E. J. Lovesey
(ed.), Contemporary Ergonomics 1990 (London: Taylor & Francis).
DILLON, A. P. 1988, The role of usability labs in system design. In E. D. Megaw (ed.),
Contemporary Ergonomics 1988 (London: Taylor & Francis).
DOWELL, J. and LoNG, J. B. I 989a, The 'late' evaluation of a messaging system design and
the target for 'early' evaluation methods. In A. Sutcliffe and L. Macauley (eds),
People And Computers V. Proceedings of HCI 89 (Cambridge: Cambridge
University Press).
DOWELL, J. and LoNG, J. B. 1989b, Towards a conception for an engineering discipline of
human factors. Ergonomics, 32, 1513-1535.
GARDNER, A. and MCKENZIE, J. 1988, Human Factors Guidelines For The Design Of
Computer-Based Systems. Ministry of Defence and Department of Trade and
Industry.
HAMMOND, N., JORGENSEN, A. MAcLEAN, A., BARNARD, P. and LoNG, J. 1983, Design
practice and interface usability: evidence from interviews with designers. In
Proceedings of CHI 83 (New York: ACM), 40-44.
HAMMOND, N.,. HINTON, G., BARNARD, P., MACLEAN, A., LoNG, J. and WHITEFIELD, A.
1985, Evaluating the interface of a document processor: a comparison of expert
judgement and user observation. In B. Shackel (ed.) Human-Computer Interaction
INTERACT '84 (Anisterdam: North-Holland).
HOWARD, S. and MURRAY, D. M. 1987, A. taxonomy of evaluation techniques for HC!. In
H-J. Bullinger and B. Shackel (eds), Human-Computer Interaction INTERACT '87
(Amsterdam: Elsevier Science Publishers).
JORGENSEN, A. 1990, Thinking-aloud in user interface design: a method promoting
cognitive ergonomics. Ergonomics, 33, 501-507.
KARAT, J. 1988, Software evaluation methodologies. In M. Helander (ed.), Handbook Of
Human-Computer Interaction (Amsterdam: Elsevier Science Publishers).
LEA, M. 1988, Evaluating user interface designs. In T. Rubin, User Interface Design For
Computer Systems (Chichester: Ellis-Horwood).
LoNG, J. B. and WHITEFIELD, A. D. 1986, Evaluating Interactive Systems. Tutorial given at
HCI '86, University of York, September 1986.
MEISTER, D. 1986, Human Factors Testing and Evaluation (New York: Elsevier).
RAVDEN, S. and JOHNSON, G. 1989, Evaluating Usability OfHuman-Computer Interfaces
(Chichester: Ellis Horwood).
REISNER, P. 1983, Analytic tools for human factors of software. In A. Blaser and M.
Zoeppritz (eds), Lecture Notes in Computer Science No. 150 (Berlin: Springer-
Verlag).
A framework for human factors evaluation 79

SHACKEL, B. 1986, Ergonomics in design for usability. In M. D. Harrison and A. F. Monk


(eds), People and Computers: Designing For Usability. Proceedings of HCI '86
(Cambridge: Cambridge University Press).
SUTCLIFFE, A. 1988, Human-Computer Interface Design (London: Macmillan Education).
SUTCLIFFE, A. and WHITEFIELD, A. 1989, Evaluation of human-computer interaction with
CASE tools. Proceedings of CASE '89 Conference, London.
SWEENEY, M. and DILLON, A. 1987, Methodologies employed in the psychological
evaluation of HCr. In H.-J. Bullinger and B. Shackel (eds), Human-Computer
Interaction INTERACT '87 (Amsterdam: Elsevier Science Publishers).
TAYLOR, B. and BONNER, J. 1989, HUFIT: Usability specification for evaluation. In E. D.
Megaw (ed.), Contemporary Ergonomics 1989 (London: Taylor & Francis).
WHtTEFlELD, A. D. 1989, Constructing appropriate models of computer users: the case of
engineering designers. In J. B. Long and A. D. Whitefield (eds), Cognitive
Ergonomics and Human-Computer Interaction (Cambridge: Cambridge University
Press).
WHITEFIELD, A. D. 1990, Human-computer interaction models and their roles in the
design of interactive systems. In P. Falzon (ed.), Cognitive Ergonomics:
Understanding, Learning and Designing Human-Computer Interaction (London:
Academic Press).
Downloaded by [86.164.173.234] at 14:07 12 February 2015

WHITESIDE, J., BENNETT, J. and HOLTZBATT, K. 1988, Usability engineering: our experience
and evolution. In M. Helander (ed.), Handbook Of Human-Computer Interaction
(Amsterdam: Elsevier Science Publishers)..
WILSON, F. 1988, Human factors evaluations in the development and maintenance of
interactive computer systems. London HCI Centre Report LHC/EXT/ALVI
EV/l.lF.
WILSON, F. 1989, Case studies in interactive systems design and evaluation: I. The real-
time subtitling system. London HCI Centre Report LHC/EXT/ALV/EV/4.IF.
WILSON, F. and WHITEFIELD, A. D. 1989, Interactive systems evaluation: mapping
methods to contexts. In E. Megaw (ed.), Contemporary Ergonomics 1989 (London:
Taylor & Francis).

View publication stats

You might also like