You are on page 1of 9

Case study in human factors evaluation

A Whitefield and A Sutcliffe*


A human factors (HF) evaluation, carried out as part of the development of a set of computer-aided software engineering (CASE) tools, is presented and is used as an example of the processes and products of typical HF evaluation practice. The role of HF evaluation as a part of software quality assurance is identified, and typical current practice of HF evaluation is characterized. The details of the particular evaluation are then reported. First, its processes are described; these are determined by relating features of the system under development to the desired focus, actual context, and possible methods of the evaluation. Then the products of the evaluation are described; these products or outcomes are formulated as the user-computer interaction difficulties that were identified, grouped into three types (termed task, presentation, and device difficulties). The characteristics of each type of difficulty are discussed, in terms of their ease of identification, their generality across application domains, the H F knowledge that they draw on, and their relationship to redesign. The conclusion considers the usefulness of the evaluation, the inadequacies of system development practice it implies, and how to incorporate HF evaluation into an improved system development practice. human-computer interaction, human factors, system development, evaluation, usability, software quality assurance, case study

should be regarded as part of software quality assurance; it augments the accepted concerns of testing software for reliability, integration, and safety, with an assessment of the effectiveness and efficiency of the interaction between the user and the computer. In common with the general motivation for quality assurance, HF evaluation aims to improve design by diagnosing failures and by ensuring software meets acceptable levels of performance. This paper uses the application domain of CASE tools to illustrate HF evaluation. This choice is pertinent in the light of reported problems of current CASE-tool technology and its inability to support software developers' working practices 3. An introduction to HF evaluation is given in the next section. The third section presents the background and methods for the particular CASE-tools evaluation. The outcomes of this evaluation are then given, and the paper concludes with a discussion of important aspects of the evaluation. HUMAN FACTORS EVALUATION

Typical current practice in human factors evaluation


In common with much HF activity, the current practice of HF evaluation of interactive computer systems can be described as a craft4,5. That is, it relies heavily on the experience and intuitions of the HF specialist, who may draw on a body of largely undocumented or implicit knowledge and practice. It should not be surprising, therefore, that current practice varies widely. Practitioners have different motivations for conducting evaluations (e.g., diagnosis of problems, testing against standards, performance comparison of two or more products), they use various approaches (e.g., controlled experiments, laboratory observations, field observations), and they use numerous techniques for data recording and analysis (e.g., video recording, questionnaires, interviews, system logs). There is no agreed framework for relating these various purposes, approaches, and techniques that could be used to construct reliably sound HF evaluations. Given this variability, it is hard to identify unequivocally any typical form of HF evaluation practice. However, evidence from a number of sources suggests that there is a typical form, at least in the sense of the type of HF evaluation most commonly practised. (Note that there is no suggestion that this is an ideal form.) This evidence includes published reports and discussions of evaluations, personal reports from system developers 443

This paper reports a human factors (HF) evaluation of a set of computer-aided software engineering (CASE) tools. The purpose of the paper is to illustrate, for system developers who are not knowledgeable about HF, the processes and products of a fairly typical HF evaluation, and to identify some important features of HF evaluation practice. The intention is, therefore, to talk where possible about evaluation in general rather than solely in the particular context of the application domain of system developers using CASE tools. Well engineered human--computer interfaces are acknowledged to be a critical determinant of product success, and the importance of HF input into software engineering has been demonstrated in industrial practice t. One of the major forms of HF input has been as evaluation, although HF evaluation has not yet been incorporated into standard software engineering practice, as illustrated by Nielsen's report of design flaws in several commercial software products 2. HF evaluation

Ergonomics Unit, UniversityCollegeLondon.26 BedfordWay,London WCIH 0AP, UK. *Department of Business SystemsAnalysis,City University,Northampton Square. LondonECIV 0HB, UK Vol 34 No 7 July 1992

0950-5849/92/070443-09 1992 Butterworth-Heinemann Ltd

Case study in human factors evaluation

and from H F consultants, and the emphasis placed by major computer system manufacturers on H F evaluation laboratories 6 and on usability engineering and its variants 7. The typical form of H F evaluation that these sources suggest can be characterized as follows: , It occurs late in system development (usually when some working version of the system under development exists). It is conducted by H F personnel. It involves either or both of two methods: 'walkthroughs' by H F specialists, and observational study (often including video and audio recording) of interactions between the user and the computer. (A useful comparison o f these methods is presented by Hammond et al.8.) It tends to have a broad focus rather than concentrating on specific issues. It does not involve formal analytical and statistical procedures. Although this typical form of H F evaluation does have a number of drawbacks 9, it seems likely to continue as the most common form of H F evaluation for the forseeable future, for a number of reasons. First, such evaluations can be very productive. There is a good deal of experience in their conduct, supported by useful published recommendations ~0. Further, resource constraints in system development will only unusually allow for more detailed or formal analyses. This form of evaluation is also well suited to incorporation within an iterative prototyping approach to system development, which is itself widely recommended as an approach to user-interface design 11,~2.Finally, there are as yet no really reliable and effective alternative forms of evaluation. Thus at the moment there are no good means of evaluating early in development 13, of evaluating without some form of user being directly involved, or of integrating H F evaluation into software engineering methods so that it can be carried out by n o n - H F personnel. On this last point, there is evidence that system developers can make good use of some H F evaluation techniques t4,15, but it is also clear that H F evaluations by system developers are likely to be incomplete 16. There is, therefore, a good case for seeking to develop and improve (rather than to replace) typical current practice in H F evaluation. It is an evaluation of this form that will be presented later in this paper.

framework for H F evaluation that is intended to make explicit the concepts and relations of evaluation and that identifies a wide range of evaluation methods. The discussion of H F evaluation that follows in this section is based on this framework. The intention of this discussion is to enable an informed and principled assessment of the evaluation case study to follow. The framework defines H F evaluation as an assessment of the conformity between a system's performance and its desired performance. System is used here in the sense of a user and a computer engaged on some task within an environment. A set of hardware and software components is therefore not a system in this sense but simply a computer. The notion of performance is taken from Dowell and Long 17, who define performance in terms of two factors: the quality of task product (i.e., how well does the work done meet its goal) and the incurred resource costs (i.e., the resources employed by both the user and the computer in accomplishing the task). Different types of user and computer resource costs can be identified. A most effective system would minimize the resource costs in performing a specified task with a given quality of product. The performance of a system is determined by its behaviour. The system's behaviour comprises the interacting behaviours of the user and the computer. Both user and computer will have important behavioural limitations that constrain performance. This notion of performance and behaviour therefore distinguishes what is achieved from how it is achieved, and also the quality of the task product from how effectively it is produced. It follows from this view that to specify fully a system's desired performance means to identify both the required quality of the task product and the acceptable levels of resource costs to be incurred in accomplishing the task. The current state of the art in H F means that both of these will rarely be straightforward. The expressions of both required task quality and acceptable resource costs could take many forms and are not restricted to a particular format. Importantly, this allows the aspects of performance that are of particular interest for the evaluation (e.g., ease of learning, problems with the interactive dialogue, attitudes of users, appropriateness of functionality, and so on) to be concentrated on. An evaluation involves both a method (the process by which it is done) and a statement (the resulting product). Evaluation statements are of two kinds: those that simply report performance and conformity and therefore tend to be quantitative (here called measurement statements) those that relate the behavioural causes of the performance and therefore tend to be qualitative (here called diagnostic statements) Thus a typical measurement statement reports how closely observed performance matches some measurable criterion, but it reports nothing about how that performance is achieved. Diagnostic statements, on the other Information and Software Technology

Characterization of human factors evaluation


This paper aims both to illustrate the processes and products of a typical H F evaluation and to identify some important features of this form of evaluation practice. To meet both these aims, it is necessary to discuss the general nature of H F evaluation. Unfortunately, there is no consensus view on this in the H F literature. Thus there is no agreed definition of evaluation or of its relevant concepts. Whitefield et al. 9 have put forward a 444

A W H I T E F I E L D AND A SUTCLIFFE

hand, attempt to identify the critical behaviour that underlies the performance. Whitefield e t al. 9 discuss four classes of H F evaluation method, distinguished according to whether the user and the computer are each real or representational presences in the evaluation. The four classes of method are: Analytic methods, involving the use of representations of both the user and the computer; normally this involves the manipulation of models of the system to predict performance. Specialist reports, involving an H F or other specialist (i.e., not the intended users) using the computer to assess an implemented or prototype version. User reports, involving survey methods (questionnaires, interviews, rating scales, etc.) to obtain data or opinions from users when they are not directly interacting with the computer. Observational methods, involving real users interacting with the computer; the set of such methods is large, ranging from informal observation of a single user to full-scale experimentation with appropriate numbers of subjects and control of variables. The framework for H F evaluation just outlined has a number of implications and consequences, some of which have been mentioned already. Some further implications are as follows. First, evaluation can be done during system development at any time after the first description of the proposed system is produced (whether this be a specification or a prototype or whatever). It need not wait until some working version exists. Related to this is the second point that models or other representations of users can serve as a means of including users in development. The actual physical presence of users is not mandatory. Finally, evaluations will address particular aspects of system performance and not all aspects. Any evaluation requires a focus on certain system behaviours and certain task products - - it must be tailored to the given system and to the intended goals of that system. It is therefore not the case that an evaluation can address all H F aspects of a system (at least not without unlimited resources with which to conduct a multi-part evaluation). The framework is intended to encompass all forms of H F evaluation activity. It should therefore be possible to describe typical current practice (as outlined earlier) in these terms. Viewed in this way, typical current practice can be seen to involve H F personnel in specialist reports and in observational methods of a relatively informal kind (little experimental control and little formal or statistical data analysis). It rarely has a complete and well specified view of desired performance: it usually ignores the task quality component (how well the work is done), preferring to concentrate on the resource cost component (usually in terms of users' errors and difficulties), and it tends to have a broad focus rather than a specific one that addresses particular performance criteria. It can produce either measurement or diagnostic statements, often depending on whether the evaluation is a compariVol 34 No 7 July 1992

son of alternative products or part of a detailed redesign of a single product. The late occurrence of the typical H F evaluation within the system development life-cycle probably reflects the current weakness of analytic evaluation methods, as well as the requirement for some form of implemented software with which to conduct the preferred methods (observational methods and specialist reports). Now that a framework for characterizing H F evaluation, and also typical current practice within this framework, has been described, it is possible to present the particular evaluation that constitutes the case study. CASE-TOOLS EVALUATION:

BACKGROUND AND METHODS


The system for evaluation was a toolset for development using the Structured Systems Analysis and Design Methodology (SSADM) ~8. The toolset was developed as part of a software engineering research project funded by the U K Government's Alvey Programme. First, important aspects of the various system components at the time of the evaluation are described. This is followed by details of the evaluation focus. The context of the evaluation is then described, while the methods used are outlined in the final subsection.

System for evaluation Hardware and software


The system ran on a monochrome Sun-3 workstation. A version for IBM PCs or compatibles was under development. The software comprised three diagram editors, a data dictionary, and links between them. The data dictionary used a proprietary database product and therefore could not be changed in any substantial way. At the time of the evaluation, there were plans to replace it with another proprietary product. The diagram editors (DEs) were all project developments. They were concerned with editing dataflow diagrams, logical data structure diagrams, and entity life history diagrams. All three DEs had a good deal in common, including a similar screen layout. The basic features of this layout are shown in Figure 1. Of particular interest are the drawing area, the palette (indicating the items that can be drawn), and the menu bar (containing pull-down menus under a number of headings). All three DEs were tested using advanced prototypes. The software proved generally reliable, although there were several minor bugs; the most serious of these concerned the display handling in the DEs, which occasionally crashed out. The first issue of the toolset's user guide was also supplied. This was substantially, but not entirely, complete.
Users

The intended users of the toolset were SSADM analysts and designers. It was assumed that the users would 445

Case study in humanfactors evaluation


window , /window headerbar --menu bar

0----71----"61

if

I I'"--I~,./

The first focus was that the DEs should be as easy to learn as possible. Ease of learning is influenced by multiple determinants, and this focus therefore needed to consider a wide range o f system behaviour (but not all system behaviour). With such a broad focus, the relevant criterion was also expressed rather broadly: Naive users should be able to use each DE to complete basic editing tasks without encountering serious difficulties or making serious errors. The second focus was that users should understand the relationships between the various tools, and especially between the data dictionary and the DEs. They should be able to make use o f all appropriate facilities and to change tools as necessary. The relevant criteria were: Users should understand what information about diagram elements is stored in the data dictionary. Users should understand how to enter the diagram elements in the data dictionary. Users should know how to access and edit specified information related to the diagrams in the data dictionary. These four were the principal criteria against which the system was to be evaluated. In addition, other aspects o f performance and behaviour would be evaluated in passing (i.e. without stated criteria and not in depth).

- . ~ m w ~ g

area

pale~e

Figure 1. Basic screen layout of diagram editors


already be familiar both with analysis and design in general and with the SSADM method in particular.

Tasks
The toolset was not intended to cover every aspect o f SSADM development. It was aimed at the three major types o f diagram in SSADM: dataflow and logical data structure diagrams, which are used throughout the three phases of SSADM development, and the less important entity life history diagrams. The toolset was also intended to include the activities associated with the data dictionary, i.e., with maintaining a database of information on the system design. Examples of SSADM development activities not covered by the toolset are the specific techniques of dialogue design and process outlines and the general activities of planning and quality assurance. The tasks for which the toolset was intended, therefore, are: creating and editing dataflow diagrams creating and editing logical data structure diagrams creating and editing entity life history diagrams developing and maintaining a data dictionary of design information, in particular with respect to the information in the three sorts of diagrams

Context of evaluation
To decide on the appropriate method or methods to be used, it is necessary to know something about the context of the evaluation. In particular it needs to be known how the system development is being done, about the form or forms o f the system that will be available for evaluation, and about the resources available for the evaluation. The important aspects of these issues for the case in question were as follows. With regards to the system development, this was being conducted using some semiformal notations (e.g., dataflow diagrams) within a generally top-down approach. The developers had reached the stage of producing an advanced prototype. None of the developers was knowledgeable about HF. The system actually used for the evaluation was described earlier. The main points to note here are as follows: No complete set of specifications existed, so any analytic evaluation method would have to have created those to start with. There were no experienced users, as there was no extant version in use. The tasks for the evaluation needed to be developed, consistent both with the target tasks for the toolset and with the foci of the evaluation. Versions of the actual hardware and early software were available. Information and Software Technology

Environments
The system was intended for installation in standard office environments.

Focus
It was stated earlier that an H F evaluation needs to be focused on certain aspects of system performance and not on performance in general. The particular foci need to be expressed as criteria for desired performance. Discussions with the system developers about the issues that might be addressed by the evaluation identified two foci of particular interest to the developers. 446

A W H I T E F I E L D AND A SUTCLIFFE

The available resources included basic recording equipment (video and audio), access to a small number of H F specialists and two SSADM specialists, and a maximum elapsed time period of two months.

Methods
The evaluation focus requested by the client concentrated on assessment of the learnability and the functionality of the editors and data dictionary. The need to assess learnability suggested the use of observational methods, while the need to assess functionality indicated the use of specialist reports in addition. Taking these concerns into account, along with the context described in the last subsection, led to the evaluation making use of three forms of the evaluation methods mentioned earlier. (Such decisions about methods based on loci and context are by no means fixed and are poorly understood; initial discussions of these decisions have been publishedl9,2.) Specialist reports by three H F engineers, in two sessions lasting approximately one hour each. Specialist reports by two domain (i.e., SSADM) specialists, in one session of approximately one hour. Observational methods involving four users untrained on the system. The users were computer-science students familiar with structured methods. Each user worked individually for up to two hours on set tasks. The tasks involved the creation and editing of appropriate diagrams, based in some cases on the tutorial exercises in the user guide. One of the evaluators was present at all times to elicit from, and discuss with, the users their view on the interaction as it progressed. The sessions were audiotaped and the evaluator made notes in addition. The tapes and notes were subsequently informally analysed to identify the difficulties encountered by the users.

full list of difficulties (75 for the DEs, plus those for the data dictionary), the authors have chosen to present a selection of difficulties with the DEs that is intended to illustrate the various types identified. (The term difficulty will be employed from here on as a general term to include errors and problems.) Before presenting the difficulties, it should be said that the system (except for the data-dictionary component) generally performed well in the evaluation and was at least adequate for the diagram-editing tasks for which it was designed. The difficulties identified relate to areas in which it could be improved. There is no single fixed way in which to group and present the difficulties identified. The most appropriate way will depend on many factors, not least on the evaluator, the system under evaluation, and the evaluation focus. In this case, the difficulties were divided into three kinds, termed device difficulties, presentation difficulties and task difficulties. This division is not formally defined, but it is based on, and has many parallels with, more formal layered descriptions of the h u m a n computer interface 21,22.

Device difficulties
The device difficulties are those encountered in the direct use of the input-output devices (in this instance a keyboard, a mouse, and a bit-mapped monochrome display). There were not many such difficulties, but those that did occur were certainly frequent and problematic for users new to the particular hardware platform, especially early in the testing session. Example device difficulties included: All users experience some problems with which button to use on the (three-button) mouse. The difference between bold and plain text is small and easily overlooked. Users would be expected to overcome difficulties of this type relatively easily as they gained experience with the particular devices in this system, and some initial training would make them less frequent and problematic from the start.

Example of typical current practice


The evaluation as described in this section clearly corresponds with typical current practice as outlined earlier: it involves an advanced prototype and is therefore late in system development, it is conducted by H F personnel, it involves both observational methods and specialist reports, and it does not involve formal analytical and statistical procedures. The major difference is that this evaluation is more focused than much current evaluation practice, although it is certainly not as focused as it might be.

Presentation difficulties
Much more common and varied were the presentation difficulties. These concern the dialogue aspects of the interaction, such as command names, screen appearance, and feedback. Some examples of each type are as follows. Commands The distinctions between some menu options are unclear (e.g., between SAVE and S T O R E or between P R I N T and P R I N T DFD). The D E L E T E and BACKSPACE keys on the keyboard are used inconsistently. Some command names are misleading; for example, 447

CASE-TOOLS EVALUATION: OUTCOMES Outcomes to be presented


Given the principal focus of the evaluation on identifying the difficulties and errors encountered by the users, it is appropriate to present and discuss the outcomes of the evaluation in the same terms. But rather than present the Vol 34 No 7 July 1992

Case study in human factors evaluation

A D D F I L E TO D A T A D I C T I O N A R Y only adds the file name, whereas users assumed it meant the file contents. The single graphical menu (for selecting between horizontal or vertical alignment) is ambiguous and therefore confusing. Users sometimes type into the wrong window, especially when using dialogue boxes, because they fail to remember the importance of the pointer position with respect to the current window.

because the same difficulties recur in a variety of application domains - - there is nothing about the above difficulties that is exclusive to CASE tools. Furthermore, presentation difficulties are relatively easy to identify by H F evaluation and are often simply to rectify (particularly if the interface is well separated from the application).

Task difficulties
The final group of difficulties is the task difficulties. These concern the functional aspects of the system, or the kind of support offered to help the users achieve their task goals. Unlike the presentation difficulties, therefore, the task difficulties will tend to be specific to an application domain, as tasks vary between domains. Difficulties found with this system are that the support offered can be unnecessary, inappropriate, inadequate, or missing. Examples of each type are as follows.

Screen appearance
The buttons in some dialogue boxes are extremely small and therefore hard to select. The items on the menu bar are made to stand out by use of a grey background, but some users (based on their experience with other systems) initially take this to mean that the items are not available for selection. Within a pull-down menu, choices that are unavailable, and are therefore greyed-out, are virtually illegible. This means users who are seeking a particular menu choice cannot identify it if it is unavailable, which leads to unnecessary searching through the menus. When dataflow lines are selected, the selection indicators (small black boxes on the line) obscure the arrowheads. This means users cannot tell which way the arrows are pointing, which can lead to uncertainty in performing some operations (e.g., reversing arrow direction).

Unnecessary support
Certain menu items (e.g., for drawing process boxes) are redundant because the functionality is offered by other means that are easier to use (e.g., on the palette, or by use of the mouse).

Inappropriate support
The concept of an A4 sheet assumes an exaggerated importance in the way the software represents and manipulates a diagram, as it constrains the diagram layout and cannot always be changed easily. When a lower-level child process (i.e., a detailed expansion of a single higher-level process) is created, a number of assumptions is made about how it will relate to its parent process (e.g., that each dataflow in the parent process will relate to a separate subprocess in the child); some of these assumptions are too strong and then cannot be changed.

Feedback
Many menu options give little or no feedback about the operation of the command and users therefore often tried to reselect them. Some changes to object names displayed on the screen were made without indicating to the user that this had happened. On selecting one particular item on the menu bar, the pull-down menu (because of its dynamic content) only appeared after a delay. Without any feedback that the menu would soon appear, users found this disconcerting. These presentation difficulties are all consequences of poor H F design. General principles of good h u m a n computer interface design have been proposed by a number of authors 23, while others have proposed large numbers of detailed design guidelines 24,25. Examination of the difficulties found in this system in the light o f the general principles should enable the identification of the kinds of H F concerns that the system has ignored (e.g., principles about consistency relate to the use of the Delete and Backspace keys; principles about compatibility relate to the grey background for the menu-bar items). Further, examination of the difficulties with reference to the available guidelines will almost certainly suggest some design solutions. There exists, therefore, a good deal of H F knowledge relating to presentation difficulties. This is in part 448

Inadequate support
Arrangement of a number of diagram objects into a rectangular layout (something users frequently attempt) is an extremely cumbersome and error-prone procedure and is without some form of 'Tidy' facility. When loading a file, it is not possible to obtain a list of names of available files from within the application. Entity life history diagrams are always arranged in rows. Although horizontal alignment is possible, it is not provided automatically in this DE.

Missing support
It is not possible to have two files loaded at the same time, for the purposes of comparing them or copying between them. Users occasionally want to divide a diagram they have already started into two separate diagrams, which cannot be accomplished easily. In hierarchically related sets of diagrams, changing lower-level diagrams in ways that makes them inconsistent with higher-level ones is strictly disallowed. Information and Software Technology

A WHITEFIELD AND A SUTCLIFFE

Although keeping the diagrams consistent is obviously valuable, users will often want to change lower-level diagrams first and have them temporarily inconsistent with the higher-level ones. While some of these difficulties (e.g., obtaining a list of file names) could be dealt with by the kinds of HF principles discussed for the presentation difficulties, most of them concern the way the software does, or does not, support the user's task activity. The origin of such difficulties usually lies in inadequate task analysis during requirements definition. For instance, better observation of software engineers' working practices would have identified the occasional needs to divide diagrams in two and to change lower-level diagrams before changing higher-level ones. It is clear from observation that software engineers use dataflow diagrams iteratively, that they sometimes make mistakes in levelling, and that they may need to repartition the analysis. Similarly, the inadequate support offered to rectangular layout and to horizontal alignment in the diagrams stems from a failure to analyse in detail what support users require most to achieve their task goals. The DEs give the impression of working from an analysis of the diagrams themselves rather than of how people produce the diagrams. Thus the general pattern is that the DEs are well able to support the copying of existing paper diagrams, but are less able to support the normal process of producing diagrams. That is, design is characterized by frequent major amendments, by borrowing from previous designs, and by being both a bottom-up and top-down process. It is these aspects of diagram production that the DEs do not support as well as they might. In general, task difficulties will often be hard to identify, because they require detailed understanding of users' task behaviours and thus will vary between application domains. There is, therefore, no large database of HF knowledge to consult as there is with the presentation difficulties. Moreover, HF has not so far been able to develop a range of clearly effective methods, tools, or techniques for understanding users' tasks. All this makes the accurate identification of task difficulties the major problem in HF evaluation. In addition, because the difficulties relate to the functionality of the system, they tend to be harder to rectify than the presentation difficulties.

Difficulties with data dictionary


It is not the intention to present difficulties associated with the data dictionary, for two main reasons. First, it is possible to illustrate the evaluation without discussing the details of the data dictionary. Second, the data dictionary was not evaluated to the same extent as the DEs since, as already stated, it was a proprietary product and therefore not easily changed, and it was in any case liable to be replaced by another product. Suffice to say here that there were no classes of difficulties identified in the use of the data dictionary that were not also found with the DEs, and the data dictionary generally performed poorly in the evaluation, in large part due to the contrast between the database-oriented terminology and structure of the data dictionary (row, column, master, detail, etc.) and the domain-oriented terminology and structure of the DEs (process, dataflow, entity, etc.). Performance with the data dictionary essentially failed to meet all three of the relevant criteria mentioned earlier.

CONCLUSIONS What can the case study illustrate about typical current practice in HF evaluation? First, that such evaluations can succeed in identifying a wide range of user-interface problems while consuming modest resources. The report of this evaluation was well received by the system developers, who found it contributed to the improvement of the design in an area in which they themselves were inexperienced. The total effort involved for all personnel (including HF specialists, domain specialists, and users) was between three and four person-weeks, of which around one third was spent on the written report. There is no fixed level of effort required for an HF evaluation. In any instance, the chosen level will be determined by a number of factors. However, the less effort is involved, then the less confidence there can be in the outcomes in terms of their accuracy, completeness, and consistency. The authors would suggest that, for this system, the level of effort for the evaluation was small as a part of the overall development and was worth while in terms of its consequences for the quality of the final system. As far as the range of difficulties identified is concerned, there is clearly an important distinction to be made between what have been called here presentation and task difficulties. They differ in a number of ways, including their identifiability, their generality across application domains, the HF knowledge to which they relate, and the ease with which they can be remedied. They also tend to differ in the aspects of performance to which they relate. While both will adversely affect the resource costs required of a user in carrying out a task, the task difficulties are much more likely also to affect task product quality. Identification of task difficulties requires either detailed knowledge about the application domain (in this case, SSADM design) or good techniques for task analysis and representation, or probably both. 449

Criterial assessment of DEs


The relevant criterion given earlier was that 'naive users should be able to use each DE to complete basic editing tasks without encountering serious difficulties or making serious errors.' Against this criterion, the DEs performed adequately, although not optimally. All users tested could complete the basic diagram-editing tasks without serious difficulty or error. The worst problems encountered were in 'tidying up' the diagrams to improve their appearance, the lack of feedback from some commands, and the mismatches of task support with respect to normal working practices. Vol 34 No 7 July 1992

Case study in human factors evaluation

In this instance the knowledge of the domain was contributed in the form of the tasks the users were asked to perform, and by the specialist reports from the domain specialists. Both of these contributed heavily to the identification of task difficulties. It is worth pointing out that there is at least one type of difficulty that has been missed (or rather ignored) in this evaluation - - organizational difficulties. These concern how the system impacts the working structures and practices of the organization rather than those of the individual. They have been ignored here because they were not included in the focus of the evaluation as negotiated with the client. Had they been included in the focus, there would have been changes to the methods used and to the types of specialist knowledge recruited. What consequences for system development arise from considering the different types of difficulties? The task difficulties highlight the need for better requirements analysis in the software engineering sense, and the need for an HF input to requirements analysis in the form of task analysis. Requirements analysis is acknowledged to be the source of many errors in system development26, and the importance of tracing and validating requirements in specifications and products has been emphasized 27,2s. Either a better appreciation of empirical studies of software design29 or simple observation and analysis of SSADM developers' activities would have indicated the need for support of actual work practices rather than of the top-down procedures implied by the structured methods. Many CASE tools fail to adopt a flexible approach to supporting working practices, and consequently they tend to be used as documentation tools rather than as genuine support for the software engineering process 3. The presentation difficulties, on the other hand, suggest the need for the incorporation of known HF principles and guidelines into software engineering practice. There is a good deal of relevant HF knowledge which, for one reason or another, had not been used in the development of these tools. Clearly, ways need to be found to ensure that what is known can be recruited to software engineering practice as easily and effectively as possible. Improvements to requirements analysis, and the recruitment of extant HF knowledge, would both reduce the reliance on evaluation as a form of HF input to system development and enable HF to contribute in other ways. The framework for evaluation summarized earlier indicated that evaluation can take place at any time during system development, although it has been symptomatic of HF evaluation that it has tended to occur very late, and often only once, in development. A typical HF evaluation, as described here, clearly requires some physical form of the computer to be present. For a typical evaluation to have the potential for maximum benefit by occurring early in system development therefore requires the adoption of an iterative approach, if not a prototyping approach. Indeed, evaluation is an essential component of prototyping, for without a methodical 450

approach to testing, the value of prototyping for trapping errors early may be nullified by cursory examination. Methods for HF evaluation can systematize prototype testing so that testing scenarios are correctly constructed and testing is thorough. The need for a methodical approach to prototyping and requirements validation has been emphasized27, and scenarios have proved effective in requirements testing3. The framework and practice reported in this paper should enable readers to take some initial steps in systematizing and refining HF evaluation practice. Specifying the focus of the evaluation, and selecting methods appropriate for the focus, given the context, are critical activities whose importance the authors have tried to emphasize (while acknowledging that they are currently poorly understood and underspecified). The clientcentric focus reported here represents a departure from previous practice, which has viewed evaluation from a very general HF brief. It is important that the focus is clear and that, while it may express the client's concerns, it is expressed in user-oriented terms. The authors believe that their approach, of attempting to systematize and refine current HF evaluation practice, offers greater promise for improving evaluation than more radical alternatives, such as increasing the reliance on users' subjective experiences or the development of quantitative evaluation metrics with doubtful construct validity. The final point concerns the integration of HF evaluation into the quality-assurance role within software engineering. HF evaluation is capable of making a significant contribution to quality computer systems. Just as software engineering methodology has evolved to address other quality-assurance concerns (e.g., reliability), it needs to evolve further to encompass HF aspects of quality assurance. Incorporation of the kinds of evaluation practice described in this paper would be one move in that direction. ACKNOWLEDGEMENTS This work was done while the first author was working on a project funded by the Department of Trade and Industry, UK. The authors thank personnel on the project that developed the tools for their assistance, and colleagues at the Ergonomics Unit for discussions and for comments on an earlier draft.
REFERENCES

I Gould,J D 'How to design usable systems' in Bullinger,H-J and Shackel, B (eds) Human-Computer Interaction - INTERA CT'87 North-Holland (1987) 2 Nielsen, J 'Traditional dialogue design applied to modern user interfaces' Commun. ACM Vol 33 No 10 (1990) pp 109-118 3 Chikofsky, E J and Rubenstein, B L 'Reliability engineering for information systems' IEEE Software (March 1988) (1988) pp 11-16 4 Long, J B and Dowell, J 'Conceptions of the discipline of HCI: craft, applied science and engineering' in Sutcliffe,A
and Macauley, L (eds) People and computers V. Proc.
HCI89 Cambridge University Press (1989)

Information and Software Technology

A WHITEFIELDAND A SUTCLIFFE 5 Denley, I and Long, J B 'A framework for evaluation practice' in Lovesey, E J (ed) Contemporary Ergonomics 1990 Taylor and Francis (1990) 6 Dillon, A P 'The role of usability labs in system design' in Megaw, E D (ed) Contemporary Ergonomics 1988 Taylor and Francis (1988) 7 Whiteside, J, Bennett, J and Hoitzbatt, K 'Usability engineering: our experience and evolution' in Helander, M (ed) Handbook of human-computer interaction Elsevier (1988) 18 Downs, E, Clare, P and Coe, I Structured Systems Analysis And Design Method, application and context Prentice Hall (1988) 19 Hill, R, Denley, I and Long, J 'Towards an evaluation planning aid: classifying and selecting evaluation methods' in Lovesey, E J (ed) Contemporary Ergonomics 1991 Taylor and Francis (1991) 20 Wilson, F and Whitefield, A D 'Interactive systems evaluation: mapping methods to contexts' in Megaw, E (ed) Contemporary Ergonomics 1989 Taylor and Francis (1989) 21 Moran, T P 'The command language grammar: a representation for the user interface of interactive computer systems' Int. J. Man-Mach. Stud. Vol 15 (1981) pp 3-50 22 Taylor, M M 'Layered protocols for computer-human dialogue. I: principles' Int. J. Man-Mach. Stud. Vol 28 (1988) pp 175-218 23 Sutcliffe, A G Human computer interface design Macmillan (1988) 24 Smith, S L and Mosier, J N 'Guidelines for designing user interface software' MITRE Corporation report ESD-TR86-278 (1986) 25 Gardiner, M M and Christie, B (eds) Applying cognitive psychology to user-interface design John Wiley (1987) 26 van Assche, F, Layzell, P J, Loucopoulos, P and Speltinex, G 'RUBRIC: a rule based representation of information system constructs' in Proe. 5th Esprit Conf. North-Holland (1988) pp 438-452 27 Roman, C C 'A taxonomy of current issues in requirements engineering' Computer (April 1985) pp 14-22 28 Sailor, J D 'Systems engineering: an introduction' in Thayes, R H and Dorfman, M (eds) System and software requirements engineering IEEE Computer Society Press (1990) 29 Curtis, W 'Empirical studies of the software design process' in Diaper, D, Gilmore, D, Cockton, G and Shackel, B (eds) Human-Computer Interaction - - INTERACT'90 NorthHolland (1990) pp 35-50 30 Benner, K M and Johnson, W L 'The use of scenarios for the development and validation of specifications' in Proc. A1AA Conf. Computers In Aerospace Monterey, CA, USA (1989)

8 Hammond, N, Hinton, G, Barnard, P, MaeLean, A, Long, J and Whitefield, A 'Evaluating the interface of a document
processor: a comparison of expert judgement and user observation' in Shackel, B (ed) Human-Computer Interaction - - IN TERA CT'84 North-Holland (i 985) Whitefield, A D, Wilson, F and Dowell, J 'A framework for human factors evaluation' Behav. Inf. Technol. Vol 10 No 1 (1991) pp 65-79 Ravden, S and Johnson, G Evaluating usability of humancomputer interfaces: a practical method Ellis-Horwood (1989) Hartson, H R and Smith, E C 'Rapid prototyping in humancomputer interface development' Interact. Comput. Vol 3 No 1 (1991) pp 51-91 Gould, J D and Lewis, C 'Designing for usability: key principles and what designers think' Commun. ACM Vol 28 No 3 (1985) pp 300-311 Dowell, J and Long, J B 'The 'late' evaluation of a messaging system design and the target for "early" evaluation methods' in Suteliffe, A and Macauley, L (eds) People and computers V. Proc. HC189 Cambridge University Press (1989) Wright, P C and Monk, A F 'A cost-effective evaluation method for use by designers' Int. J. Man-Mach. Stud. (in press) Jorgeusen, A H 'Thinking-aloud in user interface design: a method promoting cognitive ergonomics' Ergonomics Vol 33 No 4 (1990) pp 501-507 Molich, R and Nielsen, J 'Improving a human-computer dialogue' Commun. ACM Vol 33 No 3 (1990) Dowell, J and Long, J B 'Towards a conception for an engineering discipline of human factors' Ergonomics Vol 32 No !1 (1989)pp 1513-1535

9 10 11 12 13

14 15 16 17

Vol 34 No 7 July 1992

451