You are on page 1of 6

Armen Chakmakjian

HF750 – SP11

3/24/11

Reliability of Usability Evaluations
Review of Literature and How To Minimize Problems

Table of Contents
Introduction: Is Usability Testing Reliable? ................................................................. 2 The optimal number under question .............................................................................. 2 The layering of effects on results...................................................................................... 3 The evaluator effect bias .................................................................................................... 4 Analytical data as a solution ............................................................................................. 4 Evaluation of the Evaluation Literature ........................................................................ 5 Summary and Conclusion.................................................................................................. 5 Works Cited ......................................................................................................................... 6

Armen Chakmakjian

HF750 – SP11

3/24/11

Introduction: Is Usability Testing Reliable?
Usability testing has become an integral part of the software design life cycle for products in many industries. Much literature has been written describing how to conduct an effective usability test. This paper will describe some of those methods, and will address the fundamental issue being argued by practitioners, namely that the results are not reliable or comprehensive in a scientific sense. All phases and methodologies have been under scrutiny for several years now. Some examples of topics under scrutiny:       The optimal number of users to test The observers effect on the test The moderators effect on the test The evaluators’ effect on the conclusions Whether expert reviews are better than user testing Whether a specific task plan is better than free form use

In reviewing the pertinent literature, this paper will attempt to explain each of these items and the expert opinions in those areas. Finally the paper will describe the effect of these things on the practitioner and propose some possible solutions.

The optimal number under question
Starting with the work of Virzi at GTE Laboratories (Virzi, 1992), there was a set of research whose results coalesced around the optimal number of subjects participating to be about 5. Those 5 would generally find 80% of the issues detected as well as identifying the most severe usability issues being detected by the first few subjects. More recently, Law and Hvannberg (Law & Hvannberg, 2004) show that Paper 2 2

Armen Chakmakjian

HF750 – SP11

3/24/11

from a probabilistic point of view that that magic number and the methodology behind it may have problems due to the independence of individual task scenarios events and that there is an unequal likelihood that individual problems are identified. Lindgaard and Chattratichart point out that that the magic number 5 is relied upon excessively by practitioners (Lindgaard & Chattratichart, 2007). They contend that by concentrating on the number rather than the task coverage, many problems are being missed. They based their study on their own evaluation of the results from the CUE-4 study (Molich & Dumas, 2006), which attempted to compare the results of expert review teams against each other. That particular study and one of its predecessors, CUE-2 (Molich, Ede, Kaasgaard, & Karyukin, 2004), studied the results of consistency of findings between teams and organizations. In both cases they presented showed that 75% of results were unique reported by teams and that only a small number of problems were found the teams involved. As Molich pointed out and Lindgaard later may have inferred, consistency of method and task may be a cure for this.

The layering of effects on results
The difficulty here is that there seemed to be layered effects of human evaluators and simply having the same task list may not achieve consistent results. For example observers of the same test may record different results. As far back as 1992, it was apparent (Donkers, Tombaugh, & Dillon, 1992) that the performance observers recording usability problems was affected by both the obviousness of usability problems and prior knowledge. Simply watching and recording (like behind a 2 way mirror) is affected by ancillary factors in process. Further studies Paper 2 3

Armen Chakmakjian

HF750 – SP11

3/24/11

have shown that posttest evaluations can also be affected (Jacobsen, Hertzum, & John, 1998). Jacobsen points out that the results of their study “questions the use of data from usability tests as a baseline for comparison to other usability evaluation methods.” (Jacobsen, Hertzum, & John, 1998).

The evaluator effect bias
Further clouding the usability testing reliability waters with the evaluator effect is a study by Hertzum, Jacobsen and Molich (Hertzum, Jacobsen, & Molich, 2002) wherein they studied what happens when evaluators compare their results to each other. In this case the mere fact that two evaluators found issues in the same area led them to feel that their own results were justified and therefore unnecessary to involve more team in the evaluation. Even more damning was that all the evaluators missed the several of the same problems that were clearly usability issues. (Hertzum, Jacobsen, & Molich, 2002)

Analytical data as a solution
Some have argued that to ameliorate the problem of rigor and inconsistency of results, both quantitative and qualitative methodologies have to be used in conjunction or that if the testing is done with one method or the other, the client should be limitation of resulting data. (Hughes, 1999) Others feel that using more quantitative techniques in conjunction with contextual environment would possibly remove bias. In particular the use of equipment that can record behaviors as video as well as mouse and keyboard manipulation during a think aloud test would help. (Christensen & Frøkjær , 2010) In other words, this might make remote testing a Paper 2 4

Armen Chakmakjian

HF750 – SP11

3/24/11

more reliable format if the quantitative data could be collected easily.

Evaluation of the Evaluation Literature
The literature in the whole area of usability evaluation seems to fall into two camps. Promoting a particular technique or damning the reliability of many of the techniques and rubrics that are considered etched in stone (like the commandment 5). One article tries mightily to point out that usability evaluations are a part of a HCI tool kit and that they have an appropriate place in the design cycle but are not the only tool in the kit. (Greenberg & Buxton, 2008)

Summary and Conclusion
In the end the lesson for the practitioner is that usability evaluation is not just a think-aloud qualitative test nor is it just google analytics tracking a user’s every click (not to mention eye-tracking). As many of the studies presented have shown, the reliability of usability evaluation can be biased by the rigor applied to the particular technique and by the makeup of the team doing the evaluation. In fact, usability professionals sharing and comparing results can result in erroneously affirming the correctness of a single set of results. The implications for the practitioner are quite profound and in some way a lesson is:   Keep the test target and expectations small and focused…software can be big Have a well defined task list which keeps the user’s think aloud evaluation focused on the problem being evaluated

Paper 2

5

Armen Chakmakjian 

HF750 – SP11

3/24/11

Use some amount of technically objective data such as keystrokes and mouseclicks to create a data stream to correlate to evaluators observations

Expert Reviews and User evaluations are complementary techniques but be prepared for results to vary widely

Works Cited
Virzi, R. A. (1992). Refining the Test Phase of Usability Evaluation: How Many Subjects is Enough? Human Factors , 34, 457-471. Christensen, L., & Frøkjær , E. (2010). Distributed Usability Evaluation: Enabling Large-scaleUsability Evaluation with User-controlled Instrumentation. Proceedings: NordiCHI 2010 (pp. 118-127). Reykjavik, Iceland: ACM. Donkers, A. M., Tombaugh, J. W., & Dillon, R. F. (1992). Observer Accuracy in Usability Testing: The Effects of Obviousness and Prior Knowledge of Usability Problems. Carleton University, Department of Psychology, Ottawa, Ontario, Canada. Greenberg, S., & Buxton, B. (2008). Usability Evaluation Considered Harmful (Some of the Time). CHI 2008 Proceedings (pp. 111-121). Florence, Italy: ACM. Hughes, M. (1999, November). Rigor in Usability Testing. Technical Communication (4), pp. 488-494. Hertzum, M., Jacobsen, N. E., & Molich, R. (2002). Usability Inspections by Groups of Specialists: Perceived Agreement in Spite of Disparate Observations. CHI 2002: , (pp. 662-663). Denmark. Jacobsen, N. E., Hertzum, M., & John, B. E. (1998). The Evaluator Effect in Usability Tests. CHI 98 (pp. 255-256). ACM. Law, E. L.-C., & Hvannberg, E. T. (2004). Analysis of Combinatorial User Effect in International Usability Tests. CHI 2004, 6, pp. 9-16. Vienna, Austria. Lindgaard, G., & Chattratichart, J. (2007). Usability Testing: What Have We Overlooked? CHI 2007 Proceedings • Usability Evaluation, (pp. 1415-1424). San Jose, CA. Molich, R., & Dumas, J. S. (2006). Comparative Usability Evaluation (CUE-4). Behaviour and Information Technology , Preprint. Molich, R., Ede, M. R., Kaasgaard, K., & Karyukin, B. (2004). Comparative usability evaluation. BEHAVIOUR & INFORMATION TECHNOLOGY , 23 (1), 65-74.

Paper 2

6