You are on page 1of 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221248089

The effect of group discussions in usability inspection: A pilot study

Conference Paper · October 2008


DOI: 10.1145/1463160.1463221 · Source: DBLP

CITATIONS READS
7 58

1 author:

Asbjørn Følstad
SINTEF
126 PUBLICATIONS   4,336 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Social Health Bots View project

ETAPAS View project

All content following this page was uploaded by Asbjørn Følstad on 28 May 2015.

The user has requested enhancement of the downloaded file.


This is an author's accepted manuscript of an article published in Proceedings of the NordiCHI '08, (pp.
467-670). Copyright: ACM. Available online at: http://dl.acm.org/citation.cfm?id=1463221.

The Effect of Group Discussions in Usability Inspection: a


Pilot Study
Asbjørn Følstad
SINTEF ICT
Forskningsveien 1
0373 Oslo, Norway
+47 22067515
asf@sintef.no

ABSTRACT walkthrough [3], relies heavily on the use of group discussions. In


How do group discussions affect the output of a usability between these extremes, the cognitive walkthrough can be
inspection, as compared to individual evaluators only? This performed either individually or in groups [4].
question was investigated in association with usability inspections This paper presents a pilot study of the possible effect of group
of a music community website. Potential users of the website discussions in usability inspections. The paper makes an
participated as evaluators in two usability inspection sessions. All important contribution by providing initial insight in this effect;
evaluators made individual predictions prior to group discussions. insight that is important both as background knowledge to
Twenty-five percent of the usability issues generated in the group practitioners and as input to future UIM development.
discussion were new, viz. not predicted by individual evaluators.
Also, the group discussion served to discard or modify the 2. PREVIOUS RESEARCH
majority of usability issues predicted by the individual evaluators.
The discarded individual predictions were typically low severity. In his review of the empirical literature comparing the problem
solving and decision making performance of groups versus
individuals, Hill [5] reported that group performance is generally
Categories and Subject Descriptors superior to the average individual performance, but inferior to the
H.5.m [Information interfaces and presentation]: Miscellaneous best individual in a statistical aggregate. Group productivity
seems to increase when members can correct each other and pool
General Terms information.
Human Factors Miner [6], building on the work of Hill, investigated the effect of
the decision process sequence on the performances of groups vs.
Keywords individuals. Miner concluded tentatively that when individual
decisions precede the decision of a group, the group outperforms
Usability inspection methods, group discussion the best individual decision. Groups without previous individual
decision making were outperformed by groups whose individuals
1. INTRODUCTION had made decisions for themselves prior to entering the group.
Individuals and groups perform differently when conducting Related HCI work includes studies by Nielsen and Sears,
analyses and reaching conclusions, something that has been indicating that the pooling of results from several evaluators
shown in research within the fields of psychology and increases the number of usability problems identified [1] and the
organization studies. Given this difference, it is interesting to evaluation’s thoroughness [7], as compared to the results of a
investigate the effect of group discussions in usability inspections. single evaluator. These findings indicate that the individuals of a
Usability inspection methods (UIM) vary greatly regarding the group of evaluators most likely will contribute with heterogeneous
use of group discussions. At one extreme, the heuristic evaluation input; possibly complementing each other in the discussion.
is constructed to support the individual evaluator to such a degree Similarly, a study on the evaluator effect by Hertzum et al. [8]
that in the case of multiple evaluators, communication between indicates low overlap between individual evaluators’ usability
evaluators is discouraged until all evaluations have been problems. In this last study, when the evaluators discussed their
completed [1]. In contrast, UIMs involving user representatives, individual findings they concluded that they in general were in
such as the pluralistic walkthrough [2] and the group-based expert agreement despite their low overlap of identified problems.

3. RESEARCH QUESTION AND


Permission to make digital or hard copies of all or part of this work for HYPOTHESES
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
The research question of the present study was:
copies bear this notice and the full citation on the first page. To copy Which effect will the use of group discussions have on the output
otherwise, or republish, to post on servers or to redistribute to lists, of a usability inspection, as compared to the pooled output of the
requires prior specific permission and/or a fee.
individual evaluators alone?
NordiCHI 2008: Using Bridges, 18-22 October, Lund, Sweden
Copyright 2008 ACM ISBN 978-1-59593-704-9. $5.00
Possible effects of using evaluation groups were formulated as the task, following which the evaluators were required to make
three hypotheses: individual notes on predicted usability issues (problems or design
suggestions). Discussions were not allowed during the individual
H1: The group discussion will lead to the prediction of new note taking. Upon completion of the individual notes, the
usability issues. participants’ individual issues were presented and discussed in the
H2: The group discussion will lead to the discarding or group. The group decided on whether or not the issue should be
modification of usability issues predicted by the individual sustained and (if sustained) how it should be formulated. The
evaluators. discussions were structured by the test leader, but the input to the
discussions was to come from the evaluators only. For each of the
H3: The group discussion will lead to changes in the severity steps of the task scenarios, high-severity issues were presented
ratings of the usability issues predicted by the individual and discussed prior to low-severity issues.
evaluators.
The walkthroughs resulted in 14 individual records of predicted
4. METHOD usability issues, as well as two plenary records – one from each
session. All issues, both individually recorded and plenary
The pilot study was conducted as an empirical comparison of the decided, were required to be associated with a severity rating. The
pooled output of individual evaluators and the output of following severity ratings were employed: Cosmetic (minor
subsequent group discussions in a given case. obstacles or sources of irritation), Serious (major obstacles of
The case was a usability inspection of a music community sources of irritation), Critical (insurmountable obstacles or
website, Urørt (www.nrk.no/urort), used by unsigned Norwegian sources of irritation).
artists to share their demo material with the general public. Urørt The evaluators were potential, but not actual, users of the website.
is run by the Norwegian state broadcaster NRK, contains the All evaluators reported music to be of high importance to them,
music of ~21.000 artists, and is visited by ~12.000 unique users and all but three reported to listen to music online weekly or
each day. more. All but four had never used the Urørt website at all. All
The inspection method was the group-based expert walkthrough used Internet on a daily basis, and all but five also used online
[3], a UIM particularly developed to allow non-usability experts communities on a daily basis. Eight were males and 6 females.
to participate as evaluators. Group-based expert walkthrough is Mean age was 22 years (min. 18, max 24).
structured as a series of individual note taking and plenary When the evaluation sessions were finished, the individual
discussions, where the evaluators decide individually on usability records were matched with their associated plenary record. For
issues prior to plenary discussions. each item of an individual record, one analyst (the author of this
Two evaluation sessions were conducted, with 7 participants in paper), judged whether or not it matched any item of the
each. Each session lasted 2.5 hours. The sessions were led by a associated plenary record. All items were reported in a format
usability professional (the author of the present paper). In addition clearly identifying the task-scenario context, something that
one secretary and one observer participated. Totally 12 task- facilitated the matching process. The strength of each match was
scenarios were included in the two evaluations, targeting rated according to three categories: Weak match (major changes
navigation mechanisms, functionality for playing audio content, made to the individually recorded item), Fair match (some
and profile page interaction. In one of the sessions 9 task- changes), Good match (minor or no changes)
scenarios were walked through, in the other 10. For each task- Examples of individual and plenary notes associated with the
scenario the test leader first presented the steps required to solve three match categories are given in Table 1.

Table 1: Examples of match between individually and plenary recorded items


Match Individual record Corresponding plenary record
"Not a good idea to mix genres in the recommendation section if
"Link to individual genres"
Weak this section only reflects parts of the available genre".
match
"Wish more information about the artist available in the same
"Should be placed in a nearby container"
window as the player"

"Why include artists you cannot listen to?" "Annoying with songs in the list that you cannot listen to"
Fair
match "The house icon is usually front page, slightly "Difficult to understand the house-icon of the player leading to the
misguiding, but ok if it is there. Somewhat discreet!" artist's own page. Interpreted as link to the Urørt main page."

"Name of the song or the artist? Doubled up?" "Difficult to differentiate between song name and artist name"
Good
match "The player opens in a separate window. Would prefer a "Annoying that the player opens in a separate window. Wish the
player embedded in the artist’s page" player to be embedded at the artist’s own page, as MySpace"
After the matching, the following analyses were conducted: Table 3: Distribution of discarded vs. sustained individually
New usability issues predicted by the group (H1): Plenary items recorded items, across severity categories.
not matching any individual’s items were regarded as new; Severity Modified or
plenary items matching individual’s items were regarded as Discarded items
(individual unchanged items
sustained. The proportion of new items to the totality of plenary ratings) Count % Count %
items was calculated. The severity ratings of new vs. sustained
items were compared. Cosmetic 25 58 36 27
The groups’ discarding or modification of usability issues Serious 9 21 52 39
predicted by individuals (H2): Individual items not matching Critical 3 7 28 21
any plenary items were regarded as discarded; individual items
associated with the categories Weak match and Fair match were Not rated 6 14 16 12
regarded as modified. The proportions and severity ratings of Sum 43 100 132 100
discarded, modified and unchanged items were calculated.
Changes in severity rating between individuals and group (H3):
The severity rating of each sustained plenary item was compared Results of relevance to H3: Forty-four of the plenary items had
with the average severity rating of its associated items recorded been rated both for severity and associated with one or more
by individual evaluators. Average severity ratings were individually recorded items rated for severity. Of these plenary
calculated upon assigning the following values to each severity items, 21 had a severity rating identical to the average severity
level: Cosmetic=1, Serious=2, Critical=3. score of their individually recorded counterparts, 16 had higher
severity ratings and 7 had lower. The severity ratings of plenary
items vs. individual’s items were compared through a Wilcoxon
5. RESULTS signed rank test. No significant difference between the groups
The individual evaluators’ notes contained a total of 175 items was found (Z=1.40; p(two tailed)=0.16).
(an additional four items were discarded due to interpretational
issues). The mean number of items for each individual evaluator
was 12.5 (SD=5.2). Thirty-six percent of the individually 6. DISCUSSION
recorded items were classified as cosmetic, 35 percent serious The discussion is structured according to the three research
and 18 percent critical. Eleven percent of the individual hypotheses and is concluded by suggestions for refined
evaluators’ items were not rated for severity. hypotheses and further work.

The plenary discussions resulted in 76 items. Thirteen percent of 6.1 New usability issues
the plenary recorded items were classified as cosmetic, 38 One fourth of the items of the plenary records were interpreted
percent serious and 21 percent critical. Twenty eight percent of as new, resulting from the group discussions. These new items
the plenary recorded items were not rated for severity. were not found to differ in severity from the sustained items,
Results of relevance to H1: Twenty-five percent of the plenary implying that the usability issues generated in the group
recorded items were not found to match any individually discussions were judged by the evaluators to be of equal
recorded item, and thus regarded as new. The severity ratings of important to the issues brought into the discussion by one or
new vs. sustained items were compared through a Fischer Exact more of the group’s individuals.
test. Items not rated for severity were excluded from the test. No H1 is sustained. Group discussions indeed seem to enable the
significant difference was found (n=55; p(two tailed)=0.19). identification of new usability issues. However, the great
Results of relevance to H2: The distribution of individually majority of the issues already being identified by the individuals
recorded items as discarded, modified, or unchanged is of the groups, the importance of the group discussions as such
presented in Table 2. seemed to be a fairly modest source of new issues.

6.2 Discarded or modified usability issues


Table 2: Distribution of individually recorded items Twenty-five percent of the individually recorded items were
according to match with plenary recorded items interpreted as discarded and an additional 37 percent as
modified in the group discussions. The group discussions had
Match Count % strong impact on the items identified by individual evaluators,
No match found (discarded) leaving less than half these items unchanged.
43 25
Weak or fair match (modified) Consequently, even though a significant aspect of group
66 37 discussions may be to identify new usability issues, more
Good match (unchanged) 66 38 important is their contribution to qualify the set of usability
Sum 175 100 issues already identified by the individual evaluators. This effect
of the group discussions is in line with previous research on
group vs. individual performance by Hill [5] and Miner [6].
The severity ratings of discarded vs. sustained (modified and
unchanged) items were compared through a Chi-square test. Such qualification of individual items through group discussions
Items not rated for severity were excluded from the test. A may be highly valuable. The modification of usability issues
significant difference was found (n=153; X2(2)=15.9; p(two allows the group of evaluators itself to re-focus identified issues,
tailed)=0.00). Distribution details are presented in Table 3. instead of this being done by a test-leader when analyzing
individual evaluators’ contributions. Also, the modified usability In this pilot study, possible effects of group discussions on the
issues will most likely represent the best judgment of several thoroughness, validity, and downstream utility of usability
evaluators rather than one. inspections were not investigated. Such effects on the
It is interesting to note that when the group discussions resulted performance of usability inspection would make highly relevant
in the discarding of items, these items were given significantly objectives for future work.
lower severity ratings by the individual evaluators than the Other relevant future work may be to investigate whether or not
sustained items. Thus it may well be that a characteristic of the the effect of group discussions found in the present study
group discussions is to remove low-importance predictions, depends on the particular group process of group-based expert
something that will be important in practical development walkthrough. Relevant other group processes could be group
contexts with limited resources for redesign. discussions associated with adaptations of heuristic evaluation
H2 is sustained, and the results seem to provide background for or cognitive walkthrough, or possibly with groups not organized
a refinement of the hypotheses where the group discussions will by a test-leader.
impact the majority of individually identified items. It is the hope of the author that the present study serves as an
inspiration within the HCI community to continue research on
6.3 Changed severity ratings? the effect of group discussions in usability inspections. Such
The comparison of plenary recorded items’ severity rating with research may in time change the way we conduct usability
the average severity score of the individually recorded items inspections.
indicated a tendency towards increased severity ratings in
plenum as opposed to by individuals. This tendency was, 7. ACKNOWLEDGEMENTS
however, not found to be statistically significant, reflecting a The reported study was conducted as part of the research project
high proportion of ties (situations where the plenary recorded RECORD, supported by the Norwegian Research Council’s
item and the severity score of the associated individually VERDIKT program.
recorded items were identical).
It may be possible that the tendency towards increased severity
8. REFERENCES
[1] Nielsen, J. 1992. Finding usability problems through
ratings in consequence of group discussions corresponds to a
heuristic evaluation. In Proceedings of the SIGCHI
general effect. The present case, however, seem to indicate that
conference on Human factors in computing systems
such an effect may not be very great.
(Monterey, California, 1992). CHI 1992. ACM Press, New
6.4 Refined hypotheses York, NY, 373-380.
The current pilot study allows, in spite of its weaknesses [2] Bias, R. 1994. The Pluralistic Walkthrough: Coordinated
regarding validity, a refinement of the current hypotheses and Empathies. In J. Nielsen, R. L. Mack, Eds. Usability
inspiration for future work. The following refinements of the Inspection Methods, New York, NY, John Wiley & Sons,
hypotheses are suggested: 63-76.
H1’: The group discussion will lead to the prediction of new [3] Følstad, A. 2007 Group-based Expert Walkthrough. In D.
usability issues, however fewer than the sustained usability Scapin, E. L.-C. Law, Eds. R3UEMs: Review, Report and
issues identified by individual evaluators Refine Usability Evaluation Methods. Proceedings of the
H2’: The group discussion will lead to the discarding or 3rd. COST294-MAUSE International Workshop, 58-60.
modification of the majority of usability issues predicted by [4] Wharton, C., Rieman, J., Lewis, C., and Polson, P. 1994.
individual evaluators The Cognitive Walkthrough: A Practitioners Guide. In J.
H3’: The group discussion will be associated with a minor Nielsen, R. L. Mack, Eds. Usability Inspection Methods,
increase in the severity ratings of the predicted usability issues New York, NY, John Wiley & Sons, 105-140.
as compared to severity ratings provided by individual [5] Hill, G. H. 1982 Group Versus Individual Performance:
evaluators. Are N + 1 Heads Better Than One? Psychological Bulletin
91(3), 517-539.
6.5 Limitations and future work [6] Miner, F. C. 1984. Group versus Individual Decision
This study has several limitations, which is in line with the study
Making: An Investigation of Performance Measures,
being a pilot. Future studies set up to investigate the refined
Decision Strategies, and Process Losses/Gains.
hypotheses resulting from this pilot, should in particular be
Organizational Behavior and Human Performance 33, 112-
considered the following in order to improve validity:
124.
• Include a larger number of cases and evaluator types [7] Sears, A. 1997. Heuristic walkthroughs: Finding the
(external validity) problems without the noise. International Journal of
• Include objects of evaluation from different application Human-Computer Interaction 9(3), 213-234.
areas (construct validity) [8] Hertzum, M., Jacobsen, N.E., Molich, R. 2002. Usability
• Include qualitative analyses of the group discussion Inspections by Groups of Specialists: Perceived Agreement
(construct validity) in Spite of Disparate Observations. In Extended Abstracts
of the ACM CHI 2002 Conference (Minneapolis, MN,
• Use multiple analysts and/or blind matching of individually
April 20-25, 2002), ACM Press, New York, NY, 662-663.
and plenary recorded items (internal validity)

View publication stats

You might also like