You are on page 1of 5

This chapter presents the preliminary steps for an evaluation study and continues to the stage at which the

study has been crystallized and captured. An evaluation plan clarifies and outlines the evaluator’s decisions
about a range of issues: What aspects of the product will be evaluated? Where should the evaluation take
place? Who are the participants? What is the evaluation’s goal, expected outcome, and final result? What
methods will be used for evaluating the product?

These issues are not normally dealt with in such a neat order—it isn’t that simple! In practice, some of these
decisions are interdependent and must be made together. Some decisions may also be assumed implicitly
by project stakeholders or, worse, may have been overlooked completely. Part of the job of an evaluator is
to help clients make the right decisions and use their efforts and resources appropriately.

This chapter explains some of the reasoning and trade-offs involved in making these decisions. On the way,
it introduces terminology and concepts that will be useful when contemplating and discussing evaluation
studies. It presents different approaches for, and purposes of, evaluations and outlines the different criteria
that might be used to formulate evaluation goals. Further, it discusses how these different approaches
might be served by different evaluation methods and analysis procedures and demonstrates how any
evaluation study represents a trade-off between control and the realism that a test setup can achieve.

Defining the Purpose of the Evaluation

The first step is deciding what the evaluation study is trying to achieve. Different purposes may require
different approaches to evaluation, different data to be collected, and different analysis procedures. An
evaluation can have several purposes; this section introduces some important distinctions.

Diagnostic Evaluations
In most cases, an evaluation is conducted to find out how to improve a product or a prototype, and its main
output is a list of problems regarding interaction with a product. The goal of the evaluation here is
diagnostic. It is not enough to find and name problems; they should be accompanied by explanations for the
causes of the problems and suggestions for correcting them.

The following are example outcomes of diagnostic evaluations:

The icons chosen for the navigation are not well understood by the children because they don't look like
navigation icons that the children are familiar with.

The wolf character in the game was too scary for the children of the chosen age group, who did not enjoy
that part of the game.

Formative and Summative Evaluations

Evaluation outcomes like the preceding ones are more useful during the design process than after its
completion. Their purpose is formative —the outputs help shape the intended product. Summative
evaluations aim to document the quality of a finished product. These definitions blur in practice: A
summative evaluation concerning a finished product may itself feed the design of a new product. For
example, in the early design phases it may be appropriate to evaluate an earlier version of the product, or
some competitor product, to identify areas of improvement.

A summative evaluation may involve comparison with a benchmark or a competitor product. A comparison
between products or intermediate design prototypes with respect to some desired set of attributes can also
inform the choice between different design options — for example, different ways of navigating, different
input techniques, and so forth.

Exploratory, Measurement, and Experimental Studies

Questions about the product can be “open” or “closed.” Open questions are those for which the range of
answers is not known in advance, such as “How will children react to a robot that can display facial
expressions?” This is in contrast to closed questions (where a list of all possible answers can be prepared in
advance) that aim to produce numeric measures or to check prespecified expectations one might have for a
product—for example, “Can children find books in the library faster using interface A rather than interface

Open questions can be asked at all phases of the design lifecycle. They may concern usage patterns,
unpredicted problems of use, areas of improvement, or areas of opportunities for future products. If the
study starts with an open question, it will be more exploratory in nature. It will seek to uncover and explain
occurrences of events, emerging patterns of behavior, and so on.
The first statement concerns ease of learning, or learnability , which is a measure of how well or how fast
users learn to use the product. Learnability is especially important for children. Adult workers can be trained
to use a product that is difficult to learn but useful for their work, or they may persevere because using it is
part of their job. On the other hand, children are more likely to abandon a product that they cannot easily
learn how to use.
The second example is about the enjoyment experienced from playing a game. Enjoyment is not usually
considered a component of usability, but the two are connected. Good usability may contribute to
enjoyment, whereas low usability may prevent children from enjoying the game or using a product at all.

The preceding two examples also demonstrate a distinction between subjective and objective evaluation
criteria. The first statement suggests observation of child testers. Any criterion that can be assessed with
observed or logged data is often called objective. On the other hand, the second statement clearly refers to
how children experienced aspects of the design during a test, and such a question can be answered by
asking (surveying) them about their thoughts, feelings, and experiences. A goal and an evaluation outcome
described in terms of opinions or judgments by test participants are considered subjective. It is relatively
easy to evaluate subjective goals with adults by asking test participants to judge how much a product has
met a particular design goal. This is not as straightforward for young children, who may find it hard to reflect
on and evaluate their experiences.

It is not always possible to find objective measures for answering questions that evaluate how products
support user performance or a particular user experience. For example, how much fun a game is may well
be described as a subjective outcome when based on surveyed opinions, but it is difficult to observe. One
can observe related behaviors of children that relate to fun, such as gestures, facial expressions, or even
measures of skin conductivity that relate to the arousal of the child during the evaluation session. Finding
meaningful and reliable objective measures for assessing interactive experiences is the subject of several
ongoing research investigations worldwide.

Whether objective or subjective, evaluation criteria may be expressed and therefore evaluated qualitatively
and quantitatively. For example, the enjoyment from a game could be evaluated quantitatively, such as
“How many children mentioned that they enjoyed the product?” It could also be evaluated using qualitative
data, in which case anecdotes and quotes can be listed in the evaluation to support and clarify the
conclusions. A qualitative analysis can uncover crucial oversights that render a game impossible to play, or
it may tease out unexpected reactions to the game story, the graphics, and so on. On the other hand, when
evaluating the performance of users using two different input devices, a quantitative analysis may be a
more meaningful basis for the evaluation.

When a project adopts a quantitative rather than a qualitative criterion, the evaluation becomes precise and
provides a clear criterion of success rather than one that is elastic and may be relaxed under time pressure.
On the other hand, a quantitative criterion may become an end in itself and create a tunnel vision effect for
the design team. Making sure a child can learn to start a game and personalize its character in less than
two minutes may be a valid target, but focusing on it prematurely may mean that insufficient attention is paid
to fixing more substantial problems with the story line or the game controls that are crucial for making the
game fun to play. The design team should strive to balance the need for a clear measure of success while
ensuring that this measure is relevant to the success of the product.

Products may be evaluated against criteria that are project specific; it may require some skill and effort to
formulate such criteria in an operational manner, so it is good practice to use prior art. There is a wide range
of criteria defined in the literature together with relevant measures that can be used. The criteria in the next
box should give you an idea of their diversity, as well as a starting point for exploring related sources.

Criteria to Evaluate Against

Usability: ISO 9241-11 defines usability as the “extent to which a product can be used by specified users to
achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use” (ISO,
1998). The repeated use of the word specified suggests that usability is not a singular, concrete,
measurable criterion but a framework for measuring related aspects of the interaction with a product.

Effectiveness: The accuracy and completeness with which specified users can achieve specified goals in
particular environments.

Efficiency: The level of use of resources (usually time) for reaching a specified goal.

Satisfaction: According to ISO 9241-11, satisfaction is “freedom from discomfort and positive attitudes
toward the use of the product” (ISO, 1998). Satisfaction as a dimension of usability can encompass many
users’ perceptions, attitudes, and feelings regarding the interaction with the product. Satisfaction with the
interaction can be even more important than performance. Children are much more likely to be motivated to
use a product again if they get some satisfaction from it rather than if they obtain some measurable
performance gains.

Usefulness: The degree to which a product provides benefits to users or helps them address their needs
and goals. Usefulness concerns the supported functionality and how relevant and important this is for the
users. The usability of a product affects directly how useful it is perceived to be (see Davis, 1989).

Learnability: How easy it is to reach a certain level of competence with a product; often the critical level is
the point at which the user can use the product effectively without help. Contrary to adult workers, children
cannot be assumed to follow a training manual, though in some cases— for example, classroom products—
an instructor may be at hand to help.

Fun: This is hard to define! Carroll (2004) summarizes it well: “Things are fun when they attract, capture,
and hold our attention by provoking new or unusual emotions in contexts that typically arouse none, or [by]
arousing emotions not typically aroused in a given context.” It could be added that the emotions should be
happy ones. Carroll isn't writing about children here, and children may not experience fun in exactly the
same way. Draper (1999) suggests that fun is associated with playing for pleasure and that activities should
be done for their own sake through freedom of choice. In other words, “fun” is not goal related, whereas
“satisfaction” is.

Accessibility: The extent to which a product can be accessed by the full range of potential users in the full
range of potential situations of use. The word accessibility is frequently used in the context of users with
disabilities and whether they are capable of using a product.

Safety: Where children are concerned, two aspects of product safety are important: whether a product has
sharp edges or represents an electrical hazard, and if interactive products, such as websites, or
communication tools, such as cell phones, expose children to inappropriate Web content or cause them to
give personal information to strangers.

Choosing Evaluation Methods

Suppose that an evaluator wants to compare two design prototypes for an interactive toy for 5- to 6-year-
olds to determine which one is more fun. Three approaches can be used here:
■ Do an expert assessment of the two designs, examining whether the designs follow the rules for a “fun”
product. A suitable method would be an inspection method (see Chapter 15). Unfortunately, appropriate
rules or methods for evaluation may not exist or may not work well for the criterion of interest; testing the
product with children may be the only recourse.
■ Look for “signs” of fun in a sample of users. When it comes to children and having or not having fun, signs
might be laughing and smiling, being easily distracted, showing reluctance to continue, or taking frequent
breaks. One could do an observation study (see Chapter 10) in which the number of smiles is counted or
what the children say while using the product is noted.
■ Consider the “symptoms” of having or not having fun. Some, such as pain, cannot be observed. So a
candidate method for assessing fun is asking children which design they think is more fun, using a suitable
survey method (see Chapter 13).

In the chapters that follow, several variants of these three approaches will be discussed, each with its
strengths and weaknesses. In general, it is wise to combine more than one of these approaches,
triangulating the results. Triangulation may help corroborate conclusions. Potential differences between
what is found with each method can help evaluators to be more critical of their findings and get a deeper
understanding of how the product is experienced by its users.

Triangulate at all levels. Combine methods; involve different children, different test administrators, and
different tasks; and compare your product against others.

Reliability of Evaluation Results

Compare the following statements:
Three out of four children preferred the stylus and seemed also to be consistently faster with the pen when
compared to the speech input.

The average time for entering an age-appropriate text presented to 10-year-olds on a piece of paper is
significantly higher when they use the keyboard, compared to spoken text input.

The first statement shows a small-scale evaluation, but the numbers are clearly too small to allow
generalizations. The second sentence suggests a higher reliability for the results. It indicates that a
statistical analysis has been made, and the superior speed of the keyboard is not considered to be a
chance occurrence.
Reliability is important because design teams must be assured that when an effort is invested in modifying a
product; it is because more than the test participants are likely to benefit from the changes. Any study that
involves people is likely to be affected by random errors in the measurements. This is an inherent part of
any type of measurement and can’t be avoided entirely.

Some early qualitative input by a few children may have much more impact on the quality of the product
than a large, formal evaluation at later stages. When planning an evaluation, you must weigh how important
it is that your findings are reliable in the preceding sense. Ideally, one would want to have certainty for every
single design decision made in the type of design process described in Chapter 3. In practice, this is usually
not possible, and you must spread the resources available for evaluation over several iterations of the
product design and over several design choices. As an evaluator, it is important not only that you are able
to collect and then analyze data, whether qualitative or quantitative, but also that you can convey how
reliable your data and analyses are and how your audience should use the results. These issues are
discussed further in Chapter 8.

Field versus Lab: More Than Just a Location!

An important decision that must be made when planning an evaluation, particularly when children are
involved, is where the evaluation should take place (see Figure 5.2). Typically, this means choosing
between testing in either a laboratory or a field setting. Large organizations and research establishments
have specialized usability testing laboratories (see Chapter 9). Testing in the field means testing in the
actual physical and social context where children are likely to encounter the technology evaluated, such as
at home or school.

Control and Realism

Choosing to test in the field or in the lab is more than a choice of location. A better way to think about it is as
a trade-off regarding the degree of control the evaluator has in the testing situation versus the realism of the
evaluation. Control might refer to environmental factors such as noise and lighting but also to practical
things like knowing that the test application will work as expected and using a familiar technical
infrastructure that works consistently and reliably for all participants. Control can also refer to social context.
Evaluation sessions may lead to completely different results when a child’s friends or parents are present.
Laboratory testing can help in controlling the context and knowing exactly who will be present, ensuring this
will not distort the results. More control also means a more rigorous process and a more formal setup.
Giving predefined test tasks and even simulating intermediate results are ways that an evaluator might
enhance control.

Evaluating in a Laboratory
When an installation can be tested only in a laboratory, designers should be cognizant of the physical and
social contexts of the lab environment, as it might influence the children’s experiences with the product.
Scorpiodrome (Metaxas et al., 2005) is a mixed-reality installation where children construct a physical
landscape on which they drive remotecontrolled cars (scorpios). The installation combines the advantages
of playing with toy cars and construction kits with playing a video game. The design team’s goal was to
compare alternative design options to measure the level of engagement and fun that the children

As it turned out, the children became so animated, excited, and engaged that any option presented to them
would not have made a difference. In response, designers attempted to degrade the game to evaluate
single features but to no avail. It was clear that the evaluation setup did not distinguish the impact of coming
to the lab and the novelty of taking the role of a test participant. A possible remedy might be to have the
same children participate in repeated sessions so that the “newness” of both the lab and the designers will
eventually wear off. Only then will comparisons between design options yield useful results.
Testing in the field means relinquishing control to ensure that conditions of actual use (relating to social
context, equipment used, and physical environment) are reflected as far as possible. This realism helps us
understand how a product is used in the unpredictable situations in real life, lending ecological validity to the
evaluation results so that conclusions are likely to hold for the intended context of use rather than just for
the protected environment of the laboratory.

The control–realism trade-off can be better understood as a continuum rather than a binary choice. Factors
that influence this trade-off are the goals set for the evaluation, the maturity and robustness of the
technology that is being tested, and other practical constraints—for example, children may have to be at
school during working hours or they live far away from the lab.

Figure 5.3 Children building the landscape and driving

the scorpios during an evaluation session of the
Scorpiodrome in the KidsLab.