You are on page 1of 418

Springer Series on Bio- and Neurosystems 12

Céline Jost · Brigitte Le Pévédic ·
Tony Belpaeme · Cindy Bethel ·
Dimitrios Chrysostomou ·
Nigel Crook · Marine Grandgeorge ·
Nicole Mirnig   Editors

Human-Robot
Interaction
Evaluation Methods and Their
Standardization
Springer Series on Bio- and Neurosystems

Volume 12

Series Editor
Nikola Kasabov, Knowledge Engineering and Discovery Research Institute,
Auckland University of Technology, Penrose, New Zealand
The Springer Series on Bio- and Neurosystems publishes fundamental principles
and state-of-the-art research at the intersection of biology, neuroscience, informa-
tion processing and the engineering sciences. The series covers general informatics
methods and techniques, together with their use to answer biological or medical
questions. Of interest are both basics and new developments on traditional methods
such as machine learning, artificial neural networks, statistical methods, nonlinear
dynamics, information processing methods, and image and signal processing. New
findings in biology and neuroscience obtained through informatics and engineering
methods, topics in systems biology, medicine, neuroscience and ecology, as well as
engineering applications such as robotic rehabilitation, health information tech-
nologies, and many more, are also examined. The main target group includes
informaticians and engineers interested in biology, neuroscience and medicine, as
well as biologists and neuroscientists using computational and engineering tools.
Volumes published in the series include monographs, edited volumes, and selected
conference proceedings. Books purposely devoted to supporting education at the
graduate and post-graduate levels in bio- and neuroinformatics, computational
biology and neuroscience, systems biology, systems neuroscience and other related
areas are of particular interest.
The books of the series are submitted for indexing to Web of Science.

More information about this series at http://www.springer.com/series/15821


Céline Jost Brigitte Le Pévédic
• •

Tony Belpaeme Cindy Bethel


• •

Dimitrios Chrysostomou Nigel Crook


• •

Marine Grandgeorge Nicole Mirnig


Editors

Human-Robot Interaction
Evaluation Methods and Their Standardization

123
Editors
Céline Jost Brigitte Le Pévédic
University Paris 8 IUT de Vannes—Département STID
Saint-Denis, France University of South Brittany
Vannes, France
Tony Belpaeme
IDLab—imec—ELIS Cindy Bethel
Ghent, Belgium Social, Therapeutic, and Robotic
Systems Lab
Dimitrios Chrysostomou Mississippi State University
Robotics and Automation Group Mississippi State, MS, USA
Aalborg University
Aalborg, Denmark Nigel Crook
Research and Knowledge Exchange
Marine Grandgeorge Faculty of Technology,
Laboratoire Ethologie Animale et Humaine Design and Environment
Université Rennes 1 Oxford Brookes University
Paimpont, France Oxford, UK

Nicole Mirnig
Center for Human-Computer Interaction
University of Salzburg
Salzburg, Austria

ISSN 2520-8535 ISSN 2520-8543 (electronic)


Springer Series on Bio- and Neurosystems
ISBN 978-3-030-42306-3 ISBN 978-3-030-42307-0 (eBook)
https://doi.org/10.1007/978-3-030-42307-0
© Springer Nature Switzerland AG 2020
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Foreword

Robotic agents and the related domains we think of spontaneously (AI, robotics,
mechanics, computer science, etc.) are in the heart of a growing number of debates
in our society, as the industrial, economic and technological challenges are crucial.
However, if these stakes are obviously essential, the current or future presence
of these robotic agents in all our living spaces (e.g. industry, hospital, institutions
for elderly people, at home, at school) raises questions related to the human factors:
what interactions with us? What are the impacts on our behaviours and our
activities? What ethical dimensions must be taken into account? How to evaluate
for better design?
Beyond extreme opinions and positions (techno-phobia versus technophilia), it
is, therefore, necessary to study and especially to anticipate expectations, uses,
mental representations and human behaviours during IHR (Human-Robot
Interactions). Indeed, if some technological barriers still limit robot design, the
real barriers are linked to human and societal barriers in terms of acceptability, trust,
emotional relationships, impact on performance and behaviour, and impact on
organisations. It is these human dimensions that this book proposes to the reader to
address through chapters produced by experts in these different domains.
Concretely, the introduction of robots in our homes or in our institutions
questions above all the human and ethical dimensions since it profoundly modifies
our relationships not only with objects but also with other individuals: indeed, these
robots are supposed to provide monitoring functions (e.g. fall detection), cognitive
stimulation functions (e.g. offering and initiating interactions), social functions (e.g.
prevent social isolation) or play functions. But how can these functions be recon-
ciled with respect for the human individual? In addition, how do these robots
impact on the relationships between individuals in these different environments?
Robots are not only technical objects; they are socio-technical objects in the
sense where on the one hand, they influence the relationships between other agents
in the environment (whether these agents are humans or other socio-technical
objects) and on the other hand, are themselves modified by these same relation-
ships. In other words, robots first and foremost raise human questions. Moreover,
the interactions we have or will have with robots question mainly in the case of

v
vi Foreword

so-called “social” robots, assistance or “companions”, i.e. in the case of robots with
which physical proximity will be strong, frequent exchanges and important inter-
dependencies. The replacement of humans by robots in tasks considered dangerous
or in extreme environments is commonly accepted (e.g. industrial robots, for
underwater exploration, for space exploration); their presence in our daily and
“intimate” living spaces is much less accepted a priori.
The design of robots is crucial since it largely determines the acceptance and
interactions that will be linked. The evaluation of interactions is equally decisive
and several chapters are devoted to the methodological aspects of HMI.
Recommendations and concrete approaches are also proposed to help researchers,
designers and decision-makers to better take human factors into account from the
early stages of robot agent design. Whether it is the collection and analysis of
quantitative or qualitative data, in an ecological or laboratory context, behavioural
or attitudinal, this book provides valuable insights for designing evaluation
protocols and methods that will finally provide objective, verifiable, reproducible,
i.e. scientific data.
The User Experience domain (or UX) is a particularly relevant domain for
designing interactions between robotic agents and human agents. If this domain is
relevant, it is because it proposes to take into account and anticipate the experiences
(real or imagined) that the individual has with technical objects such as robots.
Similarly, as some chapters of this book demonstrate, ethology and ethnography are
disciplines that need to be reintroduced into robotic systems research and devel-
opment projects since they provide not only theoretical concepts to better under-
stand what is at stake in interactions between humans and robots, but also tools and
techniques for analyzing highly relevant situations.
The field of HRI (Human-Robot Interaction) therefore raises questions related to
human nature and its relationship with specific objects (here, robots). Some ques-
tions still seem to be science fiction. For example, can we collaborate with a robot,
or even be in a hierarchical relationship in which we must obey a robot? Likewise,
can one love a robot or be loved by a robot? Finally, can we adopt or be adopted by
a robot? The common point to these questions is that they are less technical than
human (psychological, ethological, ethical, sociological, sociological, legal, etc.).
This book offers, on the basis of proven theories and methodologies, to address
these questions in order to anticipate our future.

Lorraine, France Jérôme Dinet


September 2019
Acknowledgements

We thank Leontina Di Cecco who has supported us since the first idea ever for this
book. She was immediately interested in this project. Since 2014 she has given us
advice and all of her support with help while writing this book.
We thank the entire Springer team for their interest in this project, their help,
kindness, professionalism, support, and patience. This book exists thanks to all
of them.
We thank all the authors for their important engagement with this book. All
of them have given their best to provide high-quality contributions, in spite of the
absence of specific funding to do this job and in spite of their busy lives. Each
author is indispensable to this book and we are grateful that they trusted in this
project and were patient until its publication.
We thank Aurélie Clodic, CNRS researcher at the LAAS laboratory, who urged
us to organize the first workshop which was the trigger of all this work and sup-
ported us during all these years.
We thank the organizers of workshops in the conferences that held our three
workshops: Ginevra Castellano and François Ferland for ICSR 2015 in Paris;
Fulvio Mastrogiovanni for RO-MAN 2016 and RO-MAN 2017 who also follows
this work and the book; and Diego Faria for RO-MAN 2017. These workshops
were essential for the authors to meet each other, to grow our team, to start
reflections, to understand the main concerns and issues, to delineate the context, and
in one word to prepare the basis for this book.
We thank Jérôme Dinet who accepted to write the foreword of this book, which
demonstrates his interest in this book in spite of a busy life.
We thank all the participants of the workshops and supporters who are many
more people than the number of contributors to this book. Many researchers are
interested in this project and have contributed to the reflections that were essential
for the compilation of this book.
We thank our institutions and our colleagues who supported us in this project,
who gave us advice, who asked questions, who showed their interest, and who
answered our questions.

vii
Introduction and Book Overview

We are very pleased to introduce this work, which is the fruit of 5 years of
exceptional collaborative work. We did not receive specific funding and each
author contributed voluntarily with a real passion for our cause. Is there a nobler
cause than scientific rigor according to researchers?
This work originates from a series of workshops, which were rich in encounters
and open discussions. Our adventure began in August 2014 during the RO-MAN
conference. Celine Jost, Marine Grandgeorge, and Brigitte Le Pévédic presented a
paper that received an award for their multidisciplinary work in designing an
evaluation for Human-Robot Interaction (HRI). It was from this common interest,
that they started to exchange communications with Tony Belpaeme and Cindy
Bethel. As chance would have it, Leontina Di Cecco from Springer was attending
the conference as an exhibitor with books published by Springer. A few discussions
later, the project was born. The organizers’ core group was formed. Whereupon we
quickly started the first phase of this work and organized workshops in order to
assemble a larger group of researchers, a community that is still growing. Through
these workshops, we met researchers who were also organizing workshops on the
same theme of standardizing metrics and measures for HRI. Our communities
joined together to form an even larger group of members with this area of interest.
To clearly understand the history and evolution of this book, General Context
sums up the context of the problems encountered in 2014. Methodologies to Design
Evaluations presents the workshops series we organized as well as people who
contributed to this work. Some Standardization Proposals explains the structure
of the book and what the reader can find within this book. Disciplinary Points of
View provides a summary of each chapter.

ix
x Introduction and Book Overview

Initial Context

This part presents the evolution of contributions over time, focusing on two periods.
During the first period, researchers pointed out methodological issues and current
practice mistakes, while during the second period, they proposed new evaluation
methods.

First Period: Researchers Pointed Out Practice Mistakes

In the beginning, robots were industrial machines built in order to perform tasks in
place of humans. However, with the emergence of technology, they rapidly offered
perspectives related to humans’ daily life as service robots and then companion
robots. Humans took up more space in this new relationship and researchers began
to consider them more as robot partners.
Scholtz [1] indicated, in 2002, that evaluation issues arise from a mismatch
between the system’s representation and the user’s expectations. She pointed out
the need for robots to respect social norms, to be understood by people, to act in a
coherent manner. She proposed to evaluate the limit of humans’ acceptance of
robots and to make the results publicly available. In this way, other researchers
were able to reproduce experimental designs and compare results.
At the same time, Dautenhahn and Werry [2] wondered how best to analyze
human-robot interactions knowing that questionnaires and interviews cannot be
applied in contexts where robots do not have an explicit task to do. They proposed
an analysis technique “inspired by a technique used in psychology that is based on
‘micro-behaviours’.” They highlighted the importance of objective measures to
avoid biases related to influences of the experimenters or participant’s attitudes or
expectations of the study outcomes.
The importance of observation was also stressed by Sabanovic et al. [3] in 2006.
They argued that robots should be observed, objectively and analytically in
real-world environments, as it is not possible to obtain consistent results in the
laboratory. They stressed the importance of testing the interaction with untrained
participants because their knowledge, history, and life may influence results.
The phrase “human in the loop” became obvious to researchers from various
fields. In 2008, Tsui et al. [4] argued that focusing on a human’s performance gives
the possibility to correlate the performance of the system with that of the human’s
performance. However, they pointed out the need to choose appropriate perfor-
mance measures, introducing an interdisciplinary approach. Indeed, in their minds,
choosing suitable performance measures requires consulting a specialist in the field
of evaluation.
Thus, the validity of these studies rapidly became the main concern, and some
researchers started to criticize the practices of other researchers. For example, in
2009, Syrdal et al. [5] analyzed the use of the Negative Attitudes Towards Robots
Introduction and Book Overview xi

Scale (NARS). They pointed out the danger of using standardized questionnaires
for cross-cultural evaluations in different languages (e.g., translation may not
reproduce the original purpose of the questions). The use of questionnaires for
people of a different culture (from the original one) requires revalidating the
questionnaire, which is a long and complex task. As researchers needed some
common methods to compare results, some of them used standardized question-
naires, but sometimes in a way that distorted their original aim questioning validity.
Six years later, this problem still exists, as Weiss and Bartneck [6] reached the same
conclusion when they analyzed the use of the standardized Godspeed questionnaire,
one of the most frequently used by HRI researchers. Indeed, some researchers made
modifications without reevaluating their new version, leading to corrupt data. Weiss
and Bartneck [6] proposed to combine questionnaire results with objective data
(e.g., behavioral measurements) to obtain valid results.
Finally, in 2011, Young et al. [7] published a new point of view indicating that
the nature of robots is complex, and therefore must be analyzed as a whole. They
criticized reductionism that considers that a phenomenon can be divided into parts.
They focused their attention on holism that considers that systems and their
properties should be viewed as wholes, not as a collection of parts, considering that
the whole is greater than the sum of its parts. Robots are seen as “an active physical
and social player in our everyday world.” They argued that HCI (Human-Computer
Interaction) methods are not applicable to HRI because of the complex nature of
robots. Moreover, “there is a need for structures and methodologies that aid eval-
uators in applying specific techniques such as the ones outlined above to the
evaluation of social interaction experiences with robots.” Thus, they proposed a
new approach to evaluate this complex phenomenon.

Second Period: Researchers Made Proposals

Other researchers proposed new evaluation methods adapted to HRI. For example,
in 2010, Bethel and Murphy [8] in their “Review of Human Studies Methods in
HRI and Recommendations” pointed out that “standards need to be established for
conducting reliable and quality studies where methods of measurement can be
validated for use by the HRI community.” As evaluation experts, they presented all
the terminologies required for human studies such as alpha level, between-subjects
design, within-subjects design, conditions, counterbalance, and so on. They also
presented factors that must be taken into consideration such as type of study,
number and quality of participants, type of robot, and so on. Sharing the idea
supported by Weiss and Bartneck [6] concerning the use of several evaluation
methods, Bethel and Murphy [8] indicated that the more different methods were
used, the more robust the study would be, citing five possible methods:
self-assessment, observational or behavioral measures, psychophysiology mea-
surements, interviews, and task performance metrics. They concluded by advo-
cating important recommendations that would allow researchers to design robust
xii Introduction and Book Overview

studies. Their paper is a reference document and an important contribution because


the authors inform HRI researchers that, in fact, evaluation methods applicable to
HRI do exist.
However, in spite of this contribution, it seems that the HRI community still
needs further information about evaluation designs. For example, in 2015, Sim and
Loo [9] reviewed evaluation methods and provided recommendations to design
experimentations. They proposed possible hybrids of HRI methods or combinations
of HRI methods to be applied according to the research questions. The same year,
Xu et al. [10] highlighted methodological issues in scenario-based evaluation,
arguing that the characteristics of scenario media influences users’ acceptance of
robots and affect their attitudes, and that HCI methods may not be applicable to
HRI. They proposed five guidelines to help choose the scenario media that best
suits the evaluation purpose, knowing that the media could induce an important
bias. For example, the same year, Seo et al. [11] argued that a virtual robot does not
induce the same effect as a real robot because humans may empathize with a
physical robot more than with a simulated robot. Their contribution resulted in
proposing a reproducible HRI experimental design (to induce empathy toward
robots in the laboratory) and an empathy measuring instrument that would allow
researchers to reproduce this result.

Initial Motivation for This Work

The contributions discussed above show that evaluations have a crucial role to
validate research results and that acquiring standardized methods is necessary to
obtain a background and knowledge common to all HRI researchers.
The main concern is the validity of results that is obtainable if, and only if,
studies are exemplary. To help design studies, researchers are warned about
recurrent biases: studies in the laboratory versus in a real-life context, participants
recruited from the university community versus from a more general population,
size of participant sample, the need to compare the experimental condition with
other condition(s), robots controlled by Wizard of Oz versus autonomous robots.
Each of these choices influences the results and can introduce bias.
Of course, all biases cannot be avoided. However, they must be identified and
controlled because they may influence the results directly. For example, Weiss and
Bartneck [6] indicated that Wizard of Oz may be a problem as “participants actually
measure the intelligence of the human operator instead of measuring the perceived
intelligence of the robot.” In the same way, the fact that the experimenter stays with
the participant during an evaluation, thus being a bystander, may influence the
results because the interaction is not really dyadic [1] as it involves an audience
effect [12]. Biases can also emerge because some standardized questionnaires are
reused for people from different cultures, in other languages and in unpredicted
Introduction and Book Overview xiii

contexts [5, 6]. For further explanations concerning factors to take into consider-
ation when planning and designing a study, please refer to Bethel and Murphy’s [8]
reference paper, which is the starting point and the basis for all experimentations,
and the starting point for this book.

Our Workshop Series and Related Community

Emshri 20151

The first workshop was held in conjunction with the Seventh International
Conference on Social Robotics (ICSR) in Paris (France) on October 26th, 2015.
It was organized by the seven people presented in Table 1.

Table 1 EMSHRI 2015 committee


Organizing committee Program committee
Céline Jost, University of Paris 8 Brigitte Le Pévédic, University of South
Marine Grandgeorge, University of Rennes I Brittany
Pierre De Loor, University of Occidental Virginie Demulier, University of South Paris
Brittany Kerstin Dautenhahn, University of
Tony Belpaeme, Plymouth University Hertfordshire

The objective of this first workshop was to answer the following questions:
• Q1: Which methodology from Human-Human Interaction and Human-Animal
Interaction can be applicable to Human-Robot Interaction?
• Q2: Which are good or bad practices?
• Q3: Which common mistakes or biases should be avoided when designing an
evaluation, whatever the partners studied?
The workshop organized the five following presentations:
• Ethology and Human-Machine Interaction: Does it fit?
• Objectivity and Human-Machine Interaction: Does it fit?
• Evaluation Good Practices
• Evaluation Methods Survey
• Interpersonal Synchrony in Human-Robot Interaction.

1
https://sites.google.com/site/emshri2015/.
xiv Introduction and Book Overview

Emshri 20162

The second workshop was held in conjunction with the 25th IEEE International
Symposium on Robot and Human Interactive Communication (RO-MAN 2016) in
New York City (USA) on August 26th, 2016.
It was organized by 12 people presented in Table 2. The majority of new arrivals
came from EMSHRI 2015.

Table 2 EMSHRI 2016 committee


Organizing committee Program committee
Céline Jost, Paris 8 University, France Kim Baraka, Carnegie Mellon University, USA
Tony Belpaeme, Plymouth University, Matthieu Courgeon, ENIB, France
UK Nigel Crook, Oxford Brookes University, UK
Marine Grandgeorge, University of Pierre De Loor, ENIB, France
Rennes I, France Alexandre Kabil, INSERM, France
Brigitte Le Pévédic, University of South Eleuda Nunez, University of Tsukuba, Japan
Brittany, France Franz Werner, Vienna University of Technology
Nicole Mirnig, University of Salzburg, and “raltec”, Austria
Austria

The objective of this second workshop was to understand how to design eval-
uations in order to avoid biases and to ensure valid results. This workshop aimed at
exploring methods, which were used in existing studies in order to know which
methods fit with scientific questions. It also aimed at completing knowledge about
good and bad practices, and at elaborating recommendations and guidelines in
collaboration with participants of the first workshop.
The workshop organized the five following presentations:
• Introduction and Feedback about the Previous EMSHRI Workshop
• Feedback about a Related Workshop: Towards Standardized Experiments in
Human Robot Interactions
• Lessons Learned from Human-Robot Interaction Evaluations for Different
Applications
• Interpreting Survey Items Exploratory Factor Analysis
• Ethnographic Methods to Study Human-Robot Interaction: Experiences in the
Field.

2
https://sites.google.com/site/emshri2016/.
Introduction and Book Overview xv

Emshri 20173

The third workshop was held in conjunction with the 26th IEEE International
Symposium on Robot and Human Interactive Communication (RO-MAN 2017) in
Lisbon (Portugal) on August 28th, 2017.
It was organized by 17 people presented in Table 3. The majority of new arrivals
came from EMSHRI 2016.

Table 3 EMSHRI 2017 committee


Organizing committee Program committee
Tony Belpaeme, Plymouth Kim Baraka, Carnegie Mellon University,
University, UK USA/Universidade de Lisboa, Portugal
Cindy Bethel, Mississippi State Ravi T Chadalavada, Chalmers University of
University, USA Technology Örebro, Sweden
Dimitrios Chrysostomou, Aalborg Shirley Elprama, Vrije Universiteit Brussel, Belgium
University, Denmark Fulvio Mastrogiovanni, University of Genoa, Italy
Nigel Crook, Oxford Brookes Renato Paredes, Pontificia Universidad Católica del
University, UK Perú, Peru
Marine Grandgeorge, University of Karola Pitsch, University of Duisburg-Essen,
Rennes I, France Germany
Céline Jost, Paris 8 University, Matt Rueben, Oregon State University, USA
France Jivko Sinapov, University of Texas at Austin, USA
Brigitte Le Pévédic, University of Franz Werner, University of Applied Sciences of
South Brittany, France Wien, Austria
Nicole Mirnig, University of
Salzburg, Austria

The objective of this third workshop was to answer the following questions:
• How to evaluate Human-Robot Interaction?
• How to ensure valid results and replicability?
• Which existing evaluation methods can/cannot be applied to HRI?
• Which protocols can/cannot be replicated?
• Which questionnaires can/cannot be reused?
• Which criteria are needed to evaluate HRI?
• Which criteria are needed to ensure valid results and replicability?
• Which rules can be established about statistical analyses?
The workshop organized the five following presentations:
• The Use of a Forensic Interview Approach Using Robots for Gathering
Sensitive Information from Children—Lessons Learned and Recommendations
• Evaluating the Moral Competence of Robots
• UX Evaluation of Social Robots with the USUS Goals Framework

3
https://sites.google.com/view/emshri2017/.
xvi Introduction and Book Overview

• AMPH: A New Short Anthropomorphism Questionnaire


• Employing Nonparametric Statistics in Human-Robot Interaction Experimental
Studies: An Alternative Approach.

We joined our forces with a multidisciplinary group of researchers who had


organized nine international workshops focused in reproducible HRI experiments
presented in the International Conference on Robotics and Automation (ICRA), the
International Conference on Intelligent Robots and Systems (IROS), the
International Conference on Human-Robot Interaction (IROS), and the European
Robotics Forum (ERF).

EMSHRI Community

Our three workshops involved a total of 53 participants, coming from all over the
world. It was a real success. Our community increased, and some researchers got
the desire to organize their own workshops. Thus the community is now much
larger, but impossible to count. But it is evident that we have noticed a real passion
for this topic for several years.
This book is the result of the collaboration between 34 people coming from 10
countries as shown in Table 4: Austria, Belgium, Denmark, Finland, France, Peru,
Portugal, Sweden, the United Kingdom, and the USA. To be noted that one author
has two affiliations and is noted 0.5 for each.
This book is also the result of a totally egalitarian collaboration between “sci-
ences for robots” and “sciences for humans” as each group is composed of 17
people. It is really perfect representativeness as Table 5 shows (2 first lines for
“sciences for robots”). The “computer science” category is not detailed because it is
quite difficult to distinguish the disciplines composing “computer science” nowa-
days. In all cases, researchers who claim their affiliations to HRI (and who studied
computer science) are in this category. This perfect repartition is a pure chance as

Table 4 Contributors’ repartition by country


Country Number of participants %
Austria 2 6
Belgium 4 12
Denmark 4 12
Finland 1 3
France 8 23
Peru 2 6
Portugal 2.5 7
Sweden 4 12
United Kingdom 1 3
USA 5.5 16
Introduction and Book Overview xvii

Table 5 Contributors’ repartition by discipline


Discipline Number of participants %
Computer science (HRI) 14 41
Robotics 3 9
Psychology 5 14
Cognitive science 3 9
Anthropology 2 6
Ethology 2 6
Sociology 2 6
Ergonomics 1 3
Philosophy 1 3
User Experience 1 3

we did not try to obtain that. But we are really pleased about that because the
majority of disciplines involved in HRI had the opportunity to be represented in this
book, and we are convinced that this book is representative of the community as a
whole.

Structure of the Book

This book is a multidisciplinary work coming from the collaboration of 34


researchers. These are not a compilation of conference proceedings. It is not a
collection of the articles presented in workshops, even if naturally a few presen-
tations have led to a chapter in this book. We have built this work together, with
total scientific freedom, without editorial policies and with the complete support of
Springer. Our objective was to collect opinions about the questions we raised
during the EMSHRI workshops. Some researchers wrote a chapter together while
they never collaborated before. This book is thus a very big work that kept us busy
for one year and a half. Together, we decided on the topics for this book, we have
defined the common thread, and we worked for overall coherence. Each author was
writing with the knowledge of the global project and of the foreseen chapters,
allowing them to make some references to other chapters if required.
This book is organized into five parts.
General Context (composed of 3 chapters) was written by five authors with the
objective to present the general context, that is, to be reminded, “a human being and a
social robot are interacting with each other.” In our context, we were interested in three
elements: a human, a robot, and an interaction. And we question the evaluation
methods of this interaction. This part starts with the presentation of humans and
associated communications (which is the basis of interactions), follows with the
presentation of robots and the associated challenges, and finishes with an overview
of the current practices in HRI evaluations. At the end of this part, the reader has a
complete vision of what HRI is in our context and of evaluation methods currently
used.
xviii Introduction and Book Overview

Methodologies to Design Evaluations (composed of three chapters) was written


by eight authors with the objective to present the good practices, which allow
designing reliable evaluations whose results are robust. As a reminder that the
literature is currently suffering from an important number of evaluations whose
results are invalidated because of bad practices. Thus this part offers guidelines for
design evaluations, questionnaires, and qualitative interviews. At the end of this
part, the reader will understand how to design a reliable evaluation for HRI.
Some Standardization Proposals (composed of two chapters) was written by five
authors who propose a standardization. These proposals give rise to reflections,
exchanges, and debates. These chapters provide some openings for new practices.
At the end of this part, the reader will have a more general vision about our
evaluation problems and begins to detect the extent of the problem and related
possibilities.
Disciplinary Points of View (composed of four chapters) was written by nine
authors with the collaboration of five experts, who discuss the methods used in
different disciplines to design HRI evaluation. First, three chapters present
methodologies used in User Experience Design, ethology, and ethnography.
Second, this chapter presents the results of a qualitative survey conducted with five
experts belonging to Human-Technology Interaction, cognitive psychology, soci-
ology, ethology, and ergonomics. Thus this part gathers points of view coming
from seven different disciplines. At the end of this part, the reader begins to have
her/his own point of view about the question of evaluation methods standardization.
At this level of the book, numerous debates are possible and welcomed.
Last, Recommendations and Conclusions (composed of two chapters and con-
clusions) was written by 10 authors who give recommendations and their opinions
about evaluation methods standardization. As our work highlights statistics misuse,
the first chapter proposes to guide researchers using statistics that are adapted to our
context. The second chapter gives invaluable feedback that aims at giving advice to
uninitiated researchers. It brings an original point of view, dissecting numerous
mistakes, which lead to invalidated results. With these numerous pieces of advice,
this chapter ensures the reader will not make the recurrent mistakes often found in
the literature. This part ends with eight personal conclusions from the eight
co-editors of this book allowing them to give their opinion without censorship
about the question of standardization. Last, a general conclusion summarizes the
essential points mentioned throughout the book.
The end of this part provides a summary sheet per chapter that allows the reader to
have a synthetic view of each chapter. To be noted that these sheets are subjective as
they represent the point of view of both of the authors of this introduction. And we
apologize for the possible presence of misunderstanding or misinterpretation.
A careful reader would notice that the sum of distinct authors is equal to 33. The
34th researcher involved is the author of the foreword who, as each author, has
followed and has participated in this project since the beginning. Other people are
involved in this work but were not able to participate because of a busy schedule,
but certainly not because of a lack of interest in our identified problem space!
Introduction and Book Overview xix

Book Overview

General Context
Title Communication between humans:
Towards an Interdisciplinary Model of Intercomprehension
Author Marine Grandgeorge
Discipline Ethology
Pages 17
Keywords Communication, interaction, relationships, intercomprehension
Paper objective The objective is to explain what communication is and to propose an
interdisciplinary model of intercomprehension between individuals that
could be used to improve communication with robots
Strong idea The strong idea is that humans are constantly communicating,
consciously or unconsciously. Communication consists of cooperation
and co-construction
Paper overview First, the paper presents some theoretical models of communication and
their limits. This paper highlights the evolution of thought, and the needs
to include the social part, non-verbal communication, and
metacommunication in models
Second, this paper presents what verbal and non-verbal communication
is
Third, this paper presents factors that can modulate communication, for
example, the degree of knowledge between two individuals
(relationship), socio-cultural factors, emotions, multimodal and
multichannel communication
Last, the paper presents an intercomprehension model common to
humans, animals and machines, which aims at integrating all aspects of
communication and identity in a dynamic process, changing across time
Positioning about This book deals with Human-Robot Interaction. Thus, our objective is
EMSHRI first to define what interaction is. This first chapter is focusing on
humans and introduces the complex process of communication between
individuals, which is the basis of interaction. This chapter is useful to
understand who humans are and what we have to know to build robots
which can communicate with them
xx Introduction and Book Overview

Title An extended framework for characterizing social robots


Author Kim Baraka, Patricia Alves-Oliveira and Tiago Ribeiro
Discipline Robotics, Psychology and Computer science
Pages 44
Keywords Human-Robot Interaction, framework, classification, social robots
Paper objective This chapter provides a broad-ranging overview of the main
characteristics that arise when one considers social robots and their
interactions with humans
Strong idea Robots are classified according to 7 dimensions: appearance, social
capabilities, purpose and application area, relational role, autonomy and
intelligence, proximity, and temporal profile
Paper overview This chapter introduces social robots which are the result of
multidisciplinary work. Then it summarizes some of the existing
classification which inspired this work (Fong et al., Yanco et al., Shibata,
and Dautenhahn)
Then it presents a classification based on the 7 dimensions above:
Appearance: bio-inspired (human-inspired, animal inspired),
artifact-shaped, functional
Social capabilities: the depth of the robot’s actual social cognition
mechanisms, the human perception of the robot’s social aptitude
Purpose and application area: healthcare and therapy; industry;
education, entertainment and art; home and workplace; search and
rescue; public service; social sciences
Relational role: for you, as you, with you, around you, as part of you, as
if you
Autonomy and intelligence. Autonomy requires that the robot can learn
in order to be able to “operate in the tasks it was designed for without
external intervention”. Intelligence is defined by “the ability to
determine behavior that will maximize the likelihood of goal satisfaction
under dynamic and uncertain conditions […]”. The higher the autonomy
and intelligence is, the higher the complexity of the system is. A robot
which has intelligence and autonomy should possess the following
capabilities: perception of environment-related and human-related
factors, modeling of environment and human(s), planning actions to
interact with environment and human(s), executing plans under physical
and social constraints and learning through interaction with the
environment or humans
Proximity: remote, co-located, physical
Temporal profile: timespan (period of time in which the human is
exposed to the robot: short-term, medium-term, long-term and life-long),
duration (of each interaction session), frequency of interactions (from
very occasional to multiple times per day)
Then the chapter provides a brief discussion of design approaches for
social robots (human-centered design, robot-centered design, and
symbiotic design
Positioning about This second chapter is complementary to the first one. It is focusing on
EMSHRI robots and explains what a robot is, what challenges are to face
Introduction and Book Overview xxi

Title A survey on current practices in user evaluation of companion robots


Author Franz Werner
Discipline Software science
Pages 24
Keywords Evaluation methods, companion robots, review, older people
Paper objective The objective of this paper is to provide a survey on current
methodologies and practices used to evaluate “companion robots” (here
homecare robots for the elderly)
Strong idea Almost 60% of papers partially present the conducted evaluations.
Incomplete information prevents the reproduction of evaluations and
alters their validity
Technical issues are very present in most evaluation phases and can
bring to biases
Paper overview The paper begins with the state of the art about methodologies to
evaluate robots
The next section presents the methodology used to select the papers to
review among European projects and related institutions
The next section discusses evaluation methods according to the
technology readiness levels (proposed by NASA), which gives
discussions about: laboratory trials of the integrated prototype,
short-term user trials of the integrated prototype within realistic
environments, field trials in real environments. This section ends with a
discussion about evaluation aims and user groups
Then, the paper presents the identified methodological challenges such
as the lack of technical robustness and functionality of prototypes, the
difficulties in conducting user trials with the group of older users, the
lack of accepted methodologies, issues regarding long-term field trials,
further issues and concludes with limitations of this review
Positioning about This chapter is complementary to the two introductive chapters as it
EMSHRI presents the methodologies currently used by researchers to evaluate
robots. This chapter focuses only on homecare robots for the elderly in
order to compare the same types of evaluations
xxii Introduction and Book Overview

Methodologies to Design Evaluations


Title Conducting studies in human-robot interaction
Author Cindy L. Bethel, Zachary Henkel, Kenna Baugus
Discipline Computer science
Pages 34
Keywords HRI evaluation, sample size, evaluation methods, recommendations
Paper objective The objective of the paper is to explain how to design a reliable
evaluation and to give recommendations
Strong idea This paper gives a chronology of items required for planning, designing,
and executing human studies in HRI: type of study design, number of
groups, sample size, methods of evaluation, study location, type and
number of robots, other types of equipment, failures and contingencies,
study protocol and scripts, methods of recruiting participants, IRB/ethic
documents, participants, and conducting the study
Moreover, this paper highlights the need to have an appropriate sample
size that represents the population and the need to use three or more
methods of assessment to obtain reliable results
Paper overview This chapter first presents related work on experimental designs and
methods and provide the terminology related to evaluations
Then, it introduces the factors (named “items” in “Strong idea”) that
need to be considered when planning and designing an evaluation, and
give explanations for each one
Next, the chapter presents an exemplar study which allows giving
examples for each item
Last, the conclusion provides some recommendations to assist
researchers with “the development of a comprehensive experimental
design that should provide successful study results”
Positioning about This chapter lays the foundations for designing evaluations. It contains
EMSHRI the minimum knowledge to know in order to plan, design and execute an
evaluation. It's an invaluable help for researchers who are not specialists
of evaluations
Introduction and Book Overview xxiii

Title Introduction to (Re)using questionnaires in human-robot interaction


research
Author Matthew Rueben, Shirley A. Elprama, Dimitrios Chrysostomou, An
Jacobs
Discipline Robotics, Computer science, Sociology
Pages 20
Keywords Questionnaire, standardization, process
Paper objective The objective of this paper is to discuss the standardization of
questionnaires and gives a methodology to create questionnaires, which
have to be reliable and valid
Strong idea The standardization is not the questionnaire itself which cannot be used
in all evaluations. The standardization is on the process for using
questionnaires itself: formulating research question, identifying relevant
concept(s), searching relevant scale(s), adapting scales if required, pilot
testing scale(s) and validating scale(s)
Paper overview This chapter first explains what a questionnaire is
It then discusses the word “standardization” applied to questionnaires
The third section explains how to identify the concepts which need to be
measured depending on the research question
The fourth section discusses the research of relevant questionnaires with
three possibilities: use an existing questionnaire, modify a questionnaire
or create a new questionnaire
The fifth section provides the procedure for adapting questionnaires. It is
important to remember that “changing a questionnaire will affect its
validity and performance”
The sixth section discusses pilot testing questionnaires and the seventh
section discusses validating questionnaires (for them to be reliable and
valid)
The chapter ends with recommendations for further reading
Positioning about This chapter proposes “a process for finding and using the appropriate
EMSHRI scales in a questionnaire for a study”. It is an important step toward
evaluation methods standardization
xxiv Introduction and Book Overview

Title Qualitative interview techniques for human-robot interactions


Author Cindy L. Bethel, Jessie E. Cossitt, Zachary Henkel, Kenna Baugus
Discipline Computer science
Pages 30
Keywords Qualitative data, structured interview, children, methods
Paper objective The objective of this chapter is to present the forensic interview
approach which “is a structured protocol that has been established for
obtaining critical information especially from children”. The target
population is children who “had experienced maltreatment or were
eyewitnesses to crimes”
Strong idea Children under the age of 11 can partially only understand vague and
abstract questions and labeled responses. Thus, questionnaires are not a
good tool
“The forensic interview approach is beneficial in obtaining a person’s
feelings about an experience and less likely to introduce confounds into
that process”
Interviews “can provide additional insights that may not be discerned
using other methods of evaluation”
Paper overview This chapter begins with the introduction of related work on qualitative
interviews (structured and semi-structured)
Then, the chapter presents the approach used to design the forensic
interview (introductory phase and guidelines, rapport building,
participants narrative practice, substantive disclosure interview phase,
cool-down/wrap-up phase)
The next section presents how to transcript and code data (transcription
process, coding the transcribed data, coding written responses)
Last, this chapter gives the example of a qualitative interview study
Positioning about This chapter proposes a process for designing and conducting a
EMSHRI semi-structured interview. It is an important step toward evaluation
methods standardization
Introduction and Book Overview xxv

Some Standardization Proposals


Title Design and development of the USUS goals evaluation framework
Author Josefine Wallström and Jessica Lindblom
Discipline User Experience Design, Cognitive Science
Pages 25
Keywords User Experience (UX), UX goals, USUS Framework
Paper objective The objective of this paper is to present the USUS Goals evaluation
framework, derived from the USUS framework, which provides HRI
evaluation methods taking User Experience into account and adding UX
goals in the framework
Strong idea It is necessary to create a positive user experience (UX) for human users
who interact with robots to favor robots’ acceptance. Defining the UX
goals of users is fundamental when designing products, software and
services
UX Goals are transformed into UX measures which provide UX metrics
(which can be used for comparison)
Paper overview The chapter explains first what User Experience is, why it is important
and describes the UX design (UXD) lifecycle process (denoted as the
UX wheel). It also explains the importance of UX goals, which are
absent from the USUS Framework
The chapter presents then the method used to develop the USUS Goals
evaluation framework inspired by Blandford’s and Green’s iterative
method development process: analysis (related work on existing
methods to evaluate UX), design and evaluation, implementation and
evaluation, results, and recommendations
Then the chapter presents the USUS Goals evaluation framework itself
The chapter ends with a discussion containing six recommendations
Positioning about User Experience is a new field of research which need to be taken into
EMSHRI account when designing an HRI evaluation, as the main objective of
robots is to avoid giving a bad experience to humans, and in the best of
cases to maximize their well-being. The USUS Goals evaluation
framework is a proposal of standardization
xxvi Introduction and Book Overview

Title Testing for ‘anthropomorphization’: A case for mixed methods in


human-robot interaction
Author Malene Flensborg Damholdt, Christina Vestergaard, Johana Seibt
Discipline Cognitive psychology, Anthropology, Philosophy
Pages 25
Keywords Anthropomorphism, methodologies for human-robot interaction, social
robots
Paper objective The objective of this chapter is to discuss evaluations of “social
robotics” in the context of “Human-Robot Interaction” concerning the
“tendency to anthropomorphize”
Strong idea This chapter proposes “a new questionnaire for assess the tendency to
“anthropomorphize”” named AMPH
Authors think that “HRI will become a transdiscipline in the long run”
Qualitative analysis should be added to quantitative analysis to improve
our understanding of HRI
The tendency to sociomorphizing should be investigated, instead of only
measuring anthropomorphizing
Paper overview This chapter first presents questions raised by the nature of HRI and by
current methodological problem in HRI related to the notion of
“anthropomorphizing”
Then, it present “the tools currently used in HRI research to assess
tendencies to anthropomorphize” (Godspeed Questionnaire Series and
Individual Differences in Anthropomorphism Questionnaire)
The next section presents the AMPH questionnaire with quantitative and
qualitative analysis of a pilot study
The chapter ends with a discussion widely focusing on the question of
anthropomorphizing versus sociomorphizing
Positioning about This chapter is an attempt to propose a new standardized questionnaire
EMSHRI to evaluate the tendency for anthropomorphization. It is a first step in a
very long work to reach the evaluation methods standardization for HRI
Introduction and Book Overview xxvii

Disciplinary Points of View


Title Evaluating the user experience of human-robot interaction
Author Jessica Lindblom, Beatrice Alenljung and Erik Billing
Discipline Cognitive science, Computer science
Pages 26
Keywords User Experience, evaluation, methods
Paper objective The objective of this chapter is to introduce UX evaluation—UX
standing for User Experience—in order to facilitate the use of UX
evaluation methods in HRI
Strong idea “Positive user experience is necessary in order to achieve the intended
benefits and societal relevance of human-robot interaction.” A positive
user experience increase users’ acceptance
Paper overview This chapter first introduces motivations and objectives to design UX
evaluation
Then it introduces and discusses HRI
The third section introduces User Experience and User Experience
evaluation. It lists the existing evaluation methods, makes
recommendations to well design evaluations and explains how to define
UX goals. Then, it proposes a UX evaluation process (planning,
conducting, analyzing the data, and considering the obtained findings)
Positioning about This chapter describes the methodologies used to design UX evaluation
EMSHRI of Human-Robot Interaction. This is a valuable point of view on the
evaluation of HRI with a different approach which adds an element of
reflection towards standardization of evaluation methods for HRI
xxviii Introduction and Book Overview

Title Evaluating human-robot interaction with ethology


Author Marine Grandgeorge
Discipline Ethology
Pages 12
Keywords Interaction, relationships, ethology, methods
Paper objective The objective of this paper is to propose that and to explain how
ethology could be used to evaluate HRI. Ethology is the “scientific and
objective study of animal behavior, usually with a focus on behavior
under natural conditions, and viewing behavior as an evolutionarily
adaptive trait”
Strong idea Robots can be considered as “another entity with which we could
communicate”, which is studied by ethology
“Ethological concepts, methods, and analyses can be applied to HRI”
Ethological methods were necessary for some HRI researchers to affirm
or deny some hypotheses
Ethology and robotics mutually enhance each other
Ethology methods are made to evaluate interactions and relationships
over the long term in either natural or experimental settings without
being invasive
Ethologists are used to using methodologies from other fields to
complete their observations
Paper overview This chapter begins with the introduction to ethology which is a
behavioral science. Ethologists are expert of evaluations
Then, it discusses the use of ethology for HRI answering the question:
can ethology form the basis of HRI evaluation?
The third section describes the methodology used in ethology: to choose
the study context, to describe behavior, to observed behavior, and to
analyze and to interpret data. For each behavior, ethology has to answer
four kinds of questions about function, causation, development and
evolutionary
The fourth section gives some examples of research using the
ethological approach
Positioning about This chapter describes the ethological methods used to design evaluation
EMSHRI of Human-Robot Interaction. This is a valuable point of view on the
evaluation of HRI with a different approach which adds an element of
reflection towards standardization of evaluation methods for HRI
Introduction and Book Overview xxix

Title Evaluating human-robot interaction with ethnography


Author An Jacobs, Shirley A. Elprama and Charlotte I. C. Jewell
Discipline Sociology, Computer science, Anthropology
Pages 18
Keywords Ethnography, Human-Robot Interaction, Qualitative Research Methods,
Observation, Interview
Paper objective The objective of this paper is to propose that and to explain how
ethnography could be used to evaluate HRI. Ethnography is a research
process that aims to detail “knowledge of the multiple dimensions of life
within the studied milieu and aims to understand members’
taken-for-granted assumptions and roles”
Strong idea The term ethnography both refers to a research process and a result (a
written text). Ethnography produces an ethnography
It aims at discovering a real reality that can be influenced by people and
different according to the context
It mainly uses observations and a qualitative approach
Qualitative research is an added value to the field of HRI
Ethnography “can help identify and address new ethical, legal and
societal issues in robot design and implementation”
Paper overview This chapter begins with a review of the current use of Ethnography in
the HRI community where it is shown that ethnography is rarely
mentioned in HRI research papers
The next section introduces Ethnography with a review of its history
(coming from anthropology), on its state of mind (coming from
positivism), and with a discussion on qualitative research (following four
quality criteria)
The, the chapter describes methods to collect data in ethnography
(observation and interview)
The final chapter discusses current practices of reporting ethnographic
research in HRI
Positioning about This chapter describes the ethnography methods to design evaluation of
EMSHRI Human-Robot Interaction. This is a valuable point of view on the
evaluation of HRI with a different approach which adds an element of
reflection towards standardization of evaluation methods for HRI
xxx Introduction and Book Overview

Title Designing evaluations: Researchers’ insights interview of five experts


Author Céline Jost and Brigitte Le Pévédic
With the expertise of Sophie Lemonnier, Jérôme Michalon, Cédric
Sueur, Gérard Uzan and Iina Aaltonen
Discipline Computer science
With Cognitive psychology, sociology, ethology, ergonomics and
human-technology interaction
Pages 44
Keywords Evaluations, meta-evaluation, qualitative survey, disciplines, methods,
standardization, personal view
Paper objective The objective of this chapter is to qualitatively investigate how experts
of evaluation are proceeding to answer a research question. The focus
is on the methodology itself, not on the answer to the research question.
Experts are not expected to design an experimental protocol
Strong idea We observed that all experts followed the same approach to answer a
research question. First, they investigated the topic of the question in
order to obtain perfect knowledge about it. Second, they reformulated
the question with a lot of details in order to raise ambiguities (which
robot, which task, which criteria to observe…) and they choose an
experimental context that was in their area of expertise. Third, experts
thought about appropriate methodologies to answer the question and
about the appropriate metrics to produce. All the proposed
experimental settings were totally different and were in the scope of the
experts’ domain. In some cases, experts proposed to use existing
experimental settings or to slightly modify existing experimental
settings
Each expert had to think about three research questions, resulting in 15
proposed experimental settings. In all of them, experts proposed an
evaluation where it was required to compare several conditions
To conclude, we observed that a research question can be answered by
several different valid evaluations
Paper overview The introduction explains the motivations and methods used in the
evaluation conducted with 5 experts
Section 2 describes the survey protocol which was followed to recruit
and interview experts
Section 3 is dedicated to present the experts who collaborated with the
survey and their disciplines
Section 4 presents the first analysis of experts' answers followed by
section 5 which provided the first discussion
Section 6 presents the complementary survey which was conducted to
obtain a better precision to the point of comparison. Is there anything
else than comparison to evaluate results?
Section 7 proposed a more general discussion
Section 8 concludes the chapter. Section 9 is an annex that contains all
the answers provided by experts
Positioning about This paper is a meta-evaluation. It evaluates the process used by
EMSHRI researchers to design evaluations. It is a valuable but personal and
subjective point of view that provides important information about
evaluation methods standardization. It opens numerous debates
Introduction and Book Overview xxxi

Recommendations and Conclusions


Title Experimental research methodology and statistics insights
Author Renato Paredes Venero and Alex Davila
Discipline Cognitive science, Psychology
Pages 21
Keywords Statistics, research methodology, non-parametric tests, experimental
designs, Human-Robot Interaction
Paper objective The objective of this chapter is to highlight common mistakes or misuses
in statistical analyses and to discuss the use of parametric vs
nonparametric analysis
Strong idea Working on the internal validity of experiments should be a priority
when designing an evaluation
Within-subjects designs should be preferred
It is important to understand when employing parametric or
nonparametric statistics
Paper overview On the one hand, the introduction gives the basics of statistics and
explains specific vocabulary such as dependent/independent variables,
parametric/nonparametric test… On the other hand, it highlights some
common problems and shows that some empirical studies are not
designed rigorously or are not analyzed with the appropriate statistical
tools and are invalidated
The next chapter discusses experiment requisites, experiment internal
and external validity, experimental designs, symmetrical distributions,
Likert items and scales, nonparametric statistics for factorial designs
Then, the chapter presents a simulation study to illustrate previous
discussions and show that nonparametric tests are better to determine
within-subjects differences than parametric tests when analyzing
Likert-scale responses
Positioning about This chapter gives a focus on statistics which is one of the recurrent
EMSHRI problems in the literature. Authors give recommendations about the use
of statistics to evaluate HRI, which is really important when discussion
HRI evaluations. This chapter is made to explain challenges to naïve
reader and needs the two first chapters to be totally understood
xxxii Introduction and Book Overview

Title Advice to new human-robot interaction researchers


Author Tony Belpaeme
Discipline Computer science
Pages 15
Keywords HRI evaluations, common mistakes, bad practices, good practices,
recommendations
Paper objective The objective of this chapter is to explain how to avoid bad practices in
designing and conducting evaluations. It is complementary to the
previous one as it focuses on common mistakes and gives
recommendations to make reliable evaluations
Strong idea This paper has an original approach as it highlights common mistakes
done during designing and conducting evaluations. It gives a pride place
to errors allowing the reader to learn from them and to avoid them in the
future
Paper overview “This chapter contains some of the most prevalent and fundamental
errors committed in HRI studies, and suggests potential solutions”
This chapter first presents the current practice in HRI studies (lab vs in
the wild, Wizard of Oz vs full autonomy, on-screen vs real robot,
convenience sampling vs representative sampling and single vs
long-term interaction)
Then it discusses problems of using Null-Hypothesis Significance
Testing (NHST) to validate the sample while results are really unstable
Next, it presents alternatives to NHST in order to obtain reliable results
highlighting that descriptive statistics are too often missing in papers and
that the chosen tests or presented results are sometimes inappropriate. It
discusses selective publication data, the Hawthorne effect,
crowdsourcing data, the replication crisis, short-term studies
Positioning about This chapter is very important because it gives experience to new
EMSHRI researchers in order them to avoid doing common mistakes. Learning is
based on errors. Thus analyzing errors give us good knowledge about
evaluations. It is complementary to the previous one
Introduction and Book Overview xxxiii

Title Editors’ Personal conclusions


Author Tony Belpaeme, Cindy Bethel, Dimitrios Chrysostomou, Nigel Crook, Marine
Grandgeorge, Céline Jost, Brigitte Le Pévédic, and Nicole Mirnig
Discipline Computer science, Ethology
Pages 7
xxxiv Introduction and Book Overview

Title Book overview: Towards new perspectives


Author Céline Jost and Brigitte Le Pévédic
Discipline Computer science
Pages 7
Strong The work presented in this book raises more questions than it solves. The
idea reproducibility of an experiment can be ensured by following a rigorous
experimental protocol, thus having a scientific approach and choosing a method
adapted to answer the research question. But why do we need standardization? To
ensure reliable and valid results? That seems to be ensured by a rigorous
experimental protocol. To be able to compare evaluation results between them (for
example to assess the effect of culture or an evaluation context)? Why to do that
knowing the number of biases it will bring (different participants, different
contexts, different rooms, different experimenters…)? Do we really need to
compare several different evaluations between them? And if it is really required, is
it enough to agree on the metrics/indicators of the evaluation?

Céline Jost
Brigitte Le Pévédic
Introduction and Book Overview xxxv

References

1. Scholtz, J.: Theory and evaluation of human robot interactions. In: 36th Annual Hawaii
International Conference on System Sciences, 2003. Proceedings of the, vol. 3, 10pp (2003)
2. Dautenhahn, K., Werry, I.: A quantitative technique for analysing robot-human interactions. In:
IEEE/RSJ International Conference on Intelligent Robots and Systems, vol. 2, pp. 1132–1138
(2002)
3. Sabanovic, S., Michalowski, M.P., Simmons, R.: Robots in the wild: observing human-robot
social interaction outside the lab. In: International Workshop on Advanced Motion Control,
AMC, pp. 576–581 (2006)
4. Tsui, K.M., Yanco, H.A., Feil-Seifer, D.J., Matarić, M.J.: Survey of domain-specific per-
formance measures in assistive robotic technology. In: Proceedings of the 8th Workshop on
Performance Metrics for Intelligent Systems—PerMIS ’08, pp. 116–123 (2008)
5. Syrdal, D.S., Dautenhahn, K., Koay, K., Walters, M.L.: The negative attitudes towards robots
scale and reactions to robot behaviour in a live human-robot interaction study. In: 23rd
Convention of the Society for the Study of Artificiel Intelligence and Simulation of
Behaviour, AISB, pp. 109–115 (2009)
6. Weiss, A., Bartneck, C.: Meta analysis of the usage of the godspeed questionnaire series. In:
2015 24th IEEE International Symposium on Robot and Human Interactive Communication
(RO-MAN), pp. 381–388 (2015)
7. Young, J.E., et al.: Evaluating human-robot interaction: focusing on the holistic interaction
experience. Int. J. Soc. Robot. 3(1), 53–67 (2011)
8. Bethel, C.L., Murphy, R.R.: Review of human studies methods in HRI and recommendations.
Int. J. Soc. Robot. 2(4), 347–359 (2010)
9. Sim, D.Y.Y., Loo, C.K.: Extensive assessment and evaluation methodologies on assistive
social robots for modelling human-robot interaction—A review. Inf. Sci. (Ny). 301, 305–344
(2015)
10. Xu, Q., Ng, J., Tan, O., Huang, Z., Tay, B., Park, T.: Methodological issues in scenario-based
evaluation of human–robot interaction. Int. J. Soc. Robot. 7(2), 279–291 (2015)
11. Seo, S.H., Geiskkovitch, D., Nakane, M., King, C., Young, J.E.: Poor thing! Would you feel
sorry for a simulated robot? In: Proceedings of the Tenth Annual ACM/IEEE International
Conference on Human-Robot Interaction—HRI ’15, pp. 125–132 (2015)
12. Zuberbühler, K.: Audience effects. Curr. Biol. 18(5), R189–R190 (2008)
Contents

General Context
Communication Between Humans: Towards an Interdisciplinary
Model of Intercomprehension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Marine Grandgeorge
An Extended Framework for Characterizing Social Robots . . . . . . . . . . 21
Kim Baraka, Patrícia Alves-Oliveira, and Tiago Ribeiro
A Survey on Current Practices in User Evaluation
of Companion Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Franz Werner

Methodologies to Design Evaluations


Conducting Studies in Human-Robot Interaction . . . . . . . . . . . . . . . . . . 91
Cindy L. Bethel, Zachary Henkel, and Kenna Baugus
Introduction to (Re)Using Questionnaires in Human-Robot
Interaction Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Matthew Rueben, Shirley A. Elprama, Dimitrios Chrysostomou,
and An Jacobs
Qualitative Interview Techniques for Human-Robot Interactions . . . . . 145
Cindy L. Bethel, Jessie E. Cossitt, Zachary Henkel, and Kenna Baugus

Some Standardization Proposals


Design and Development of the USUS Goals Evaluation
Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Josefine Wallström and Jessica Lindblom
Testing for ‘Anthropomorphization’: A Case for Mixed Methods
in Human-Robot Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
M. F. Damholdt, C. Vestergaard, and J. Seibt

xxxvii
xxxviii Contents

Disciplinary Points of View


Evaluating the User Experience of Human–Robot Interaction . . . . . . . . 231
Jessica Lindblom, Beatrice Alenljung, and Erik Billing
Evaluating Human-Robot Interaction with Ethology . . . . . . . . . . . . . . . 257
Marine Grandgeorge
Evaluating Human-Robot Interaction with Ethnography . . . . . . . . . . . . 269
An Jacobs, Shirley A. Elprama, and Charlotte I. C. Jewell
Designing Evaluations: Researchers’ Insights Interview
of Five Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
Céline Jost and Brigitte Le Pévédic

Recommendations and Conclusions


Experimental Research Methodology and Statistics Insights . . . . . . . . . 333
Renato Paredes Venero and Alex Davila
Advice to New Human-Robot Interaction Researchers . . . . . . . . . . . . . . 355
Tony Belpaeme

Editors’ Personal Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371


Book Overview: Towards New Perspectives . . . . . . . . . . . . . . . . . . . . . . . 379
Editors and Contributors

About the Editors

Céline Jost is an Associate Professor in Computer


Science at Paris 8 University in France, working in the
CHArt laboratory for her research. She obtained her
Ph.D. in Computer Science from South Brittany
University (France). She was a Postdoctoral Research
at the National Engineering School of Brest (ENIB) in
France, working in the Lab-STICC laboratory.
She mostly conducts multidisciplinary research with
different disciplines, for which she received the
“RJS/KROS Distinguished Interdisciplinary Research
Award” in RO-MAN 2014. She has co-organized
various conferences and workshops on Human-Robot
Interaction and Assistive technology for disabilities, and
is actively involved in the IFRATH Society (Federative
Institute for Research on Assistive Technology for
People with Disabilities).
She has also been involved in many research projects
funded by the French Research Agency and is currently
leading the EMSHRI project. She is currently leading the
MemoRob project that aims at studying the distractor
effect of a robot during the learning task, and co-leading
the StimSense project which aims at studying the
importance of multisensorial in learning tasks, especially
during stimulation cognitive exercises.
Her research interests include natural interaction,
individualized interaction, multisensory interaction,
human-machine interaction, interaction paradigm, eval-
uation methods, cognitive ergonomic, serious game,

xxxix
xl Editors and Contributors

mulsemedia, artificial companion, disabilities, educa-


tion, and cognitive stimulation.

Brigitte Le Pévédic is an Assistant Professor at the


University of South Brittany. She obtained her Ph.D. in
Natural Language Processing from the University of
Nantes and she defended his Habilitation in November
2012 at the University of South Brittany. Her research
interests include Human-Computer Interaction, cogni-
tive assistive technologies, and multisensory interaction.

Tony Belpaeme is Professor at Ghent University and


Professor in Robotics and Cognitive Systems at the
University of Plymouth, UK. He received his Ph.D. in
Computer Science from the Vrije Universiteit Brussel
(VUB) and currently leads a team studying cognitive
robotics and human-robot interaction. He coordinated
the H2020 L2TOR project, studying how robots can be
used to support children with learning a second language,
and coordinated the FP7 ALIZ-E project, which studied
long-term human-robot interaction and its use in pedi-
atric applications. He worked on the FP7 DREAM
project, studying how robot therapy for Autism
Spectrum Disorder. Starting from the premise that
intelligence is rooted in social interaction, Belpaeme
and his research team try to further the science and
technology behind artificial intelligence and social
human-robot interaction. This results in a spectrum of
results, from theoretical insights to practical applications.
Editors and Contributors xli

Cindy Bethel, Ph.D. (IEEE and ACM Senior


Member) is a Professor in the Computer Science and
Engineering Department and holds the Billie J. Ball
Endowed Professorship in Engineering at Mississippi
State University (MSU). She is the 2019 U.S. Fulbright
Senior Scholar at the University of Technology Sydney.
Dr. Bethel is the Director of the Social, Therapeutic,
and Robotic Systems (STaRS) lab. She is a member
of the Academy of Distinguished Teachers in the
Bagley College of Engineering at MSU. She also was
awarded the 2014–2015 ASEE New Faculty Research
Award for Teaching. She was an NSF/CRA/CCC
Computing Innovation Postdoctoral Fellow in the
Social Robotics Laboratory at Yale University. From
2005 to 2008, she was a National Science Foundation
Graduate Research Fellow and was the recipient of the
2008 IEEE Robotics and Automation Society Graduate
Fellowship. She graduated in August 2009 with her Ph.
D. in Computer Science and Engineering from the
University of South Florida. Her research interests
include human-robot interaction, human-computer
interaction, robotics, and artificial intelligence. Her
research focuses on applications associated with robotic
therapeutic support, information gathering from chil-
dren, and the use of robots for law enforcement and
military.

Dimitrios Chrysostomou received his Diploma


degree in production engineering in 2006, and the Ph.
D. degree in robot vision from Democritus University
of Thrace, Greece in 2013. He is currently an Assistant
Professor with the Department of Materials and
Production, Aalborg University, Denmark. He was a
Postdoctoral Researcher at the Robotics and
Automation Group of the Department of Mechanical
and Manufacturing Engineering, Aalborg University,
Denmark. He has co-organized various conferences and
workshops in Mobile Robotics, Robot Ethics and
Human-Robot Interaction. He has served as guest
editor in various journals and books on robotics and
HRI, associate editor for several conferences including
IROS and ICRA and regular reviewer for the major
journals and conferences in robotics. He has been
involved in numerous research projects funded by the
xlii Editors and Contributors

European Commission, the Greek state, and the Danish


state. His research interests include robot vision,
skill-based programming, and human-robot interaction
for intelligent robot assistants.

Nigel Crook is Associate Dean for Research and


Knowledge Exchange and Professor of Artificial
Intelligence and robotics at Oxford Brookes
University. He graduated from Lancaster University
with a B.Sc. (Hons) in Computing and Philosophy in
1982. He has a Ph.D. in medical expert systems and
more than 30 years of experience as a lecturer and a
researcher in AI. He is an expert reviewer for the
European Commission and serves on several scientific
committees for international conferences. His research
interests include machine learning, embodied conver-
sational agents, and social robotics. His most recent
work is in autonomous moral robots in which he is
exploring how it might be possible to equip robots with
a degree of moral competence. Professor Crook is also
working on other aspects of ethical AI, including
developing systems that can explain the decisions of
trained machine learning models. He is the founder
of the Ethical AI institute at Oxford Brookes
University. His work in robotics has attracted some
media attention, including 16 appearances on regional,
national, and international television channels.

Marine Grandgeorge is Lecturer in Ethology at the


Human and Animal Ethology Lab at the University of
Rennes 1. She belongs to Pegase team focused on
cognitive processes and social factors associated with
scientific and societal issues that include communica-
tion, brain plasticity, perception and understanding of
conspecific and heterospecific signals, remediation, and
welfare. Her research is mainly focused on heterospeci-
fic communications such as human-robot interactions as
well as human-pet interactions and relationships, espe-
cially on animal-assisted interventions (e.g., dog, horse).
Editors and Contributors xliii

Dr. Nicole Mirnig is an expert in Human-Robot


Interaction. She completed her Ph.D. in Human-
Computer Interaction at the Center for HCI, University
of Salzburg, Austria and she holds a Master’s Degree in
Communication Studies from the University of Salzburg.
Her thesis “Essentials of Robot Feedback: On
Developing a Taxonomy for Human-Robot Interaction”
presents a substantial body of related research and
empirical data from a user-centered perspective on how
to design feedback strategies in HRI.
Nicole’s overall research aim is to facilitate the design
of understandable (social) robots. Her focus lies in the
cooperation between humans and robots, taking into
account different factors that foster a positive user
experience. Her most recent work on “imperfect robots”
was prominently discussed in the media.
Nicole was engaged in the EU-projects IURO
(Interactive Urban Robot) and ReMeDi (Remote
Medical Diagnostician), focusing on improving
human-robot interaction by the means of adequate
feedback strategies. She further researched human-
robot collaboration in industrial contexts within the
Christian Doppler Laboratory “Contextual Interfaces”.
During her Ph.D. years, she spent nine months as a
visiting researcher at the A*STAR Institute for
Infocomm Research in Singapore, deepening her
research in robot feedback.
The idea for this book was born while Nicole was
working at the Center for HCI. At the time of publication,
she is working as an expert in user experience, usability,
and user-centered design at Porsche Holding in Salzburg,
Austria.

Contributors

Beatrice Alenljung University of Skövde, Skövde, Sweden


Patrícia Alves-Oliveira Instituto Universitário de Lisboa (ISCTE-IUL) and
CIS-IUL, Lisbon, Portugal;
INESC-ID, Porto Salvo, Portugal
xliv Editors and Contributors

Kim Baraka Robotics Institute, Carnegie Mellon University, Pittsburgh, PA,


USA;
INESC-ID, Porto Salvo, Portugal;
Instituto Superior Técnico, Universidade de Lisboa, Porto Salvo, Portugal
Kenna Baugus Department of Computer Science and Engineering, Mississippi
State University, Mississippi State, MS, USA
Tony Belpaeme Ghent University, Ghent, Belgium;
University of Plymouth, Plymouth, UK
Cindy L. Bethel Department of Computer Science and Engineering, Mississippi
State University, Mississippi State, MS, USA
Erik Billing University of Skövde, Skövde, Sweden
Dimitrios Chrysostomou Aalborg University, Aalborg East, Denmark
Jessie E. Cossitt Department of Computer Science and Engineering, Mississippi
State University, Mississippi State, MS, USA
M. F. Damholdt Unit for Psychooncology and Health Psychology, Department of
Oncology, Aarhus University Hospital and Department of Psychology &
Behavioural Science, Aarhus University, Aarhus, Denmark
Alex Davila Department of Psychology, Pontifical Catholic University of Peru,
Lima, Peru
Shirley A. Elprama imec-SMIT-Vrije Universiteit Brussel, Brussel, Belgium
Marine Grandgeorge University of Rennes 1, University of Normandie, CNRS,
EthoS (Éthologie Animale et Humaine), Paimpont, France
Zachary Henkel Department of Computer Science and Engineering, Mississippi
State University, Mississippi State, MS, USA
An Jacobs imec-SMIT-Vrije Universiteit Brussel, Brussel, Belgium
Charlotte I. C. Jewell imec-SMIT-Vrije Universiteit Brussel, Brussel, Belgium
Céline Jost Laboratory EA 4004 CHArt, Paris 8 University, Saint-Denis, France
Brigitte Le Pévédic Laboratory UMR 6285 Lab-STICC, South Brittany
University, Vannes, France
Jessica Lindblom University of Skövde, Skövde, Sweden
Renato Paredes Venero Department of Psychology, Pontifical Catholic
University of Peru, Lima, Peru
Tiago Ribeiro INESC-ID, Porto Salvo, Portugal;
Instituto Superior Técnico, Universidade de Lisboa, Porto Salvo, Portugal
Editors and Contributors xlv

Matthew Rueben University of Southern California, Los Angeles, CA, USA


J. Seibt Research Unit for Robophilosophy, School of Culture and Society,
Aarhus University, Aarhus, Denmark
C. Vestergaard Research Unit for Robophilosophy, School of Culture and
Society, Aarhus University, Aarhus, Denmark
Josefine Wallström Uptive, Göteborg, Sweden
Franz Werner University of Applied Sciences, FH Campus Wien, Vienna,
Austria
General Context
Communication Between Humans:
Towards an Interdisciplinary Model
of Intercomprehension

Marine Grandgeorge

Abstract Communication, to communicate… These are words daily used in


common speech (e.g. media, science, business, advertising and so on). Although
these words are familiar, the correct definition of communication remains complex.
Here, our aim is to gather knowledge from different scientific disciplines to better
understand what communication is. After some theoretical models of communica-
tion, we detailed what are verbal and nonverbal communications, how researchers
try to classify them and which factors could influence them. We proposed, at last,
an interdisciplinary model of intercomprehension between individuals that could be
used to improve communication with robots.

Keywords Communication · Interaction · Relationships · Intercomprehension

1 Some Theoretical Models of Communication

Social life concerns associations of individuals belonging to the same species, e.g.
humans. For example, communication ensures coordination between individuals.
Thus, communication was initially defined as a social phenomenon of exchanges
between two or more congeners. It uses specific signals, to survive (i.e. reproduction,
protection, feeding) and maintain group cohesion [1].
First, communication was conceptualized as an information source (i.e. source’s
message and a transmitter) that transmits a signal to a receiver and a destination [2]
(Fig. 1). Information is considered as a sequence of signals combined according to
precise rules which modifies the receiver’s state. Notice that the message could be
modified by noise. Here, communication is a linear and mechanical system without
social component. The design of this first model, so-called telegraphic, has since
evolved, to better include the complexity of communication. Indeed, communication
is not limited to verbal language. It is multichannel, including signals of various kinds

M. Grandgeorge (B)
University of Rennes 1, University of Normandie, CNRS, EthoS (Éthologie Animale et
Humaine)—UMR 6552, F-35380 Paimpont, France
e-mail: marine.grandgeorge@univ-rennes1.fr

© Springer Nature Switzerland AG 2020 3


C. Jost et al. (eds.), Human-Robot Interaction, Springer Series
on Bio- and Neurosystems 12, https://doi.org/10.1007/978-3-030-42307-0_1
4 M. Grandgeorge

Fig. 1 Shannon and Weaver model of communication [2]

such as sounds, gestures, mimics, tactile or even electrical signals [3]. It takes into
account not just “what is said […] but rather how it is said, and who says it” [4].
First models were simple, excluding contexts. Individuals were not considered
as a whole of their environment. However, humans are social entities, an essential
element that, later, was introduced in the model proposed by Riley & Riley [5]. They
use the notions of affiliations to human groups (e.g. judging each other) as well as
of a feedback loop between the sender and the receiver, highlighting the existence of
reciprocity phenomenon. Conceptualization of communication moves from a linear
view to a circular process.
In same time, as first models were too simple, some researchers developed a
linguistic approach of communication. For example, Jakobson defines six functions
of language or communication functions that are necessary for communication to
occur: context, addresser or sender, addressee or receiver, contact, common code and
message; all work together [6]. Here, the importance of the communicative context
appears and is defined as “either verbal or capable of being verbalized”.
Later, Barnlund [7] postulates that interpersonal communication is a dynamic
process in which participants are both sender and receiver of the messages, a weak-
ness finding in the models used until his works. Thus, in communication, coding and
decoding are not alternative processes but are interdependent, that is each contribut-
ing to the communication meaning [8]. We could consider it as a co-construction.
Today, some authors define the communication as orchestral, that is all partici-
pants are immersed themselves in communication. Each one plays her/his score, as a
member of an orchestra [9]. In addition, all behaviors may be meaningful to others,
whether they are intentional or not. This difference suggests that we don’t necessarily
communicate what we are trying to communicate and we communicate even if we
don’t try to do so [7]. This could be linked to one axiom of communication: “one
cannot not communicate” [10]. That is “every behaviour is a kind of communication,
people who are aware of each other are constantly communicating. Any perceivable
behaviour, including the absence of action, has the potential to be interpreted by other
people as having some meaning”. The 4 other axioms—statement that is taken to be
true, to serve as a premise or starting point for further reasoning and arguments—are:
1. every communication has a content and relationship aspect such that the latter
classifies the former and is therefore a metacommunication. In other words, we
always try to communicate something different of the exchange content. Here,
the interest is on “how” is the communication act, i.e. non-verbal communication
(e.g. gaze, intonation, gesture, mimicry …).
Communication Between Humans: Towards … 5

2. the nature of a relationship is dependent on the punctuation of the partners com-


munication procedures, that is communication is an exchange between partners
and what one does impact the others and reciprocally.
3. human communication involves both digital and analog modalities. If I want to
communicate the information “the road turns” to someone who does not speak
English, I can use my body, my arms, my hands… to make curved movements
from left to right. My gestures are similar to what they mean. It’s the analog
language. If both partners speak the same language, it is possible to use it and
therefore, “to not show anything”. Only the common knowledge of language or
of a common code makes it possible to understand each other. It’s the digital
language. Notice that both are needed to communicate.
4. inter-human communication procedures are either symmetric or complemen-
tary. A symmetrical relationship is an equal relationship that diminish differ-
ences between individuals. Otherwise, complementary relationship maximizes
the differences, with two positions, one high and the other, low.
More recently, Blanchet [11] argues that the old models are not enough to understand
the richness, flexibility and complexity of language (Fig. 2). For this, he suggests
some changes. The first one is the circularity in which speech operates, forming a
loop of exchanges that act on each other. Speaking as well as other cues such as
gestures, mimicry, images, symbols and so on overlap simultaneously. An exchange
therefore never has really a beginning or an end. Then, the contexts are temporal,
spatial and socio-cultural. The same statement has different meanings according to

Fig. 2 Ethno-sociolinguistic conceptualization of communication adapted from Blanchet [11]


6 M. Grandgeorge

the participants, in different places or at different times, all elements are concerned.
In addition, the contexts gathered also objects, noises, people present whose only
presence influences the behaviors (i.e. audience effect, first time described [12]) and
on what is communicated (i.e. emotions on face), the events around the exchanges,
the ethno-socio-cultural setting in which the exchange takes place, and so on. With
a different context, the implicit information and presuppositions are different, and
hence have different meanings. Then each individual and group has their own codes,
some are common, others not, even with a “same language” and “same culture”.
Moreover, each individuals emits intentional signals (e.g. linguistic, gestural, sym-
bolic, etc.) but also unintentional signals which are nevertheless perceived and inter-
preted by others because “one cannot not communicate”. Intentionality includes the
strategies of interaction, by which each one seeks to reach her/his goal (e.g. con-
vince, inform, move and be recognized). Likewise, to exchange, each individual
must engage in a form of collaboration with others, try to interpret the intentions of
others, and reciprocally seek to produce valuable signals that could be interpreted by
others. We can call it cooperation or co-construction. Therefore, the interpretation
that partners construct during communication is done by inferences that integrate
both the meanings and contextualization signals. The meaning is not reduced to the
sense or the message. Metacommunication is the responsibility for the “success” or
“failure” of the exchange that never falls to only one partner. As the possible meanings
and modalities of the exchange are multiple, everything comes from interpretation.
When there is misunderstanding or apparently too large gap between intentions
and results, we can then metacommunicate. To metacommunicate is communi-
cate about communication. In short, it is to interrupt the circle of exchange, to
explain strategies and interpretations. The discrepancies could be elucidated and
then resolved. The exchange is reframed and may start again. This is a higher level
of cooperation that requires more flexibility, openness and listening to others.
The proposed models focus often on the only verbal communication. It is therefore
important to open up such models to a larger communication view if we want, at the
end, to include both animals and humans, and by extension, robots.

2 Verbal and Nonverbal Communication

As previously stated, communication models give prominence to verbal communi-


cation. However, we can’t omit nonverbal or para verbal communication when we
are interested in human communication.

2.1 Verbal Communication

We consider that verbal communication corresponds to language in humans. Numer-


ous definitions exist and this area of research is still evolving; here, we don’t want
Communication Between Humans: Towards … 7

to review all theoretical and experimental approaches but seek to just give a general
overview.
Directly in the dictionary of the CNRS (i.e. National Center for Scientific Research
in France), it is defined as the ability of humans to express their thought and to
communicate with each other throughout a system of vocal and/or graphic signs
constituting a language. There are two main types of language. On the one hand, there
is the articulated language that is language with different, identifiable and meaningful
sounds. It is possible to analyze it in significant and minimum units (i.e. monemes),
themselves could be analyzed in distinctive and minimum units (i.e. phonemes). This
is a common characteristic of all languages. On the other hand, there is also an inner
language, a form of internalized and self-directed dialogue that is not expressed.
Thus, in a general way, language can be defined as a system of communication by
speaking, writing, or making signs in a way that can be understood, that allows the
mention of present but also past and future situations.
Here, we mention only an old but interesting work of Hockett [13]. He established
that 13 characteristics are common to all languages:
1. The first characteristic is the vocal-auditory channel that leaves the body free
for other simultaneous activities. Notice that other communication systems use
other channels, e.g. sign language.
2. Multi-directional transmission and directional reception characteristic that is
the source of emission is localizable. It is based also on the physical aspect of
sounds.
3. Quick disappearance of the signal is one of the advantageous features of sound
communication, compared to other which are persistent (e.g. chemical or visual
communication).
4. Interchangeability means that human can, in general, produce any linguistic
message that she/he understands.
5. Feedback allows to hear what is relevant in the message, and especially to
internalize the message.
6. Specialization means that the sound produced has no other function: it is
specialized to ensure communication.
7. In humans, words have meaning regardless of the context of transmission. For
example, the word “table” evokes the object “table”, even if it is absent. It
corresponds to semantics, the study of meaning.
8. Link between the message elements (i.e. sounds) and the referent can be arbi-
trary or not. Words don’t need to be similar to the object they designate. We
could have a long word for a small thing (e.g. microorganism) or short for a big
beast (e.g. lion). The link between words and referents is arbitrary.
9. Next characteristic is discrete units. Human vocal abilities are very extensive,
but they use only a small number of sounds to speak. For example, in French,
there are 37 basic sound units or phonemes. The language is produced from
these basic discrete sound units, easily identifiable.
10. Movement that refers to the fact that humans would apparently be the only
species able—using language—to refer to objects and events removed from
8 M. Grandgeorge

time or place where the speaker is. This property makes it possible to evoke
distant objects in time or space, but also to verbally evoke things that have no
spatial location or that never occur. Humans are disconnected from the object
to which they relate and that it has a meaning regardless of a given context.
11. Productivity can be defined as the ability to utter new messages, i.e. to say
things that have never been said and heard before, but that can be understood
by someone who speaks the same language. Language is really an open system,
as we could say “she has naturally green hair”.
12. Cultural transmission means that all humans are genetically able to acquire
language. If everyone has structures for the language production and processing,
learning and education remain essential for acquisition.
13. Double articulation is defined as the combination of basic sound units, these
basic sound units have no specific meaning. Morphemes result from the combi-
nation of a small number of distinct and meaningless sounds, i.e. the phonemes.
For example, the words “team” and “meat” are two words that have a very dif-
ferent meaning, but result from the combination of the same basic sound units,
the same phonemes, but not associated in the same order.
The ethologist Thorpe [14] added three new characteristics to this list: the ability to
lie, the metalinguistic ability (i.e. ability to speak about the system itself) and the
learning of derived systems (e.g. learn other language).
To date, there is significant debate about this question: does the language belong
only to humans? The linguist Chomsky [15] states that “human language appears
to be a unique phenomenon, without significant analogue in the animal world”.
However, this previous classification [13] may be used to compare verbal language
with other animal communication systems, even if the choice of the items used for
comparison is still in debate. For example, should we try to take into account the
complexity or on the contrary, is it better to agree on a minimum and essential core?
Indeed, recent data showed that we need to rethink the limits between humans and
animals (as well as robots [16]) according several parameters [17, 18]. For example,
learning, attachment, culture, laugh, identity and so on are now rethinking and still
not belong to only humans. But remain that animal vocal communication, as human
language, is—before all—a social act. And language, as verbal communication, is
associated to nonverbal communication.

2.2 Nonverbal Communication

One of the first researchers to work on this topic was Charles Darwin who described
the biological and innate origins of nonverbal communication and especially emo-
tions [19]. He proposed the existence of universal emotions. Nonverbal communi-
cation can be defined as construction and sharing of meanings that happen without
speech use [20]. Para-verbal communication is then a component of the nonverbal
communication that is relative to the voice, while excluding a semantic component.
Communication Between Humans: Towards … 9

Other authors use a different dichotomy, proposing a session between speech-based


communication and non-speech communication [21]. Other definitions propose that
nonverbal communication be called “bodily communication” because most non-
verbal items are expressed through the gestures and movements of some body parts
[22]. Para-verbal communication concerns intonations, rhythm, latency between
words, volume whereas nonverbal communication concerns gestures, gaze, mimicry
and posture. There is neither single theory about nonverbal communication, nor sin-
gle discipline that deals with the study of nonverbal communication characteristics
and functions [20].

2.3 How to Classify Nonverbal Communication?

Here, we propose a non-exhaustive list of classifications of nonverbal items that


researchers propose to better understand nonverbal communication. First, Bonaiuto
& Maricchiolo [23] proposed a scale where the items of nonverbal communication
are graduated (Fig. 3), from the most obvious items (i.e. external appearance and
spatial behavior) to the least obvious items (i.e. vocal cues).
Specifically, we proposed to gather 5 classification propositions of para-verbal
communication items (see Table 1 for details). Literature seems to agree with para-
verbal communication as sub-component of nonverbal communication that corre-
sponds to “verbal vocal signs with para-verbal meaning, non-verbal vocal cues, and
silences” [20].
Some authors focused on particular human body elements to classify items con-
stituting nonverbal communication. For example, Bonaiuto et al. [30] proposed a
classification of hand gestures, whether or not related to speech. More precisely,
Rozik [31] illustrated the role of the hands in the theatrical play situation with a spe-
cific classification. Some other authors focused on the whole body parts. Gestures
could be analyzed as we analyze linguistic [32]. The term kineme was created by
comparison to phoneme. The kineme itself is meaningless. Its repertoire is based
on human body division into 8 parts: head, face, neck, trunk, arms, hands, legs and
feet. Then, each part is subdivided. This method of formal classification is extremely
fine, but is unusable in direct observation, unless you record the interaction and view
frame by frame, which involves significant effort.
Another classification, not based on structure but on function, proposed to sep-
arate communicative gestures from so-called extra-communicative gestures [3] as
previously suggested by Morris [27]. Among communicative gestures, 3 broad cat-
egories are distinguished: (1) quasi-linguistic gestures that is conventional form
and use of gestures according to the culture that can be used independently of the
speech, although they often have an equivalent verbal expression, (2) syllinguistic
gestures that is gestures necessarily associated with speech and at last (3) synchro-
nizers which are centered on the interaction and ensure good exchange run. Among
extra-communicative gestures, the author distinguishes also 3 broad categories:
10 M. Grandgeorge

Fig. 3 Classifications of
nonverbal items with the
most obvious at the top and
the least obvious at the
bottom [23]

(1) comfort gestures that is change of position, (2) self-centered gestures or self-
body manipulation and at last (3) playful gestures equivalent to the previous ones,
but centered on the object.
Notice that when a gesture is accompanied by language, gesture becomes a support
whereas when a gesture is not accompanied by language, it becomes language per
se (e.g. sign language; deaf children have difficulty to learn conventions and implicit
rules governing the language use).
As we mentioned above, nonverbal communication also contains silences. Our
daily experience reveals that all silences are not the same (e.g. silence following an
embarrassing question, silence due to reflection, silence to ignore others). Silence—
depending on context and partners involved—may have a positive or negative valence
that could impact relationships (e.g. dominance). For example, silence becomes pos-
itive when it is used in cases of emotions so strong that they can’t be expressed
verbally (e.g. love at first sight) or to express approval. If the silence is accompanied
by gaze avoidance, it may indicate that the partner is embarrassed or wishes to close
Communication Between Humans: Towards … 11

Table 1 Summary of five


Authors Classification of para-verbal items
classifications of paraverbal
items Trager [24] 1. Voice quality
2. Vocalisations • Vocal
characteristics
• Vocal
qualifications
• Sounds
Harrow [25] 1. Reflex movements
2. Fundamental movements
3. Perceptual skills
4. Physical skills
5. Motor skills
6. Gestural communication
Argyle [26] According the type of speech (e.g.
friendly)
Morris [27] 1. Intentional gestures
2. Non intentional gestures
Laver & Trudgill 1. Extralinguistic characteristics of the
[28] voice
2. Paralinguistic characteristics of the
voice tone
3. Characteristics of the phonetics
Anolli et al. [29] 1. Vocal and
verbal cues
2. Vocal but non • Tone
verbal cues • Intensity
• Velocity

the conversation for example. Sachs et al. [33] developed a classification of silence
in conversation in 3 parts: gap, lapse and pause. A silence of gap-type corresponds
to the moment when you take your turn in speech. A silence of lapse-type defines
situations where none of the interlocutors speak, causing the interruption of the con-
versation. At last, silence of pause-type corresponds to delay of the partner observed
following a request, a question, a greeting. The latter can be considered, at least in
our culture, as a violation of the informal rules of the conversation.

2.4 Factors That Modulate Communication

Communication may be influenced by several factors. For example, Anolli et al.


[29] proposed four main types of factors: biological factors (e.g. gender, age), social
factors (e.g. culture, social norms, environmental context, the degree of knowledge
about each other), personality factors and emotional factors.
12 M. Grandgeorge

2.4.1 Degree of Knowledge About Each Other: Interaction


or Relationships

Many definitions coexist [34, 35]. Here, we privileged the one proposed by Hinde
[36], which is used in several disciplines such as psychology and ethology: “By an
interaction, we usually mean a sequence in which individual A shows behavior X to
individual B, or A shows X to B and B responds with T”. This sequence of interaction
can be repeated, identically or not. The description of an interaction is based on what
individuals do together (content) and how they do it (quality). Hinde [37] argued
that “in human interactions, such qualities can be as or more important than what
the interactants actually did together”. When two individuals encounter them for the
first time, their level of uncertainty is strong, in the sense that there is indecision
about the beliefs and behaviors that the other is likely to display [38]. Getting to
know each other corresponds to reducing this uncertainty, so that the other appears
predictable, a decision can be made about the desirability of future interactions and
the level of intimacy that is desirable [39]. That is how, from an interaction, we move
to a relationship.
A relationship involves a series of interactions in time: partners have, on the
basis of the past experiences, expectations on the other individual’s responses [36].
Depending on the interaction perceptions (i.e. positive or negative valence), relation-
ships could range from trust and comfort to fear and stress. Once the relationship is
established, it is not static: each interaction may influence the relationship or may
persist despite a long separation [40]. Relationships expressed by (1) strong attraction
between individuals, (2) proximity seeking, (3) preferences, (4) psychophysiolog-
ical imbalance after isolation, (5) co-operation, (6) activity coordination, (7) affil-
iations and (8) predisposition to social facilitation (according to Laurence Henry,
lecturer in University of Rennes 1). Henry and her collaborators [41] reports a gen-
eral trend, through the animal kingdom including humans: vocal communication is
important to establish a relationship between two individuals. Vocal communication
is often used during first interactions but tends to decrease across time (e.g. number
of occurrences), when a stable relationship is established.
Throughout the works of several authors from different disciplines [42–44], we
propose a synthetic view of the word “relationship”. The relationship between two
individuals comes from the first encounter. It is instantiated from a very general
model of the partner (e.g. man or woman, age range). Interaction after interaction,
the “model” of the partner would be refined (e.g. from interaction valence, iden-
tity of each partner) to become an individual model. When the individual model is
established, the relationship corresponds to the one defined by Hinde [36]. During
interactions, continuous process allows to get individual model into conformity with
what this model really is, the identity of each partner is dynamic and progressive.
Communication Between Humans: Towards … 13

2.4.2 Socio-Cultural Factors

While it seems common-sense that verbal communication is subject to socio-cultural


factors, nonverbal communication is also submitted to such factors, even at a young
age and ubiquitous way. For example, Efron [45] compared gestures of Italian immi-
grants to Jewish ones from Eastern Europe in the United States. He shows that
gestures used are different in the first generation of both groups. These differences
diminish in the second generations to finally become typically American for both
populations. Differences are also observed on common gestures. For example, the
gesture using hand to mean “come here” depends on the culture of people who use it.
While, in France, the movement of the hand and fingers is palm up, in many Mediter-
ranean countries, the gesture is palm down. Differences exist also in facial expres-
sions as shown by cross-cultural studies. Facial expression of emotions corresponds
to universal patterns but society would provide use rules: emotional expressions are
not equally accepted according to cultures. For example, word numbers using to
qualify emotions varies among cultures. In France, a hundred are identified while in
Chewong, ethnologists found only 7 words [46].

2.4.3 Emotions

Defining “emotion” is very complex. Literature offers more than a hundred defini-
tions [47] and more than 150 theories [48], showing that currently no consensus is
reached. At the interpersonal level, function of emotions is to coordinate immediate
social interactions, particularly through emotional expressions that help people to
know the partner’s emotions as well as beliefs and intentions [49, 50]. It has long
been considered that Arousal (i.e. strength of the emotional stimulus) and Valence
(i.e. emotionally positive or negative characteristics of stimulus) were only relevant
components of emotion [51]. However, Scherer [52] identified a greater number of
specific dimensions: Intensity, Dominance and Impact of emotion.
Finally, Tcherkassof [49] argues that emotion may have importance in communi-
cation as well as other components: affective state (i.e. sensation of pleasure or dis-
pleasure), sense (i.e. affective states of moral origin, emotionally charged attitudes),
mood (i.e. chronic phenomenon compared to the emotion characteristic, i.e. acute
phenomenon, that affects behavior), temperament (i.e. stable affective dispositions)
and affect (i.e. experience of pleasure and displeasure).

2.4.4 Multimodal and Multichannel Communication

While it is true that verbal language, through the voice channel, is the preferred way
to communicate, it is far from being the only channel used by humans. Observing
a situation of communication of everyday life allow us to realize the importance
of multimodal and multichannel communication in our interactions [53]. Human
communication can, therefore, use different channels to transmit the signal: (1) the
14 M. Grandgeorge

auditory (e.g. sound) channel, linked to verbality and vocalizations, (2) the visual
channel linked to gestures and (3) the olfactory, thermal and tactile channels unfortu-
nately often neglected in adults of Western cultures. These channel uses are subject
to strong socio-cultural factors. For example, the regulation of interactions between
human adults (e.g. to ensure the attention of the partner) is mainly displayed by
glances and gazes and not by direct physical contact, especially if the partners don’t
know them. In the context of robots, both the robot and the human multimodality
must also be considered [54]. But researchers encounter several difficulties related
to this question, in particular on the integration of several modalities in human-robot
communication (e.g. synchronization of different modalities, gesture recognition).

3 A Proposed Intercomprehension Model

To communicate well, intercomprehension between individuals is prerequisite. How-


ever, this word is rarely defined, such as in dictionary. Nevertheless, it seems consis-
tent that for intercomprehension, reciprocity is necessary. Jamet [55] proposes that
intercomprehension could be defined as the ability to understand and be understood
in an unknown language through different communication channels, both verbal
and non-verbal. It cannot be limited to the verbal language and must include all
components of the communication including, for example, smelling, touching or
visual modality. For our model, we propose that intercomprehension is the ability
to understand and be understood through different modalities of communication.
This model was proposed by a consortium of interdisciplinary researchers working
together in 2010–2011 in a project named MIAC (“Modélisation interdisciplinaire
de l’acceptabilité et de l’intercompréhension dans les interactions” that is “Interdis-
ciplinary Model of Acceptability and Intercomprehension in Interactions” between
humans, animals and robots) [56]. In this model, we were first interested in the con-
cept of identity. Identity is complex and dynamic, changing across time and according
to the individual which whom you interact as well as the context you are in [57].
Each individual has her/his own identity with several sides (e.g. biology, personality,
skills, knowledge, uses, emotions, and so on) [58, 59].
During an interaction, each individual activates one of “her/his identity”, accord-
ing to the partner’s identity as well as to the context, i.e. moment, place, social
environment and interactive situation/co-activity [11]. This is called proposed iden-
tity (e.g. A/A; Fig. 4). Likewise, each individual has her/his own identity perceived
by others, that we called perceived identity (e.g. B/A) that could be influenced by
past interactions. In this model, everyone conceives of what the others may represent
about her/him: we called that the represented identity (e.g. (A/B)/A). Notice that “/”
means “for”. Based on the dynamic described above, we proposed a model of inter-
comprehension that could be common to humans, animals and machines including
robots (Fig. 5). Thus, if the identity proposed by the individual A (e.g. status, aim…
in the interaction) is consistent with the represented identity of A, and conversely for
these two identities of the individual B, we can then talk about intercomprehension.
Communication Between Humans: Towards … 15

Fig. 4 Dynamics at the identity level. A and B are 2 different individuals (e.g. human, animal,
robot). The arrows indicate the direction of the action (e.g. the proposed identity of A activate in B
the perceived identity of A). Circle arrow indicate that the phenomena is always in process. Notice
that “/” means “for”

This definition is useable for interactions between individuals from same species.
But, for individuals from different species or individuals of same species with particu-
lar way of communication (e.g. autism, blind people), it is required to add a condition
to talk about intercomprehension: the functional communication between the indi-
vidual A and the individual B (upper part of the Fig. 5). This notion is based on Von
Uexküll’s concept [60] of Umwelt, i.e. an environment-world of each organism. They
perceived the experience of living in terms of species-specific, spatio-temporal and
‘self-in-world’ subjective reference frames. Each individual’s Umwelt has a mean-
ing and imposes determinations. In order to communicate, two individuals must have
functional concordances between their own perceptive devices, i.e. their senses allow
16 M. Grandgeorge

Fig. 5 Model of intercomprehension common to humans, animals and machines

to perceive each other, and therefore the signals emitted can be interpreted at a min-
imum level. Thus, among all the signals that a species uses, some are selected as
having a meaning for other species [60]. In addition, these signals could become sig-
nificant by learning. Therefore, communication between individuals from different
species is limited by reception and recognition of signals used. These are a required
but not sufficient condition for intercomprehension. Taking into account functional
communication appears essential, for example, in human-robot interactions. Indeed,
if we are interested in blind children, the visual modality may not be preferred in the
design of the robot; likewise for elderly people with hearing problems, etc. Thus,
if these conditions are fulfilled, individuals from different species can communi-
cate and thus activate their different identities allowing intercomprehension between
them.
Communication Between Humans: Towards … 17

4 Conclusions

Throughout this manuscript, we show that communication is dynamic, circular, com-


plex and flexible. It is both intentional and unintentional signals that are used through-
out multichannel and multimodal way. All communication has a purpose. At last,
communication consists of cooperation and co-construction.
We proposed here a model of intercomprehension, taken into account several
scientific disciplines. Real need appears in terms of research in order to improve
human-robot communication. We hope that this model may help in such way. Indeed,
as stated previously [61], “it is highly probable that the way humans “view” their
robots has an important influence on the way they use them. Daily human-robot
interactions are very varied and include both positive (e.g. services) and negative
(e.g. breakdown) events, leading to more or less balanced relationships.” Now, with
such model and adaptations that could be involved, “further research is needed in
order to assess how to maximize robot acceptance in the human environment - at
least in some societies, what type of robot (e.g. size, functions) may help develop a
positive relationship, what influence human involvement has on the relationship and
so on”. These constitute now major challenges for following researches.

References

1. Vauclair, J.: L’intelligence de l’animal. Seuil, Paris (1992)


2. Shannon, C.E., Weaver, W.: A Mathematical Model of Communication. University of Illinois
Press, Urbana, IL (1949)
3. Cosnier, J.: L’étho-anthropologie de la gestualité dans les interactions quotidiennes. In: Laurent,
M., Therme, P. (eds.) Recherche en A.P.S., pp. 15–22 (1987)
4. Rendall, D., Owren, M.J.: Animal vocal communication: say what? In: Bekoff, C.A.M.,
Burghardt, G. (eds.) The Cognitive Animal. Empirical and Theoretical Perspectives on Animal
Cognition, pp. 307–313. MIT Press, Cambridge, MA (2002)
5. Riley, J., Riley, M.: Mass communication and the social system. Sociol. Today 537 (1959)
6. Jakobson, R.: Linguistics and poetics. In: Sebeok, T.A. (ed.) Style in Language, pp. 350–377.
M.I.T. Press, Cambridge, MA (1960)
7. Barnlund, D.C.: A transactional model of communication. In: Foundations of Communication
Theory. Harper & Row, New York (1970)
8. Anderson, R., Ross, V.: Questions of Communication: A Practical Introduction to Theory. St.
Martin’s Press, New York (1994)
9. Winkin, Y.: La nouvelle communication. Éditions du Seuil, Paris (1981)
10. Watzlawick, P., Beavin-Bavelas, J., Jackson, D.: Some tentative axioms of communication.
In: Pragmatics of Human Communication: A Study of Interactional Patterns, Pathologies and
Paradoxes. W. W. Norton, New York (1967)
11. Blanchet, P.: Linguistique de terrain, méthode et théorie (une approche ethnosociolinguistique),
150 p. Presses Universitaires de Rennes, Rennes (2000)
12. Meumann, E.: Haus-und schularbeit: Experimente an kindern der volksschule. J. Klinkhardt
(1904)
13. Hockett, C.F.: The origin of speech. Sci. Am. 203, 89–96 (1960)
14. Thorpe, W.H.: Duetting and antiphonal song in birds—its extent and significance. Behav.
Monogr. Supplement 18(3):1–197 (1972)
18 M. Grandgeorge

15. Chomsky, N.: La linguistique cartésienne: un chapitre de l’histoire de la pensée rationaliste,


suivie de La nature formelle du langage. Edition du Seuil, Paris (1969)
16. Chapouthier, G., Kaplan, F.: L’homme, l’animal et la machine, p. 224. CNRS éditions, Paris
(2011)
17. Engesser, S., Crane, J.M.S., Savage, J.L., Russell, A.F., Townsend, S.M.: Experimental evidence
for phonemic contrasts in a nonhuman vocal system. PLoS Biol. 13(6), e1002171 (2015)
18. Ouattara, K., Lemasson, A., Zuberbühler, K.: Campbell’s monkeys concatenate vocalizations
into context-specific call sequences. Proc. Natl. Acad. Sci. 106(51), 22026–22031 (2009)
19. Darwin, C.: The Expression of the Emotions in Man and Animals. Cambridge library collection:
Francis Darwin (1872)
20. Hennel-Brzozowska, A.: La communication non-verbale et paraverbale - perspective d’un
psychologue. Synergies Pologne 5, 21–30 (2008)
21. Greene, J., Burleson, B.: Handbook of Communication and Social Interaction Skills. Purdue
University, Lawrence Erlbaum Associates, New York (1980)
22. Argyle, M.: Bodily Communication. Methuen, London (1974)
23. Bonaiuto, M., Maricchiolo, F.: La comunicazione non verbale. Carocci Editore, Roma (2007)
24. Trager, G.: Paralanguage: A first approximation. Stud. Linguist. 13, 1–12 (1958)
25. Harrow, A.J.: Taxonomie des objectifs pédagogiques. Tome 3, domaine psychomoteur. Presses
de l’Université du Québec, Montréal (1977)
26. Argyle, M.: The Social Psychology of Everyday Life. Routledge, London (1992)
27. Morris, D.: Bodytalk: A World Guide to Gestures. Jonathan Cape (1994)
28. Laver, J., Trudgill, P.: Phonetic and linguistic markers in speech. In: Sherer, K.R., Giles, H.
(eds.) Social Markers in Speech, pp. 1–31. Cambridge University Press, New York (1979)
29. Anolli, L., Ciceri, R., Riva, G.: Say Not to Say: New Perspectives on Miscommunication. IOS
Press, Amsterdam (2002)
30. Bonaiuto, M., Gnisci, A., Maricchiolo, F.: Proposta e verifica di una tassonomia per la codifica
dei gesti delle mani in discussioni di piccolo gruppo. Giornale Italiano di Psicologia 29, 777–807
(2002)
31. Rozik, E.: Les Gestes Metaphoriques de la Main au Théâtre. Prothée 21(3), 8–19 (1993)
32. Birdwhistell, R.: Kinesics and Context. University of Pennsylvania Press, Philadelphia (1970)
33. Sachs, O., Schegloff, E.A., Jefferson, G.: A simplest systematics for the organization of turn-
taking for conversation. Language 50(4), 696–735 (1974)
34. Goffman, E.: Les rites d’interaction. Minuit, Paris (1974)
35. Kerbrat-Orecchioni, C.: Les interactions verbales, Tome 1. Armand Colin, Paris (1990)
36. Hinde, R.: Towards Understanding Relationships. Academic Press, London (1979)
37. Hinde, R.: On Describing Relationships. J. Child Psychol. Psychiatry 17, 1–19 (1976)
38. Berger, C.: Beyond initial interaction: uncertainty, understanding, and the development of
interpersonal relationship. In: Giles, H., St. Clair, R. (eds.) Language and Social Psychology,
pp. 122–144. Blackwell, Oxford (1979)
39. Moser, G.: Les relations interpersonnelles. P.U.F., Collection Le psychologue, Paris (1994)
40. Sankey, C., Richard-Yris, M.A., Leroy, H., Henry, S., Hausberger, M.: Positive interactions
lead to lasting positive memories in horses, Equus caballus. Anim. Behav. 79(4), 869–875
(2010)
41. Henry, L., Barbu, S.L., Lemasson, A., Hausberger, M.: Dialects in animals: evidence,
development and potential functions. Anim. Behav. Cogn. 2(2):132–155 (2015)
42. Habermas, J.: Théorie de l’agir communicationnel. Fayard (1981)
43. Goffman, E.: Strategic Interaction. University of Pensylvania Press, Philadelphia (1969)
44. Fiske, A.P.: The four elementary forms of sociality: framework for a unified theory of social
relations. Psychol. Rev. 99, 689–723 (1992)
45. Efron, D.: Gesture, Race and Culture. La Hague, Mouton, Paris (1941)
46. Russell, J.A.: Culture and the categorization of emotions. Psychol. Bull. 110(3), 426–450
(1991)
47. Kleinginna, P.R., Kleinginna, A.M.: A categorized list of emotion definitions with suggestions
for a consensual definition. Motiv. Emot. 5, 345–379 (1981)
Communication Between Humans: Towards … 19

48. Strongman, K.T.: The Psychology of Emotion: Theories of Emotion in Perspective. Wiley,
New York (2000)
49. Tcherkassoff, A.: Les émotions et leurs expressions. PUG, Grenoble (2008)
50. Doise, W.: Levels of Explanation in Social Psychology. CUP, Cambridge (1986)
51. Schlosberg, H.: The description of facial expressions in terms of two dimensions. J. Exp.
Psychol. 44, 229–237 (1952)
52. Scherer, K.R.: Appraisal theory. In: Dalgleish, T., Power, M. (eds.) Handbook of Cognition
and Emotion, pp. 637–663. Wiley, New York (1999)
53. Guyomarc’h, J.C.: Abrégé d’éthologie, 2ème édition ed. Masson, Paris (1995)
54. Carbonell, N., Valot, C., Mignot, C., Dauchy, P.: Etude empirique: usage du geste et de la parole
en situation de communication homme-machine. Presented at the ERGO’IA’94 (1994)
55. Jamet, M.C.: L’intercompréhension: de la définition d’un concept à la délimitation d’un champ
de recherche ou vice versa? Autour de la définition. Publifarum 10 (2010)
56. Grandgeorge, M., Le Pévédic, B., Pugnière-Saavedra, F.: Interactions et Intercompréhension:
une approche comparative, p. 342. E.M.E. Editions, Collection Echanges (2013)
57. Lipiansky, E.M.: Psychologie de l’identité. Dunod, Paris (2005)
58. James, W.: Psychology: Briefer Course (1892)
59. Mead, G.H.: L’esprit, le soi et la société. PUF, Paris (1963)
60. von Uexküll, J.: Mondes Animaux et Monde Humain, suivi de la Théorie de la Signification.
Gonthier, Paris (1965)
61. Grandgeorge, M., Duhaut, D.: Human-Robot: from interaction to relationship. In: CLAWAR
2011. Paris (2011)

Marine Grandgeorge is lecturer in ethology at


the human and animal Ethology lab at the
university of Rennes 1. She belongs to Pegase team focused on
cognitive processes and social factors associated with scientific
and societal issues that include communication, brain plasticity,
perception and understanding of conspecific and heterospe-
cific signals, remediation and welfare. Her research is mainly
focused on heterospecific communications such as human-robot
interactions as well as human-pet interactions and relationships,
especially on animal assisted interventions (e.g. dog, horse).
An Extended Framework
for Characterizing Social Robots

Kim Baraka, Patrícia Alves-Oliveira and Tiago Ribeiro

Abstract Social robots are becoming increasingly diverse in their design, behavior,
and usage. In this chapter, we provide a broad-ranging overview of the main character-
istics that arise when one considers social robots and their interactions with humans.
We specifically contribute a framework for characterizing social robots along seven
dimensions that we found to be most relevant to their design. These dimensions
are: appearance, social capabilities, purpose and application area, relational role,
autonomy and intelligence, proximity, and temporal profile. Within each dimension,
we account for the variety of social robots through a combination of classifications
and/or explanations. Our framework builds on and goes beyond existing frameworks,
such as classifications and taxonomies found in the literature. More specifically, it
contributes to the unification, clarification, and extension of key concepts, drawing
from a rich body of relevant literature. This chapter is meant to serve as a resource for
researchers, designers, and developers within and outside the field of social robotics.
It is intended to provide them with tools to better understand and position existing
social robots, as well as to inform their future design.

Keywords Human-Robot Interaction · Framework · Classification · Social robots

Kim Baraka and Patrícia Alves-Oliveira have contributed equally to this chapter.

K. Baraka (B)
Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA
e-mail: kbaraka@andrew.cmu.edu
P. Alves-Oliveira (B)
Instituto Universitário de Lisboa (ISCTE-IUL) and CIS-IUL,
1649-026 Lisbon, Portugal
e-mail: patricia_alves_oliveira@iscte-iul.pt
K. Baraka · P. Alves-Oliveira · T. Ribeiro
INESC-ID, 2744-016 Porto Salvo, Portugal
e-mail: tiago.ribeiro@gaips.inesc-id.pt
K. Baraka · T. Ribeiro
Instituto Superior Técnico, Universidade de Lisboa,
2744-016 Porto Salvo, Portugal

© Springer Nature Switzerland AG 2020 21


C. Jost et al. (eds.), Human-Robot Interaction, Springer Series
on Bio- and Neurosystems 12, https://doi.org/10.1007/978-3-030-42307-0_2
22 K. Baraka et al.

1 Introduction

1.1 Social Humans, Social Robots

Humans are inherently social beings, spending a great deal of their time establishing
a diverse range of social connections. Their social nature is not only demonstrated
by their social behavior [90], but also possesses a biological basis [72]. This social
dimension prompts human beings to involuntarily ascribe social qualities even to
non-human media, such as technological artifacts, often treating them similarly to
how they would treat humans or other living beings [138]. This disposition stems
from the general human tendency of ascribing human-like qualities to non-human
entities, called anthropomorphism, which has been observed and demonstrated in
several contexts [60]. These phenomena therefore place technologies capable of
social interactions with humans as unique technological innovations. In particular,
social robots, i.e., robots deliberately designed to interact with humans in a social
way, open up a new paradigm for humans to communicate, interact, and relate to
robotic technologies.
The integration of a social dimension in the design of robots has generally been
following two approaches. First, existing robotic technologies are being enhanced
with social capabilities for more fluid interactions with humans. Second, social robots
are being developed for new application areas where the social dimension is central,
and beyond a mere interface. As a result of these approaches, social robots have
been deployed in a wide variety of contexts, such as healthcare [37], education [23],
companionship [54], and others (refer to Sect. 2.3 for a discussion of application
areas). They offer a spectrum of interactions that is being continuously enriched by
researchers from a variety of disciplines. The field of human-robot interaction (HRI),
as an expanding field of research, reflects this observation.
HRI is a multidisciplinary field bringing together researchers from an eclectic set
of disciplines, including robotics, computer science, engineering, artificial intelli-
gence (AI), machine learning, human-computer interaction (HCI), design, art, ani-
mation, cognitive science, psychology, sociology, ethology, and anthropology [9,
21, 62, 69, 136]. The multidisciplinarity inherent to this field of research provides
contributions and advancements nurtured by scholars from different backgrounds in
the conception, design, and implementation of social robots. In addition to develop-
ment, HRI aims to evaluate how well such robots perform or serve the purpose they
were designed for, being concerned with proper evaluation, testing, and refinement
of these technologies. The result is a rich multidisciplinary effort to create engaging
robots that can sustain personalized interactions with humans, adapt to the task at
hand and to the interaction flow, but also understand and model aspects pertaining
to the human, such as affect and cognition [86, 113].
In this chapter, we provide a framework for characterizing social robots that
encompasses major aspects to consider when designing them and their interactions
with humans. Our framework is focused on interactive robots that possess a social
component in their design. Specifically, we use the term “social robots” to denote
An Extended Framework for Characterizing Social Robots 23

Fig. 1 Visual summary of the seven dimensions of our framework, positioned in relation to the
robot, the interaction, and the context. Each dimension will be further broken down and discussed
separately in Sect. 2

“socially interactive robots” as defined by Fong et al. [69], namely robots that have
one or more of the following abilities: (1) communicating using natural language or
non-verbal modalities (such as lights, movements, or sound), (2) expressing affective
behaviors and/or perceiving human emotions, (3) possessing a distinctive personality
or character, (4) modeling social aspects of humans, (5) learning and/or developing
social skills, and (6) establishing and maintaining social relationships [69].
Our framework builds upon existing work within the field of HRI, providing a
holistic understanding about the state of the art, while aiming at unifying, clarify-
ing, and extending key concepts to be considered in the design of social robots.
Specifically, our framework comprises several dimensions we identified to be of
major relevance to the design of social robots. We summarize the seven dimensions
considered in Fig. 1. Some of these dimensions relate to the robot itself—namely
appearance, social capabilities, and autonomy/intelligence—, others relate to the
interaction—namely proximity and temporal profile—, and the remaining ones relate
to the context—namely robot relational role and purpose/application area. We envi-
sion this framework to be used broadly in order to gain a better understanding of
existing social robots, as well as to inform the design and development of future
ones.
24 K. Baraka et al.

1.2 Brief Summary of Frameworks for Characterizing Social


Robots

Before outlining the content of our framework, it is useful to first look at existing
frameworks for classifying social robots. In particular, existing taxonomies, as such
from Fong et al. [69], Yanco and Drury [203], Shibata [167], and Dautenhahn [52],
are useful to get an understanding of different aspects that may be included in the
design space of social robots in HRI research. While this list of frameworks is not
exhaustive, we chose these particular ones to base our framework on, as they provide
a broad range of classifications and definitions that relates to the scope of this chapter.
As such, Fong et al. [69] contributed a taxonomy of design methods and system
components used to build socially interactive robots. These components include robot
social capabilities, several design characteristics, and application domains.
Additionally, Yanco and Drury [203] provided a framework that included elements
of social robot’s design, such as the role that a robot can have when interacting
with humans, the types of tasks that robots can perform, different types of robot
morphology, and the level of autonomy at which robots can operate.
Similarly, Shibata [167] provided a taxonomy for the function and purpose of
social robots by considering different ways of using them for psychological enrich-
ment. Therefore, Shibata classified human-robot interactions in terms of the duration
of these interactions and in terms of design characteristics (e.g., robot’s appearance,
hardware, and software functionalities), accounting for culture-sensitive aspects.
Moreover, Dautenhahn [52] focused on different evaluation criteria to identify
requirements on social skills for robots in different application domains. The author
identified four criteria, including contact between the robot and the human (which
can vary from no contact or remote contact to repeated long-term contact), the extent
of the robot’s functionalities (which can vary from limited to learning and adapting),
the role of the robot (which can vary from machine or tool to assistant, companion,
or partner), and the requirement of social skills that a robot needs to have in a
given application domain (which can vary from not required/desirable to essential).
The author further explains that each evaluation criterion should be considered on a
continuous scale.
Taken together, these classifications and taxonomies have gathered essential
aspects for the characterization and design of social robots. Despite each of them
being unique in its contribution, we can see the existence of some overlapping terms
and ideas between them. We now discuss our extended framework in the next section.

1.3 Overview of Our Extended Framework

Our framework leverages the existing ones discussed previously as a starting point
and goes beyond the individual frameworks discussed. In particular, it focuses on
the following features:
An Extended Framework for Characterizing Social Robots 25

• Unification—The existence of multiple available perspectives in HRI often results


in scattered concepts and classifications. In this chapter, we aim at merging aspects
of the literature on social robots and related fields in a self-contained and consistent
resource.
• Breadth—Existing individual taxonomies often focus on specific aspects relevant
to the main line of research of their authors, and may not provide satisfactory
coverage. Our framework includes dimensions related to the design of the robot
itself, but also of the interaction and context.
• Recency—In recent years, we have observed some important developments in
robotic technologies, which have taken robots outside of research laboratory set-
tings and enabled them to be deployed “in the wild”. We incorporate those recent
developments in our work.
• Clarity—Concepts associated with HRI are often difficult to define, and as a result
clear definitions may not always be available. This lack of clarity may impede
communication within the field, or result in inconsistent concepts. In this chapter,
we attempt to clarify some important key concepts, such as the distinction between
embodiment and purpose, or the concepts of autonomy and intelligence for social
robots.
With these points in mind, we list below our focuses within each of the 7 dimen-
sions considered.
1. Appearance—We present a broad classification of robot appearances, synthesiz-
ing and going beyond existing ones (Sect. 2.1).
2. Social capabilities—We contribute a repositioning of existing classifications aim-
ing to clarify how existing categories related to each other (Sect. 2.2).
3. Purpose and application area—We discuss a cross-section of purposes for social
robots, and benefiting application areas, with selected examples that include recent
developments in the field (Sect. 2.3).
4. Relational role—We provide a straightforward and broad classification of the
robot’s role in relation to the human(s) (Sect. 2.4).
5. Autonomy and intelligence—We clarify the related but distinct concepts of
autonomy and intelligence, and discuss their quantification (Sect. 2.5).
6. Proximity—We classify interactions according to their spatial features (Sect. 2.6).
7. Temporal profile—We look at several time-related aspects of the interaction,
namely timespan, duration, and frequency (Sect. 2.7).
It is to be noted that our framework is not meant to be exhaustive, but rather to
provide the reader with major aspects that shape social robots and their interactions
with humans. While our focus in illustrating the presented concepts will be on single
human–single robot interactions, the concepts may also apply for group interactions
involving more than one robot and/or more than one human. Additionally, even
though this framework was developed with social robots in mind, some dimensions
may also be of relevance to robots without a social component in their design, such
as for example in the “appearance” dimension. In the following section, we delve
into each of the 7 dimensions of our framework. We then end this chapter with a
brief discussion on designing social robots within the resulting design space.
26 K. Baraka et al.

2 Framework Description

We now provide a description of each of the 7 dimensions of our framework. The


dimensions purposefully operate at different levels, according to the aspects that
are most relevant to the design of social robots. In some dimensions, we provide a
classification into different categories and possibly subcategories (namely Sects. 2.1,
2.3, 2.4, 2.6, and 2.7). In others, we focus on clarifying or reinterpreting existing
distinctions in categories or scales (namely Sects. 2.2 and 2.5). Due to different
levels of research and relevant content in each, some dimensions are addressed in
more depth than others. Also, since the discussions of dimensions are not dependent
on each other, we invite the reader to jump to their subsections of interest.

2.1 Appearance

The mere physical presence of robots in a shared time and space with humans sparks
crucial aspects of a social interaction. Indeed, embodiment, a term used to refer to
the idea that “intelligence cannot merely exist in the form of an abstract algorithm
but requires a physical instantiation, a body” [146], plays an important role in the
perception and experience of interaction with intelligent technology. Indeed, litera-
ture supports that physical embodiment influences the interaction between humans
and robots [63, 102, 112, 118, 134, 149, 197]. In particular, the physical appearance
of a robot per se, was shown to have a strong influence on people regarding aspects
like perception, expectations, trust, engagement, motivation, and usability [35, 55,
95].
Several taxonomies were developed in order to create representative classifica-
tions for a robot’s appearance. To cite a few, Shibata [167] classified robots as being
human type, familiar animal type, unfamiliar animal type, or imaginary animals/new
character type. Additionally, Fong et al. [69] considered anthropomorphic, zoomor-
phic, caricatured, and functional categories. The amount of classifications present in
the literature urges for a unified and broad classification for social robot appearances.
Building upon the existing classifications, we introduce a broad classification that
encompasses main categories described by other authors, as well as new categories
and subcategories. Our classification targets only and exclusively a robot’s physical
appearance, as distinct from any type of robot behavior, i.e., “robot at rest”.
We contribute to the study of social robots’ appearance in the following ways:
(1) we integrate similar terms already present in the robot appearance classification
literature, (2) we add new terms to existing classifications as they were not represented
in the literature but urged for a classification, and (3) we attempt to clarify concepts
related to different categories. Our unified classification is visually represented in
Fig. 2. We considered the following categories of robot appearances: bio-inspired,
including human-inspired and animal-inspired, artifact-shaped, and functional, each
with several further subcategories (see Fig. 2). We generated this classification with
An Extended Framework for Characterizing Social Robots 27

Fig. 2 Summary of our


robot appearance
classification. This
classification was based on
prior work from Fong et
al. [69] and Shibata [167],
and was unified, extended,
elaborated, and clarified in
the present chapter. Although
the focus is on social robots,
its scope is general enough to
encompass appearances of
robots without a social
component in their design.
List of robots shown
(left-to-right,
top-to-bottom) Bio-inspired
robots: HI-4, ERICA,
Kodomoroid, NAO, LOLA,
Robotic Eyes, Elumotion,
EMYS, AIBO, PARO,
DragonBot, Keepon,
GoQBot, Meshworm,
Robotic Flower, Lollipop
Mushroom. Artifact-shaped
robots: Travelmate, AUR,
Google self-driving car,
Greeting Machine, YOLO.
Functional robots: CoBot,
Quadcopter, Beam,
TurtleBot

a holistic mindset, meaning it can serve to classify existing robots, but also to inform
the design of future ones. Although devised with social robots in mind, it is general
enough to be applied to any robot, independent of its social capabilities. We now
provide a description of each category in our classification.
1. Bio-inspired—Robots in this category are designed after biological organisms
or systems. These include human-inspired and animal-inspired robots (described
28 K. Baraka et al.

next), as well as other bio-inspired robots, such as robotic plants (e.g., the robotic
flower1 ) and fungi (e.g., the Lollipop Mushroom robot2 ).
a. Human-inspired—Robots in this category are inspired by features of the
human body, including structure, shape, skin, and facial attributes. Human-
inspired robots not only include full-body designs, but also robots designed
after human body parts. When designed after the full-human body, they are
called humanoids. The level of fidelity can vary from a highly mechanical
appearance, such as the LOLA robot [41], to a highly human-like appear-
ance that includes skins and clothes, such as the ERICA robot [77], or even
include an intermediate between these two, in the case of the NAO robot.3 For
humanoids, it is worth mentioning the case in which they strongly resemble
the human outer appearance and are covered with flesh- or skin- like materi-
als, in which case they are often referred to as androids (if they possess male
physical features) or gynoids (if they possess female physical features). An
example of a gynoid is the Kodomoroid robot.4 Additionally, special cases of
androids/gynoids are geminoids, which are designed after an existing human
individual (i.e., it is a “robotic twin”) such as Geminoid HI-4,5 the tele-operated
robotic twin of Hiroshi Ishiguro. On the other hand, some robots are inspired
by individual parts of the human body. These include robotic arms, e.g., Elu-
motion Humanoid Robotic Arm,6 robotic hands [121], robotic heads such as
the EMYS robot [101], robotic torsos, [169], and robotic facial features, such
as robotic eyes [43]. It is worth mentioning that high-fidelity human-inspired
robots are often subject to uncanny valley effects [133]. Being highly but
not totally human-like, they elicit feelings of eeriness, and hence should be
designed bearing these possible effects in mind.
b. Animal-inspired—Robots in this category are inspired by animals or by crea-
tures possessing animal traits of appearance. On the one hand, they may be
inspired by real animals, for which we consider inspiration from familiar
animals, like the AIBO dog-inspired robot,7 and inspiration from unfamil-
iar animals, such as the PARO baby seal robot.8 The distinction between
familiar and unfamiliar animals is emphasized by Shibata [167]. According
to the author, familiar animals are those whose behavior can be easily rec-
ognized, such as pets; while unfamiliar animals are those that most people
know something about but are not totally familiar with, and have rarely inter-
acted with them before, such as savanna animals. The same author mentioned

1 http://www.roboticgizmos.com/android-things-robotic-flower/.
2 https://www.amazon.com/Lollipop-Cleaner-Mushroom-Portable-Sweeper/dp/B01LXCBM3E.
3 https://www.softbankrobotics.com/emea/en/nao.
4 http://www.geminoid.jp/en/robots.html.
5 http://www.geminoid.jp/projects/kibans/resources.html.
6 http://elumotion.com/index.php/portfolio/project-title-1.
7 https://us.aibo.com/.
8 http://www.parorobots.com/.
An Extended Framework for Characterizing Social Robots 29

that when robots are designed to resemble an unfamiliar animal they can be
more easily accepted due to the lack of exposure to their typical behavior. It is
documented in the literature that people hold strong expectations when faced
with the possibility of interacting with a social robot [179], wherein robots
whose embodiment matches its abilities are perceived more positively [79,
105, 117]. However, it is to be noted that familiarity is a subjective concept
depending on culture and individual experiences, making this distinction flex-
ible. On the other hand, animal-inspired robots can also be imaginary, mean-
ing they possess animal-like features but are not designed after a real animal.
They can either be familiar, i.e., designed after familiar imaginary animals
“existing” in fantasy worlds, like cartoon characters or legendary creatures
(e.g., DragonBot [174]), or unfamiliar, i.e., robots that are purely created from
imagination, such as Miro9 and Keepon.10 In addition, this category includes
robots designed after animal body parts, such as the GoQBot designed as a
caterpillar part [120], the Meshworm designed after the oligochaeta [163], and
robotic soft tentacles [96].
2. Artifact-shaped—Robots in this category bear the appearance of physical human
creations or inventions. They may be inspired by objects, such as furniture and
everyday objects, e.g., the AUR robotic desk lamp [89], the Mechanical Ottoman
robotic footstool [176], and the Travelmate robotic suitcase.11 They may also be
inspired by an existing apparatus, demonstrating how existing apparatuses can
become robotic systems while maintaining the same appearance, such as self-
driving cars (e.g., the Google self-driving car12 ), but also everyday apparatuses
like toasters, washing machines, etc. Additionally, artifact-shaped robots may be
imaginary, i.e., translating the invention of the designer, such as the Greeting
Machine robot [11] or YOLO [7, 8].
3. Functional—The appearance of robots included in this category is merely the
sum of appearances of the technological pieces needed to achieve a given task
or function. This means that their appearance leans more towards mechanical
aspects. Examples are quadcopters, or mobile robots such as the CoBots [196],
the TurtleBot,13 and the Beam.14
As a side note, shape-shifting robots, modular robots, or polymorphic robots [15,
116, 204, 205] are all examples of hybrid robots that can fit into more than one
category depending on their configuration. Also, robotic swarms are examples of
multi-robot systems that may be perceived as a single entity, i.e., more than the sum
of individual robots (homogeneous or heterogeneous) [104], however they are not
part of our classification, because they are too dependent on the configuration and

9 http://consequentialrobotics.com/miroe/.
10 https://beatbots.net/my-keepon.
11 https://travelmaterobotics.com/.
12 https://waymo.com/.
13 https://www.turtlebot.com/.
14 https://suitabletech.com/.
30 K. Baraka et al.

behavior of the swarm. Moreover, the actual process of assigning categories to exist-
ing robots always carries a certain degree of subjectivity, which relates to different
possible perceptions of the same robot appearance, depending or not on the context,
the behavior of the robot, etc. The clearest example in our classification would be the
distinction between familiar and unfamiliar, which strongly depends on people’s cul-
tural background and personal experiences. Those differences in perception should
be accounted for when designing robot appearances.
Our presented classification is not intended to offer a clear-cut or rigid boundary
between categories of robots. Rather, it represents a useful guideline for categoriz-
ing robots based on major distinguishing features. It does encourage the view of
robot design as a spectrum, providing fluidity to their design and allowing for the
combination of different elements of classification system.
A robot’s appearance is the most obvious and unique visual attribute, which con-
tributes highly to the interaction [68]. Nonetheless, in addition to appearance, there
are several factors related to embodiment, such as size, weight, noise, material tex-
ture, among others [56] that may contribute to the perception of the robot during an
interaction. More research is needed in order to develop classifications that account
for the other factors mentioned above.

2.2 Social Capabilities

Social robots vary greatly in their social capabilities, i.e., how they can engage in
and maintain social interactions of varying complexities. As such, researchers have
classified and defined them according to those social capabilities. Based on the work
of Fong et al. [69], we list the different components of a social robot’s capabilities
as follows:
• Communicating using natural language or non-verbal modalities—Examples
of these ways of communication are natural speech [200], motion [57, 103]—
possibly including gaze [4], gestures or facial expressions—, lights [19, 187],
sounds [24], or a combination of them [122]. Mavridis [128] provided a review
on verbal and non-verbal interactive communication between humans and robots,
defining different types of existing communications such as interaction grounding,
affective communications, and speech for purpose and planning, among others.
• Expressing affect and/or perceiving human emotions—Beyond Ekman’s six
basic emotions [58]—anger, disgust, fear, happiness, sadness, and surprise—, this
component may include more complex affective responses such as empathy. For
example, Paiva et al. [145] analyzed different ways by which robots and other
artificial agents can simulate and trigger empathy in their interactions with humans.
• Exhibiting distinctive personality and character traits—The major compo-
nents to be considered, according to Robert [153], are human personality when
interacting with a robot, robot personality when interacting with humans, dissimi-
larities or complementarity in human-robot personalities, and aspects that facilitate
An Extended Framework for Characterizing Social Robots 31

robot personality. Some companies such as Misty Robotics15 are prioritizing the
user customization of a robot’s personality as an important feature for future com-
mercial social robots.
• Modeling and recognizing social aspects of humans—Modeling human agents
allows for robots to interpret aspects of human behavior or communication and
appropriately respond to them. Rossi et al. [154] provide a survey of sample works
aimed at profiling users according to different types of features. More advanced
models may have to consider theory of mind approaches [158].
• Learning and developing new social skills and competencies—In addition to
being programmed to have social skills, social robots may have the ability to refine
those skills with time through adaptation, or even developing new skills altogether.
An active area of research that looks at such paradigms is the area of developmental
robotics [124].
• Establishing and maintaining social relationships—Relationships operate over
a timespan that goes beyond a few interactions. A number of questions arise when
one considers long-term interactions of robots with humans and what it means for
a robot to proactively establish and maintain a relationship that is two-sided. Leite
et al. [113] established some initial guidelines for the design of social robots for
long-term interaction. These include continuity and incremental robot behaviors
(e.g., recalling previous activities and self-disclosure), affective interactions and
empathy (e.g., displaying contextualized affective reactions), and memory and
adaptation (e.g., identifying new and repeated users).
Complementary to these components, Breazeal [32] distinguished 4 categories of
robot social capabilities: (1) Socially evocative, denoting robots that were designed
mainly to evoke social and emotional responses in humans, leveraging the human ten-
dency to anthropomorphize [60]. Therefore, despite expected social responsiveness,
the robot’s behavior does not necessarily reciprocate; (2) Social interface, denoting
robots that provide a “natural” interface by using human-like social cues and com-
munication modalities. In this sense, the social behavior of humans is only modeled
at the interface level, which normally results in shallow models of social cognition in
the robot; (3) Socially receptive, denoting robots that are socially passive but that can
benefit from interaction. This category of robots is more aware of human behavior,
allowing humans to shape the behavior of the robot using different modalities, such
as learning by demonstration. Also, these robots are socially passive, responding to
humans’ efforts without being socially pro-active; and (4) Sociable, denoting robots
that pro-actively engage with humans, having their own internal goals and needs in
order to satisfy internal social aims (drives, emotions, etc.). These robots require
deep models of social cognition not only in terms of perception but also of human
modeling.
In addition to this list, Fong et al. [69] added the following three categories: (5)
Socially situated, denoting robots that are surrounded by a social environment that
they can perceive and react to. These robots must be able to distinguish between other

15 https://www.mistyrobotics.com/.
32 K. Baraka et al.

Fig. 3 Positioning of the


classifications of Breazeal
[32] and Fong et al. [69]
according to our proposed
two-dimensional space
Sociable Socially

Depth of robot social cognition


formed by (1) the depth of
intelligent
the robot’s social cognition
mechanisms, and (2) the
expected human-perceived
level of robot social aptitude. Socially embedded
This figure is merely
Socially situated
illustrative and color patches
deliberately fuzzy, as we do
not pretend to have the tools Socially receptive
to actually quantify these
Social interface
dimensions according to any
scale
Socially evocative

Perceived robot social aptitude

social agents and different objects that exist in the environment; (6) Socially embed-
ded, denoting robots that are situated in a social environment and interact with other
artificial agents and humans. Additionally, these robots can be structurally coupled
with their social environment, and have partial awareness of human interactional
structures, such as the ability to perform turn-taking; and (7) Socially intelligent,
including robots that present aspects of human-style social intelligence, which is
based on deep models of human cognition and social competence.
Although robots have been classified according to their different social capabil-
ities, it is yet unclear how these categories relate to each other. Are they part of a
spectrum? Are they separate categories altogether? We argue that evaluating social
capabilities of robots can be understood according to two main dimensions:
1. The depth of the robot’s actual social cognition mechanisms.
2. The human perception of the robot’s social aptitude.
Given these dimensions, and in light of the existing categories presented above,
we propose a two-dimensional space map, providing a clearer understanding of the
social capabilities of robots. This map is presented in Fig. 3 for illustrative purposes.
As can be seen in the figure, socially evocative robots have the least depth of social
cognition but are perceived as rather socially apt. A social interface typically pos-
sesses some additional cognition mechanisms to allow for easy communication with
the range of the robot’s functionality; it also possibly results in a slightly higher per-
ceived social aptitude thanks to its more versatile nature. Socially receptive robots,
socially situated, and socially embedded robots possess increasing depth in their
social cognition, and as a result increasing perceived social aptitude. For socially
embedded robots, the perceived aptitude may vary according to the degree of aware-
An Extended Framework for Characterizing Social Robots 33

ness about interactional structure the robot has. On the outskirts of our map we find
sociable and socially intelligent robots, with much deeper models of social cognition.

2.3 Purpose and Application Area

In this section, we discuss social robots according to their purpose, i.e., what types of
goals they are designed to achieve, as well as benefiting application areas. Figure 4
summarizes the main purposes and application areas included in this section, with
illustrative examples.
A note on purpose as being distinct from embodiment
In traditional engineering practice, the physical characteristics of a technological
device (e.g., toaster, microwave, typewriter, manufacturing machine) tend to be
strongly coupled with its purpose, i.e., the task it was designed to achieve. With
the advent of personal computers and smartphones, we moved away from defining
those devices solely by their purpose. For instance, it would be inappropriate to
call a modern computer an “electronic typewriter” or even a smartphone an “elec-
tronic phone”, because those devices can serve an immense variety of uses, thanks to
software applications that constantly create new purposes for them. Similarly, even
though some robots may currently be designed for a specific purpose in mind, some
robots may possess a set of skills that can prove useful in a variety of scenarios,
sometimes across completely different application areas. As a result, (1) many dif-
ferent robots can be programmed to be used for the same purpose, but also (2) a
single robot can be used for many different purposes. For example, a robot such as
NAO has been used across a large variety of purposes, both in research and industry,
from playing soccer [82] to assisting individuals with cognitive impairments [61,
164] or teaching children [10, 201].
There remains, however, a general tendency to define robots by characteristics of
their programmed behavior, which can be limiting or inappropriate. As an example,
we see locutions of the form “educational robots”, “therapeutic robots”, “pet robots”,
and so on. The Baxter robot,16 for instance, is often referred to as a “collaborative
industrial robot” (or co-bot), because it has been used quite often in such a setting.
However, it has also been used in very different applications, such as assistance for
the blind [31], or education [66], and hence the naming is reductive. Similarly, a “pet
robot” such as the AIBO dog-inspired robot has been used in contexts where it is far
from being considered a pet, such as playing soccer with other robots [185].
Of course, the embodiment of the robot may restrict its capabilities and hence
the type of tasks it may be able to physically achieve. Also, the robot’s hardware
may be optimized for a specific interactive application (e.g., Baxter has compli-
ant joints for safer collaboration). Moreover, a robot’s appearance, which goes
beyond its hardware specifications, may be optimized for human perceptions such as

16 https://www.rethinkrobotics.com/baxter/.
34 K. Baraka et al.

Fig. 4 A cross-section of main application areas for social robots with selected examples, and
emphasis on the possibility of more than one purpose for the same physical robot, e.g., Baxter
appears in healthcare, industry, and education. Education and entertainment/art were merged for
conciseness. All images were adapted with permission of the authors, publishers, companies, or
copyright owners. Additional credits (left-to-right, top-to-bottom) NAO: Image belongs to K.
Baraka; Paro: Adapted from [168] with permission of authors and Karger. Karger,
c 2011. Credits
AIST, Japan; Baxter for the blind: Adapted from [31] with permission of authors; Baxter (industry):
Courtesy of Rodney Brooks; Robota: Adapted from [29] with permission of authors and Taylor &
Francis group. 2010,
c Taylor & Francis; Pearl: Courtesy of Sebastian Thrun; SeRoDi: Source
Fraunhofer IPA, Photographer Rainer Bez (2015); Robear: Credits RIKEN; Locusbots: Courtesy
of LocusbotsTM ; Baxter with children: Adapted from [66] with permission of authors and Elsevier.
2018,
c Elsevier; Bee-bot: Credits Ben Newsome, Fizzics Education; CoBot: Image belongs to
K. Baraka; Care-O-bot: Source Phoenix Design (2015); Inuktun/Packbot: Adapted from [26] with
permission of author; HERB on stage: Adapted from [209] with permission of authors, Credits
Michael Robinson; Furby: Credits Robert Perry; Bossa Nova robot: Courtesy of Sarjoun Skaff;
HERB in kitchen: Courtesy of Siddhartha Srinivasa; Survivor buddy: Courtesy of Robin Murphy and
Cindy Bethel; Robovie: Courtesy of Masahiro Shiomi; Roboceptionist: Courtesy of Reid Simmons;
Pepper: Retrieved from Wikimedia Commons under the GNU Free Documentation License, Author
Nesnad; Robotinho: Adapted from [141] with permission of Sven Behnke, Credits University of
Freiburg; Cog: Retrieved from Wikimedia Commons; Robota (social sciences): Adapted from [29]
with permission of authors and Taylor & Francis group. 2010,
c Taylor & Francis
An Extended Framework for Characterizing Social Robots 35

acceptability, likeability, trust, and so on, for a specific intended purpose. However,
given the considerations above, we believe that robots should not be defined solely by
their purpose, the same way humans are (hopefully) not defined by their profession.
As a result, we personally prefer a slightly different language to characterize robots
according to their purpose(s): “robots for education” instead of “educational robots”,
“robots for therapy” instead of “therapeutic robots”, and so on. Using this slightly
modified language, we now discuss the main purposes and application areas that are
benefiting from the use of social robots. In light of our discussion, the presented list is
not meant to be selective, as the same robot may be used for more than one purpose.

2.3.1 Robots for Healthcare and Therapy

Robots are being introduced in the healthcare sector to assist patients and providers in
hospitals, at home, or in therapy settings. The type of assistance the robot provides can
be generally categorized into physical and/or social. Physically assistive applications
include helping patients with reduced mobility or dexterity, such as the elderly [70]
or people with physical impairments [39]. These robots can help to carry out daily
tasks, like getting out of bed, manipulating objects, eating, and so on, which can
give them a higher sense of autonomy and dignity [165]. They may also help in
therapy to assist patients in regaining lost physical skills or building new ones [39].
On the other hand, socially assistive robotics (SAR) focuses on providing assistance
primarily through social interactions. Feil-Seifer et al. [64] identified a number of
applications where SAR may have a strong impact, namely in therapy for individuals
with cognitive disorders [42, 160], companionship to the elderly and individuals
with neurological disorders or in convalescent care [40], and students in special
education. We also believe that robots in the healthcare domain may be used to
benefit healthcare providers directly, for example training therapists through robotic
simulation of interactions with patients [17].

2.3.2 Robots for Education

Robots in education are mainly used with children [99, 107, 189] because they can
increase engagement in learning while favoring an interactive and playful compo-
nent, which may be lacking in a traditional classroom setting. When designing such
educational robots, it is crucial to design for and evaluate long-term interactions, to
avoid successes merely due to strong novelty effects [113].
There is a number of formats that educational scenarios can take, where the robot
has a different role. Beyond being a teacher delivering material, the robot can also
act as a social mediator between children, encouraging dyadic, triadic, and group
interactions [106]. Moreover, the robot may play the role of a learner in learning-by-
teaching scenarios, in which the child teaches the robot and in this process develops
their own skills [93].
36 K. Baraka et al.

2.3.3 Robots for Entertainment and the Arts

The entertainment industry has benefited from the use of robots for their engaging
and interactive capabilities. Personal entertainment creations emerged with robotic
toys, such as Furby17 or Bee-Bot,18 and robotic dolls, such as Hasbro’s My Real
Baby.19 Public entertainment robots have appeared in theme parks and other public
entertainment spaces [126]. More complex robots with both verbal and non-verbal
communication capabilities have been used for more prolonged interaction scenarios
such as storytelling [47] or comedy [38]. Other entertainment applications include
interactive shows [6], acrobatic robots for movie stunts [148], and sex robots [115],
among others.
More artistic-oriented applications include robots in the visual arts20 [144] and
installation art [12]. Social robots have also been deployed in fields of performative
arts such as drama [209] or dance [44, 186], where their embodied intelligence
in real-time contexts and their interactivity remain a challenging and rich research
challenge. Generally, the inclusion of intelligent robots in the arts and the broader
field of computational creativity [49] are questioning definitions and criteria of art,
authorship, and creativity.

2.3.4 Robots for Industry

As industrial robots are becoming more intelligent, they are being equipped with
interactional capabilities that allow them to collaborate with humans, mainly in tasks
involving manipulation skills. Schou et al. [162] identified several tasks that can ben-
efit from a human-robot collaborative setting, possibly including multi-robot/multi-
human teams. These are: logistic tasks (namely transportation and part feeding),
assistive tasks (namely machine tending, (pre)assembly, inspection, and process exe-
cution), and service tasks (namely maintenance and cleaning).
Research has shown that robots exhibiting social communication cues in industrial
settings are perceived as social entities [157]. Moreover, Fong et al. [69] emphasized
that in order to achieve true collaboration between humans and robots, the robot must
have sufficient introspection to detect its own limitations, must enable bidirectional
communication and information exchange, and must be able to adapt to a variety of
humans from the novice to the experienced.

17 https://furby.hasbro.com/en-us.
18 https://www.bee-bot.us/.
19 https://babyalive.hasbro.com/.
20 An annual robot art competition is held to encourage the use of robots in the visual arts
http://robotart.org/.
An Extended Framework for Characterizing Social Robots 37

2.3.5 Robots for Search and Rescue

Search and rescue is one of the applications in which robots are being investigated
as replacements to humans in dangerous environments, such as in natural or human
disasters. Even though typical robots in this domain have not been designed with
social capabilities, research has shown the importance of “social intelligence” in
this domain [67]. Bethel and Murphy [25] identified the importance of different
modalities of social communication in the context of victim approach, across the
scale of proxemic zones (i.e., the distancing of the robot to the human), ranging from
the public to the personal space. Such modalities include body movement, posture,
orientation, color, and sound.

2.3.6 Robots for Assistance in Home and Workplace

With the advent of personal robots [74], the vision is that anyone will have the ability
to own and operate a robot, regardless of their skills or experience, thanks to natural
and intuitive interfaces [119]. Such robots can be deployed in home or workplace
environments to assist individuals, reduce their mental and physical load, and increase
their comfort and productivity. In the home, personal robots are already cleaning floor
surfaces autonomously,21 cooking full meals,22 and doing laundry,23 just to name a
few. More ambitious research projects have aimed at designing versatile “robotic
butlers” [180], that can operate in a variety of tasks across the home.
In the workplace, robots are being used on a daily basis to transport objects,
cataloguing inventory, escorting people, delivering messages, among other tasks, in
settings such as offices, hospitals,24 supermarkets,25 and hotels. The majority of these
robots are called service robots and have the capability of navigating in structured
indoor environments, mainly corridors as opposed to open public spaces. An example
of such service robots is the CoBots [196], developed and deployed at Carnegie Mel-
lon University, servicing multiple floors and having navigated more than 1,000 km
autonomously [30]. Other types of robots used in the workplace include tele-presence
robots for tele-conferencing and virtual visits of remote places [193].

2.3.7 Robots for Public Service

Robots have been deployed in public spaces including malls [170], museums [141],
exhibition spaces [94], and receptions [78]. Some (but not all) of those robots are
mobile, and can navigate in open spaces or in crowds, which makes the design of their

21 https://www.irobot.com/for-the-home/vacuuming/roomba.
22 http://www.moley.com/.
23 http://www.laundry-robotics.com/.
24 https://aethon.com/.
25 http://www.bossanova.com.
38 K. Baraka et al.

behavior challenging and subject to a variety of social constraints [123]. Interactions


with such robots have to account for the fact that the robot will interact with a very
large number of people, with inevitable differences, and during a short duration.
Hence, personalizing the interaction and making it as intuitive as possible (as there
is very little adaptation time on the human side) are important design considerations.

2.3.8 Robots for the Social Sciences

Due to the possibility of programming robots to exhibit mechanisms of cognition


similar to those of humans, a less publicized purpose of robots is in fields of the social
sciences for the study of social development, social interaction, emotion, attachment,
and personality [69]. The idea is to use robots as test subjects in controlled laboratory
experiments, leveraging the fact that such robots can reproduce consistent behaviors
repeatedly and can be controlled to test predictions of human models of cognition.
For example, the Cog robot [159] was used to investigate models of human social
cognition. Similarly, a doll-like robot, Robota [29], was used in comparative studies
for social development theories [53]. Additionally, robots (human-inspired or other
types) can be used as stimuli to elicit behaviors from humans for the development
and refinement of theories about human behavior and cognition. For a more detailed
discussion on cognitive robotics and its applications outside of technology-related
fields, consult Lungarella et al. [125].

2.3.9 Other Application Areas

The list of application areas and purposes listed above is not comprehensive, but
reflects major developments and deployments. To this list we can add:
• Robots for companionship—Dautenhahn [51] presented a perspective on differ-
ent possible relationships with personalized (possibly life-long) robotic compan-
ions, drawing on literature from human-animal relationships. Situated somewhere
between animal pets and lifeless stuffed animals, robotic companions may pro-
vide support for socially isolated populations. The technical and design challenges
associated with robotic companions are numerous due to the time dimension, and
the deployment of robotic pets has raised an ethical concern [178]. Examples of
robotic companions include the HuggableT M robot [184], the AIBO dog-inspired
robot [71], and the Lovot robot.26
• Robots for personal empowerment—The ultimate ethically concerned use of
robots is to expand human abilities instead of replacing them, and to empower
people at an individual level. Examples of personal empowerment that robots may
facilitate are physically assistive robots that help people with impairments gain
autonomy and dignity, such as prosthetics, exoskeletons, brain-controlled robotic

26 https://groove-x.com/en/.
An Extended Framework for Characterizing Social Robots 39

arms [87], and other assistive robots (see Sect. 2.3.1). Other examples include
robots that are designed to enhance creativity in individuals, such as the YOLO
robot [8], or tele-presence robots for workers that cannot physically perform the
required tasks, such as in the “Dawn ver. β” cafe in Japan who hired paralyzed
people to serve the costumers through a mobile robot controlled by their eye
movements.27
• Robots for transportation—The rise of autonomous driving will revolutionize
transportation and the urban environment. Autonomous vehicles (cars, trucks,
public transportation, etc.) are expected to operate in environments populated by
humans (drivers, pedestrians, bicyclists, etc.), and research is looking at adding
social dimensions to their behavior [129, 137, 198]. Additionally, drones will be
used in the near future for package delivery28 and will have to (socially) interact
with costumers.
• Robots for space—Robots for space exploration are historically known for their
low level of interactions with humans. However, as humans are getting more
involved in space explorations, social robots are being introduced to assist astro-
nauts in their tasks and daily routines, e.g., the Jet Propulsion Laboratory’s Robo-
naut and Valkyrie [202].
• Robots for technology research—Robots can also be used to test theories in
fields related to technology, such as testing algorithms and architectures on phys-
ical platforms. More generally, robots can provide a platform for developing and
testing new ideas, theories, solutions, prototypes, etc., for effective embodied tech-
nological solutions and their adoption in society.
The application areas mentioned above provide a cross-section of purposes that
social robots hold in existing developments and deployments. If we view robots as
embodied agents that can carry intelligently complex tasks in the physical and social
world, we expect, in the future, to have robots introduced in virtually any application
where they can complement, assist, and collaborate with humans in existing roles
and expand their capabilities, as well as potentially assume new roles that humans
cannot or should not assume.

2.4 Relational Role

One of the relevant dimensions that shapes human-robot interaction is the role that the
robot is designed to fulfill. The concept of role is an abstract one, for which various
different perspectives can be presented. In this section, we specifically look at the
relational role of the robot towards the human. This is the role that a robot is designed
to fulfill within an interaction, and is not necessarily tied to an application area. The
relational role the robot has been designed to have is critical to the perception, or
even the relationship, that arises between robot and human.

27 https://www.bbc.com/news/technology-46466531.
28 https://www.amazon.com/Amazon-Prime-Air/b?ie=UTF8&node=8037720011.
40 K. Baraka et al.

Towards clarifying the concept of relational role, it is important to immediately


distinguish relational role from role in an activity or application. In a specific activity
or application, we may expect to find activity-specific roles (as in role-playing),
such as teacher, driver, game companion, cook, or therapist. These types of roles are
defined by the type of activity performed between the robot and humans, therefore
making it an open-ended list that is likely to stay in constant evolution as robots
become applied to new fields and tasks.
Given the fuzziness of this concept, there have not been many attempts at general-
izing the concept of role of robots within a relation with humans. For the rest of this
section, we will present and analyze some broader definitions from the existing liter-
ature, to conclude by contributing a broad classification that attempts to agglomerate
the main concepts of the pre-existing ones while containing and extending them.
Scholtz et al. presented a list of interaction models found in HRI [161]. They
included roles that humans may have towards a robot in any HRI application. The
list defines the roles of the Supervisor, who monitors and controls the overall sys-
tem (single or multiple robots), while acting upon the system’s goals/intentions; the
Operator, who controls the task indirectly, by triggering actions (from a set of pre-
approved ones), while determining if these actions are being carried out correctly by
the robot(s); the Mechanic, who is called upon to control the task, robot, and envi-
ronment directly, by performing changes to the actual hardware or physical set-up;
the Peer, who takes part in the task or interaction, while suggesting goals/intentions
for the supervisor to perform; and the Bystander, who may take part in the task or
interaction through a subset of the available actions, while most likely not previously
informed about which those are. These five roles were initially adapted from HCI
research, namely from Norman’s HCI Model [142]. As such, they refer mostly to
the role of the human within a technological system, whereas in this section we look
for a classification to support the roles of robots in relation to humans within their
interaction with each other.
Later, Goodrich et al. [81] built upon this list to propose a classification of roles
that robots can assume in HRI. In their list, it is not specified whether the role refers
to a human or to a robot. Their proposed classification can be vague, as they take
Scholtz’s roles (for humans) and directly apply them to both robots and humans with
no discussion provided. They also extended the list by adding two more roles, but
these are defined only for the robot. In the Mentor role, the robot is in a teaching
or leadership role for the human; in the Informer role, the robot is not controlled by
the human, but the latter uses information coming from the robot, for example in a
reconnaissance task.
The concept of robot roles was also addressed by Breazeal [33], who proposed
four interaction paradigms of HRI. In these paradigms, the robot can either take the
role of a Tool, directed at performing specific tasks, with various levels of autonomy;
a Cyborg extension, in which it is physically merged with the human in a way that
the person accepts it as an integral part of their body; an Avatar, through which the
person can project themselves in order to communicate with another from far away;
or a Sociable partner, as in classic science-fiction fantasy.
An Extended Framework for Characterizing Social Robots 41

Fig. 5 Our classification of relational roles of robots towards humans (represented as “you”)

Based on the many different proposed classifications, and on all the various inter-
action scenarios and applications found throughout literature and presented through-
out this chapter, we have outlined our own classification for the role of robots within
a relation with humans. Our classification attempts to merge the various dimensions
of interaction while stepping away from explicit types of scenarios or applications.
It does not necessarily add or propose new roles, but instead, redefines them from
a relational perspective, placing emphasis on how the robot relates from a human’s
perspective, as depicted in Fig. 5.
In our classification for relational roles of robots, we view HRI as including both
the robot and you (the human). As such, we consider the following roles that a robot
may have towards you:
• A robot “for you” serves some utility on a given task. This is the most traditional
role of a tool or a servant, and is inspired by most previous classifications. Despite
closely related with the concept of a tool, as proposed by other authors, we frame
this role as a broader type of robotic tool, which can even include robots like
autonomous cars.
• A robot “as you” plays the role of a proxy, namely, but not limited to, tele-presence.
However it does not necessarily imply interaction from far away as in Breazeal’s
classification [33]. This type of role can exist even when inter-actors are co-located,
as long as the robot is acting in place of another person who operates it (e.g., shared
autonomy scenarios).
• A robot “with you” is typically collaborative, with various levels of autonomy,
including being part of a group with you. It is used in applications in which both
the human and the robot act together, as a team, or towards common goals, and
42 K. Baraka et al.

also includes robots for companionship. The robot and human are not necessarily
co-located, such as for example human-robot teams that have to communicate
remotely.
• A robot “as if you” emulates particular social or psychological traits found in
humans. These robots are mainly used as social sciences research tools (see
Sect. 2.3.8). To date, robots have been used to examine, validate, and refine the-
ories of social and biological development, psychology, neurobiology, emotional
and non-verbal communication, and social interaction.
• A robot “around you” shares a physical space and common resources with the
human. It differs from a “robot with you” by the fact that it is necessarily co-
located with the human, but not necessarily collaborating with them. These are
typically called co-operating, co-present, or bystanders, as previously proposed in
Scholtz’s classification [161].
• A robot “as part of you” extends the human body’s capabilities. These robots typi-
cally have nonexistent or very limited autonomy, but provide humans with physical
capabilities that they could not otherwise perform using their own biological body.
Such robots can be used for pure embodiment extension (e.g., strength-enhancing
exoskeletons), or for close-range HRI collaboration, such as the robotic wearable
forearm [195] whose function is to serve as a supernumerary third arm for shared
workspace activities.
The list of relational roles that we present defines non-exclusive roles, meaning
that for some particular applications, we may design and develop robots that take
more than one of these roles, or take a different role when more than one human is
involved in the interaction. An example would be of a robot used in an office, which
can be used for the users to deliver mail and packages to different locations, while
at the same time acting around the users when navigating the office space. Another
example would be an autonomous vehicle operating for the passenger(s), but around
pedestrians and other human drivers.

2.5 Autonomy and Intelligence

Necessary aspects to consider when characterizing the behavior of social robots are
those of autonomy and intelligence. Although related, these are two distinct concepts
that are often inconsistently and confusingly used in existing literature [73, 84]. In
particular, it is often assumed that a high level of robot autonomy implies both a high
level of intelligence and of complexity. In reality, some fully autonomous systems
can possess very low intelligence (e.g., a traditional manufacturing machine) or
complexity (e.g., a simple self-operated mechanism). A better clarification of the
concepts of autonomy and intelligence, and their relation, is needed, especially in
the context of social robotics.
An Extended Framework for Characterizing Social Robots 43

2.5.1 Definitions (or Lack Thereof)

The concepts of autonomy and intelligence are hard to define, and there does not
seem to be unique accepted definitions [22]. In particular, existing definitions in the
literature seem to differ depending on the context of application, and the main field
of focus of the author(s). Based on existing literature, we propose below extended
working definitions of those two concepts in the context of social robotics.

2.5.2 Autonomy

It may seem somewhat paradoxical to talk about autonomy in the context of interac-
tive robots, because traditionally fully autonomous robots are involved in minimal
interactions with humans; in other words, reduced interaction with humans is a by-
product of increased robot autonomy. For social robots however, this relation between
amount of human interaction and robot autonomy is questioned. Highly autonomous
social robots are expected to carry out more fluid, natural, and complex interactions,
which does not make them any less autonomous. There exists a very large number
of definitions of autonomy for general agents, however central to most existing def-
initions is the amount of control the robot has over performing the task(s) it was
designed to fulfill (or that it sets to itself), as emphasized by Beer et al. [22]. For
social robots, tasks may include well-defined goal states (e.g., assembling furniture)
or more elusive ones (e.g., engaging in conversation).
We claim that in addition to control, the concept of autonomy should also
account for learning. Indeed, many learning paradigms include human-in-the-loop
approaches, and we believe these should be taken into account. These include active
learning [46], learning by demonstration [155], and corrective human feedback learn-
ing [131], used within the context of interactions in applications involving human
teachers such as learning-by-teaching educational scenarios [93] or general collab-
orative scenarios [34]. As a result, we extend the definition from Beer et al. [22] to
make it applicable to social robots, and define autonomy of a social robot as follows:
Autonomy—“The extent to which a robot can operate in the tasks it was designed
for (or that it creates for itself) without external intervention.”
Note the use of the term intervention as opposed to interaction.

2.5.3 Intelligence

The is no real consensus on the definition of general intelligence [73]. In the context
of robotics and AI, intelligence is generally emphasized as related to problem solv-
ing [139]. For social robots, we propose the following extension of the definition of
Gunderson et al. [84]:
Intelligence—“The ability to determine behavior that will maximize the likeli-
hood of goal satisfaction under dynamic and uncertain conditions, linked to the
environment and the interaction with other (possibly human) agents.”
44 K. Baraka et al.

Note that intelligence is also dependent on the difficulty of the goals to be achieved.
Based on this definition, it can be seen that intelligence and autonomy are distinct
concepts, but that, for a given task, intelligence creates a bound on achievable auton-
omy. In other words, the level of intelligence of a robot may prevent its ability to reach
a given level of autonomy for fixed robot capabilities [84]. A final important note
concerning the design of social robots is that a robot’s perceived intelligence [20]
can be drastically different from its actual intelligence. As a result, minimizing the
gap between the two is crucial for maintaining adequate expectations and appropriate
levels of trust on the human side. Now that we have defined the concepts of autonomy
and intelligence, we discuss approaches to quantify them.

2.5.4 Quantifying Autonomy and Intelligence

Unlike scales from the automation [59] or tele-operation [80, 91, 166, 203] fields,
and more recently with autonomous vehicles [156], all of which are based on the
idea that more autonomy requires less HRI, some researchers have developed scales
of autonomy that apply to social robots [22, 65, 81, 191]. These emphasize on the
fact that autonomy has to be understood as a dynamic entity [81]. On the other hand,
measuring robot intelligence has been the subject of some investigation, from both
practical [3] and theoretical perspectives [27]. Both autonomy and intelligence can
be seen as belonging to a continuum, taking into account aspects of robot perception,
cognition, execution, and learning [84, 203]. As a result, autonomy is a dimension
that one designs for, constrained by possible achievable levels of intelligence. As a
general rule, the higher the autonomy and intelligence is, the higher the complexity
of the system is.
The importance of dimensional thinking
For a highly heterogeneous technology such as a social robot that involves a combina-
tion of hardware, software architecture, cognition mechanisms, intelligent hardware
control, just to name a few, it is important to define dimensions about aspects such
as autonomy and intelligence. The overall assessment of these aspects would then
depend on a combination of assessments over individual dimensions. Researchers
at IBM have proposed to define “dimensions of (general artificial) intelligence”,
as a way to define an updated version of the Turing test [194]. Their list is more
task-oriented, but can serve as a basis to think about general dimensions for both
intelligence and autonomy. We propose the following dimensions of intelligence
and autonomy, accounting for the socially interactive factor:
1. Perception of environment-related and human-related factors—In order to
engage in successful interactions, social robots need to be able to assess the
dynamic state of the physical environment and of humans, to inform their deci-
sion making. On the human side, this includes estimating the human’s physical
parameters (pose, speed, motion, etc.), speech, and non-verbal social cues (ges-
tures, gaze, prosody, facial expressions, etc.).
An Extended Framework for Characterizing Social Robots 45

2. Modeling of environment and human(s)—In order to interpret robot percep-


tions, models of the environment and of humans are needed. For example, models
of the humans can allow the robot to infer their intents, personality, emotional
or affective states, and predict future human states or behavior. If models are
parametrized to capture individual differences, then they can be a powerful tool
to inform personalization and adaptation mechanisms in HRI [154].
3. Planning actions to interact with environment and human(s)—Decision-
making on a robot can be reduced to creating plans for robot actions that take
into account the shape of the task, the goal, and the current state of the world,
including the robot, the environment, and the human(s). A social robot needs to
plan its motion, speech, and any other modality of social behavior it may be able
to exhibit.
4. Executing plans under physical and social constraints—The same way the
environment poses physical constraints on how the robot interacts with it, culture
and society impose social constraints on how interactions with a robot should take
place [111]. Robot decision-making should take human social norms into account
while planning and executing generated plans [45]. Note that the execution of the
plan may not be successful, hence the robot needs to account for all possible
outcomes.
5. Learning through interaction with the environment or humans—On top of
the four basic dimensions mentioned above, some robots may be endowed with
learning capabilities, which allow them to improve with time, throughout their
interactions with the environment or humans (including human-in-the-loop learn-
ing). Note that this dimension does not necessarily encompass machine learning
as a general technique, as many offline machine learning methods would fall
under the dimensions of perception and modeling.
The dimensions above span most existing building blocks for the intelligence of
a social robot. However, depending on their implementation and complexity, some
robots may not include one or more of the above dimensions. Those dimensions
are generally separated in the design and implementation of most robots, hence as a
result, intelligence and autonomy on each dimension may be completely different.
For example, some semi-autonomous robots include completely human-controlled
perception [183], or rely on human input for learning [46, 131, 155] or verifying the
suitability of robot plans [61].
As technology advances, higher amounts of robot intelligence will be achievable,
unlocking new possible levels of autonomy for more complex tasks; however, the
amount of autonomy of a system (within possible technological limits) will remain
a design choice. As a design principle for future social robots, we advocate for
the notion of symbiotic autonomy [50, 196], where both humans and robots can
overcome their limitations and potentially learn from each other.
46 K. Baraka et al.

2.6 Proximity

Spatial features of the interaction may have a strong influence on the type of possible
interactions and their perception by humans. In this section, we focus on the prox-
imity of the interaction, i.e., the physical distance between the robot and the human.
In particular, we consider three general categories of interactions according to the
proximity dimension: remote, co-located, and physical.

2.6.1 Remote HRI

Several applications in HRI require the human and the robot to be in physically
remote places. Tele-operation applications generally involve tasks or environments
that are dangerous or inaccessible to humans, and historically represents one of the
first involvements of humans with robots. In traditional tele-operation contexts, the
human is treated as an operator, intervening to shape the behavior of one or more
robots. Such types of HRI scenarios have been extensively studied and a number of
metrics have been developed for them [182]. However, they are often excluded from
the literature in social robotics [69].
More recent developments in the field of tele-operation gave rise to tele-presence
applications, which treat the robot as a physical proxy for the human [108, 193],
allowing the latter for example to be virtually present in tele-conferencing settings,
or to visit remote places. As a result, as the robot is used to interact with humans in
the remote environment, its design may include a strong focus on socially embodied
aspects of the interaction beyond mere audio and video, such as distancing and gaze
behavior [2].
In all the previously cited literature, several notes are made regarding issues that
are commonly faced, and should be addressed when developing social robots for tele-
presence applications, such as concerns regarding privacy, a proper control interface
for the pilot (including a map of the environment and the robot’s surroundings), adapt-
ability to people’s height and stance (e.g., sitting, standing, behind a desk), robustness
towards communication failures (e.g., loss of WiFi connection), and dynamic volume
control.
Finally, an important aspect of remote interaction is the translation of the operator’s
input into robot behaviors. Many interfaces have been developed for controlling tele-
presence robots, including graphical and tangible interfaces [110], but also virtual
reality tools [140], or brain-machine interfaces [192].

2.6.2 Co-located HRI

This category includes all interactions in which the robot and the human are located
in a shared space and interact directly without explicit physical contact. This is the
case for most existing social robotics scenarios.
An Extended Framework for Characterizing Social Robots 47

Within these cases we are most interested in mentioning the ones in which the
robot has some form of locomotion ability (e.g., legged robot, aerial robots, wheeled
robots), and also the ability to perceive and measure the distance to the human, in
order to be able to actively control the distance between them. The social meaning
of proximity in this context is referred to as proxemics, and constitutes an important
part of non-verbal robot behavior [135].
Mead et al. [130] have explored this topic by taking into account not only the
psycho-physical and social aspects of proximity from the human’s perspective, but
also regarding the robot’s needs. In terms of needs related to proximity, social robots
may require or prefer certain distances to people in order for their sensors to work
properly (e.g., vision, speech interaction).
Depending on the actual distance of the co-located robot, different modalities
of communication may be more suitable. For example, robots in the private space
may interact using speech or sound, and use touch screen for human input. However,
robots at a greater distance but within line of sight, such as mobile robots, autonomous
cars, or drones may use visual signals instead, such as expressive lights [19, 187].

2.6.3 Physical HRI

Interactions happening in a shared space may involve an additional modality, namely


physical contact between the human and the robot. Such interactions pertain to a blos-
soming subfield of HRI, commonly designated as Physical Human-Robot Interac-
tion, or pHRI for short [28, 85, 208]. From a hardware perspective, robots involved
in pHRI are being designed with compliant joints (e.g., Baxter robot) for safety.
Also, the design of robot outer shells is taking texture and feel into account [206].
Moreover, novel paradigms for robot hardware are emerging with soft robotics [127].
Examples of pHRI include physically assistive applications, where a robot has to
be in physical contact with the person to execute its tasks, such as getting patients out
of a chair [173], or helping them feed [177] or dress themselves [100]. In industrial
settings, physical proximity has also been shown, for some tasks, to improve the
interaction and its perception by the workers [92].
On the other hand, physical contact may be used as a communication modality
in itself, using a combination of touch, motion, pressure, and/or vibration, known as
haptic communication [132]. Such a communication modality is especially useful
when others (e.g., visual) are not feasible. In particular, research has looked at how
robots can communicate or guide people with visual impairments using physical
contact. For example, Bonani et al. [31] investigated the use of movement of a
Baxter’s arm that blind people held to complement verbal instructions in a playful
assembly task. Additionally, mobile robots have been used to guide people in indoor
environments using physical contact [109, 172].
Moreover, physical contact may possess a social component. This is the case
when a robot behavior utilizing physical contact with a human is meant to induce
or influence their behavior. For example, a mobile robot may use physical contact
when navigating through a human-crowded environment, inducing people to move
48 K. Baraka et al.

away [175]. Also, affective robot behaviors involving contact, such as a hug or a
handshake, have been shown to have an influence on the social behavior of the
humans in their interaction with the robot (e.g., self-disclosure or general perception
of the robot) [13, 171]. Human-robot haptics have also been investigated by studying
the role of physical contact in human-animal interactions [207].
While the spatial features discussed in this section pertain to different fields of
research, one would expect in future robotic technologies a range of interactions that
would incorporate a combination of the three, according to the task and situation at
hand.

2.7 Temporal Profile

In this section, we look at time-related aspects of interactions with a social robot.


Knowing the intended temporal profile of these interactions may have a strong impact
on the design of such robots. We specifically discuss the timespan, the duration, and
the frequency of interactions.

2.7.1 Timespan

Interactions with robots can be classified according to timespan, meaning the period
of time in which the human is exposed to the robot. We consider four timespan
categories, namely short-term, medium-term, long-term, and life-long. There does
not exist, in the HRI literature, a quantitative way to establish the boundaries between
these four categories, as they may be context-dependent. Our aim is hence to provide
a useful guideline for thinking about implications of such categories in the design of
social robots, as well as their evaluation.
• Short-term interactions typically consist of a single or only a few consecutive
interactions, e.g., a robot giving directions in a mall. Of special importance for
these types of interactions are design factors that influence the first impression
of the human towards the robot (e.g., appearance, size, motion “at rest”, prox-
emics/approach behavior, initiation of the interaction). Usually very present in
short-term interactions is the novelty effect, a fundamental characteristic of any
innovation characterized by the newness or freshness of the innovation in the eyes
of the adopter [199]. It is a salient effect that plays a role in the adoption and use
of novel media, characterized by higher initial achievements not because actual
improvements occur, but due to the increased interest in technology [48]. This
effect may help or harm the interaction depending on its content and outcome, but
it should be kept in mind in the design of robots for short-term use, also accounting
for different expectations based on the users’ demographics.
• Medium-term interactions go beyond a single or a few interaction(s) but do not
extend over a timespan long enough to be considered part of the long-term category.
An Extended Framework for Characterizing Social Robots 49

They typically span several days or weeks. An example is a robot used to teach
children a module in their curriculum over a few weeks. During repeated interac-
tions, the novelty effect may wear off after the first few interactions, resulting in
potential loss of interest or changes in attitudes towards robots over time [78, 98].
When considering repeated interactions with the same robot, it is hence essential
to take this dynamic aspect into account by incrementally incorporating novelty
or change in the behavior of the robot as well as maintaining a sense of continuity
across interactions [10, 113]. This will help sustain engagement and satisfaction
both within and across individual interactions.
• Long-term interactions include prolonged interactions that go beyond the period
needed for the novelty effect to fade [113]. An example is a personal robot operating
in a home. Long-term interactions typically create a sense of predictability in the
human to know they will encounter a subsequent interaction. Additionally, humans
may start to feel a sense of attachment to the robot, and even develop relationships
with it. In addition to the points mentioned for the medium-term category, it is
crucial to consider how the robot can both personalize and adapt its interactions
with the human. Personalization means that the robot will accommodate for inter-
individual differences, usually focusing on static or semi-static features of the
human such as personality, preferences, or abilities. Adaptation means that the
robot accommodates for intra-individual changes, focusing on dynamic features
of the human such as physical, psychological and emotional state, performance, or
behavior. For surveys about personalization and adaptation in HRI, please consult
Rossi et al. [154] and Ahmad et al. [5]. Personalization can also include a dynamic
component; for example, an algorithm has been developed for an office robot to
learn not only preferences of robot behaviors but also how to switch between them
across interactions, according to personality traits of the human [18].
• Life-long interactions differ from long-term interactions by the fact that the human
may go through large changes, for example, transitioning from childhood to adult-
hood, or progressively losing some capabilities during old age. These types of
interactions are much rarer with existing robots, but we do have examples that
include robotic pets adopted in life-long timespans such as the AIBO or PARO
robots. Another example is robots meant to accompany people until the end of
their lives, such as robots assisting the elderly while gaining skills over time hence
compensating for the decrease in their users’ capabilities [75]. In the future, the
vision of robotic companions [51] may include richer interactions including mutual
learning and evolution, emotional support, and building deeper bidirectional rela-
tionships.

2.7.2 Duration and Frequency

In addition to timespan, an important temporal aspect of the interaction is the average


duration of individual interactions. For example, a human can interact with a robot
in short-term but prolonged interactions (e.g., in an educational context), or on the
contrary in short interactions over a long timespan (e.g., office robot), or in other
50 K. Baraka et al.

combinations and levels of the above. An important question to consider for longer
durations is how to maintain engagement, especially with populations with a short
attention span, such as children. For short durations, it is important to design for
intuitiveness and efficiency of the interaction, in order to reduce the cognitive load
or adaptation time of the human.
It is worth mentioning that duration is often imposed by the task itself, but may
also be imposed by the human’s willingness to end it. For example, the Robocep-
tionist [78] interacts with people in a building over large timespans. It was designed
as a conversational chatbot, hence every person that interacts with it can initiate
and end the interaction at any moment. The authors reported short interactions gen-
erally under 30 s, and aimed at increasing this number by designing for long-term
interactions with engagement in mind, using techniques from the field of drama.
In addition to timespan and duration, the frequency of interactions plays a role in
their perception by humans, and in the resulting design considerations. The frequency
of interactions with the same robot can vary from very occasional (e.g., robots in
stores visited sporadically) to multiple times per day (e.g., workplace robots). For
high frequencies, a lack of incorporation of novelty, or at least variation in the robot’s
behavior, may result in fatigue and lack of engagement. Also, achieving continuity
through memory is a particularly relevant factor [113]. Currently, the effect of fre-
quency on the perception and effectiveness of interactions seems to be largely lacking
in the HRI literature.
This concludes our discussion of time-related aspects of the interaction, as well
as the discussion of our framework as a whole. Before concluding this chapter, we
provide a brief discussion of design approaches for social robots.

3 Working Within the Social Robot Design Space

The framework presented in this chapter outlined major dimensions of relevance to


the understanding of existing social robots and the design of future ones. Moving
forward, it effectively defines a design space for social robots, where each of the
aspects discussed will involve a set of design decisions. For example: What role
should my robot play in relation to humans? What should it look like? What kind
of social capabilities should it have? What level of autonomy is best fitted for the
task(s) and should it be fixed? etc. Higher-level decisions in the design process also
arise such as: Are the requirements feasible with current technology, or will it require
developing new technology? What are the practical considerations associated with
the “theoretically best” design, as well as the costs, and are they outweighed by the
benefits?
The actual design process of social robots and their interactions with humans has
benefited from a number of design approaches inspired by design practices from a
variety of fields such as engineering, computer science, HCI, and human factors. For
example, some researchers in HRI have looked at developing design patterns that
can be reused without having to start from scratch every time [97]. There generally
An Extended Framework for Characterizing Social Robots 51

exist three broad design approaches, each of which may be valid depending on the
intended context and objectives: human-centered design, robot-centered design, and
symbiotic design. We briefly discuss these approaches next.

3.1 Robots as Technology Adapted to Humans


(Human-Centered Design)

Human-centered design (HCD) is the central paradigm of HCI, and much of HRI
design as a result. It aims to involve the intended user population as part of most
development stages, including identifying needs and requirements, brainstorming,
conceptualizing, creating solutions, testing, and refining prototypes through an iter-
ative design process [1].
In the HRI context, the main assumption is that humans have their own com-
munication mechanisms and unconsciously expect robots to follow human social
communication modalities, rules, conventions, and protocols. Important aspects of
the robot behavior and embodiment design that play a strong role in terms of the
human’s perception of the interaction include physical presence [14], size [149],
embodiment [112, 197], affective behaviors [114], role expectations [54], just to cite
a few. From an evaluation point of view, HCD relies a lot on subjective self-reports of
users to measure their perceptions, and complement more objective measures such
as task performance.
While many HCD approaches exist for social robots, one of particular interest is
treating robots as expressive characters, i.e., robots with the ability of expressing iden-
tity, emotion, and intention during autonomous interaction with human users [152].
Designing for expressivity can be achieved for example by bringing professional
animators to work side by side with robotic and AI programmers. The idea is to
utilize concepts of animation developed over several decades [190] and apply them
to robotic platforms [36, 76, 88, 150, 151, 188].

3.2 Robots as Goal-Oriented Technology (Robot-Centered


Design)

Historically, robots were developed solely by engineers who carried little concern
about the human beyond the interface. While the focus in HRI has now shifted to a
more human-centered approach as was discussed in the previous section, HCD as a
general design paradigm has been criticized by many researchers who consider it to be
harmful in some aspects [83, 143]. For example, it has been criticized for its focus on
usability (how easy it is to use) as opposed to usefulness (what benefits it provides) and
its focus on incremental contributions based on human input conditioned by current
technologies, which prevents from pushing technological boundaries. Additionally,
52 K. Baraka et al.

adapting the technology to the user may sometimes be more costly than having the
user adapt to the technology.
As a result, there are cases where a more robot-centered approach may work
best. Excessively adapting robots to humans may result in suboptimal performance,
high cost of development, or unmatched expectations. It is important to recognize
that in some cases, it may be better to ask the human to adapt to the robot (maybe
through training) in order to achieve better performance on the long run. Humans
have a much better ability to adapt than robots, and it is crucial to identify when
robots should not adapt because it would be more efficient to ask or expect humans
to do it [143]. In many cases, the robot may have needs that may incur an immediate
cost on humans, but result in a better future performance. Examples include robots
asking for help from humans when they face limitations [196], or teaching the robot
to perform a certain task so that it can perform better in subsequent tasks. A robot-
centered approach may also include the adaptation of our environments to make
them suitable for robots. Examples include avoiding construction materials that are
not compatible with the robot’s sensors, interfacing the robot with building facilities
(such as elevators), and so on.

3.3 Robots as Symbiotic Embodied Agents (Symbiotic Design)

Both approaches discussed above, whether human-centered or robot-centered, are


valid approaches that one can use when designing social robots and their associated
tasks. As a general design process for such robots, we advocate for the careful
identification of strengths and weaknesses of each part and design for an increased
symbiosis between the human(s) and the robot(s). One way to achieve this symbiosis
is to adopt a holistic view that focuses on the overall system behavior, as a function
of robot(s), human(s), and the environment [183]. For example, the CoBot robots are
autonomous mobile robots [196] servicing human users in a building, designed with
the ability to utilize the presence of other humans in the environment (i.e., passerby)
to overcome their limitations. For instance, they ask for assistance in pressing the
elevator button or putting objects in their basket since they do not have arms. This
is an example of symbiotic autonomy where humans and robots service each other
mutually in the same shared environment, and where both parties have to adapt to
the other party’s needs.

4 Conclusion

In this chapter, we have introduced a framework for characterizing social robots


and their interactions with humans along principal dimensions reflecting important
design considerations. In particular, we (1) presented a broad classification of robot
appearances, (2) repositioned existing classifications of robot social capabilities, (3)
An Extended Framework for Characterizing Social Robots 53

discussed a cross-section of purposes and application areas, (4) provided a straight-


forward and broad classification of the robot’s relational role, (5) clarified the related
but distinct concepts of autonomy and intelligence, and discussed their quantifica-
tion, (6) analyzed interactions according to their spatial features, and (7) looked at
time-related aspects of the interactions. While this framework is aimed primarily at
characterizing social robots by drawing from a large body of literature to illustrate
the concepts discussed, it also serves as a useful guide to inform the design of future
social robots. Towards this end, we briefly touched upon different design approaches,
namely human-centered, robot-centered, and symbiotic.
Social robotics is a growing multidisciplinary field that bridges aspects of human
nature with aspects of robotic technology. The scope of what a social robot means,
does, or serves, will be shaped by future developments in the field. In this journey
towards creating interactive intelligent machines, we are hopeful that as they become
more socially apt, they contribute to expanding, not reducing, the fundamental aspects
of our humanity.

Acknowledgements We would first like to thank Céline Jost for inviting us to be part of
this book project and for contributing to the initial stages of the manuscript. Additionally, this
book chapter would have not been possible without the valuable comments and suggestions
of Prof. Ana Paiva. We would also like to thank the participants and co-organizers of the
HRI Reading Group at Instituto Superior Técnico for sparking many discussions that influenced the
content of this chapter. We would finally like to acknowledge the Global Communication Center at
CMU for their feedback on one of our drafts. K. Baraka acknowledges the CMU-Portugal INSIDE
project grant CMUP-ERI/HCI/0051/2013 and Fundação para a Ciência e a Tecnologia (FCT) grants
with ref. SFRH/BD/128359/2017 and UID/CEC/50021/2019. P. Alves-Oliveira acknowledges a
grant from FCT with ref. SFRH/BD/110223/2015. The views and conclusions in this document are
those of the authors only.

References

1. Abras, C., Maloney-Krichmar, D., Preece, J.: User-centered design. In: Bainbridge, W. (ed.)
Encyclopedia of Human-Computer Interaction, pp. 445–456. Sage Publications, Thousand
Oaks, vol. 37(4) (2004)
2. Adalgeirsson, S.O., Breazeal, C.: Mebot: a robotic platform for socially embodied presence.
In: Proceedings of the 5th ACM/IEEE International Conference on Human-Robot Interaction,
pp. 15–22. IEEE Press (2010)
3. Adams, S.S., Banavar, G., Campbell, M.: I-athlon: towards a multidimensional turing test. AI
Mag. 37(1), 78–84 (2016)
4. Admoni, H., Scassellati, B.: Social eye gaze in human-robot interaction: a review. J. Hum.-
Robot. Interact. 6(1), 25–63 (2017)
5. Ahmad, M., Mubin, O., Orlando, J.: A systematic review of adaptivity in human-robot inter-
action. Multimodal Technol. Interact. 1(3), 14 (2017)
6. Alonso-Mora, J., Siegwart, R., Beardsley, P.: Human-robot swarm interaction for entertain-
ment: from animation display to gesture based control. In: Proceedings of the 2014 ACM/IEEE
International Conference on Human-Robot Interaction, p. 98. ACM (2014)
7. Alves-Oliveira, P., Arriaga, P., Paiva, A., Hoffman, G.: Yolo, a robot for creativity: a co-
design study with children. In: Proceedings of the 2017 Conference on Interaction Design
and Children, pp. 423–429. ACM (2017)
54 K. Baraka et al.

8. Alves-Oliveira, P., Chandak, A., Cloutier, I., Kompella, P., Moegenburg, P., Bastos Pires, A.E.:
Yolo-a robot that will make your creativity boom. In: Companion of the 2018 ACM/IEEE
International Conference on Human-Robot Interaction, pp. 335–336. ACM (2018)
9. Alves-Oliveira, P., Küster, D., Kappas, A., Paiva, A.: Psychological science in HRI: striving
for a more integrated field of research. In: 2016 AAAI Fall Symposium Series (2016)
10. Alves-Oliveira, P., Sequeira, P., Melo, F.S., Castellano, G., Paiva, A.: Empathic robot for
group learning: a field study. ACM Trans. Hum.-Robot. Interact. (THRI) 8(1), 3 (2019)
11. Anderson-Bashan, L., Megidish, B., Erel, H., Wald, I., Hoffman, G., Zuckerman, O., Grishko,
A.: The greeting machine: an abstract robotic object for opening encounters. In: 2018 27th
IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN),
pp. 595–602. IEEE (2018)
12. Augugliaro, F., Lupashin, S., Hamer, M., Male, C., Hehn, M., Mueller, M.W., Willmann,
J.S., Gramazio, F., Kohler, M., D’Andrea, R.: The flight assembled architecture installation:
cooperative construction with flying machines. IEEE Control. Syst. 34(4), 46–64 (2014)
13. Avelino, J., Moreno, P., Bernardino, A., Correia, F., Paiva, A., Catarino, J., Ribeiro, P.: The
power of a hand-shake in human-robot interactions. In: 2018 IEEE/RSJ International Confer-
ence on Intelligent Robots and Systems (IROS), pp. 1864–1869. IEEE (2018)
14. Bainbridge, W.A., Hart, J., Kim, E.S., Scassellati, B.: The effect of presence on human-robot
interaction. In: The 17th IEEE International Symposium on Robot and Human Interactive
Communication. RO-MAN 2008, pp. 701–706. IEEE (2008)
15. Balch, T., Parker, L.E.: Robot teams: from diversity to polymorphism. AK Peters/CRC Press
(2002)
16. Baraka, K., Couto, M., Melo, F.S., Veloso, M.: An optimization approach for structured
agent-based provider/receiver tasks. In: Proceedings of the 18th International Conference
on Autonomous Agents and MultiAgent Systems, pp. 95–103. International Foundation for
Autonomous Agents and Multiagent Systems (2019)
17. Baraka, K., Melo, F.S., Veloso, M.: Interactive robots with model-based ‘autism-like’ behav-
iors. Paladyn J. Behav. Robot. 10(1), 103–116 (2019)
18. Baraka, K., Veloso, M.: Adaptive interaction of persistent robots to user temporal preferences.
In: International Conference on Social Robotics, pp. 61–71. Springer (2015)
19. Baraka, K., Veloso, M.: Mobile service robot state revealing through expressive lights: for-
malism, design, and evaluation. Int. J. Soc. Robot. 10(1), 65–92 (2018)
20. Bartneck, C., Kulić, D., Croft, E., Zoghbi, S.: Measurement instruments for the anthropomor-
phism, animacy, likeability, perceived intelligence, and perceived safety of robots. Int. J. Soc.
Robot. 1(1), 71–81 (2009)
21. Baxter, P., Kennedy, J., Senft, E., Lemaignan, S., Belpaeme, T.: From characterising three
years of HRI to methodology and reporting recommendations. In: The Eleventh ACM/IEEE
International Conference on Human Robot Interaction, pp. 391–398. IEEE Press (2016)
22. Beer, J.M., Fisk, A.D., Rogers, W.A.: Toward a framework for levels of robot autonomy in
human-robot interaction. J. Hum.-Robot. Interact. 3(2), 74–99 (2014)
23. Belpaeme, T., Kennedy, J., Ramachandran, A., Scassellati, B., Tanaka, F.: Social robots for
education: a review. Sci. Robot. 3(21) (2018)
24. Bethel, C.L., Murphy, R.R.: Auditory and other non-verbal expressions of affect for robots.
In: 2006 AAAI Fall Symposium Series, Aurally Informed Performance: Integrating Machine
Listening and Auditory Presentation in Robotic Systems, Washington, DC (2006)
25. Bethel, C.L., Murphy, R.R.: Survey of non-facial/non-verbal affective expressions for
appearance-constrained robots. IEEE Trans. Syst. Man Cybern. Part C (Applications and
Reviews) 38(1), 83–92 (2008)
26. Bethel Cindy, L.: Robots without faces: non-verbal social human-robot interaction. Ph.D.
thesis, dissertation/Ph.D.’s thesis. University of South Florida (2009)
27. Bien, Z., Bang, W.C., Kim, D.Y., Han, J.S.: Machine intelligence quotient: its measurements
and applications. Fuzzy Sets Syst. 127(1), 3–16 (2002)
28. Billard, A., Bonfiglio, A., Cannata, G., Cosseddu, P., Dahl, T., Dautenhahn, K., Mastrogio-
vanni, F., Metta, G., Natale, L., Robins, B., et al.: The roboskin project: challenges and results.
In: Romansy 19–Robot Design, Dynamics and Control, pp. 351–358. Springer (2013)
An Extended Framework for Characterizing Social Robots 55

29. Billard, A., Robins, B., Nadel, J., Dautenhahn, K.: Building Robota, a mini-humanoid robot
for the rehabilitation of children with autism. Assist. Technol. 19(1), 37–49 (2007)
30. Biswas, J., Veloso, M.: The 1,000-km challenge: insights and quantitative and qualitative
results. IEEE Intell. Syst. 31(3), 86–96 (2016)
31. Bonani, M., Oliveira, R., Correia, F., Rodrigues, A., Guerreiro, T., Paiva, A.: What my eyes
can’t see, a robot can show me: exploring the collaboration between blind people and robots.
In: Proceedings of the 20th International ACM SIGACCESS Conference on Computers and
Accessibility, pp. 15–27. ACM (2018)
32. Breazeal, C.: Toward sociable robots. Robot. Auton. Syst. 42(3–4), 167–175 (2003)
33. Breazeal, C.: Social interactions in HRI: the robot view. IEEE Trans. Syst. Man Cybern. Part
C (Applications and Reviews) 34(2), 181–186 (2004)
34. Breazeal, C., Hoffman, G., Lockerd, A.: Teaching and working with robots as a collaboration.
In: Proceedings of the Third International Joint Conference on Autonomous Agents and
Multiagent Systems, vol. 3, pp. 1030–1037. IEEE Computer Society (2004)
35. Breazeal, C.L.: Designing Sociable Robots. MIT Press (2004)
36. Breemen, A.V.: Animation engine for believable interactive user-interface robots. In:
IEEE/RSJ International Conference on Intelligent Robots and Systems. IROS 2004, vol. 3,
pp. 2873–2878 (2004). https://doi.org/10.1109/IROS.2004.1389845
37. Broadbent, E., Stafford, R., MacDonald, B.: Acceptance of healthcare robots for the older
population: review and future directions. Int. J. Soc. Robot. 1(4), 319 (2009)
38. Bruce, A., Knight, J., Listopad, S., Magerko, B., Nourbakhsh, I.R.: Robot improv: using
drama to create believable agents. In: ICRA, p. 4003 (2000)
39. Burgar, C.G., Lum, P.S., Shor, P.C., Van der Loos, H.M.: Development of robots for rehabil-
itation therapy: the Palo Alto VA/Stanford experience. J. Rehabil. Res. Dev. 37(6), 663–674
(2000)
40. Burton, A.: Dolphins, dogs, and robot seals for the treatment of neurological disease. Lancet
Neurol. 12(9), 851–852 (2013)
41. Buschmann, T., Lohmeier, S., Ulbrich, H.: Humanoid robot lola: design and walking control.
J. Physiol.-Paris 103(3–5), 141–148 (2009)
42. Cabibihan, J.J., Javed, H., Ang, M., Aljunied, S.M.: Why robots? A survey on the roles
and benefits of social robots in the therapy of children with autism. Int. J. Soc. Robot. 5(4),
593–618 (2013)
43. Cannata, G., D’Andrea, M., Maggiali, M.: Design of a humanoid robot eye: models and
experiments. In: 2006 6th IEEE-RAS International Conference on Humanoid Robots, pp.
151–156. IEEE (2006)
44. Cappo, E.A., Desai, A., Collins, M., Michael, N.: Online planning for human–multi-robot
interactive theatrical performance. Auton. Robot., 1–16 (2018)
45. Carlucci, F.M., Nardi, L., Iocchi, L., Nardi, D.: Explicit representation of social norms for
social robots. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), pp. 4191–4196. IEEE (2015)
46. Chao, C., Cakmak, M., Thomaz, A.L.: Transparent active learning for robots. In: 2010 5th
ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 317–324. IEEE
(2010)
47. Chen, G.D., Wang, C.Y., et al.: A survey on storytelling with robots. In: International Con-
ference on Technologies for E-Learning and Digital Entertainment, pp. 450–456. Springer
(2011)
48. Clark, R.E.: Reconsidering research on learning from media. Rev. Educ. Res. 53(4), 445–459
(1983)
49. Colton, S., Wiggins, G.A., et al.: Computational creativity: the final frontier? In: Ecai, vol.
2012, pp. 21–26. Montpelier (2012)
50. Coradeschi, S., Saffiotti, A.: Symbiotic robotic systems: humans, robots, and smart environ-
ments. IEEE Intell. Syst. 21(3), 82–84 (2006)
51. Dautenhahn, K.: Robots we like to live with! a developmental perspective on a personalized,
life-long robot companion. In: Proceedings of the 13th IEEE International Workshop on Robot
and Human Interactive Communication, RO-MAN (2004)
56 K. Baraka et al.

52. Dautenhahn, K.: Socially intelligent robots: dimensions of human-robot interaction. Philos.
Trans. R. Soc. B Biol. Sci. 362(1480), 679 (2007)
53. Dautenhahn, K., Billard, A.: Studying robot social cognition within a developmental psychol-
ogy framework. In: 3rd European Workshop on Advanced Mobile Robots, Eurobot 1999, pp.
187–194. IEEE (1999)
54. Dautenhahn, K., Woods, S., Kaouri, C., Walters, M.L., Koay, K.L., Werry, I.: What is a
robot companion-friend, assistant or butler? In: 2005 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS 2005), pp. 1192–1197. IEEE (2005)
55. DiSalvo, C., Gemperle, F.: From seduction to fulfillment: the use of anthropomorphic form
in design. In: Proceedings of the 2003 International Conference on Designing Pleasurable
Products and Interfaces, pp. 67–72. ACM (2003)
56. DiSalvo, C.F., Gemperle, F., Forlizzi, J., Kiesler, S.: All robots are not created equal: the
design and perception of humanoid robot heads. In: Proceedings of the 4th Conference on
Designing Interactive Systems: Processes, Practices, Methods, and Techniques, pp. 321–326.
ACM (2002)
57. Dragan, A.D., Lee, K.C., Srinivasa, S.S.: Legibility and predictability of robot motion. In:
Proceedings of the 8th ACM/IEEE International Conference on Human-Robot Interaction,
pp. 301–308. IEEE Press (2013)
58. Ekman, P.: An argument for basic emotions. Cogn. Emot. 6(3–4), 169–200 (1992)
59. Endsley, M.R.: Level of automation effects on performance, situation awareness and workload
in a dynamic control task. Ergonomics 42(3), 462–492 (1999)
60. Epley, N., Waytz, A., Cacioppo, J.T.: On seeing human: a three-factor theory of anthropo-
morphism. Psychol. Rev. 114(4), 864 (2007)
61. Esteban, P.G., Baxter, P., Belpaeme, T., Billing, E., Cai, H., Cao, H.L., Coeckelbergh, M.,
Costescu, C., David, D., De Beir, A., et al.: How to build a supervised autonomous system for
robot-enhanced therapy for children with autism spectrum disorder. Paladyn J. Behav. Robot.
8(1), 18–38 (2017)
62. Eyssel, F.: An experimental psychological perspective on social robotics. Robot. Auton. Syst.
87, 363–371 (2017)
63. Fasola, J., Mataric, M.: Comparing Physical and Virtual Embodiment in a Socially Assistive
Robot Exercise Coach for the Elderly. Center for Robotics and Embedded Systems, Los
Angeles, CA (2011)
64. Feil-Seifer, D., Mataric, M.J.: Defining socially assistive robotics. In: 9th International Con-
ference on Rehabilitation Robotics. ICORR 2005, pp. 465–468. IEEE (2005)
65. Feil-Seifer, D., Skinner, K., Matarić, M.J.: Benchmarks for evaluating socially assistive
robotics. Interact. Stud. 8(3), 423–439 (2007)
66. Fernández-Llamas, C., Conde, M.A., Rodríguez-Lera, F.J., Rodríguez-Sedano, F.J., García,
F.: May i teach you? Students’ behavior when lectured by robotic vs. human teachers. Comput.
Hum. Behav. 80, 460–469 (2018)
67. Fincannon, T., Barnes, L.E., Murphy, R.R., Riddle, D.L.: Evidence of the need for social
intelligence in rescue robots. In: Proceedings. 2004 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS 2004), vol. 2, pp. 1089–1095. IEEE (2004)
68. Fink, J.: Anthropomorphism and human likeness in the design of robots and human-robot
interaction. In: International Conference on Social Robotics, pp. 199–208. Springer (2012)
69. Fong, T., Nourbakhsh, I., Dautenhahn, K.: A survey of socially interactive robots: concepts,
design and applications. Technical Report CMU-RI-TR-02-29, Robotics Institute, Carnegie
Mellon University (2002)
70. Forlizzi, J., DiSalvo, C., Gemperle, F.: Assistive robotics and an ecology of elders living
independently in their homes. Hum.-Comput. Interact. 19(1), 25–59 (2004)
71. Friedman, B., Kahn Jr, P.H., Hagman, J.: Hardware companions?: What online AIBO dis-
cussion forums reveal about the human-robotic relationship. In: Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems, pp. 273–280. ACM (2003)
72. Frith, U., Frith, C.: The social brain: allowing humans to boldly go where no other species
has been. Philos. Trans. R. Soc. B Biol. Sci. 365(1537), 165–176 (2010)
An Extended Framework for Characterizing Social Robots 57

73. Gardner, H., Kornhaber, M.L., Wake, W.K.: Intelligence: Multiple Perspectives. Harcourt
Brace College Publishers (1996)
74. Gates, B.: A robot in every home. Sci. Am. 296(1), 58–65 (2007)
75. Georgiadis, D., Christophorou, C., Kleanthous, S., Andreou, P., Santos, L., Christodoulou,
E., Samaras, G.: A robotic cloud ecosystem for elderly care and ageing well: the growmeup
approach. In: XIV Mediterranean Conference on Medical and Biological Engineering and
Computing 2016, pp. 919–924. Springer (2016)
76. Gielniak, M.J., Thomaz, A.L.: Enhancing interaction through exaggerated motion synthesis.
In: ACM/IEEE International Conference on Human-Robot Interaction. HRI 2012, p. 375
(2012). https://doi.org/10.1145/2157689.2157813
77. Glas, D.F., Minato, T., Ishi, C.T., Kawahara, T., Ishiguro, H.: Erica: The ERATO intelligent
conversational android. In: 2016 25th IEEE International Symposium on Robot and Human
Interactive Communication (RO-MAN), pp. 22–29. IEEE (2016)
78. Gockley, R., Bruce, A., Forlizzi, J., Michalowski, M., Mundell, A., Rosenthal, S., Sellner,
B., Simmons, R., Snipes, K., Schultz, A.C., et al.: Designing robots for long-term social
interaction. In: 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS 2005), pp. 1338–1343. IEEE (2005)
79. Goetz, J., Kiesler, S., Powers, A.: Matching robot appearance and behavior to tasks to improve
human-robot cooperation. In: The 12th IEEE International Workshop on Robot and Human
Interactive Communication. Proceedings. ROMAN 2003, pp. 55–60. Ieee (2003)
80. Goodrich, M.A., Olsen, D.R.: Seven principles of efficient human robot interaction. In: SMC
2003 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and
Cybernetics. Conference Theme-System Security and Assurance (Cat. No. 03CH37483),
vol. 4, pp. 3942–3948. IEEE (2003)
81. Goodrich, M.A., Schultz, A.C., et al.: Human–robot interaction: a survey. Found. Trends®
Hum. Comput. Interact. 1(3), 203–275 (2008)
82. Graf, C., Härtl, A., Röfer, T., Laue, T.: A robust closed-loop gait for the standard platform
league humanoid. In: Proceedings of the Fourth Workshop on Humanoid Soccer Robots in
Conjunction with the 2009 IEEE-RAS International Conference on Humanoid Robots, pp.
30–37 (2009)
83. Greenberg, S., Buxton, B.: Usability evaluation considered harmful (some of the time). In:
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 111–
120. ACM (2008)
84. Gunderson, J., Gunderson, L.: Intelligence = autonomy = capability. In: Performance Metrics
for Intelligent Systems, PERMIS (2004)
85. Haddadin, S., Croft, E.: Physical Human–Robot Interaction, pp. 1835–1874. Springer Inter-
national Publishing, Cham (2016). https://doi.org/10.1007/978-3-319-32552-1_69
86. Ho, W.C., Dautenhahn, K., Lim, M.Y., Du Casse, K.: Modelling human memory in robotic
companions for personalisation and long-term adaptation in HRI. In: BICA, pp. 64–71 (2010)
87. Hochberg, L.R., Bacher, D., Jarosiewicz, B., Masse, N.Y., Simeral, J.D., Vogel, J., Haddadin,
S., Liu, J., Cash, S.S., van der Smagt, P., et al.: Reach and grasp by people with tetraplegia
using a neurally controlled robotic arm. Nature 485(7398), 372 (2012)
88. Hoffman, G.: Dumb robots, smart phones: a case study of music listening companionship. In:
IEEE International Symposium on Robot and Human Interactive Communication. RO-MAN
2012, pp. 358–363 (2012). https://doi.org/10.1109/ROMAN.2012.6343779
89. Hoffman, G., Breazeal, C.: Effects of anticipatory perceptual simulation on practiced human-
robot tasks. Auton. Robot. 28(4), 403–423 (2010)
90. Homans, G.C.: Social Behavior: Its Elementary Forms. Harcourt Brace Jovanovich (1974)
91. Huang, H.M., Pavek, K., Albus, J., Messina, E.: Autonomy levels for unmanned systems
(ALFUS) framework: an update. In: Unmanned Ground Vehicle Technology VII, vol. 5804,
pp. 439–449. International Society for Optics and Photonics (2005)
92. Huber, A., Weiss, A.: Developing human-robot interaction for an industry 4.0 robot: How
industry workers helped to improve remote-HRI to physical-HRI. In: Proceedings of the
Companion of the 2017 ACM/IEEE International Conference on Human-Robot Interaction,
pp. 137–138. ACM (2017)
58 K. Baraka et al.

93. Jacq, A., Lemaignan, S., Garcia, F., Dillenbourg, P., Paiva, A.: Building successful long child-
robot interactions in a learning context. In: 2016 11th ACM/IEEE International Conference
on Human-Robot Interaction (HRI), pp. 239–246. IEEE (2016)
94. Jensen, B., Tomatis, N., Mayor, L., Drygajlo, A., Siegwart, R.: Robots meet humans: Inter-
action in public spaces. IEEE Trans. Ind. Electron. 52(6), 1530–1546 (2005)
95. Jordan, P.W.: Human factors for pleasure in product use. Appl. Ergon. 29(1), 25–33 (1998)
96. Jørgensen, J.: Interaction with soft robotic tentacles. In: Companion of the 2018 ACM/IEEE
International Conference on Human-Robot Interaction, p. 38. ACM (2018)
97. Kahn, P.H., Freier, N.G., Kanda, T., Ishiguro, H., Ruckert, J.H., Severson, R.L., Kane, S.K.:
Design patterns for sociality in human-robot interaction. In: Proceedings of the 3rd ACM/IEEE
International Conference on Human Robot Interaction, pp. 97–104. ACM (2008)
98. Kanda, T., Hirano, T., Eaton, D., Ishiguro, H.: Interactive robots as social partners and peer
tutors for children: a field trial. Hum.-Comput. Interact. 19(1–2), 61–84 (2004)
99. Kanda, T., Sato, R., Saiwaki, N., Ishiguro, H.: A two-month field trial in an elementary school
for long-term human-robot interaction. IEEE Trans. Robot. 23(5), 962–971 (2007)
100. Kapusta, A., Yu, W., Bhattacharjee, T., Liu, C.K., Turk, G., Kemp, C.C.: Data-driven haptic
perception for robot-assisted dressing. In: 2016 25th IEEE International Symposium on Robot
and Human Interactive Communication (RO-MAN), pp. 451–458. IEEE (2016)
101. Ke˛dzierski, J., Muszyński, R., Zoll, C., Oleksy, A., Frontkiewicz, M.: Emys-emotive head of
a social robot. Int. J. Soc. Robot. 5(2), 237–249 (2013)
102. Kennedy, J., Baxter, P., Belpaeme, T.: Comparing robot embodiments in a guided discovery
learning interaction with children. Int. J. Soc. Robot. 7(2), 293–308 (2015)
103. Knight, H.: Eight lessons learned about non-verbal interactions through robot theater. In:
International Conference on Social Robotics, pp. 42–51. Springer (2011)
104. Kolling, A., Walker, P., Chakraborty, N., Sycara, K., Lewis, M.: Human interaction with robot
swarms: a survey. IEEE Trans. Hum.-Mach. Syst. 46(1), 9–26 (2016)
105. Komatsu, T., Kurosawa, R., Yamada, S.: How does the difference between users’ expectations
and perceptions about a robotic agent affect their behavior? Int. J. Soc. Robot. 4(2), 109–116
(2012)
106. Kozima, H., Michalowski, M.P., Nakagawa, C.: Keepon. Int. J. Soc. Robot. 1(1), 3–18 (2009)
107. Kozima, H., Michalowski, M.P., Nakagawa, C., Kozima, H., Nakagawa, C., Kozima, H.,
Michalowski, M.P.: A Playful Robot for Research, Therapy, and Entertainment (2008)
108. Kristoffersson, A., Coradeschi, S., Loutfi, A.: A review of mobile robotic telepresence. Adv.
Hum.-Comput. Interact. 2013, 3 (2013)
109. Kulyukin, V., Gharpure, C., Nicholson, J., Pavithran, S.: RFID in robot-assisted indoor nav-
igation for the visually impaired. In: Proceedings. 2004 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS 2004), vol. 2, pp. 1979–1984. IEEE (2004)
110. Lazewatsky, D.A., Smart, W.D.: A panorama interface for telepresence robots. In: Proceedings
of the 6th International Conference on Human-Robot Interaction, pp. 177–178. ACM (2011)
111. Lee, H.R., Sabanović, S.: Culturally variable preferences for robot design and use in South
Korea, Turkey, and the United States. In: Proceedings of the 2014 ACM/IEEE International
Conference on Human-Robot Interaction, pp. 17–24. ACM (2014)
112. Lee, K.M., Jung, Y., Kim, J., Kim, S.R.: Are physically embodied social agents better than dis-
embodied social agents?: the effects of physical embodiment, tactile interaction, and people’s
loneliness in human-robot interaction. Int. J. Hum.-Comput. Stud. 64(10), 962–973 (2006)
113. Leite, I., Martinho, C., Paiva, A.: Social robots for long-term interaction: a survey. Int. J. Soc.
Robot. 5(2), 291–308 (2013)
114. Leite, I., Pereira, A., Martinho, C., Paiva, A.: Are emotional robots more fun to play with? In:
The 17th IEEE International Symposium on Robot and Human Interactive Communication.
RO-MAN 2008, pp. 77–82. IEEE (2008)
115. Levy, D.: Love and Sex with Robots: The Evolution of Human-Robot Relationships, New
York (2009)
116. Li, B., Ma, S., Liu, J., Wang, M., Liu, T., Wang, Y.: Amoeba-i: a shape-shifting modular robot
for urban search and rescue. Adv. Robot. 23(9), 1057–1083 (2009)
An Extended Framework for Characterizing Social Robots 59

117. Li, D., Rau, P.P., Li, Y.: A cross-cultural study: effect of robot appearance and task. Int. J.
Soc. Robot. 2(2), 175–186 (2010)
118. Li, J.: The benefit of being physically present: a survey of experimental works comparing
copresent robots, telepresent robots and virtual agents. Int. J. Hum.-Comput. Stud. 77, 23–37
(2015)
119. Liang, Y.S., Pellier, D., Fiorino, H., Pesty, S., Cakmak, M.: Simultaneous end-user program-
ming of goals and actions for robotic shelf organization. In: 2018 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), pp. 6566–6573. IEEE (2018)
120. Lin, H.T., Leisk, G.G., Trimmer, B.: Goqbot: a caterpillar-inspired soft-bodied rolling robot.
Bioinspiration Biomim. 6(2), 026,007 (2011)
121. Liu, H., Meusel, P., Seitz, N., Willberg, B., Hirzinger, G., Jin, M., Liu, Y., Wei, R., Xie, Z.:
The modular multisensory DLR-HIT-Hand. Mech. Mach. Theory 42(5), 612–625 (2007)
122. Löffler, D., Schmidt, N., Tscharn, R.: Multimodal expression of artificial emotion in social
robots using color, motion and sound. In: Proceedings of the 2018 ACM/IEEE International
Conference on Human-Robot Interaction, pp. 334–343. ACM (2018)
123. Luber, M., Spinello, L., Silva, J., Arras, K.O.: Socially-aware robot navigation: a learning
approach. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), pp. 902–907. IEEE (2012)
124. Lungarella, M., Metta, G.: Beyond Gazing, Pointing, and Reaching: A Survey of Develop-
mental Robotics (2003)
125. Lungarella, M., Metta, G., Pfeifer, R., Sandini, G.: Developmental robotics: a survey. Connect.
Sci. 15(4), 151–190 (2003)
126. Madhani, A.J.: Bringing physical characters to life. In: 2009 4th ACM/IEEE International
Conference on Human-Robot Interaction (HRI), p. 1. IEEE (2009)
127. Majidi, C.: Soft robotics: a perspective-current trends and prospects for the future. Soft Robot.
1(1), 5–11 (2014)
128. Mavridis, N.: A review of verbal and non-verbal human-robot interactive communication.
Robot. Auton. Syst. 63, 22–35 (2015)
129. Mavrogiannis, C., Hutchinson, A.M., Macdonald, J., Alves-Oliveira, P., Knepper, R.A.:
Effects of distinct robot navigation strategies on human behavior in a crowded environment.
In: 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp.
421–430. IEEE (2019)
130. Mead, R., Matarić, M.J.: Perceptual models of human-robot proxemics. In: Experimental
Robotics, pp. 261–276. Springer (2016)
131. Meriçli, Ç., Veloso, M., Akın, H.L.: Task refinement for autonomous robots using comple-
mentary corrective human feedback. Int. J. Adv. Robot. Syst. 8(2), 16 (2011)
132. Miyashita, T., Tajika, T., Ishiguro, H., Kogure, K., Hagita, N.: Haptic communication between
humans and robots. In: Robotics Research, pp. 525–536. Springer (2007)
133. Mori, M.: The uncanny valley. Energy 7(4), 33–35 (1970)
134. Mumm, J., Mutlu, B.: Designing motivational agents: the role of praise, social comparison,
and embodiment in computer feedback. Comput. Hum. Behav. 27(5), 1643–1650 (2011)
135. Mumm, J., Mutlu, B.: Human-robot proxemics: physical and psychological distancing in
human-robot interaction. In: Proceedings of the 6th International Conference on Human-
Robot Interaction, pp. 331–338. ACM (2011)
136. Murphy, R.R., Nomura, T., Billard, A., Burke, J.L.: Human-robot interaction. IEEE Robot.
Autom. Mag. 17(2), 85–89 (2010)
137. Nass, C., Jonsson, I.M., Harris, H., Reaves, B., Endo, J., Brave, S., Takayama, L.: Improving
automotive safety by pairing driver emotion and car voice emotion. In: CHI 2005 Extended
Abstracts on Human Factors in Computing Systems, pp. 1973–1976. ACM (2005)
138. Nass, C., Steuer, J., Tauber, E.R.: Computers are social actors. In: Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems, pp. 72–78. ACM (1994)
139. Newell, A., Simon, H.A., et al.: Human problem solving, vol. 104. Prentice-Hall Englewood
Cliffs, NJ (1972)
60 K. Baraka et al.

140. Nguyen, L.A., Bualat, M., Edwards, L.J., Flueckiger, L., Neveu, C., Schwehr, K., Wagner,
M.D., Zbinden, E.: Virtual reality interfaces for visualization and control of remote vehicles.
Auton. Robot. 11(1), 59–68 (2001)
141. Nieuwenhuisen, M., Behnke, S.: Human-like interaction skills for the mobile communication
robot Robotinho. Int. J. Soc. Robot. 5(4), 549–561 (2013)
142. Norman, D.A.: Cognitive engineering. User Cent. Syst. Des. 31, 61 (1986)
143. Norman, D.A.: Human-centered design considered harmful. Interactions 12(4), 14–19 (2005)
144. Pagliarini, L., Lund, H.H.: The development of robot art. Artif. Life Robot. 13(2), 401–405
(2009)
145. Paiva, A., Leite, I., Boukricha, H., Wachsmuth, I.: Empathy in virtual agents and robots: a
survey. ACM Trans. Interact. Intell. Syst. (TiiS) 7(3), 11 (2017)
146. Pfeifer, R., Scheier, C.: Understanding intelligence. MIT Press (2001)
147. Pollack, M.E., Brown, L., Colbry, D., Orosz, C., Peintner, B., Ramakrishnan, S., Engberg, S.,
Matthews, J.T., Dunbar-Jacob, J., McCarthy, C.E., et al.: Pearl: a mobile robotic assistant for
the elderly. In: AAAI Workshop on Automation as Eldercare, vol. 2002, pp. 85–91 (2002)
148. Pope, M.T., Christensen, S., Christensen, D., Simeonov, A., Imahara, G., Niemeyer, G.: Stick-
man: towards a human scale acrobatic robot. In: 2018 IEEE International Conference on
Robotics and Automation (ICRA), pp. 2134–2140. IEEE (2018)
149. Powers, A., Kiesler, S., Fussell, S., Fussell, S., Torrey, C.: Comparing a computer agent with
a humanoid robot. In: Proceedings of the ACM/IEEE International Conference on Human-
Robot Interaction, pp. 145–152. ACM (2007)
150. Ribeiro, T., Dooley, D., Paiva, A.: Nutty tracks: symbolic animation pipeline for expressive
robotics. In: ACM International Conference on Computer Graphics and Interactive Techniques
Posters. SIGGRAPH 2013, p. 4503 (2013)
151. Ribeiro, T., Paiva, A.: The illusion of robotic life principles and practices of animation for
robots. In: ACM/IEEE International Conference on Human-Robot Interaction. HRI 2012, pp.
383–390 (2012)
152. Ribeiro, T., Paiva, A.: Animating the Adelino robot with ERIK: the expressive robotics inverse
kinematics. In: Proceedings of the 19th ACM International Conference on Multimodal Inter-
action. ICMI 2017, pp. 388–396. ACM, New York, NY, USA (2017). https://doi.org/10.1145/
3136755.3136791, http://doi.acm.org/10.1145/3136755.3136791
153. Robert, L.: Personality in the human robot interaction literature: a review and brief critique.
In: Robert, L.P. (2018). Personality in the Human Robot Interaction Literature: A Review and
Brief Critique, Proceedings of the 24th Americas Conference on Information Systems, pp.
16–18 (2018)
154. Rossi, S., Ferland, F., Tapus, A.: User profiling and behavioral adaptation for HRI: a survey.
Pattern Recognit. Lett. 99, 3–12 (2017)
155. Rybski, P.E., Yoon, K., Stolarz, J., Veloso, M.M.: Interactive robot task training through
dialog and demonstration. In: Proceedings of the ACM/IEEE International Conference on
Human-Robot Interaction, pp. 49–56. ACM (2007)
156. SAE International: Automated Driving: Levels of Driving Automation are Defined in New
SAE International Standard J3016 (2014)
157. Sauppé, A., Mutlu, B.: The social impact of a robot co-worker in industrial settings. In:
Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems,
pp. 3613–3622. ACM (2015)
158. Scassellati, B.: Theory of mind for a humanoid robot. Auton. Robot. 12(1), 13–24 (2002)
159. Scassellati, B.: Investigating models of social development using a humanoid robot. In: Pro-
ceedings of the International Joint Conference on Neural Networks, vol. 4, pp. 2704–2709.
IEEE (2003)
160. Scassellati, B., Admoni, H., Matarić, M.: Robots for use in autism research. Annu. Rev.
Biomed. Eng. 14, 275–294 (2012)
161. Scholtz, J.: Theory and evaluation of human robot interactions. In: Proceedings of the 36th
Annual Hawaii International Conference on System Sciences, pp. 10–pp. IEEE (2003)
An Extended Framework for Characterizing Social Robots 61

162. Schou, C., Andersen, R.S., Chrysostomou, D., Bøgh, S., Madsen, O.: Skill-based instruction
of collaborative robots in industrial settings. Robot. Comput. Integr. Manuf. 53, 72–80 (2018)
163. Seok, S., Onal, C.D., Wood, R., Rus, D., Kim, S.: Peristaltic locomotion with antagonistic
actuators in soft robotics. In: 2010 IEEE International Conference on Robotics and Automa-
tion, pp. 1228–1233. IEEE (2010)
164. Shamsuddin, S., Yussof, H., Ismail, L., Hanapiah, F.A., Mohamed, S., Piah, H.A., Zahari, N.I.:
Initial response of autistic children in human-robot interaction therapy with humanoid robot
NAO. In: 2012 IEEE 8th International Colloquium on Signal Processing and its Applications
(CSPA), pp. 188–193. IEEE (2012)
165. Sharkey, A., Sharkey, N.: Granny and the robots: ethical issues in robot care for the elderly.
Ethics Inf. Technol. 14(1), 27–40 (2012)
166. Sheridan, T.B., Verplank, W.L.: Human and computer control of undersea teleoperators. Tech-
nical report, Massachussetts Institute of Technology Cambridge man-machine systems lab
(1978)
167. Shibata, T.: An overview of human interactive robots for psychological enrichment. Proc.
IEEE 92(11), 1749–1758 (2004)
168. Shibata, T., Wada, K.: Robot therapy: a new approach for mental healthcare of the elderly-a
mini-review. Gerontology 57(4), 378–386 (2011)
169. Shidujaman, M., Zhang, S., Elder, R., Mi, H.: “RoboQuin”: a mannequin robot with natural
humanoid movements. In: 2018 27th IEEE International Symposium on Robot and Human
Interactive Communication (RO-MAN), pp. 1051–1056. IEEE (2018)
170. Shiomi, M., Kanda, T., Glas, D.F., Satake, S., Ishiguro, H., Hagita, N.: Field trial of networked
social robots in a shopping mall. In: IEEE/RSJ International Conference on Intelligent Robots
and Systems. IROS 2009, pp. 2846–2853. IEEE (2009)
171. Shiomi, M., Nakata, A., Kanbara, M., Hagita, N.: A robot that encourages self-disclosure by
hug. In: International Conference on Social Robotics, pp. 324–333. Springer (2017)
172. Shomin, M.: Navigation and physical interaction with balancing robots. Ph.D. thesis, Robotics
Institute, Carnegie Mellon University (2016)
173. Shomin, M., Forlizzi, J., Hollis, R.: Sit-to-stand assistance with a balancing mobile robot. In:
2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 3795–3800.
IEEE (2015)
174. Short, E., Swift-Spong, K., Greczek, J., Ramachandran, A., Litoiu, A., Grigore, E.C., Feil-
Seifer, D., Shuster, S., Lee, J.J., Huang, S., et al.: How to train your dragonbot: socially assistive
robots for teaching children about nutrition through play. In: The 23rd IEEE International
Symposium on Robot and Human Interactive Communication, pp. 924–929. IEEE (2014)
175. Shrestha, M.C., Nohisa, Y., Schmitz, A., Hayakawa, S., Uno, E., Yokoyama, Y., Yanagawa,
H., Or, K., Sugano, S.: Using contact-based inducement for efficient navigation in a congested
environment. In: 2015 24th IEEE International Symposium on Robot and Human Interactive
Communication (RO-MAN), pp. 456–461. IEEE (2015)
176. Sirkin, D., Mok, B., Yang, S., Ju, W.: Mechanical ottoman: how robotic furniture offers and
withdraws support. In: Proceedings of the Tenth Annual ACM/IEEE International Conference
on Human-Robot Interaction, pp. 11–18. ACM (2015)
177. Song, W.K., Kim, J.: Novel assistive robot for self-feeding. In: Robotic Systems-Applications,
Control and Programming. InTech (2012)
178. Sparrow, R.: The march of the robot dogs. Ethics Inf. Technol. 4(4), 305–318 (2002)
179. Spence, P.R., Westerman, D., Edwards, C., Edwards, A.: Welcoming our robot overlords:
initial expectations about interaction with a robot. Commun. Res. Rep. 31(3), 272–280 (2014)
180. Srinivasa, S.S., Ferguson, D., Helfrich, C.J., Berenson, D., Collet, A., Diankov, R., Gallagher,
G., Hollinger, G., Kuffner, J., Weghe, M.V.: Herb: a home exploring robotic butler. Auton.
Robot. 28(1), 5 (2010)
181. Srinivasan, V., Henkel, Z., Murphy, R.: Social head gaze and proxemics scaling for an affective
robot used in victim management. In: 2012 IEEE International Symposium on Safety, Security,
and Rescue Robotics (SSRR), pp. 1–2. IEEE (2012)
62 K. Baraka et al.

182. Steinfeld, A., Fong, T., Kaber, D., Lewis, M., Scholtz, J., Schultz, A., Goodrich, M.: Com-
mon metrics for human-robot interaction. In: Proceedings of the 1st ACM SIGCHI/SIGART
Conference on Human-Robot Interaction, pp. 33–40. ACM (2006)
183. Steinfeld, A., Jenkins, O.C., Scassellati, B.: The oz of wizard: simulating the human for inter-
action research. In: Proceedings of the 4th ACM/IEEE International Conference on Human
Robot Interaction, pp. 101–108. ACM (2009)
184. Stiehl, W.D., Lee, J.K., Breazeal, C., Nalin, M., Morandi, A., Sanna, A.: The huggable: a
platform for research in robotic companions for pediatric care. In: Proceedings of the 8th
International Conference on Interaction Design and Children, pp. 317–320. ACM (2009)
185. Stone, P.: Intelligent autonomous robotics: a robot soccer case study. Synth. Lect. Artif. Intell.
Mach. Learn. 1(1), 1–155 (2007)
186. Sun, A., Chao, C., Lim, H.A.: Robot and human dancing. In: UNESCO CID 50th World
Congress on Dance Research (2017)
187. Szafir, D., Mutlu, B., Fong, T.: Communicating directionality in flying robots. In: Proceedings
of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, pp.
19–26. ACM (2015)
188. Takayama, L., Dooley, D., Ju, W.: Expressing thought: improving robot readability with
animation principles. In: 2011 6th ACM/IEEE International Conference on Human-Robot
Interaction (HRI), pp. 69–76. IEEE (2011)
189. Tanaka, F., Cicourel, A., Movellan, J.R.: Socialization between toddlers and robots at an early
childhood education center. Proc. Natl. Acad. Sci. 104(46), 17954–17958 (2007)
190. Thomas, F., Johnston, O.: The Illusion of Life: Disney Animation. Hyperion (1995)
191. Thrun, S.: Toward a framework for human-robot interaction. Hum.-Comput. Interact. 19(1),
9–24 (2004)
192. Tonin, L., Carlson, T., Leeb, R., Millán, J.D.R.: Brain-controlled telepresence robot by motor-
disabled people. In: Engineering in Medicine and Biology Society, EMBC, 2011 Annual
International Conference of the IEEE, pp. 4227–4230. IEEE (2011)
193. Tsui, K.M., Desai, M., Yanco, H.A., Uhlik, C.: Exploring use cases for telepresence robots.
In: Proceedings of the 6th International Conference on Human-Robot Interaction, pp. 11–18.
ACM (2011)
194. Turing, A.M.: Computing machinery and intelligence. In: Parsing the Turing Test, pp. 23–65.
Springer (2009)
195. Vatsal, V., Hoffman, G.: Design and analysis of a wearable robotic forearm. In: 2018 IEEE
International Conference on Robotics and Automation (ICRA), pp. 1–8. IEEE (2018)
196. Veloso, M.M., Biswas, J., Coltin, B., Rosenthal, S.: Cobots: robust symbiotic autonomous
mobile service robots. In: IJCAI, p. 4423. Citeseer (2015)
197. Wainer, J., Feil-Seifer, D.J., Shell, D.A., Mataric, M.J.: Embodiment and human-robot inter-
action: a task-based perspective. In: The 16th IEEE International Symposium on Robot and
Human Interactive Communication. RO-MAN 2007, pp. 872–877. IEEE (2007)
198. Wei, J., Dolan, J.M., Litkouhi, B.: Autonomous vehicle social behavior for highway entrance
ramp management. In: Intelligent Vehicles Symposium (IV), 2013 IEEE, pp. 201–207. IEEE
(2013)
199. Wells, J.D., Campbell, D.E., Valacich, J.S., Featherman, M.: The effect of perceived novelty
on the adoption of information technology innovations: a risk/reward perspective. Decis. Sci.
41(4), 813–843 (2010)
200. Williams, T., Thames, D., Novakoff, J., Scheutz, M.: Thank you for sharing that interesting
fact!: effects of capability and context on indirect speech act use in task-based human-robot
dialogue. In: Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot
Interaction, pp. 298–306. ACM (2018)
201. Yadollahi, E., Johal, W., Paiva, A., Dillenbourg, P.: When deictic gestures in a robot can harm
child-robot collaboration. In: Proceedings of the 17th ACM Conference on Interaction Design
and Children, CONF, pp. 195–206. ACM (2018)
202. Yamokoski, J., Radford, N.: Robonaut, Valkyrie, and NASA Robots. Humanoid Robotics: A
Reference, pp. 201–214 (2019)
An Extended Framework for Characterizing Social Robots 63

203. Yanco, H.A., Drury, J.: Classifying human-robot interaction: an updated taxonomy. In: 2004
IEEE International Conference on Systems, Man and Cybernetics, vol. 3, pp. 2841–2846.
IEEE (2004)
204. Yim, M., Shen, W.M., Salemi, B., Rus, D., Moll, M., Lipson, H., Klavins, E., Chirikjian,
G.S.: Modular self-reconfigurable robot systems [grand challenges of robotics]. IEEE Robot.
Autom. Mag. 14(1), 43–52 (2007)
205. Yim, M., Zhang, Y., Duff, D.: Modular robots. IEEE Spectr. 39(2), 30–34 (2002)
206. Yohanan, S., MacLean, K.E.: A tool to study affective touch. In: CHI 2009 Extended Abstracts
on Human Factors in Computing Systems, pp. 4153–4158. ACM (2009)
207. Yohanan, S., MacLean, K.E.: The role of affective touch in human-robot interaction: human
intent and expectations in touching the haptic creature. Int. J. Soc. Robot. 4(2), 163–180
(2012)
208. Youssefi, S., Denei, S., Mastrogiovanni, F., Cannata, G.: Skinware 2.0: a real-time middleware
for robot skin. SoftwareX 3, 6–12 (2015)
209. Zeglin, G., Walsman, A., Herlant, L., Zheng, Z., Guo, Y., Koval, M.C., Lenzo, K., Tay, H.J.,
Velagapudi, P., Correll, K., et al.: Herb’s sure thing: a rapid drama system for rehearsing and
performing live robot theater. In: 2014 IEEE Workshop on Advanced Robotics and Its Social
Impacts (ARSO), pp. 129–136. IEEE (2014)

Kim Baraka is currently a dual degree Ph.D. candidate at


Carnegie Mellon’s Robotics Institute (Pittsburgh, PA, USA), and
Instituto Superior Técnico (Lisbon, Portugal). He holds an M.S.
in Robotics from Carnegie Mellon, and a Bachelor in Electri-
cal and Computer Engineering from the American University
of Beirut. He was a summer student at CERN, and a recipient
of the IEEE Student Enterprise Award. His research interests
lie at the intersection of Artificial Intelligence, Machine Learn-
ing and Human-Robot Interaction, aimed at making robots more
adaptive and more transparent to humans. His doctoral thesis
focuses on the use of Artificial Intelligence to enrich social inter-
actions between robots and humans, specifically in the context
of robot-assisted autism therapy. In parallel from his scientific
work, he is a professionally trained contemporary dancer, per-
forming, teaching, and creating artistic work.

Patrícia Alves-Oliveira is a Ph.D. candidate in Human-Robot


Interaction. Patrícia is enrolled in a multidisciplinary and inter-
national Ph.D. program supported by 3 institutions across
Europe and the US: Institute of Social Sciences (ISCTE-
IUL) in Lisbon, Portugal; Institute for Systems and Computer
Engineer-ing-Research and Development (INESC-ID) in Lis-
bon, Portugal; and Cornell University, in Ithaca, NY, USA.
Patrícia is passionate about investigating ways to use social
robots to empower and nurture intrinsic human abilities, such as
creativity, curiosity, and exploration. Patrícia is the founder of
The Robot-Creativity Project, a project dedicated to the design
and fabrication of social robots for creativity-stimulation pur-
poses. Patrícia was involved in the organizing committees of the
2020 HRI Conference, 2020 RSS Pioneers, 2017 HRI Pioneers,
2015 Symposium AI for HRI, and several workshops within the
field of design and HRI. She has published in conferences such as HRI, IDC, ICSR, RO-MAN,
RSS, and IROS, and in high-quality journals. Patrícia has an interest in science communication,
performing outreach activities to bring scientific knowledge to day-to-day communication.
64 K. Baraka et al.

Tiago Ribeiro is an animation scientist in pursuit of harmony


between arts and interactive AI-driven characters, both in the
field as HRI and IVA. Since 2011 he has worked in the EU
FP7 LIREC and EMOTE projects, and has provided technical
direction to many MSc- and PhD-student projects at the GAIPS
lab from INESC-ID in Lisbon, Portugal. His PhD, at IST—
University of Lisbon, focused on developing Robot Animation
as a field that fully integrates roboticists, AI/software engineers,
and animation artists, by merging robotics, CGI-animation tech-
niques, and principles of traditional animation. Some of his
notable achievements are the Nutty Tracks programmable ani-
mation engine, the SERA-Thalamus architecture, and fully cre-
ating the Adelino craft robot, along with ERIK—an expressive
inverse kinematics system. He has collaborated with Carnegie-
Mellon University and Yale University, is part of various journal
and conference program committees in HRI/Social Robotics, and has organized academic events
such as the HRI Pioneers Workshop 2016 (general chair) and the AAAI Fall Symposium 2016
(on AI for HRI). He has authored and co-authored over 30 peer-reviewed scientific papers, pub-
lished and distinguished at conferences such as the ACM/IEEE SIGGRAPH (Student Competition
Finalist), the ACM/IEEE HRI (Best Paper recipient and nomination), AAMAS, IVA and others.
A Survey on Current Practices in User
Evaluation of Companion Robots

Franz Werner

Abstract The applicability of robotics to help older users at home was under
investigation within a number of research projects in recent years. The evaluation of
assistive robotics within a user study is a challenging task due to the complexity of
the technologies used and the vulnerability of the involved target group. This chapter
reviews research methods applied during the evaluation of companion robots and
provides details on the implemented methods, involved target groups, test settings
and evaluation aims. Typical pitfalls and methodological challenges are discussed
and recommendations for the planning of future user evaluations are given.

Keywords Evaluation methods · Companion robots · Review · Older people

1 Introduction and Background

1.1 Introduction

Recent years have witnessed an intensification of efforts in assistive robotics within


the active assisted living (AAL) community but also in the HRI and robotics research
communities at large. This can be illustrated by the prototypes developed, for exam-
ple, in the projects KSERA,1 Companionable,2 Hobbit,3 GrowMeUp4 and Mario.5
These and other projects have resulted in channelling more funding into the AAL

1 http://ksera.ieis.tue.nl.
2 http://www.companionable.net.
3 http://hobbit.acin.tuwien.ac.at.
4 http://www.growmeup.eu.
5 http://www.mario-project.eu.

F. Werner (B)
University of Applied Sciences, FH Campus Wien,
Favoritenstrasse 226, 1100 Vienna, Austria
e-mail: franz.werner@fh-campuswien.ac.at

© Springer Nature Switzerland AG 2020 65


C. Jost et al. (eds.), Human-Robot Interaction, Springer Series
on Bio- and Neurosystems 12, https://doi.org/10.1007/978-3-030-42307-0_3
66 F. Werner

robotics domain and at least 11 research projects on European level alone were run-
ning at the time of this review to target the development of robotics to support older
users at home or in care facilities.
Although various types of robots are imaginable to help older users with activities
of daily living at home, multi-purpose companion robots, which are able to target a
wide range of individual user needs by providing an easy to understand, anthropo-
morphic, multimodal user interface, are of particular interest for the AAL robotics
community as shown by the large number of scientific projects within this field.
A large variety of user research methods can be applied to evaluate the developed
prototypes of companion robots from multiple perspectives such as the technical
performance in a real-life setting, the usability of prototypes, the acceptance among
users and the impact on care and the users’ lives. Although evaluation aims between
projects are to some extent similar, evaluation methods implemented vary strongly
both in quality and quantity. Bemelmans et al. found that the methodology used to
study the effects of assistive robots in current research suffers limitations, is rather
vague and often not replicable which limits the scientific value of presented evidence
[1].
The aim of this paper is to shed light on currently used methods in the particular
field of user evaluation of companion type robots by providing an overview including
information on typically involved user groups, test settings and evaluation aims.
Secondly this paper contributes to the aim of creating a set of methodologies and
tools that can be used by researchers to allow other researchers to replicate and
validate their user studies, which was already proposed by Kerstin Dautenhahn in
2007 [2].

1.2 Background and Related Work

Even though the aim to provide a methodological overview on the evaluation of


robotics is not new, there seems to be a deficiency of a review in the particular area
of companion robots, which constitute a unique challenge for user evaluation due
to their technical complexity, social interaction capabilities and vulnerable targeted
user group.
Tina Ganster et al. reviewed literature specifically on methods usable for long-
term evaluation with robots and agents [3]. They categorize between subjective and
objective methods and give considerations on benefits and shortcomings of surveys,
field diaries, eye tracking, psychophysiology, neuroimaging and video analysis. The
aim of the review presented in this paper is to provide a more recent and wider
overview on methods with focus on the particular type of companions and to renew
the findings as the field of AAL robotics is currently rapidly advancing.
David Feil-Seifer et al. took a conceptual approach in 2007 and proposed poten-
tial evaluation aims of the—in 2007 recently developing—field of socially assistive
A Survey on Current Practices … 67

robots including also already existing HRI benchmarks. In comparison the work pre-
sented in this paper is based on methods that were actually used and discusses them
in relation to such conceptual approaches [4].
Iolanda Leite et al. surveyed the particular field of social robots for long-term
interaction on the main findings and methods of evaluations [5]. Companion robots, as
described in the presented work, have so far rarely been used in long-term interactions
and were not covered by Leite et al.
Several websites such as [6] and [7] provide overviews and coarse descriptions
on methods in user research including also methods for field trials with technology.
These are not focused on the evaluation of robotics or with vulnerable user groups
though such as older users and users with disabilities.

2 Methodology

This narrative review of primary literature adheres to the methodology and consid-
erations presented by Green et al. [8].
Several steps were used to acquire literature as source for the review.
As a first step literature was searched for in Google Scholar by using the following
list of keywords: “assistive robot* method”, “assistive robot* evaluation”, “robot*
evaluation method” and “user research robot*”. Interestingly only few papers could
be found detailing information on the evaluation of companion robots. To compensate
for this lack of publications, research projects were screened for publications on
methodologies.
The development of companion robots currently is depending on large resources
for reasons of technical complexity and the need of experts from several research
domains such as technology, sociology and healthcare. The focus of the literature
research was hence laid on results of larger European projects, which were also able
to provide the necessary resources.
Projects funded by the European commission within the framework programmes6
and the Horizon2020 programme, are listed in the “Cordis” database.7 In addition
projects funded by another relevant research funding initiative, the “Ambient Assisted
Living Joint Programme” (AAL-JP) are listed on the AAL-JP website.8 The Cordis
database was searched for the terms “robot elderly”, “robot senior”, and “robot older”,
which gave 71 results of which 23 were projects developing robots to assist older
users. The AAL-JP website hosts the abstracts of 249 AAL projects which were hand
searched for aims to develop or evaluate a robotic solution. 7 AAL-JP projects with
this aim were found giving together 30 relevant projects within the field of assistive
robotics for older users.

6 http://ec.europa.eu/research/fp7/index_en.cfm.
7 http://cordis.europa.eu/home_en.html.
8 http://www.aal-europe.eu.
68 F. Werner

Since the evaluation usually takes place towards the end of the project, projects
that are currently running and end later than Q2 2018 are highly unlikely to already
have published evaluation results and were excluded from further analysis. Of the
remaining 39 projects, 15 were excluded as they did not develop companion robots but
other assistive robots for older users such as exoskeletons, rehabilitation robots or a
pedestrian assistant. One project (KSERA) was excluded since the author participated
in the evaluation to avoid a possible review bias. Finally, 24 projects were selected
for detailed analysis. The earliest project ended in 2010, the latest project ended in
April 2018.
For the remaining 24 projects, papers and public deliverables were searched for:
(a) on the project’s website, (b) by contacting responsible investigators, (c) by search-
ing through publications of institutions which were responsible for the evaluation
tasks within the project and (d) by searching through the projects list of publications
typically provided within the public dissemination deliverables.
49 publications (43 from project based search, 6 from general search of databases)
could be identified to contain relevant information on user evaluation of robotic tech-
nologies. Publications in the languages English and German with a publication date
later than 2007 were considered. Figure 1 details the process of literature selection.

AAL-JP Database Cordis Database Google Scholar


249 AAL projects 71 robotic projects 9 relevant projects

7 assistive robotics 23 assistive robotics


projects projects

39 assistive robotics
projects

24 projects in the field


of companion robotics

49 papers identified
providing information
on evaluation methods

20 key publications
selected after quality
assessment

Fig. 1 Summary of the paper selection process


A Survey on Current Practices … 69

2.1 Assessing the Quality of Studies

Papers were selected based on the quality and level of detail provided. Literature that
did not provide basic information on the evaluation methods including the evaluation
aims, trial setup, participating users and methods used for results generation were
omitted. Based on these criteria out of 49 publications, 20 key publications from 10
projects were selected for the detailed analysis within this review. For the remaining
14 projects the information found on the evaluation procedures and methods was
either to scarce or planned procedures rather than the actual trial results were reported.

2.2 Data Extraction

Data about the evaluation aims, evaluation setup, participating user groups and used
methods and metrics was extracted from literature and inserted in evidence tables
for further analysis. Evidence tables are reported within the paper.

2.3 Categorization of Data

To provide a comprehensive overview on current methodologies and flows of research


within robotic projects data was structured along common themes. As methodologies
used are depending on the aims and the technology readiness of the technical probes,
it was decided to structure the data along the typical workflow within European
projects, which again is linked to the technological advancement of the research
prototypes over the evaluation phases.
By analysing the literature on evaluation methods from European robotics projects
typical evaluation phases could be identified and linked to the model of technol-
ogy readiness proposed by the “National Aeronautics and Space Administration”
(NASA).
“Technology Readiness Levels (TRL) are a type of measurement system used to
assess the maturity level of a particular technology” [9]. In particular for projects, with
the aim to develop an assistive companion robot, the technology readiness influences
the aims and methodologies selected for evaluation as the common main goal of
the evaluation is to derive new design guidelines for later stages of development
according to the user-centered design process. The model was adopted by scientists
within the robotics community and in particular by the team at euRobotics which used
it within the “Multi-Annual Roadmap” [10] to describe the future goals of robotics
research. See also Table 1 for an overview on the levels of technology readiness. The
highlighted items in the table show the levels of technology readiness achieved by
the reviewed projects.
70 F. Werner

Table 1 Technology readiness levels as proposed by NASA [9]


TRL Description
1 Basic principles observed and reported
2 Technology concept and/or application formulated
3 Analytical and experimental critical function and/or characteristic proof of concept
4 Component and/or bread board validation in laboratory environment
5 Component and/or bread board validation in relevant environment
6 System/subsystem model or prototype demonstration in a relevant environment
7 System prototype demonstration in an operational environment
8 Actual system completed and qualified through test and demonstration
9 Actual system proven through successful mission operations
10 Commercial

3 Discussion

The following discussion is structured along the presented TRLs. The used evaluation
aims, methods, involved user groups and test settings are reported for each category
as they vary between the different categories based on the technology readiness level
of the used solution.

3.1 Laboratory Trials of the Integrated Prototype (TRL-4/5)

The goals of this phase are to verify the proper operation of all system parts in
conjunction with each other and to guarantee sufficient reliability and stability of the
prototype to allow for later evaluation phases involving user groups.
One example is given by Merten et al., who report laboratory trials of the Compan-
ionable9 robot regarding the mechanical design of the drive system, the mechanical
framework of the robot, the system architecture including the communication net-
works and implemented software functionalities. Further the safety concept was
reviewed in cooperation with an independent testing laboratory. The usability of the
systems interactive components was validated towards ergonomic standards [11].
Methods used within this phase include:
(a) Integration tests such as checklist type tests to validate the correct function-
ality of all integrated technical modules. Ad hoc lists are used that define single test
cases [12]. Integration tests typically take place within a laboratory setting or within
a setting mimicking a real-life environment such as a living-lab [11, 13].
(b) Usability evaluation by experts who walk through the concept description
and mark all positive and negative aspects they think affect the user experience is

9 www.companionable.net.
A Survey on Current Practices … 71

undertaken in [14]. Heuristic evaluation as proposed by Nielsen et al. is a specific


method performed within this phase by HCI experts from within the project to validate
the system’s usability prior conducting user trials [13, 15].
(c) System pre-tests were conducted at homes of project members and project
related users such as grandparents of researchers that are easy to recruit and rather
tolerant regarding the probable lack of functionality and usability [16, 17].
Checklist type functional tests are conducted similar to integration tests with the
exception of a setup within a real environment [17]. As the prototype in this stage is
typically not yet stable, the Wizard of Oz technique [18] is of strategic importance to
simulate functionalities not yet fully integrated or not yet working smoothly enough
but needed in order to test other functions depending on it [17].
(d) Integratability tests were conducted to gather information on potential issues
regarding the integration of the robotic platform and surrounding technologies such
as smart homes equipment into a real environment in case the plan is to perform field
trials at users’ homes with this prototype such as described in Pérez et al. [19].
Table 2 provides details of the selected papers or reports that undertook laboratory
trials of integrated prototypes.

3.2 Short-Term User Trials of the Integrated Prototype


Within Realistic Environments (TRL-6)

A wide array of research questions is targeted by implementing user trials of the inte-
grated prototype outside of the laboratory within controlled but still realistic settings.
Focus in most projects was found to be laid on usability evaluation and evaluation
of acceptance and social aspects resulting from the anthropomorphic characteristics
of the robots used.

3.2.1 Workshops, Focus Groups and Group Discussions

Within focus groups, group discussions or questionnaires scenarios that provide


show-cases of typical assistive functionalities are shown to groups of primary, sec-
ondary or tertiary users and user feedback is gathered [13, 20]. The scenarios might
be shown by the actual prototype or by videos of recordings of the actual prototype.
In the first case a tryout-session was included as well to give participants a deeper
understanding of the system’s capabilities and behaviour.
The aims of using these methods are to gain early input on advantages and disad-
vantages of the demonstrated functionalities or suggestions for improvements from
a diverse user group. They have the advantage to provide input from several par-
ticipants and experts from different fields within one test session, which makes it
cost-efficient compared to short-term single user trials.
72 F. Werner

Table 2 Part of evidence table for laboratory trials of the integrated prototype
Reference [11, 14] [16, 17] [18]
Project name Companionablea SRSb Accompanyc
Robot type Companion Companion Companion
Robotic Platform Scietos G3 CareOBot CareOBot
Aims Verification of Investigate and Get a first exploration
technical specification measure technical on how to deploy a
Validation of usability performance, robot at a trial site
for user trials effectiveness, usability Get general opinions
and acceptability of on the robotic use
the advanced prototype cases from potential
to generate feedback users
for improvement Usability evaluation
Setup Laboratory setup Whole-system pre-test The used robot was
in real-home of project placed inside the
affine users activities room of a
(Grandparents of one sheltered housing
researcher) facility
Functional test for a
duration of 1.5 days
Users unknown Two older users aged 10 older users from an
80 and 81 elderly activities
facility
Methods Functional tests of all Evaluation list for Technology probe,
technical systems. technical performance Interview, Observation
Safety evaluation by measurements of
German TÜV system components.
Semi-structured
interviews with
participants
implementing Wizard
of Oz
a https://www.tu-ilmenau.de/neurob/projects/finished-projects/companionable/
b http://srs-project.eu
c https://cordis.europa.eu/project/n/100743_en.html

3.2.2 Short-Term Scenario-Based User Trials Under Controlled


Conditions

Short-term, scenario-based user trials within a setup mimicking a real user’s home or
a living-lab are the most commonly used method implemented to evaluate assistive
companions [21–24].
Within this method individual users are typically invited for the duration of about
two hours. After explanation of the goals and informed consent procedure, measure-
ments are undertaken followed by a block of pre-defined scenario-based interaction
with the robotic solution in which the developed usage scenarios are demonstrated
one after the other. Sometimes the scenarios are embedded in larger user stories, to
A Survey on Current Practices … 73

give the participants an impression of how they could use the system in real life.
Either final interviews or questionnaires or both conclude the test session.
This method is used to cover a wide variety of research questions such as those
related to technical performance or reliability, usability, acceptance and perceived
value which were also taken up by most authors. In one case impact measurements
regarding the user’s autonomy and perceived safety were undertaken [22]. In another
case Fischinger et al. reported to have aimed for information on the perceived value
and willingness to pay which is similar to a concept that was already mentioned by
Coradeschi et al. who measure the use-worthiness, which reflects whether people
think this technology might be worth to try [24, 25].
Typically primary older users were the core group of participants. The number of
users varies strongly between the studies ranging from four [23] to 49 [24], but is
generally low and hence qualitative methods, such as interviews, thinking-aloud and
observations, are the main assessment tools during the scenario execution. In one
case experience-sampling cards were used with single closed questions to assess the
user impression about a scenario directly after conduction [21].
Other authors conducted interviews and questionnaires prior and after showing
scenarios. Typically customized questionnaires were used to specifically target the
evaluation aims [21, 22, 24], which indicates a lack of standardized or well-accepted
questionnaires. Lucia et al. facilitated the “AttrakDiff” questionnaire [26], which
is composed of 28 items to evaluate factors of usability and user experience. The
questionnaire can be used in lab as well as field studies. Within the AttrakDiff ques-
tionnaire, hedonistic and pragmatic dimensions of the user experience are studied by
means of semantic differentials.
Most authors additionally involved informal caregivers as secondary user group
firstly to gain their views on the questions of research and in particular in case of tele-
care or communication functionalities which need a counterpart for communication
to evaluate these specific functionalities from both sides of client and carer in which
case the evaluation was done with teams of participants consisting of one primary
and one secondary user [17, 21].

3.2.3 Longer User Trials Under Controlled Conditions

Schröter et al. reports of trials conducted in a living-lab situation to which users


were invited to stay for a longer duration as the typically two hours within short-term
trials. The authors clearly tried to go as close to field trials as possible without leaving
the controlled environment necessary to safely conduct trials. Users stayed for two
consecutive days but slept at their own homes [14]. In contradiction to the short-
term trials described above, in this case the developed usage scenarios are embedded
into the users daily routine providing a more realistic experience for the participants
including also possible repetitive or annoying situations. Only primary users were
used in the described evaluation and the aims were comparable to the short-term
scenario-based interactions as described above.
74 F. Werner

Tables 3 and 4 provide details of the selected papers or reports that undertook
short-term user trials in realistic settings with integrated prototypes.

Table 3 Evidence table for user trials of the integrated prototype (part 1)
Reference [11, 14, 49] [20, 50] [21, 51]
Project name Companionablea ExCiteb Florencec
Robot type Companion Telepresence Companion
(Companion)
Robotic Platform Scietos G3 Giraff Florence (developed
within the project)
Aims Validate “interaction” Assess users’ reaction Technical Performance
between robot and towards the adoption of the prototype
smart-home of the robotic system Usability evaluation to
Evaluation of usability Assess willingness to give recommendations
and acceptance in adopt the robotic for future prototypes
real-life solution, possible Gather overall
domains of impression of the users
application,
advantages and
disadvantages and
suggestions for
improvements
Setup 6 × 2-day trials within Workshop with a Short-term demos of
an environment group of participants scenarios in a living
mimicking a real Interviews with older lab setting mimicking
user’s flat users a real user’s flat
Users 6 older users with mild 10 older adults 5 primary older users
cognitive impairments 44 health-workers (4m, 1f, 68–86y), 5
(26f, 18m) from informal carers, 2
different disciplines tertiary users
(professional tele-care
support staff)
Methods Semi-structured Workshop with Pre-test interview
interviews, health-workers: Experience sampling
observations, diary, ad presentation, cards (tailored closed
hoc questionnaires tryout-session, focus question
group and final ad hoc questionnaire)
questionnaire Post-test interview
Interviews with older Observations during
adults: (video) the tests
presentation of the
robot, interview and
qualitative analysis
thereof
a https://www.tu-ilmenau.de/neurob/projects/finished-projects/companionable/
b http://www.aal-europe.eu/projects/excite/
c https://cordis.europa.eu/project/rcn/93917_en.html
A Survey on Current Practices … 75

Table 4 Evidence table for user trials of the integrated prototype (part 2)
Reference [16, 17] [23, 43] [24, 39]
Project name SRSa ALIASb Hobbitc
Robot type Companion Companion Companion
Robotic Platform CareOBot Scietos A5 Hobbit (developed
within project)
Aims Evaluation of technical Evaluation of usability, Usability of
effectiveness, Impact user friendliness, multimodal interaction
on autonomy and system performance possibilities,
safety, usability, acceptance of the
acceptability/intention robot, perceived value
to adopt with respect to
affordability and
willingness to pay
Setup Scenario based test Scenario based Short-term, scenario
sessions with users in individual user based individual trials
teams consisting of an tryout-sessions at 3 similar test-sites in
elderly user together Two main trial simulated real homes
with an informal iterations with users (living labs) decorated
caregiver and/or with 1 year in between as living rooms
remote operator within to allow for technical
a test site modifications
Users 16 elderly users, 1 4 primary users (2f, 49 primary users aged
young disabled man 2m) 70+ with typical age
12 informal caregivers 2 care givers impairments
(relatives)
5 professional
operators (tertiary
users from a 24 h call
centre)
Methods Evaluation check-list Task oriented test Wizard of Oz, Ad hoc
for technical methods taking users’ developed
performance behaviour and questionnaires for
Interactive think-aloud comments into account usability, acceptance
with moderators Observation during the and affordability.
Ad hoc developed conduction of trial
questionnaires scenarios
Attrackdiff Analysis of user
questionnaire comments
Focusgroup on safety,
ethical and privacy
issues after the test
session
a http://srs-project.eu
b http://www.aal-europe.eu/projects/alias/
c http://hobbit.acin.tuwien.ac.at
76 F. Werner

3.3 Field Trials in Real Environments (TRL-7)

Field trials were undertaken by projects in more recent years and in particular by using
either product-grade off-the-shelf robotic systems or functionally minimal robotic
solutions e.g. with restricted ability to interact (compare also [5]). Mucchiani et al.
was able to use a technically advanced robotic system that was initially developed
for the commercial setting of goods delivery to hotel guests [27].
The goals and research questions of most projects were to gain information on
the impact of a robot for care and on impacts on health and quality of life of the
targeted user groups. Nevertheless, also aspects of all research goals of earlier phases
were included such as measurements of social aspects, usability measurements and
measurements regarding the technical performance within a real-life setting.
Typically a within-subject design was chosen for field trials with respect to the
inter-individual differences of older users and users with disabilities. Questionnaires,
(semi-structured) interviews and medical measurements were used as repeated mea-
surements prior, during and after the integration of the robot into users’ homes or
care facilities to gain information on the impact of such systems on the users. User
diaries and technical data-logging were the most often used methods to gain contin-
uous information about the user experience over time and the technical performance
of the systems (See also evidence in Tables 4 and 5).
Heylen et al. reported of a technique to use video logging within the homes of
users by using cameras that would only activate in case the participant confirms by
button press in order to take account for privacy needs [28].
Authors that report of standard questionnaires name the “standard usability mea-
surement inventory” SUMI [29], which was initially developed to evaluate the usabil-
ity of software, and the System Usability Scale (SUS) [30], to measure aspects of
usability. The “Positive and Negative Affect Schedule” (PANAS) [31], the “Short
Form Health Survey” (SF12) [32], the “Geriatric Depression Scale” [33] and the
“UCLA Loneliness Scale” [34] were used to gain insights into the impact on quality
of life and health of the introduced systems. The “Multidimensional scale of per-
ceived social support” (MSPSS) [35] was used to measure the impact of the system
on the subjective feeling for social support which influences depression and anxiety
symptomatology and hence is also a factor for quality of life.
To assess the acceptance and factors of usability of the systems the “Almere
Model” [36] and the “Godspeed questionnaire” [37] were implemented, as these
were both specifically developed to assess acceptance factors of social robots as
companions.
Tables 5 and 6 provide details of the selected papers or reports that undertook
field-trials in real environments.
Table 5 Evidence table for field trials in real environments (part 1)
Reference [20] [19] [28]
Project name ExCitea Accompanyb SERAc
Robot type Tele-presence (Companion) Companion Companion
Robotic Platform Giraff Developed within the project, similar Nabaztagd
appearance as CareOBot3
Aims Monitor robots usage over time, Measure Evaluate (a) perceptions and attitudes Study HRI aspects such as attitudes
impact on user’s health and Quality of Life towards the robot, (b) impact on daily towards the robot and their change over
routines, (c) impact on physical and time, interaction of participants with the
psychological health device
A Survey on Current Practices …

Setup Field trials for a duration of 3–12 months Field trial for 3 weeks at the participants’ Field trial for a duration of approx. 10 days
in users’ homes home (in the living room) each
Users Users consist of pairs of primary older 1 older user, male, 74 years, living alone at 6 healthy primary users (aged 50+)
users (have the system at home) and home, technically experienced
secondary formal or informal caregivers
(teleoperate the system)
Methods Repeated measurements prior, during and Pre-/post interview, daily diary Analysis of video recordings
after integration of the robot Objective methods: frequency and duration Semi-structured interviews before, during
Evaluation with carers: ad hoc of use, performance score of a health and after the test
questionnaires, SUMI questionnaire [29], exercise, heart rate Diary to note interesting aspects during the
Temple Presence Inventory [52], PANAS Godspeed questionnaire [37] test duration
[31], structured interviews, diary Almere model [36]
Additionally for evaluation with primary Source Credibility Scale [53] to measure
users: UCLA Lonelines Scale [34] Short trust in the technical system
Form Health Survey (SF12) [32], Personal Opinion Survey (POS) [54] to
Multidimensional Scale of Perceived measure impact on stress
Social Support [35] Geriatric Depression
Scale [33] Almere model [36]
a http://www.aal-europe.eu/projects/excite/
b https://cordis.europa.eu/project/rcn/100743_en.html
c https://cordis.europa.eu/project/rcn/89259_en.html
77

d http://www.nabaztag.fr
Table 6 Evidence table for field trials in real environments (part 2)
78

Reference [38, 44] [27] [40]


Project name Radioa none Teresab
Robot type Companion Companion Tele-presence (Companion)
Robotic Platform Developed within the project Saviokec Developed within the project, based on the
Giraff platform
Aims Evaluation of the usability for primary Understand efficacy of human-robot Investigate user acceptance and experience
users (older people) interaction and enhance future robot
versions
Setup The users were involved for five days (two Field trial for one week (four users) or two Deployment of the robotic system during
days for deployment of the system at the days (12 users) four sessions of a weekly activity to groups
participants’ homes, three days of actual of participants. The robot was controlled
pilot study) using a Wizard of Oz approach
(continued)
F. Werner
Table 6 (continued)
Reference [38, 44] [27] [40]
Users Two users were recruited from 16 older users living in supported Older users within a nursing home who
beneficiaries of a home care service and apartment living were already part of a coffee and quiz
from volunteers of a social care activities activities
network. Users were excluded from the
trials if unable to operate the robotic
system
Methods Users complete a set of assessments over Immediately after each interaction (e.g. the Qualitative approach observation and
the course of three days robot delivered water, or the robot guided retrospective video analysis, group
A Survey on Current Practices …

Day one is used primarily for a the users through the building) a discussion as well as a final semi-structured
pre-assessment and training of the usage of post-interaction survey including a interview with older residents and
the robotic system. On day 2 the system is questionnaire based on the Almere Model unstructured meetings with care staff
used and the user experiences different [36] was conducted. Further an observation
scenarios such as “pill intake”, “bed was undertaken and project specific
transfer”, “chair transfer”, “meal parameters were noted down
preparation” over the course of the day. On
day three usability satisfaction and quality
of life questionnaires are filled out within
an in-depth interview for qualitative
analysis
Standardized assessments used include:
Long-Term Care Facilities Form [55],
System Usability Scale (SUS) [30], the
Psychological Impact of Assistive Device
Scale (PIADS) [56]
a http://radio-project.eu
b https://teresaproject.eu
c http://www.savioke.com
79
80 F. Werner

3.4 General Considerations

The following paragraphs present general insights derived from the analysis of
implemented research methods.

3.4.1 Evaluation Aims

Three main evaluation aims could be identified across literature. Most reviewed
studies had the main goal to develop an assistive companion for older users and use
study results to provide insights on how to further improve the developed companion
in the future, relatively to the current solution (see e.g. [11, 21, 24]).
Another main goal was to show that the developed prototype has an impact on the
quality of care, health and/or quality of life. In this case the implemented evaluation
methods were selected to evaluate or prove impacts resulting in the necessity of
long-term interactions [19, 20, 38].
A third major goal was to push the state of the art in a particular research field
such as HRI. In that case evaluation was used to gain insights on the use of robotic
companions in general, rather than to validate a particular development [13, 19].

3.4.2 User Groups

User groups were typically split into two to three sub-groups with different interests;
often named “primary”, “secondary” and “tertiary” users.
In all reviewed studies older users comprised the group of primary users. Different
inclusion criteria was used, in most cases healthy older users were included based on
their age such as in [21] or [24]. Secondary users were often included and referred
to informal and formal carers [20]. Tertiary users, such as technical support staff and
professional tele-operators, were included in some trials [17, 20].

3.5 Methodological Challenges

This section presents several methodological challenges that were brought up by the
authors of the reviewed literature or submerged during the review process.

3.5.1 Lack of Technical Robustness and Functionality of Prototypes

Several projects reported technical issues that influenced the end user evaluation in
particular regarding the measurements of user experience and acceptance [14, 17].
The issues were mainly due to a lack of robustness and reliability of prototype-level
A Survey on Current Practices … 81

components and the complex integration of many prototype parts summing up indi-
vidual probabilities of failure. Pigini et al. reports the usage of complex scenarios as
an issue. The same authors also report that during some evaluation phases, high num-
bers of the scenarios demonstrated to the users (up to 70%) showed technical issues.
Users noted issues and reports also suggest that this influenced the evaluation results
[17]. Schröter et al. found that in particular speech recognition rates were dissatisfy-
ing and users therefore did not use this, often-preferred, mode of communication but
an alternative input via touch-screen on the robot [14]. This implies that one of the
core aspects of companion robots—the multimodal human like interaction—could
not be evaluated. The technical systems lacked robustness in uncontrolled real-life
settings in particular. Pripfl et al. report of the core functionality of the robotic system
being fully operational for only about 18% of time within the conducted field trials
[39]. It can be expected that low performance rates negatively influence study results
as Heylen et al. found that a poorly designed robot frustrated people and hence biased
results on acceptance [28].
In addition to lacking robustness, also the functional capabilities of current pro-
totypes did not allow for real-life trials as was shown by Pigini et al. who report of
necessary changes of the environment to successfully integrate the robots. In one case
objects made from glass needed to be covered, as the robotic sensors could otherwise
not recognize them. In other cases in particular furniture needed to be displaced to
allow the robot to navigate along obstacle-free paths [13, 17].
In other cases the trial methodology had to be altered to compensate for lacking
robustness. Vroon et al. changed their initial plan of field trials over a period of three
weeks because they were not able to log their robot into the test sites (care center)
wireless-lan [40].
Low technical reliability and functionality is an issue particularly in early proto-
types, nevertheless users were involved early in the design and evaluation process to
gather early results on user experience such as in [17] and [19]. It is difficult to assess
whether such early user interactions can provide valuable input given the influence
of technical malfunctions on the perceived usability and overall impression on the
participating users.

3.5.2 Difficulties in Conducting User Trials with the Group of Older


Users

Older users are a heterogeneous group with high inter-individual differences. These
differences seem to not been taken care of by parts of the literature base as most
reviewed projects report to select their participants mainly based on the chronological
age, which assumes they would have otherwise similar conditions. This is not the
case as also Britt Östlund argues: “… chronological age is not a sufficient measure
for older people’s life situation” [41]. This issue leads to heterogeneous user groups
within the trials making it hard to derive design conclusions from experiences and
results gathered, which was also found by Payr [42].
82 F. Werner

The inclusion of vulnerable participants does give the risk of either a higher
number of user dropouts or the need to strip down the initially planned trials to
methods suitable for this particular user group. Rehrl et al. reports changes in the test
flow leaving important parts of planned trials out because the poor health-status of
participants did not allow their further involvement and hence further investigations
[43]. Within the research project “Teresa” the trial setup was altered after researchers
found users within a nursing home were incapable of filling in a questionnaire and
seemed scared to participate in a formal experiment as they feared to be “not good
enough” for the project and hence hesitated to sign an informed consent form [40].
Within the Radio project out of initially planned ten users in a real-life evaluation
only two users were finally recruited for the trials, further only 3 days of trial duration
were planned [44].

3.5.3 Lack of Accepted Methodologies

Feil-Seifer et al. critiques that “although it is difficult to compare robotic systems


designed for different tasks, it is important to do so to establish benchmarks for
effective and ethical SAR design” [4]. Currently it seems not feasible to compare
results between studies because a respective methodology is lacking. Hardly any
standardized research instruments were used in the reviewed literature implicating
that the research field of assistive robotics is still in an “exploratory” state where
qualitative methods and subjective measurements are predominant. Ganster et al.
raised this point as well [3].
In addition to missing methodologies Amirabdollahian et al. argue that some of
the few existing and commonly used methods are not appropriate for long-term real-
life trials as neither the Almere Model nor the earlier UTAUT model [45] are specific
enough and based on lab studies only, rather than real-life studies. The authors argue
that the used constructs in the Almere model are not sufficient to predict future use as
“… self-efficacy and self-esteem moderate the relation between intention to use and
actual use” but are not included in the model and further in general: “What people
respond in a questionnaire about the intention to use in general does not comply with
their actual use of the system in the long run” [46]. However literature reviewed used
this method mainly to gain insights on acceptance factors, not to predict future use.

3.5.4 Issues Regarding Long-Term Field Trials

Only one (Pripfl et al. 2016) of the reviewed field trials so far reached a minimum
duration of two months, which is necessary to gain information on acceptance with-
out bias of initial excitement by participants [47]. In more recent years an increasing
number of projects and studies tried to undertake real-life field trials. However as the
presented results suggest, almost all of the presented studies faced severe method-
ological problems in conducting the trials leading to mostly a steep decrease in
A Survey on Current Practices … 83

study participants and/or a methodological shift towards a more qualitative approach


(compare also [38, 39, 44]).
Heylen, Dijk and Nijholt found that real-life trials at users’ homes do not neces-
sarily reduce experimental biases of typical experimental procedures such as socially
accepted answers and biases in engagement with the prototype [28].
The same authors argue that although real-life experiments were conducted in
real-users homes, the character of an experiment to the users was still evident and
according to interviews, users also behaved differently during interaction phases
having the nature of a research project in mind. The situation therefore is not com-
parable with the situation after deciding to acquire a robot and using it at home by
own determination.

3.5.5 Further Issues

Impact measurements such as measurements on the user’s quality of live (perceived


safety) or the users care were undertaken within short-term user trials in a living lab
situation. Impacts are typically measured within long-term investigations by means
of pre-post measurements such as shown by [19] and [20]. It seems an open question
whether measured impact factors over short term can provide valuable information
on later long-term impacts in the field.
Authors report the time consumption of individual short-term user trials, which
allows for approx. two trials a day only because of large efforts to set-up and control
the robotic prototypes directly limiting the number of users involved and introduces
budgetary limits. The number of primary users that participated in trials was hence
low; typically about 10 in short-term and 4 in long-term evaluations.
As stated in the methodology section, finding information on evaluation meth-
ods and study results from user trials on companion robots is surprisingly difficult
although this information represents one of the main outcomes of such research
projects. This might be due to the fact that the evaluation phase within the funded
projects mostly takes place at the end of the projects, hence publication of the results
might not be possible within the duration of the project which raises a funding issue.
Another likely reason is that researchers do not feel comfortable with publishing
evaluation results because of the mentioned common methodological and technical
issues and their impact on the quality of results.

3.5.6 Limitations of This Review

The literature review is limited to sources from funded projects on European level.
In particular no purely national or overseas publications were considered.
Within this paper, because of lacking information present in peer-reviewed
sources, also project reports, namely public deliverables of European projects from
the EU-FP7 and AAL-JP programmes were analysed. The scientific quality of infor-
mation presented in public deliverables is not validated, as they are commonly not
84 F. Werner

peer-reviewed. It could be argued that deliverables are a necessary work targeted


towards reviewers of funding organizations and might hence be rather positively
phrased. However the author believes this is not the case for the reviewed descriptions
of the methodologies used.
The methodologies used strongly depend on the research aim, which varies
between literature presented and does not always fit well to the chosen categorization
of technology readiness levels. In that way the categorization is limited, but still the
author thinks that the presented overview is helpful to other researchers as it can be
used to find potentially fitting methods for future studies.

4 Conclusion

An overview on current practices and current methodologies used for the user eval-
uation of companion robots was given and included current typical research aims,
research methods, test setups and user groups.
In addition to the overview several methodological points for discussion were
found, which were raised partly already by other authors, such as the common lack
of technical robustness and consequences thereof, lack of scientific quality regarding
the selection of methods, partly caused by a general lack of commonly accepted
methodologies that would allow for comparison of data between research projects
and a low number of published results in general.
Technical issues hinder evaluation of the user experience and acceptance of com-
panion robots. Due to the complex technical nature of assistive robots involving
artificial intelligence with less than 100% accuracy and reliability as well as non-
product grade hardware components it seems clear that technical issues were and
will be present in most evaluation phases. This is to be taken into account by the
user researchers, which have to make sure that the system they implement in a real
setting is functioning perfectly in order not to bias the evaluation results in particular
on acceptance and user experience.
Even in large European projects funded extensively by the European Commission
and lasting for three years or more, the ideal of the user centred design process to
reiterate several times around the cycles between design, development and evaluation
until the prototype is mature enough to advance to the next step of productization
does not hold as most literature reports only one or two main trial phases implicating a
maximum of two cycles within the process with the integrated prototype. The reason
for this seems to be the exceptionally high technical complexity of prototypes and
high research efforts needed from different disciplines resulting in long development
times.
All reviewed projects that tried to perform real-life field-trials with robotic pro-
totypes reported severe issues in trial execution. The lesson learned seems to be that
only product-grade robotic platforms should be used within real-life trials.
Out of 39 researched projects in the field of assistive robotics 15 (38%) belong to
the field of companion robotics, despite 9 other potentially interesting fields exist,
A Survey on Current Practices … 85

as identified and clustered by Payr et al. [48]. Hence the research has a focus on this
particular type of robots. Later projects focussed less on companion type robots.
The method of searching for literature based on relevant scientific projects in
this area resulted in a considerably larger literature base as compared with a classic
search of databases since the selection of keywords (both by authors who link their
publications to certain keywords and the reviewer who searchers for them) plays a
subordinated role. That means the search for projects first and on that base, the search
for a literature of these projects can be a viable solution in case the literature base is
small. Further, this method is not depending on search-keywords in the same way,
as project-databases exist that can be screened by hand.
Hence the method seems to be a viable possibility in case of a scarce literature
base within a reviewed field.

References

1. Bemelmans, R., et al.: Socially assistive robots in elderly care: a systematic review into effects
and effectiveness. J. Am. Med. Dir. Assoc. 13(2), 114–120 (2012)
2. Dautenhahn, K.: Socially intelligent robots: dimensions of human–robot interaction. Philos.
Trans. R. Soc. B: Biol. Sci. 362(1480), 679–704 (2007)
3. Ganster, T., Eimler, S.C., Von Der Pütten, A.M., Hoffmann, L., Krämer, N.: Methodological
considerations for long-term experience with robots and agents. In: Proceedings of EMCSR
(2010)
4. Feil-Seifer, D., Skinner, K., Matarić, M.J.: Benchmarks for evaluating socially assistive
robotics. Interact. Stud. 8, 423–439 (2007)
5. Leite, I., Martinho, C., Paiva, A.: Social robots for long-term interaction: a survey. Int. J. Social
Robot. 5(2), 291–308 (2013)
6. allaboutux: “Field-methods” [Online]. Available: http://www.allaboutux.org/field-methods.
Accessed 2 July 2018
7. Koppa.jyu.fi: “Method map” [Online]. Available: https://koppa.jyu.fi/avoimet/hum/
menetelmapolkuja/en/methodmap. Accessed 2 July 2018
8. Green, B.N., Johnson, C.D., Adams, A.: Writing narrative literature reviews for peer-reviewed
journals: secrets of the trade. J. Chiropr. Med. 5(3), 101–117 (2006)
9. NASA: “Technology readiness level” [Online]. Available: http://www.nasa.gov/content/
technology-readiness-level/#.VOXJUlpTOXs. Accessed 2 July 2018
10. euRobotics: “Robotics 2020 multi-annual roadmap,” 2016 [Online]. Available: https://www.
eu-robotics.net. Accessed 2 July 2018
11. Merten, M., et al.: A mobile robot platform for socially assistive home-care applications. In:
Robotics; Proceedings of ROBOTIK 2012; 7th German Conference on, VDE (2012)
12. University of the West of England: Mobiserv project D7.3: final system prototype. Public
Report (2010)
13. University of the West of England: Mobiserv project D2.4: evaluation plan. Public Report
(2013)
14. Schröter, C., et al.: CompanionAble–ein robotischer Assistent und Begleiter für Menschen
mit leichter kognitiver Beeinträchtigung. In: Wohnen–Pflege–Teilhabe–„Besser leben durch
Technik“ (2014)
15. Nielsen, J., Molich, R.: Heuristic evaluation of user interfaces. In: Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems, ACM (1990)
16. Pigini, L., Facal, D., Mast, M., Blasi, L., López, R., Arbeiter, G.: SRS project D6.1: testing site
preparation and protocol development. Public Report (2012)
86 F. Werner

17. Pigini, L., Mast, M., Facal, D., Noyvirt, A., Qiu, R., Claudia, S., Alvaro, G., Rafael, L.: SRS
deliverable D6.2: user validation results. Public Report (2013)
18. Green, A., Huttenrauch, H., Eklundh, K.S.: Applying the Wizard-of-Oz framework to cooper-
ative service discovery and configuration. In: Robot and Human Interactive Communication,
2004. ROMAN 2004. 13th IEEE International Workshop on, IEEE (2004)
19. Pérez, J.G., Lohse, M., Evers, V.: Accompany D6.3, acceptability of a home companion robot.
Public Report (2014)
20. Cesta, A., Cortellessa, G., Orlandini, A., Tiberio, L.: Evaluating telepresence robots in the field.
In: Agents and Artificial Intelligence, pp. 433–448. Springer, Berlin, Heidelberg (2013)
21. Kosman, R., Eertink, H., Van der Wal, C., Ebben, P., Reitsma, J., Quinones, P., Isken, M.:
Florence D6.6: evaluation of the FLORENCE system (2013)
22. Lucia, P., Marcus, M., David, F., Alexander, N., Renxi, Q., Claudia, S., Alvaro, G., Rafael, L.:
SRS D6.2: user validation results (2013)
23. Ihsen, S., Scheibl, K., Schneider, W., Glende, S., Kohl, F.: ALIAS D1.5, analysis of pilot’s
second test-run with qualitative advices on how to improve specific functions/usability of the
robot (2013)
24. Fischinger, D., Einramhof, P., Papoutsakis, K., Wohlkinger, W., Mayer, P., Panek, P., Hofmann,
S., Koertner, T., Weiss, A., Argyros, A., Vincze, M.: Hobbit, a care robot supporting independent
living at home: first prototype and lessons learned. Rob. Auton. Syst. 75, 60–78 (2016). https://
www.sciencedirect.com/science/article/abs/pii/S0921889014002140
25. Coradeschi, S., Cesta, A., Cortellessa, G., Coraci, L., Gonzalez, J., Karlsson, L., Furfari, F.,
Loutfi, A., Orlandini, A., Palumbo, F., Pecora, F., von Rump, S., Stimec, A., Ullberg, J.,
Otslund, B.: GiraffPlus: combining social interaction and long term monitoring for promoting
independent living. In: 2013 6th International Conference on Human Systems Interaction,
pp. 578–585, June 2013
26. Hassenzahl, M., Burmester, M., Koller, F.: AttrakDiff: Ein Fragebogen zur Messung
wahrgenommener hedonischer und pragmatischer Qualität. In: Mensch & Computer 2003,
pp. 187–196. Vieweg + Teubner Verlag (2003)
27. Mucchiani, C., Sharma, S., Johnson, M., Sefcik, J., Vivio, N., Huang, J., Cacchione, P., Johnson,
M., Rai, R., Canoso, A., Lau, T., Yim, M.: Evaluating older adults’ interaction with a mobile
assistive robot. In: IEEE International Conference on Intelligent Robots and Systems, vol.
2017, pp. 840–847, September 2017
28. Heylen, D., van Dijk, B., Nijholt, A.: Robotic rabbit companions: amusing or a nuisance? J.
Multimodal User Interfaces 5(1–2), 53–59 (2012)
29. Sumi: Software Usability Measurement Inventory, University College Cork (2011).
http://sumi.ucc.ie/. Last checked Feb 2015
30. Brooke, J.: SUS—a quick and dirty usability scale. Usability Eval. Ind. 189(194), 4–7 (1996)
31. Terracciano, A., McCrae, R.R., Costa, P.T.: Factorial and construct validity of the Italian positive
and negative affect schedule (PANAS). Eur. J. Psychol. Assess. Off. Organ Eur. Assoc. Psychol.
Assess. 19, 131–141 (2003)
32. Ware, J.E.J., Kosinski, M., Keller, S.D.: A 12-item short-form health survey: construction of
scales and preliminary tests of reliability and validity. Med. Care 34, 220–233 (1996)
33. Yesavage, J.A., Brink, T.L., Rose, T.L., Lum, O., Huang, V., Adey, M., Leirer, V.O.: De-
velopment and validation of a geriatric depression screening scale: a preliminary report. J.
Psychiatr. Res. 17, 37–49 (1983)
34. Russell, D., Peplau, L.A., Cutrona, C.E.: The revised UCLA loneliness scale: concurrent and
discriminant validity evidence. J. Pers. Soc. Psychol. 39, 472–480 (1980)
35. Zimet, G.D., Dahlem, N.W., Zimet, S.G., Farley, G.K.: The multidimensional scale of perceived
social support. J. Pers. Assess. 52, 30–41 (1988)
36. Heerink, M., Kröse, B.J.A., Evers, V., Wielinga, B.J.: Assessing acceptance of assistive social
agent technology by older adults: the Almere model. Int. J. Soc. Robot. 2, 361–375 (2010)
37. Bartneck, C., Kulić, D., Croft, E., Zoghbi, S.: Measurement instruments for the anthropomor-
phism, animacy, likeability, perceived intelligence, and perceived safety of robots. Int. J. Soc.
Robot. 1(1), 71–81 (2008)
A Survey on Current Practices … 87

38. Radio, C.: DELIVERABLE 6.11 User Evaluation Report of the Radio Project (2017)
39. Pripfl, J., Kortner, T., Batko-Klein, D., Hebesberger, D., Weninger, M., Gisinger, C., Frennert,
S., Eftring, H., Antona, M., Adami, I., Weiss, A., Bajones, M., Vincze, M.: Results of a real
world trial with a mobile social service robot for older adults. In: 2016 11th ACM/IEEE
International Conference on Human Robot Interaction, pp. 497–498 (2016)
40. Jered Vroon, V.E., Gwenn Englebienne: Deliverable 3.2 : Longitudinal Effects Report of the
Project ‘Teresa,’” no. 611153 (2015)
41. Östlund, B., et al.: STS-inspired design to meet the challenges of modern aging. Welfare
technology as a tool to promote user driven innovations or another way to keep older users
hostage? Technol. Forecast. Soc. Chang. 93, 82–90 (2014)
42. Payr, S.: Virtual butlers and real people: styles and practices in long-term use of a companion.
In: Trappl, R. (ed.) Virtual Butlers: The Making of. Springer, Heidelberg (2013)
43. Rehrl, T., Troncy, R., Bley, A., Ihsen, S.: The ambient adaptable living assistant is meeting its
users. In: AAL Forum (2012)
44. Radio, C.: DELIVERABLE 6.4 Piloting Plan IV of the Radio Project (2017)
45. Venkatesh, V., et al.: User acceptance of information technology: toward a unified view. MIS
Q. 27(3), 425–478 (2003)
46. Amirabdollahian, F., et al.: Accompany: acceptable robotics companions for ageing years—
multidimensional aspects of human-system interactions. In: 2013 The 6th International
Conference on Human System Interaction (HSI), IEEE (2013)
47. Broekens, J., Heerink, M., Rosendal, H.: Assistive social robots in elderly care: a review.
Gerontechnology 8(2), 94–103 (2009)
48. Payr, S., Werner, F., Werner, K.: Potential of Robotics for Ambient Assisted Living. FFG
Benefit, Vienna (2015)
49. Schroeter, C., Mueller, S., Volkhardt, M., Einhorn, E., Huijnen, C., van den Heuvel, H., van
Berlo, A., Bley, A., Gross, H.-M.: Realization and user evaluation of a companion robot for
people with mild cognitive impairments. In: 2013 IEEE International Conference Robotics and
Automation, pp. 1153–1159, May 2013
50. Cesta, A., et al.: Into the wild: pushing a telepresence robot outside the lab. In: Social Robotic
Telepresence (2012)
51. Melenhorst, M., Isken, M., Lowet, D., Van de Wal, C., Eertink, H.: Florence D6.4: Report on
the Testing and Evaluation Methodology for the Living Lab Testing (2013)
52. Lombard, M., Ditton, T., Weinstein, L.: Measuring telepresence: the temple presence inventory.
In: Proceedings of the Twelfth International Workshop on Presence, Los Angeles,
San Francisco, CA, USA (2009)
53. McCroskey, J.C., Teven, J.J.: Goodwill: a reexamination of the construct and its measurement.
Commun. Monogr. 66(1), 90–103 (1999)
54. McCraty, R., et al.: The impact of a new emotional self-management program on stress, emo-
tions, heart rate variability, DHEA and cortisol. Integr. Physiol. Behav. Sci. 33(2), 151–170
(1998)
55. Kim, H., Jung, Y.I., Sung, M., Lee, J.Y., Yoon, J.Y., Yoon, J.L.: Reliability of the interRAI
long term care facilities (LTCF) and interRAI home care (HC). Geriatr. Gerontol. Int. 15(2),
220–228 (2015)
56. Jutai, J.W., Day, H.: Psychosocial impact of assistive devices scale (PIADS). Technol. Disabil.
14, 107–111 (2002)
88 F. Werner

Franz Werner is head of the interdisciplinary master program


“Health Assisting Engineering” at the University of Applied
Sciences, FH Campus Wien. He is responsible for the manage-
ment of teaching activities as well as the applied research in
the field of health and care technologies undertaken at the same
institution.
Previously, he studied medical software science as well as
software management at the Technical University of Vienna,
Austria. Since 2007 he specialises in the development of
eHealth-solutions and undertakes research in the field of assis-
tive technologies on national and European level. He focuses his
research on user-centred development and evaluation of assistive
technologies and technologies for health-care.
Since 2010 one of his main research areas are design and
analysis of human-robot interaction and in particular the devel-
opment of evaluation methodologies for the assessment of assis-
tive robotics. During his research he took part at the EU-FP7
funded projects KSERA (ID:248085), the EU AAL-JP funded
project ReMIND and led several national projects targeting the
development of assistive robotic solutions that support the care
of older users.
Methodologies to Design Evaluations
Conducting Studies in Human-Robot
Interaction

Cindy L. Bethel, Zachary Henkel and Kenna Baugus

Abstract This chapter provides an overview on approaches for planning, designing,


and executing human studies for Human-Robot Interactions (HRI). Recent literature
is presented on approaches used for conducting studies in human-robot interactions.
There is a detailed section on terminology commonly used in HRI studies, along
with some statistical calculations that can be performed to evaluate the effect sizes
of the data collected during HRI studies. Two improvements are described, using
insights from the psychology and social science disciplines. First is to use appro-
priate sample sizes to better represent the populations being investigated to have a
higher probability of obtaining statistically significant results. Second is the appli-
cation of three or more methods of evaluation to have reliable and accurate results,
and convergent validity. Five primary methods of evaluation exist: self-assessments,
behavioral observations, psychophysiological measures, interviews, and task perfor-
mance metrics. The chapter describes specific tools and procedures to operationalize
these improvements, as well as suggestions for recruiting participants. A large-scale,
complex, controlled human study in HRI using 128 participants and four methods
of evaluation is presented to illustrate planning, design, and execution choices. The
chapter concludes with ten recommendations and additional comments associated
with the experimental design and execution of human studies for human-robot inter-
actions.

Keywords HRI evaluation · Sample size · Evaluation methods ·


Recommendations

C. L. Bethel (B) · Z. Henkel · K. Baugus


Department of Computer Science and Engineering, Mississippi State University,
P.O. Box 9637, Mississippi State, MS 39762-9637, USA
e-mail: clb821@msstate.edu; cbethel@cse.msstate.edu
Z. Henkel
e-mail: zmh68@msstate.edu
K. Baugus
e-mail: kbb269@msstate.edu

© Springer Nature Switzerland AG 2020 91


C. Jost et al. (eds.), Human-Robot Interaction, Springer Series
on Bio- and Neurosystems 12, https://doi.org/10.1007/978-3-030-42307-0_4
92 C. L. Bethel et al.

1 Introduction

Human-Robot Interaction (HRI) is a rapidly advancing area of research, and as such


there is a growing need for strong experimental designs and methods of evaluation
[1]. This brings credibility and validity to scientific research that involves humans
as subjects, such as those established in psychology and the other social sciences.
Two primary concerns observed in HRI studies are (1) the lack of appropriate sample
sizes that closely represent the populations being studied and has sufficient statistical
power, and (2) the lack of three or more methods of assessment used to obtain
convergent validity in HRI studies [21, 30, 32].
The focus until recently in HRI was on the development of specific robotic systems
and applications while neglecting methods of evaluation and metrics. Some methods
of evaluation have been adopted and/or modified from such fields as human-computer
interaction, psychology, and social sciences [32]; however, the manner in which a
human interacts with a robot is similar but not identical to interactions between
a human and a computer or a human interacting with another human. As robots
become more prevalent, it will be important to develop accurate methods to assess
how humans respond to robots, how they feel about their interactions with robots,
and how they interpret the actions of robots, in addition to how they may operate
robots [1, 3, 6, 9].
There are five primary methods of evaluation used for human studies in HRI:
(1) self-assessments, (2) interviews, (3) behavioral measures, (4) psychophysiology
measures, and (5) task performance metrics [3, 6, 7, 9, 21, 32, 49]. From the review
of HRI literature, it appears the most common methods used in HRI studies are
self-assessment, behavioral measures, and task performance metrics. There is lim-
ited research in the use of psychophysiological measures and interviews. Though
using psychophysiological measures seems to be increasing in its use in HRI. A cau-
tionary note, there has been an increase in the use of psychophysiological measures
with the claims that this method can be used to identify specific emotions just from
physiological measures, and physiology signals measure levels of arousal and not
valence or how positive or negative a person is feeling. Therefore, it is not possible
from the physiological signals alone to determine specific emotions such a happy
or to distinguish between surprise and anger, which are both high arousal emotions.
Arousal levels are the intended method of measurement for physiological measures
[12]. Each method of evaluation has advantages and disadvantages; however the dis-
advantages can be overcome by using more than one method of evaluation [3, 6, 9,
32].
The design of quality research studies for use in HRI applications with results
that are verifiable, reliable, and reproducible is a major challenge [1, 6]. The use of a
single method of measurement is not sufficient to interpret accurately the responses
of participants to a robot with which they are interacting. Steinfeld et al. describe the
need for the development of common metrics as an open research issue in HRI [49].
This is still an issue today, though more researchers are dedicated to addressing this
important issue to improve the quality of HRI research. Until recently, the discus-
Conducting Studies in Human-Robot Interaction 93

sion for the development of common metrics for HRI has been oriented toward an
engineering perspective and this does not completely address the social interaction
perspective. Both the engineering and social interaction perspectives require further
investigation to develop standard and common metrics and methods of evaluation.
There is a significant need for new methods and metrics to advance and validate the
field of human-robot interaction [33]. There are some social science researchers that
are working toward establishing methodological approaches to HRI [15]. There is
a strong need to explore longitudinal studies in real-world settings moving studies
away from lab settings with college-aged students and convenience samples [17, 33].
This chapter begins with a discussion of some related work on experimental
designs and methods used in HRI. There is a brief summary of terminology and
related information presented in Sect. 3. Next, is a discussion of the process of plan-
ning and designing a human study in HRI in Sect. 4. This section covers how to deter-
mine the type of study to use, number of participants, and methods and measures
of assessment (advantages and disadvantages). Additionally, there is a discussion of
how to design a high-fidelity study site, selection of robots and other equipment, find
assistants to conduct the study, recruit the required number of participants, devel-
opment of contingency plans to deal with failures, and preparation of Institutional
Review Board (IRB) and/or ethics documents. In Sect. 5, illustrative examples are
drawn from a large-scale, complex, controlled human study in HRI using 128 partic-
ipants and four methods of evaluation in a high fidelity, simulated disaster site. The
focus of this study was to determine if humans interacting in close proximity with
non-anthropomorphic robots would view interactions as more positive and calming
when the robots were operated in an emotive mode versus a standard, non-emotive
mode. Section 6 summarizes three categories of recommendations for designing and
executing large-scale, complex, controlled human studies in HRI using appropriate
samples sizes with three or more methods of evaluation, and discusses how improv-
ing experimental design can benefit the HRI community. Portions of this chapter
come from two previous articles published by the lead author [5, 6].

2 Survey of Human Studies for HRI

This section summarizes a representation of previous human studies conducted in


HRI that employ at least one of the various methods of evaluation, and in some cases
more than one method of evaluation was utilized in the studies. One issue observed
with these studies is that sample sizes are often relatively small, and therefore may
not have been representative of the population being investigated, which may have
influenced the results. There needs to be an appropriate sample size used in order to
have sufficient statistical power to make claims and generalize the results obtained
from the data collected [6]. In some cases with smaller sample sizes statistical signif-
icance appears to be achieved; however the results then only apply to that particular
population and do not generalize to a larger or different population.
94 C. L. Bethel et al.

The most commonly used method of evaluation observed in HRI studies has been
self-assessments. In general, most of the studies in HRI include some form of ques-
tionnaire or survey; however in some cases, the researchers add other methods of
assessment such as video observations and coding (although the results are often not
presented for this data), psychophysiology measurements, and/or task performance
measures. A thorough evaluation of three years of human-robot interaction papers
was conducted by Baxter et al. with recommendations provided by the authors for
areas of improvement when conducting human studies in HRI. Their first recom-
mendation was to not only have larger sample sizes, but to ensure that these samples
are consistent with the domain and application area being explored (e.g., using chil-
dren for child HRI, using elderly when investigating the impact of robots in nursing
homes) [2]. It is important to not conduct studies using a convenience sample of
college-aged students especially when investigating topics that are not relevant to
that population. Their second recommendation is to take the experiments out of the
laboratory controlled setting and move them into the real-world or in the wild in order
to establish ecological validity [2]. The third trend in HRI studies is that studies are
conducted typically in hours and at most a few days. It is important when determin-
ing the impact of robots, that longer-term interactions are observed and evaluated to
have a better understanding of how people respond to robots once the novelty effect
wears off [2, 17]. Another significant issue is how statistics are reported. Baxter
et al. discuss the use of p-values as the standard for statistically significant results.
It is important to also include the actual statistics, descriptive statistics, and also
include the effect size calculations for all statistically significant results with a scale
for the interpretation of the effect sizes [2, 6]. This information provides details on
how much of an impact the statistically significant results may have. There is typi-
cally a test for effect size for most statistical tests and it may require some manual
calculations, but should always be included with all statistically significant results
reported. The final point made in the paper by Baxter et al. is that studies should be
designed and reported for future replication by the HRI community. There needs to
be sufficient details provided so that others can replicate the procedures for any HRI
study and it would be expected that those results should be consistent. This is rarely
performed and needs to be more carefully considered.

2.1 HRI Study Examples

Dautenhahn et al.: One of the more comprehensive studies performed by Dauten-


hahn et al. utilized self-assessments, unstructured interviews, and video observations
from one camera angle [18]. The study included 39 participants from a conference
venue and 15 participants that were involved in a follow-up study in a controlled lab-
oratory environment. In this study, the researchers were able to obtain statistically
significant results [6].
Moshkina and Arkin: Another study that incorporated multiple methods of eval-
uation was performed by Moshkina and Arkin [37] in which they used
Conducting Studies in Human-Robot Interaction 95

self-assessment measures including a measure commonly used in psychology stud-


ies called the Positive and Negative Affect Schedule [52]. Video observation and
coding were performed, though results were not presented. Their study included 20
participants in a controlled, laboratory setting. The study results were mixed and
may have been more conclusive had a larger sample size been used [6].
Mutlu et al.: A study conducted by Mutlu et al. used both task performance mea-
sures and self-assessments [39]. The sample size for this study was 20 participants
and the results were mixed. One hypothesis showed statistically significant results;
however other items of interest were not statistically significant. The results may have
been different had a larger participant pool been utilized. The use of larger samples
sizes makes it possible for smaller effects to be discovered with significance [6].
Kulić and Croft: One of the larger studies in HRI to use psychophysiology
measurements along with self-assessments was conducted by Kulić and Croft with
a sample size of 36 participants [34]. Multiple psychophysiological signals were
measured which is highly recommended for this type of study for reliability and
validity in the results [3, 7, 29, 32, 36, 43, 45]. As a result of having a larger
participant pool than previous studies, they found statistically significant results and
were able to discover the best psychophysiology measures to determine valence and
arousal responses from participants. The results may have been even more prominent
with a larger sample size [6]. A major issue is the use of physiological measures
for determining valence. These signals are intended to measure arousal, and are
not intended to determine specific emotions or valence of a participant. There is a
growing trend to use these measures for determining specific emotions through the
use of machine learning, and this is not what these measures are intended for or
accurately measure. Facial coding could be used to determine the actual emotional
expression of the participant, but it is really not possible to identify specific emotions
with any accuracy through the use of physiological measures alone (e.g., heart rate,
respiration, skin conductance response, etc.) [7, 9].
Mutlu et al.: They conducted a study using two groups, with the first having 24
participants and the second having 26 participants for a total sample size of 50 [40].
The study relied heavily on the use of self-assessments developed from their previous
studies and adapted from psychology. The study found that several of the human-
human interaction scales were not useful in human-robot interaction activities. The
results may have been different had a larger sample size been utilized, which was
also mentioned in their conclusions.
Fisicaro et al.: More recently, Fisicaro et al. [23] performed a preliminary
evaluation of plush social robots in the context of engaging people with neuro-
developmental disorders. Eleven medium-low intellectual functioning participants
listened to a story told by a human speaker one week and told by the ELE plush
elephant robot the next week. Behavioral measures of the time looking at the speaker
and the number of times the participant shifted focus away from the speaker indicated
that the ELE robot speaker was more engaging than the human speaker. Though this
study was conducted on a small sample, it illustrates that a larger study of this type
of interaction should be investigated in the future. It is still quite common to use
96 C. L. Bethel et al.

smaller sample sizes, which limits the impact and generalizability of the results of
the research.
Vogt et al.: [51] examined the effects of a social robot added to a tablet computer
application designed to teach new English vocabulary words to young children. In
total 194 children participated in the study, which involved seven sessions taking
place over a three-week time period. Depending upon condition, participants expe-
rienced vocabulary lessons administered by the tablet application alone, the tablet
application and a robot using iconic gestures, the tablet application and a robot not
using iconic gestures, or were part of a control group that danced each week rather
than learning English vocabulary. Performance was measured using two translation
tasks and a comprehension task within two days of the final session and again within
two weeks of the final session. Statistical comparisons between groups indicated
vocabulary gains for all groups (except the control condition) and found that the
presence of a social robot did not significantly affect the level of vocabulary gains
achieved by participants. This is a good example of using a larger sample size with
appropriate participants. It is also good that the study was conducted over multiple
sessions versus a single interaction. This reduces the impact of any potential novelty
effects. In this case, having a larger sample likely would not provide any additional
knowledge and indicates the importance of having an appropriate sample size for the
research.
Conti et al.: [16] recruited fifty-two students from the same kindergarten and
visited the children’s school for three sessions throughout a three-week period. The
sessions were held in a classroom with their teacher to insure the children felt com-
fortable.
During the first session, children were asked to design a robot and given a blank
piece of paper, colored pencils, and encouragement to discuss robots with their
peers. Once the drawing was completed, each child participated in an interview
and answered open-ended questions about the characteristics he or she attributed to
the robot. The second session consisted of interacting with the Softbank Robotics
humanoid robot, Nao. In the study, the robot introduced himself, danced and played
music, and told the children a story. In the third and final session, children were again
asked to design and draw a robot, then completed the same interview process as the
initial session.
After analyzing the children’s responses and drawings, the researchers found
that after meeting the robot, children included more colors in their designs. They
observed a ten percent decrease in reports of aggressive behaviours performed by
the robot when asked “What is the robot doing?” and an overall positive reaction
and decrease of distress regarding meeting a robot. The researchers emphasized
that an early introduction to robots can facilitate positive perceptions of robots and
technology. Because of this being a vulnerable population, this study represents a
larger sample size. The study may have more impact with more participants; however
this demonstrates the benefit of larger samples and repeated measures over time
versus a single interaction.
Chien et al.: [13] recruited twenty-four younger (20–24 years) and twenty-four
older (59–86 years) participants to compare implicit and explicit attitudes towards
Conducting Studies in Human-Robot Interaction 97

assistive robots. The study measured explicit attitudes using self-report question-
naires and implicit attitudes via an implicit association test (IAT) [27].
Prior to interacting with a Zenbo home robot, participants completed self-report
surveys that included: Negative Attitudes toward Robots Scale (NARS) [41], Tech-
nology Acceptance Model (TAM) [19], and Subjective Technology Adaptivity Inven-
tory (STAI) [31]. After interacting with the robot, participants completed these mea-
sures again and completed an implicit association test (IAT) that examined pairing the
concept of robot or human with negative or positive words. IATs measure the ability
of and time taken for a participant to correctly categorize a stimulus that appears
on a computer screen. Shorter categorization times indicate a stronger association
between the two concepts. By asking participants to respond as quickly as possi-
ble and balancing categorization conditions, social desirability biases in participant
responses can be avoided.
The results indicated that although both younger and older participants had similar
increases in explicit measures of positive attitudes towards assistive robots after
interacting with a robot, on average older participants had more implicit negative
associations towards assistive robots than younger participants. The sample size is
larger than several HRI studies, but may have benefited from having an even larger
population.

2.2 Summary

It is clear from previous studies conducted to date in HRI that standards need to be
established for conducting reliable and quality studies where methods of measure-
ment can be validated for use by the HRI community. It is essential to use three or
more methods of evaluation to establish study validity. Additionally, it is important
to determine the appropriate sample size necessary to obtain statistically significant
results. This can be accomplished with careful planning and using study design tech-
niques that are the state of practice in the psychology and social science communities.

3 Terminology and Related Information for Conducting


Human Studies in HRI

This section contains terminology and background information needed for planning,
designing, and conducting human studies in HRI. The information presented will
provide a general understanding of research methods and statistical terminology to
form a very basic foundation. For a more in-depth understanding it is recommended
that readers refer to research methods and/or statistical textbooks (e.g., [21, 26, 28,
35]).
98 C. L. Bethel et al.

Alpha level: the probability of having a Type I error, which occurs when the null
hypothesis is rejected and it is true.
Between-subjects design: participants are placed in different groups with each group
experiencing different experimental conditions.
Confound: any extraneous variable that covaries with the independent variable and
might provide another explanation for findings discovered in the study.
Contingency plans: plans that are developed for cases of failures or unexpected
events that were not part of the original design or plan (robot failures, equipment
problems, participants not showing up, etc.).
Control condition: one of the groups in an experimental design that does not receive
the experimental condition being evaluated in a study.
Counterbalance: a procedure used in within-subjects designs that changes the order
variables are presented to control for sequence effects.
Dependent variable: the behavior that is evaluated as the outcome of an experiment.
Effect size: the amount of variance in the dependent variable that can be explained by
the independent variable. The amount of influence one variable can have on another
variable. (Additional information and an example follow this terminology list)
Experimental condition: the group(s) in an experimental design that receive the
experimental condition being evaluated in a study.
Independent variable: the variable(s) that are manipulated by the experimenter and
is of interest.
Interaction: in a mixed-model factorial design, an interaction occurs when the effect
of one independent variable that is manipulated depends on the level of a different
independent variable.
Main effect: in a mixed-model factorial design, it is whether or not there is a statis-
tically significant difference between different levels of independent variables.
Mixed-model factorial design: this type of design includes both between-subjects
and within-subjects design components.
Objectivity: this occurs when observations can be verified by multiple observers
with a high level of inter-rater reliability.
Power: the probability that the null hypothesis will be rejected when it is false. It is
impacted by alpha level, effect size, and sample size.
Reliability: the consistency in obtaining the same results from the same test, instru-
ment, or procedure.
Sample: a portion or subset of a population.
Type I error: occurs when the null hypothesis is rejected when it is true.
Type II error: failure to reject the null hypothesis when it is false. It occurs when
there is failure to find a statistically significant effect when it does exist.
Validity: a method of evaluation (test, instrument, procedure) that measures what it
claims to measure.
Within-subjects design: each participant is exposed to all levels of the independent
variable(s).
Conducting Studies in Human-Robot Interaction 99

Effect size information.


The formula for calculating Cohen’s d effect (for statistically significant t-tests)
is [14]:
d = M1 −M
σ
2
.
where, d is the effect size index for t-tests of means in standard unit, M1 , M2 are
the population means expressed in raw (original measurement) units, and σ is the
standard deviation of either population (since they are assumed equal).
The scale used to interpret Cohen’s d effect is:
• 0.00–0.19 negligible effect
• 0.20–0.49 small effect
• 0.50–0.79 medium effect
• 0.80 + large effect
The formula for calculating Phi ϕ effect (for statistically significant Chi-Square
χ 2 tests) is:

ϕ = χn
2

where, χ 2 is the statistically significant result of a Chi-square test and n = the total
number of observations.
The scale used to interpret a ϕ effect is:
• 0.00–0.09 negligible effect
• 0.10–0.29 small effect
• 0.30–0.49 medium effect
• 0.50 + large effect
The formula for calculating Cohen’s fˆ effect (for statistically significant F tests)
is [14]:

fˆ = d f ( NF )
where, df = the degree of freedom for the numerator (number of groups - 1), F =
the statistically significant result of an F-test (e.g., from an ANOVA statistical test),
N = the total sample size.
The scale used to interpret Cohen’s fˆ effect is:
• 0.00–0.09 negligible effect
• 0.10–0.24 small effect
• 0.25–0.39 medium effect
• 0.40 + large effect
The following is an example of an effect size calculation using a significant main
effect for arousal from the exemplar study:
The following F-test result is used to calculate the effect size: F(1,127) = 12.05.

fˆ = 1( 12.05
127
) = 0.31.
Based on Cohen’s scale, this is a medium effect.
100 C. L. Bethel et al.

4 Planning and Experimental Design

A successful human study in HRI requires careful planning and design. There are
many factors that need to be considered (see Fig. 1). When planning and designing
a human study in HRI the following questions should be considered:

• What type of study will be conducted (within-subjects, between-subjects, mixed-


model, etc.)?
• How many groups will be in the study?
• How many participants per group will be required?
• What methods of evaluation will be used (self-assessments, behavioral measures,
interviews, psychophysiological measures, task performance, etc.)?
• What type of environment and space is required to conduct the study (field, labo-
ratory, virtual, etc.)?
• What type of robots will be used?
• How many different types of robots will be used?
• What type of equipment will be needed (computers, measurement equipment,
recording devices, etc.)?
• How will contingencies and failures be handled?
• What types of tasks will the participants perform or observe (Study Protocol)?
• How will participants be recruited for the study?
• What type of assistance is needed to conduct the study?

4.1 Type of Study and Number of Groups

The first step in designing a human study in HRI is to determine the research ques-
tion(s) and hypotheses. From the research questions the researcher can determine how
many groups are needed and whether the study design should be a within-subjects,
between-subjects, or a mixed-model factorial approach.
Within-Subjects Design: The most comprehensive approach is the within-
subjects design in which every participant experiences all of the experimental condi-
tions being investigated. The within-subjects design requires less time to conduct the
study because all experimental conditions can be performed in one session and there
is no waiting to run different participants through different experimental conditions
as observed in between-subjects designs. Within-subjects designs require fewer par-
ticipants, the variables remain constant across the different experimental conditions,
it increases statistical power, and reduces error variance. However, within-subject
designs are prone to confounds from demand characteristics, in which participants
make inferences about the purpose of the study in an effort to cooperate. Participants
may also experience a concept known as habituation or practice effect in which their
responses are reduced due to repetitive presentation of the same tasks or the robot
Conducting Studies in Human-Robot Interaction 101

Fig. 1 The chronology of


items required for planning,
designing, and executing
human studies in HRI

performs in a similar manner reducing the novelty effect the robot over time. Par-
ticipants are likely to be impacted by side effects of different events that can occur
during a study (e.g., robots breaking or behaving in ways that were not anticipated)
[47].
Between-Subjects Design: In a between-subjects design, participants experience
only one of the experimental conditions. The number of experimental groups depends
on the number of experimental conditions being investigated [35].
102 C. L. Bethel et al.

A common reason for using a between-subject design would be the participants


themselves may dramatically differ such as experiments that may evaluate human-
robot interactions for typically developing children versus children diagnosed with
autism or male participants versus female participants. In these situations, the par-
ticipants can be classified in only one of the groups. The results between the groups
are then compared. Between-subject designs are typically cleaner because partici-
pants are exposed to only one experimental condition and typically do not experience
practice effects or learn from other task conditions. The time to run the participant
through one condition is less than in a within-subjects design where the participants
experience all experimental conditions. The between-subjects design reduces con-
founds such as fatigue and frustration from repeated interactions with the robot. A
limitation of the between-subject design is that results are compared between the
groups, which can result in substantial impacts from individual differences between
the participants of the different groups. Therefore, it makes it more difficult to detect
differences in responses to the robots and Type II errors can be more prevalent [35].
Mixed-Model Factorial Design: A mixed-model factorial design uses both
between-subjects and within-subjects designs. This can be useful when there are one
or more independent variables being investigated with a between-subjects design and
the other variables are explored through a within-subjects approach in the same study.
In this type of design, the variables being investigated have different levels, such as
there may be two types of robots used in the interactions. It allows the investigator
to explore if there are significant effects from each individual variable, but this type
of design allows for the exploration of the interaction effects between two or more
independent variables [35]. The limitations previously mentioned for within-subjects
and between-subjects designs can apply in the mixed-model factorial design as well
and must be considered. Section 5.1 provides an example of a mixed-model factorial
design.

4.2 Determining Sample Size

Determining the appropriate sample size is often a challenge in human studies in HRI,
though it is important to determine when you are in the planning phase for a study. An
a priori power analysis is a statistical calculation that can be performed to determine
the appropriate number of participants needed to obtain accurate and reliable results
based on the number of groups in the study, the alpha level (typically α = 0.05),
the expected or calculated effect size, and a certain level of statistical power (com-
monly 80%). There are power analysis tables in the appendices of most statistical
textbooks (e.g., refer to Appendix C in [50]) that will provide group size values.
Additionally, there is software available online that will assist with this type of calcu-
lation (e.g., G*Power3.1—http://www.psycho.uni-duesseldorf.de/abteilungen/aap/
gpower3/). Section 5.2 presents an example for calculating sample size using statisti-
cal tables and the G*Power3.1 software. When calculating the sample size, if you do
Conducting Studies in Human-Robot Interaction 103

not know what your effect size is expected to be based on results from prior studies,
then a common approach is to use a medium effect size for the power analysis (in
the example case that was 0.25).

4.3 Methods of Evaluation

There are five primary methods of evaluation used in human studies in HRI: (1) self-
assessment, (2) observational or behavioral measures, (3) psychophysiology mea-
surements, (4) interviews, and (5) task performance metrics. Each of these methods
has advantages and disadvantages; however most problems can be overcome with
the use of three or more appropriate methods of evaluation. For results to be valid and
reliable, use at least three different credible forms of evaluation in a process known
as triangulation to obtain convergent validity [3, 7, 9, 32, 35, 44]. Due to the fact
that no one method of evaluation is without problems, researchers should not rely
solely on the results of one source of evaluation. The use of two sources of evaluation
may result in conflicting results; however using three or more methods of evaluation
it is expected that the results for two of the three or more methods should support
each other adding validity and reliability to the findings. If three or more methods of
evaluation are in conflict, reconsider the research question(s) and possibly structure
them in a different way.
When planning a human study, it is important to use credible measures that are
appropriate for the study and for the participants being investigated. For example, if
the researcher is evaluating anxiety and stress levels of participants interacting with
a robot, then they may want to use validated self-assessments, video observations,
interviews and/or psychophysiology measures such as respiration rate, electrocardio-
graphy (EKG/ECG), and skin conductance response. Each of these measures would
be appropriate for this type of study and these are credible measures for assessing
participants’ levels of stress and anxiety. The use of multiple methods of evaluation
that are not credible or appropriate will result in data that is not meaningful to answer
the hypotheses and research questions.
Self assessments are among the most commonly used method of evaluation in
HRI studies; however obtaining validated assessments designed for HRI studies can
be a challenge. There are several surveys available in the HRI community but very
few have been validated. There are surveys available in the psychology literature
that are validated that may be useful, but if they are modified then they need to be
validated again. Self-assessment measures include paper or computer-based psycho-
metric scales, questionnaires, or surveys. With this method, participants provide a
personal assessment of how they felt or their motivations related to an object, situ-
ation, or interactions. Self-assessments can provide valuable information but there
are often problems with validity and corroboration. Participants may not answer the
questions based on how they are feeling but rather respond based on how they feel
others would answer the questions or in a way they think the researcher wants them
answered. Another issue with self-assessment measures is that observers are unable
104 C. L. Bethel et al.

to corroborate the information provided by participants immediately and directly


[21]. Participants may not be in touch with what they are feeling about the object,
situation, and/or interaction, and therefore may not report their true feelings. Also,
the responses to self-assessments and other measures could be influenced by par-
ticipants’ mood and state of mind on the day of the study [21, 30]. Another issue
with self-assessments is that participants will complete them after performing tasks
and may not recall exactly how they felt during the actual interaction. For these rea-
sons, it is important to perform additional types of measurements such as behavioral,
interviews, task performance and/or psychophysiological measures to add another
dimension of understanding of participants’ responses and performance in HRI stud-
ies [3, 9].
Behavioral measures are the second most common method of evaluation in HRI
studies, and sometimes are included along with psychophysiological evaluations and
participants’ self-assessment responses for obtaining convergent validity. Johnson
and Christensen define observation as “the watching of behavioral patterns of people
in certain situations to obtain information about the phenomenon of interest” [30].
The “Hawthorne effect” is a concern with observational as well as self-assessment
studies. It is a phenomenon in which participants know that they are being observed,
and it impacts their behaviors [21, 30]. For this reason, psychophysiological measures
can assist with obtaining an understanding of participants’ underlying responses at
the time of the observations. The benefit of behavioral measures is that researchers
are able to record the actual behaviors of participants and do not need to rely on
participants to report accurately their intended behaviors or preferences [3, 9, 21].
Video observations of human-robot interactions are often recorded and later coded
for visual and/or auditory information using two or more independent raters [11].
Interpreting audio and video data does require training to provide valid, accurate, and
reliable results. There are multiple approaches to interpreting this type of data, which
is beyond the scope of this chapter (refer to [25, 35, 44, 47, 48]). Behavioral measures
and observations are often collected but rarely is the data analyzed and published.
It is a very tedious and time consuming process that takes significant personnel to
complete. If it is analyzed, the resulting data is usually very rich and more informative
than using only self-assessment data. Unfortunately many researchers end up never
processing this data.
Psychophysiology measures are gaining popularity in HRI studies. The primary
advantage for using psychophysiological measurements is that participants cannot
consciously manipulate the activities of their autonomic nervous system [29, 32, 34,
36, 43, 45]. Also, psychophysiological measures offer a minimally-invasive method
that are used to determine the stress levels and arousal responses of participants inter-
acting with technology [29, 34, 36, 43, 45]. Psychophysiological measurements can
complicate the process because the results may not be straightforward and confounds
can lead to misinterpretation of data. There is a tendency to attribute more meaning
to results due to the tangible nature of the signals. Information needs to be obtained
from participants prior to beginning a study to reduce these confounds (e.g., health
information, state of mind, etc.). Multiple physiological signals should be used to
obtain correlations in the results [3, 7, 9]. An issue of concern that is becoming more
Conducting Studies in Human-Robot Interaction 105

frequent is the use of data mining and machine learning techniques with the physio-
logical signals and researchers are claiming that they can use these signals to identify
specific emotions. The physiological signals such as heart rate and respiration rate
are designed to measure levels of arousal. They are not going to indicate valence
aspects of emotion. The results are often inconclusive in the literature and this is
because it is not what these signals are intended to measure. They will provide only
the level of arousal a participant is experiencing at the time of interactions. Another
issue with these signals are that they only measure what a person feels and if the
interaction does not provide significant arousal then the signals will not provide any
definitive insights. These are signals that cannot be manipulated consciously.
Interviews which are closely related to self-assessments are another method of
evaluation. Interviews can be structured in which the researcher develops a series
of questions that can be close-ended or open-ended; however the same questions
are given in the same order to every participant in the study. The interview can be
audio and/or video recorded for evaluation at a later time. Unstructured interviews
are used less frequently in research studies. In unstructured interviews, the questions
are changed and developed based on participants responses to previously presented
questions. It is an adaptive process and more difficult to have consistency and to
evaluate in research studies. Interviews often provide additional information and
details that may not be gathered through self-assessments; however there are numer-
ous issues that may arise from using interviews. Response style of participants can
influence responses to interview questions. There are three types of response styles,
(1) response acquiescence—participants answer in the affirmative or yea-saying,
(2) response deviation—participants answer in the negative or nay-saying, and (3)
social desirability—participants provide what they perceive as socially acceptable
responses. It can be a challenge to obtain responses that are reflective of participants’
true behaviors and feelings [21]. Another issue related to interviews is that partic-
ipants that volunteer for the research study may not answer interview questions in
a manner consistent with those participants that are not volunteers. Some of these
challenges can be overcome by using other methods of evaluation to obtain conver-
gent validity among different measures. The data from interviews, like behavioral
measures are often difficult, tedious, and time consuming to analyze. Even though
interview data is collected, it is often not analyzed. It also takes training in transcrib-
ing, coding, and analyzing the data. If the time is taken to perform the analysis, then
the result is often rich data that gives additional meaning to the research questions.
When performing research with children, the interview is helpful in obtaining accu-
rate responses from them. Young children are not able to respond to survey questions
that are rated on a scale. They are capable of responding to yes/no questions and will
provide additional knowledge and information through interviews.
Task performance metrics are becoming more common in HRI studies, espe-
cially in studies where teams are evaluated and/or more than one person is interacting
with one or more robots [11, 20, 39, 42, 49]. These metrics are designed to measure
how well a person or team performs or completes a task or tasks. This is essen-
tial in some HRI studies and should be included with other methods of evaluation
such as behavioral and/or self-assessments. Task performance measures are useful in
106 C. L. Bethel et al.

determining if technology is helping to improve the performance of a specific task.


It typically measures aspects like time of completion for a task and/or the number
of errors or problems encountered. It is an excellent method for evaluating how well
teams of people work together and how they work with technology. Task performance
metrics are considered objective measures that have concrete values associated with
them.
No single method of measurement is sufficient to evaluate any interaction; there-
fore it is important to include three or more methods of evaluation in a comprehensive
study to gain a better understanding of Human-Robot Interaction. Within a single
method of evaluation there should be multiple measures used. For example, in self-
assessments, more than one credible assessment should be used for validity purposes.
In behavioral studies, obtain observations from more than one angle or perspective.
For psychophysiological studies use more than one signal to obtain validity and
correlation. Measure task performance in more than one way. This ensures a com-
prehensive study with reliable and accurate results that can be validated. Compare
the findings from three or more of these measures to determine if there is meaningful
support in at least two or more of the evaluations as it relates to the research ques-
tion(s). Normalize the means for each type of measurement to a meaningful, common
scale and then perform a separate analysis on this data. Conduct a correlation analysis
to determine if there are positive or negative correlations in the results. Essentially it
is important to interpret the meaning of the results discovered and determine if there
are commonalities in the results of two or more of the methods of evaluation.

4.4 Study Location and Environment

A major factor to consider when planning any study is where the study will be
conducted: in the field, laboratory, virtual environment, or online. For a successful
study, the environment should reflect realistically the application domain and the
situations that would likely be encountered so that participants respond in a natural
manner. In some cases, it is just not practical or possible to place participants in
the exact situation that is being investigated, so it is important to closely simulate
that situation and/or environment. It is important to consider lighting conditions,
temperature, and design the environment to appear as close as possible to the actual
setting by including props and/or sound effects. Use draping or other means of
privacy to ensure the integrity of the site is preserved prior to the start of the study.
In psychophysiology studies, if skin conductance measures are used, it is extremely
important that the temperature is controlled in the study environment [3, 7].
In the example study, provided in Sect. 5, placing participants in an actual search
and rescue site of a building collapse was not practical or possible. Therefore a simu-
lated disaster sight was created that contained rubble, was dark, and the temperature
was kept on the cooler side. Participants were placed in a confined space similar to
what might be experienced in an actual disaster situation.
Conducting Studies in Human-Robot Interaction 107

4.5 Type and Number of Robots

Another consideration when designing a human study in HRI is the selection of robots
for the study. The selection of a robot needs to be congruous with the application
and domain being investigated. It is important to select robots that have been used or
would be expected in the type of tasks/applications being examined in the research
study. Sometimes you may not have access to a particular robot, but it is important
to try and get a robot that would work well in the environment and situation being
investigated. For example, you would not likely put a Nao humanoid robot, that is
small and has difficulty with mobility, into a search and rescue site.
The use of more than one type of robot provides a mechanism to detect if the
concepts being investigated can be generalized or if they are specific to a particular
robot. The results are more meaningful if they can be extended to more than one
specific robot. This is often difficult to do with the cost of robots; however it does
add another dimension to the study and increases the usefulness to the HRI and
robotics communities.

4.6 Other Equipment

Determining what equipment will be used in an HRI study impacts the success and
results of the study. Whenever possible, equipment choices should be redundant.
Unfortunately, equipment failures are more common than you want to believe and it
is important to make sure that there are contingency plans in place in times of failure.
Video Observation Studies: When performing video observation or behavioral
studies, the first step is to determine the number of different perspectives to be
recorded for the study. It is important to obtain multiple viewing angles because
each perspective may contain unique information and gives a more comprehensive
record of the events that occur in a study and gives you some redundancy in case there
is a problem with a particular recording. It is important that cameras are synchronized
and extra batteries and SD cards/tapes are readily available. There should be at least
one or two extra cameras available in case of equipment failure that are available to
swap out if needed. It is also advisable not to reuse tapes if at all possible because
it can impact the integrity of the recordings. This is less of a problem with current
video cameras as they typically now use SD cards for recordings. To preserve data
and prevent mishaps, it is important to off-load recordings quickly to a more stable
media. It is recommended that all data be backed up to multiple locations and media.
Psychophysiological Studies: For psychophysiological studies, it is necessary
to determine if the equipment needs to be connected to a stationary computer or if
the participant will be mobile. There are limited options available for ambulatory
psychophysiology equipment in which the participant is mobile. Currently, there are
more options available, but they are often expensive if you are using research grade
systems. Another option is to use smart watches, as some will now offer recording
108 C. L. Bethel et al.

capabilities, and are becoming more accurate in the recordings. It depends on what
is being measured if that is an option. In a study I conducted recently, heart rate data
was compared between an Apple watch and a QardioCore system and the heart rates
were identical (https://www.getqardio.com/qardiocore-wearable-ecg-ekg-monitor-
iphone/). This was a very limited study, and a more formal and extensive study should
be conducted, but it is expected to be a less expensive option for recording heart rate
data.
It is recommended to keep on hand multiple sensors in case of failure, which
seems to be common due to the sensitive nature of this type of equipment. This
can make the difference between a productive, successful, and organized study and
one that produces stress, delays, and sometimes failure. Physiological equipment is
becoming more stable and reliable, but it is still prone to issues.

4.7 Failures and Contingencies

Even with careful planning, failures and problems are likely to occur when conducting
studies. It is imperative to plan for as many potential failures as can be anticipated.
Robots can fail, cameras can fail, computers and sensors can fail; therefore it is
important whenever possible to have redundancy in all necessary equipment. It needs
to be available immediately to prevent delays in the study. It is also recommended
that there be redundancy in personnel as well. Develop a call list for participants
and essential personnel who might be available on short notice to fill a timeslot
where a participant or assistant does not arrive as scheduled. It is common to expect
approximately 20% of scheduled participants not to appear for their appointment.
When calculating the number of participants required for a study this number should
be increased to take into account the likelihood that some participants will miss their
appointment and to account for any possible data failures. Even with contingency
plans, problems will occur and it is important to be as prepared as possible to deal
with these problems to avoid delays and risks to your data collection.

4.8 Study Protocol

Another important phase of the planning and study design process is the development
of the study protocol. The protocol involves determining exactly how the study will
proceed from start to finish once a participant arrives. It is a detailed description of
instructions that will be provided to the participant, what assessments will be done
and in what order, what tasks the participant will perform, the timing of events,
recording of information, how the data and personal information will be handled,
and where this information will be stored for security purposes. This is necessary
for completing Institutional Review Board (IRB) or Ethics Committee paperwork
required for human studies and for determining risks and maintaining privacy of
Conducting Studies in Human-Robot Interaction 109

participants in the United States and in many countries. The process may be different
depending on the country, but regardless of whether it is required in your country
or not, the preparation of a study protocol is good practice to ensure a high quality
study.
Trial runs of experiments should be conducted until the study can be executed
smoothly from start to finish. This is the only way to determine where problems can
and likely will occur. Systems and study designs do not always execute as expected
and until several trial runs of the protocol are performed there is no way to ensure that
all the problems are resolved for the process to run smoothly when data is actually
collected. It is important that once the study begins with participants that the study
protocol is discussed with each participant as part of the instruction process as well
as providing this information as part of the informed consent form participants will
sign.

4.9 Methods of Recruiting Participants

Recruiting participants is a challenge that most human studies face in any field
including HRI. That may be a significant reason why many of the studies conducted
in HRI do not have large or appropriate sample sizes. It is important to recruit
participants that will appropriately represent the population being studied. If your
research involves questions associated with children, then your study population
should be using children as participants. If you are performing research with first
responders then it is important to include them as your participants in the study. In
the case of special populations, it may be more difficult to get larger or appropriate
sample sizes, but every effort should be made to use the target population for the
research questions. Too often, the population selected is a convenience sample, that
typically involves college-age students because that is what is most available to
university researchers. It may be challenging, but it is important to the validity of the
research to use a population that represents the end user and evaluates the technology
being investigated. The novelty effect of robots in some cases is not enough to entice
participants to be involved in a study. There are several methods of recruitment
available, and they should all be implemented for a successful study. Flyers are a good
method of recruitment on campus with the added bonus of some type of incentive to
participate (e.g., door prizes, payment for participation, extra credit in courses). In
some cases, the psychology department may have research participation requirements
and a system for advertising research studies on campus. Establish relationships with
management of other participant pools or databases. These are excellent sources of
recruitment on college campuses; however limiting participation to these sources will
bias your participant pool. In most cases the population of interest may not include
college educated participants, and the results of studies using just these sources of
participants will not generalize. Therefore, it is important to explore other methods
of recruitment such as word of mouth to family, friends, and acquaintances. It is
also possible to contact other resources for permission to solicit participants, such
110 C. L. Bethel et al.

as a local mall for testing the general public, and kindergarten through 12th grade
educational institutions for recruiting children (the use of children requires informed
consent from the parents and informed assent from the children to participate). These
methods are more involved, but can serve as rich sources of recruitment. You must
obtain written permission to recruit participants from these different populations. If
you work with first responders, you need to contact the agencies and request written
permission to use their employees as participants. For schools, it may be necessary to
present your study before the school board to obtain permission to use the students.
There are special considerations if you are working with military personnel. If you
are developing research for these populations, you need to use them to determine the
viability of the research.

4.10 Preparing Institutional Review Board or Ethics


Committee Documents

The next step in planning, designing, and executing a human study is the prepara-
tion of the Institutional Review Board or Ethics Committee documentation (this is
applicable to all human studies conducted in the United States, there may be similar
requirements in other countries). The Institutional Review Board (IRB) or the Ethics
Board are committees at each university. The process is similar at most universities
and in other countries if they have this requirement. In some cases it may be more
extensive than what is described in this section. The committee members are from
diverse backgrounds and they review proposals for research conducted with human
subjects. The IRB or Ethics Boards were established to review the procedures of
proposed research projects, to determine if there are any known or anticipated risks
to participants in the study, methods of recruiting participants, and how the partic-
ipants’ confidentiality is maintained. A part of the IRB application is the informed
consent, or in the case of studies with children, parental informed consent/child
assent documents. This document is provided and explained to each participant prior
to being involved in any research study and includes the study protocol, informed
consent/assent, permissions to audio and/or video record the study, any risks or ben-
efits to the participants, and how confidentiality of the data will be maintained [47].
Also included in the informed consent is a statement that participants can terminate
participation in the study at any point without any penalty and they will still receive
any incentives provided in the study. It is important to include this type of language
in any informed consent form provided to participants for ethical reasons. The IRB
or Ethics Board reviews these documents and provides approval to proceed with the
study, requests revisions to the study documents, or can deny approval of the study.
Typically there is training required by the personnel involved with the study but the
requirements can vary by institution; therefore it is important to investigate all the
requirements of the institution(s) that will be involved in the study. Start this process
early because in some cases this process can take considerable time, especially if
Conducting Studies in Human-Robot Interaction 111

your research involves vulnerable populations such as children, elderly, prisoners,


etc. If you are using a vulnerable population or your study has risks involved, then the
study will often have to be approved by the entire board and this can take a month or
more to complete that process, especially if changes are requested by the committee.
It is important to keep this in mind when planning your data collection.

4.11 Recruiting Participants

Once IRB or Ethics Board approval is received, the next step in the study process is the
actual recruitment of participants and ensuring they follow through once recruited. A
significant challenge in many human studies is participants’ attendance once they are
scheduled. Schedule researchers, assistants, and participants for mutually convenient
times. It is important to remind participants of their appointment time at least 24 hours
in advance, but it may also be helpful to send a reminder an hour before his or her
appointment. It is more likely a participant will show up if they have a specific time
and also it is recommended to allow adequate time between participants to account
for time delays in the study or in case a participant is running late. Even with planning,
problems occur and participants do not show up, but this time can be used to process
and backup data.
A helpful scheduling tool is software to make timeslots available. This is often
included in the software for participant pools, such as SONA or PRP that many
university Psychology departments use in the United States. There are also soft-
ware products like Calendly (https://calendly.com/) and YouCanBook.me (https://
youcanbook.me/). These allow the researcher to link to their personal calendar and
participants can sign up through the software that will place them on the researcher’s
calendar. This can be quite useful and the software can be setup to send reminders.
In the case of the calendly.com software, you can also include questions related to
inclusion/exclusion criteria that allows researchers to screen participants prior to
them signing up for the study.

4.12 Conducting the Study

In most cases, to run a successful human study requires assistance in addition to


the principal investigator. This is especially true when running a large-scale, com-
plex study with a significant sample size and three or more methods of evaluation.
Finding research assistants can be a challenge for some researchers, especially when
economic times are tough and there may not be funding available to pay for research
assistants. One option available is to contact the Honors College or Program if the
university or institution has this type of program. These students typically desire
research experience and often are willing to volunteer their time for the experience
and knowledge they may gain. Depending on the study, often students can easily be
112 C. L. Bethel et al.

trained to assist and do not necessarily need to be in the field of study. Psychology and
pre-medical students often need a certain amount of volunteer hours and assisting in
a research study can fulfill these requirements.
It is important to ensure the volunteers understand the need for reliability and
attention to detail. Whenever possible, schedule an additional person to be available
in case of emergencies or when plans do not proceed as expected. Volunteer research
assistants each will typically provide between five and ten hours per week; therefore
it is necessary to consider their availability when designing the study and the timeline.
It is also advisable to schedule assistants for data processing as well as assisting with
conducting the actual experiments. These recommendations may not be applicable
to all institutions or studies.

5 An Exemplar Study

This section presents examples from a recent HRI study on determining study design,
sample size, methods of evaluation, study location, and how failures and contingen-
cies were handled. This was a large-scale, complex, controlled study involving 128
participants responding to two different search and rescue robots (Inuktun Extreme-
VGTV and iRobot Packbot Scout) operated in either the standard or emotive modes.
The experiment was conducted in the dark in a high-fidelity simulated disaster site,
using four different methods of evaluation. To date, this study remains one of the
most complex studies of this scale in the HRI community.

5.1 Type of Study and Number of Groups

This study was a mixed-model factorial design, in which the between-subjects factor
was the robot operating mode (standard versus emotive) and the within-subjects
factor was robot, the Inuktun Extreme-VGTV and the iRobot Packbot Scout (see
Fig. 2). This design was selected because there were four conditions and that was
too many for a within-subjects design. We did not want to expose participants to
both the emotive and standard operating modes in the same study or they would
likely determine the purpose of the experiment. Participants were randomly assigned
to one of two groups (standard-operated or emotive-operated). Every participant
experienced both robots within their assigned group. The order in which the robots
appeared was counterbalanced (e.g., Inuktun viewed first or Packbot viewed first),
and operating mode assignments were balanced for age and gender.
Conducting Studies in Human-Robot Interaction 113

Fig. 2 The Robots: Inuktun Extreme-VGTV (left) and iRobot Packbot Scout (right)

5.2 Determining Sample Size

An a priori power analysis was conducted for this study and was based on using two
groups, power of 0.80, a medium effect size of 0.25, and α = 0.05 the calculation
resulted in two groups of 64 participants for a total of 128 participants based on Table
C.1 on page 384 of [50]. The same sample size was calculated using the G*Power3.1
software [22]. In this example, the test family was the F-test, the statistical test was the
MANOVA: repeated measures, within-between interaction, and the type of power
analysis was—A priori: compute required sample size-given α, power, and effect
size. The effect size was based on a medium effect using Cohen’s fˆ = 0.25 (results
prior data collections were used to determine the effect size), α = 0.05, power was
set at 0.80, the number of groups was 2 (standard versus emotive), and the number
of measurements was also 2 for the two robots (Inuktun versus Packbot) resulting
in a calculated sample size of 128 participants (see Fig. 3 with the input values
and sample size highlighted with red boxes). Based on the analysis of prior data
the effect sizes were small to medium depending on the analyses performed and had
there not been such a large sample size used for the study, some of the results may not
have been statistically significant. In the self-assessment data, statistically significant
results were obtained for the main effect of arousal and a three-way interaction was
significant for valence [8]. If there is existing data available, then the effect size can
be calculated using Cohen’s fˆ effect for any significant F-tests (refer to Sect. 3) and
used as input in the a priori power analysis [14].

5.3 Methods of Evaluation

This study utilized four methods of evaluation (self-assessments, video-recorded


observations, psychophysiology measurements, and a structured audio-recorded
interview) so that convergent validity could be obtained to determine the effec-
114 C. L. Bethel et al.

Fig. 3 G*Power3.1 example using the data for the exemplar study

tiveness of the use of non-facial and non-verbal affective expression for naturalistic
social interaction in a simulated disaster application.
Self-Assessments: Multiple self-assessments were used in this study. Some of the
assessments were adopted and/or modified from existing scales used in psychology,
the social sciences, and other HRI studies. The assessments were given to the partic-
ipants prior to any interactions and after each robot interaction. It is recommended
to conduct pilot studies of all the assessments to ensure that they are understandable
and testing exactly what was expected. In this study, some of the questions were
Conducting Studies in Human-Robot Interaction 115

confusing to the participants and were not considered as part of the data analyses. It
is important to make note of the questions that participants found confusing and/or
required further explanation. This can be done in a participant log or as part of their
paperwork so that you are not relying on memory after the data collection.
In the case of one assessment, the Self-Assessment Manikin (SAM) [10] the
valence and arousal questions were easily interpreted; however the questions related
to the dominance dimension were often misunderstood. That dimension was not
included as part of the data analyses. The questions associated with the dominance
dimension of the SAM assessment will need to be reworded and then validated;
however the valence and arousal portions of the SAM assessment have been validated
for future HRI studies and are available [3].
As part of the validation process for self-assessments it is important to ask each
question in at least two different but similar ways and then perform a statistical
test known as Cronbach’s alpha to determine tau-equivalent reliability or internal
validity for the items on the assessment. A Cronbach’s alpha value greater than
0.70 indicates acceptable validity for the items evaluated. For more information on
Cronbach’s alpha consult [24] or a statistical textbook.
Another issue associated with self-assessment data is in the analysis of this data.
Likert scale and Semantic Differential scales are not continuous data. This type
of data can be analyzed using t-tests or a more appropriate approach would be to
perform a Chi-square or related test such as Kruskal-Wallis test for non-parametric
data. Unfortunately, many researchers use an ANOVA test, which is not appropriate
given that the scales are not continuous. The issue with the Chi-square test is that
there needs to be at least five (5) items in each bin to run the test. This may be
accomplished by combining categories such as strongly disagree and disagree in a
Likert scale and agree and strongly agree categories. For the purpose of this chapter,
the details of a Chi-square and/or the Kruskall-Wallis ranks tests will not be covered
in detail, but most statistical software can be used to perform these types of analyses.
It is important when analyzing data that the appropriate statistical test is used. It
is also important that when reporting those statistical tests that the results from the
actual tests are presented and for statistically significant results the p-value and the
effect size for the statistical test used are also reported. You should include a scale
for interpreting the effect size for the reader as a reference and state the level of effect
(e.g., small, medium, large).
Psychophysiology Measures: There were five different psychophysiological sig-
nals recorded as part of this study: (1) EKG, (2) Skin Conductance Response, (3)
Thoracic Respiration, (4) Abdominal Respiration, and (5) Blood Volume Pulse, using
the Thought Technology ProComp5 Infiniti system (http://www.thoughttechnology.
com/pro5.htm). Five signals were used to obtain reliable and accurate results. Cor-
relations were conducted between the different signals to determine the validity of
participants’ responses and there was support between the heart rate variability and
respiration rates. There was also support in the findings for heart rate variability,
respiration rates, and the self-assessment data to provide validity in the results of this
study.
116 C. L. Bethel et al.

Video-Recorded Observations: Videotaped observations were obtained from


four different camera angles (face view—including the upper torso, overhead view,
participant view, and robot view) using night vision cameras and infrared illumina-
tors. When recording video observation data, synchronizing multiple cameras can
be a challenge. In the case of the this study the interactions were all conducted in
the dark. Turning video cameras on before the lights were turned off and turning the
lights back on before shutting off the cameras made a good synchronizing point for
the multiple cameras. Another technique is to use a sound that all cameras can detect
through built-in microphones. A visual summary of this study can be viewed in a
video format in [4].
Structured Interviews: After the interactions were complete each participant was
interviewed in a structured interview format that was audio recorded. Participants
were required to read and sign IRB approved informed and video/audio recording
consent forms prior to participating in the study. They were given the option to deny
publication of their video and audio recordings and three participants elected to deny
publication of their recordings. It is important to note this denial in all files and
related documents for their protection and to not accidentally use the materials. The
interviews can be transcribed in detail, then coded and analyzed in a quantitative
manner through categorizing the comments.

5.4 Study Location and Environment

The application domain for the study was Urban Search & Rescue, which required a
confined-space environment that simulated a collapsed building (see Fig. 4). Partic-
ipants were placed in a confined space box with a cloth cover to simulate a trapped
environment. Actual rubble was brought into the lab to give the look and feel of a
building collapse. The robots were all pre-programmed so that the movements would
be consistent and reproducible for all participants with the robots exhibiting either
standard or emotive behaviors. The medical assessment path, traveled by the robots,
was developed from video observations of experiments conducted by Riddle et al.
with emergency responders and medical personnel based on how they would operate
a robot to conduct a medical assessment of a trapped victim [38, 46]. Ideally, it would
have been better to conduct the study in a real disaster or even a training exercise;
however due to practicality and physiological measures, the study was conducted in
a temperature-controlled environment.
Performing a large-scale, complex human study in HRI has many pitfalls and
rewards. Even with the most careful planning and study design it becomes apparent
through the course of the study that changes could be made to improve the study. An
example from the exemplar study was the design and development of the simulated
disaster site. It was high fidelity and based on real-world knowledge; however it would
have been more realistic had the confined space box been more confining. The box
was designed based human factors standards for designing spaces to accommodate
95% of the population. In the case of this study most of the population of participants
Conducting Studies in Human-Robot Interaction 117

Fig. 4 Confined space simulated disaster site with the lights on

had smaller body sizes than average and the space was truly confining to only a small
portion of the participants. To increase the feeling of confinement, a blanket or rough
heavy plastic or fabric that would crinkle or make audible sounds should be utilized
in the future. Additionally, a soundtrack playing in the background with sounds from
an actual disaster or a training exercise would have improved the fidelity of the
site and the experiences of the participants. Without these changes the results were
statistically significant; however the impact and effect might have been greater if the
environment was more realistic.

5.5 Failures and Contingencies

In this study, the “no show” percentage was much lower than the expected, at approx-
imately 8%; equipment failures did occur. Making sure there are contingencies for
equipment cannot be stressed enough. This study experienced a one week delay due to
the failure of an EKG sensor which was essential to the psychophysiology portion of
the study. Planning ahead and having extra sensors could have prevented delays and
the loss of participants who could not be rescheduled. Following that experience extra
sensors were ordered and kept on hand and they were needed. Video cameras had
auto-focus problems that were not noticed until the video data was being off-loaded.
Also one video camera was moved between the two different robots and accidentally
the zoom was activated making some of the robot-view video data unusable. It is
118 C. L. Bethel et al.

important always to double check equipment settings and verify all equipment is
working properly so that no data is lost or determined to be unusable. The primary
failure that ended the study and resulted in the cancellation of 18 participants was
the failure of the one robot for which there was no redundancy; however the goal
number of 128 participants was attained.

6 Conclusions

Planning, designing, and executing a human study for HRI can be challenging; how-
ever with careful planning many of these challenges can be overcome. There are two
main improvements that need to be made in human studies conducted in HRI and
those are (1) having larger sample sizes to appropriately represent the population
being studied, and so that small to medium effects can be determined with statisti-
cally significant results; and (2) the use of three or more methods of evaluation to
establish reliable and accurate results that will have convergent validity. From the
experiences gained in completing a large-scale, complex, controlled human study in
HRI recommendations are presented that fall into three categories: (A) Experimen-
tal Design Recommendations, (B) Recommendations for Study Execution, and (C)
Other Recommendations.

6.1 Experimental Design Recommendations

These recommendations are presented to assist with the planning and design of
large-scale, complex human studies in HRI. They will assist researchers with the
development of a comprehensive experimental design that should provide successful
study results.

1. Determine the most appropriate type of study for the hypotheses being investigated
using a within-subjects, between-subjects, or mixed-model factorial design.
2. Perform an a priori power analysis to estimate the appropriate number of partic-
ipants required for the study in order to have a better opportunity of obtaining
statistically significant results that are valid, reliable, and accurate. This can be
accomplished through power analysis tables or available software.
3. Determine the best methods of evaluation for the hypotheses being investigated;
however, it is recommended to utilize three or more methods to obtain convergent
validity in the study.
4. Design a study environment that closely reflects the real-world that is being tested
for more natural participant responses. When conducting psychophysiology stud-
ies using skin conductance response, a temperature-controlled environment is
essential.
Conducting Studies in Human-Robot Interaction 119

5. If the goal of the research is to generalize results to different robots, perform the
study with more than one type of robot.
6. Include participants who would be the expected population associated with the
research questions being investigated. For example, if the study is researching the
responses of children to a robot, then use children as participants in the study.
These recommendations offer guidelines and practical tips to determine the best
approach to design a comprehensive study in robotics and human-robot interactions.
This will increase the probability that results will be statistically significant.

6.2 Recommendations for Study Execution

The following recommendations are provided to facilitate the execution of the study’s
experimental design. These recommendations will assist in revealing potential flaws
in the experimental design so that corrections can be implemented resulting in a
smooth running, efficient study. However, even with the best designs you can expect
equipment failures, participants and assistants arriving late or not at all, and other
pitfalls. The key is to have contingency plans in place and anticipate worst case
scenarios because they do occur.

1. Develop a written study protocol of all instructions, assessments with ordering,


participant tasks in order of execution, timing of events, coordination of data
collection, and any associated activities. This study protocol document will be
used when preparing IRB or Ethics Board paperwork, creating instructions for
participants, and preparing informed consent documents.
2. Perform multiple test runs of the planned study protocol until all glitches and
problems have been discovered and resolved and there is a smooth running system
in place.
3. Make sure that there is redundancy in all equipment that is required for the study
and that backup equipment is always ready to use because failures are common.
4. Always prepare for the unexpected with contingency plans in place to handle
equipment failures, participants and/or research assistants not arriving at their
designated times, or other events.
5. Always allow time for study delays, participants arriving late, or equipment fail-
ures that may cause the cancellation of participants and delay of the study.

6.3 Other Recommendations

The following are recommendations for the recruitment of participants and volun-
teer research assistants and are based on our experiences and may not apply to all
researchers and universities. They were excellent resources for our particular study
120 C. L. Bethel et al.

and we are aware of similar programs available at many United States and European
universities and institutions.
• Recruit quality volunteer research assistants from an Honors College or Program if
available at the university or institution. Additionally, pre-medical and psychology
students often have a volunteer hours requirement and are willing to volunteer.
• Recruit participants through the use of flyers posted across campus; word of mouth
to friends, family, and associates; offering incentives such as door prizes, pay for
participation, and extra credit in courses for participation; signing up for research
study participant pools through the psychology department and/or other depart-
ments on campus, if offered.
• Recruit the general public by requesting permission to post flyers at local malls,
stores, or applicable agencies to represent the target population.
• Obtaining permission to recruit children from local schools, museums, and orga-
nizations.

Conducting human studies can be challenging and also very rewarding. Careful
planning and design can make the experience more positive and successful. Follow-
ing the above recommendations should improve the chances of having a successful
study with accurate and reliable statistically significant results. Through the use of
appropriate sample sizes and three or more methods of evaluation, convergent validity
should be obtainable. Readers are directed to [21, 30, 50] or other research methods
books for further reference.

6.4 Impact of Valid Human Studies on HRI and the Use of


Robots in Society

The area of Human-Robot Interaction is an emerging field and as such it is essential


to use good research methods and statistical testing when conducting human studies.
The use of appropriate sample sizes and three or more methods of evaluation can
provide validity and credibility to the human studies that are performed associated
with HRI. This will improve the overall field, but also will result in stronger public
acceptance of robots. The public will be more likely to accept robots in their homes,
schools, work environments, and as entertainment if they know that the use of these
robots has been thoroughly tested for safety and effectiveness using good experi-
mental methodology. Additionally, the engineering community will be able to use
the information obtained from well conducted user studies to design and build better
robots.
Conducting Studies in Human-Robot Interaction 121

References

1. Bartneck, C., Kulic, D., Croft, E., Zoghbi, S.: Measurement instruments for the anthropomor-
phism, animacy, likeability, perceived intelligence, and perceived safety of robots. Int. J. Soc.
Robot. 2009(1), 71–81 (2008)
2. Baxter, P., Kennedy, J., Senft, E., Lemaignan, S., Belpaeme, T.: From characterising three
years of HRI to methodology and reporting recommendations. In: The Eleventh ACM/IEEE
International Conference on Human-Robot Interaction, pp. 391–398. IEEE Press (2016)
3. Bethel, C.L.: Robots without faces: non-verbal social human-robot interaction. Dissertation,
University of South Florida (2009)
4. Bethel, C.L., Bringes, C., Murphy, R.R.: Non-facial and non-verbal affective expression in
appearance-constrained robots for use in victim management: robots to the rescue! In: 4th
ACM/IEEE International Conference on Human-Robot Interaction (HRI2009). San Diego,
CA (2009)
5. Bethel, C.L., Murphy, R.R.: Use of large sample sizes and multiple evaluation methods in
human-robot interaction experimentation. In: 2009 AAAI Spring Symposium Series, Experi-
mental Design for Real-World Systems. Palo Alto, CA (2009)
6. Bethel, C.L., Murphy, R.R.: Review of human studies methods in hri and recommendations.
Int. J. Soc. Robot. (2010). https://doi.org/10.1007/s12369-010-0064-9
7. Bethel, C.L., Salomon, K., Burke, J.L., Murphy, R.R.: Psychophysiological experimental
design for use in human-robot interaction studies. In: The 2007 International Symposium
on Collaborative Technologies and Systems (CTS 2007). IEEE, Orlando, FL (2007)
8. Bethel, C.L., Salomon, K., Murphy, R.R.: Preliminary results: humans find emotive non-
anthropomorphic robots more calming. In: 4th ACM/IEEE International Conference on
Human-Robot Interaction (HRI2009). San Diego, CA (2009)
9. Bethel, C.L., Salomon, K., Murphy, R.R., Burke, J.L.: Survey of psychophysiology measure-
ments applied to human-robot interaction. In: 16th IEEE International Symposium on Robot
and Human Interactive Communication. Jeju Island, South Korea (2007)
10. Bradley, M.M., Lang, P.J.: Measuring emotion: the self-assessment manikin and the semantic
differential. J. Behav. Ther. Exp. Psychiatry 25, 49–59 (1994)
11. Burke, J.L., Murphy, R.R., Riddle, D.R., Fincannon, T.: Task performance metrics in human-
robot interaction: taking a systems approach. In: Performance Metrics for Intelligent Systems.
Gaithersburg, MD (2004)
12. Cacioppo, J.T., Tassinary, L.G., Berntson, G.G.: Handbook of Psychophysiology. Cambridge
Handbooks in Psychology, 4th edn. Cambridge University Press, United Kingdom (2017)
13. Chien, S.E., Chu, L., Lee, H.H., Yang, C.C., Lin, F.H., Yang, P.L., Wang, T.M., Yeh, S.L.:
Age difference in perceived ease of use, curiosity, and implicit negative attitude toward robots.
ACM Trans. Hum. Robot Interact. 8(2), 9:1–9:19 (2019). https://doi.org/10.1145/3311788.
http://doi.acm.org/10.1145/3311788
14. Cohen, J.: Statistical Power Analysis for the Behavioral Sciences, 2nd edn. Lawrence Earlbaum
Associates, Hillsdale, NJ (1988)
15. Compagna, D., Marquardt, M., Boblan, I.: Introducing a methodological approach to evaluate
HRI from a genuine sociological point of view. In: International Workshop in Cultural Robotics,
pp. 55–64. Springer (2015)
16. Conti, D., Di Nuovo, S., Di Nuovo, A.: Kindergarten children attitude towards humanoid robots:
what is the effect of the first experience? In: 2019 14th ACM/IEEE International Conference
on Human-Robot Interaction (HRI), pp. 630–631. IEEE (2019)
17. Dautenhahn, K.: Some brief thoughts on the past and future of human-robot interaction. ACM
Trans. Hum. Robot Interact. 7(1), 1–3 (2018). https://doi.org/10.1145/3209769
18. Dautenhahn, K., Walters, M., Woods, S., Koay, K.L., Nehaniv, C.L., Sisbot, A., Alami, R.,
Siméon, T.: How may i serve you?: a robot companion approaching a seated person in a helping
context. In: 1st ACM SIGCHI/SIGART Conference on Human-Robot Interaction (HRI2006),
pp. 172–179. ACM Press, New York, NY, USA, Salt Lake City, UT (2006)
122 C. L. Bethel et al.

19. Davis, F.D.: Perceived usefulness, perceived ease of use, and user acceptance of information
technology. MIS Quarterly, pp. 319–340 (1989)
20. Elara, M.R., Wijesoma, S., Acosta Calderon, C.A., Zhou, C.: Experimenting false alarm demand
for human robot interactions in humanoid soccer robots. Int. J. Soc. Robot. 2009(1), 171–180
(2009)
21. Elmes, D.G., Kantowitz, B.H., Roediger III, H.L.: Research Methods in Psychology, 8th edn.
Thomson-Wadsworth, Belmont, CA (2006)
22. Faul, F., Erdfelder, E., Lang, A.G., Buchner, A.: G*power 3: a flexible statistical power analysis
program for social, behavioral, and biomedical sciences. Behav. Res. Methods 39(2), 175–191
(2007)
23. Fisicaro, D., Pozzi, F., Gelsomini, M., Garzotto, F.: Engaging persons with neuro-
developmental disorder with a plush social robot. In: 2019 14th ACM/IEEE International
Conference on Human-Robot Interaction (HRI), pp. 610–611. IEEE (2019)
24. Gliem, J.A., Gliem, R.R.: Calculating, interpreting, and reporting cronbach’s alpha reliabil-
ity coefficient for likert-type scales. In: Midwest Research-to-Practice Conference in Adult,
Continuing, and Community Education (2003)
25. Goodwin, C.J.: Research in Psychology-Methods and Design. Wiley, Hoboken (2003)
26. Gravetter, F.J., Forzano, L.A.B.: Research Methods for the Behavioral Sciences, 5th edn. Cen-
gage Learning, Stamford, CT, USA (2016)
27. Greenwald, A.G., McGhee, D.E., Schwartz, J.L.: Measuring individual differences in implicit
cognition: the implicit association test. J. Pers. Soc. Psychol. 74(6), 1464 (1998)
28. Hatcher, L.: Advanced Statistics in Research: Reading, Understanding, and Writing Up Data
Analysis Results. Shadow Finch Media, Saginaw (2013)
29. Itoh, K., Miwa, H., Nukariya, Y., Zecca, M., Takanobu, H., Roccella, S., Carrozza, M.C., Dario,
P., Atsuo, T.: Development of a bioinstrumentation system in the interaction between a human
and a robot. In: International Conference of Intelligent Robots and Systems, pp. 2620–2625.
Beijing, China (2006)
30. Johnson, B., Christensen, L.: Educational Research Quantitative, Qualitative, and Mixed
Approaches, 2nd edn. Pearson Education Inc., Boston (2004)
31. Kamin, S.T., Lang, F.R.: The subjective technology adaptivity inventory (stai): a motivational
measure of technology usage in old age. Gerontechnology (2013)
32. Kidd, C.D., Breazeal, C.: Human-robot interaction experiments: Lessons learned. In: Pro-
ceeding of AISB’05 Symposium Robot Companions: Hard Problems and Open Challenges in
Robot-Human Interaction, pp. 141–142. Hatfield, Hertfordshire, UK (2005)
33. Kiesler, S., Goodrich, M.A.: The science of human-robot interaction. ACM Trans. Hum. Robot
Interact. 7(1), 1–3 (2018). https://doi.org/10.1145/3209701
34. Kulić, D., Croft, E.: Physiological and subjective responses to articulated robot motion. Robot
15 (2006) (Forthcoming)
35. Lazar, J., Feng, J.H., Hochheiser, H.: Research Methods in Human-Computer Interaction.
Wiley, West Sussex (2010)
36. Liu, C., Rani, P., Sarkar, N.: Affective state recognition and adaptation in human-robot interac-
tion: a design approach. In: International Conference on Intelligent Robots and Systems (IROS
2006), pp. 3099–3106. Beijing, China (2006)
37. Moshkina, L., Arkin, R.C.: Human perspective on affective robotic behavior: a longitudinal
study. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2005),
pp. 2443–2450 (2005)
38. Murphy, R.R., Riddle, D., Rasmussen, E.: Robot-assisted medical reachback: a survey of how
medical personnel expect to interact with rescue robots. In: 13th IEEE International Workshop
on Robot and Human Interactive Communication (RO-MAN 2004), pp. 301–306 (2004)
39. Mutlu, B., Hodgins, J.K., Forlizzi, J.: A storytelling robot: Modeling and evaluation of
human-like gaze behavior. In: 2006 IEEE-RAS International Conference on Humanoid Robots
(HUMANOIDS’06). IEEE, Genova, Italy (2006)
40. Mutlu, B., Osman, S., Forlizzi, J., Hodgins, J.K., Kiesler, S.: Task structure and user attributes
as elements of human-robot interaction design. In: 15th IEEE International Workshop on Robot
Conducting Studies in Human-Robot Interaction 123

and Human Interactive Communication (RO-MAN 2006). IEEE, University of Hertfordshire,


Hatfield, UK (2006)
41. Nomura, T., Suzuki, T., Kanda, T., Kato, K.: Altered attitudes of people toward robots: inves-
tigation through the negative attitudes toward robots scale. In: Proceedings of the AAAI-06
Workshop on Human Implications of Human-robot Interaction, vol. 2006, pp. 29–35 (2006)
42. Olsen, D.R., Goodrich, M.A.: Metrics for evaluating human-robot interactions. In: Performance
Metrics for Intelligent Systems Workshop (2003)
43. Picard, R.W., Vyzas, E., Healey, J.: Toward machine emotional intelligence: analysis of affec-
tive physiological state. IEEE Trans. Pattern Anal. Mach. Intell. 23(10), 1175–1191 (2001)
44. Preece, J., Rogers, Y., Sharp, H.: Interaction Design-Beyond Human-Computer Interaction,
2nd edn. Wiley, West Sussex (2007)
45. Rani, P., Sarkar, N., Smith, C.A., Kirby, L.D.: Anxiety detecting robotic system-towards implicit
human-robot collaboration. Robotica 22(1), 85–95 (2004)
46. Riddle, D.R., Murphy, R.R., Burke, J.L.: Robot-assisted medical reachback: using shared visual
information. In: IEEE International Workshop on Robot and Human Interactive Communica-
tion (ROMAN 2005), pp. 635–642. IEEE, Nashville, TN (2005)
47. Schweigert, W.A.: Research Methods and Statistics for Psychology. Brooks/Cole Publishing
Company, Pacific Grove (1994)
48. Shaughnessy, J.J., Zechmeister, E.B.: Research Methods in Psychology. McGraw-Hill Inc.,
New York (1994)
49. Steinfeld, A., Fong, T., Kaber, D., Lewis, M., Scholtz, J., Schultz, A., Goodrich, M.: Common
metrics for human-robot interaction. In: 1st ACM SIGCHI/SIGART Conference on Human-
Robot Interaction. ACM Press, Salt Lake City, Utah, USA (2006)
50. Stevens, J.P.: Intermediate Statistics: A Modern Approach, 2nd edn. Lawrence Erlbaum Asso-
ciates, Publishers (1999)
51. Vogt, P., van den Berghe, R., de Haas, M., Hoffman, L., Kanero, J., Mamus, E., Montanier,
J.M., Oranç, C., Oudgenoeg-Paz, O., García, D.H., et al.: Second language tutoring using social
robots: a large-scale study. In: 2019 14th ACM/IEEE International Conference on Human-
Robot Interaction (HRI), pp. 497–505. IEEE (2019)
52. Watson, D., Clark, L.A., Tellegen, A.: Development and validation of brief measures of positive
and negative affect: the panas scales. J. Pers. Soc. Psychol. 54(6), 1063–1070 (1988)

Cindy L. Bethel Ph.D. (IEEE and ACM Senior Member) is


a Professor in the Computer Science and Engineering Depart-
ment and holds the Billie J. Ball Endowed Professorship in
Engineering at Mississippi State University (MSU). She is the
2019 U.S. Fulbright Senior Scholar at the University of Tech-
nology Sydney. Dr. Bethel is the Director of the Social, Ther-
apeutic, and Robotic Systems (STaRS) lab. She is a member
of the Academy of Distinguished Teachers in the Bagley Col-
lege of Engineering at MSU. She also was awarded the 2014–
2015 ASEE New Faculty Research Award for Teaching. She
was a NSF/CRA/CCC Computing Innovation Postdoctoral Fel-
low in the Social Robotics Laboratory at Yale University. From
2005–2008, she was a National Science Foundation Gradu-
ate Research Fellow and was the recipient of the 2008 IEEE
Robotics and Automation Society Graduate Fellowship. She
graduated in August 2009 with her Ph.D. in Computer Science and Engineering from the Uni-
versity of South Florida. Her research interests include human-robot interaction, human-computer
interaction, robotics, and artificial intelligenc e. Her research focuses on applications associated
with robotic therapeutic support, information gathering from children, and the use of robots for
law enforcement and military.
124 C. L. Bethel et al.

Zachary Henkel is a computer science PhD student at Missis-


sippi State University. He received a bachelor’s degree in com-
puter science from Texas A&M University, College Station, TX,
USA, in 2011. His research interests include human-robot inter-
action and human-computer interaction.

Kenna Baugus is pursuing a Bachelor of Science in Software


Engineering at Mississippi State University. She enjoys learning
about human-machine interaction and works as an undergradu-
ate researcher in the Social, Therapeutic, and Robotic Systems
(STaRS) Lab. Her current focus is developing social robots that
act as intermediaries to gather sensitive information from chil-
dren.
Introduction to (Re)Using Questionnaires
in Human-Robot Interaction Research

Matthew Rueben, Shirley A. Elprama, Dimitrios Chrysostomou


and An Jacobs

Abstract In the domain of Human-Robot Interaction (HRI), questionnaires are often


used to measure phenomena. In this chapter, we focus on the use of validated scales.
Scales consist of a series of questions that combined are used to measure a particular
phenomenon or concept. The goal of this chapter is to guide researchers originat-
ing from different backgrounds through the process of choosing and (re)using such
an instrument. In addition, we explain how researchers can verify that the scale is
measuring what they intend to measure (is “valid”). We also give practical advice
throughout the process based on our own experience with using scales in our own
research. We recommend a standardized process for using scales in HRI research.
Existing scales should be validated in a study very similar to the study that is being
designed before being trusted to perform correctly. Scales that do not quite fit a
study should be modified, but must then be re-validated. Even though some scales
are prevalent and often used in HRI studies with a different context, researchers
should still know their limitations and have a healthy suspicion that they are working
as expected. We expand upon recommendations like these as we describe our rec-
ommended process for (re)using questionnaires. This chapter gives an introductory
overview of this process in plain language and then points towards the more formal
and complete texts for any details that are needed.

Keywords Robotics · Computer science · Sociology

M. Rueben (B)
University of Southern California, 3710 McClintock Ave, Room 423, Los Angeles, CA 90089,
USA
e-mail: mrueben@usc.edu
S. A. Elprama · A. Jacobs
imec-SMIT-Vrije Universiteit Brussel, Pleinlaan 9, Brussels, Belgium
e-mail: selprama@vub.ac.be
D. Chrysostomou
Aalborg University, Fibigerstraede 16, Aalborg East, Denmark
e-mail: dimi@mp.aau.dk

© Springer Nature Switzerland AG 2020 125


C. Jost et al. (eds.), Human-Robot Interaction, Springer Series
on Bio- and Neurosystems 12, https://doi.org/10.1007/978-3-030-42307-0_5
126 M. Rueben et al.

1 What Is a Questionnaire?

Questionnaires are often used as measurement tools in Human-Robot Interaction


(HRI) research. By questionnaire, we mean a series of questions presented to par-
ticipants as means of evaluation of an HRI experiment. We will focus on describing
different types of scales often used in questionnaires and consisting of multiple
questions that are created so that, when their responses are combined, they yield an
estimate of one or more concepts. Scales in questionnaires are used to measure some-
thing that the respondent knows about, usually something inside themselves like a
belief or attitude [7]. This internal variable does not have to be about the respondent;
it could also be about something else, like about a robot, about an interaction with a
robot, robots as a general category, or another person depending on the focus of the
research. Scales usually consists of multiple items. An example of a 3-item scale is
the Intention to Use [a robot] scale from Heerink et al. [14]. This scale consists of
the following three items: (1) “I think I will not use iCat the next few days”, (2) “I
am certain to use iCat the next few days”, (3) “I am planning to use iCat the next
few days”. If the scale is well-made, the responses to the individual questions can
be combined to triangulate the concept of interest more accurately than any single
question would.
When responding to scales, respondents choose from at least two options. For
instance, the Intention to Use [a robot] scale has a 5-point response format; i.e., there
were five options: totally agree, agree, do not know, do not agree, and totally do not
agree [14]. A scale usually comes with a specific range of answers. Questionnaires
are a method to research interactions with several types of robots, and in our domain,
they have been mainly used in experimental settings [9].

2 A Recommended Process for (Re)Using Questionnaires

The theme of this book is standardization, which raises the question: Should we
create standardized questionnaires for HRI research—that is, questionnaires that are
designed to be used without modification in a broad variety of HRI studies? We argue
that while it may sometimes be useful to create and use some scales, the emphasis
should instead be on encouraging all HRI researchers to use a well-defined process
for finding and using the appropriate scales in a questionnaire for a study.
This chapter presents the process that we have learned from textbooks, courses,
and mentors—a process that we have found to be helpful in HRI research. It is
depicted in Fig. 1. This is how to do it, or how to review what someone else has done
(e.g., when reading a paper about an experiment that uses a questionnaire). The steps,
expanded upon for the remainder of the chapter, are as follows. First, the concepts to
be measured should be identified, and then defined very precisely (Sect. 3). Next, it
is essential to search for relevant scales (Sect. 4). This literature review reveals how
different scales have worked in various study contexts and in questionnaires in the
Introduction to (Re)Using Questionnaires in Human-Robot Interaction Research 127

Fig. 1 Schematic overview of a standarized process for using questionnaires in HRI research

past. Sometimes it is appropriate to reuse an existing scale without modifying it, but
we walk the reader through several things to look for before doing so (Sect. 4.1). If
existing scales do not measure exactly the right thing, or are not designed for the
study context being considered, we recommend adapting one or more of them for
this new purpose (Sect. 5). Changes could be minor, such as switching just a few
words on a scale but keeping everything else the same. In other cases, a few items
could be borrowed from several different scales to create something more suitable
for the study being designed. As explained in Sect. 4.3, creating a completely new
scale from scratch is out of scope for this chapter. Any modifications to an existing
scale warrant re-testing to make sure it still works as intended. In fact, pilot testing
(Sect. 6) is always recommended even when using an unmodified scale in a new
study context, and can even help reveal when modifications to a scale (or the study
protocol) might be needed. The final section (Sect. 7) introduces the concepts of
reliability and validity, which are crucial for characterizing the performance of an
existing scale in a questionnaire or evaluating one that has been modified. We present
this process as sufficient for HRI researchers who want to reuse or modify existing
scales for evaluating their experiments and conclude the chapter by referring to some
additional resources that describe the various steps in greater detail.

3 Identify Which Concept(s) to Measure

The first step is to identify which concept(s) need to be measured in your study.
This depends on your research question. For instance, if you want to know whether
participants like robot A more than robot B, the concept to be measured could be how
much someone likes a robot. The next step is to investigate whether scales already
exist for your concept(s) by conducting a literature study. The advantage of using
existing scales is that it enables the comparison of the study with other studies (which
encourages standardization in HRI). However, existing scales might not always be
available in the HRI domain. For our example on rating how much one likes a robot,
we found a scale that could be used: one of the GODSPEED subscales which is called
Likeability and measures how much someone likes a robot using opposite concepts
on a 5-point scale [1].
In most cases, questionnaires aim to measure something that is not directly
observable, such as attitudes, intentions, beliefs, motivations, or emotions [7].
Corresponding research questions in HRI might be:
128 M. Rueben et al.

• “How much you like the robot?” (attitude)


• “Do you intend to use the robot?” (intention)
• “Do you believe the robot will deceive you?” (belief)
• “What motivates you to talk to the robot?” (motivation) and
• “How did the robot’s failure make you feel?” (emotion).
The measurement is indirect because the answers can not always be easily observed
and they are, therefore, inferred from the responses to the questions. Anything that
is not directly observable must be measured indirectly, and in the social sciences
researchers have developed an impressive set of tools for doing this [18].
A major problem is that the indirectness of the measurements introduces error.
Imagine a scenario wherein researchers want to know whether someone feels
stressed—they might measure the person’s blood pressure—or maybe they want
to know if someone trusts a robot—they might observe whether that person allows
the robot to do a dangerous task autonomously. Here, blood pressure is a proxy for
stress, and letting the robot act freely is a proxy for trusting it, but each proxy will
only give an imperfect estimate of the true value of stress or trust. After all, blood
pressure is affected by much more than stress levels, and decision-making depends
on more than trust. Questionnaires have the same problem: the response to each
question is only an imperfect estimate of the concept it is designed to measure [20].
So questionnaires are one of several ways to measure things that are not directly
observable about a person or group of people, but this indirectness introduces the
possibility that the measurement will be imperfect [16]. This is just one of several
sources of measurement error: if the questionnaire is poorly made or misused it might
perform inconsistently or even measure the wrong concept.
In sum, questionnaires are one way to measure concepts that are not directly
observable. Direct measurement or observation are alternative methods for things
that can be observed (for instance, the number of times a person interacted with a
robot in a shopping mall).

4 Searching for Relevant Questionnaires

After addressing what someone wants to measure, and agreeing that a questionnaire
is a good way to measure it, the next question might be: “Do I need to choose an
existing questionnaire that has been tested by other researchers?” In fact, there are
three options, each of which can be a good one in certain situations: use an existing
questionnaire, modify a questionnaire (or combine items from several) [14], or create
a new questionnaire from scratch [3].
This section will help the reader to choose between these three options, including
how to choose which questionnaire to use “as-is” if that is the chosen option. This
section will also attempt to assist the reader with locating and examining existing
questionnaires and their scales, which is an important step regardless of which is the
chosen option. The next section (Sect. 5) is about modifying questionnaires to suit the
Introduction to (Re)Using Questionnaires in Human-Robot Interaction Research 129

undertaken research. The final option—creating a new scale from scratch—will be


mentioned briefly, but is a serious undertaking for which we recommend DeVellis’s
book on Scale Development [7].

4.1 Considerations for Using a Questionnaire in Its Current


Form

The main recommendation that we provide in this section is about choosing a ques-
tionnaire that measures exactly what needs to be measured and is suited for the
given scenario and population. We explore the ways to understand which scale the
questionnaire is using and how to validate the collected evidence.

4.1.1 Examining the Scale Itself

It is possible to partially evaluate the suitability of a potentially interesting scale by


merely reading the statements of the questionnaire. Here are some guiding questions
that can be of assistance during an initial literature search:
Are the authors trying to measure exactly the same concept?

1. For example, consider the timing of the concept: is there the need to capture the
feeling of the participant at that very moment of the experiment, or on average
over the entire study?
2. Is it a dispositional measurement—i.e., a trait: a fact about someone that does
not change over time—or a situational measurement—i.e., a state, which might
change from moment to moment? For example, a person might have a disposition
towards trusting machines that is relatively stable over time, but how much they
trust a particular robot varies based on their interactions with it.
3. Does the questionnaire measure the target concept with the level of specificity or
generality that is needed? For example, a scale that measures “anxiety about using
new technology” might work differently than a more specific scale that measures
“anxiety about using a new robot” or even “...using the Baxter robot for the first
time”.

Is the “target” correct?

If robots are in the focus of the study, then the questionnaire should ask about robots,
not people or computers or telepresence systems or virtual agents. At a minimum any
language that does not make sense for the “target” should be changed.
1. For example, suppose the study is about the measurement of people’s privacy
concerns about robots. Assuming that after a literature survey for measures of
privacy concerns the three online privacy scales developed by Buchanan et al. [2]
130 M. Rueben et al.

are found. These scales measure the required privacy concepts, but with respect
to the wrong target: the questions ask about Internet usage, not interactions with
robots. However, this scale might still be relevant as inspiration as it could be used
to form a new question about robots based on each question about the Internet—
see Sect. 5 for some guidance about how to do that.
2. On the other hand, if aspects about each participant are needed without any ref-
erence to a robot or the study context, an existing questionnaire might work well.
Examples might include questionnaires that measure personality, demographics,
workload [13], and emotion [10].
Are you confident that it will work in an HRI if it was not originally designed
for that?

As an example, in the case where the participants are required to judge the “warmth
of an interaction” with a robot using a questionnaire developed for human-human
interaction, the questionnaire might not measure precisely the desired aspect. It is
possible that extra pilot studies, measurements, or analyses will be required to check
that the questionnaire measures what is really designed for.
Does it appear to be well-written?

Sometimes it is possible to predict if a questionnaire will perform poorly just by


reading the items. We have listed a few types of items below that should be removed
or changed if they are present in a questionnaire, but refer to Step 2 of Chap. 5 in
DeVellis [7] and Chap. 10 of Furr and Bacharach [12] for more complete guidance.
• Pay extra attention for items that might accidentally measure the wrong concept
(called a “confounding variable”). E.g., “I felt safe around the robot” might mea-
sure whether people think the robot will hurt them, but their answers might also
be influenced by other safety hazards around the robot, like if they are worried
about tripping over the robot’s tether cable. A related problem is when an item is
“double-barrelled”—that is, different respondents could interpret the item differ-
ently or have different beliefs about what it is asking about. Each item should be
written clearly so as to be interpreted in the same way by everyone responding to
the questionnaire.
• It can be risky to use a questionnaire without any “reverse-coded items.” In other
words, a scale about trusting the robot should have both positive items (e.g., “I
trust the robot”) and negative items (e.g., “I do not think the robot is trustworthy”).
This can sometimes reveal that respondents think about positive and negative atti-
tudes differently, and also helps catch careless respondents who always choose
“agree” or always choose “disagree” [4]. Especially beware of questions that par-
ticipants might be motivated to lie about, e.g., if lying would make them feel better
about themselves or make the experimenter happy. For example, be suspicious of
any positive responses from participants who may believe that they personally
compliment the creator of the robot or the programmer of its behaviour.
Introduction to (Re)Using Questionnaires in Human-Robot Interaction Research 131

• Questionnaires should not be so long that the participants get tired or bored, but it
usually1 takes multiple items per concept (the exact number varies) to get accurate
measurements [7].
• Questionnaires are completed in a specific order: some questions are seen before
the others. Special care should be given to whether any of the earlier items could
impact the way people interpret the later items. For example, people might initially
disagree with the statement, “There are things about the robot that I would like to
change”, but not if they are first reminded of problems with the robot by reading
items like, “the robot’s driving is unpredictable”, or, “sometimes the robot takes
a long time to do a task.” These are called “order effects,” and are usually found
during the scale creation process by testing the scale with many different item
orders.
• Ensure that the language is appropriate for the people who are reading the
questionnaire—it is not advisable to use advanced grammar or vocabulary in a
questionnaire for children, or technical jargon for the general population. Every-
body who takes the survey should understand it in the same way as it is intended
and designed for their best understanding.

4.1.2 Examining Validity Evidence and Estimates of Scale Properties

If a questionnaire looks good according to all the criteria above, it should be expected
that it will perform well in practice—i.e., that it will measure the right concept every
time it is used. If this is the main expectation, we examine here how it is possible to
validate the provided results.
Well-made questionnaires are usually tested. Often at least one “validity study” is
performed right after the questionnaire is finished to prove that it is a strong indicator
of the same concept every time it is used (i.e., the questionnaire is reliable) and that
this concept that it measures, is the concept that the researchers wanted (i.e., the
questionnaire is valid). After that, other researchers might report their own evidence
about the questionnaire’s performance when they use it in their studies, which shows
how it performs in different scenarios or with different types of people.
This information on the performance of a specific questionnaire in studies that are
as similar as possible to yours is useful and should be carefully studied. A new study
will always be at least a little different than these published studies—e.g., it might be
done with people of different ages, cultures, or professions, or in a different type of
room, time of day, or with a different robot doing another type of task, or even with a
different experimenter. Each one of these variations could change the performance of
a questionnaire from what is published in other studies. The risk is probably larger if
there are multiple, large differences, but even a single, small difference could cause
a questionnaire to perform very differently.

1 Efforts to create single-item measures such as those reported by Denissen et al. [5] can demonstrate

the concerns and difficulties involved.


132 M. Rueben et al.

Only use an existing questionnaire “as-is” if you become convinced by the evi-
dence that it will perform the way it is needed in your study. If you are not completely
convinced by the evidence that is available but you believe that the questionnaire
might work in your study without any changes, you should consider running a fresh
validation study in your study context—see Sect. 7. If you are convinced that the
questionnaire will not work “as-is”, you can either modify it to better fit your study,
perhaps just taking one or two of its items—see Sect. 5—or abandon it and try some-
thing else.

4.2 Where to Search for Existing Questionnaires

“A good place to begin searching for an appropriate measurement instrument is in


published studies that have examined the concept of interest in ways and contexts
similar to what you have planned” [6] (p. 239). For example, if the one of the focus
points of the study is to measure the level of trust for the interaction with the robot, the
papers from the previously mentioned literature review should provide a good idea
of which measures those researchers used. Naturally, you should not just replicate
what you find in those Methods sections just because other researchers have done it;
instead, evaluate all existing scales using the principles in this section and in Sect. 7.
Several other bodies of literature might contain information for existing questionnaire
and their scales as well: one for the concept of interest (e.g., about personality or
privacy or teamwork), one for the population if it is a special population (e.g., older
adults), one for the robot or class of robot used (e.g., the Nao, or humanoid robots
in general), and one for the scenario or application area (e.g., search and rescue,
warehouse logistics, or autism therapy).
There are several places where researchers have listed existing questionnaires so
others can find them. We recommend searching online databases such as PsycTESTS2
or the Health and Psychosocial Instruments (HaPI) database by Behavioral Measure-
ment Database Services (BMDS). Even in the case where an entire questionnaire is
not used “as-is”, you can always use some of the individual items, or perhaps an
entire subscale, provided that the appropriate researchers receive the credit. Besides,
be sure that the new questionnaire is tested because those items will work differently
alongside any new items that are written or taken from other questionnaires.

4.3 What About Creating a Completely New Questionnaire?

Maybe you are trying to measure something that is completely different than any-
thing that has been measured before using a questionnaire format. Maybe you are
dissatisfied with existing survey measures for a certain concept—maybe you believe

2 https://www.apa.org/pubs/databases/psyctests.
Introduction to (Re)Using Questionnaires in Human-Robot Interaction Research 133

them to be misguided about exactly what that concept is—and you want to start
from the beginning to make sure the concept is being measured accurately and in its
entirety with sufficient reliability. Creating a new questionnaire from scratch is its
own adventure. It is a serious undertaking and a research project by itself. The rec-
ommended process (which will not be described any further in this chapter) includes
working with experts on your concept to create a long list of potential items that are
carefully reviewed, tested, and reduced down to a shorter list. There are also more
technical steps like choosing response formats and the total number of items. It is not
common in HRI to start from the beginning like this—many researchers select and
modify items from existing questionnaire to create a new questionnaire, which is the
subject of the next section. It should also be kept in mind, that changing a validated
scale is still a bit like starting from nothing in terms of validation (see Sect. 7).

5 Adapting Questionnaires

The previous section talked about situations where it is advisable to use an existing
questionnaire as it is. This section tackles situations where existing questionnaires
provide many of the wanted elements for the study so the creation of a new ques-
tionnaire is unnecessary but at the same time the used questionnaires require some
significant changes to fit in the context of the study. There are several possible reasons
from our list in the previous section:
1. Existing questionnaires do not measure exactly the right concept. They pos-
sibly measure a broad concept when a narrower one is needed, such as if they
measure perceived safety of the robot in general whereas you want to measure
the perceived gentleness of the robot’s movements around people.
2. Especially when measuring attitudes, perhaps existing questionnaires do not
measure concepts that target the right object. For example, they might measure
optimism towards self-driving car technology instead of towards your household
robot, or maybe they measure resentment towards a human instead of towards a
social robot.
3. You are dissatisfied with the quality of the items, such as the phrasing, variety,
or order.
That being mentioned, if the existing questionnaires that you have found do seem
to provide you with a helpful start for creating your questionnaire—whether the
items could be directly used with some changes or just serve as inspiration for
different types of items or the sorts of diversity that are possible—then you should
consider adapting the contents of those questionnaire to the study instead of writing
a completely new questionnaire from scratch.
134 M. Rueben et al.

5.1 A Reminder: Changing a Questionnaire Will Affect Its


Validity and Performance

Making any changes to existing scales of a questionnaire might change the way
it performs [11]. This includes even small changes to item wordings, the order of
the items, and the way the scale is introduced to participants. A seemingly mild
adaptation of an existing scale like changing the word “person” to “robot” can have
large effects on how people understand and respond to the scale.
Scales are usually tested to see whether they measure the right concept (i.e., are
valid) and work consistently over and over again (i.e., are reliable). Other character-
istics of the scale are also measured, like the mean and variance of each individual
item and how the items correlate with each other (e.g., to form subscales that measure
distinct facets of the concept of interest). All of these properties—validity, reliabil-
ity, means, variances, correlations among others—could change drastically when
you adapt existing scales. It might be possible to guess how much they change and in
which direction, but only careful testing can show whether this is the correct assump-
tion. The scale might not retain the properties that have been reported in previous
tests after you modify it; instead, you should remain skeptical of what the new scale
is measuring until it is tested. Consult Sect. 7 for an introduction to validity testing
and characterization of scales.

5.2 Taking Items from Several Scales and Making Mild


Changes to Item Wordings

There are different ways to adapt existing scales into a scale for your study. First,
we will talk about ways that do not require anything more than mild changes to the
wording of the items.
It might be possible to avoid changing the wordings of the items at all if they
already fit the study’s context. You might just decide to remove a few items that
you do not want from an existing scale. If it is divided into subscales that measure
different facets or components of a concept then you might remove some of those.
You might also use items from several different scales to cover your entire concept
or to get a good diversity of items.
As mentioned in the previous section (Sect. 4), minor changes to the items might
be desired to make them fit to the context. Note that, even a minor change in the
wording might significantly alter the way people interpret it and respond to it. For
example, you might change the target of an item—i.e., the entity you are asking
the respondent to judge. In HRI studies participants might need to rate one of many
different targets: themselves, other people, a robot, a group of robots, an interaction,
a human-robot team, or something else. Here is a simple example: in a personality
test the items might be written so that you are rating yourself, like “It is easy for
you to get excited about something”, but you want respondents to judge the robot
Introduction to (Re)Using Questionnaires in Human-Robot Interaction Research 135

instead; you could simply replace the target in the statement so it says, “It is easy for
the robot to get excited about something”. You might also need to change the item
to suit a different scenario or use case. As an example an item from a trust scale
made for search and rescue applications can be borrowed and be changed to fit a
medical task: “I worry that the robot will somehow fail to find the person” might be
changed to, “I worry that the robot will somehow fail to pick up the syringe”. Finally,
a change of the wording of items to suit your population might be necessary, such
as by simplifying the language to suit respondents who are children or by changing
the interface to be more accessible if your respondents have motor disabilities.

5.3 Writing New or Significantly Modified Items

Even if most of the items are copied or lightly adapted from existing scales, there
may still be some gaps that you decide to fill with items you write on your own.
This section will give a brief introduction to writing new items or making significant
modifications to existing items.

5.3.1 The Importance of Related Questionnaires in the Literature

Although the exact text from existing questionnaires might not be used, you might still
use those existing items as inspiration. For example, maybe you notice a certain kind
of diversity that you want to emulate among the items in an existing questionnaire—
perhaps you are inspired by the various types of items. There might also be clever
phrasings to borrow, or certain topics or scenarios to ask about. Existing questionnaire
might also help you see a new facet of your concept of interest that you would like to
include in your own questionnaire. For example, maybe you mostly use items from
an existing HRI trust scale, but then notice that a human-human trust scale has a
subscale for “trust in hypothetical situations”. You decide that you want to include
this facet in your questionnaire to see if it exists in HRI, but the existing items do not
make sense for an HRI experiment, so you use them as loose inspiration and write
some of your own.

5.3.2 A Primer on Item Creation

Here we present some of the principles from DeVellis’ book [7] that you would use
to create items for a completely new scale. The chapter by Krosnick on questionnaire
design is also a good resource [15]. These principles will provide guidance when you
are creating new items or making significant changes to items in an existing scale.
One important detail we exlude is how to write the response format, i.e., the options
that respondents can choose from—both the DeVellis book and the Krosnick chapter
have insightful sections about this [7, 15].
136 M. Rueben et al.

The purpose of a scale is to measure the strength or level of a particular concept.


For example, the NARS [17] measures the strength of the respondent’s negative
attitudes towards robots—whether these attitudes are strong, mild, or nonexistent.3
The prevailing philosophy for creating items for such a scale is to think of each
item as a miniature scale in itself. In particular, each individual item should measure
the strength of your concept of interest by itself. In other words, using that single
item could provide you an estimate of what you are trying to measure.4
When this principle is used, every good scale is highly “redundant” in the sense
that it measures the same thing over and over again in different ways and from
different angles. We can measure this intentional “overlap” between items by looking
at the correlations (or shared variance) between item responses for a sufficiently
large sample of respondents. Larger correlations between two items mean they are
mostly measuring the same thing(s), whereas a lack of correlation is due to unwanted
factors that are specific to the individual items, like accidentally measuring a different
concept, item quality issues like ambiguous wording, and other random variance in
how people respond to each item. By creating items to be independent measures of
the same concept, we can assume that the “redundancy” or “overlap” identified by
the shared variance between item responses is the part we want—an estimate of the
level of our concept of interest—and that the rest is not helpful to us.
The purpose of a scale is to isolate this “overlap” to yield a more robust and
accurate estimate of the true level of our concept than we could have gotten from
any one of the individual items. In practice this is done by simply adding together
the responses from the items, boosting the signal of our concept of interest over any
other signals that are not shared by multiple items. This works best with a diversity
of items that approach the concept from many different angles—the common things
should be amplified and shine through whereas the unwanted or incidental things
and any other noise should cancel out with itself or fade into the background.
How, then, do we choose which items to include in a questionnaire? There are
many different items—perhaps an infinite number of variations—that could be used
to measure your concept. Different items might use different words or phrasing (e.g.,
“I trusted the robot” versus “I relied on the robot”), refer to a different part of the
situation (e.g., “The robot’s speech was humanlike” versus “The robot’s motions
were humanlike”), or offer a different part of the concept’s range of values (e.g., “I
would be glad if robots became common in society” versus “I would be worried if
robots became common in society”). The main goal when choosing items for the
questionnaire is to evenly sample from this infinite pool of different items such that
the concept of interest is the only thing that is shared by all the items. That way, when
the responses to all the items are combined, only the main concept of interest will
be amplified.

3 More specifically, the NARS has three subscales, each of which measures the strength of a certain

type of negative attitudes about robots.


4 DeVellis [7] (p. 109): “Each item can be thought of as a test, in its own right, of the strength of the

latent variable.” And, “…each [item] must still be sensitive to the true score of the latent variable”.
Introduction to (Re)Using Questionnaires in Human-Robot Interaction Research 137

6 Pilot Testing Questionnaires

After multiple questionnaires are explored and finally selected to use in the research,
it is important to pilot test the scales.5 In other words, it is time now for the people
from your target population, to fill in the survey. It is important to test the duration of
the entire survey because if it is too long, people might drop out of your study or start
randomly answering the items. If you are using the questionnaire in an experimental
setting, it is important to also test the whole protocol, to know how long your entire
experiment lasts from receiving the participant and informing them about your study
until the debriefing after their participation.
Other than learning about the duration of the study, you can also learn if the items
make sense to your target audience. For instance, your items might unintentionally be
interpreted differently by different respondents, or your respondents might interpret
them differently than you intended them to. This can be assessed by analysis of the
responses or by interviewing the respondents.
Another thing that is important is that brief descriptions of how your respondents
should fill in the survey’s items should be provided. For instance, if the 5-point
response format is chosen, ask them to select only one point. Also, think about
how the label of the different response options is structured. Just writing down the
numbers 1–5 can be interpreted differently by various respondents and, therefore, it
is important to provide some clarifications, or “anchors”. For instance, (1) strongly
agree, (2) agree, (3) neither agree nor disagree, (4) disagree, and (5) strongly disagree.
It is not always advisable to label every option, and there are a lot of different labels
to use. Additionally, make sure that the positive responses are always on the same
side—do not switch the order of your answer options halfway in your questionnaire.
You can keep piloting your survey until your pilot respondents start saying the
same things and you are not learning anything new anymore. At that point you can
stop and make improvements based on this first round of feedback. If you changed
your questionnaire significantly, you could pilot it again for a second round.

7 Validating Questionnaires

7.1 Reliability

The “reliability” of a scale is the extent to which the scores from the scale are indi-
cating the true strength of some concept and not measurement error, assuming there
is only one concept being measured and that all measurement error is random [12].
Note the two distinct aspects of this definition: first, that the scale (or subscale) items
are all measuring just one concept, and second, that there is not too much random

5 DeVellisrefers to this as “Administer[ing] Items to a Development Sample”—see his section for


more details [7].
138 M. Rueben et al.

error in the responses, as this would obscure that measurement. A more formal def-
inition for reliability is the correlation between respondents’ scores on the scale and
the true level of the concept being measured. Another, equivalent definition is that
differences between people’s scores on the test should reflect the actual (“true”) dif-
ferences between them in the level of the trait being measured. Of course, we never
know this true level, so psychometricians have invented several ways to estimate
the reliability of a scale. Note that the identity of the concept being measured is not
considered—reliability is simply about measuring a single concept (regardless of
which one) and being untainted by the various sources of measurement error. Valid-
ity (discussed in Sect. 7.2) is the property that additionally requires that the correct
concept is being measured.

7.1.1 Empirical Estimates of Reliability

If a scale is reliable, it will track with the true level of the construct it is measuring
across multiple respondents and in different situations that cause that true level to
fluctuate. Several different methods have been created to measure whether this is the
case for a particular scale. One is to administer two versions of the same scale to
participants in one sitting, but this assumes that the two versions are similar enough
to be directly compared (“alternate forms reliability” or “parallel forms reliability”).
A second method is to administer the same scale twice, at different sittings, and
compare the results (“test-retest reliability”). This only works for stable concepts
like personality that are not supposed to change much over time. An alternative
to both of these first two methods is to examine the inter correlations between the
scale’s items, where each item is thought of as a parallel form of the scale (“internal
consistency reliability”). This is more convenient than the other methods because
participants must take only one test, and during only a single sitting. “Coefficient
alpha” or “Cronbach’s alpha” is the most popular measure of internal consistency
reliability [6]. In all these methods the assumption is that the extent to which responses
to the different test administrations (or to the different individual items) are correlated
indicates the extent to which they are measuring the same thing.

7.1.2 Scales Must Be Reliable to Be Valid

We have begun this section about validation of scales with an introduction to relia-
bility because it is a prerequisite for validity. A scale must first measure something
consistently (i.e., be reliable) before it can measure what you want it to measure
(i.e., to be valid). Hence, estimating a scale’s reliability can be considered part of the
validation process of a whole questionnaire. You should just be sure to go beyond
estimating reliability when you are evaluating a scale, since it is very possible to
accidentally create a reliable measure of the wrong construct.
Also, remember that this formulation of reliability (based on Classical Test The-
ory [12]) assumes that all measurement error is random. It is a common problem for
Introduction to (Re)Using Questionnaires in Human-Robot Interaction Research 139

a scale to accidentally measure one or more other variables in addition to the target
one, and these other variables might be correlated (either positively or negatively)
with your target variable. Reliability estimates will not detect these circumstances—
they can only quantify the “error” but not identify its source. Additional analy-
ses such as “factor analysis” [7, 8] or metrics that contain more information than
Cronbach’s alpha does, such as the coefficient of equivalence and stability [19],
would be required instead.

7.2 Validity

Validity is “the degree to which evidence and theory support the interpretations of
test scores entailed by the proposed uses” [12] (p. 168). Hence, instead of talking
about the validity of a scale or of certain items we should talk about the validity of
our interpretations of responses to the scale. For example, we could talk about the
validity of concluding from a certain person’s responses to the items of the Perceived
Intelligence scale in the “Godspeed” series of questionnaires [1] that they perceive
the robot to have some particular level of intelligence. This interpretation is valid to
the extent that the scale is really measuring perceived intelligence of the robot for that
participant, and those particular responses are really indicative of the stated level of
perceived intelligence. Validity is a continuous variable, a matter of “degree”. Finally,
the above definition says that “evidence and theory” are what we should consult to
see whether our interpretations are valid. We will now discuss four different types
of validity evidence from Furr and Bacharach [12].

7.2.1 Types of Validity Evidence

1. Test Content. Most theories about the development and use of scales assume that
each and every item is a measure of the target concept and not any other concepts.
Not even one item should measure a different concept. It is also important for
the items to cover all facets of your concept—this is called “content validity”.
2. Response Processes. The processes that influence the participants’ responses
should be the ones that are expected. For example, if the participants are asked
how much they agree or disagree with the statement, “I often ignore advertise-
ments for new robotic toys,” it matters whether they respond by remembering
times they have encountered such ads and whether they ignored them, or instead
by thinking about whether they consider themselves to be interested in robotic
toys. Depending on the psychological process that produces the response, the
item could end up measuring different concepts.
3. Internal Structure of the Test. The items in a scale (or groups of items in case
of subscales) should relate to each other in an expected way given the concept(s)
you are trying to measure. For example, imagine a scale that is supposed to
measure the respondents’ perception of the precision of the robot’s movements
140 M. Rueben et al.

with two subscales: one for navigation movements and another for manipulation
movements. We would first expect that each subscale measures only one concept,
so responses to all of that subscale’s items should be highly intercorrelated.
We would also expect that the items in each subscale would be more strongly
correlated with each other than with the items in the other subscale—i.e., that
the items form two distinct clusters. If a scale has multiple dimensions like this
one, or if it has more than just a few items, it becomes difficult to manually
inspect the correlations between individual items. Researchers use a class of
techniques called “factor analysis” to evaluate the dimensionality of large or
multidimensional scales [7, 8].
4. Associations with Other Variables. It is important to know how your concept is
related to other ones—concepts with which it is synonymous, correlated but not
identical, and practically unrelated—in order to be able to properly validate any
measure of that concept. If you measure your target concept along with some of
these related concepts on the same participants, you can use the correlations to
evaluate whether your measure is working as expected. Concepts that you would
expect to co-occur with your target construct should be measured at the same
time (“concurrent validity”), whereas if one construct is expected to causally
influence the other then the measurements should be taken at different times
(“predictive validity”). Measures of unrelated concepts should be uncorrelated
(“discriminant validity”). It can be important to make several of these checks
in case one is misleading—for example, if you find that your trust measure
successfully predicts people’s willingness to interact with the robot, you might
also do a discriminant validity check to make sure you are not accidentally
measuring people’s comfort around robots in general.

7.2.2 What Is an Acceptable Level of Validity?

Validity studies that were performed by someone else, should raise concerns whether
the evidence they report will apply for the usage of the scale by your experiment. Did
they use a different population, context, or scale items? Even if you replicated exactly
the conditions of that validation study you would get different numbers because both
studies are sampling from a larger population so your results are unstable estimates of
population values. Our point is: the more your study varies from the validation study,
the more you should become skeptical that their evidence applies to your study.
If you want to use a scale but are skeptical about one (or several) of these four facets
of validity then you can perform some additional validity tests, either as standalone
studies or via validity checks in your main study. Even if you do not perform any
checks, at least avoid making claims about scores using validity assumptions that
have not been supported by evidence. You might have to make weaker or more
conservative conclusions, or provide clear disclaimers about possible flaws with
your measure. You might also simply decide not to use the scale for this study.
Introduction to (Re)Using Questionnaires in Human-Robot Interaction Research 141

8 Summary

This chapter is based on our belief that the idea of standardizing questionnaires—that
is, of creating questionnaires to be reused without modification in a wide variety of
experiments—should not be taken too far in HRI research. Standardization might be
possible for scales that measure something about humans that can be measured in
the same way across many different contexts, such as personality or affective state
or attitudes towards robots as a general category, but often existing questionnaires
ought to be modified to fit a new study context, and this requires sensitive handling
and precise alterations. Therefore instead of standardized questionnaires we have
laid out a standardized process that we recommend to be applied in all HRI research
for choosing, modifying, and testing questionnaires.
We started by describing how to decide whether a questionnaire and its scales is
the proper measurement tool for a particular research goal. Next, we talked about
choosing the concept(s) to be measured by this questionnaire and defining them
precisely. We then recommend ways to perform a careful search for relevant scales
used in previously published literature. As a next step we outline how to find ways
in which existing scales are not suitable for the study under consideration, and how
those scales might be modified or combined to fit the context and measure the right
thing. We then talk about pilot testing scales and close with an introduction to the key
concepts of reliability and validity. Even if we believe that major steps of the proposed
standardized process are covered adequately in the current chapter, two important
operations are not discussed as they will need extensive discussion on their specific
details: (1) creating a new scale from scratch, and (2) designing validation studies. At
this point, the following Sect. 9 on recommendations for further reading will conclude
the chapter and offer some guidance to the reader that wants more information.

9 Recommendations for Further Reading

We have tried our best to explain the basics of using questionnaires in HRI research.
This might be enough to have an intelligent conversation with someone about their
questionnaire research, but if you are reviewing a paper or designing a study that
uses a questionnaire then you will probably need more details. We have listed several
books below that we refer to in those situations. We hope that after reading this chapter
you will be able to use the Table of Contents or Index in each of them to find the
information you need.

• Furr and Bacharach—Psychometrics. We find this book easy to read and under-
stand. It covers all the main topics that are important for using questionnaires.
• DeVellis—Scale Development. This book covers many of the same topics, but
more briefly, and we find it less clear and harder to read. It gives step-by-step
instructions, however, on how to create a new scale.
142 M. Rueben et al.

• Rosenthal and Rosnow—Essentials of Behavioral Research. There are sections


on reliability and validity (Chap. 4), forming composite variables (i.e., factor anal-
ysis; end of this chapter), and questionnaires (Chap. 6). This book may sometimes
use simple, outdated formulas and overly neat and tidy examples, but the purpose
is to give the reader quite a deep understanding of the basic concepts and the
intuitions behind more advanced techniques.

Acknowledgements The authors are grateful to Amber Fultz, Prof. John Edwards, and several
anonymous reviewers for their helpful comments on a draft of this chapter.

References

1. Bartneck, C., Kulić, D., Croft, E., Zoghbi, S.: Measurement instruments for the anthropomor-
phism, animacy, likeability, perceived intelligence, and perceived safety of robots. Int. J. Soc.
Robot. 1(1), 71–81 (2009)
2. Buchanan, T., Paine, C., Joinson, A.N., Reips, U.D.: Development of measures of online privacy
concern and protection for use on the internet. J. Am. Soc. Inf. Sci. Technol. 58(2), 157–165
(2007)
3. Charalambous, G., Fletcher, S., Webb, P.: The development of a scale to evaluate trust in
industrial human-robot collaboration. Int. J. Soc. Robot. 8(2), 193–209 (2016). https://doi.org/
10.1007/s12369-015-0333-8
4. Couch, A., Keniston, K.: Yeasayers and naysayers: agreeing response set as a personality
variable. J. Abnorm. Soc. Psychol. 60(2), 151 (1960)
5. Denissen, J.J., Geenen, R., Selfhout, M., Van Aken, M.A.: Single-item big five ratings in a
social network design. Eur. J. Pers.: Publ. Eur. Assoc. Pers. Psychol. 22(1), 37–54 (2008)
6. Devellis, R.F.: A consumer’s guide to finding, evaluating, and reporting on measurement instru-
ments. Arthritis Rheum. Off. J. Am. Coll. Rheumatol. 9(3), 239–245 (1996)
7. DeVellis, R.F.: Scale development: Theory and applications, 4th edn. Sage publications (2016)
8. Fabrigar, L.R., Wegener, D.T., MacCallum, R.C., Strahan, E.J.: Evaluating the use of
exploratory factor analysis in psychological research. Psychol. Methods 4(3), 272 (1999)
9. Fink, A.: How to conduct surveys: A step-by-step guide. Sage Publications (2015)
10. Fischer, K., Jung, M., Jensen, L.C., aus der Wieschen, M.V.: Emotion expression in hri–when
and why. In: 2019 14th ACM/IEEE International Conference on Human-Robot Interaction
(HRI), pp. 29–38. IEEE (2019)
11. Furr, M.: Scale Construction and Psychometrics for Social and Personality Psychology. SAGE
Publications (2011)
12. Furr, M.R., Bacharach, V.R.: Psychometrics. An introduction. Sage Publications, Thousand
Oaks, CA (2008)
13. Hart, S.G.: Nasa-Task Load Index (NASA-TLX); 20 Years Later. In: Proceedings of the Human
Factors and Ergonomics Society Annual Meeting 50(9), 904–908 (2006) . https://doi.org/10.
1177/154193120605000909
14. Heerink, M., Kröse, B., Evers, V., Wielinga, B.: The influence of social presence on acceptance
of a companion robot by older people. J. Phys. Agents 2(2), 33–40 (2008)
15. Krosnick, J.A.: Questionnaire design. In: The Palgrave Handbook of Survey Research, pp.
439–455. Springer (2018)
16. Ninomiya, T., Fujita, A., Suzuki, D., Umemuro, H.: Development of the multi-dimensional
robot attitude scale: constructs of people’s attitudes towards domestic robots. In: International
Conference on Social Robotics, pp. 482–491. Springer (2015)
17. Nomura, T., Kanda, T., Suzuki, T.: Experimental investigation into influence of negative atti-
tudes toward robots on human-robot interaction. Ai Soc. 20(2), 138–150 (2006)
Introduction to (Re)Using Questionnaires in Human-Robot Interaction Research 143

18. Porfirio, D., Sauppé, A., Albarghouthi, A., Mutlu, B.: Computational tools for human-robot
interaction design. In: 2019 14th ACM/IEEE International Conference on Human-Robot Inter-
action (HRI), pp. 733–735. IEEE (2019)
19. Schmidt, F.L., Le, H., Ilies, R.: Beyond alpha: an empirical examination of the effects of
different sources of measurement error on reliability estimates for measures of individual-
differences constructs. Psychol. Methods 8(2), 206 (2003)
20. Tomarken, A.J.: A psychometric perspective on psychophysiological measures. Psychol.
Assess. 7(3), 387 (1995)

Matthew Rueben is a human-robot interaction researcher work-


ing as a postdoctoral scholar in the Interaction Lab. Matt
received his PhD in Robotics from Oregon State University for
research on user privacy in human-robot interaction. His under-
graduate degree was the H.B.S. Degree in Mechanical Engineer-
ing, also from Oregon State University. Matt has collaborated
with legal scholars and social psychologists in an effort to make
human-robot interaction research more multi-disciplinary. Be-
sides privacy, his current interests include how humans form
mental models of robots—and how robots can be more transpar-
ent to humans.

Shirley A. Elprama is a senior researcher at imec-SMIT-VUB


since 2011. In her research, she investigates social robots, col-
laborative robots and exoskeletons at work and particularly
under which circumstances these different technologies are
accepted by end users. In her PhD, she focuses on the accep-
tance of different types of robots (healthcare robot, collaborative
robots, exoskeletons) in different user contexts (car factories,
hospitals, nursing homes) by different users (workers, nurses,
surgeons).

Dimitrios Chrysostomou received his Diploma degree in pro-


duction engineering in 2006, and the Ph.D. degree in robot
vision from Democritus University of Thrace, Greece in 2013.
He is currently an Associate Professor with the Department
of Materials and Production, Aalborg University, Denmark. He
was a Postdoctoral Researcher at the Robotics and Automa-
tion Group of the Department of Mechanical and Manufac-
turing Engineering, Aalborg University, Denmark. He has co-
organized various conferences and workshops in Mobile Robotics,
Robot Ethics and Human Robot Interaction. He has served as
guest editor in various journals and books on robotics and HRI,
associate editor for several conferences including IROS and
ICRA and regular reviewer for the major journals and confer-
ences in robotics. He has been involved in numerous research
projects funded by the European Commission, the Greek state,
144 M. Rueben et al.

and the Danish state. His research interests include robot vision, skill-based programming and
human-robot interaction for intelligent robot assistants.

An Jacobs An Jacobs holds a Ph.D. in Sociology and is a part-


time lecturer in Qualitative Research Methods (Vrije Univer-
siteit Brussel). She is also program manager of the unit Data
and Society within the imec- SMIT-VUB research group. She
is a founding member of BruBotics, a collective of multiple
research groups at Vrije Universiteit Brussel that together con-
duct research on robots. In her research, she focuses on future
human-robot interaction in healthcare and production environ-
ments in various research projects.
Qualitative Interview Techniques
for Human-Robot Interactions

Cindy L. Bethel, Jessie E. Cossitt, Zachary Henkel and Kenna Baugus

Abstract The objective of this chapter is to provide an overview of the use of the
forensic interview approach for conducting qualitative interviews especially with
children in human-robot interaction studies. Presented is a discussion of related
work for using qualitative interviews in human-robot interaction studies. A detailed
approach on the phases of a forensic interview are presented, which includes intro-
duction and guidelines, rapport building, narrative practice, substantive disclosure
interview, and the cool-down and wrap-up. There is a discussion on the process of
transcription and coding of the qualitative data from the forensic interview approach.
A presentation is provided detailing an exemplar study including the analyses of the
qualitative data. There is a brief discussion of the methods for reporting this type of
data and results along with the conclusions from using this approach for human-robot
interaction studies.

Keywords Qualitative data · Structured interview · Children · Methods

1 Introduction

A significant challenge associated with data collection for studies in human-robot


interaction, especially with children, is determining the best methods for obtaining
rich and meaningful data. In general, children under the age of 11 have a difficult
time with being able to distinguish between different levels of information such as

C. L. Bethel (B) · J. E. Cossitt · Z. Henkel · K. Baugus


Department of Computer Science and Engineering, Mississippi State University,
P.O. Box 9637, Mississippi State, MS 39762-9637, USA
e-mail: clb821@msstate.edu
J. E. Cossitt
e-mail: jec570@msstate.edu
Z. Henkel
e-mail: zmh68@msstate.edu
K. Baugus
e-mail: kbb269@msstate.edu
© Springer Nature Switzerland AG 2020 145
C. Jost et al. (eds.), Human-Robot Interaction, Springer Series
on Bio- and Neurosystems 12, https://doi.org/10.1007/978-3-030-42307-0_6
146 C. L. Bethel et al.

found in Likert scale-based surveys, understanding vague and abstract questions,


and understanding partially labeled responses [3, 11]. As cognitive development
increases after around age 11, then the use of survey data may be more accurate.
However, the data from surveys and self-report questionnaires is limited in scope
when compared with the richness of information obtained from structured and semi-
structured interviews. This is typically true for different age groups including ado-
lescents and adults. One approach that has been helpful is the use of the forensic
interview approach to structured and semi-structured interviews [9]. The forensic
interview approach is a structured protocol that has been established for obtaining
critical information especially from children. It can also be useful for obtaining
information from adolescents in addition to adults. The emphasis of the structured
forensic interview is the use of open-ended prompts for information versus the use
of focused recognition prompts that are frequently used in structured interviews [9].
The open-ended nature of the interview questions allows for more free responses
that help to reduce the likelihood of the interviewer influencing the memory of the
person being interviewed. It is important for accuracy of information gathering that
leading questions or focused/closed-ended questions are not used as it could impact
the person’s memory of the event or experience.
The qualitative analysis of the data from using structured and semi-structured
interviews can be tedious and challenging to perform, though the results can be
beneficial to advancing our knowledge of interactions between humans and robots
and other related technologies. This chapter begins with a brief overview of research
related to the use of interviews for gathering information located in Sect. 2. In Sect. 3
there are details on how to perform studies using the forensic interview approach.
The next section will cover the process of transcription and coding of the data, refer
to Sect. 4. In Sect. 5, an example study is presented along with the transcription,
coding, and analyses performed. Lastly, conclusions presented in Sect. 6.

2 Related Work on Qualitative Interviews

This section includes some representative examples of studies that have used inter-
views for gathering information related to the topic of interest. It is not meant to be
a comprehensive review of all possible studies, but rather to provide a foundation of
examples related to the use of structured and semi-structured interviews in research.
A study conducted by da Silva et al. focused on doing a qualitative analysis of
participants’ interactions with a robot as the robot conducted a motivational interview
about physical activity [14]. This style of interview is a counseling approach that
focused on interactive, two-way communication with the interviewees that leads
them to make their own conclusions about the topic. A NAO robot was used to
deliver the motivational interview using a script that was devised in such a way
to provide a flow of conversation regardless of how questions were answered and
was evaluated using Shingelton and Palfai’s criteria for assessing technology-driven
adaptations of motivational interviewing. Participants who had expressed a desire to
Qualitative Interview Techniques for Human-Robot Interactions 147

increase their level of activity engaged in the motivational interview with the robot
where the robot asked them questions, and they responded before tapping the robot’s
head to indicate that they had finished speaking. The interviews were not recorded
due to concerns regarding participants’ levels of discomfort in the situation, and
instead qualitative data was acquired in the form of typed answers to free response
questions about the experience. The authors do express the desire to further their
research in the future by recording the interviews after a short acclimation phase.
Participants’ responses were coded using thematic content analysis that defined a
unit of the data set as all of the responses of a participant. The data set was analyzed
to develop codes and create a list of themes and sub-themes. The data was coded by
two coders working independently who then came together to discuss discrepancies
and revise the coding scheme on agreement was achieved. Then two more coders
with no knowledge of the study worked independently following the scheme to re-
code the data. Inter-rater reliability was then determined based on the percentage
of agreement on coded data for each unit of the data set. The analysis showed that
participants found the experience helpful and were especially fond of getting to hear
themselves talk about their concerns while not having to fear being judged by a
human interviewer [14].
Qualitative interviews were used to gather data about the effects of using a robotic
seal as a therapeutic tool for the elderly in a study conducted by Birks et. al. [2].
The study took place at an eldercare facility in Australia where three recreational
therapists, who were employees of the facility were trained to use Paro robotic seals
as therapeutic tools and did so with residents of the facility over the duration of
four months while keeping detailed journals of each session. After this time, the
therapists were interviewed about their experiences using the seals as a therapeutic
tool. Transcripts from the interviews with the therapists were created and provided
verbatim responses for each interview. Following transcription, two researchers did
a thematic analysis to code the collected data both from the interviews and the
therapists’ session notes. A third researcher was used to check the finished coding.
The thematic coding showed three themes in the responses that the authors list as
“(1) a therapeutic tool that’s not for everybody, (2) every interaction is powerful,
and (3) keeping the momentum.” The data was coded based on these three themes;
however, some of the data was split into sub-themes. Results of the analysis showed
that the robotic seals had high potential to be a useful therapeutic tool based on the
experiences of these recreational therapists [2].
Jeong et. al. investigated the potential use of a social robot to offer therapeutic
interactions with children in pediatric hospitals [7]. The project utilized a robotic
teddy bear called Huggable and compared patients’ interactions with it to their inter-
actions with a virtual teddy bear on a screen as well as with a normal plush teddy
bear. Children who were patients at a pediatric hospital were given one of the three
bears for a duration of four hours to interact with as they wished. During the sessions
involving the Huggable robot and the virtual bear, the bears were operated remotely
by a researcher outside the room who used the bears to communicate and play games
with the children.
148 C. L. Bethel et al.

The interactions in each condition were video-recorded following an initial thirty


minute introductory period. The video recordings were transcribed to include ver-
bal data as well as data of physical movements. A transcription of the verbal data
was created by a professional transcriber, and this transcription identified who was
speaking for each utterance with possible speakers including “Patient, Huggable,
Moderator, and Other.” Physical movement was transcribed by the use of a number
between 0 and 1 with 0 meaning no movement and 1 meaning fully active movement.
The data that was coded from the transcriptions included average movement score,
number of utterances, and length of interaction. Post study questionnaires were also
given to the staff members who acted as moderators though it is not clearly stated
what kind of questionnaires were given. Results showed that, of the three conditions,
children were the most verbally and physically engaged with the Huggable robot.
These results were partially corroborated by the moderators whose questionnaires
reflected that they found the virtual bear and Huggable to be equal but both better
than the standard plush bear [7].
In a study by A. M. Rosenthal-von der Pütten and N. C. Krämer, participants were
interviewed about their opinions of pictures of robots in order to collect qualitative
data about their attitudes towards robots that could be considered “uncanny” or very
human-like without appearing fully human [12]. Both adults and children participated
in the interviews where they were shown pictures and videos of robots with varying
degrees of likeness to humans. Interviewers asked the participants questions based
on an interview guide. The questions prompted participants to freely respond about
their thoughts regarding robots in general as well as their emotions and anxiety levels
when shown specific robots. Auditory recordings of the interviews were collected and
then transcribed including pauses and excluding filler words. Responses were coded
by two different individuals working independently using the software MAXQDA
(https://www.maxqda.com/). Coding schemes for the data were determined based on
responses to questions. For example, the response to a question regarding how a robot
made a person feel would be coded as “positive,” “negative,” or “no response.” The
coded data was used to identify an extensive amount of information about perceptions
of uncanny robots [12].
In an attempt to determine what causes breakdowns in interactions between chil-
dren and technological tutoring systems, Serholt conducted a study where a robot
tutor was placed in an elementary classroom to interact with students over the course
of six months [13]. A NAO robot was used for the study as well as an interactive
screen. The robot could carry out scripted interactions including talking and ges-
turing to the screen while instructing students on certain classroom tasks such as
map reading and sustainability. Video recordings of the students’ interactions with
the robot were collected and later analyzed to determine when breakdowns occurred
based on a list of breakdown indicators. Videos were always viewed at least twice for
the sake of reliability and accuracy. Video segments that were determined to show
breakdowns were coded based on the indicators that were shown as well as details of
the interaction. Thematic analysis of the data showed that there were common themes
causing breakdowns including “the robot’s inability to evoke initial engagement and
Qualitative Interview Techniques for Human-Robot Interactions 149

identify misunderstandings, confusing scaffolding, lack of consistency and fairness,


and controller problems.”
These studies provide some examples of different ways in which qualitative data
may be collected and analyzed. It also provides some information regarding the
richness of qualitative data and the possible benefits for using this approach. The use
of interviews and open-ended questioning either through verbal interviews or from
written responses, provides the ability to learn trends in data and responses that may
not be evident from using surveys and self-report assessments. This approach allows
users to make statements without being constrained by a set of survey questions for
identifying their thoughts, feelings, and interactions. Overall, if performed correctly,
there is less suggestibility and influence by the researchers and they may be able to
obtain information that otherwise may not be possible. One caution, the Hawthorne
Effect may still be an influence with this approach, in which participants may say
what they feel the researcher wants to hear instead of what they are truly feeling
[10]. Therefore, it is important when conducting a study that the researchers do not
express any feedback on the responses given by participants or their performance
during the study.

3 Approach

There are several approaches that could be used to perform a structured interview;
however for the purpose of this chapter, the focus will be on the use of the forensic
interview process and approach. This process was developed for use with children
who had experienced maltreatment or were eyewitnesses to crimes to gather infor-
mation for a legal case [9]. The forensic interview approach has been very effective
especially when working with children. The use of open-ended questions allows the
person being interviewed to rely on their memory of an experience rather than the
recognition of options presented by the interviewer, which can be misleading. The
forensic interview approach is beneficial in obtaining a person’s feelings about an
experience and less likely to introduce confounds into that process. This process
encourages the interviewer to avoid yes/no questions and questions that provide lim-
ited responses, such as multiple choice. If a multiple choice question type is used then
it is important to follow that up again with an open-ended question to gain additional
knowledge. As part of this approach there is a protocol in place for the interviewer
to follow. The protocol will be outlined in the following sections.

3.1 Introductory Phase and Guidelines

The first part of the forensic interview process is for the interviewer to introduce him
or herself to the person who is being interviewed. Next it is important to share what
the purpose of them participating in the study is and what tasks they are expected
150 C. L. Bethel et al.

to do. Also the interviewer should tell the interviewee what his or her role is in the
study and a bit about their background. Once that has been completed it is important
to provide guidelines for the interview process. The following is an example of the
items to include as part of the guidelines used in a forensic interview. It is important
to provide these guidelines regardless of the age of the participant, but it is even more
important when working with children.
Guidelines:

1. If you do not know or remember the answer to a question, then just let me know.
With children it is recommended that you then practice that by giving an example,
such as “Tell me the name of my dog.” It is expected that the child would say that
they did not know the name of your dog.
2. If you do not understand the question, then ask me to provide the question in
a different way to help you better understand. This would be followed by an
example such as “Tell me your ocular color.” The child would likely say they did
not understand. The interviewer would then change the question to be, “Tell me
your eye color.” The child should then be able to answer that question.
3. If you need me to repeat a question or information, please ask me to repeat it.
This typically does not require an example, even when working with children.
4. If I say something that is not correct, please let me know and give me the correct
information. As an example, you can tell a male child that is 10 years old, “What
if I said you are a 5 year old girl?” The male child should correct you and tell
you that he is not a 5 year old girl. Then provide the correct response.
5. I ask that you please tell the truth as you recall it for any questions you are
answering. Do you promise to tell the truth today during our time together? It is
important for whomever is being interviewed to agree to tell the truth throughout
the process.

NOTE: it is important to audio/video record the interactions so that transcriptions


can be accurately performed at a later date including the guideline process using
proper informed consent as to how the data may be handled.

3.2 Rapport Building

The next phase of the forensic interview process is considered rapport building
with the person being interviewed. Most people will not be comfortable sharing
information with someone they do not know and this is especially true with children.
It is important to establish some level of rapport to overcome this. Rapport can
be established by asking some general questions that are not necessarily critical
information. Some examples of questions that can be asked are:
Qualitative Interview Techniques for Human-Robot Interactions 151

• Tell me about who you live with.


• Tell me about the place in which you live.
• Tell me about what you like to do for fun.
When asking questions using this approach it is important to actually make them
statements instead of questions. Commonly, people will ask questions in the format
of “Can you tell me about who you live with?” and that leads to a yes/no response
typically. By making a statement, it requires the person to give more than just a one
word response. This is one of the most challenging aspects to performing this type
of interview for the interviewer.

3.3 Narrative Practice

The next phase of the forensic interview approach is called narrative practice. This
involves asking the person being interviewed to discuss a topic in as much detail
as they can remember. For example, it is common to ask them to tell you about
everything they did that day from the moment they woke up until they arrived to
meet with you. As part of this process, you may pick one aspect of their detailed
account and request that they provide more details, such as “tell me more about what
you had for breakfast” or “tell me more about playing basketball with your friends
today.” This gets the person used to discussing in detail about a topic area. Some
additional possible follow on statements may be:

• Tell me about how you were feeling when ... happened.


• Tell me about how that made you feel physically in your body.
• Tell me more about that.—this is commonly used in the open-ended interview
questions.
• Tell me about what happened next.
• Tell me about the first time, the last time, or the most recent time that this happened.
• Tell me about how you learned to do ...
• Tell me about anything else you would like to share that happened.

The important aspect of this phase is to get the person being interviewed used to
sharing information in as much detail as possible on a topic that is not related to the
exact topic of interest you are trying to investigate.

3.4 Sustantive Disclosure Interview Phase

The substantive interview phase of the forensic interview protocol is the part in which
questions of interest or investigation are discussed. The substantive interview is the
key component of this qualitative interview approach. This is where you ask the per-
son being interviewed about what is being studied. In the case of our examplar study
152 C. L. Bethel et al.

presented in Sect. 5, the primary topic being investigated was children’s experiences
with bullying. These are sensitive topics that most people are not comfortable right
away discussing with someone they do not know well. That is why following the
forensic interview protocol is important to have a more comfortable interaction.
In the case of the exemplar study, participants were interviewed using the forensic
interview protocol by either a human interviewer or a robot interviewer. In the case
of the robot interviewer, following the forensic interview protocol allowed time for
the person engaged with the robot to overcome the novelty effect [1]. The novelty
effect is the initial excitement and positive interactions as a result of new technology.
It may impact how a person views the technology and this may change over time
when the newness of the interaction “wears off.”
It is during the substantive interview phase that the interviewer asks the partici-
pant the key questions or statements of interest. Some recommended statements for
eliciting information may be:
• Tell me about how you felt emotionally ...
• Tell me about how you felt physically in your body when ...
• Tell me what symptoms you felt when you ...
• Tell me about what happened next.
These are just some examples to guide the interview process. It is also possible
to use the prompts and statements from the narrative practice phase of the forensic
interview (refer to Sect. 3.3). The important aspect is to make sure that the statements
or questions being asked are open-ended in nature to get the person talking in detail
about the topic of interest. This may not seem natural to the interviewer at first,
because so often in conversations there is a tendency to ask closed-ended questions
or multiple choice questions. If that is needed to get people talking, then it is rec-
ommended to follow that type of question with an open-ended question or statement
to obtain more details or information. This is where this approach holds the most
value because it provides rich data and deeper insights than what can typically be
obtained through a survey or self-assessment questionnaire. Also it is typically done
at the time of interaction or immediately following an interaction so the information
is fresh in the mind of the person being interviewed.

3.5 Cool-down/Wrap-up Phase

This is the last phase of the forensic interview protocol. In this phase it is important
to return the person being interviewed to everyday life and normalcy. In this case,
you may ask a statement such as “Tell me about your plans for the rest of today.”
This redirects the focus away from the topic of interest and gets them thinking about
what is to come in their day and returns them to everyday life. At this point, for
research purposes you can do a debrief of the study where you discuss the purpose
of the study. You may also want to ask the participant to not share any information
about the study with others so they can have a similar experience if they participate.
Qualitative Interview Techniques for Human-Robot Interactions 153

This may also be a good time to have participants complete any relevant surveys
on what is being investigated. It is always recommended that the interviewer thank
the participant for his or her time and effort in being involved in the research. After
this point, the researcher/interviewer can go through normal procedures for ending
participation in the study as they would in any study such as paying out incentives
or other closing activities.

4 Transcription and Coding of Data

Once all the data is collected for a study, then the next phase of a qualitative study
is to begin transcription of the interview data. This is a tedious process of listening
to the audio portion of the audio-video recording of the interviews and typing out a
transcript or record of everything that was said during the interview. There is some
disagreement as to whether filler words should be included in the transcription, such
as “um,” “ahhh,”, and similar. Depending on what is being studied, a person who
is more nervous or uncertain may have higher numbers of these filler words and
that may be important if the investigation is surrounding for example interactions
between a human and a robot. It is up to the discretion of the researcher whether these
types of utterances are included; however it is important to be consistent in what is
included within a particular study. Once all of the data is transcribed, and there is a
complete written record of the interview then this data is coded and categorized in
a manner that can be interpreted and compared. This section presents details on the
transcription and then the coding process.

4.1 Transcription Process

Once video of an interview has been collected, it is necessary to create a written


record or transcription of all the auditory data. The transcription process must reli-
ably produce an accurate representation of everything said in the interview usually
including filler words like “um” and “ah.” It is helpful to have more than one person
working on transcribing the same data in order to ensure a level of accuracy higher
than just one person’s understanding of what was said.
Typically, the transcription process involves watching a small bit of video and
pausing to denote what is said before moving on to another small bit of video or
re-watching, if necessary to capture more details. The process can be quite tedious
and time consuming, but there are some tools to help it go more smoothly. An
example of one such tool is the use of foot controls. When using foot controls for
transcription, one has the ability to control the video playback with a foot pedal
keeping the hands free for typing a written record of the data. One example that
our research team have used is the Infinity in-USB-1 USB Computer Transcription
154 C. L. Bethel et al.

Foot Pedal https://smile.amazon.com/gp/product/B008EA1K66/ref=ppx_yo_dt_b_


asin_title_o03_s00?ie=UTF8&psc=1.
Another useful tool is software designed for use in video transcription. One exam-
ple of transcription software is ELAN, which is produced by The Language Archive.
ELAN is a program that allows the user to add in-time annotations to video or audio
files and also supports searching and exporting the transcribed data [5]. This software
is open source and available for download at https://tla.mpi.nl/tools/tla-tools/elan/.
While there is no method to effortlessly create an audio-video transcription, taking
advantage of available tools greatly eases the process. There is no easy way to per-
form this process, which is why many researchers prefer quantitative data collection
methods such as surveys because it is much easier and straightforward to process
the data and quickly obtain results. Often researchers will report in publications that
qualitative data was collected as part of the study, and that the analyses will be pre-
sented at a later date. Typically, those results often do not appear in publications.
It takes time and personnel to process this type of data, but it is often worthwhile
because deeper knowledge and insights can be obtained from open-ended interview
data.

4.2 Coding the Transcribed Data

After video data has been transcribed, it must be coded or translated into a format so
that the data can be analyzed. The first step to coding transcribed data is to decide
on a system to use to create quantitative data that makes sense with the kind of
responses that are being coded. This could take the form of rating responses using a
Likert scale, determining whether responses display certain qualities using a binary
system, deciding which of a set of categories is the best fit for responses, sorting
responses by recurring themes found through thematic analysis [2, 13, 14], or any
other similar scientific system that makes sense for the data being analyzed. Adequate
training of whatever system is chosen must be provided to the people that will be
serving as data coders. Training usually involves providing the coders with sufficient
documented information of how the data should be coded as well as testing the coders
with a set of dummy data to ensure that they are capable of performing the task before
they begin working with the actual data from the study.
It is essential to ensure that the coding of transcribed data produces consistent and
reliable quantitative data. Such consistency can be reached by establishing strong
inter-rater reliability. This is accomplished by having multiple people involved in
coding the same data. The process requires at least three coders but can include
more for an even higher level of reliability. In the case of three coders, two of these
coders work independently coding the same transcribed data before coming together
and finding any discrepancies in their coding. The third coder determines how the
data should be coded in the case of the discrepancies [16]. Additional coders can be
used in the initial coding process to provide more viewpoints as well as in the tie
Qualitative Interview Techniques for Human-Robot Interactions 155

breaking process to form a committee to objectively determine coding in the case of


discrepancies.
Data coding is a time intensive process, and because of the necessity of establishing
reliability, it can be very labor intensive as well. While these factors are certainly
drawbacks of the transcription and coding process, they are worthwhile because of the
robust data that is produced when taking the time to properly analyze the qualitative
data.
In addition to coding and analyzing transcribed data, the behaviors of the person
during an interview may also be coded. This involves a similar process of having
a minimum of two coders categorizing behaviors and/or facial expressions during
an interaction. This may be important in human-robot interaction studies. They may
code body positions, postures, and facial expressions or other behaviors of interest.
This data may then be quantified and compared across participants of the study to
determine trends in behaviors within the study and during the interactions.
A common measure for determining the accuracy of coding is a statistical measure
known as Cohen’s Kappa. Cohen’s kappa is used to measure the level of agreement
between two coders of qualitative data, whether it is transcription data or behavioral
data. Many statistical analysis software packages can be used to perform this statis-
tical test and there are also online calculators available such as https://idostatistics.
com/cohen-kappa-free-calculator/. The scale to evaluate Cohen’s Kappa and relia-
bility is:

• 0.01–0.20 slight agreement


• 0.21–0.40 fair agreement
• 0.41–0.60 moderate agreement
• 0.61–0.80 substantial agreement
• 0.81–1.00 almost perfect or perfect agreement
It is commonly agreed that a Cohen’s Kappa score of 0.60 or greater indicates
satisfactory reliability within the coding of a set of data [10]. It is important to perform
this evaluation and report it as part of any study that involves qualitative data.

4.3 Coding Written Responses

Qualitative interview data can sometimes take the form of written responses to open-
ended questions rather than transcribed audio or video data. In such situations, coding
the data works essentially the same way that it would if it did come from a transcribed
video, the main difference being that transcription is not necessary because the data
is already in a text format. This allows studies that use written responses to collect
qualitative data (see [4, 14]) to do so while skipping the transcription step. This can
save time and equipment, but does not always provide data that is as robust as that
gathered by transcribing audio or video. Participants may get tired of writing or may
not want to write out a response often resulting in shorter responses.
156 C. L. Bethel et al.

5 Examplar Qualitative Interview Study

This section provides an exemplar study that used qualitative interviews for informa-
tion gathering with children. This was part of a broader project investigating the use
of social robots as intermediaries for collecting sensitive information from children.
Our research team conducted two studies focused on conducting forensic interviews
with children concerning their experiences with bullying at school.
Though a rich array of data was collected during these studies (e.g., behavioral
measures, participant self-reports, parental surveys, etc.) this section focuses on
the qualitative data captured in a verbal structured forensic interview with partic-
ipants concerning their perceptions of the interviewer (human or robot) immediately
following their session.
Detailed findings from this research effort are available in a journal article (see
Henkel et al. [6]); however in order to provide an example for this chapter, the relevant
details are presented of the research questions, study design, data collected, analysis
approach, and findings in this section with an expanded focus on the qualitative data.
Excerpts from that article are presented as part of this chapter [6].

5.1 Research Questions

The exemplar research effort focused on the question of how a forensic interviewer’s
characteristics (i.e., robot or human, and male or female) would affect the likelihood
of children disclosing sensitive information related to their experiences with bullying
at school. In addition to assessing disclosure behavior across conditions, we were
interested in gaining an understanding of how participants perceived the robot or
human conducting the forensic interview.
Though our past experiences conducting studies with children using social robots
in sensitive domains helped to inform this inquiry, the approach can be characterized
as exploratory in nature. While it was hypothesized that differences would exist
between interviewer conditions, no prior predictions about the areas or directions in
which differences would be observed were made. Determining the trends and exact
differences between interviewer types was an investigative process, that was well
supported by the use of the open-ended forensic interview approach [9].

5.2 Study Design

As the context surrounding data is critical to interpreting and making use of the
data, this section describes the design of the two larger studies from which data
regarding participant perceptions of the forensic interviewer were obtained. In Study
Qualitative Interview Techniques for Human-Robot Interactions 157

A participants were between ages 8 and 12, while in Study B participants were between
ages 12 and 17.
Both studies focused on using a structured forensic interview technique (as devel-
oped and investigated by Lamb et al. [9]) to obtain information about a participant’s
personal experiences with bullying at school (refer to Sect. 3. An interdisciplinary
approach was taken to developing a dynamic script to guide the interview, with the
research team’s sociologist (experienced in this area of inquiry) ensuring full cov-
erage of data typically collected during an investigation of bullying. This research
was approved by and conducted with guidance from the Mississippi State University
Institutional Review Board.
Both Study A and Study B followed the same base script addressing the areas of
physical, relational, and verbal aggression, but Study B (older children) also included
additional questions specifically addressing cyberbullying. Though following a pre-
scripted structure during the interview, in all conditions a participant’s responses
determined the specific follow-up questions delivered by the interviewer. Addition-
ally, information provided by the participant during the interview was incorporated
into the follow-up prompts and interviewers responded to requests for clarifications
or other spontaneous requests if needed.
Both studies used a between-participants design, employing random assignment
to pair a participant and interviewer. In Study A possible interviewer assignments
were: female human, male human, female humanoid RoboKind robot, or male
humanoid RoboKind robot. In Study B the possible interviewers were: male human,
male humanoid RoboKind robot, or male humanoid Nao robot.
Study sessions were conducted in a dedicated lab space at the Social Science
Research Center on the campus of Mississippi State University (refer to Fig. 1).
Two separate rooms were used for participant interactions, one for the researcher
and participant to interact and the other for the participant and forensic interviewer
to interact. All sessions (human and robot) were conducted using a Wizard-of-Oz
[8] approach in which two hidden researchers worked collaboratively using camera
feeds and a custom software system to remotely direct the flow of the interaction.
These researchers entered spoken information into the software system, cued the
questions an interviewer would deliver next, and ensured the interviewer remained
engaged and responsive during the session. In the case of robot interviewers, the
output of the software system directly activated the robot’s speech and movement
behaviors. Human interviewers delivered the same prompts, but were guided by a
concealed teleprompter that projected prompts and cues on the wall directly behind
the participant or onto a tablet the human interviewer held in his lap.
In total three robotic platforms were used between the two studies. In Study A male
and female RoboKind R25 robots were used, while in Study B a male RoboKind R25
robot and blue colored humanoid Nao V5 robot with the same synthetic male voice
were used (refer to Fig. 2).
158 C. L. Bethel et al.

Fig. 1 Layout of the study rooms and space at the MSU social science research center

5.2.1 Forensic Interview About Bullying Experiences

The main portion of each study session was an interview conducted by either a
robot or human. At the beginning of the interview the interviewer communicated the
guidelines consistent with forensic interviews. The introduction script was as follows
with excerpts from one participant:
Qualitative Interview Techniques for Human-Robot Interactions 159

Fig. 2 Male Robokind R25 robot (left), Nao robot (center), and Female Robokind R25 robot (right)

interviewer: Hi <participant name>, it’s nice to meet you. I’m the interviewer
for the study today and would like to talk to you for a little while about your
experiences at home and school. Does that sound okay to you?
participant: Yes [nods]
interviewer: Okay, great. If you can take a seat over there, we can get started.
Once we are finished <researcher name> will come back and take you to the
next part of the study.
interviewer: Like <researcher name> said before, my name is <interviewer name>
and I spend a lot of time talking with kids about the different things they
see and do at home and at school. Before we get started, do you have any
questions for me?
participant: No.
interviewer: Before we begin, there are a few important things you should
know. The first is if at anytime during the interview you are not sure about
the answer to a question, just tell me that you don’t know. Let’s practice this.
So if I asked you, what’s my dog’s name, what would you say?
participant: I don’t know.
interviewer: If I ask a question during the interview that is confusing or that
you don’t understand, just tell me and I will explain what I mean. Let’s
160 C. L. Bethel et al.

practice this. So if I asked you, what is your ocular color, what would you
say?
participant: I don’t know what that means.
interviewer: It’s okay to tell me you don’t know what that means. So I’ll ask
it in another way. What color are your eyes?
participant: Blue.
interviewer: It’s also important that if I say something wrong during the inter-
view, you should tell me. For example, if I said you were a 4 year old boy,
what would you say?
participant: I am not a 4 year old boy.
interviewer: And the last important thing is that you tell the truth during the
interview. Do you promise you will tell the truth today?
participant: Yes, I promise.

Next, rapport with the participant was built via questions about the participant’s
hobbies, family, and events that occurred recently. This sequence also allowed the
participant to practice providing a narrative description. An example exchange is as
follows:

interviewer: So first, I’d like to get to know you better. What’s something you
like to do for fun?
participant: I like speech and debate. I’m in the speech and debate team at
school.
interviewer: How did you learn to do that?
participant: Well this is my first year doing it. My best friend is on the team
and she recruited me to do it because she thought I’d be good at it because
I’m into politics. She asked me if I would join the team and then I did. My
coaches have taught me a lot.
interviewer: How often do you get to do that?
participant: We have tournaments almost every weekend, and we have prac-
tice almost every day after school.
interviewer: Are there other people that you do that with?
participant: Yeah, we have a pretty big team.
interviewer: Thank you for telling me about that. I’d like to know about things
that happen to you. Can you tell me everything that happened yesterday in
as much detail as possible, from the time you woke up until you went to
bed?
participant: I went to school, ate breakfast, went through my classes, and went
home. I took a nap, then played a video game for a little while. I ate supper
with my mom then went back to bed.
interviewer: Tell me more about playing a video game.
participant: I wasn’t really doing much.
interviewer: Earlier you mentioned eating supper with your mom, tell me
more about that.
participant: She had bought something for us to eat.
Qualitative Interview Techniques for Human-Robot Interactions 161

interviewer: Is there anything else you remember about yesterday?


participant: No.

Following the rapport-building and narrative practice phases, the interviewer


began to explore the substantive issue of bullying experiences at school. After
discussing how kids get along with each other at school the interviewer inquired
about verbal aggression, relational aggression, cyber aggression (Study B only), and
physical aggression. If the participant disclosed relevant aggression experiences at
any stage, follow-up questions were asked to fully characterize their experience.
A representative interaction is as follows:

interviewer: Now I’d like to learn a little bit more about you and your family.
[pause] There are a lot of different types of families today. Tell me about
your family and who lives with you.
participant: My parents are still married. They have been married for a really
long time, and I have three other siblings. I have an older brother, older
sister, and younger brother.
interviewer: Who is the person that spends the most time with you when you
are at home?
participant: Probably my sister. Me and sister are best friends.
interviewer: How do you feel about that person?
participant: I love her. We act the exact same. I got really lucky that she was
my sister.
interviewer: Now let’s talk about your friends. Can you tell me about your
closest friends at school?
participant: One of my closest friends, the one who recruited me to the debate
team, we spend a lot of time together. Especially now that I got my license.
We hang out all the time.
interviewer: If you were going to tell someone a secret, who would it be and
why?
participant: Probably her because I know I can trust her with all of it, and she
won’t tell anyone else.
interviewer: I’d like to talk some about how the kids at your school get along
with each other. Let’s start with the way kids talk to each other. Please tell
me about the different ways kids talk to each other at school.
participant: Well, it’s really really clique-y at my school so there is the cheer-
leading clique and then there is like the baseball clique and just like different
things like that but most people, they’re pretty nice to you. A lot of people
are pretty nice to you, to your face, but a lot of people talk when they’re
not with you. There are people who are a little bit rude to people who are
different than them, which isn’t fun to see but for the most part people are
pretty nice face-to-face. The real problem is whenever they’re not together.
interviewer: Sometimes kids say mean or unkind things to other kids to make
them feel bad. Can you tell me about this happening at your school?
162 C. L. Bethel et al.

participant: Yeah, I’ve witnessed that a few times, especially with people who
are different, like people with disabilities and stuff. I’ve witnessed that a lot,
which is terrible to see. I have a friend who has Asperger’s Syndrome, so
obviously he functions a little bit differently, but he’s still really really nice,
but people just treat him so differently and kinda talk down on him and make
fun of him when he does stuff that they’re just not really used to, which isn’t
fair. There are some just rude things that–especially whenever people don’t
agree with each other about things, you shouldn’t be mean about but a lot
of people, if someone is different they automatically criticize.
interviewer: Do the kids who say mean things to other kids do this a lot?
participant: Yeah
interviewer: How often do they do this?
participant: I have witnessed it. Not so much about the disabilities thing but
whenever people disagree about pretty much about anything; it happens
almost everyday. Especially on social media, it happens so much, within
the school. Then making fun of people that are different like people with
disabilities and stuff, I witness that maybe like twice a week or three times
a week.
interviewer: Which kids do they normally say mean things to?
participant: Recently, surprisingly, you wouldn’t think this would be going
on today in 2017, but recently it’s been a lot of like political issues, which
is crazy. Our volleyball team knelt during the national anthem so it made a
big deal in the school, so there were people who agreed with it and people
who disagreed with it, and all day people were just criticizing each other
about it literally all day. High school students today have a really hard time
accepting that someone else can believe something different than they do,
and if someone thinks differently it is so quick and easy to criticize than it
is to listen, so that happens a lot.
interviewer: How often do these kids say mean things to you?
participant: I am pretty confident in my opinions, if I think someone is doing
something that I don’t think is okay, I’ll be like, “hey that’s not okay,” and
I get criticized about that a good deal.
interviewer: What happens when they say mean things to other kids?
participant: A lot of times people don’t stand up and say anything, and some-
times people will agree and say the same thing. But sometimes students will
say, “hey that’s not okay.” But it’s a lot easier to just go with the crowd and
just laugh about it.
interviewer: How do you feel when they say mean things to others?
participant: It really sickens me to see it, because it’s just so unfair I think.
interviewer: Sometimes people talk about other people and say mean or untrue
things about them. For example, kids might spread rumors, gossip, or tell lies
about someone to hurt them. Can you think of examples of this happening
at your school?
participant: Oh yes. That happens all the time. I think that’s the primary issue
at my high school: gossiping and rumors being spread all the time.
Qualitative Interview Techniques for Human-Robot Interactions 163

interviewer: Let’s think about the different groups of kids at your school.
Sometimes kids will leave others out of the group or ignore them on purpose.
Can you tell me about this happening at your school?
participant: Yeah, I think that does happen a lot.
interviewer: Are there kids who do this kind of thing a lot?
participant: Probably everyday. You see it everyday at lunch.
interviewer: Which kids do they normally leave out?
participant: The people who are not so popular, not dressed with the trends
or whatever.
interviewer: Are there kids at your school who leave you out?
participant: Yeah, there has been friend groups that I was part of that I nec-
essarily wouldn’t always like get invited to things.
interviewer: What happens when they do this to you?
participant: It does kinda hurt cause it’s like, “Why wasn’t I good enough?”
but most of the time I get over it pretty quickly and go talk to someone else.
interviewer: How do you feel when kids leave others out?
participant: It makes me feel bad, and sometimes I’ll be like, “Do you want
to come sit with us?” but sometimes I don’t, because [pause] it’s – it’s a lot
easier to not say anything. It is sad to admit, but sometimes it’s just easier
to say, “Well that really sucks,” and kind of pity them but not do anything
about it.
interviewer: Is there anything else you would like to share about that?
participant: I don’t think so.

At the end of each aggression section, participants responding with relevant expe-
riences were prompted to characterize the aggressors to identify the power dynamic
between aggressor and victim. The interview’s final phases focused on the partic-
ipant’s definition of bullying and closing the interview by thanking the participant
and discussing any fun things they were planning to do in the near future.

5.2.2 Perceptions of the Interviewer Interview (PII)

After completing the forensic interview, a researcher led the participant to a separate
room and verbally administered a semi-structured interview about their perceptions
of the human or robot performing the forensic interview. The Perceptions of the
Interviewer Interview (PII) was a set of open-ended questions developed and refined
over the course of six different studies involving robots and humans interviewing
children about sensitive topics. The interview was structured as follows:
General Perception Questions

• Q1: What did you think about interviewer name during the study?
• Q2—Robot only: Do you think interviewer name was aware of what was going
on around her/him? Why or why not?
164 C. L. Bethel et al.

Understanding, Feelings, and Advice Questions

• Q3: How well do you think interviewer name understood what you said?
• Q4: How well do you think interviewer name understood how you felt?
• Q5: Do you think interviewer name could give you helpful advice if you had a
problem? Why or why not?
• Q6: Sometimes people hide how they feel from others. Do you think you could
hide how you feel from interviewer name? Why or why not?
• Q7: Are there things you could talk to interviewer name about that you could
not talk to other people about? What kind of things could you talk about with
interviewer name?
• Q8—Robot only: How is interviewer name different from a human?

Helpfulness Questions

• Q9: Was interviewer name helpful to you?


• Q10: In what ways did you feel like interviewer name was helpful to you?
• Q11: How could interviewer name be more helpful to you?

Likability and Social Norms Questions

• Q12: Did you like interviewer name? Why or why not?


• Q13: Do you think interviewer name liked you? Why or why not?
• Q14: What would you do if interviewer name did not listen to you while you were
trying to talk to her/him?
• Q15: What would happen if you did not listen to interviewer name while he was
trying to talk to you?

Other Comments

• Q16: Do you have any other thoughts about interviewer name from the study that
you’d like to share with us?

5.3 Study Protocol

In both studies each session was an hour long and divided into four segments: (1) pre-
interview tasks, (2) forensic interview about bullying experiences, (3) post-interview
tasks, and (4) character guessing game with a robot. Prior to the interview, a researcher
explained the study, obtained informed consent and participant assent, and adminis-
tered a paper-based demographics survey in the “research room.” After completing
the demographics survey, the participant was guided to a separate “interview room”
and spent about 30 min engaged in the forensic interview about bullying experiences.
For children especially it is important to keep interviews short in length and should
not go beyond 30 min. Once the forensic interview was complete, the researcher
administered the PII in the “research room” and then the participants also completed
Qualitative Interview Techniques for Human-Robot Interactions 165

a paper-based interviewer perception survey. Once that was completed participants


were returned to the “interview room” and played a character guessing game with the
robot, so that participants who did not get to experience the interview condition with
the robot would have the opportunity to interact with the robot. After the character
guessing game, participants were compensated for their involvement in the study, in
this case they received a payment of $10 and a small gift.

5.4 Participants

Participants were recruited from a database of local children and parents that had
expressed an interest in participating in research studies. The database is maintained
by university researchers and is advertised through fliers, newspaper advertisements,
and targeted advertisements on popular social media networks. Researchers used the
database to contact parents with children that were eligible for each study. Partici-
pants who took part in the first study were ineligible for participation in the second
study. Between the two studies, 70 female participants and 71 male participants
were interviewed; 75 interviews were conducted by robots, and 67 interviews were
conducted by humans.

5.5 Data and Analysis

A total of 147 one-hour sessions were conducted during the summer and fall of 2017,
yielding 142 usable cases. Participants in Study A (younger children) were distributed
between conditions to balance the interviewer-participant gender pairings. In Study B
(older children), participants were randomly assigned to one of three interviewers [6].
The same male and female human interviewers were used across all human con-
dition sessions in Study A and were social science majors in their early twenties. In
Study B human interviews were conducted by a male undergraduate in the last year
of his social science program and in his early twenties.
In this section, the analysis and results are presented from the Perceptions of
the Interviewer Interview (PII) conducted by the researcher after the participant
completed their forensic interview interaction concerning their experiences with
bullying. Verbal and behavioral data captured during the main interview is currently
undergoing analysis and will be reported in the future. The analysis in this section
of the chapter examines effects present when responses from Study A and Study B
were pooled as well as when they were analyzed independently.

5.5.1 Transcription and Data Coding Approach

Verbal answers provided by participants during the Perceptions of the Interviewer


Interview (PII) were converted to text by two independent research assistants using
166 C. L. Bethel et al.

audio recordings of each session and the ELAN software package [5]. A third research
assistant examined and resolved any discrepancies between the transcriptions, yield-
ing a final text transcript for each participant. If responses were unable to be deter-
mined from audio recordings alone, video of the session was consulted to clarify
verbal responses and capture non-verbal responses.
Two researchers independently evaluated text transcriptions of participant
responses to items Q2–Q13 from Sect. 5.2.2, coding responses first for agreement or
appraisal (depending upon the question) and then for any social factors present in
the response.
A five-point coding scheme for indicating agreement or disagreement was devel-
oped for items Q2, Q5–Q7 and Q9–Q13 using the following coding guidelines, see
Sect. 5.2.2 [6]:

• No—A verbal or non-verbal response indicating complete disagreement.


• Indecisive negative—A verbal response that primarily indicated disagreement but
also included reservations, conditions, minor uncertainty, or hypothetical alterna-
tives.
• Indecisive—A non-verbal or verbal response that ultimately indicated uncertainty.
• Indecisive positive—A verbal response that primarily indicated agreement but
also included reservations, conditions, minor uncertainty, or hypothetical alterna-
tives.
• Yes—A verbal or non-verbal response indicating complete agreement.

Similarly, a five point coding scheme for appraising performance was developed
for items Q3 and Q4 with the following coding guidelines, refer to Sect. 5.2.2 [6]:

• Very poor—A verbal response that indicated exceptionally poor performance.


• Poor—A verbal response that indicated performance that was slightly problematic
or did not fully meet expectations.
• Indecisive—A verbal or non-verbal response that ultimately indicated uncertainty.
• Well—A verbal response that primarily indicated performance was acceptable or
met expectations.
• Very well—A verbal response that indicated superb performance or exceeding
expectations.

In addition to the five established agreement and appraisal codes, a not applicable
(NA) category was created for cases when a participant was not asked or did not
provide an answer to a specific item. For the analysis presented in this chapter the
agreement and appraisal scales were collapsed from five points to three points by
combining the first two and last two categories on each scale [6].
Each response was also examined for supporting social factors that participants
used to justify their answers. Items Q2, Q5–Q7 and Q9–Q13 included explicit follow-
up prompts, which often elicited social factors, while responses to Q8 were primarily
composed of social factors. Two researchers collaboratively generated a list of six
base social factors from their observations of study sessions and by examining a small
sample of transcribed responses to each item. During the coding process researchers
Qualitative Interview Techniques for Human-Robot Interactions 167

discussed and created additional sub-categories within these six main factors when
doing so assisted in more precisely characterized responses. Responses were coded
for the following main social factors [6]:

• Appearance: Responses that referenced the interviewer’s physical appearance


but did not incorporate elements of the interviewer’s behavior. (Positive Example:
“yeah because I liked her blue hair”, Negative Example: “He doesn’t have life in
his, in his eyes so you don’t feel like you’re talking to somebody alive”.)
• Demeanor: Responses that referenced the interviewer’s behavior or personality,
often as it related to being understanding, interested, concerned, or helpful. (Posi-
tive Example: “she was really nice and, uh she would listen a lot more than some
other people would to me”, Negative Example: “I don’t think she could see me uh
really feel”.)
• Interviewer behavior: Responses that highlighted a specific action the interviewer
took during the interaction with the participant. (Example: “I could tell he was
listening at least most of the time because he like when I would say something he
would ask like questions like specifically about that and stuff like that”.)
• Knowledge: Responses that referenced the interviewer’s knowledge. (Positive
Example: “he seemed pretty smart”, Negative Example: “I don t think she under-
stood it at all because like I said before she was most likely programmed to do
that”.)
• Social confidence: Responses that discussed social judgment, privacy, or trust as
it related to the interviewer.(Positive Example: “I feel probably better talking to
him than a person because I mean like when you’re saying it to the person it’s
basically just like a guilt trip right there”, Negative Example: “it felt good to talk
about it and at the same time I had to stop myself because it’s a robot and I don’t
know where it was going to, so I just kinda like stop myself ”.)
• Non-specific: Responses that expressed only uncertainty, were not specific, refer-
enced intuition, or emphasized the participant’s own traits rather than evaluating
the interviewer’s traits or behaviors.

Each response was coded for social factors independently and any conflicts were
resolved through discussion. A majority of conflicts in coding involved one coder
selecting a less specific area of the same main category or one coder applying a single
code when multiple were merited. After all coding conflicts were resolved, a final
review of the responses associated with each social factor was conducted to ensure
consistency.

5.5.2 Agreement and Appraisal

After the initial data coding process, responses to items, which included an agree-
ment or appraisal prompt were sorted into three categories (positive, negative, or
indecisive) for analysis. For agreement prompts Yes and Indecisive Positive were
grouped into the Positive category and No and Indecisive Negative were placed in
168 C. L. Bethel et al.

Appraisal: Well Unsure Not Well

By Interviewer Type By Interviewer By Interviewer


100%
Percentage of Participants

88.6 89.4 87.5 88


84.6
78.7
75% 71.4 70 72.2

50
50% 44.4

28.6 30
25% 21.3
15.4 12.5 16.7
8.6 8 11.1
2.8 5.3 5.3 4 5.6
0 0 0 0 0
0%

Human Robot Human Human Robot Robot Human Human Robot Robot
Male Female Male Female Male Female Male Female
Interviewer

Fig. 3 Participant responses to questions about the interviewer’s ability to understand what they
said and how they felt

the Negative. For appraisal prompts Very Poor and Poor were grouped as Negative
while Very Well and Well formed the Positive group. In a small number of cases
a researcher inadvertently skipped an item, a participant offered no decipherable
response, or a technical error prevented capturing the participant’s response and the
participant was excluded from the analysis for the affected items [6].
The responses from both studies were merged and the frequencies of coded
responses for each item were compared when participants were grouped by study,
interviewer type (human or robot), and participant gender. Further analyses were
conducted within the context of each study to better understand the source of any
significant differences. No statistically significant differences on any agreement or
appraisal items were identified when splitting responses into two groups based on
the study they participated in (younger children versus older children) [6].
Robot versus human interviewers
When responses were split into those with a human interviewer (male or female) and
those with a robot interviewer (female R25, male R25, or male Nao), statistically
significant differences were observed for items related to how well the interviewer
understood the participant and the perceived ability of the interviewer to give helpful
advice (Q3–Q5).
Participants with a human interviewer were more likely to respond that they were
uncertain how well the interviewer understood what they said (Q3) (14.93%) in
comparison to those with a robot interviewer (4.23%, Fisher’s Exact Test p = 0.01).
Furthermore, 7.04% of participants with robot interviewers felt the interviewer did
not understand what they said, while none of the participants in human interviewer
conditions reported a lack of understanding (refer to Fig. 3) [6].
When asked how well the interviewer understood the way they felt (Q4) (see
Fig. 3), participants in the robot conditions were more likely to indicate Negative
(14.29%) or Indecisive (22.86%) responses compared with the responses in the
human interviewer conditions with 1.54% indicating Negative and 13.85% indi-
cating Indecisive, Fisher’s Exact Test p < 0.01).
Participants in robot interviewer conditions were more likely (5.33%) than
those with a human interviewer (1.54%) to indicate that they were uncertain if the
Qualitative Interview Techniques for Human-Robot Interactions 169

Fig. 4 Older participant’s


assessment of how well the Emotional Understanding
100% 95

Percentage of Participants
interviewer understood their
feelings, divided by
75%
interviewer 66.7
62.5 Appraisal:
Well
50% Unsure
Not Well

22.2 25
25%
11.1 12.5
5
0
0%
Human Male Nao Robot RK25 Male

Interviewer

interviewer could provide them helpful advice if they had a problem (Fisher’s Exact
Test p = 0.01). Furthermore, those in the robot conditions were more likely report
that the interviewer would be unable to provide helpful advice (9.33%) in comparison
to those in the human condition (0%).
An examination of Study A independently shows the only statistically signifi-
cant difference (Fisher’s Exact Test p < 0.01) that occurred between the human and
robot interviewers was in responses to Q3 (see Fig. 3). In the human conditions (male
and female interviewers) 78.72% participants appraised the interviewer’s ability to
understand what they said as Positive, while 21.28% were Indecisive. In comparison,
88.58% of participants in robot interviewer conditions (male and female) appraised
the interviewer’s ability to understand what they said as Positive, 2.86% were Inde-
cisive and 8.57% reported the interviewer did not understand well [6].
Figure 4 illustrates that when analyzed independently, responses from Study B
yield a statistically significant difference between participants in the different inter-
viewer conditions (Human male, Nao male, R25 male) on Q4 (Fisher’s Exact Test,
p = 0.05). In the male Nao robot condition 66.67% of participants reported that the
interviewer understood how they felt, while 22.22% indicated the robot did not have
a good understanding of how they felt. Of the participants in the male R25 condition
62.5% indicated the robot understood how they felt, while 12.5% responded that the
robot did not understand how they felt. In the human interviewer condition 95% of
participants felt the interviewer understood how they felt [6].
Participant gender
When responses were split into groups based on participant gender (male or female)
statistically significant differences were present for items concerning interviewer
understanding of feelings and ability to provide advice (Q4, Q5), refer to Fig. 5. As
displayed in Fig. 5, Participants identifying as female were more likely (26.47%)
to report uncertainty when asked how well the interviewer understood how they
felt in comparison to male participants (10.45%, Fishers Exact Test p = 0.01), who
were more likely to endorse the Negative option (13.43%) in comparison to females
(2.94%). Furthermore, as shown in Fig. 5, participants identifying as female were
more likely (97.18%) to indicate the interviewer could provide them helpful advice
170 C. L. Bethel et al.

Effects of Participant Gender


Question 5: Older Participants Question 6: Younger Participants Question 13: Younger Participants
Percentage of Participants

100
100%
100
78.6
75% 66.7 64.3 Response:
50 Yes
50% 45.2 Unsure
35.7 35.7 No
33.3
25% 19.1 21.4
14.3 16.7
11.9
7.1
0 0 0
0%
Male Female Male Female Male Female

Participant Gender

Fig. 5 Gender differences among participants in each study across all types of interviewers

in comparison to participants identifying as male (85.51%, Fisher’s Exact Test p =


0.04).
Figure 5 illustrates significant differences found within each study. When exam-
ined separately, responses from Study A show significant differences between partic-
ipant reported genders and appraisals of the interviewer for Q6 and Q13. Compared
to 35.71% of male participants, 66.67% of female participants reported that they
could not hide how they felt from the interviewer (Fisher’s Exact Test p = 0.02) [6].
Furthermore, within Study A female participants (64.29%) were more likely than
male participants (50%) to perceive that the interviewer liked them (Fisher’s Exact
Test p = 0.02). While 16.67% of male participants responded that they felt the
interviewer did not like them, none of the female participants reported the perception
that the interviewer did not like them [6].
All female participants (100%) in Study B reported that the interviewer could
provide helpful advice, while 78.57% of male participants indicated the interviewer
could provide helpful advice (Fisher’s Exact Test p = 0.02) [6].

5.5.3 Social Factor Mentions

Following the analysis of agreement and appraisal responses, an examination of


the explanations that participants provided for their responses was conducted. Each
response was tagged with all relevant social factors (described in Sect. 5.5.1). Data
from Study A and Study B were combined and analyzed as a whole [6].
For each participant it was computed whether or not the participant made mention
of each social factor across their entire response to the Perceptions of the Interviewer
Interview (PII). This was done to limit the influence of participants who cited the same
factors repeatedly for multiple items. As each factor could have a positive or negative
valence, this resulted in a set of twelve binary variables for each participant indicating
whether or not the participant discussed the factor. For the analysis associated with
this chapter, the sub-categories for each factor were not examined, rather they were
counted as representing their higher level category [6].
Qualitative Interview Techniques for Human-Robot Interactions 171

Social Factors Mentioned Per Condition


Human Robot
100 93.33
Percentage of Participants

78.67
75 67.16

50 45.33
40.00 41.33 38.81
31.34 28.00
25 20.90 21.33
13.43
9.33
4.48 4.00
0.00 1.49 0.00 2.67 1.49
0
Appearance Demeanor Interviewer Social Knowledge Appearance- Demeanor- Interviewer Social Knowledge-
Behavior Behavior-
Positive Social Factors Negative Social Factors

Fig. 6 Positive and negative mentions of each social factor for human and robot interviewers

Figure 6 compares the percentage of participants in human and robot interviewer


conditions citing each social factor. With the exception of the Knowledge + social fac-
tor, all other positive social factors were referenced significantly more by participants
in a robot interviewer condition [6].
Since participants in robot interviewer conditions had the opportunity to respond
to two additional questions (Q2 and Q8), an analysis was conducted in which all
responses were removed from these questions. Removing these items resulted in a
loss of statistical significance between human and robot interviewers for the area
of Social Confidence, which includes components of general trust, social judgment,
and maintaining privacy (χ 2 (1) = 2.67, p = 0.10, Cramer’s V = 0.19, small effect).
When incorporating all questions, 41.33% of participants in the robot conditions
discussed factors related to Social Confidence, but when excluding responses to the
question How is the interviewer different from a human? (Q8) this declined to 34.67%
of participants in the robot interviewer conditions. Of the participants in the human
interviewer conditions, 20.9% identified positive factors related Social Confidence
[6].

5.6 Reporting

[6] The exemplar study presented in Sect. 5 provides different ways in which qual-
itative data can be analyzed, interpreted, and reported. There are different types of
statistical tests required depending on the type of data being evaluated and how it is
converted from textual transcriptions, coded, and evaluated as a quantitative value.
In most cases the data was evaluated using non-parametric statistical measures such
as a Chi-square test using Cramer’s V for determining effect size (refer to http://
vassarstats.net/newcs.html) or a Fisher’s Exact Test (see https://www.socscistatistics.
com/tests/fisher/default2.aspx), which is used when sample sizes are smaller than
what is needed to calculate a Chi-square. It is recommended that a statistical meth-
ods book (e.g., [15]) be consulted or software packages used such as SPSS, SAS, or
similar be used for a better understanding of these techniques. A detailed discussion
of these statistical tests are beyond the scope of this chapter.
172 C. L. Bethel et al.

6 Conclusions

This chapter presents the basics associated with using a forensic structured interview
approach for gathering richer data from human-robot interaction studies. Although
survey and self-assessment data is a place to start and is useful in human-robot
interaction studies, the content is limited. Study participants are forced to provide
information associated only with what information is requested based on the items
selected by the researcher. This type of data can only provide a certain amount of
information about interactions between robots and humans.
In order to obtain additional information from participants of user studies, it is
recommended that researchers enhance studies through the use of structured or semi-
structured interview questions. This allows participants of the study to provide their
own insights and feelings regarding the interactions and gives an opportunity for them
to provide additional knowledge and insight. This type of data can be challenging,
time intensive, and tedious to evaluate; however, it can provide additional insights
that may not be discerned using other methods of evaluation.
As discussed in the Related Work Sect. 2, there are many different types of inter-
view techniques that can be used. The focus of this chapter was on the use of an inves-
tigative interview technique, known as the forensic interview [9]. This approach has
been successfully used in different types of investigations and it is especially effective
for use in studies involving children. The approach for using the forensic interview
protocol were presented in Sect. 3 and included the introduction, guidelines, rapport
building, narrative practice, substantive disclosure, and cool-down/wrap-up tech-
niques. A discussion on the process of transcription and coding of the qualitative
data was presented in Sect. 4. An exemplar study was presented to help reinforce and
demonstrate the process of how to conduct a forensic interview, and then examples of
the how the transcription, coding, and analyses process was performed for this type
of study. Although the process of using structured interviews with open-ended types
of questions seems to incur significant effort, the results may transform human-robot
interaction research and researchers are encouraged to consider using this approach
in the design of their studies.

References

1. Baxter, P., Kennedy, J., Senft, E., Lemaignan, S., Belpaeme, T.: From characterising three
years of hri to methodology and reporting recommendations. In: The Eleventh ACM/IEEE
International Conference on Human-Robot Interaction, pp. 391–398. IEEE Press (2016)
2. Birks, M., Bodak, M., Barlas, J., Harwood, J., Pether, M.: Robotic seals as therapeutic tools in
an aged care facility. J. Aging Res. (2016)
3. Borgers, N., Hox, J., Sikkel, D.: Response quality in survey research with children and adoles-
cents: The effect of labeled response options and vague quantifiers. Int. J. Public Opin. Res.
15(1), 83–94 (2003)
4. Foster, R.K.: An investigation of training, schemas, and false recall of diagnostic features.
Masters thesis, Mississippi State University (2015)
Qualitative Interview Techniques for Human-Robot Interactions 173

5. Hellwig, B., Van Uytvanck, D., Hulsbosch, M., Somasundaram, A., Tacchetti, M., Geerts, J.:
ELAN - Linguistic Annotator, 5th edn. The Language Archive, MPI for Psycholinguistics,
Nijmeg, The Netherlands (2018)
6. Henkel, Z., Baugus, K., Bethel, C.L., May, D.C.: User expectations of privacy in robot assisted
therapy. Paladyn J. Behav. Robot. 10(1), 140–159 (2019)
7. Jeong, S., Breazeal, C., Logan, D., Weinstock, P.: Huggable: Impact of embodiment on pro-
moting verbal and physical engagement for young pediatric inpatients. In: 2017 26th IEEE
International Symposium on Robot and Human Interactive Communication (RO-MAN), pp.
121–126 (2017)
8. Kelley, J.F.: An iterative design methodology for user-friendly natural language office infor-
mation applications. ACM Trans. Inf. Syst. (TOIS) 2(1), 26–41 (1984)
9. Lamb, M.E., Orbach, Y., Hershkowitz, I., Esplin, P.W., Horowitz, D.: A structured forensic
interview protocol improves the quality and informativeness of investigative interviews with
children: A review of research using the nichd investigative interview protocol. Child Abus.
Negl. 31(11–12), 1201–1231 (2007)
10. Lazar, J., Feng, J.H., Hochheiser, H.: Research Methods in Human-Computer Interaction.
Wiley, West Sussex, United Kingdom (2010)
11. de Leeuw, E.D.: Improving data quality when surveying children and adolescents: Cognitive
and social development and its role in questionnaire construction and pretesting. In: Report
prepared for the Annual Meeting of the Academy of Finland: Research programs public health
challenges and health and welfare of children and young people, pp. 1–50 (2011)
12. Rosenthal-von Pütten, A.M., Krämer, N.C.: Individuals’ evaluations of and attitudes towards
potentially uncanny robots. Int. J. Soc. Robot. 7(5), 799–824 (2016)
13. Serholt, S.: Breakdowns in children’s interactions with a robotic tutor: A longitudinal study.
Computers in Human Behavior 81, 250–264 (2018)
14. da Silva, J.G.G., Kavanagh, D.J., Belpaeme, T., Taylor, L., Bleeson, K., Andrade, J.: Experi-
ences of a motivational interview delivered by a robot: qualitative study. Journal of Medical
Internet Research 20(5), (2018)
15. Stevens, J.P.: Intermediate Statistics A Modern Approach, 2nd edn. Lawrence Erlbaum Asso-
ciates, Publishers, Mahwah, NJ (1999)
16. Syed, M., Nelson, S.C.: Guidelines for establishing reliability when coding narrative data.
Emerging Adulthood 3(6), 375–387 (2015)

Cindy L. Bethel Ph.D. (IEEE and ACM Senior Member) is


a Professor in the Computer Science and Engineering Depart-
ment and holds the Billie J. Ball Endowed Professorship in
Engineering at Mississippi State University (MSU). She is the
2019 U.S. Fulbright Senior Scholar at the University of Tech-
nology Sydney. Dr. Bethel is the Director of the Social, Ther-
apeutic, and Robotic Systems (STaRS) lab. She is a member
of the Academy of Distinguished Teachers in the Bagley Col-
lege of Engineering at MSU. She also was awarded the 2014–
2015 ASEE New Faculty Research Award for Teaching. She
was a NSF/CRA/CCC Computing Innovation Postdoctoral Fel-
low in the Social Robotics Laboratory at Yale University. From
2005–2008, she was a National Science Foundation Gradu-
ate Research Fellow and was the recipient of the 2008 IEEE
Robotics and Automation Society Graduate Fellowship. She
graduated in August 2009 with her Ph.D. in Computer Science and Engineering from the Uni-
versity of South Florida. Her research interests include human-robot interaction, human-computer
interaction, robotics, and artificial intelligence. Her research focuses on applications associated
with robotic therapeutic support, information gathering from children, and the use of robots for
law enforcement and military.
174 C. L. Bethel et al.

Jessie E. Cossitt received a B.S. in Psychology from Missis-


sippi State University in 2017 and is currently enrolled in the
Ph.D. program in computer science in the Bagley College of
Engineering at Mississippi State University. She works on driv-
ing simulator research as a graduate research assistant at the
university’s Center for Advanced Vehicular Systems, and her
main research interest is the interactions between humans and
autonomous vehicles.

Zachary Henkel is a computer science PhD student at Missis-


sippi State University. He received a bachelor’s degree in com-
puter science from Texas A&M University, College Station, TX,
USA, in 2011. His research interests include human-robot inter-
action and human-computer interaction.

Kenna Baugus is pursuing a Bachelor of Science in Software


Engineering at Mississippi State University. She enjoys learning
about human-machine interaction and works as an undergradu-
ate researcher in the Social, Therapeutic, and Robotic Systems
(STaRS) Lab. Her current focus is developing social robots that
act as intermediaries to gather sensitive information from chil-
dren.
Some Standardization Proposals
Design and Development of the USUS
Goals Evaluation Framework

Josefine Wallström and Jessica Lindblom

Abstract For social robots to provide long-term added value to people’s lives, it is
of major importance to emphasize the need for developing a positive user experience
(UX). In this chapter, we address the identified lack of available and suitable UX
evaluation methods in social human-robot interaction (HRI). Inspired by Blandford’s
and Green’s iterative method development process, this lack was mainly handled by a
state-of-the art review of current HRI evaluation methods that identified some tenta-
tive candidates, of which the USUS framework was considered the most prominent.
However, upon closer examination it was revealed that the USUS framework explic-
itly omitted UX goals, which are considered a significant aspect in UX evaluation.
We designed and developed an enhanced version of the USUS framework in order to
include UX goals that we denoted the USUS Goals evaluation framework. Besides
the modified framework, some recommendations are presented that may contribute
to the ongoing work of integrating UX in the HRI field.

Keywords User experience (UX) · UX goals · USUS framework

1 Introduction

The recent and rapid development of autonomous technology emphasizes the impor-
tance of considering various aspects of human-robot interaction (HRI) from a human-
centered perspective. Socially interactive robots are expected to have an increasing
importance in everyday life for a growing number of people [1]. There has been an
increased number of socially interactive robots in human environments and their level
of participation in everyday activities are becoming more sophisticated [e.g. 1–4].
Taking on the human-centered view, highlighting the importance of evaluating the
quality of the human-robot interaction, is of major concern in order for technology

J. Wallström (B)
Uptive, Lindholmspiren 7, 41756 Göteborg, Sweden
e-mail: josefine.wallstrom@uptive.se
J. Lindblom
University of Skövde, Box 408, 541 28 Skövde, Sweden
e-mail: jessica.lindblom@his.se
© Springer Nature Switzerland AG 2020 177
C. Jost et al. (eds.), Human-Robot Interaction, Springer Series
on Bio- and Neurosystems 12, https://doi.org/10.1007/978-3-030-42307-0_7
178 J. Wallström and J. Lindblom

to provide a long-term added value to people’s lives [1, 2]. Consequently, many eval-
uation methods and techniques have been developed [5–8], resulting in evaluations
of different aspects; including acceptance, usability, user experience, learnability,
safety, trust, and credibility. While some of the aspects are covered in depth, some
are just briefly touched upon in HRI research. Lately, the importance of creating a
positive user experience (UX) when a human user is interacting with a social robot
is widely stressed [1, 5, 7–9]. Briefly stated, UX is about people’s feelings, as caused
and shaped by the use of technology in a particular context [e.g. 10–12]. It is argued
that positive UX is necessary in order for socially interactive robots to achieve the
intended benefits [1, 2, 5, 7–9].
This chapter addresses the identified lack of available and suitable UX evaluation
methods in social HRI. In order to address this need, we present the design and devel-
opment process of the so-called USUS Goals evaluation framework. This process
was mainly influenced by Blandford’s and Green’s iterative method development
process [13], in which a state-of-the-art review of current UX evaluation methods
(see Sect. 3.1) was conducted. This review identified some tentative candidates, of
which the USUS evaluation framework [5] was considered the most prominent and
well developed, which has been used in several evaluation studies. However, upon
closer examination it was revealed that the USUS evaluation framework [5] explic-
itly omitted UX goals, which are considered a significant aspect in UX evaluation in
general. The aim of this paper is to investigate and analyze how an enhanced version
of the USUS evaluation framework should be developed and designed in order to
include UX goals. The intended end-users of the new UX evaluation framework,
called USUS Goals, are both robot developers and HRI researchers who intend to
develop and design social robots with a positive UX, beyond technical functionality,
performance, and acceptance. De Graaf and Allouch [14] have shown that users’ sub-
jective experiences of the interaction quality with a humanoid robot have the same
impact on the robot’s acceptance and trust as more performance-related aspects. It
has been argued when various kinds of robots, including social robots, become more
complex and the commercial markets become more competitive, the robotics indus-
try would expect an increase for UX competence within HRI. As a consequence,
the need for relevant, useful and improved HRI evaluation frameworks, methods and
techniques growths when social robots are becoming ubiquitous parts of our society.
When it comes to social interaction with robots, HRI research could be catego-
rized into three different approaches: robot-centered HRI, robot cognition-centered
HRI, and human-centered HRI [1]. While robot-centered HRI views the robot as an
autonomous entity and the human as the robot’s “caretaker” who should identify and
respond to the needs of the robot, robot-cognition HRI views the robot as an intelli-
gent system and the fundamental problem is to provide these robots with a cognitive
capacity. In human-centered HRI, the human perspective is emphasized and issues
related to the design of robot behavior that is comfortable for humans are included
in this approach. This involves acceptability and believability, as well as humans’
expectations of, attitudes towards, and perceptions of robots [1]. In order to get robots
to inhabit our social and material living environments, the three approaches need to be
synthesized to enhance social interaction [1]. However, historically human-centered
HRI has not received as much attention as the other two approaches [7–9, 15].
Design and Development of the USUS Goals Evaluation Framework 179

1.1 UX, the UX Wheel, and UX Goals

From the users’ point of view, a digital artifact that is suitable for its purpose, being
easy to use, and fits into its intended context are just basic requirements of the
technological artifact. Users have also started to postulate and demand a positive,
good and great experience when interacting with technological artifacts, beyond
utility, usability, and acceptance [7–9]. Broadly speaking, UX addresses the feelings
created and shaped by the use of technology and how technology can be designed
to create a user experience that evolves the required feelings [e.g. 10–12, 16, 17].
Therefore, the intended users have to be identified and described, and focused upon
during the whole UX design (UXD) lifecycle process [10]. One central principle of
the UXD lifecycle process is the need to identify and characterize the user goals,
and these goals have to be connected to the business goals, and subsequently the
business goals to the user behaviors [16, 18]. Another central principle of the UXD
lifecycle process is its iterative and incremental nature. It is not possible to have
all the answers from the very beginning. Instead, the answers are being identified,
evolved, characterized, and refined during the whole iterative UXD lifecycle process
[10, 16, 18], which Hartson and Pyla [10] also denoted as the UX wheel.
The UX wheel is iterative and consists of the four key elements of UX activities;
analyze, design, implement, and evaluate [10]. Briefly stated, “analyze” refers to
understanding the users’ work and needs. “Design” refers to creating conceptual
design ideas and the fundamental “look and feel” of the interaction between the user
and the intended product. “Implementation” refers to the more detailed interaction
situations with the use of different kinds of product prototypes, which vary from low
fidelity to high fidelity of details. Finally, “evaluation” refers to the different methods
and techniques that can be used to investigate and analyze to what extent the proposed
design meets the users’ needs, requirements, and expectations. The whole “wheel”
(Fig. 1) corresponds to an iterative UXD lifecycle process that is accompanied by
the identified and characterized UX goals [7–10].
An important activity for the whole UXD wheel is to extract, identify, and charac-
terize UX goals [10]. UX goals are high-level objectives, which should be driven by
the representative use of an envisioned interactive artifact or system. The UX goals
should identify what is important to the users, stated in terms of anticipated UX of
an interaction design. The UX goals are expressed as desired effects, for example
the interactive artifact’s ease-of-use, perceived safety, quality-of-use, learnability,
acceptance, trust, and emotional arousal [10, 18]. UX goals are important because
they help and support the designers and developers to continuously focus on the
intended experience when interacting with the envisioned interactive artifact. For
that reason, they are referred to as “goals”, instead of “requirements”, since these
UX goals cannot be guaranteed to be fulfilled by all intended end-users [18]. These
UX goals should be aligned with the business goals of the company. Today, most
robot developers are researchers, but there is a growing shift towards social robots as
commercial products. UX goals are extracted and defined in the initial investigation
phase of UXD, in the first analysis phase of the UX wheel, and these UX goals may,
180 J. Wallström and J. Lindblom

Iterate

DESIGN
Creation of interaction
design concepts and
ideas, look and feel.
Iterate

ANALYZE IMPLEMENT
Understanding users’ Implementation design
behaviours, attitudes, alternatives in
works and needs. prototypes.

Iterate
EVALUATE
Evaluate if the proposed
design meets the users’
needs and requirements.

Iterate

Fig. 1 The UX wheel, describing the iterative UX design lifecycle process (adapted from [10],
p. 54)

e.g., be extracted from the corporation policy and product requirements. UX goals
can be stated in several ways; they can be expressed, e.g., as ease-of-use, powerful
performance to experts, avoiding errors for beginners, or high satisfaction among
users [10, 18]. Unfortunately, the extraction of UX goals is an activity that is often
overlooked, either because of lack of time or lack of knowledge [10, 18], which
may result in negative consequences for the final design of the interactive artifact. If
proper and relevant UX goals are extracted and characterized early on, continuously
being used during the UXD lifecycle process, they could increase the potential of the
UX evaluation performed later on. Having identified specific UX goals then these
goals should support and benefit the evaluation process in pointing out exactly what
should be investigated in order to enhance the positive UX of the interactive artifact.
UX goals offer support throughout the development lifecycle by defining quantita-
tive and qualitative metrics, which provides the basis for knowing when the required
interaction quality has been fulfilled. During the UXD cycle, it is possible to con-
duct both formative and summative evaluations. Briefly stated, formative evaluation
is typically performed during the early development of a system, while summative
evaluation is performed at the end of a design process. The characterized and defined
UX goals can also provide appropriate benchmarks for formative evaluations, which
Design and Development of the USUS Goals Evaluation Framework 181

in turn can help to point out exactly what adjustments that will result in the most use-
ful outcome. By evaluating the UX goals continuously, the developers and designers
may also recognize when it is time to stop iterating the design, and when the design
is considered to be successful enough [10]. The UX goals also scaffold and support
to stay tuned to the UX focus throughout an interdisciplinary development process
[18]. As pointed out several times above, UX evaluation is a central activity when
designing for a positive UX in HRI, which is the main topic of the next subsection.

1.2 UX Evaluation in Social HRI

Lindblom and Andreasson [9] identified three major challenges which, if met, might
result in a better, broader understanding of UX evaluation in social HRI. Their list is
not exhaustive, but provides a useful starting point in order to narrow the gap between
the different approaches to social interaction in HRI identified by Dautenhahn [1].
It should be noted that these challenges can be met by drawing on the fields of HCI
and UX, for example with design processes, theories, models, methods, tools, and
evaluation approaches that may provide starting points for the design, analysis, and
evaluation of HRI studies [9]. The three major challenges are briefly presented as
follows [9]. The first challenge is the need to adopt an iterative UXD lifecycle process
in HRI. This poses a dilemma because of the high cost of rapid prototyping in robotics.
The second challenge is the need to incorporate UX goals to ensure positive UX, and
performing several formative evaluations during the iterative design process, so it may
be possible to compare and contrast the evaluation results obtained during the whole
development process. The third challenge is the need for robot developers to acquire
knowledge about proper UX evaluation, in theory and in practice. Based on the
identified challenges of Lindblom and Andreasson [9], the rest of this chapter focuses
on the design and developmental process for achieving the current version of the
USUS Goals evaluation framework. The development was influenced by Blandford’s
and Green’s iterative method development process [13], which is presented in more
details in the next section.

2 A Method for Method Development

Literature on methods for evaluation method development is, for varying reasons,
quite scarce in HCI, UX and HRI, although the results, the evaluation methods in
themselves, are found in abundance. Blandford’s and Green’s method development
process [13] consists of five iterative steps; (1) identification of an opportunity or
need, (2) development of more detailed requirements, (3) matching opportunities,
needs and requirements, (4) development of the method, and (5) testing of the method.
However, as they further pointed out, a method project needs not necessarily cover
all the phases of the method development process.
182 J. Wallström and J. Lindblom

The first step deals with the identification of an opportunity or need. This could
arise from a vast number of sources such as the need for a new type of evaluation
method, technology advancements, or simply the acquirement of new knowledge,
giving rise to better models and methods for design and/or evaluation. The second
step in Blandford’s and Green’s method development process is development of
more detailed requirements. The third step in Blandford’s and Green’s [13] method
development process, Matching opportunities, needs and requirements concern the
exploration phase that inherently follows the initial requirements set at the beginning
of any project. These phases would involve researching existing methods and theory
within the focal area of the identified opportunity or need with the aim of identifying
relevant existing and neighboring methods for application, modification, or inspira-
tion as well as identifying relevant theoretical frameworks that could be developed
into a method. The fourth step, development of the method is in itself an iterative and
explorative process similar to the iterative processes occurring in all types of design.
Blandford and Green [13] further argued that this phase, as such, does not allow for
much detail in terms of structured processes although they highlight that drawing
inspiration from existing methods is a good place to start. The fifth and final step,
testing of the method, is being essential in all method development as a method’s
usage is very dependent on the person using it. As Blandford and Green [13] stressed,
it is impossible to be predictive about methods for developing a method, because so
many variables and motivations are involved. The reminder of this chapter presents
the outcomes of steps 1–4, which are linked to the fifth and final step.

3 The Method Development Process for the USUS Goals


Evaluation Framework

Inspired by Blandford’s and Green’s iterative method development process [13], our
own method development process unfolded as follows. The first step concerned the
identification of an opportunity or need [13]. Our position is that it was an identified
need for such methodological approach as the envisioned USUS Goals evaluation
framework, partly based on the identified challenges by Lindblom and Andreasson
[9], to provide systematical guidance how to evaluate UX for social robots from a
human-centered perspective. Thus, the aim and intended benefits of the methodolog-
ical approach was clearly defined (see the Introduction). The second step concerned
development of more detailed requirements. We emphasize that the motivation for an
UX approach is based on the increasingly received attention of UX in HRI, which is
recognized as vital for the proliferation of robots in society. When undertaking such
an endeavor, a lot of inspiration could be gained from the fields of UX and HCI that
since the mid-1980s have focused on systematic evaluation of interactive technolo-
gies from a human-centered perspective, even before robots entered the scene [e.g. 7,
8, 10]. The third step concerned matching opportunities, needs and requirements, and
a state-of-the-art literature review on UX evaluation in social HRI was conducted.
Design and Development of the USUS Goals Evaluation Framework 183

The main outcome from this review was the identification of the USUS evaluation
framework [5], which was considered the most prominent and well developed can-
didate, although it lacked UX goals. The lack of UX goals was a major shortcoming
in USUS although it was estimated that is should not be problematic to modify and
develop the USUS framework to address these identified deficiencies. The outcomes
from the literature review and the USUS evaluation framework [5] is described and
analyzed in more detail in Sect. 3.1.
The fourth step in Blandford’s and Green’s development process [13] concerned
the development of the method, in this case the envisioned USUS Goals evaluation
framework, which in itself an iterative and explorative process similar to the iterative
processes occurring in all types of design. To visualize and clarify the activities
carried out in the fourth and fifth steps of the USUS Goals development process,
the first author used two different, but complementary perspectives that could be
characterised as two separate roles (Fig. 2). In the role as a designer, the purpose was
to review and analyze literature, develop and design the USUS evaluation framework
[5] further in order to present a new, modified framework, and provide additional
recommendations to the HRI community. In the role as a user, the purpose was to
test and evaluate the renewed framework by applying and using it in practice, also
involving robot researchers/developers. The term user in this case is not referring
to end-users that are interacting with the robot, the user in this case would be the
individuals using the evaluation framework. How these two roles were aligned in the
fourth and fifth steps is illustrated in Fig. 2.
The first activity performed in steps 3–4, was to analyze how an enhanced ver-
sion of the USUS framework should be developed and designed in order to include
UX goals. Firstly, a deeper understanding of UX goals was acquired via a literature
analysis (activity 1 in Fig. 2). The second activity was the modification of the USUS
evaluation framework [5] to design the initial version of USUS Goals that included
UX goals (activity 2 in Fig. 2). This second activity also included some empirical
evaluation, which was performed with a NAO robot [19]. The third activity was to
implement the identified issues from the evaluation to the second version of USUS

1. 2. 3. 4. 5.
Analysis Design & Implemen- Results Recommen-
Designer

evaluation tation & dations


evaluation Present Present
modified recommen-
Literature Modify Literature version of dations for
analysis USUS analysis USUS developers
User

Evaluate Evaluate
new version new version
of USUS of USUS

Fig. 2 Roles in and description of the performed activities in the USUS Goals evaluation
framework’s development process in step 4 and step 5 in Blandford’s and Greene’s [13] method
184 J. Wallström and J. Lindblom

Goals as well as perform empirical work with robot researchers/developers, in order


to assess and provide input to the third version of the USUS Goals evaluation frame-
work. The outcomes from these three activities are described in Sect. 3.2. Based on
the robot researchers/developers’ feedback from the micro-test evaluation, the fourth
activity was to present the modified, and third version of the USUS Goals evaluation
framework, which is described in Sect. 4. Based on the findings from the literature
reviews and the empirical evaluations, we have developed six recommendations,
which is the fifth activity in Fig. 2, which may contribute to the ongoing work of
integrating UX in the HRI field.

3.1 Evaluation Methods and Frameworks in Social HRI

Commonly used evaluation methods, not including questionnaires, in HRI could be


divided into three main categories: user-based methods, inspection methods, and
frameworks.
A common method used to simulate interaction with social robots in order to
conduct UX evaluation is the Wizard of Oz technique (WOz) used, among others,
by Weiss et al. [5]. WOz is feasible for examining several aspects of UX while the
overall, holistic experience is evaluated. WOz has similar advantages as video-based
scenario, but compared to it, the user can interact with the robot in a proper way.
WOz is also easy to combine with other methods and makes it possible to evaluate
interaction scenarios in early stages of the design process since it does not require a
very developed prototype of the robot. Often a human plays the robot perspective in
a puppet manner.
Another common method used to evaluate UX is scenario-based evaluation. In
contrast to traditional scenarios used in HCI, that occur through direct interaction
between the agents or through virtual worlds with digital agents, video-based
scenarios can be used to evaluate HRI [e.g. 20]. Video-based scenarios can be used
to investigate how the social context where the interaction takes place is affecting
the UX and to evaluate the user’s acceptance. In order to assess this aspect, a
complementary questionnaire was used after the participants had viewed the video
with different interaction scenarios to evaluate UX [e.g. 21–23]. Syrdal, Otero
and Dautenhahn [24] also used video-recorded interaction, but instead of using
questionnaires, they interviewed the participants afterwards. A clear advantage of
using video-based scenarios for evaluation purposes is the possibility to examine
the users’ experiences of a specific interaction in a specific context. Furthermore, it
is also faster and more efficient than real physical scenarios because the videos can
be distributed over different geographical sites and cultural contexts. However, Xu
et al. [20] identified some challenges with the use of video-based scenarios, e.g. only
certain aspects of the interaction are shown in the video and it can be hard to predict
possible interaction scenarios and how they could evolve. The user’s experience of
seeing someone else interacting with a robot can also differ from the experience
of doing it by oneself [20]. Many of these methods and techniques are often
Design and Development of the USUS Goals Evaluation Framework 185

applied without first-hand experience of the interaction situation and they often are
conducted afterwards. This might bias the validity of the conclusions. Furthermore,
only using questionnaires can also be restrictive as many relevant aspects may not be
mentioned in these kinds of surveys, and the lack of standardized questionnaires also
makes it difficult to compare results from different researchers and developers. To
overcome the problems with evaluating UX retrospectively, different psychological
measurements for participant activity can be used [25, 26]. These kinds of mea-
surement tools have the advantages of being able to be used during the interaction,
capturing data in real-time experience and do not seem to affect the user too much,
depending on what kind of technology being used. However, the interpretation of the
physiological measurements poses problems because you cannot be sure that what
is being measured is causatively connected with what is being assessed. Therefore,
these methods should be complemented by other kinds of methods and/or techniques.
While the methods described above involve participants to a larger extent, UX can
also be evaluated with inspection methods, without user involvement. A very popular
variant is heuristic evaluation that initially was designed for the HCI domain. These
heuristics have been adapted for HRI interfaces by Clarkson and Arkin [27], and
further by Weiss et al. [28], to become even more feasible for robot interfaces. The
advantages of inspection methods are that they are fast and easy to perform and do
not require much resources. It has also proven to be no difference in the obtained
results, independently of who is evaluating the robot, since it requires no previous
experience or knowledge of how to conduct the evaluation [29]. It can be used on
videotaped scenarios, and therefore be distributed across different geographical sites
and development teams. However, the influence from the robot’s physical presence
cannot be taken into account via video-based scenarios. Another disadvantages with
heuristic evaluation in HRI is that it can be difficult to use correctly because the
actual robot may not always offer visible, physical clues of its functionality, as well
as the more UX aspects of the robot might be hard to assess.
Although the same methods often are used in many different studies and by several
researchers, there is a huge variety in how they are applied in practice. Furthermore,
there is also a large variety in the different aspects of UX being evaluated, which
may indicate a need for common guidelines or frameworks for practitioners and
researchers conducting UX evaluation. Another related aspect is the possible gap
between the objectives and principle of different methods and their practical appli-
cation. This creates a risk for misunderstanding the factors evaluated, resulting in
biased outcomes.
As pointed out by Young et al. [6], there is a lack of distinct methods that cover
the breadth and depth of the holistic experience of interacting with a robot. They pre-
sented an appropriate framework for evaluating UX holistically since it emphasizes
the importance of the hedonistic qualities of UX. However, it lacks concrete guide-
lines on which methods are appropriate and how they should be applied in practice.
The framework provides a UX lens rather than concrete guidelines for the evaluation
of a positive UX, and does not contribute with the necessary knowledge that would
make the framework useful in practice. In addition, the USUS framework developed
by Weiss et al. [5] provides a promising, comprehensive and holistic view of the
186 J. Wallström and J. Lindblom

aspects that can affect both usability and UX in HRI. In contrast to the framework
developed by Young et al. [6], USUS provides instructions on the methods and tech-
niques that are appropriate to use when evaluating single aspects of HRI. The USUS
evaluation framework is based on four factors—Usability, Social Acceptance, User
Experience, and Social Impact (see Fig. 3). The aim of the USUS framework is to
be able to answer general questions about how people experience interaction with a
humanoid robot, and to what extent they accept it as a part of the human society. The
USUS framework consists of two parts, a theoretical framework and a methodologi-
cal framework. The theoretical framework characterizes the four factors further, and
for each and every factor, some specific indicators that may be relevant for eval-
uation are described in detail (see Fig. 3). The methodological part explains how,
and with which methods, these factors should be evaluated. Qualitative and quan-
titative methods are combined to contribute to a comprehensive, holistic approach
[5]. Weiss et al. also provided instructions on methods/data collection techniques
that are suitable for the individual indicators presented in the theoretical part of the
framework, including expert evaluation, user studies, interviews, focus groups, and
physical measurements. What needs to be emphasized though, is that the application
of these methods should be carefully selected and adjusted according to the specific
context of each evaluation. Since the USUS framework is very comprehensive it is
also time consuming if one is supposed to evaluate all the different factors. But the
comprehensively, along with the theoretical parts/clear descriptions, should make
the framework useful and easy to apply in practice, for both experts and novices. We
envision further possibilities to add more relevant methods and techniques from the
HRI field if necessary. Another positive aspect of the USUS evaluation framework
is that it includes both UX and usability. The other methods, techniques and frame-
works described in this subsection tend to focus only on one or the other aspect, and
therefore misses the fact that usability is a crucial part of the total UX of a social
robot.
In summary, there are only a few frameworks available that provide a comprehen-
sive view of UX evaluation in social HRI. Although the USUS evaluation framework
is comprehensive, its developers argued for the need of further validation and per-
haps extension [5]. Their detailed descriptions of what and how to evaluate provide
clear guidance on which evaluation method is suitable for evaluating certain usability
and UX aspects that are useful for both experts and novices. It is also positive that
usability is included as a factor in the framework, because the methods and frame-
works that have been described earlier in this paper tend to explore either usability or
UX, but usability (pragmatic quality) is an important part of the whole UX (hedonic
quality). The USUS framework also provides the possibility of extension to include
other relevant methods or techniques depending of what needs to be evaluated. One
disadvantage of USUS is that it may be too time consuming to apply if all usability
and UX factors that are included in the framework should be evaluated. It offers no
detailed description of the various phases of the UX wheel, the importance of work-
ing in an iterative approach within these phases, and does not differentiate between
summative and formative evaluation methods. Another disadvantage relates to the
fact that the USUS framework does not explicitly address the need to specify UX
Design and Development of the USUS Goals Evaluation Framework 187

Effectivness

Efficiency
Methods:
Usability Flexibility Expert evaluation
User studies
Robustness Interviews

Utility

Attitude toward technology

Performance expectancy

Effort expectancy
Methods:
Social Self efficacy Questionnaires
Acceptance Focus groups
Interviews
Forms of grouping

Attachment

Reciprocity

Embodiment

Human-centered perception Methods:


Questionnaires
User Feeling of security Physiological
Experience measurements
Emotion Focus groups
Interviews
Co-experience

Quality of life

Working condition Methods:


Social Focus groups
Impact Education Interviews
Questionnaires
Cultural context

Fig. 3 The USUS evaluation framework (modified from [5], p. 93)


188 J. Wallström and J. Lindblom

goals, and working with these goals throughout the UXD process. This would, as
stated by Hartson and Pyla [10], increase the potential of the UX evaluation. How
this could be done, along with how and where their framework should be situated in
a UXD lifecycle process, is described in Sect. 3.2, providing additional support for
how USUS Goals should be applied in practice.

3.2 Iterative Development Process of the USUS Goals


Evaluation Framework

The first activity depicted in Fig. 2, and described in Sect. 3, was to perform a more
deepened literature analysis. Firstly, Hartson’s and Pyla’s book [10] on UX evaluation
and UX goals was studied in more details, e.g., issues to address UX goals and how
they could be aligned to the UX wheel, and how these aspects could be implemented
into the current USUS evaluation framework [5]. Then a deeper analysis of the
USUS framework was conducted to further identify which part(s) of the framework
that could be enhanced from a UX perspective, and whether the methodological
framework [5] could be simplified to make it easier to be used in practice. The
outcomes from the analysis were that the most relevant parts of the USUS framework
and the foundation for UX goals were identified, and how these parts could be merged
together. In so doing, the modified USUS framework would be well-aligned with the
ISO 9241-10 Standard Ergonomics of human-system interaction [30] that could be
properly applied within the HRI domain. It was also revealed that the application
process of USUS should be made clearer and easier to follow, and it was already
identified that instructions with more clarity of how the methods presented in USUS
[5] should be applied was needed. Furthermore, “observation” should be included
as an optional method. Hartson and Pyla [10] stressed how UX goals function as a
common thread throughout the entire UX wheel, which should also be the case in
USUS Goals to make the process easier to grasp and follow. It became evident that
UX goals were needed to be extracted at an early stage and that these goals should be
prioritized. The use of benchmark tasks is an appropriate way to work with UX goals
in organizations where there is little experience, knowledge and/or resources for
UX work [10]. Using benchmark tasks is considered an appropriate way of working
to perform more formative evaluations, which enabled that these goals could and
should actively guide the UXD process. Regarding UX evaluation, one of the major
contribution of the UX field is usability/UX testing (u/UX testing), which is an
established approach that almost characterizes the field. It provides valuable support
to designers to improve their products, software and services, in which defining UX
goals is the red thread and driving force throughout the whole UX wheel [4, 31].
The chosen robot platform to perform the evaluation on was the NAO robot [19] (see
Fig. 4).
The initial version of the USUS Goals evaluation framework was developed in
the second activity. To emphasize the importance of including UX goals in the UX
Design and Development of the USUS Goals Evaluation Framework 189

Fig. 4 The NAO robot used


in the scenario [19]

wheel, the top of Fig. 5 illustrates how the methodology of applying USUS [5] is
modified. The top of Fig. 5 shows how and where UX goals could be integrated in
the original cycle [5], in which the UX goals are included already on. The UX goals
need to be aligned to the scenarios and usage contexts identified, because they could
vary depending on the usage setting.
After designing this initial version of USUS Goals the third activity was to use
and evaluate it in practice via a micro-test evaluation [32], which took approximately
15–20 min per session. It focused on identifying potential users and their needs by
collecting information that addresses these needs. It was conducted through observ-
ing and interviewing people who could be the potential users while performing tasks
with a digital artifact [32]. The micro-test evaluation was conducted by applying it
to an interactive scenario with a NAO-robot [19].
190 J. Wallström and J. Lindblom

Fig. 5 The top image illustrates how and where UX goals could be integrated in the original
application cycle, and the bottom image illustrates the initial version of the USUS Goals evaluation
framework (modified from [5])
Design and Development of the USUS Goals Evaluation Framework 191

The scenario was used to extract relevant UX goals and investigate and analyze
whether the concept of USUS Goals could be a viable approach. The scenario per-
formed with the NAO platform [19] was a basic socially interactive situation, in which
the robot asked the user five questions in a scenario called “Want to know you” that
the user was supposed to answer. This “getting-to-know-each-other” scenario was
chosen to reveal relevant UX goals and offer a naturalistic environment that included
both social and interactive aspects. The users involved in the scenario were the first
author (in the role as potential user) and one of the robot developers/researchers. As
a result, some tentative UX goals were formulated, prioritized, and placed into the
USUS Goals (Fig. 6) by the first author (in the role as designer).
The fourth activity was to perform interviews with two experienced robot
researchers/developers to further assess and evaluate the USUS [5] and the cur-
rent version of the USUS Goals (Fig. 6) frameworks. The interviews lasted between
45–60 min. The robot researchers/developers were individually interviewed in their
roles as potential end-users of the framework. Semi-structured interviews were con-
ducted to promote a discussion in order to address; positive aspects as well as possible
deficiencies and problems, how to interpret the framework, their prior knowledge of
UX, UXD and UX goals, and their opinions about the need for more available meth-
ods for UX evaluation in social HRI. Furthermore, the interviews addressed to what
extent USUS Goals fitted in the current robotic development process, and finally

Fig. 6 The scenario with UX goals used to evaluate the initial version of USUS Goals with a NAO
robot [19]
192 J. Wallström and J. Lindblom

their attitudes towards working with benchmark tasks in UX evaluation to provide a


more formative evaluation process.
The obtained findings from the interviews revealed that USUS in its current
form needed to be more clarified to be considered useful, particularly the indicators
describing the UX factor were not explicitly formulated. Both informants claimed
they understood the overall purpose of UX goals, the aim of working with these
goals, and initially these goals seemed to fit well with the current robot development
process. However, some criticism was expressed regarding the placement of the UX
goals in the USUS Goals evaluation framework, and using them as benchmark tasks
because the UX goals were considered too general. According to Kaasinen et al. [18],
questioning the appropriateness of concretizing the UX goals is rather common in
fear of losing the holistic UX perspective, but this is necessary in design to clarifying
and communicating design goals [17]. One of the informants mentioned that the UX
goals were already an implicit part of the robot development process, and therefore
not necessary to focus on more explicitly. The informants expressed that much of
the robot development process is guided by predefined checklists. One of the main
challenges with the USUS Goals evaluation framework discussed concerned the
difference between the current robot development process and the prescribed work
process in the USUS Goals evaluation framework, because the latter did not match
with the former. Hence, the informants were doubtful that the application process of
the first version of USUS Goals needed to be developed further.

4 Result—The USUS Goals Evaluation Framework

The obtained insights from the literature analysis, the micro-test evaluation, and the
inputs from the informants were used during the second redesign process. The third
version of the USUS Goals evaluation framework demonstrates that the inclusion
of UX goals via u/UX testing is a tentative approach to increase the knowledge of
UX, and would offer a better mapping with current robot development processes
(Fig. 6). It should be pointed out that this third version only focuses on the UX
factor of the original USUS evaluation framework [5]. In USUS Goals, UX goals
have been included as an explicit part of the general framework. For every HRI
scenario developed there should be specific UX goals that are defined in an early
phase, before the robot design process begins. These UX goals are then linked to
the relevant indicators of the framework (embodiment, human-oriented perception,
feeling of security, emotion, co-experience) (see Fig. 7). It should be noted that a
separate UX goal could be included in several indicators simultaneously. During
UX evaluation via u/UX testing, these UX goals provide the focus for the overall
evaluation process.
As pointed out by Hartson and Pyla [10], benchmark tasks are proper ways to work
with UX goals in domains that have limited experience of UX work. To support the
evaluation of UX goals, a matrix of benchmark tasks has been integrated in the
USUS Goals evaluation framework (the matrix was originally presented in Hartson
Design and Development of the USUS Goals Evaluation Framework
193

Fig. 7 The current version of the USUS Goals evaluation framework


194 J. Wallström and J. Lindblom

and Pyla [10], but here modified). The matrix offers guidance on how to transform the
UX goals into more concrete units, so-called UX measures, in order to make them
assessable. The method kit from the original USUS evaluation framework is also
used in the USUS Goals evaluation framework, except the difference to decide early
on what method(s) being used to assess a certain UX goal. After the UX measures
have been formulated, it is specified which ones that should be used and then define
how it will be measured (UX metric). A baseline level is then decided, to specify
the acceptable level for each metric. The baseline level is the benchmark level, to
which all the obtained results are compared to. A desired target level should also
be specified, and it is valuable for each metric to have an explicit connection to the
aimed UX goal and what indicator that generates the successful UX. An example is
shown in Fig. 8.
It is suggested that after conducting the UX evaluation including u/UX testing,
supported by the benchmark tasks, the obtained results are compared to the decided
target level for each and every UX goal. The outcome of the analysis is then compared
to the initial list of UX goals to investigate which ones that are fulfilled and which
ones that are not fulfilled. The targets that are not reached yet will then be focused
on in the possible redesign and the forthcoming development process of the social
robot.
In order to apply the USUS Goals evaluation framework, the application cycle by
Weiss et al. [5] was modified, which is illustrated in Fig. 9. Applying USUS Goals
in a UXD process can briefly be characterized as follows:
1. Clarifying the need for USUS Goals.
2. Identifying usage context(s) and relevant scenarios for the u/UX testing.
3. Developing requirements and UX goals for the scenarios. These UX goals rep-
resent the desired UX and they should also be ordered according to levels of
priority.

Indicator UX goal UX measure Methods UX metric Baseline Target Result


level level
Emotion The user should The user is having Questionnaire Rating on likert 4/5 5/5 5
enjoy talking to the fun scale
robot
Wants to do it Interview Answers yes/no Yes Yes Yes
again question
Co- Talking to the robot Response to Observation Time on response 2s 1s 1s
experience should feel like commands
talking to a human

General The interaction Ease of use Observation Number of errors 1 <1 0


should be
experienced as
intuitive
Feeling of The user should Trust that the robot Questionnaire Rating on likert 5/5 5/5 5
security feel safe and wouldn’t hurt me scale
secure when
interacting with the
robot

Fig. 8 Benchmark tasks used in the USUS Goals evaluation framework


Design and Development of the USUS Goals Evaluation Framework 195

Fig. 9 UX Goals included into the application cycle (modified from [5])

4. Applying the USUS Goals via predefined specifications how the UX goals are
going to be measured with benchmark tasks; where metrics, methods, baseline
levels and target levels are defined. These levels could be set by evaluating the
existing robot platform in a similar way as described in the original version of
the USUS framework [5].
5. Developing and designing the robot platform.
6. Evaluating the UX of the robot platform by using the UX goals characterized
and further specified in the benchmark tasks.
7. Comparing the obtained results with the stated target levels to investigate and
analyze which UX goals that have been met so far. If some UX goals are not
reached yet, return to phase 4. Then repeat the UX evaluation process for the
next scenario and continue until all UX goals are fulfilled.
The selected UX goals are prioritized according to level of severity, to inform the
evaluators what is of major and less importance to focus on. As depicted in Fig. 9,
the defined UX goals follow throughout the whole design and development process.
196 J. Wallström and J. Lindblom

5 Discussion and Conclusion

The aim of this chapter was to investigate and analyze how an enhanced version of
the USUS framework could be developed and designed in order to include UX goals
via u/UX testing. We hope that the USUS Goals evaluation framework provides a
useful and appropriate framework that contributes to improved UX evaluation of
social HRI.
However, the present version of the USUS Goals evaluation framework has some
limitations. The enhanced framework is still at a concept level, and needs to be applied
in real usage context and validated in controlled settings, which would be future
work, which also is the final step in Blandford’s and Green’s method development
process [13]. Furthermore, there is a need to re-analyze the method kit in the USUS
framework [5] once again, to examine whether it still holds being part of the USUS
Goals evaluation framework. Another limitation is that the intended users of the
USUS Goals have to be aware of its existence, before reaching to the UX evaluation
phase. As stated earlier, UX goals need to be extracted and characterized in the initial
design and development phase of robot platforms, and without the identification and
formulation of relevant and defined UX goals the proposed framework will be rather
useless in practice. However, if this aspect is made clear, the enhanced framework’s
potential may be prominent by the addressed focus on UX goals that in the long
run will be optimizing the whole design and development process of social HRI.
Hopefully, the USUS Goals evaluation framework may provide significant insights
regarding the importance of UX in general, particularly UX goals, in the HRI field.
Based on the obtained results from the literature reviews and the empirical work, we
have developed six recommendations that may contribute to the ongoing work of
integrating UX in the HRI field.
Integrating UX goals as part of HRI requires a focus throughout the whole
design and development process. During the empirical work, the difficulty of inte-
grating UX goals only in the evaluation phase was revealed. Because UX goals have
a central role in all the development phases, there is an identified need to further
expand on how UX goals as well as business goals and related activities in Fig. 1
should be included. If UX goals should be integrated within several aspects of HRI,
a more long-term perspective is required. This challenge has also been pointed out
by Alenljung and Lindblom [7] as well as Alenljung et al. [8], who stressed the
importance of making the effort to integrate UX aspects in every part of the UXD
lifecycle process. Identifying UX goals is a tentative approach, and as described by
Hartson and Pyla [10], an important component to work with UX properly and effec-
tively throughout the whole design cycle. To create relevant and useful methods for
UX evaluation, both in theory and practice, it is required to develop these methods
in close collaboration with robot developers who actually will use and apply these
methods correctly. Good starting points were provided by Kaasinen et al. [18], in
which UX goals were extracted in several industrial environments, and this way of
working could also be applied to the HRI field. Powers [33] offered a success story
Design and Development of the USUS Goals Evaluation Framework 197

of a human-centered perspective in the design and development of the commercial


vacuum cleaner robot Roomba.
Identifying the need for additional usable and suitable methods for UX eval-
uation. During the interviews, it was revealed that the USUS framework was consid-
ered too abstract and therefore not applicable. This may indicate that the framework
is not as suitable and promising as it was considered at a first glance, compared to
other UX methods and frameworks identified in the HRI literature. Only two frame-
works for UX evaluation were identified, of which USUS was the most prominent.
This is aligned with the raised concern of the identified lack of feasible methods for
UX evaluation in HRI [e.g., 2, 7–9, 15, 34]. Thus, the need for usable UX methods
and techniques in HRI is still unsatisfied, and it is necessary to decrease that need,
if social robots should be fully integrated, via socially acceptable behaviors, to pro-
vide enjoyable and safe UX in the human cultural, social, and physical environment
[1, 2, 4, 8, 14, 15, 25, 29].
Finding inspiration from other domains that have experienced similar chal-
lenges. If UX aspects should be fully integrated into the HRI field, it would probably
require substantial changes in the existing robot development processes, which also
poses changes in existing mind sets. However, these challenges are similar to those
barriers encountered when UX has been introduced to other areas and organizations,
e.g., in software development and industry. In these domains, it was necessary to
include other parts of the organization, beyond the actual development process. The
maturity level of the organization itself has also shown to play a major role in how
well UX is received and integrated [19, 30]. In more industrial contexts, UX goals
have proved to be particularly useful, because they make it easier for all stakeholders
to agree on what aspect that should be designed [18].
Creating a common understanding of key concepts. Frequently occurring
comments during the interviews were the considered abstractness in the USUS
framework, the USUS Goals and UX goals. This implies the importance to clearly
characterize and define the concepts, aspects and methods that are introduced to
a new domain. It is also equally important for UX advocates to create a common
understanding of the processes and concepts used by robot developers, which could
be realized through conceptual modeling, in which key concepts are concretized and
discussed. This conceptual modeling could also reduce the risk that certain and sig-
nificant aspects are taken for granted or are not explicitly stated, which was revealed
during the interviews. By so doing, the identified problems in the HRI literature
about the misuse of UX concepts would probably decreased [7–9, 34]. It should be
noted that this challenge is not unique for the HRI field, it has been identified in HCI
as well. Therefore, the HRI field should actively work with clarifying UX aspects
and activities for UX evaluation of social HRI without being reluctant of being too
explicit.
Highlighting additional benefits with UX. Although mutual efforts from both
the HRI and UX fields are required, it is of major importance that the UX field
continues to demonstrate its added value and what it further can contribute to in
HRI. No other benefits than user satisfaction and flawless products were addressed
in the present study. It should be noted that additional benefits of UX have been
198 J. Wallström and J. Lindblom

identified, e.g. increased economic benefits for the organization/company and the
competitive branding factor, which should be further highlighted [10, 16, 18, 33, 35,
36]. These additional benefits will probably be of great importance in the future,
when the social robots will be culturally and socially situated in people’s homes
and robot developers are no longer are the primary end-users. Then the economic
profit will probably be a heavy factor for developing these robots, besides the robots’
envisioned societal value.
Confirming the mutual responsibility for the involved fields that address
the need for additional UX methods. The challenges described by Alenljung and
Lindblom [7], Alenljung et al. [8], as well as Lindblom and Andreasson [9] regarding
the current lack of UX evaluation methods adapted for HRI, were further verified
during the interviews. It should be noted that all the approaches for social interaction
in HRI research depicted in Fig. 1 need to be considered and cooperative work is
necessary to overcome the identified gap that exists between them. The changes
need to be conducted not only from the UX perspective that has to adapt its methods
and techniques, but it will also require changes in the existing robot development
processes. However, this should not be seen as making sacrifices, but in reaping the
best of worlds, both fields have a lot to learn from each other. By viewing it as an
ongoing exchange of knowledge, instead of compromises and adjustments, the UX
and the HRI fields can gain new knowledge and insights to meet future challenges.
We strongly believe that the inception of UX goals and u/UX testing in the USUS
Goals evaluation framework is a viable approach to increase the understanding of a
holistic UX approach, especially for robot developers in social HRI. The sophistica-
tion of the UX concept, the UX goals, and the UX methodologies is often overlooked
by HRI researchers and their full potential is not yet reached. By employing the var-
ious methods and techniques in UX evaluation, the gap can be narrowed as the
different approaches to social interaction in HRI identified by Dautenhahn [1] are
synthesized. Although there are many issues and challenges that have been raised and
addressed in this chapter about UX goals and UX evaluation, we still believe there is
great reason to be optimistic that what we know from UX can be successfully applied
to social HRI. HRI then not runs the risk of excluding the modern understandings of
technology-mediated activity in which humans are considered as actors (not factors)
in a socio-material context [37].

Acknowledgements This work was supported by the Knowledge Foundation, Stockholm, under
SIDUS grant agreement no. 20140220 (AIR, Action and intention recognition in human interaction
with autonomous systems).
Design and Development of the USUS Goals Evaluation Framework 199

References

1. Dautenhahn, K.: Socially intelligent robots: dimensions of human-robot interaction. Phil. Trans.
R. Soc. B 362(1480), 679–704 (2007)
2. Dautenhahn, K.: Methodology & themes of human-robot interaction: a growing research field.
Int. J. Adv. Robot. Syst. 4(1), 103–108 (2007)
3. Oh, K., Kim, M.: Social attributes of robotic products: observations of child-robot interactions
in a school environment. Int. J. Des. 4(1), 45–55 (2010)
4. Thrun, S.: Toward a framework for human-robot interaction. Hum.-Comput. Interact. 19(1),
9–24 (2004)
5. Weiss, A., Bernhaupt, R., Tscheligi, M.: The USUS evaluation framework for user-centered
HRI. In: Dautenhahn, K., Saunders, J. (eds.) New Frontiers in Human–Robot Interaction,
pp. 89–110. John Benjamins Publishing Co., Amsterdam (2011)
6. Young, J.E., Sung, J.Y., Voida, A., Sharlin, E., Igarashi, T., Cristensen, H.I., Grinter, R.E.:
Evaluating human-robot interaction: focusing on the holistic interaction experience. Int. J.
Social Robot. 3, 53–67 (2011)
7. Alenljung, B., Lindblom, J.: User experience of socially interactive robots: its role and rele-
vance. In: Vallverdú, J. (ed.) Synthesizing Human Emotion in Intelligent Systems and Robotics,
pp. 352–364. IGI Global, Hershey, PA, USA (2015)
8. Alenljung, B., Lindblom, J., Andreasson, R., Ziemke, T.: User experience in social human-robot
interaction. Int. J. Ambient. Comput. Intell. 8(1), 12–31 (2017)
9. Lindblom, J., Andreasson, R.: Current challenges for UX evaluation of human-robot interaction.
In: Schlick, C., Trzcieliński, S. (eds.) Advances in Ergonomics of Manufacturing: Managing the
Enterprise of the Future. Advances in Intelligent Systems and Computing, vol. 490, pp. 267–
278. Springer International Publishing, Cham, Switzerland (2016)
10. Hartson, H.R., Pyla, P.S.: The UX Book: Process and Guidelines for Ensuring a Quality User
Experience. Elsevier, Amsterdam (2012)
11. Hassenzahl, M.: Experience Design—Technology for All the Right Reasons. Morgan &
Claypool, San Rafael, CA (2010)
12. Hassenzahl, M.: User experience and experience design. In: Soegaard, M., Dam, R.F. (eds.)
The Encyclopedia of Human-Computer Interaction, 2nd edn. The Interaction Design Foun-
dation, Aarhus, Denmark. http://www.interaction-design.org/encyclopedia/user_experience_
and_experience_design.html (2013). Accessed 15 Sept 2017
13. Blandford, A., Green, T.: Methodological development. In: Cairns, P., Cox, L.A. (eds.) Research
Methods for Human-Computer Interaction, pp. 158–174. New York: Cambridge University
Press (2008)
14. de Graaf, M.M.A., Allouch, S.B.: Exploring influencing variables for the acceptance of social
robots. Robot. Auton. Syst. 6(12), 1476–1486 (2013)
15. Dautenhahn, K., Sanders, J.: Introduction. In: Dautenhahn, K., Sanders, J. (eds.) New Fron-
tiers in Human-Robot Interaction, pp. 1–5. John Benjamins Publishing Company, Amsterdam,
Netherlands (2011)
16. Anderson, J., McRee, J., Wilson, R., the Effective UI Team: Effective UI. Sebastopol, CA:
O’Reilly (2010)
17. Hassenzahl, M., Tractinsky, N.: User experience—a research agenda. Behav. Inf. Technol.
25(2), 91–97 (2006)
18. Kaasinen, E., Roto, V., Hakulinen, J., Heimonen, T., Jokinen, J.P.P., Karvonen, H., Keskinen,
T., Koskinen, H., Lu, Y., Saariluoma, P., Tokkonen, H., Turunen, M.: Defining user experience
goals to guide the design of industrial systems. Behav. Inf. Technol. 34(10), 976–991 (2015)
19. Aldebaran by SoftBank Group: https://www.aldebaran.com (2015)
20. Xu, Q., Ng, J., Tan, O., Huang, Z., Tay, B., Park, T.: Methodological issues in scenario-based
evaluation of human-robot interaction. Int. J. Social Robot. 7(2), 279–291 (2015)
21. Lohse, M., Hanheide, M., Wrede, B., Walters, M.L., Koay, K.L., Syrdal, D.S., Severinson-
Eklundh, K.: Evaluating extrovert and introvert behaviour of a domestic robot—a video study.
200 J. Wallström and J. Lindblom

In: RO-MAN 2008: The 17th IEEE International Symposium on Robot and Human Interactive
Communication, pp. 488–493. Munich, Germany, 1–3 August 2008
22. Strasser, E., Weiss, A., Tscheligi, M.: Affect misattribution procedure: an implicit technique
to measure user experience in HRI. In: Proceedings of the Seventh Annual ACM/IEEE
International Conference on Human-Robot Interaction, pp. 243–244. Boston, MA, 5–8 May
2012
23. Keizer, S., Kastoris, P., Foster, M.E., Deshmukh, A.A., Lemon, O.: Evaluating a social multi-
user interaction model using a Nao robot. In: RO-MAN: The 23rd IEEE International Sym-
posium on Robot and Human Interactive Communication, pp. 318–322. Edinburgh, Scotland,
UK, 25–29 August 2014
24. Syrdal, D.S., Otero, N., Dautenhahn, K.: Video prototyping in human-robot interaction:
results from a qualitative study. In: Proceedings of the 15th European Conference on Cog-
nitive Ergonomics: The Ergonomics of Cool Interaction, pp. 1–8. Madeira, Portugal, 16–19
September 2008
25. Anzalone, S.M., Boucenna, S., Ivaldi, S., Chetouani, M.: Evaluating the engagement with
social robots. Int. J. Social Robot. 7(4), 465–478 (2015)
26. Baddoura, R., Venture, G.: Social vs. useful HRI: experiencing the familiar, perceiving the robot
as a sociable partner and responding to its actions. Int. J. Soc. Robot. 5(4), 29–547 (2013)
27. Clarkson, E., Arkin, R.C.: Applying heuristic evaluation to human-robot interaction systems.
In: FLAIRS Conference, pp. 44–49. Key West, FL, USA, 7–9 May 2007
28. Weiss, A., Wurhofer, D., Bernhaupt, R., Altmaninger, M., Tscheligi, M.: A methodological
adaptation for heuristic evaluation of HRI. In: RO-MAN 2010: Proceedings of the 19th IEEE
International Symposium on Robot and Human Interactive Communication, pp. 1–6. Viareggio,
Italy, 13–15 September 2010
29. Xu, Q., Ng, J., Cheong, Y.L., Tan, O., Wong, J.B., Tay, T.C., Park, T.: The role of social context
in human-robot interaction. In: Network of Ergonomics Societies Conference (SEANES),
pp. 1–5. Langkawi, Kedah, Malaysia, 9–12 July 2012; Gray, C.M., Toombs, A., Gross, S.:
Flow of competence in UX design practice. In: Proceedings of CHI’15 the 33rd Annual ACM
Conference on Human Factors in Computing Systems, pp. 3285–3294. Seoul, Republic of
Korea, 18–23 April 2015
30. ISO DIS 9241–210: Ergonomics of human system interaction—part 210: human-centred design
for interactive systems. International Organization for Standardization, Switzerland (2010)
31. Dumas, J.S., Redish, J.: A Practical Guide to Usability Testing. Ablex Publishing Corporation,
Norwood, NJ (1999)
32. Goodman, E., Kuniavsky, M., Moed, A.: Observing the User Experience: A Practitioner’s
Guide to User Research. Morgan Kaufmann Publishers, San Francisco, CA (2013)
33. Powers, A.: What robotics can learn from HCI. Interactions 15(2), 67–69 (2008)
34. Bartneck, C., Kulić, D., Croft, E., Zoghbi, S.: Measurement instruments for the anthropomor-
phism, animacy, likeability, perceived intelligence, and perceived safety of robots. Int. J. Social
Robot. 1(1), 71–81 (2009)
35. Fraser, J., Plewes, S.: Applications of a UX maturity model to influencing HF best practice in
technology centric companies—lessons from Edison. Procedia Manuf. 3, 626–631 (2015)
36. Rajanen, M., Iivari, N., Keskitalo, E.: Introducing usability activities into open source software
development projects—a participative approach. In: Proceedings of NordiCHI’12: Making
Sense Through Design, pp. 683–692. Copenhagen, Denmark, 14–17 October 2012
37. Savioja, P., Liinasuo, M., Koskinen, H.: User experience: does it matter in complex systems?
Cogn. Technol. Work 16, 429–449 (2014)
Design and Development of the USUS Goals Evaluation Framework 201

Josefine Wallström is working as a User Experience Designer


(UXD). She has a Bachelor’s degree in Cognitive Science from
the University of Skövde in Sweden, and extensive knowledge
within applied UXD. For the past years she has been supporting
the health care sector as well as industrial companies to excel
their UX strategies with the aim to make the interaction between
humans and various kinds of advanced technological system as
painless and frictionless as possible.

Jessica Lindblom is an Associate Professor of Informatics at


the University of Skövde, Sweden. She has a Bachelor’s degree
in Cognitive Science, a Master’s degree in Informatics, and a
Ph.D. in Cognitive Systems. She is the head of the research
group Interaction Lab at the University of Skövde. Her research
interests are social aspects of embodied, situated, and distributed
cognition, and their implications to various kinds of interactive
technology. During the years she has acquired extensive expe-
rience from research on human-robot interaction and human-
robot collaboration from human-centred and user experience
perspectives. In collaboration with colleagues at the University
of Skövde, she is establishing one of the world’s first Master’s
Programmes on Human-Robot Interaction.
Testing for ‘Anthropomorphization’:
A Case for Mixed Methods
in Human-Robot Interaction

M. F. Damholdt, C. Vestergaard and J. Seibt

Abstract The study of human-robot interaction (HRI) currently lacks (i) clear under-
standing of the envisaged scope and format of the pluridisciplinary approach required
by the domain, (ii) established set of methods and standards, and (iii) a joint termi-
nological framework, or at least a set of analytical concepts and associated tests.
This chapter aims to contribute to these three tasks. We begin with the observation
that there is a need to define both the interdisciplinary scope of HRI research and
its pluridisciplinary format, two tasks that are at the center of the new procedu-
ral paradigm of “Integrative Social Robotics”. These methodological reflections are
further illustrated with a newly developed questionnaire, the AMPH. The AMPH
contains a higher proportion of items tapping anthropomorphism towards artefacts
than extant questionnaires. The analysis of AMPH (N = 339) pointed to a two-factor
solution: anthropomorphism towards artefacts and anthropomorphism towards nat-
ural objects. These findings were further explored through triangulation with quali-
tative data. In the last section of the paper we discuss how the AMPH can be used
to trace the distinction between humanizing and socializing (anthropomorphing and
sociomorphing), and how qualitative and quantitative methods should be used in uni-
son in HRI research to achieve more fine-grained analyses of relevant experiences.
We argue, based on philosophical concept analysis and phenomenology, that the
notion of anthropomorphization is far from clear and we must distinguish tendencies
to humanize from tendencies to socialize, which comes in various subvarieties. In
conclusion we consider whether our results suggest that HRI should aim for the high
degree of pluridisciplinary integration associated with an “interdiscipline” or even a
“transdiscipline.”

M. F. Damholdt (B)
Unit for Psychooncology and Health Psychology, Department of Oncology,
Aarhus University Hospital and Department of Psychology & Behavioural
Science, Aarhus University, Aarhus, Denmark
e-mail: malenefd@psy.au.dk
C. Vestergaard · J. Seibt
Research Unit for Robophilosophy, School of Culture and Society,
Aarhus University, Aarhus, Denmark

© Springer Nature Switzerland AG 2020 203


C. Jost et al. (eds.), Human-Robot Interaction, Springer Series
on Bio- and Neurosystems 12, https://doi.org/10.1007/978-3-030-42307-0_8
204 M. F. Damholdt et al.

Keywords Anthropomorphism · Methodologies for human-robot interaction ·


Social robots

1 Introduction: The Need for Method Reflection in HRI

The fields of “social robotics” and “Human-Robot Interaction Studies” (HRI) are
comparatively young pluridisciplinary research efforts that still are in the process
of constituting themselves. The two research efforts have somewhat different yet
overlapping focus—roughly, “social robotics” aims to produce items with the affor-
dances of social agents and HRI investigates human interactions with such items.
Both fields combine investigatory perspectives from robotic engineering, psychology,
cognitive science, and—but so far only occasionally—sociology and anthropology.
However, to our knowledge neither field so far has engaged yet in a more compre-
hensive methodological reflection on the scope and format this pluridisciplinarity is
supposed to take. Moreover, researchers in social robotics and HRI resort to early
proposals for a classification of ‘social’ robots [1–3], but largely have proceeded
so far without engaging in detailed discussions of standards for terminology.1 The
present volume thus fills an important lacuna and will constitute an important step
on the road towards clarifying the methodological foundation of HRI research and
its relation to the field of social robotics.
Our aim in this chapter is to contribute to the goal of method reflection in HRI by
addressing two general questions. First, (Q1), what should be the interdisciplinary
scope of HRI, i.e., which disciplines, and their characteristic methods, should be
included? Second, (Q2) which format of pluridisciplinary research should HRI aim
for–should it remain a “multidiscipline” or aim to become an “interdiscipline” or even
a “transdiscipline” [11]? While the precise definitions for various types of pluridisci-
plinarity are still a matter of ongoing research in philosophy of science, there appears
to be general agreement that different types of pluridisciplinary relatedness are best
characterized in terms of degrees of terminological and methodological integration,
and that the division between ‘multidiscipline’, ‘interdiscipline,’ and ‘transdiscipline’
provides a useful division to signal such differences in integration [11]. In a multi-
discipline, such as climate research, “participants from disciplines come together in
response to a problem, create a local integration to solve that problem, and go back
to their respective disciplines, with these largely unchanged by the transient interac-
tion” (ibid., 717), while an interdiscipline, such as biomedical engineering, involves
the “integration of concepts, methods, materials, models” (ibid., 719) to generate a
new understanding and modelling resources for the domain.
Finally, a transdiscipline, such as integrative systems biology, arises when “each
field in the adaptive…problem space will likely penetrate and change significant
practices in regions of the collaborating field” (ibid., 723). In contrast with an
interdiscipline, in a transdiscipline the ‘adaptive transactions’ yield conceptual and

1 These have been provided mainly by philosophers working on analytical ontology or phenomenol-

ogy [4–10].
Testing for ‘Anthropomorphization’ … 205

methodological results that transform the participating disciplines by changing the


terminology and/or methods.
We approach these general questions by way of a concrete investigation of the way
in which the notion ‘anthropomorphization’ is used and measured in HRI research.
Tendencies to anthropomorphize are an important target of assessment in the research
field of HRI since the affordances of social robots are often, but not always, expressly
designed to facilitate the manifestation of these tendencies [12]. However, we argue,
based on research in social ontology and phenomenology, (i) that the notion of a
“tendency to anthropomorphize” as currently used by HRI researchers does not suf-
ficiently discriminate between several different kinds of cognate tendencies, (ii) that
the differences between these cognate tendencies are crucially relevant for an ade-
quate understanding of human-robot interactions, and (iii) that it is doubtful whether
extant questionnaires used in quantitative studies on anthropomorphization in HRI
can adequately represent these differences. In order to arrive at the required fine-
grained analysis of human tendencies in relation to robots, we argue, (iv) quantita-
tive methods need to be supplemented by qualitative methods such as observation of
human-robot interaction and focus interviews.
We present this argument not only to draw attention to a current methodologi-
cal problem in HRI in connection with the notion of “anthropomorphization” but
also, and perhaps even primarily so, in order to illustrate the workings and potential
benefits of “Integrative Social Robotics” (ISR), a new procedural paradigm for the
research, design, and development process in social robotics [8, 13]. The basic ideas
for the ISR approach were sketched in 2015 with the aim of changing the procedural
paradigm for the research, design, and development (RDD) process in social robotics.
Since 2016 the approach is being further developed in the context of a large research
project involving 26 researchers from 11 disciplines and 9 different research institu-
tions. The ISR approach is motivated by the observation that a responsible approach
to social robotics—understood in the wide sense as research, policy, and deploy-
ment— is currently stymied by an exacerbated form of the Collingridge dilemma,
the “triple gridlock of description, evaluation, and regulation tasks” [13]. One of the
special features of ISR is that it proceeds from a clear and simple answer to the first
general methodological question (Q1) stated above: Since social reality is the most
complex reality we know, the interdisciplinary scope of HRI should be as wide as
is necessary to adequately capture the phenomena of a given application of social
robotics technology in the given socio-cultural context. This means in particular that
the expertise and methods of Humanities research must be included in HRI research
in order to capture the dimensions of individual “meaning-making” in a situation of
cultural change and more broadly the socio-cultural significances of human-robot
interactions—from first, second, and third-person points of view—with the required
precision.
Our discussion of the notion of “anthropomorphizing” in this chapter thus can be
understood as a case study of a way of engaging in HRI research with distinctively
wider interdisciplinary scope than currently undertaken, and allowing for mixed
methods including qualitative observations. We present a new questionnaire for the
tendency to “anthropomorphize” and discuss it relative to current research tools in
206 M. F. Damholdt et al.

HRI. At the end of the chapter, however, we broaden the scope of methodological
reflection. Abiding by the principles of ISR we relate the notion of anthropomorphiza-
tion to results from conceptual and phenomenological analysis in social ontology.
As it will become apparent from these broader reflections, the case of ‘anthropo-
morphization’ also suggests an answer to the above question (Q2)—there are good
reasons to entertain the hypothesis that HRI will become a transdiscipline in the long
run.
We proceed as follows. In Sect. 2 we briefly outline the tools currently used in HRI
research to assess tendencies to anthropomorphize and present a new short anthropo-
morphism questionnaire, the AMPH, which we developed [14] in an effort to devise
a more discriminative instrument. AMPH contains a higher proportion of items tap-
ping anthropomorphism towards artefacts than previously published questionnaires.
Here we empirically test the factor structure of AMPH in a convenience sample of
339 respondents. In order to tap displays of anthropomorphism in real interactions
with a robot, the quantitative data is integrated with content analysis of qualitative
observational data coded for signs of anthropomorphism. In Sect. 3 we discuss the
quantitative and qualitative findings and relate these to conceptual and phenomeno-
logical analysis. We argue, based on our experiences in several empirical studies,
that there are principled reasons for why it is more productive to explore differences
among tendencies for sociomorphizing and anthropomorphizing by using qualitative
methods. In conclusion, we consider the implications of our discussion for general
methodological questions about the necessary scope and degree of integration of the
pluridisciplinary investigation of human-robot interactions.

2 New Tool for Assessment of the Tendency


to Anthropomorphize

Anthropomorphism has been defined as “the tendency to attribute human characteris-


tics to inanimate objects, animals and others with a view to helping us rationalize their
actions. It is attributing cognitive or emotional states to something based on observa-
tion in order to rationalize an entity’s behavior in a given social environment” ([12],
p. 180). Following this popular definition tendencies to anthropomorphize have been
construed as a dispositional trait that may be expressed in varying degrees depending
on individual differences, situations and motivations [15, 16]. The expression of a
tendency to anthropomorphize facilitate learning in human-computer interactions
where humans are found to learn faster from more human-like interfaces [17]. Also,
it has been found that humans strive to behave in ways that are congruent with the
anthropomorphized agents [18].
Hence, the tendency to anthropomorphize may influence the way in which we
engage and interact with technology and therefore it is important to develop replica-
ble methodologies with which to assess this phenomenon in the field of robotics. In
HRI the most commonly used tool for this purpose is the Godspeed Questionnaire
Testing for ‘Anthropomorphization’ … 207

Series (GQS). The GQS consists of five composite scores (composed of a total of 24
questionnaire items) whereof one is the five-item anthropomorphism subscale [19].
All items are rated on semantically differential 5-point scales for instance between
the two statements “artificial versus lifelike”. Despite the wide use and acceptance of
GQS some challenges with the scale should be mentioned. Firstly, the utilization of
semantically differential scales does pose challenges such as determining the seman-
tic description of the mid-point between the two opposing semantic categories. The
determination of a natural midpoint in some of the GQS items is difficult (for instance
the items “alive vs. dead”). Secondly, the items that describe “anthropomorphism” in
GQS is mainly based on evaluations of “naturalness” in different senses of the word
rather than the attribution of cognitive, emotional or mental states. Thirdly, the GQS
is solely suitable for assessing anthropomorphism when the participant is introduced
to a specific robot and it largely focuses on anthropomorphism as properties of the
robot rather than the attributional beliefs of the human interacting with/observing
the robot. Hence the GQS offers an evaluative measure of a given robot of the extent
to which its morphology has anthropomorphic properties but can and should not
be utilized to make inferences about anthropomorphism as a dispositional trait. This
distinction is rarely made in HRI research where the majority is focused on anthropo-
morphism as visible design features of the robot rather than dispositional tendencies
of the human. However, there is strong evidence to support that there are individual
differences in the propensity to anthropomorphize and that these tendencies are aug-
mented by both internal stimuli, for instance oxytocin levels [20, 21], and external
influences such as social isolation [16].
A commonly used measure of what is believed to be these more trait-based ten-
dencies to anthropomorphize is the “individual differences in anthropomorphism
questionnaire” (IDAQ [15]), a 15-item self-report measure. The IDAQ has excellent
psychometric properties and has been tested in a variety of samples. Unfortunately,
however, it seems to have some limitations in terms of its applicability in robotics
research as it only contains two items tapping anthropomorphism relating to technol-
ogy. Also, it encompasses complex psychological and philosophical concepts that
may be difficult for people to comprehend (such as consciousness, mind, soul, etc.)
[22].
Given that it is often theorized that anthropomorphism affects the perception of
and interactions with robots, but the phenomenon is seldom systematically explored,
we present here the development and statistical testing of a short anthropomorphism
questionnaire, the AMPH. The AMPH has a larger proportion of items that embrace
anthropomorphism towards various artefacts. The development and reporting of this
new questionnaire integrate both qualitative and quantitative data. This question-
naire will allow HRI researchers to undertake fast and systematic assessment of
anthropomorphism.
208 M. F. Damholdt et al.

2.1 Participants

The sample consisted of 339 (men, n = 178) Danish respondents recruited for the
purpose of statistically testing a number of newly developed questionnaires [23]. The
average age of respondents was 41 years (SD = 16.72; range 18–87 years).
The questionnaire has also been used in experimental research settings where
qualitative data was gathered. For the purpose of this paper we included qualitative
data from a pilot study at a rehabilitation center in Jutland, Denmark, where partici-
pants interacted individually with a robot during four lunch sessions. The qualitative
data consists of a content analysis of video-recordings of 17 participants in interaction
with a robot (for full description, see [14]).

2.2 Procedure

The preponderance of participants (n = 267) were recruited through the panel system
supplied by the online survey system Qualtrics version 2015. The remaining sample
was recruited primarily in the student population but also at a local museum. Qualtrics
was used to deliver the questionnaires online and filters were applied during the
data-collection removing respondents who were cheating (either by being too fast or
systematically choosing the same answers/patterns of answers every time). The final,
total sample consisted of 339 participants. All questionnaires were administered in
Danish.

2.3 Development of the Anthropomorphism Questionnaire

The Anthropomorphism questionnaire (AMPH, [14]) was developed by an interdis-


ciplinary task force with psychology, philosophy and anthropology represented. The
questionnaire was developed with the aim of accommodating some of the insights and
distinctions worked out in the philosophical analysis of social interactions in social
ontology and social phenomenology. The capacities that we commonly ascribe to
each other in our daily social interactions philosophers traditionally have summarized
by the term ‘subjectivity’—these are, among others, the capacities of (i) having men-
tal occurrences such as feelings, emotions, conceptual experiences, beliefs, desires,
and intentions; (ii) thinking and planning rationally, (iii) following norms and (iv)
responding to moral norms, (v) experiencing oneself as the cause of one’s actions and
(vi) acting freely, spontaneously and creatively, (vii) feeling with others and predict-
ing someone else’s beliefs, feelings, emotions, and intentions, (viii) being conscious
of one’s own identity, as relating in experience to what is ‘other’.
While people in normal social interactions implicitly assume that their interac-
tion partners has the human capacities of mentality, empathy, rationality, normative
Testing for ‘Anthropomorphization’ … 209

competence, agentive and moral autonomy, it is by no means clear that all of these
capacities are ascribed in exceptional situations, when an alleged partner in social
interaction is not human. Often only the capacities of having feelings, or of acting
in accordance with a norm, are presupposed in these projections.
This suggests that anthropomorphizing is not an ‘all-or-nothing’ affair and might
occur with different degrees of imaginative projection. An item may be viewed
as affording social interactions since it behaves in accordance with a pattern of
interaction, thus appears to have normative competence; other items may be viewed
as affording social interaction since it appears to have emotions of feeling, and other
since they appear to act intentionally and freely, etc.
Some of these projections seem more plausible, or more in line with social clas-
sificatory norms, than others—for example, to treat one’s car as social interaction
partner since it has started reliably under the most difficult conditions against all
expectations is, relative to the social conceptual norms of contemporary Western
societies, less of a stretch of imagination than projecting onto a tree the practical
intention to shade you and feelings or protectiveness.
The questions of the AMPH thus have been formulated in ways that make it
possible to explore whether participants tend to undertake projections of human
capacities that require a lesser or a greater imaginative effort. For example, see
question 2 in Table 1, it requires a less imaginative, projection to consider a com-
puter as ‘uncooperative’ based on an interaction pattern, while it requires a more
imaginative projection to attribute to it the evil intention of actively sabotaging your
efforts, as formulated in question 3. A particularly weak form of the tendency to

Table 1 The anthropomorphism questionnaire


1. Would you name an everyday object (such as a television)?
2. Do you ever blame your computer for being uncooperative?
3. Do you believe that sometimes your computer sabotages your actions on purpose?
4. Do you find it understandable when people treat their car as if it were human, i.e. by
giving it a name and referring to it as ‘reliable’ and ‘helpful’?
5. Do you tend to feel grateful towards a technological object (your car, your computer, your
mobile) if it has rescued you from a dangerous or difficult situation?
6. Would you ever believe that a mountain (such as Mount Everest) would set off a series of
avalanches on mountain climbers if they disturb the peace of the mountain?
7. Do you believe that a tree can feel pain?
8. Do you believe that an insect has a soul that you need to respect?
9. Do you believe dogs have goalsb ?
10. Do you believe that a car has its own free willa,b ?
a Derived from the IDAQ
b These items were had low primary factor loadings (<0.05) and were deleted. Hence, as these two
items were not strongly related to the remaining questions they were both removed from the final
version of AMPH
210 M. F. Damholdt et al.

anthropomorphize might occur in individuals who understand (accept) the projec-


tive tendencies or others without engaging in it themselves. Moreover, the tendency
to project human capacities may not necessarily be the expression of a lively imag-
ination but merely the expression of a metaphysical creed or general assumptions
(see questions 8 and 7) or a general disposition to express gratitude if things go well
(see questions 4 and 5), or a disposition to feel at home (question 1). In short, the
AMPH was designed to differentiate between different degrees and circumstances
of the tendency to anthropomorphize.
With this aim of a more diversified approach to the phenomenon of anthropomor-
phizing projections in the background, the AMPH was developed in an effort to offer
an alternative to existing measures of anthropomorphism such as the IDAQ which
only to a limited degree include anthropomorphism towards inanimate objects. The
development of items was based on extant theory and psychometric knowledge and
were done in several iterations. The final version was presented to a focus group of
six master students (female, n = 5; in history of ideas, political science, graphical
design, theology) who assessed the individual items for comprehensibility and clar-
ity. Based on their comments wording of some of the items were changed. The final
AMPH consists of 10 statements to be rated on a four-point Likert-like scale rang-
ing from “very unlikely” to “highly likely”. The maximum score is 40 and higher
scores indicate more pronounced anthropomorphic tendencies. For the full AMPH,
see Table 1.

2.4 Measures

Several measures were included in the study whereof the following fall within the
scope of this chapter.
The Godspeed Questionnaire Series (GQS [19]) were included which consists of
24 individual items rated on semantically differential scales. Five composite scores
can be calculated from the GQS: anthropomorphism, animation, likeability, per-
ceived intelligence and perceived safety. Whilst the AMPH is postulated to assess
anthropomorphic tendencies the GQS taps a user’s overall perception of a given
robot. In the present study, we utilized video-based stimuli material featuring the
tele-operated robot “Telenoid” engaged in interaction with a woman. The Telenoid
is an android robot with adumbrated human features, developed by Hiroshi Ishig-
uro from Osaka University and the Advanced Telecommunication Research Institute
International [24]. The AMPH scores were first obtained, then video material was
presented, followed by obtaining GQS scores.
Overall attitude towards robots was assessed with a single item: “on a scale from
1-10 how positive are you about robots?” with higher scores reflecting more positive
attitudes.
A number of additional measures of personality, empathy, and attitudes were
administered but fall outside the scope of this publication.
Testing for ‘Anthropomorphization’ … 211

2.5 Results of the Quantitative Analyses

The questionnaire data were analyzed using IBM SPSS Statistics for Macintosh,
Version 24.0. (2012; Armonk, NY, USA: IBM Corp). The main analysis conducted in
the present study was factor analysis. Briefly explained, factor analysis is employed to
explore whether there are underlying “invisible” patterns (factors or latent constructs)
within a given “observable” dataset (such as a questionnaire). Hence, the assumption
is that if multiple questionnaire items have a similar response pattern it is because they
assess the same, underlying factor. Such latent factors might also add new, theoretical
knowledge. In the present study for instance, it would add important theoretical
knowledge to gauge whether anthropomorphism is a singular psychological trait or
if there are subtypes/various forms of anthropomorphism that can exist independent
of each other (i.e. if some respondents score high on one factor and low on another).
Furthermore, factor analysis can be used to identify specific questionnaire items that
only weakly are related to the remaining items in the questionnaire. For instance, if
you have developed a questionnaire to assess sleep you might find that a questionnaire
items about the color of your pajamas is not related to other items in the questionnaire
such as sleep hygiene and bedroom temperature. Hence you would likely find that you
would be able to exclude the “color of pajamas” item from your final questionnaire as
it would not contribute anything valuable to the assessment of sleep. Factor analysis
therefore is used with the objective to uncover latent, underlying factors and to ensure
that all items in the questionnaire are related to at least (and preferably only) one
factor.
Preliminary Factor Analysis2 : Initial inspection of correlation matrices for the
10 items of the AMPH questionnaire confirmed that the data was suitable for factor
analysis (correlations were above r = 0.30 between several items which could point to
the existence of underlying factor(s)). An initial principal component analysis yielded
an acceptable Kaiser–Meyer–Olkin measure of 0.779, and Bartlett’s test of sphericity
was significant (χ 2 (45) = 671.27, p = 0.000). The Kaiser–Meyer–Olkin measure is
an overall indication of the suitability of the data for conducting factor analysis as it
indicates the size of the sum of partial correlations relative to the sum of correlations
[26]. The Kaiser–Meyer–Olkin value is between 0 and 1 where values closer to 1
indicates correlations that are relatively dense (i.e. the sum of partial correlations is
not large relative to the sum of correlations) and suitable for factor analysis. Values
between 0.7–0.8 are deemed good [27]. The Bartlett’s test of sphericity assesses if
there is redundancy between variables that can be summarized. Hence, in other words
this test assesses the probability that there are significant correlations between some
of the variables. In order to conduct a factor analysis there has to be some relationship
between the items and this is indicated by the significance of the Bartlett’s test of
sphericity.
Eigenvalues were inspected in order to establish the appropriate number of factors
to extract. As the goal of conducting a factor analysis is to reduce the number of
variables it is important to also follow statistical indices when deciding how many

2 For an accessible introduction to factor analysis, see [25].


212 M. F. Damholdt et al.

Table 2 Eigenvalues for


Factors Raw data eigenvalues Random (simulated) data
actual and simulative data
eigenvalues
1 3.23 1.40
2 1.34 1.24
3 1.14 1.17
4 0.80 1.11
5 0.73 1.05
6 0.69 1.01
7 0.65 0.96

factors to extract. Factors with eigenvalues below 1 are generally assumed not to
add anything new. Inspection of the initial eigenvalues yielded that the first three
factors explained 32, 13, 11% of the variance, respectively. Hence, not much more
was explained by adding the third factor (please see Table 2 where the two-factor
solution is underlined and highlighted in bold). The fourth and fifth factors had
eigenvalues falling below one (see Table 2). The eigenvalues indicate a scaling factor
or quality score and is applied to point to how many factors should be retained in
the analysis. Components with eigenvalues below 1 are not assumed to represent the
underlying factor. Furthermore, when eigenvalues drop drastically in size, it means
that adding factors would add little new knowledge to what is already extracted.
This can be visualized with a graph: the Cattell scree plot test [28]. Inspection of
the Cattell scree plot supported a two-factor solution as a sudden downturn in values
was prominent for the preceding factors.
Parallel analysis [29] was applied to further gauge the most appropriate number of
factors to extract utilizing the method and programme described by O’Connor [30].
Parallel analysis was primarily done to overcome some of the critiques of utilizing
eigenvalues to determine the optimal number of factors. In particular, concerns have
been raised that some eigenvalues above 1 may appear simply as a result of sampling
error [25]. This is overcome by parallel analysis as it adjusts for the effect of sampling
error. Hence, parallel analysis is not strictly necessary but it can be used to illuminate
further and strengthen the findings of the appropriate number of factors. In parallel
analysis, a Monte Carlo Simulation technique is used to generate simulated, random
datasets (in this case 1,000 sets of random data were created) with the same number
of participants and variables as the original dataset (339 participants and 10 variables)
[30]. Eigenvalues for the generated datasets were calculated at the 95th percentile
and compared to the actual data extracted from the principal component analysis (see
Table 2). Hence, Table 2 compares the actual data which we collected in the present
study to the simulated data that is brought about following the procedure described
by O’Connor. Then the appropriate number of factors to retain can be determined:
A factor is retained as long as the eigenvalue of the actual data exceeds that of the
simulated data [31]. In this case, the parallel analysis yielded a two-factor solution
(for two factors the eigenvalue for the actual data is 1.34 vs. 1.24 for the simulated
data- for three factors the eigenvalue for the simulated data exceeds that of the
Testing for ‘Anthropomorphization’ … 213

actual data and hence the three factor solution is dismissed). This is underscored and
highlighted in bold in Table 2 where the distribution of raw data eigenvalues and
random (simulative) data eigenvalues are shown and compared.
The two-factor solution explained 51% of the variance and yielded a parsimonious
solution, with good primary factor loadings. This means that 51% of the variance
(i.e. proportion of dispersion) in the data was explained by the existence of two
underlying factors. To sum up, the analysis so far indicates that the individual items
of the AMPH can be said to assess or reflect two latent factors.
Two questionnaire items from the AMPH (“do you believe dogs have goals” and
“do you believe that a car has its own free will”) did not load on any of the two factors
(low primary factor loadings (< 0.05)) and these two items were deleted. Hence, as
these two items were not strongly related to the remaining questionnaire items, they
were both removed from the final version of the AMPH.
Ideally, the findings should be confirmed in a factor analysis on a new sample.
Lacking this option, we confirm the findings up until now utilizing a final principal
component analysis with oblique rotation on the remaining 8 questionnaire items.
Given that psychological constructs are rarely partitioned completely independently
of each other oblique rotations were deemed more appropriate than orthogonal rota-
tion for this data as it allows for overlap between factors. The two-factor model
explained 48% of the variance. Composite scores were calculated for the two fac-
tors. The first factor contained all the questionnaire items relating to nature and was
coined “anthropomorphism towards natural objects” whilst the second factor con-
tained questionnaire items assessing human-made technical artefacts and was named
“anthropomorphism toward inanimate objects”. The means and standard deviations
for the factors are displayed in Table 3.
The pattern loadings for the individual questionnaire items are displayed in Table 4
for all 8 items. It illustrates the degree of correlation between the individual items
and the two factors. The individual items are assigned to the factor to which they
correlate the most. For instance, the item “Do you ever blame your computer for being
uncooperative?” correlates −0.131 with the factor Anthropomorphism toward natural
objects and 0.747 with the factor Anthropomorphism towards artefacts whereby it is
assigned to the latter factor. It can be seen which factor the individual item loads the
most on by its value being highlighted in bold.
Cronbach’s alpha [32] is a measure of how closely related the individual items are
to each other and was used to assess scale reliability. Cronbach’s alpha was moderate
for factor 1 “anthropomorphism toward artefacts”: α = 0.695 (5 items; x = 9.96; SD

Table 3 Descriptive statistics for the two-factor solution and full scale AMPH (N = 339)
No. of items M (SD) Skewness Kurtosis
Anthropomorphism towards artefacts 5 9.96 (3.11) 0.34 −0.49
Anthropomorphism towards natural 3 5.55 (2.13) 0.48 −0.22
objects
Total AMPH score 8 15.50 (4.36) 0.42 −0.06
214 M. F. Damholdt et al.

Table 4 Pattern matrix for principal component analysis with Oblimin rotation and two factor
solutions of the AMPH
Anthropomorphism towards Anthropomorphism toward
artefacts natural objects
Would you name an everyday 0.636 0.040
object (such as a television)?
Do you ever blame your 0.747 −0.131
computer for being
uncooperative?
Do you believe that 0.440 0.197
sometimes your computer
sabotages your actions on
purpose?
Do you find it understandable 0.721 0.030
when people treat their car as
if it were human, i.e. by
giving it a name and referring
to it as ‘reliable’ and
‘helpful’?
Do you tend to feel grateful 0.717 −0.009
towards a technological
object (your car, your
computer, your mobile) if it
has rescued you from a
dangerous or difficult
situation?
Would you ever believe that a 0.224 0.573
mountain (such as Mount
Everest) would set off a series
of avalanches on mountain
climbers if they disturb the
peace of the mountain?
Do you believe that a tree can −0.059 0.791
feel pain?
Do you believe that an insect −0.151 0.855
has a soul that you need to
respect?
Extraction method: principal
component analysis. Rotation
method: Oblimin with Kaiser
normalization.a
a Rotation converged in 3 iterations
Testing for ‘Anthropomorphization’ … 215

= 3.11), and for factor 2 “anthropomorphism towards natural objects”: α = 0.659 (3


items; x = 5.55; SD = 2.13). Cronbach’s alpha for the total scale was α = 0.731 (8
items x = 16.50; SD = 4.36) which is an acceptable value. Further deletion of items
would not lead to improved Cronbach’s alpha levels.
There was a significant difference between men (n = 178) and women (n = 161) on
AMPH total scale with men (x = 14.47; SD = 4.00) scoring significantly lower than
women (x = 16.36; SD = 4.60), t(337) = 3.45, p = 0.001. Furthermore, the oldest
participants in the sample (n = 123; over the age of 45 years) scored significantly
lower on the AMPH total scale (x = 14.03; SD = 4.12) compared to the youngest
participants (n = 179; 44 years or younger; x = 16.64; SD = 4.33), t(300) = −5.24,
p = 0.000.
It is common practice in psychology to explore how new assessment tools behave
in relation to established tools—both to ascertain if they are indeed assessing some-
thing new and to explore how the new tool relate to the old. In order to ascertain
that AMPH measures other aspects than established tools in HRI and to explore its
relation to such tools the AMPH and GQS subscales were correlated (see Table 5).
Small but significant correlations were found between AMPH scores and all the GQS
subscales as highlighted in bold. When investigating the two subscales of AMPH in
isolation, anthropomorphism towards artefacts showed small but significant corre-
lations with all GQS subscales and with the participants declared positive attitude
towards robot technology. Conversely there were no significant correlations between
“anthropomorphism towards natural objects” and the subscales of the GQS perceived

Table 5 Pearson product-moment correlation coefficient between AMPH and GQS


Anthropomorphism Anthropomorphism Total AMPH score
toward artefacts toward natural objects
Anthropomorphism –
toward inanimate
objects
Anthropomorphism 0.360** –
toward natural objects
Total AMPH score 0.890** 0.746** –
Perceived safety 0.182** 0.027 0.143**
Anthropomorphism 0.239** 0.150** 0.242**
Perceived intelligence 0.268** 0.137* 0.258**
Animation 0.297** 0.144** 0.282**
Likeability 0.226** 0.088 0.204**
Positivity towards 0.218** 0.049 0.180**
robot technology
**Correlation is significant at the 0.01 level (2-tailed)
*Correlation is significant at the 0.05 level (2-tailed)
216 M. F. Damholdt et al.

safety or likeability but small correlations with anthropomorphism, perceived intelli-


gence of the robot, and animation. Furthermore, there were no significant correlations
between this subscale and the participants’ rating of positivity towards robots.

2.6 Qualitative Analysis of Pilot Study

The quantitative analysis showed that the tendency to anthropomorphize as assessed


by the AMPH can be understood as consisting of two different factors depending
on whether people relate to natural objects (such as trees) or artefact (such as com-
puters). However, the quantitative data does not allow for inferences on whether
and how anthropomorphism can be observed in interactions with technology. This
link between declared dispositional characteristics and how these are enacted or dis-
played in interaction is poorly understood. This we now aim to explore more deeply
by triangulating quantitative and qualitative data [33].
The qualitative data derives from a pilot study of elderly residents at a rehabili-
tation center, who engaged in conversation with the Telenoid R1 robot during four
individual lunch-sessions. The lunch sessions took place in the participant’s room,
and the Telenoid was seated at the same table as the participant. The lunch-sessions
were video-recorded and the video recordings have been analyzed through content
analysis, a method used in both quantitative and qualitative studies to analyze written,
verbal or visual communication messages [34, 35]. The material is analyzed through
a defined framework, so that it should be possible to reach a result as objective as
possible, also with different researchers coding and analyzing the material.
The content analysis of data used in this chapter was framed by focus points arrived
at deductively from the quantitative analysis of the questionnaires. For the purpose
of this paper the video data was analyzed specifically on the aspect of anthropomor-
phism as observed when people engage in interaction with a technical artefact: Did
participants, unaware of how the robot functioned, engage in conversation and inter-
action with the robot in such a way that it could be described as anthropomorphizing
the robot?
The content analysis showed that the participants in general were quite willing
to engage with the robot as if it was a social interaction with another human being:
Participants greeted the robot with hospitable language, answered questions politely,
and engaged in normal turn-taking. The conversations followed the schema of a nor-
mal exchange during lunch as it typically would be at a rehabilitation center; the
general topics being the food served; the daily life of the participant at the rehabili-
tation center; why the participant was at the rehabilitation center; how it was going
with the training sessions; the participant’s family situation; the weather, and so on.
Most of the participants expressed pleasure and curiosity about engaging in the
conversation with the robot, and despite there being some technical problems (e.g. bad
sound, uncontrolled head movements) the participants consistently retained the social
norms of polite conversation, some even volunteering quite personal information, and
tried to remain in contact with the robot. Most participants finished up the last session
Testing for ‘Anthropomorphization’ … 217

by expressing positive statements of having enjoyed themselves and being positively


surprised about the experience of being in the company of a robot.
Participants had lunch with the robot two to four times and showed continued
interest in the conversation. Often the participant would greet the robot already while
entering the room, before he/she was in the view of the robot, or they would greet,
as if they were greeting someone they knew, as soon as they were sitting at the table,
trying to pick up the conversation from the previous lunch-session. When greeting
and speaking to the robot participants showed obvious signs of familiarity and posi-
tivity, smiling, waving, looking directly at the Telenoid and seeking eye contact. The
following excerpts from the video-recordings are representative examples of how the
participants engaged in conversations with the robot.
Uninformed male participant #48b
This participant is a middle-aged man who suffers from an illness where he needs
rehabilitation in order to learn to get by with slow loss of control of his body. The
participant meets the robot over a course of 4 lunch sessions. Despite several tech-
nical problems, the participant seems to be very positive when talking to the robot.
The following excerpt is from his final session with the robot and already as we can
hear the participant entering the room, we can hear him shout:

P: Hi!
…(The Telenoid doesn’t answer, the participant sits down)…
P: Hi. (pause). You are not saying anything today. Haven’t you been allowed
to…
T: (interrupts) Hi!
P: Hi! Oh, it is good to see you again (The participant is clearly happy and
smiling).
T: Yes, same here. Is there still no food for you?
P: No. But hopefully you have had the electricity you need, so that you are not
starving.
T: I have had what I need …(a little laugh in the voice)..

The participant continues talking to the robot about things that were also talked
about the previous day, the participant’s day at the center and the food served. As
the session is coming to an end, the participant says:

P: I can’t really eat a lot right now.


T: It doesn’t look like very much. Maybe you can eat a few mouthfuls while
we talk.
P: Aahh noo…(the participant hesitates a little, but picks up the fork again and
takes a little).
P: I can’t really eat anymore, but I will try and eat a few mouthfuls when you
say so. Oh no, it is not going so well, I am dropping the food. That is not very
good.
T: It is ok with me if you drop your food, that doesn’t matter.
218 M. F. Damholdt et al.

P: No, I know that. I am not shy in front of you anymore, because I know you
are just sitting here as a robot, who is supposed to help me, and you are doing
that really well. It is nice to have you here to talk to.
Uninformed male participant #48 C:
This participant is an elderly man who has been meeting the robot two times and
has seemed to enjoy the lunch-sessions. He has talked to the robot for about 20 min.
During the two sessions, he very quickly seems comfortable with talking to the robot,
gives it a female name and expresses enjoyment in the conversations as is shown
below:
P: I was actually interested in seeing what it would be like to talk to you, but I
actually think it is very cosy talking to you. I actually enjoy talking to you, so
I am looking forward to tomorrow, when we are going to talk again.
When the participant comes back for the second lunch-session, he is clearly
expecting to pick up the conversation from the previous lunch-session. He dismisses
the carer, who wants to say goodbye, in order to begin the conversation. He calls the
robot by the name he gave it the previous session and immediately begins conversing
about the food as soon as he sits down at the table. After a little while, the robot has
technical difficulties and no sound comes through, hence stopping the conversation.
The participant keeps trying to get an answer out of the robot. A carer comes in and
suggests it may be time to stop, but the participant is clearly upset about this:
P: And now I have found a good friend in you (the robot does not answer).
P: ….Halloo! tell me, have you gone deaf?
P: (aimed at the carer) It is not saying anything. It is completely…
Carer: Well okay, it does not want to continue.
P: Of course it does! There is lots of time left.
Carer: Well, if it doesn’t …maybe we should just stop then?
P: But yesterday it told me when it was time to finish. I can’t understand why
it doesn’t say anything today, we were right in the middle of a conversation!
[…] That was a shame, now we were just in the middle of something.
The participant gets visibly disappointed when the robot does not answer back
and it takes a little while before he accepts that the conversation cannot be picked up.
The qualitative data has been analyzed according to a framework of anthropo-
morphizing traits, also defined as such in other studies of anthropomorphization (see
[12, 14, 15]). The examples show the participants engage with the robot in a way
that can be described as anthropmorphization: They engage in normal conversation
and turn-taking, they seek eye-contact and refer to the robot as if it had humanlike
traits by calling it ‘you’ and asking it questions. However, when we triangulate data
for this particular group of participants the interpretation of the qualitative data is
not well-aligned with the quantitative data regarding the participants’ reported ten-
dencies to anthropomorphize. Before being introduced to the robot the participants
completed the AMPH. Utilizing the factor structure described previously these partic-
ipants scored approximately one standard deviation lower on AMPH compared to the
Testing for ‘Anthropomorphization’ … 219

larger sample with an average of 6.5 (SD = 1.38) on the anthropomorphism towards
artefacts subscale and 4.2 (SD = 1.33) on anthropomorphism towards natural objects
subscale. The lower scores may be an artefact of the small sample size or explained
by the advanced age of the participants and the gender distribution in the sample
(mainly males). However, in the following discussion we would like to outline other
possible explanations for the discrepancies we are finding in the data, and at the same
time argue for the need to extend the standard methodologies used in HRI research.

3 Discussion and Conclusion

We present AMPH as a new, short anthropomorphism questionnaire with acceptable


psychometric properties. Unlike existing measures, it offers more thorough assess-
ment of anthropomorphism towards inanimate objects. Such a sub-scale may be
especially suitable for research in the field of social robotics.
Overall, the final AMPH consists of eight items and factor analysis yielded a
two-factor solution: anthropomorphism towards artefacts and anthropomorphism
towards natural objects. The reliability of AMPH as assessed with Cronbach’s alpha
was within the acceptable range for exploratory research with the anthropomorphism
towards natural objects subscales being the lowest [36, 37]. This may be a reflection
of the low number of items in this subscale (n = 3) which is found to negatively
affect alpha level [38]. Younger participants and females scored significantly higher
on AMPH compared to older participants and males. The gender difference in
anthropomorphic tendencies supplements existing research where gender differ-
ences have been reported in relation to attitudes towards social robots, though not
consistently [39, 40].
Furthermore, small but significant correlations were found between the God-
speed Questionnaire Series (GQS) and especially the anthropomorphism towards
artefacts subscale. This indicates that declared dispositional anthropomorphic
tendencies as assessed by AMPH are related to the rating of a specific robot
(as assessed by GQS). The relatively small correlations between AMPH and
GQS can be interpreted in numerous ways. One, albeit unlikely explanation is
that anthropomorphic tendencies have little influence on perception of robots in
terms of safety, intelligence, anthropomorphic characteristics and so forth. This
however, seems unlikely as previous studies have shown how human behavior is
influenced by robotic features that encourage or discourage anthropomorphism in
varying degrees [19]. Furthermore, from animal studies it is seen that the more
similar the morphology and behavior of the animal is to humans, the more likely
people are to anthropomorphize [41]. An explanation for the small correlations
between AMPH and GQS could be the constitution of the Godspeed Questionnaire
Series. Whilst the AMPH is postulated to assess dispositional anthropomorphic
tendencies and ascription of mental or emotional capacities, the GQS taps a user’s
overall perception of a given robot with a special focus on evaluating visible features
(such as the extent to which the robot appears to be fake, mechanical, artificial,
220 M. F. Damholdt et al.

etc.). Given the ambiguous nature of the visible features of the Telenoid (such as
pale skin, small size and automated stumps for arms) it is likely that the participants
were reluctant to describe it in terms of naturalness, human likeness, realism etc.
Hence, it appears intelligible that the AMPH and GQS should not necessarily be
strongly interrelated and rather supplement each other.
A more plausible explanation for the low correlations between AMPH and GQS
may be that these are artefacts of the experimental set-up as the quantitative study did
not allow for any one-to-one interaction with the robot. It seems likely that the full
expression of anthropomorphism on perception of robots necessitates personal inter-
action with a robot. This interpretation is supported by the qualitative data reported
above, where we do observe willingness to interact with the robot and also actions
that could be interpreted as anthropomorphic- even in a group who have relatively low
AMPH scores. The finding of a two-factor solution on AMPH separating anthropo-
morphism towards natural objects and towards artefacts renders support to the need
to assess anthropomorphism towards technical artefacts separately in HRI research.
This is currently not the case in established measures (see page five for a description
of some limitations with the GQS and IDAQ). Given the brevity of the scale it can
be easily introduced in studies that wish to explore anthropomorphic thinking for
instance as mediators or moderators on robot acceptance.
Finally, as our findings suggests that participants can behave in seemingly anthro-
pomorphic ways even if their declared anthropomorphic disposition on the AMPH
is more subtle, future research should assess more in depth if AMPH can predict
willingness to interact with robots and the quality of such interactions. This last
consideration leads us back to the landscape of more comprehensive methodological
reflections we adumbrated in the introduction. In combining quantitative and quali-
tative data and applying a mixed method approach, we have already in the preceding
followed the ISR approach, which calls for a widening of the interdisciplinary scope
in human-robot interaction research and social robotics. In the following concluding
paragraphs, we want to show that the ISR approach forces us to investigate two more
fundamental questions: Could it be that social robotics creates technical artefacts that
we react to in ways that reflect not a tendency to ‘anthropomorphize’ in the current
sense of this term, but one or more of a host of related tendencies? Could it be—at
least in this domain, but perhaps also in others—that what is called the ‘tendency
to anthropomorphize’ is not one but many tendencies, and that we carefully need to
distinguish in each case which of these tendencies is expressed, especially if HRI
research is to support ethical decisions concerning social robots?
Let us first consider how these two questions arise. As mentioned in the introduc-
tion, the ISR approach is a new procedural paradigm for the research, design, and
development (RDD) process in social robotics that aims to address the exacerbated
version of the Collingridge dilemma, the “threefold gridlock of description, evalua-
tion, and regulation” of social robotics applications [9]. One of the guiding principles
of the approach is the so-called “principle of maximal expertise” which says that in
any RDD process for a social robotics application A in context C research expertise
Testing for ‘Anthropomorphization’ … 221

from all relevant areas must be involved.3 So far HRI research has all but left out
Humanities research even though the latter specializes in the analysis of socio-cultural
practices and conceptualizations. The call for maximizing the interdisciplinary scope
in order to draw on all relevant expertise thus is also a call for taking into account the
research results of social ontology, as undertaken in analytical philosophy but also in
phenomenological research in philosophy.4 From the point of view of current exper-
tise in social ontology, it is prima facie problematic that HRI researchers immedi-
ately turn to the notion of ‘anthropomorphization’ in order to understand the fact that
many people are inclined to interact socially with robots. For in the eyes of the social
ontologist the boundaries of sociality are (i) a matter of empirical and conceptual
investigation where linguists, anthropologists, psychologists, and ontologists need to
collaborate, and (ii) extend beyond the domain of human-human social interactions.
There are many and rather different ways to understand an interactive situation as a
situation of social interaction. In particular, for an item X to qualify as something that
is capable of participating in a social interaction it is not necessary that X is conscious
or knows what a norm is or even which norms are at issue in the given interaction,
or even has intentions and beliefs in the literal senses of these terms that include an
understanding of the practical and inferential difference between these. As our inter-
actions with higher animals such as cats, dogs, and horses show, we do understand an
interactive situation as a social interaction even if the interaction partner cannot be
ascribed any of the characteristic human capacities (consciousness, normative com-
petence, emotions, intentions and beliefs, capacities of empathic understanding and
reasoning, etc.). The notion of anthropomorphization, however, is clearly restricted to
the ascription of human capacities–to requote, it is “the tendency to attribute human
characteristics to inanimate objects, animals and others with a view to helping us
rationalize their actions. It is attributing cognitive or emotional states to something
based on observation in order to rationalize an entity’s behavior in a given social
environment” ([12], p. 180) The standard notion of anthropomorphization thus artic-
ulates one specific interpretation of what we do when we “rationalize the actions”
of non-human agents. However, many social interactions with animals could not be
accurately described with testing tools that implement this particular interpretation.
For example, in order to explain the conditioned response of a cat who approaches
as soon as she hears the sound of a can opener I do not need to ascribe to the cat
“human characteristics” nor “emotional states” (which require propositional struc-
ture) but only the non-human cognitive states (‘conditional response’) that represent
a regular association of a signal and an anticipated sensation.
With these considerations in mind, let us review the two dialogue excerpts quoted
above in Sect. 2. The two participants treat the Telenoid as a social agent and interact
with it using the behavioral template of a casual dialogue. But participant #48b

3 This principle is based on the simple observation that socio-cultural reality is the most complex
sector of the entire domain of scientific investigation—much more complex, in the sense of nomo-
logical complexity, than purely natural systems—and thus requires that we use all pertinent expertise
when we investigate phenomena in this domain.
4 In the following the term ‘ontology’ is always used with reference to philosophical research and

not to database structures as discussed under this label in computer science.


222 M. F. Damholdt et al.

addresses the Telenoid as a social agent that is present “as a robot” and participant
#48C switches fluently from “you” to “it”; since the subject of “wants to continue” is
an “it”, this should be understood as the ascription of a dispositional state rather than
the ascription of a full-blown volition associated with human intentions. In short, it
appears that both participants master the situation by interpreting by means of the
most appropriate familiar template of a social interaction, thereby sociomorphizing
the interaction with the robot, without necessarily also ‘going all the way’ by ascribing
to the robot “human characteristics”, a “mind” or “emotional states.”
In view of the difference between anthropomorphizing and sociomorphizing, is it
justified to believe that our social interactions with robots must be construed on the
model of humanizing the robot? Is it justified to believe that people who sincerely
engage in a social interaction with a robot indeed in all cases morph the robot into
a human social agent instead of merely morphing it into a social agent? According
to our knowledge there is no extensive discussion in HRI research of the difference
between anthropomorphization and sociomorphization that could justify the default
assumption in HRI research that social interactions with robots are an indication of
the tendency to anthropomorphize, i.e., humanize the robot. On the other hand, there
is enough empirical and conceptual research in the debate about the boundaries of
sociality to question the equation of social interactions with human social interactions
[5, 41–44].
There is good reason, then, we submit, to develop more fine-grained methods in
HRI for investigating more precisely which capacities we ascribe to robots when we
treat them as social agents. Especially given the fact that artificial social agents with
symbolic communication are an entirely novel target for any tendencies to ascribe
capacities, it seems a matter of good scientific methodology in general–and not only
a requirement of the approach of ISR–to abandon the presumption that social inter-
actions with robots must be based on anthropomorphization. In principle, going by
the ontological debate about the conditions for sociality and types of collaborative
relationships [45–49] there are at least ten different varieties of ‘sociomorphizing’ a
robot. Omitting for present purposes the finer points, we can ascribe to a robot the
ability (1) to coordinate ‘spontaneously’ or preconsciously on the basis of implicit
(biological) tendencies (e.g., people negotiating critical distance in an elevator); (2)
to coordinate on the basis of convention acquired by social learning (conditioning);
(3) to coordinate on the basis of an explicit convention and implicit understanding
of the practical force of this convention or norm; (4) to coordinate on the basis of
an explicit convention or norm and an explicit and reciprocal acknowledgement of
the practical force of the norm; (5) to coordinate on the basis of direct empathic per-
ception without inference; (6) to coordinate on the basis of empathic simulation and
inference; (7) to coordinate on the basis of an explicitly reflected (folk-psychological)
theory (of beliefs, desires and intentions); (8) to collaborate on the basis of coordi-
nation capacities as in (6) or (7), and for the fulfillment of egocentric goals; (9) to
collaborate on the basis of coordination capacities as in (6) and (7), but with the
ability of meshing individual action plans and for the fulfillment of a joint goal that
will benefit each of the partners of the social interaction; (10) to collaborate on the
basis of coordination capacities as in (6) and (7) but as a member of a team, with
Testing for ‘Anthropomorphization’ … 223

team spirit, and for the fulfillment of a joint goal that will benefit the team as a whole.
The conditions (1) through (10) represent ten different ways of understanding social
agency, and thus ten different ways of ‘sociomorphizing’, i.e., of projecting a con-
ception of social agency onto a robot. Some of these conceptions of social agents,
but not all, involve characteristically human abilities—so some but not all tendencies
of sociomorphizing are tendencies of anthropomorphizing. Researchers who inves-
tigate domains with sociomorphizing with the testing tools for anthropomorphizing
may not yet use the appropriate instruments to explore which tendencies facilitate
human social interactions in this domain. There are at least two extant proposals for
ontological classifications of human-robot social interactions [4, 6, 7, 13, 50] that
take these and other distinctions in social agency into account, but these have yet to
be related to empirical research in HRI.
In short, to clarify precisely which tendency of sociomorphizing is involved when
people socially interact with a robot is a task for future research. AMPH includes a
starting point into this inquiry since questions 1, 2, and 4 in the AMPH require the
ascription of capacities that characterize social agents at the level of animal sociality,
while questions 3, 6–8, require the ascription of farther-going capacities of that are
more narrowly human. But more detailed concepts for types of sociomorphizing need
to be developed and tested for specific social robotics applications. Given that the
conceptual presuppositions that need to be explored here are so fine-grained, given
that HRI research not only fails to distinguish clearly between anthropomorphizing
and sociomorphizing but also occasionally intertwines anthropomorphism and ani-
mism, it may be difficult to develop stand-alone quantitative measures and a mixed
method approach may be needed.
That it is important, not only for increased descriptive adequacy and precision
in HRI research, to explore which type of sociomorphizing enables human social
interactions with a particular robot in a particular application context, derives from the
link between sociality and moral status. While anthropomorphizing in the standard
sense of ‘humanizing’ implies that the anthropomorphized item enjoys the moral
status of a moral agent (who has rights and obligations, and can be held responsible),
other forms of sociomorphizing establish that the sociomorphized item merely has
that moral status of a moral patient (i.e., the right to be treated in certain ways). Any
approach to social robotics that aims for “responsible robotics” and includes ethical
perspectives and values right at the beginning of the RDD process and throughout—
such as the ISR approach, but also “design for values” [51] and “care-sensitive design
for values” [52]—crucially depend on HRI research that operates with a differentiated
analytical vocabulary for types of sociomorphizing and associated testing tools.
Finally, these observations on further directions of research on ascriptive ten-
dencies in human-robot interaction also suggest an answer to question (Q2) at the
very beginning of this chapter, where we asked which form of pluridisicplinarity the
field of HRI will likely take. This volume can be viewed as an indication that (social
robotics and) HRI research currently move from the phase of a multidiscipline
towards an interdiscipline, clarifying the methods employed in HRI and thereby
potentially preparing a more integrated approach to research. Our discussion of the
current approach to evaluating the tendency to anthropomorphize certainly provides
224 M. F. Damholdt et al.

support for such an effort at interdisciplinary integration, which the approach of


“Integrative Social Robotics” also pursues in theory and praxis. However, our
discussion of the notion of anthropomorphization also sets a pointer towards the
farther-going development, where HRI research ultimately turn into a transdiscipline.
As we stated in the introduction to this chapter, according to the characterizations of
recent research in philosophy of science on formats of pluridisciplinarity [11], in a
transdiscipline, such as “integrative systems biology” the research results undertaken
in the area affect the terminology and practices of participating fields. This definition
of transdisciplinarity is more restrictive than older proposals—see e.g. [53] who
focuses on “consilience” of perspectival knowledge—it is also more precise and thus
more useful in our view. As we hope to have shown by the preceding presentation
and discussion of our efforts to come to grips with the notion and measurement of
‘anthropomorphisation’ in HRI, there is initial plausibility for the hypothesis that
HRI research not only has repercussions for the foundational concepts of philosophy,
especially the traditional notion of subjectivity. The search for suitable testing tools
for the ‘tendency to anthropomorphize’ in the domain of human-robot interaction
may lead to a re-evaluation and transformation of the notion of anthropomorphiza-
tion in psychology in general, either by contrasting the notion with other forms
of sociomorphization or by reconceiving in an emergent, altogether new emergent
model of the affordances and tendencies that facilitate human social interactions.

Acknowledgements We are grateful to members of the Research Unit for Robophilosophy who
commented on various versions of the AMPH questionnaire and assisted in developing the design
of, and data collection in, the study referred to in this chapter, in particular our colleague Stefan
Larsen and our collaborator Raul Hakli. This research has been supported by the Velux Foundation
and by the Carlsberg Foundation.

References

1. Fong, T., Nourbakhsh, I., Dautenhahn, K.: A survey of socially interactive robots. Robot. Auton.
Syst. 42(3), 143–166 (2003)
2. Breazeal, C.: Toward sociable robots. Robot. Auton. Syst. 42(3–4), 167–175 (2003)
3. Dautenhahn, K.: Socially intelligent robots: dimensions of human–robot interaction. Philos.
Trans. R. Soc. B Biol. Sci. 362(1480), 679–704 (2007)
4. Fiebich, A., Nguyen, N., Schwarzkopf, S.: Cooperation with robots? A two-dimensional
approach. Collect. Agency Coop. Nat. Artif. Syst. 25–43 (2015)
5. Hakli, R.: Social robots and social interaction. Sociable Robots Future Soc. Relat. Proc. Robo-
Philos. 273, 105–115 (2014)
6. Seibt, J.: Varieties of the “as if”: Five ways to simulate an action. Sociable Robots Future Soc.
Relat. Proc. Robo-Philos. 273, 97 (2014)
7. Seibt, J.: Towards an ontology of simulated social interaction: varieties of the “As If” for robots
and humans. In: Hakli, R., Seibt, J. (eds.) Sociality and Normativity for Robots, pp. 11–39.
Springer International Publishing, Cham (2017)
8. Seibt, J.: Integrative social robotics—a new method paradigm to solve the description problem
and the regulation problem? Nørskov MJ Seibt What Soc. Robots Can Should Do—Proceedings
Robophilosophy (2016)
Testing for ‘Anthropomorphization’ … 225

9. Seibt, J., Damholdt, M. F., Vestergaard, C.: Integrative social robotics, value-driven design,
and transdisciplinarity. Interact. Stud. (2019)
10. Kahn, P. H., et al.: “Robovie, you’ll have to go into the closet now”: children’s social and moral
relationships with a humanoid robot. Dev. Psychol. 48(2), 303–314 (2012)
11. Nersessian, N. J., Newstetter, W. C.: Interdisciplinarity in Engineering Research and Learning
(2013)
12. Duffy, B. R.: Anthropomorphism and the social robot. Robot. Auton. Syst. 42(3), 177–190
(2003)
13. Seibt, J.: Classifying forms and modes of co-working in the ontology of asymmetric social
interactions (OASIS). Envisioning Robots Soc. Polit. Public Space Proc. Robophilosophy
2018TRANSOR 2018. 311, 133 (2018)
14. Damholdt, M. F., Yamazaki, R., Hakli, R., Hansen, C. V., Vestergaard, C., Seibt, J.: Attitudinal
change in elderly citizens toward social robots: the role of personality traits and beliefs about
robot functionality. Hum.-Media Interact. 6, 1701 (2015)
15. Waytz, A., Cacioppo, J., Epley, N.: Who sees human? The stability and importance of individual
differences in anthropomorphism. Perspect. Psychol. Sci. 5(3), 219–232 (2010)
16. Epley, N., Waytz, A., Akalis, S., Cacioppo, J.T.: When we need a human: motivational
determinants of anthropomorphism. Soc. Cogn. 26(2), 143–155 (2008)
17. Moreale, E., Watt, S.: An agent-based approach to mailing list knowledge management.
SpringerLink, 118–129 (2004)
18. Aggarwal, P., McGill, A. L.: Is that car smiling at me? Schema congruity as a basis for evaluating
anthropomorphized products. J. Consum. Res. 34(4), 468–479 (2007)
19. Bartneck, C., Kulić, D., Croft, E., Zoghbi, S.: Measurement instruments for the anthropomor-
phism, animacy, likeability, perceived intelligence, and perceived safety of robots. Int. J. Soc.
Robot. 1(1), 71–81 (2009)
20. Rilling, J. K., et al.: Sex differences in the neural and behavioral response to intranasal oxy-
tocin and vasopressin during human social interaction. Psychoneuroendocrinology 39, 237–248
(2014)
21. Scheele, D., Schwering, C., Elison, J. T., Spunt, R., Maier, W., Hurlemann, R.: A human
tendency to anthropomorphize is enhanced by oxytocin. Eur. Neuropsychopharmacol. 25(10),
1817–1823 (2015)
22. Neave, N., Jackson, R., Saxton, T., Hönekopp, J.: The influence of anthropomorphic tendencies
on human hoarding behaviours. Personal. Individ. Differ. 72, 214–219 (2015)
23. Damholdt, M. F., et al.: A generic scale for assessment of attitudes towards social robots: the
ASOR-5. (2016)
24. Ogawa, K., et al.: Telenoid: tele-presence android for communication. In: ACM SIGGRAPH
2011 Emerging Technologies, p. 15 (2011)
25. Tabachnick, B. G., Fidell, L. S.: Using Multivariate Statistics, International edition. Pearson,
Boston (2013)
26. Kaiser, H. F.: A second generation little jiffy. Psychometrika 35(4), 401–415 (1970)
27. Hutcheson, G. D., Sofroniou, N.: The Multivariate Social Scientist: Introductory Statistics
Using Generalized Linear Models. SAGE Publications, London (1999)
28. Cattell, R.B.: The scree test for the number of factors. Multivar. Behav. Res. 1(2), 245–276
(1966)
29. Horn, J. L.: A rationale and test for the number of factors in factor analysis. Psychometrika
30(2), 179–185 (1965)
30. O’Connor, B.P.: SPSS and SAS programs for determining the number of components using
parallel analysis and Velicer’s MAP test. Behav. Res. Methods 32(3), 396–402 (2000)
31. Ledesma, R.D., Valero-Mora, P.: Determining the number of factors to retain in EFA: an easy-
to-use computer program for carrying out parallel analysis. Pract. Assess. Res. Eval. 12(2),
1–11 (2007)
32. Cronbach, L. J.: Coefficient alpha and the internal structure of tests. Psychometrika, 16(3),
297–334 (1951)
226 M. F. Damholdt et al.

33. Karpatschof, B.: Bringing quality and meaning to quantitative data–bringing quantitative
evidence to qualitative observation. Nord. Psychol. 59(3), 191–209 (2007)
34. Elo, S., Kyngäs, H.: The qualitative content analysis process. J. Adv. Nurs. 62(1), 107–115
(2008)
35. Cole, F.L.: Content analysis: process and application. Clin. Nurse Spec. 2(1), 53–57 (1988)
36. Streiner, D.L.: Starting at the beginning: an introduction to coefficient alpha and internal
consistency. J. Pers. Assess. 80(1), 99–103 (2003)
37. Nunnally, J. C.: Psychometric Theory. McGraw-Hill series in psychology, Michigan (1978)
38. Graham, J.M.: Congeneric and (essentially) tau-equivalent estimates of score reliability: what
they are and how to use them. Educ. Psychol. Meas. 66(6), 930–944 (2006)
39. Shibata, T., Wada, K., Ikeda, Y., Sabanovic, S.: Cross-cultural studies on subjective evaluation
of a seal robot. Adv. Robot. 23(4), 443–458 (2009)
40. de Graaf, M. M. A., Ben Allouch, S.: Exploring influencing variables for the acceptance of
social robots. Robot. Auton. Syst. 61(12), 1476–1486 (2013)
41. Mameli, M.: Mindreading, mindshaping, and evolution. Biol. Philos. 16(5), 595–626 (2001)
42. Cerulo, K.A.: Nonhumans in social interaction. Annu. Rev. Sociol. 35(1), 531–552 (2009)
43. Andrews, K.: Understanding norms without a theory of mind. Inquiry 52(5), 433–448 (2009)
44. Castro, V. F.: Mindshaping and robotics. In: Sociality and Normativity for Robots, pp. 115–135.
Springer, Berlin (2017)
45. Tuomela, R.: The Philosophy of Sociality: The Shared Point of View. Oxford University Press,
Oxford and New York (2010)
46. Bratman, M. E.: Shared agency: a planning theory of acting together. Oxford University Press,
Oxford (2013)
47. Zahavi, D., Satne, G.: Varieties of shared intentionality: tomasello and classical phenomenol-
ogy. In: Beyond the Analytic-Continental Divide: Pluralist Philosophy in the Twenty-First
Century. Routledge, New York (2015)
48. Gallagher, S., Allen, M.: Active inference, enactivism and the hermeneutics of social cognition.
Synthese 195(6), 2627–2648 (2018)
49. Heinonen, M.: Joint commitment: how we make the social world. J. Soc. Ontol. 1(1), 175–178
(2015)
50. Fiebich, A.: Three dimensions of human-robot interactions. In: Coeckelbergh, M., Loh, J.,
Funk, M. (eds.) Envisioning Robots in Society–Power, Politics, and Public Space: Proceedings
of Robophilosophy 2018/TRANSOR 2018, vol. 311. IOS Press, Amsterdam (2018)
51. van den Hoven, J., Vermaas, P.E., van de Poel, I. (eds.): Handbook of Ethics, Values, and
Technological Design. Springer, Netherlands (2015)
52. van Wynsberghe, A.: Service robots, care ethics, and design. Ethics Inf. Technol. 18(4), 311–321
(2016)
53. Nicolescu, B.: Manifesto of transdisciplinarity. Suny Press, Albany (2002)
Testing for ‘Anthropomorphization’ … 227

Malene Flensborg Damholdt is an assistant professor at the


Department of Psychology and Behavioural Science, and at the
Department of Clinical Medicine. Her research focuses on the
effect of individual differences on human-robot interactions.

Christina Vestergaard is a postdoctoral at the Department of


Philosophy and the History of Ideas, University of Aarhus. She
is an anthropologist with research interests is interdisciplinary
methodology, anthropology of technology, and robo-philosophy.

Johanna Seibt is a professor at the Department of Philosophy


and the History of Ideas, University of Aarhus. She works on
the ontology of human-robot interactions and is the PI of the
research project on Integrative Social Robotics (INSOR) sup-
ported by the Carlsberg Foundation with 25 researchers from 11
disciplines.
Disciplinary Points of View
Evaluating the User Experience
of Human–Robot Interaction

Jessica Lindblom, Beatrice Alenljung and Erik Billing

Abstract For social robots, like in all other digitally interactive systems, products,
services, and devices, positive user experience (UX) is necessary in order to achieve
the intended benefits and societal relevance of human–robot interaction (HRI). The
experiences that humans have when interacting with robots have the power to enable,
or disable, the robots’ acceptance rate and utilization in society. For a commercial
robot product, it is the achieved UX in the natural context when fulfilling its intended
purpose that will determine its success. The increased number of socially interactive
robots in human environments and their level of participation in everyday activi-
ties obviously highlights the importance of systematically evaluating the quality of
the interaction from a human-centered perspective. There is also a need for robot
developers to acquire knowledge about proper UX evaluation, both in theory and
in practice. In this chapter we are asking: What is UX evaluation? Why should UX
evaluation be performed? When is it appropriate to conduct a UX evaluation? How
could a UX evaluation be carried out? Where could UX evaluation take place? Who
should perform the UX evaluation and for whom? The aim is to briefly answer these
questions in the context of doing UX evaluation in HRI, highlighting evaluation
processes and methods that have methodological validity and reliability as well as
practical applicability. We argue that each specific HRI project needs to take the UX
perspective into account during the whole development process. We suggest that a
more diverse use of methods in HRI will benefit the field, and the future users of
social robots will benefit even more.

Keywords User experience · Evaluation · Methods

J. Lindblom (B) · B. Alenljung · E. Billing


University of Skövde, Box 408, 541 28 Skövde, Sweden
e-mail: jessica.lindblom@his.se
B. Alenljung
e-mail: beatrice.alenljung@his.se
E. Billing
e-mail: erik.billing@his.se

© Springer Nature Switzerland AG 2020 231


C. Jost et al. (eds.), Human-Robot Interaction, Springer Series
on Bio- and Neurosystems 12, https://doi.org/10.1007/978-3-030-42307-0_9
232 J. Lindblom et al.

1 Introduction: Motivations and Aim

Socially interactive robots are expected to have an increasing importance in the every-
day life for a growing number of people. As recently pointed out by Dautenhahn [1],
the field of human–robot interaction (HRI) has many differences compared to human–
human interaction as well as human–computer interaction (HCI), and HRI also differs
from traditional robotics and engineering research. Dautenhahn stresses that currently
her biggest worry of empirical HRI research is that experimental psychology has
repeatedly emerged as the golden hammer method for conducting HRI studies. Exper-
imental psychology is mainly based on detailed experimental designs that focus on
precise and specific research questions, which often provide a well-controlled range
of stimuli that subjects are presented to, in order to collect quantitative data, often
using additional questionnaires [1]. The data is analyzed using statistical methods
that often require large numbers of subjects. Dautenhahn [1] points out that beside the
pros of these contrived HRI studies, they usually cannot address how real users, in a
particular real-world context, would interact directly with a real social robot [1]. The
obtained results are not only dependent on real-world circumstances, but also on the
specific tasks users are given, the complexity of a, sometimes unpredictable, multi-
tasking robotic agent [1]. She acknowledges that such naturalistic field studies do not
fit well with the current research method paradigm of HRI studies inspired by exper-
imental psychology. Taking into consideration the increasing amount of HRI studies
that investigate and analyze robots as companions, co-workers, and assistants; which
are situated in people’s daily lives, pointing out the need to study long-term interac-
tions in ecologically valid environments. However, field studies of face-to-face and
long-term interactions between humans and robots are much more time-consuming,
resource-intensive, and complex to design, perform and analyze than traditional labo-
ratory studies, also requiring complementary competence and skills of the researchers
[1]. In such studies, the researchers do not follow a strictly quantitative approach and
not merely use questionnaires but also include behavioral measures and qualitative
approaches to study various aspects of the interaction, including immersion, engage-
ment, acceptability, utility, and other issues. Hence, Dautenhahn [1] puts the finger
on the wide span of available methods for evaluating HRI, ranging from lab-based
studies to conducting research in the wild. She also touches upon the central dimen-
sions duration and frequency of human–robot interaction; ranging from short-term
encounters to long-term collaboration and just one or infrequent contact to daily
interaction. She admits that she personally would like to see more HRI studies with
authentic interactions between humans and more complex and autonomous social
robots that are situated in ecologically valid environments. We agree with her that
this kind of research is not easy to perform, but if we want the field of HRI to expand
beyond the research community and have implications to and social impact in society,
there is a need to go outside the lab setting of experimental psychology.
Taking a human-centered perspective, comprising systematic evaluation of the
quality of the human–robot interaction, is of major concern for robot technology in
order to provide long-term added value to people’s lives [2, 3]. For social robots, like
Evaluating the User Experience of Human–Robot Interaction 233

in all other digitally interactive systems, products, services, and devices, positive
user experience (UX) is necessary in order to achieve the intended benefits and
societal relevance of HRI. Broadly speaking, UX is about feelings that arise and
forms internally in a human through the use of technology in a particular usage context
[4–6]. If the usage of social robots entails negative experience by the intended users, it
could result in undesirable consequences, such as reluctance to use the current robot
as well as robots in general, erroneous handling, or spreading of bad reputation.
Therefore, it is essential for robot researchers as well as robot developers to put
serious efforts to design and build social robots that the intended users experience
as positive. The functional capabilities of the social robots are necessary but not
sufficient conditions for high quality technology use, because humans’ expectations
of and demands on the interaction quality of today’s technological products are going
beyond utility, usability, and acceptance. By designing a high quality interaction with
the intended users and usage context in mind it is possible to positively influence that
experience [4–6]. Consequently, the UX of social robots needs to be a central issue
of concern, since positive UX should underpin the proliferation of social robots in
society [7].
A positive UX does not appear by itself, and therefore UX has to be systematically,
thoroughly, and consciously designed for, following the phases in the iterative UX
design lifecycle process, also referred to as the UX wheel [4–6, see also Wallström
and Lindblom, Chapter 3.1]. Therefore, each specific robot development project
needs to take the UX perspective into account during the whole development pro-
cess. The field of UX design (UXD) offers methods, techniques, and guidelines for
creating a positive UX for all types of interactive systems for human use [4–6, 8],
and UXD is well-aligned with the ISO 9241-10 Standard Ergonomics of human–sys-
tem interaction [9] that could be properly applied to the HRI domain. As addressed
earlier, the interaction between humans and robots differs evidently from interaction
between humans and more traditional computer-based artefacts [1–3]. However, sev-
eral challenges for robot developers and researchers are similar as for the developers
of the inception of computer-based artefacts intended for other users than the devel-
opers themselves more than 30 years ago. Back in the 1980s, Gould and Lewis [10]
introduced the three fundamental principles of user-centered design, and since then
usability and UX experts have addressed these principles; “early focus on users and
tasks”, “empirical measurement”, and “iterative design”. These principles involve
establishing and maintaining the focus on the users during the whole development
process, i.e., the UX design lifecycle process [4], as well as grounding the design and
decisions on correct and relevant information about the users, their characteristics,
needs, and goals. Nowadays, it is highly acknowledged that in order to accomplish
high usability and positive UX in computer-based systems, services, and products,
it is necessary to design with the intended end-users in center. Hence, practitioners,
i.e., robot developers of robots for real-world use, need research-based guidance of
how to properly choose and apply UXD methods, techniques and guidelines for the
social robotic products.
When social robots are entering the commercial market and start to being used
by non-expert users, we identify the need to carefully design for and systematically
234 J. Lindblom et al.

evaluate a positive UX in much a similar way that is nowadays a fundamental part


of the development of, e.g., personal computers and mobile phones. It is vital to
make proper decisions concerning which user experiences to focus on based upon a
firm understanding of the intended user groups, their needs, and the usage context
[4, 8]. With this knowledge in mind, the robot developers could concentrate their
efforts to achieve the intended specified UX. This implies that the development pro-
cess of socially interactive robots should include the whole UX design process, i.e.,
embracing the major activities of analysis, design, implementation, and evaluation
in the iterative UX design lifecycle process [4]. Two successful examples of com-
mercial robots that have been developed with a human-centered UX perspective are
the vacuum cleaner Roomba robot at iRobot [11] and the collaborative industrial
robot YuMi [12] at ABB. Unfortunately, these examples still constitute exceptions.
Furthermore, as stressed by Powers [11], it is important to achieve a positive brand
image of commercial robots to positively influence users’ perceptions of the robots.
We have elsewhere identified and presented several general trends of UX in HRI
[see 13, 14 for further details], and focused on the need for more theoretical as well
as methodological knowledge about methods and techniques that are appropriate for
evaluating UX in HRI [15]. Lindblom and Andreasson [15] emphasized that HRI
research faces complex challenges regarding UX evaluation which do not have easy
solutions. One challenge is the need to adopt an iterative UXD process in HRI, which
entails a dilemma because of the high cost of rapid prototyping in robotics. Another
challenge is the need to incorporate UX goals to ensure positive UX, which is a key
aspect to direct the work and guide decisions throughout the UX design lifecycle
process. However, this fundamental activity of specifying relevant UX goals is often
overlooked in HRI, either because of lack of knowledge or because of lack of time.
There is also an identified need for robot developers to acquire knowledge about
proper UX evaluation, both in theory and in practice. Because many such methods
and techniques are derived from other fields, e.g. HCI, experimental psychology,
and human factors, and, thus, need to be adapted and modified in order to suit
UX evaluation in HRI. Moreover, the dissimilarities between the objectives and
the principles in different methods and their practical application cause a risk of
misunderstanding the evaluated aspects, resulting in biased outcomes. Bartneck et al.
[16] noted that robot developers sometimes create their own evaluation methods
without sufficient knowledge of appropriate methodologies, resulting in questionable
validity and reliability of these so-called “quick and dirty” methods. According to
Bartneck et al. [16], many robot developers are unaware of the extensive knowledge
about methodologies and techniques for systematically studying various aspects of
HRI, and therefore sometimes run rather naive user studies and experiments to verify
their robot designs.
In this chapter, we explore some of the challenges mentioned above, with the
aim of disentangle several methodological issues related to performing proper UX
evaluation of human–robot interaction. Our reply to the question: “Which is the best
method for evaluating UX in HRI?” is that there is not a best method available;
instead the proper choice of UX evaluation method(s) depends on many factors;
Evaluating the User Experience of Human–Robot Interaction 235

including purpose, scope, available methodological knowledge, time-frame, mate-


rial, and financial resources. When writing this chapter, we were inspired by the first
lines in Rudyard Kipling’s poem “I keep six honest serving men”:
I KEEP six honest serving men
They taught me all I knew;
Their names are What and Why and When
And How and Where and Who

This chapter tries to briefly answer these questions in the context of UX evaluation
in HRI. What is UX evaluation? Why should it be performed? When is it appropriate
to conduct a UX evaluation? How could a UX evaluation be carried out? Where
could it take place? Who should perform the UX evaluation and for whom? Thus,
we highlight the need for HRI evaluation methods that have methodological validity
and reliability as well as practical applicability, because each specific robot project
needs to take the UX perspective into account during the whole development process
in future HRI.

2 Human–Robot Interaction

Robots are increasingly becoming a part of the human world. In some domains,
not least in industrial settings, robots have been an important and natural technol-
ogy for many years. They are also entering other settings, professional as well as
domestic [17]. The purpose of robotic technology is to enable humans to conduct
something they could not do earlier, facilitating dull or dangerous tasks, or providing
entertainment [18]. Robots could bring several kinds of value, i.e., by conducting
monotonous assembling tasks in manufacturing or keeping the lawn cut. In those
cases, humans do not often need to continuously interact with the robot. Other types
of robots and usage situations, e.g., assisting elderly, demand more frequent and
multi-faceted interaction. This interplay between robots and their users has to be
carefully considered when developing a robot in order for it to provide added value.
The problem of understanding and designing the interaction between human(s) and
robot(s) is the core interest of the field of HRI [18]. More precisely,
HRI is the science of studying people’s behavior and attitudes towards robots in relationship
to the physical, technological and interactive features of the robots, with the goal to develop
robots that facilitate the emergence of human-robot interactions that are at the same time
efficient (according to original requirements of their envisaged area of use), but are also
acceptable to people, and meet the social and emotional needs of their individual users as
well as respecting human values. [19]

The importance of and the attention attracted to HRI is increasing concurrently


with the growing amount of technological achievements in robotics. For the same
reasons, the concept of a robot is constantly changing [1, 19]. The boundaries for how
robots could be constituted and the settings in which they can act in are continually
236 J. Lindblom et al.

expanding. An important characteristic that separates robot technology from techno-


logical devices in general is that it has to, at least to some extent, act autonomously
in its environment. Although autonomous action is crucial for many types of robots,
autonomy remains a problematic concept, receiving dramatically different interpre-
tations in different communities. For example, in industrial robotic automation, high
autonomy implies that the human operator can specify the robot’s behavior. In con-
trast, following the notion of autonomy present in biology, cognitive science, and to
some extent also in HRI, an artificial agent is autonomous if its behavior cannot be
fully controlled or predicted by an operator. In this chapter, autonomy means that
the robot should make their own decisions and adjust to current circumstances [20].
Different types of robots can be considered along multiple dimensions, e.g., the
types of task it is intended to support, its morphology, interaction roles, human-
robot physical proximity, and autonomy level [21]. Robots can also be categorized
into industrial robots, professional service robots, and personal service robots [20].
Moreover, the role of humans in relation to robots can vary; the human could be
a supervisor, operator, mechanic, teammate, bystander, mentor or information con-
sumer [18]. Likewise, robots could have a wide range of manifestations and be used in
different application areas. There are human-like robots (humanoids and androids),
robots looking like animals, or mechanical-appearing robots. Robots could be used
for urban search and rescue tasks, e.g., natural disasters and wilderness search; assis-
tive and educational robotics, e.g., therapy for elderly; military and police, e.g., patrol
support; edutainment, e.g., museum tour guide; space, e.g., astronaut assistant; home,
e.g., robotic companion; and industry, comprising of industrial and collaborative
robots [18, 19].
Consequently, the interaction between users and robots could occur in wide vari-
ety, depending on user-, task-, and context-based conditions. Generally, interaction
can either be remote, i.e., the humans and the robot are spatially, and sometimes also
temporarily, separated, or proximate, i.e., the humans and the robot are co-located,
sharing the same physical and social space [18]. The interaction can be indirect,
which means that the user operates the robot by commanding it, or direct, when
interaction is bi-directional between the user and robot [20]. As robots begin to enter
human domains, robots also need to have social skills [2, 3]. Recently, several aspects
related to the social and emotional quality of the interaction have been addressed in the
HRI literature, including factors such as engagement, safety, intentions, acceptance,
cooperation, emotional response, likeability, and animacy [16, 18–20].
Like all other interactive products for human use, users’ interactions with and
perception of socially interactive robots evoke feelings of different nature and inten-
sity [13, 14, 22]. The users could feel motivated to walk up and use a robot. He or
she could experience a weak distrust of the robot and at the same time be curious of
it. A user could find a robot to be well-adapted and highly useful after a long-term
use, although, initially experienced it to be a bit strange and tricky. A robot could
be fun and entertaining for younger children, but boring for teenagers. Thus, the
UX has many facets and is critical to identify what kind of feeling that a particular
robot should arouse. Then it is possible to consciously design the robot with those
user experiences as the target and it is possible to evaluate to what degree the robot
Evaluating the User Experience of Human–Robot Interaction 237

can be expected to elicit the intended experiences among end-users. These hedonic
aspects of interaction with robots are less commonly evaluated, and when they are,
often addressed using contrived experiments [15]. In contrast, the field of HCI has
to a larger extent adopted a UXD framework. We believe that UXD is a powerful
approach also to study many aspects of HRI and we hope that this chapter will
facilitate the use of UX evaluation methods in HRI.

3 User Experience and User Experience Evaluation

The international standard on ergonomics of human–system interaction [9, clause


2:15] defines UX as:
a person’s perceptions and responses that result from the use or anticipated use of a product,
system or service. Note 1 to entry: User experience includes all the users’ emotions, beliefs,
preferences, perceptions, physical and psychological responses, behaviours and accom-
plishments that occur before, during and after use. Note 2 to entry: User experience is a
consequence of brand image, presentation, functionality, system performance, interactive
behaviour and assistive capabilities of the interactive system, the user’s internal and physical
state resulting from prior experiences, attitudes, skills and personality, and the context of
use. Note 3 to entry: Usability, when interpreted from the perspective of the users’ personal
goals, can include the kind of perceptual and emotional aspects typically associated with
user experience. Usability criteria can be used to assess aspects of user experience.

This means that it is not possible to guarantee a certain UX, since it is the subjective
inner state of a human being. Still, by designing a high quality interaction with the
intended users and the usage context in mind, it is possible to positively impact users’
experiences. The concept of UX embraces pragmatic as well as hedonic qualities
[23]. On the one hand, pragmatic quality is related to fulfilling the do-goals of the
users, which means that the interactive product makes it possible for the users to
reach the task-related goals in an effective, efficient, and secure way. In other words,
pragmatic quality is concerned with the usability and usefulness of the product.
Hedonic quality, on the other hand, is about the be-goals of the users. Humans have
psychological and emotional needs, which should be addressed by the interactive
product. The users could, for instance, find the product cool, awesome, beautiful,
trustworthy, satisfying, or fun. The product could, for example, evoke feelings of
autonomy, competence, and relatedness to others [4–6, 23]. Thus, the UX perspective
includes not only functional and usability aspects, but also experiential and emotional
issues. It focuses on the positive; beyond the mere strive for absence of problems,
i.e., pragmatic quality. Additionally, a main objective of the UX field should be to
contribute to the quality of life for humans [4–6]. The UX design lifecycle process
consists of four major interactive activities; these are analyze, design, implement,
and evaluate [4]. The purpose of the analysis phase is to understand the users’ work
and needs as well as the business domain. The design phase involves creation of
the concept, the interaction behavior, and the look and feel of the product. In the
implementation phase, the focus is on prototyping and thereby realizing different
238 J. Lindblom et al.

design alternatives. In this chapter, we focus mainly on the evaluation phase, where
mainly verification and refinement of the interaction design takes place [4].

3.1 UX Evaluation

The concept of UX evaluation represents a wide amount of methods, techniques,


and skills that are used to identify how users perceive an interactive system, product,
device or service, e.g., a social robot in an educational setting, before, during and
after interacting with it. However, several misleading terms are often found in the
HRI literature [24–26], e.g., “user evaluation” instead of the more correct “usability
evaluation” [27] or “UX evaluation” [4]. The wording “user evaluation” [24–26] is
misplaced because it is not the users that are evaluated; it is the users’ experience of
interacting with the robot that is the focus, which is a significant difference. Usabil-
ity evaluation is the predecessor of UX evaluation that mainly focuses on pragmatic
qualities, whereas UX evaluation usually spans both pragmatic and hedonic qual-
ities. It is not an easy task to assess or measure the perceived UX given that it is
rather subjective (depending on prior experience, preconceptions, attitudes, skills
and competence), dependent on the context, and dynamically changing over time
[4–6].
In order for a UX evaluation to be successful, the investigator (could be a UX
practitioner, UX researcher, HRI researcher, or robot developer) needs to start asking
some general questions about what dimensions, UX aspects, and methods (sometimes
some of these methods are referred to as data collections techniques) to be focused
upon and used, and target the evaluation for the specific area of interest, e.g. certain
aspects of HRI. The field of UX offers a wide span of available methods, ranging
from empirical methods such as experiment and lab-based UX testing and analytic
methods to naturalistic field studies [4–6, 23, 27]. It is impossible to use the total
amount of available methods in every HRI project, but almost all robot projects should
benefit from using several methods to further strengthen the derived insights from UX
evaluation and not only perform contrived experiments combined with questionnaires
that mainly use Likert scales, which is the current tendency [1]. This type of evaluation
design is sometimes a result of investigators being afraid of doing a naturalistic
inquiry by collecting qualitative data or lacking competence to properly collect and
analyze this kind of data. Unfortunately and wrongly, collecting qualitative data via
naturalistic inquiry has been considered as having less scientific rigor [e.g. 28] than
performing more contrived studies that gather quantitative data. These approaches
have different aims and should not be contrasted and compared on the same criteria
[29].
The over-arching goal of the actual UX evaluation, being a formative evaluation
(during the development process) or a summative evaluation (on the final robot),
provides initial answers of how to proceed [4–6, 23, 27]. The major purpose of
formative evaluation is to receive feedback on design ideas in the earlier phases in
the UX design lifecycle process, and this could be performed via rough sketches of
Evaluating the User Experience of Human–Robot Interaction 239

the robot’s design, interaction flows, and physical mock-ups of the envisioned robot.
The initial feedback received on these conceptual ideas and low-fidelity prototypes
from intended users provide necessary hints about the interaction quality, supporting
selection of multiple design alternatives, and identifying UX problems and negative
UX. It is easier and less expensive to change a robot design and interaction flow in
the earlier phases of the development project than in the latter phases. Summative
evaluation is used to decide the UX of a high-fidelity prototype or the final robot
as well as gaining an understanding for its usage in practice, i.e., in an ecologically
valid environment [4–6, 23, 27]. The feedback received from the users in summative
evaluation could be more precise and detailed, and several kinds of assessments and
measurements could be performed. However, changes of the robot’s interface and
the interaction design in the latter phases of development process are usually more
complex, time-consuming and expensive to realize. It is recommended to include
formative and summative evaluations during the whole design lifecycle process.
To better understand “what to do” and “when one should use which method”, the
following two dimensional chart (Fig. 1) provides an overview [30]. The chart spans
the following axes: (1) attitudinal vs. behavioral, and (2) quantitative vs. qualitative
dimensions. Each dimension provides guidance to distinguish among different ways

Fig. 1 The chart illustrating how the two dimensions affect the types of questions that can be ask
in UX evaluation (modified from [30])
240 J. Lindblom et al.

of doing UX evaluation in terms of the raised questions they provide answers to and
the purposes they are most suited for.
The vertical axe distinguishes between “what users say” versus “what users do”. It
is a well-acknowledged fact that there often is a contrast between these two perspec-
tives [4–6, 23, 27, 29, 31]. The motivation for addressing the attitudinal perspective
is usually to gain a better understanding of or assess users’ stated or experienced
beliefs, expectations, and thoughts, which they are consciously aware of. Surveys
and questionnaires (like Likert-Scales), e.g., measure, assess and categorize attitudes
or collect self-reported data that could help track or discover important aspects to
address in the UX evaluation. Interviews and focus groups with more open-ended
questions than mainly used in surveys or questionnaires provide broader views and
perspectives of what users think and experience about interacting with a robot. On
the other end of this dimension is the behavioral perspective, providing methods
that focus on “what users do” when interacting with the robot. It ranges from using
eye-tracking, physiological measurements to experiments as well as several kinds
of observational studies of interactions between users and the robot. The users are
sometimes not consciously aware of what they are doing, and observations could
sometimes identify users’ tacit knowledge of how to conduct a certain task. These
observations could then be followed up by various kinds of interviews. Two com-
monly used UX methods are situated along this dimension: UX testing and natural-
istic field studies, which provide a combination of self-reported and behavioral data.
Although these methods move toward either end of this dimension, leaning toward
the behavioral side is generally highly recommended [4–6, 8, 23, 27, 29, 30].
The horizontal axe, quantitative vs. qualitative, offers another important distinc-
tion. UX evaluation methods that are quantitative in nature, behavioral data or atti-
tudes are mainly gathered indirectly, through measurements used in experiments
or pre-designed instruments like observational protocols, questionnaires and Likert
scales, or any other kind of analytic tool [4, 27]. The collected data are then math-
ematically analyzed, usually by inferential or descriptive statistical analyses. UX
evaluation methods that are qualitative in nature, behavioral or attitudinal data is
generated directly when the investigators observe how users interact with the robot
to meet their needs [4, 5, 8, 27, 29, 30]. By observing users directly, as usually done
in UX testing and field studies, it is possible to raise follow-up questions, to gain a
better understanding of what is going on or what motivates the user’s behaviour in a
certain situation or task with the robot. The analysis of the qualitative data is not usu-
ally done mathematically, instead the gathered data are analyzed by the investigators
to identify different themes or patterns that emerge from the collected data [for more
details, see 4, 31, 32]. Due to the nature of the differences between them, qualitative
methods are much better suited for answering questions about “why” or “how” to
fix an identified UX problem or a negative UX in HRI, whereas quantitative meth-
ods are better suited answering “how many and how much” kind of questions when
users interact with a robot. Laboratory-based experiments and UX testing usually
are better suited for studying specific aspects of UX in more detail, but the narrow
perspective misses the more holistic UX [27, 33] that is optimally studied over a
Evaluating the User Experience of Human–Robot Interaction 241

longer period of time with real users interacting with a robot in a natural environ-
ment. We denoted this as the granularity dimension. This dimension is aligned with
the duration and frequency dimensions, from very short-term encounters with social
robot that humans would encounter briefly and only once or a few times, to more
long-term encounters with a certain robot, usually situated in our human society.
The question where to conduct UX evaluation, laboratory vs field studies, spans
the setting dimension. Thus, based on the general purpose of the UX evaluation,
there are at least six dimensions to consider initially; (1) attitudinal vs. behavioral,
(2) quantitative vs. qualitative, (3) formative vs. summative, (4) narrow vs. holistic,
(5) short-term vs. long-term, and (6) laboratory vs. field.

3.2 Several Kinds of UX Evaluations

There exist several UX evaluation methods. Each method can typically be categorized
as either analytical or empirical [34]. Analytical evaluation is usually carried out
without any user involvement [35], while empirical evaluation typically involves
users to a large extent. These kinds of evaluations can be carried out during different
parts of the design lifecycle process, where quick and dirty analytical methods often
are used initially, providing insights and identifications of relevant UX problems [27,
36, 37], which then later could be evaluated more rigorously using more time- and
resource-consuming empirical evaluation [4, 27, 37].
Analytical evaluation could also be referred to as predictive evaluation, having
roots in HCI, particularly in so called inspection methods [37]. The major advan-
tages of inspection methods are that they are fast and easy to perform and require
relatively small resources. Another advantage is that they do not require a working
prototype, thus suitable early in the developmental process. Two influential examples
are heuristic evaluation and cognitive walkthrough. In heuristic evaluation, which is
a very popular and widespread inspection method in HCI and UX, the investigator
analyzes the interface design based on predefined guidelines that address general
issues to consider when designing interactive systems, e.g., “visibility of system
status”, “user control and freedom” and “error prevention” [38]. This makes it is
possible to identify problems or negative UX that are likely to appear in the inter-
action between the user and the system. The outcome of the analysis is influenced
by the skills and competence of the investigator. These heuristics have been adapted
for robot interfaces by Clarkson and Arkin [39], and further by Weiss et al. [40],
to become even more feasible for HRI. In cognitive walkthrough, the investigator
walks through the interface, step by step in a specific task, and for each step answers
some method-specific questions in order to identify potential problems for the users,
focusing on first time encounter [36, 41, 42]. There is currently no cognitive walk-
through specifically directed towards HRI. By using inspection methods, it is possible
to identify potential problem areas, which could guide decisions on focus areas for
further investigations or design efforts within an ongoing project. The outcome of
predictive evaluation is often a list of problems, but there also exist methods that
242 J. Lindblom et al.

intend to predictively measure, e.g., time to execute a task [4, 27, 43]. Other ways
of evaluate HRI predictively, is to use theories or models of human cognition or
activity [4, 27, 43]. An example regarding emotional, tactile interaction would be to
use research on how humans convey emotions by touch to evaluate if the sensors on
the robot are located on the parts where users naturally touch and if the sensor could
register and interpret the users’ touch behaviour [22, 44–46].
Empirical evaluation is characterised as methods that includes UX testing, infor-
mal evaluation, and field studies [47, 48]. All empirical methods involve users to a
larger degree than analytical ones. UX testing is a well-established way of evaluating
specific aspects of the interaction between human and robot [4, 27], e.g., evaluating
the efficiency of doing a certain task, assessing the experience of trustworthiness after
a first short-term interaction, or identifying specific interaction problems. This kind
of UX testing is often conducted in a laboratory setting, in which some aspects and
context can be controlled. A regular set up is that the participating users interact with
the robot based on specific scenarios, within which certain tasks are performed. Pre,
during, and posttest inquiries could be made in terms of questionnaires and/or inter-
views. Often the primary collected data is quantitative in order to be able to compare
the results to predefined UX goals, benchmarking against other robot platforms, or
determine effects of robot design improvements. However, qualitative data could also
be collected with the purpose of facilitating the interpretation and comprehension of
the quantitative results [4, 27].
It should be pointed out that many researchers confuse UX testing with user
research in form of contrived behavioral experiments derived from experimental psy-
chology. It is argued that both methods are empirical by nature; focusing on actual
behavioral observations conducted in laboratory facilities, sampling of participants,
collecting and analyzing quantitative and qualitative data, but they significantly differ
in other respects [4, 27]. A major difference between them is in their goals, where UX
testing search for identifying problems in the interaction and obtain insights on how
to solve or deal with them, not to demonstrate or investigate a certain phenomenon
and/or aspect [4, 27]. The selection of participants differs, where a valid UX testing
selects participants from the population that should use the system, i.e. purpose-
ful sampling, and is not striving for random sampling of subjects preferably used
in experiments. Using UX testing, most major usability problems can be covered
with rather few participants (approx. 5–7 persons) [10, 15] while larger samples are
normally needed to render significant results from behavioral experiments. Results
from UX testing are typically analyzed using descriptive statistics, compared to
inferential statistics commonly used when analyzing data collected in experiments.
Furthermore, the design of UX testing differs, where in experiments much skill and
effort go into isolating dependent and independent variables and controlling for con-
founding variables. In UX testing, the intention is to exert some degree of control over
confounding variables (e.g. prior experience with the task or similar systems) but not
striving for similar level of control that is necessary in obtaining scientific rigor in
experiments [4, 27]. It is seldom possible to isolate specific variables in UX testing,
and the independent variable of study is commonly the interface of the system, and
efforts to isolate specific variables in the interface do not provide the information
Evaluating the User Experience of Human–Robot Interaction 243

needed to investigate how the interaction and its quality unfolds for the users. This
means that it is seldom easy to identify what causes the uncovered usability prob-
lems or negative UX, and these issues could originate from several causes, e.g. poor
physical design of the system, poor instructions, participants’ prior experience with
similar systems, or a mixture of these causes [4, 27]. Therefore, in UX testing the
collection of several kinds of data are triangulated, where the observations of and
comments from participants provide valuable insights of what causes the problems
in performance of the tasks carried out in the UX testing. The structured analysis of
quantitative and qualitative data combined with UX expertise is necessary to diagnose
valid and credible causes of the identified UX problems and negative UX in order
to make improvements and redesign tentative solutions for the identified problems
[4, 27]. While this chapter focuses on UX evaluation and argues for a more diverse
use different evaluation methods in HRI research, we must also emphasize that user
research as contrived experiments, although sometimes confused and overused, still
is (of course) very important in cases where the primary focus of the study lies on
user behavior. One example is the work done by Bisio et al. [49], using a classical
contrived experiment to investigate to what degree motor resonance between humans
also apply in an HRI setting. Another example is our own work on tactile interaction
in HRI [44, 45], where we used a contrived experiment to study where and how
people touch a humanoid robot. During the experiment session, we also collected
data about the UX of interacting with the social robot [46]. The obtained results
from the study where and how people touch a humanoid robot could then be used as
foundation and inspiration to robot interface design and interaction patterns, which
later on could be evaluated from a UX perspective where the interaction quality as
such is the major focus of study.
Informal evaluations are sometimes called “quick and dirty” evaluations [47],
providing a hint of the character of this type of evaluation. It should not be used
when there is a need for rigorous UX evaluation, but suitable under circumstances
when the time and resources are not available to perform a thorough investigation [4,
8, 27]. It is better to receive quick feedback from users on for example concepts and
ideas than not involving any users at all in the design and lifecycle process. Depending
on the current status of the project, e.g., sketching alternative appearances of a new
robot or planning to introduce an existing robot in a new context, it is possible
to receive hints of potential possibilities and problems that can guide decisions on
further efforts. In informal evaluations, simpler presentation material can be used
such as drawings, storyboards, mock-ups, and scenario descriptions to be discussed
with users, individually or in groups. The collected data is qualitative and descriptive
[4, 8, 27].
Naturalistic field studies are, unlike UX testing and contrived experiments, con-
ducted in a natural environment in which the evaluators have less possibilities to con-
trol the context, i.e. conducting “research in the wild” [4, 8, 29, 31, 32, 43, 50–54].
It is not always possible to transfer results from a controlled and delimited situation
to a natural, and thereby more messy one [53, 55, 56]. Thus, to be able to conclude the
actual and real UX, naturalistic field studies are necessary to conduct. The primary
data is often qualitative, collected via, e.g., interviews, diary notes, and observations,
244 J. Lindblom et al.

since the possibilities to measure with accuracy is limited due to the uncontrolled
environment [4, 8, 29, 31, 32]. Nonetheless, some quantitative data could be gath-
ered, e.g., via questionnaires, counting positive and negative expressions, amount of
errors that are made, or time on tasks, which can add perspectives and nuances to
what users are saying and doing [36, 57, 58]. One way of doing a field study is to
study HRI in the wild [11, 12]. For example, how robots could support elderly per-
sons in their homes, providing insights of users’ experiences of robots, and whether
and to what extent a social robot becomes embedded and used in everyday life of
humans [51]. In the long run, all robots are supposed to be working in natural setting,
being situated in either a highly specified or generic context, which means that in
the end it is in this kind of environment that the UX of the human–robot interaction
should be positive or great.

3.3 Deciding on and Defining UX Goals

For a practitioner, i.e., a robotic product developer, it is not possible to evaluate


“everything”, so to speak, due to time and resource limitations [4, 8, 27]. Instead, it
has to be decided upon which particular experiences are most vital to awaken and
carefully design and evaluate with those close in mind. Asking yourself questions
could be a viable approach. For example, is it more important for this particular robot
to evoke feelings of curiosity and fascination than a sense of competence in the user?
Is it more central for this robot to make the user feel related to others and find the
robot elegant than experiencing it smooth and transparent? Hence, the nuances of
UX need to be recognized and carefully considered. This means that investigators
should be able to determine which user UX-affecting aspects are appropriate for a
certain purpose. For example, which UX-affecting aspects that should be avoided
and which ones have to be included for enhancing the users’ motivation of walk up
to and interact with the robot? If the UX goals have not been considered initially, and
consequently, the kind of UX the evaluation is intended to shape and which feelings
it is supposed to elicit actually to awaken have not been specified or reported, then the
actual influence of the UX-effecting aspect remains unclear. Hence, empirical HRI
research including UX studies and UX evaluation, should go beyond basic valence
feelings, i.e., beyond stating that the UX is more or less positive or negative. In
addition, in order to determine whether the actual experiences are positive or strong
enough there should be some specified acceptance levels set in advance. Hence,
there is a need to define prioritized system requirements, UX goals and validation
criteria [4, 27, 59]. However, establishing requirements and UX goals, determine their
priority, as well as defining validation criteria are not easy tasks to perform properly
[for further details, see 59–61]. Weiss et al. [26], for example, have selected the
following factors in their framework for evaluating HRI of social robots: usability,
social acceptance, user experience, and societal impact, which then have several
indicators for each factor. The usability factor, for instance, consists of the following
seven indicators: effectiveness, efficiency, learnability, flexibility, robustness, and
Evaluating the User Experience of Human–Robot Interaction 245

utility. These indicators then have individual definitions or description that in more
detail characterizes the actual indicator. De Graaf and Ben Allouch [62], for example,
have used both pragmatic and hedonic factors in their study of user acceptance of
social robots. As illustrative examples, the indicator usefulness is described as “the
user’s belief that using the robot would enhance their daily activity”, the indicator
enjoyment is described as “feelings of joy or pleasure associated by the user with
the use of the robot”, and the indicator companionship is described as “the user’s
perceived possibility to build a relationship with the robot”. However, both Weiss
et al. [26] as well as de Graaf and Ben Allouch [62] have not defined these factors
as UX goals, as described in the usability and UX literature [4, 27, 36].
Hence, an important activity for the whole UX design lifecycle process is to
extract, identify, and characterize UX goals [4, 34, 63]. UX goals are high-level
objectives, which should be driven by the representative use of an envisioned or
current robot. The UX goals should identify what is important to the users, stated
in terms of anticipated UX of the interaction between human(s) and robot(s). UX
goals are important because they help and support the robot developers focusing on
the intended experience when interacting with the envisioned or final robot. For that
reason, they are referred to as “goals”, instead of “requirements”, since these UX
goals cannot be guaranteed to be fulfilled by all intended end-users [4]. Specified UX
goals support and benefit the evaluation process in pointing out exactly what should
be investigated in order to enhance certain aspect of the UX of the robot. UX goals
offer support throughout the design lifecycle process by defining quantitative and
qualitative metrics, which provides the basis for knowing when the required quality
of interaction has been achieved. The characterized and defined UX goals can also
provide appropriate benchmarks during formative evaluations, which in turn can help
to point out exactly what adjustments that will result in the most useful outcome.
By evaluating the UX goals continuously, the developers may also recognize when
it is time to stop iterating the robot design, and when the design is considered to be
successful enough [4]. The UX goals also scaffold and support to stay tuned to the
UX focus throughout an interdisciplinary robot development process.
Furthermore, as a UX-focused robot developer it is not enough to address the
UX goals, user needs and usage context. For commercial development of socially
interactive robots there is also a business case to take into account when establishing
and prioritizing UX requirements and UX goals [4, 8, 11, 12]. In accordance to pre-
vious claims, robot developers should possibly benefit from research-based guidance
of how to fruitfully consider business cases of the socially interactive robots when
making decision of, e.g., business goals, requirements and design elements. How-
ever, to the best of our knowledge, there is a lack of HRI research focusing on the
intersection of these above aspects and UX. Thus, more research is needed in order
to provide research-based guidance to robot developers concerning these matters. In
order to make this process more concrete, we provide a tentative example of doing
a typical UX evaluation process.
246 J. Lindblom et al.

3.4 A Tentative UX Evaluation Process

A typical UX evaluation process consists of four general phases; planning, conduct-


ing, analyzing the data, and considering the obtained findings. The process could
also be iterative where an evaluation lap can be part of a larger UX evaluation cycle.

3.4.1 Planning and Conducting

The first aspect to be decided is which overall goals that should be addressed in the
evaluation. When developing robots for the real-world use in a commercial setting
there probably are explicit or implicit goals for a robot as a product, for instance,
regarding the market (e.g., elderly care sector) and general purpose of this kind
of robot, e.g., being a social companion, which defines the frame and scope within
which the UX evaluation should be conducted. The product and business goals guide
the formulation of more specific UX goals [4, 8], (or in the best of cases, the goals
could be retrieved from the requirements specification and, thus, the UX perspective
is integrated in the development process as a whole), e.g., an elderly person with
no previous experience of social robots should be able to successfully communicate
on their own with the robot after a short guided introduction and experience a high
degree of confidence and willingness to continue communicating with the robot after
the first time encounter. This way, the UX goals create the focal point of the evaluation
and direct the following planning activities [4–8, 27].
Based on the UX goals, a relevant type of evaluation is chosen, following the
example case above, UX testing in a laboratory setting would be adequate. Next, the
user profile of the participants needs to be decided, in terms of both common and
varying characteristics [4, 8, 27]. The characteristics should be defined and quantified.
In this example case, the joint attribute is users above 75 years old and having none
experience of social robots. The differing aspect can be positive and negative attitude
towards social robots in general. Two subgroups are then established, which in this
case would make it possible to compare two groups based on varying in advance
feelings and still be able to make conclusions in relation to the stated overall UX
goals. From a practitioner perspective, the important outcome of the evaluation is
to be able to provide information whether the current robot appearance and/or way
of interacting to a sufficient extent fulfills the selected UX goals, If the selected UX
goals are not reached, one can receive input on what the identified UX problems are
and how to solve them. The outcome is not supposed to infer generalizable claims
on social robot design. In the commercial robot development context, the number of
participants do not need to be many, even the small number of five persons in each
subgroup can be enough to make relevant design changes [4, 27, 64, 65]. Particularly
if the changes are of an incremental kind and not extensive and fundamental ones.
For more wide-spread and expensive changes more participants or more evaluation
cycles could be a good investment [4, 8, 27].
Evaluating the User Experience of Human–Robot Interaction 247

For UX testing, the tasks that the participants are going to carry out during the
test session need to be carefully prepared so that they are in line with the UX goals
and are relevant for the intended users [4, 8, 23, 27, 36, 66]. In the case example,
tasks for both a short initial training and the actual session need to be prepared.
To make it more natural for the participants, the tasks can be presented in form of
scenarios, which should consist of short and unambiguous stories that provide the
participants with goals to achieve, and necessary information to fulfil them [4, 8, 27].
The scenarios can be delivered to the participants in different ways, such as written,
verbal by the investigator, in a role play or by the robot itself. In the case example,
where the robot is intended to be a social companion, it could be suitable if the robot
provides the scenario to the user. If the robot is not in a development stage where
real interaction in possible, the Wizard of Oz technique could be an alternative for
running the scenario [26, 36, 67, 68]. This technique means that a human plays the
robot’s part of the interaction in a puppet manner.
It should be possible to measure and assess the UX with objective as well as
subjective measurements [4, 27, 34, 69]. There is a wide range of aspects that are
possible to quantify. Objective measures could be time on task, degree of completion,
and correctness [69]. Subjective measures are, e.g., grading of experiences (e.g.,
perceived safety, trust, smoothness) or preferences between different alternatives
[69]. There should also be reference values for the measures and assessments, such as
levels of acceptance and desirable levels [4, or Wallström and Lindblom, Chapter 3.1
this volume, for more details about how to set these levels in practice]. In the case
example, relevant objective measures could be completion time or number of errors,
with reference points for the latter is that one error per user and task is accepted
and none error is desirable. Subjective measures could be grading of confidence in
communicating with the robot on a scale from 1 to 10 where 1 is low and 10 is
high, with accepted average level on 6 and desirable level on 8. In addition to the
quantitative parts, it is advantageous to collect qualitative data in order to better
understand why the interaction unfolds as it does [4, 5, 8, 27, 29–32], why problems
occur, and why the measurements end up as they do. In the case example, video
recordings of the human–robot interactions from multiple angles and a post-test
interview could be valuable input.

3.4.2 Analyzing the Data and Considering the Findings

When the test sessions have been conducted, the collected data are going to be put
together and analyzed, focusing on identifying UX problems that need to be solved
[4, 27]. As described in Sect. 3, UX comprises of hedonic and pragmatic qualities,
the latter better known as usability [23, 70]; and “a usability problem is an aspect
of the system and/or a demand on the user which makes it unpleasant, inefficient,
onerous or impossible for the user to achieve their goals in typical usage situations”
[71, p. 254]. To verify whether a specific UX problem is a real one and that there is
a necessary need to change aspects of the robot being evaluated, triangulation is a
good way to make the findings more reliable [4, 7, 31].
248 J. Lindblom et al.

Fig. 2 Triangulation of
usability and UX problems
(adapted from [27])

Triangulation means that multiple data sources are used to compare and deepen
the understanding of the obtained findings (see Fig. 2) [4, 7, 31]. Several findings
pointing in the same direction imply that there is a severe UX problem that needs to
be taken care of. For example, an evaluation could consist of a problem list generated
from a predictive evaluation, such as heuristic evaluation, and quantitative, e.g., time
to perform a task, as well as, qualitative, e.g., interview answers regarding perceived
efficiency, from UX testing. If there is an identified potential UX problem that can
cause inefficiency, the measured mean value of time to execute a task is below the
set acceptance level, and the users describe the interaction as frustrating, then there
is a strong evidence for a real difficulty that could be assumed to cause real-world
struggles and negative UX. If one of the identified findings points in another direction,
e.g., the time to execute the task reached the set desirable level, then it is advisable
to investigate the tentative problem further before taking design action to address it.
The next step in the analysis is to organize the identified UX problems into scope
and severity [4, 27, 71, 72]. The scope of a UX problem can be global or local, where
the first is when the problem spans over the robot as a whole, and the latter is when
it only appears in a specific situation. For example, if the only way to communicate
with the robot is by gesturing and the interaction mode does not work as intended,
then there is a global problem because it will have a negative effect on all interactions
between user and robot. If the users cannot understand a particular signal from robot,
but the interaction as a whole runs smoothly, then the problem is local. Global UX
problems are often more severe than local ones, and a global problem could also be
part of a whole family of robot products, e.g., if the problematic feature is integrated
in several robots from a certain company. However, it is not always the case that
global problems are more critical than local ones. For instance, if a user cannot
interpret a specific signal, although it occurs seldom, but the signal means that there
is a high safety risk, then the local problem is of high gravity.
The identified UX problems should also be analyzed in terms of severity, which
provides an important base for making decisions regarding which re-design actions
that should be prioritized together with aspects such as how resource-demanding
the actions are and availability of resources [4, 27]. The degree of severity ranges
from high to low, where the highest includes, e.g., safety risks and problems that
Evaluating the User Experience of Human–Robot Interaction 249

obstruct the completion of central tasks. Low severity is problems of, e.g., minor
negative effect on the interaction, when it is easy for the user to find an effortless
workaround, or problems of more cosmetic nature [27, 38, 73]. Other aspects that
should be considered when stating the degree of severity are, e.g., the number of
users that is expected to be affected by the problem, i.e., related users with certain
characteristics, or with which frequency the problem occurs.
Finally, when the identified UX problems presented in this example are handled
properly, and where some of them are re-evaluated via an additional follow-up session
and then implemented into the social robot, it is time to move on. The next step could
be to perform a field study in an elderly care setting. The field study should be carried
out with the intended user groups, and the insights from that study should be beneficial
before launching the robot to the market.

3.5 Some Tentative Data Collection Techniques

There are several data collection techniques (sometimes referred to as methods) that
could be used for human-centered empirical UX evaluation [4, 27, 31]. See also the
dimensions of attitudinal vs. behavioral and quantitative vs. qualitative, described in
Sect. 3.
The basic observational techniques are direct observation and recorded observa-
tion [4, 27, 31]. In direct observation the investigator is present during the session
where the user and robot interact, either in the same room or, as possible in certain
laboratory settings, behind one-sighted window. The investigator directly monitors
what happens and documents the aspects that are the focal point of the evaluation,
e.g., making field notes, marking on a prepared observation protocol, or using dicta-
tion. In recorded observation, the investigator is not active during the session, instead
video-recording (one or many cameras) captures the activities that take place in the
session(s). Both techniques have their pros and cons, and suit different kinds of eval-
uations goals and circumstances. Some examples; if it is clear in advance exactly
what kind of behaviour and interaction aspects that is in focus, it is efficient to use
direct observation since the amount of data to analyze will be more manageable.
Recorded observation is time-consuming, because it is not only the session in itself
that takes time, but also all collected material has to be walked through all over again
afterwards. If there is an expressed need to measure certain aspects during the evalu-
ation, e.g., time on task or frequency of user actions while carrying out a certain task,
recorded observation is necessary to conduct in order to attain precision. In so doing,
it is possible to mark exactly when a task begins and ends in the recordings, e.g.
using a video annotation tool such as ELAN [74]. Direct observation is not suitable
for that kind of detailed measurement, although more rough measures can be used
such as counting how many times user do a certain action. A combination of both
observation techniques is also a viable approach.
Additional techniques that could be used together with the basic observational
techniques are, e.g., think-aloud and cooperative evaluation [4, 27, 34]. Think-aloud
250 J. Lindblom et al.

means that the user is requested to verbalize all thoughts when conducting the activ-
ities in the evaluation session. By so doing, it is possible to grasp some insights into
the participants’ thought processes and thereby gain a better understanding of why
they behave as they actually do, e.g. identifying the reasons behind misunderstand-
ings of the robot’s interaction. However, to say out loud what you simultaneously are
thinking is not easy, in particular when the task becomes complicated, and, therefore,
some initial training of thinking out loud before the actual session start could be help-
ful. In cooperative evaluation [34], which includes inquiring aspects, the investigator
and participant are both active during the session, in which the user carries out the
tasks while discussing them with the investigator in form of a cooperative team about
what unfolds in the interaction. The investigator is allowed to ask questions during
the participant’s execution of the task, such as: “why did you choose to do X to
instruct the robot?” or “how did you interpret that response from the robot?”. Hence,
the cooperative evaluation technique is not applicable when quantitative measures
are central issues in the evaluation, but when it is valuable to gain a better under-
standing of the reasons why, e.g., a certain interaction problem occurs. There are
also more indirect observational possibilities that can be used, for example, in cases
where the robot can log interactive activities which could be analyzed afterwards.
Another example, is asking users to make diary notes [4, 31], which does not require
the presence of investigators. This is usually done in more prolonged interactions
with robots in various kinds of field studies [11, 51].
The basic inquiring techniques are interviews, questionnaires, and focus groups
[4, 27, 31]. Interviews render the possibility to get more in-depth information about,
e.g., expectations, ways of thinking, and understandings. Questionnaires are advan-
tageous for collecting more quantifiable information, like rating different dimensions
of UX. For example, several questionnaires are developed within HRI, such as GOD-
SPEED [75] and NARS [76], although not having an explicit UX perspective. Instead,
it could be suitable to develop a specific UX questionnaire which might better suit the
robot, tasks, and UX goals. Both interviews and questionnaires could be used before,
e.g., asking about anticipations, during as well as after an evaluation session, e.g., ful-
filment of anticipations, depending on what aspects that are of interest. It is of major
importance to formulate relevant and (sometimes open) interview questions, which
may be more difficult than imagined [4, 31]. In focus groups, the investigator meets
5–8 persons per occasion, in which one or more themes are discussed in evaluative
terms [4, 27, 31]. The participants have not often had individual sessions in which
they have interacted with a robot, which means that the technique is not suitable for
evaluation goals that presuppose actual encounter between users and robots. Instead,
focus groups can be used for evaluating UX aspects on conceptual and idea levels; in
which the investigators present pictures, animations, contextual scenarios et cetera
of the intended interaction to the participants. They could then provide feedback on
different aspect on the envisioned interaction to the evaluators. An example could be
to evaluate future use of a certain social robot for elderly care in a domestic setting
[e.g., 51], in which participants, collected from the intended user group, caregivers,
relatives, can be provided with pictures of different robots, descriptions of a variety
of robot characteristics, or possible functions in the setting. They could then, under
Evaluating the User Experience of Human–Robot Interaction 251

facilitation by the investigator, jointly discuss preferences, worries, hopes, and such,
which could provide valuable input for the continuing design and development of
the robot.

4 Concluding Remarks

Our primary aim of this chapter was to disentangle several methodological questions
related to performing proper UX evaluation of human–robot interaction. Hopefully,
we have provided some tentative answers to the earlier raised questions in the Intro-
duction, motivating why UX evaluation is beneficial in the interdisciplinary field of
HRI. Furthermore, we aimed at clarifying the what, how, when, where, and who
aspects of UX evaluation.
As advocators of a truly interdisciplinary perspective of HRI, we believe that this
will require robot researchers and robot developers to adopt a wider set of concepts,
theories, and methods in their own work, which implies the need to read a broader
spectrum of UX literature and correctly applying UX evaluation in practice. How-
ever, we are aware of the tentative challenges in combining different research areas
and perspectives. Consequently, there is a risk of misinterpretations of underlying
epistemological, theoretical, and methodological foundations that we were not able
to cover in details in this chapter. This could result in incorrect assumptions, mis-
conceptions of obtained results and then misleading conclusions. Another tentative
risk lies in significantly different definitions and meanings of the same terms used
in several areas that are mentioned in this chapter. Nevertheless, we hope that this
chapter will contribute to an increased use, weighing the pros and cons, of UX evalu-
ation in HRI. We are being optimistic that the HRI community is and continues to be
open minded to adopt reliable, valid, and practical applicable evaluation methodolo-
gies that are appropriate for the particular research challenges that the field of HRI
will encounter. The experiences that humans perceive when interacting with robotic
products have the power and impact to enable, or disable, the robots’ acceptance
rate in society. For a commercial robot product, it is the achieved UX in the natural
context when fulfilling its intended purpose that actually matters. As a result, the
HRI field will benefit and the future users of social robots will benefit even more.

Acknowledgements Lindblom and Alenljung especially wish to thank all students that have par-
ticipated in our usability and UX evaluation courses during the years, and all authors also wish to
thank the participants in their conducted HRI studies. This work was supported by the Knowledge
Foundation, Stockholm, under SIDUS grant agreement no. 20140220 (AIR, Action and intention
recognition in human interaction with autonomous systems).
252 J. Lindblom et al.

References

1. Dautenhahn, K.: Some brief thoughts on the past and future of human–robot interaction. ACM
Trans. Hum. Robot. Interact. 7(1, Article 4), 3 (2018)
2. Dautenhahn, K.: Socially intelligent robots: dimensions of human–robot interaction. Phil.
Trans. R. Soc. B 362(1480), 679–704 (2007)
3. Dautenhahn, K.: Methodology & themes of human–robot interaction: a growing research field.
Int. J. Adv. Robot. Syst. 4(1), 103–108 (2007)
4. Hartson, H.R., Pyla, P.S.: The UX book: Agile UX design for quality user experience. Morgan
Kaufmann, Amsterdam (2018)
5. Hassenzahl, M.: User experience and experience design. In: Soegaard, M., Dam, R.F. (eds.) The
Encyclopedia of Human–Computer Interaction, 2nd edn. The Interaction Design Foundation,
Aarhus, Denmark (2013). Accessed from: http://www.interaction-design.org/encyclopedia/
user_experience_and_experience_design.html
6. Hassenzahl, M., Tractinsky, N.: User experience—a research agenda. Behav. Inf. Technol.
25(2), 91–97 (2006)
7. Weiss, A., Bernhaupt, R., Yoshida, E.: Addressing user experience and societal impact in a
user study with a humanoid robot. In: Proceedings of the Symposium on New Frontiers in
Human–Robot Interaction, AISB2009, pp. 150–157 (2009)
8. Anderson, J., McRee, J., Wilson, R., The Effective UI Team: Effective UI. O’Reilly, Sebastopol,
CA (2010)
9. ISO DIS 9241–210: Ergonomics of human system interaction—part 210: human-centred design
for interactive systems. International Organization for Standardization, Switzerland (2019).
Accessed from: https://www.iso.org/obp/ui/#iso:std:iso:9241:-210:ed-2:v1:en
10. Gould, J.D., Lewis, C.: Designing for usability: key principles and what designers think.
Commun. ACM 28(3), 300–311 (1985)
11. Powers, A.: What robotics can learn from HCI. Interactions 15(2), 67–69 (2008)
12. About YuMi at ABB. Accessed from: http://www.abb.se/cawp/seitp202/
f1347b3f51420722c1257ec2003dd739.aspx?_ga=2.214128350.817155711.1528981398-
1202336802.1528981398
13. Alenljung, B., Lindblom, J.: User experience of socially interactive robots: its role and rele-
vance. In: Vallverdú, J. (ed.) Synthesizing Human Emotion in Intelligent Systems and Robotics,
pp. 352–364. IGI Global, Hershey, Pennsylvania, USA (2015)
14. Alenljung, B., Lindblom, J., Andreasson, R., Ziemke, T.: User experience in social human–
robot interaction. Int. J. Ambient Comput. Intell. 8(1), 13–32 (2017)
15. Lindblom, J., Andreasson, R.: Current challenges for UX evaluation of human–robot inter-
action. In: Schlick, C., Trzcieliński, S. (eds.) Advances in Ergonomics of Manufacturing:
Managing the Enterprise of the Future. Advances in Intelligent Systems and Computing, vol.
490, pp. 267–278. Springer International Publishing, Switzerland (2016)
16. Bartneck, C., Kulić, D., Croft, E., Zoghbi, S.: Measurement instruments for the anthropomor-
phism, animacy, likeability, perceived intelligence, and perceived safety of robots. Int. J. Soc.
Rob. 1(1), 71–81 (2009)
17. Boden, M., Bryson, J., Caldwell, D., Dautenhahn, K., Edwards, L., Kember, S., Newman, P.,
Parry, V., Pegman, G., Rodden, T., Sorrell, T., Wallis, M., Whitby, B., Winfield, A.: Principles
of robotics: regulating robots in the real world. Connection Sci. 29(2), 124–129 (2017)
18. Goodrich, M.A., Schultz, A.C.: Human–robot interaction: a survey. Found. Trends Hum.
Comput. Interact. 1(3), 203–275 (2007)
19. Dautenhahn, K.: 2013. Human–Robot Interaction. In: Soegaard, M., Dam, R.F. (eds.) The
Encyclopedia of Human–Computer Interaction, 2nd edn. The Interaction Design Foundation,
Aarhus, Denmark. Accessed from: http://www.interaction-design.org/encyclopedia/human-
robot_interaction.html
20. Thrun, S.: Toward a framework for human–robot interaction. Hum. Comput. Interact. 19(1),
9–24 (2004)
Evaluating the User Experience of Human–Robot Interaction 253

21. Yanco, H.A., Drury, J.: Classifying human–robot interaction: an updated taxonomy. In: IEEE
International Conference on Systems, Man and Cybernetics 2004, vol. 3, pp. 2841–2846 (2004)
22. Alenljung, B., Andreasson, R., Billing, E.A., Lindblom, J., Lowe, R.: User experience of
conveying emotions by touch. In: Proceedings of the 26th IEEE International Symposium on
Robot and Human Interactive Communication (RO-MAN), pp. 1240–1247, Lisbon, Portugal
(2017)
23. Hassenzahl, M., Roto, V.: Being and doing: a perspective on user experience and its
measurement. Interfaces 72, 10–12 (2007)
24. Keizer, S., Kastoris, P., Foster, M.E., Deshmukh, A.A., Lemon, O.: Evaluating a social multi-
user interaction model using a Nao robot. In: RO-MAN: The 23rd IEEE International Sym-
posium on Robot and Human Interactive Communication, pp. 318–322, Edinburgh, Scotland,
UK, 25–29 Aug 2014
25. Xu, Q., Ng, J., Tan, O., Huang, Z., Tay, B., Park, T.: Methodological issues in scenario-based
evaluation of human–robot interaction. Int. J. Soc. Robot. 7(2), 279–291 (2015)
26. Weiss, A., Bernhaupt, R., Tscheligi, M.: The USUS evaluation framework for user-centered
HRI. In: Dautenhahn, K., Saunders, J. (eds.) New Frontiers in Human–Robot Interaction,
pp. 89–110. John Benjamins Publishing, Amsterdam (2011)
27. Dumas, J.S., Redish, J.: A Practical Guide to Usability Testing. Ablex Publishing Corporation,
Norwood, NJ (1999)
28. Sim, D.Y.Y., Loo, C.K.: Extensive assessment and evaluation methodologies on assistive social
robots for modelling human–robot interaction–a review. Inf. Sci. 301, 305–344 (2015)
29. Lincoln, Y.S., Guba, E.G.: Naturalistic inquiry. Sage, Newbury Park (1985)
30. Rohrer, C.: 2014. When to use which user-experience research methods. Nielsen Norman
Group. https://www.nngroup.com/articles/which-ux-research-methods/. Accessed 2 Sept 2019
31. Patton, M.Q.: Qualitative Research and Evaluation Methods, 3rd edn. Sage, London (2002)
32. Braun, V., Clarke, V.: Using thematic analysis in psychology. Qual. Res. Psychol. 3(2), 77–101
(2006)
33. Young, J.E., Sung, J.Y., Voida, A., Sharlin, E., Igarashi, T., Cristensen, H.I., Grinter, R.E.:
Evaluating human–robot interaction: focusing on the holistic interaction experience. Int. J.
Soc. Robot. 3, 53–67 (2011)
34. Schackel, B.: Usability—context, framework, definition, design and evaluation. Interact.
Comput. 21, 339–346 (2009)
35. Blandford, A.E., Hyde, J.K., Green, T.R.G., Connell, I.: Scoping analytical usability evaluation
methods: a case study. Hum. Comput. Interact. 23, 278–327 (2008)
36. Benyon, D.: Designing User Experience: A Guide to HCI, UX and Interaction Design, 4th edn.
Pearson, Harlow, England (2019)
37. Nielsen, J., Mack, R.L. (eds.): Usability Inspection Methods. Wiley, New York (1994)
38. Nielsen, J.: Heuristic evaluation. In: Nielsen, J., Mack, R.L. (eds.) Usability Inspection
Methods, pp. 25–62. Wiley, New York (1994)
39. Clarkson, E., Arkin, R.C.: Applying heuristic evaluation to human–robot interaction systems.
In: FLAIRS Conference, pp. 44–49, Key West, Florida, USA (2007)
40. Weiss, A., Wurhofer, D., Bernhaupt, R., Altmaninger, M., Tscheligi, M.: A methodological
adaptation for heuristic evaluation of HRI. In: RO-MAN 2010: Proceedings of the 19th IEEE
International Symposium on Robot and Human Interactive Communication, pp. 1–6, Viareggio,
Italy (2010)
41. Lewis, C., Polson, P., Wharton, C., Rieman, J. Testing a walkthrough methodology for
theory-based design of walk-up-and-use interfaces. In: Proceedings ACM CHI’90 Conference,
pp. 235–242, Seattle, WA, USA, 1–5 April 1999 (1990)
42. Wharton, C., Rieman, J., Lewis, C., Polson, P.: The cognitive walkthrough method: a practi-
tioner’s guide. In: Nielsen, J., Mack, R.L. (eds.) Usability Inspection Methods, pp. 105–140.
Wiley, New York (1994)
43. Rogers, Y.: HCI Theory: Classical, Modern, and Contemporary. Morgan & Claypool Publishers,
San Rafael, CA (2012)
254 J. Lindblom et al.

44. Andreasson, R., Alenljung, B., Billing, E., Lowe, R.: Affective touch in human–robot
interaction: conveying emotion to the nao robot. Int. J. Soc. Rob. 10(4), 473–491 (2018)
45. Lowe, R., Andreasson, R., Alenljung, B., Lund, A., Billing, E.: Designing for a wearable
affective interface for the NAO robot: a study of emotion conveyance by touch. Multimodal
Technol. Interact. 2(1), 2 (2018)
46. Alenljung, B., Lowe, R., Andreasson, R., Billing, E., Lindblom, J.: Conveying emotions by
touch to the Nao robot: a user experience perspective. Multimodal Technol. Inter. 2(4), Article
no. 82 (2018)
47. Thomas, B.: ‘Quick and dirty’ usability tests. In: Jordan, P.W., Thomas, B., Weerdmeester,
B.A., McClelland, I.L. (eds.) Usability Evaluation in Industry, pp. 107–114. Taylor & Francis,
London (1996)
48. Vermeeren, A.P.O.S, Law, E.L.-C., Roto, V., Obrist, M., Hoonhout, J., Väänänen-Vainio-
Mattila, K.: User experience evaluation methods: current state and development needs. In:
Proceedings of the 6th Nordic Conference on Human–Computer Interaction: Extending
Boundaries (NordiCHI ‘10), pp. 521–530, Reykjavik, Iceland, 16–20 Oct 2010
49. Bisio, A., Sciutti, A., Nori, F., Metta, G., Fadiga, L., Sandini, G., Pozzo, T.: Motor contagion
during human–human and human–robot interaction. PLoS ONE 9(8) (2014). https://doi.org/
10.1371/journal.pone.0106172
50. Rogers, Y., Marshall, P.: Research in the Wild. Morgan & Claypool Publishers, San Rafael,
CA (2017)
51. Frennert, S., Eftring, H., Östlund, B.: Case report: implications of doing research on socially
assistive robots in real homes. Int. J. Soc. Robot. 9(3), 401–415 (2017)
52. Beagley, N.I.: Field-based prototyping. In: Jordan, P.W., Thomas, B., Weerdmeester, B.A.,
McClelland, I.L. (eds.) Usability Evaluation in Industry, pp. 95–104. Taylor & Francis, London
(1996)
53. Kujala, S., Roro, V., Väänenen-Vainio-Mattila, K., Karaponos, E., Sinnelä, A.: UX curve: a
method for evaluating long-term user experience. Interact. Comput. 23, 473–483 (2011)
54. Nielsen, J., Lyngbæk, U.: Two field studies of hypermedia usability. In: McAleese, R., Green,
C. (eds.) Hypertext: State of the Art, pp. 64–72. Intellect, Oxford, England (1990)
55. Duh, H.B.-L., Tan, G.C.B., Chen, V.H.: Usability evaluation for mobile device: a comparison
of laboratory and field test. In: MobileHCI’06, pp. 181–186. Helsinki, Finland, 12–15 Sept
2006
56. Kaikkonen, A., Kekäläinen, A., Cankar, M., Kallio, T., Kankainen, A.: Usability testing of
mobile applications: a comparison between laboratory and field testing. J. Usability Stud. 1(1),
4–16 (2005)
57. Brooke, J.: SUS: a quick and dirty usability scale. In: Jordan, P.W., Thomas, B., Weerdmeester,
B.A., McClelland, I.L. (eds.) Usability Evaluation in Industry, pp. 189–194. Taylor & Francis,
London (1996)
58. Hassenzahl, M., Burmester, M., Koller, F.: AttrakDiff: Ein fragebogen zur messung
wahrgenommener hedonischer und pragmatischer qualität (AttrakDif: a questionnaire for the
measurement of perceived hedonic and pragmatic quality). In: Proceedings of the Mensch &
Computer 2003, Interaktion in Bewegung, Stuttgart (2003)
59. Pressman, R.S.: Software Engineering: A Practitioner’s Approach, 5th edn. McGraw Hill,
London (2000)
60. Sutcliffe, A.: User-Centred Requirements Engineering: Theory and Practice. Springer, London
(2002)
61. Zowghi, D., Coulin, C.: Requirements elicitation: a survey of techniques, approaches, and
tools. In: Aurum, A., Wohlin, C. (eds.) Engineering and Managing Software Requirements,
pp. 21–46. Springer, Berlin, Germany (2005)
62. de Graaf, M.M.A., Allouch, S.B.: Exploring influencing variables for the acceptance of social
robots. Robot. Auton. Syst. 6(12), 1476–1486 (2013)
63. Whiteside, J.A., Bennett, J., Holtzblatt, K.: Usability engineering: our experience and evolution.
In: Helander, M. (ed.) Handbook of Human–Computer Interaction, pp. 791–817. Elsevier
Science, Amsterdam, The Netherlands (1988)
Evaluating the User Experience of Human–Robot Interaction 255

64. Lewis, J.R.: Sample sizes for usability studies: additional considerations. Hum. Factors 36(2),
368–378 (1994)
65. Virzi, R.A.: Refining the test phase of usability evaluation: how many subjects is enough? Hum.
Factors 34(4), 457–468 (1992)
66. Rosson, M.B., Carroll, J.M.: Scenario-based design. In: Jacko, J., Sears, A. (eds.) The Human–
Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Appli-
cations, pp. 1032–1050. Lawrence Erlbaum Associates, Mahwah (2002)
67. Good, M.D., Whiteside, J.A., Wixon, D.R., Jones, S.J.: Building a user-derived interface.
Commun. ACM 27(10), 1032–1043 (1984)
68. Riek, L.D.: Wizard of Oz studies in HRI: a systematic review and new reporting guidelines. J.
Hum. Rob. Interact. 1(1), 119–136 (2012)
69. Hornbæk, K.: Current practice in measuring usability: challenges to usability studies and
research. Int. J. Hum. Comput. Stud. 64(2), 79–102 (2006)
70. Bevan, N.: What is the difference between the purpose of usability and user experience eval-
uation methods. In: Proceedings of the Workshop UXEM 2009 (INTERACT 2009), pp. 1–4,
Uppsala, Sweden (2009)
71. Lavery, D., Cockton, G., Atkinson, M.P.: Comparison of evaluation methods using structured
usability problem reports. Behav. Inf. Technol. 16(4–5), 246–266 (1997)
72. Andre, T.S., Hartson, H.R., Belz, S.M., McCreary, F.A.: The user action framework: a reliable
foundation for usability engineering support tools. Int. J. Hum. Comput. Stud. 54, 107–136
(2001)
73. Barnum, C.M.: Usability testing essentials: ready, set… test!. Morgan Kaufmann, Amsterdam
(2011)
74. https://tla.mpi.nl/tools/tla-tools/elan/
75. Bartneck, C., Kulic, D., Croft, E., Zoghbi, S.: Measurement instruments for the anthropomor-
phism, animacy, likeability, perceived intelligence, and perceived safety of robots. Int. J. Soc.
Rob. 1, 71–81 (2009)
76. Nomura, T., Kanda, T., Suzuki, T.: Experimental investigation into influence of negative
attitudes toward robots on human–robot interaction. AI & Soc. 20(2), 138–150 (2006)

Jessica Lindblom is an Associate Professor of Informatics at


the School of Informatics, University of Skövde, Sweden. She
has a Bachelor’s degree in Cognitive Science, a Master’s degree
in Informatics, and a Ph.D. in Cognitive Systems. She is the
head of the research group Interaction Lab at the University of
Skövde. Her research interests are social aspects of embodied,
situated, and distributed cognition, and their implications to var-
ious kinds of socially interactive technology. During the years
she has acquired extensive experience from research on human-
robot interaction and human-robot collaboration from human-
centred and user experience perspectives. In collaboration with
colleagues at the University of Skövde, she is establishing one
of the world’s first Master’s Programs on Human-Robot Inter-
action.
256 J. Lindblom et al.

Beatrice Alenljung is a senior lecturer in Informatics at the


School of Informatics, University of Skövde, Sweden. She grad-
uated with the degree of Doctor of Philosophy from Linköping
University, Sweden, in 2008 within Informatics. Dr. Alenljung
also holds a Ph.Lic (2005) in Informatics from Linköping Uni-
versity, an M.Sc. in Informatics and a B.Sc. in Systems Analy-
ses, both from the University of Skövde. Her research interests
include user experience (UX), human-computer interaction
(HCI), human-robot interaction (HRI), requirements engineer-
ing (RE), decision support systems (DSS), and simulation-
enhanced learning environments. The human-centered
perspective is the main thread of her research.

Erik Billing is a Senior Lecturer at the University of Skövde,


Sweden. With a Master in Cognitive Science and a Ph.D. in
Computing Science, he is engaged with research on human–
robot interaction, user experience, and robot learning from
human demonstration. He has extensive experience from
research on robot-assisted therapy for children with autism,
and is in collaboration with the team in Skövde building up
one of the world’s first Master’s Programs on Human-Robot
Interaction.
Evaluating Human-Robot Interaction
with Ethology

Marine Grandgeorge

Abstract Evaluating human-robot interactions to improve them is a major


challenge. Several scientific approaches are commonly used. Here, we propose that
ethology, science of behaviors, could be a suitable discipline to study such question.
After explanations, some examples are given to illustrate these possibilities.

Keywords Interaction · Relationships · Ethology · Methods

The emergence of social robotics highlights new challenges for HRI (Human-Robot
Interaction) evaluations. It is not only about evaluations of robots that are widely cov-
ered by ergonomics and psychology. Our concern here is mainly about the evaluation
of interactions between humans and robots. It could also be centred on humans or
on the interaction! For example, how is it possible to evaluate the influence of robots
and their behaviour on humans, the nature of these effects (positive or negative) on
humans’ cognitive functions, emotions or social skills and which behaviour is rele-
vant and acceptable, and in which context [1]. Until a few years ago, the researchers
who created robots and programmed their behaviour (robotic and computer sci-
entists) were also those who made the evaluations. For that, they used methods
from ergonomics and psychology, as HRI discipline was born from psychology and
robotics. Here, we propose a new type of approach, using ethology.

1 Ethology, a Behavioral Science

Ethology comes from latine words «ethos» and «logos» which correspond to the study
of behaviors. It is the scientific and objective study of animal behaviour, usually
with a focus on behaviour under natural conditions and considering behaviour as
an evolutionarily adaptive trait [2]. Based on observation, this science comes from

M. Grandgeorge (B)
University of Rennes 1, University of Normandie, CNRS, EthoS (Éthologie
Animale et Humaine), UMR 6552, 35380 Paimpont, France
e-mail: marine.grandgeorge@univ-rennes1.fr

© Springer Nature Switzerland AG 2020 257


C. Jost et al. (eds.), Human-Robot Interaction, Springer Series
on Bio- and Neurosystems 12, https://doi.org/10.1007/978-3-030-42307-0_10
258 M. Grandgeorge

zoology and was distinguished by a Nobel price to three ethologists: von Frisch,
Lorenz and Tinbergen in 1973.
Numerous biologists contributed to the emergence of ethology, for example
Charles Darwin, especially with his theory of evolution by natural selection. Later,
Geoffroy St. Hilaire used for the first time in 1855 the word «ethology» to describe
an area of zoology. Then after, Ivan Pavlov worked on associative learning in animal
behaviors. He showed that any learning process in which there is a new response
becomes associated with a particular stimulus. For example, excitement of a dog
whenever it sees a collar as a prelude for a walk
Some research in human ethology involve study of nonverbal behaviors, especially
when one is interested in young children or people with disabilities. Thus, some
of the behaviors involved in communication and social interaction of humans can
be considered analogous to those observed in other animals. Many research has
linked animal and human behaviors. Lorenz was interested in child characteristics
[3]. They may be stimuli acting on the mechanisms triggering the parental behaviors.
For example: large head, big eyes and so on are more related to maternal behavior
that small head and eyes. Another example is the attachment theory of Bowlby [4].
Another example is the attachment theory of Bowlby [4]. He defined attachment
as a ‘lasting psychological connectedness between human beings.’ He highlighted
the importance of the child’s relationship with their mother in terms of their social,
emotional and cognitive development. For example, it observed the link between
early infant separations with the mother and later maladjustment. It was inspired
by ethologists and by animal studies on imprinting in birds and effects of maternal
withdrawal in apes.
Ethology is a science with numerous links, such as with sociology, anthropology,
ethnology, psychology, and so on. We propose that the science of ethology be applied
to the study of human-robot interaction.

2 Ethology and HRI: Can Ethology Form the Basis of HRI


Evaluation?

In ethology, as in all biological sciences, humans are considered an animal species.


The ethologists’ point of view that robots can also be considered as another entity
with which we could communicate is not counter argued by HRI researchers who
consider that “robots can actively respond to people’s affections as a physical, social
actors similar to a living entity directly embedded in people’s real-world physical
environments” [5].
We could apply to HRI the concept of Umwelt or “self-centered world” developed
by Jakob Von Uexküll [6] and mainly used in ethology. Organisms have different
Umwelten, even though they share the same environment. It is “constituted by a more
or less broad series of elements [called] ‘carriers of significance’ or ‘marks’ which
are the only things” that the organisms could perceive. The most famous example is
Evaluating Human-Robot Interaction with Ethology 259

the tick umwelt that is reduced to three elements: odor of butyric acid, temperature of
mammals’ blood and their hairy features. By analogy with that, the Umwelt of robots
is created by humans. For example, concerning the auditive channel, some robots
are able to decode a human’s voice [7] as well as to produce speech [8] or music
[9]; concerning the visual channel, some robots could decode human’s posture [10],
human facial expressions [11] as well as displayed themselves facial expressions
[12].
Ethology studies behaviors related to human-human, human-animal and animal-
animal interactions. Robots are now seen as companions that also display behaviors.
Thus, ethological concepts, methods, and analyses can be applied to human-robot
interactions. Ethology has many advantages for the HRI field. First, observations can
be made in either natural or experimental settings and are not invasive for participants.
Second, ethological methods can be combined with other methods (e.g. question-
naire, interview, performance or psychophysiology measures both about human and
robot) giving several points of view of the same situation studied. Indeed, etholog-
ical data are about behavior whereas interview data are about attitudes and beliefs.
Combining ethological methods with other methods leads to robust and valid evalua-
tions as previously suggested [13]. Third, ethologists are also experts of evaluations:
starting from a research question, they can design and conduct an evaluation, analyze
and interpret results with well-established statistical methods.

3 Methodology Used in Ethology

The starting point is to establish a research question, a fundamental step. Research


expectations should be clearly established because the unique evaluation goal is to
answer this precise question. To answer it, four main steps are identified: to choose
the study context, to describe behavior, to quantify the observed behavior, to analyze
and to interpret data (this last point is not detailed in this paper because specific to
each research project, according to scientific literature).

3.1 Choosing the Study Context

Bethel and Murphy’s general description explains all the necessary steps to design
a study [13] (Fig. 1). One important step is study location and environment: in the
field, laboratory, virtual environment, or online [13]. This is confirmed by Zabel and
Zabel [14] that it is possible to use laboratory experimentation (artificially creat-
ing situations) in combination with observations made in natural contexts, or to use
only observational techniques. Choices are open, but it is important to remember
that different approaches provide answers to different kinds of questions. For exam-
ple, observing in a natural setting or conditions allows the researcher to understand
“how it works” without modification. This experimental approach allows researcher
260 M. Grandgeorge

Fig. 1 The different steps


required for planning,
designing, and executing
human studies in HRI
(extracted from [13])

to explore specific factors (e.g. male versus female) or to have greater control of
the situation. Intermediary experimental situations are possible (e.g. just give some
instructions and participants are free in such setting). Combining several approaches
makes the resolution of a specific problem easier, because of the different points of
view available. And this is true even the experimentation is more complex to design
due to different methods.
Evaluating Human-Robot Interaction with Ethology 261

3.2 Describing Behavior

Before describing behavior, it is important to understand what behavior is [15]. A


behavior is composed of several behavioral units, for example “X raises left arm”, and
“X is smiling” … Behavioral units must be clearly defined and mutually exclusive
to be understood without ambiguity by others. The complete inventory of behavioral
units (for a specific species) is called an ethogram, which is the base for observations.
Because establishing an ethogram takes a very long time (sometimes more than one
year), ethologists prefer to use a behavioral repertoire containing behavioral units
specific to a research question. Notice that it may include behavioral units that are
common to several species.

3.3 Quantifying Behavior

Quantifying behavior means observing a situation and recording what happens in


terms of apparition of behavioral units. Altmann [15] defined sampling methods to
record behavior. The main ones are:
• Ad libitum sampling consists of recording a behavioral unit “as much as the
observer can”. This sampling method is a good way to decide what behaviors
are important for the subject you studied. For example, if you watch a human and
robot, it might look like this:
13:30 Human putted his hand on the robot
13:31 Robot proposed to play
13:35 Human stop the play after 4 min
And so on
• Focal-individual sampling consists of recording all behavioral unit occurrences
and durations of an individual during a sampling period. This is a good way to
record precise data in a group of individuals. For example, if you watch a group a
children playing with a robot, it might look like this:
13:30 Charlie (C = focal individual) looked at the robot
13:31 C proposed to A to play with the robot
13:32 C looked at A and B
13:33 C played with the robot
And so on
• Focal-behavior sampling consists of recording all occurrences and durations of
chosen behavioral units for all individuals during a sampling period. This is a good
way to record precise data on a rare behavior. For example, if you watch the same
group a children playing with a robot, it might look like this, if “talking to the
robot” is the focal-behavior:
262 M. Grandgeorge

13:30 C talked to robot 2 s


13:31 C talked to robot 5 s
13:42 A talked to robot 13 s
13:44 B talked to robot 3 s
And so on
• Instantaneous sampling consists of recording an individual’s current activity at
preselected moments, for example once every minute. This is a good way to have
frequency of behavior (or time-budget) for each individual. For example, if you
watch the same group of children playing with a robot, it might look like this:
13:30 A looked to B, B walked to the robot, C touched the robot
13:31 A leaved the room, B turned on the robot, C looked at B
13:32 A talked to robot, B talked to C, C looked at B
And so on
• One-zero sampling consists of recording the presence (noted 1) or the absence
(noted 0) of a behavioral unit during a sampling period, for example one minute.
This is a good way to record intermittent behaviors. For example, if you watch the
same group of children playing with a robot, and you focused on talking to robot,
it might look like this:
13:30 to 13:31 A and C talked at least on time to the robot, B no
13:31 to 13:32 A, B and C talked at least on time to the robot
13:32 to 13:33 C talked at least on time to the robot, B and A no
And so on
Sometimes, other data are added such as performance (e.g. reaction time, response
time, success or failure) and vocalizations.

3.4 Ethology’s Contribution

Describing and classifying behaviors is a major effort for ethologists, but the gener-
ated questions proposed by Tinbergen [2] determine the contribution of the discipline.
He argues that ethology always needs to include four kinds of explanation in any
investigation of behavior.
• Function: how does the behavior affect an animal’s—including always human—
survival and reproduction chances? Why does the animal respond that way instead
of some other way?
• Causation: what are the stimuli that elicit the response, and how has it been
modified by recent learning?
• Development: how does the behavior change with age, and what early experience
is necessary for an animal to display that behavior?
• Evolutionary history: how does that behavior compare to related species’ similar
behavior, and how might it have begun through the process of evolution?
Evaluating Human-Robot Interaction with Ethology 263

To give an example, it could be interesting to focus on a human behavior, i.e.


human infant crying as Zeifman did [16]. Here, we just give a sum up of this work.
Function of human infant is to produce caring behavior by the parent. On the parents’
side, human infant crying brings information that, for example, determine the type
of care to be provided. Causations are various. One of them is that most infant crying
is caused by hunger, physical pain or discomfort, or being left alone. Development
or ontogeny question allows us to know that, in the first 3 months of life, infant
crying mainly reflects internal states (e.g. hunger). Subsequently, it also translates
external events, such as the fear of strangers. At last, question about evolutionary
history hypothesized that infant crying certainly evolved as an additional signal to
maintain mother-child contact when olfaction and vision were not adequate (e.g.
during night).

4 Some Examples of Research Using the Ethological


Approach

One of the research using the ethological approach received an international inter-
disciplinary research award in the congress RO-MAN in 2014. The research team
was composed by an ethologist, two computer scientists and a roboticist to study
Nao’s impact on a memory game [17]. As different parameters influence the human-
machine interactions (e.g. robot vs virtual character [18], realism, proximity, size
[19]), one could argue that type of robot or solely its presence may impact human’s
behaviors. To explore such question, researchers measure differences between a game
with or without a robot in order to deduce the added value of a robot in a memory
game. For that, an adaptation of the Simon game was used. To better understand the
user’s behaviors in this memory game, three conditions were compared: (1) one group
played with a robot and a tablet, (2) another group played only with a robot and (3) the
last one only with the tablet. Different types of data were collected, after establishing
a precise behavioral repertoire with, for example, posture back, spatial distance and
facial expressions. Data about performance and feelings were also recorded. Results
are surprising! For example, researchers showed than positive facial behaviors (e.g.
smiling) were less frequent (in duration and in occurrence) when participants played
with the tablet only whereas negative facial behaviors were mostly observed in the
condition with robot and tablet. Moreover, spatial distances between the human and
the device were different according to the experimental conditions as with the tablet
only, most of the time, participants were in contact whereas, with the robot (even
with or without tablet), they were more distant. Thus, participants increased their
spatial distance in the presence of the robot, and displayed more emotional expres-
sions, suggesting that the robot could be considered as a social partner [20]. Could
a questionnaire or interview give such information? Are participants aware of that?
Another experiment analyzed 3-to-5-years child interacted with Emi robot, a
robotic bear plush, when child is alone [21]. Typical items to study interaction in
264 M. Grandgeorge

ethology are used, that are vocal and verbal behaviors directed to Emi (e.g. talking to,
exclaiming, pointing), visual behaviors (e.g. looking at), tactile behaviors (e.g. touch-
ing robot, friend, self-centered gestures), moving behaviors (e.g. walking, staying
motionless) as well as distance between the child and the robot (using scan sampling)
in child’s arm length (i.e. contact, 0 to 1/2, 1/2 to 1, 1 to 2, >2, out of the room). This
last item is a common measurement and provides an estimation of child interest, here
of the robot. When being alone with the robot, children mostly remained aloof from
Emi robot, being closer to the experimenter (i.e. that could be a source of comfort?),
frequently displayed self-centered gestures, i.e. indicator of stress [22]. Thus, obser-
vations allow researchers to highlight that encountering a robot when alone may be
stressful for young children even some other experiments showed the opposite [23].
Such differences may be explained by several factors, either linked to children (e.g.
age), robot (e.g. appearance, movement) or experimental context [19, 23, 24].
More and more robots are used in therapeutic settings [25, 26] but data using
observation remain scarce about motivation and impact of the device’s appearance on
a participant’s cognitive skills. Here again, an interdisciplinary team of ethologists,
computer scientists, and roboticists worked together to answer the question “can
animated objects influence children’s behavior in cognitive tasks?” [27]. For that,
researchers compared the impact of the presence of different animated objects (i.e.
computer alone, or by the same computer assisted by an animated object, either virtual
human character, animal-shaped object covered with fur or humanoid metallic robot)
on the behavior and performance of 51 primary school children during a cognitive
mental arithmetic tasks. Interestingly, children didn’t use same behaviors according
their task “partner”, for example, robots elicited more positive behaviors as well as
fewer negative behaviors than the virtual character. Such behaviors were linked to
the arithmetic task success.
About such therapeutic settings, Kerstin Dautenhahn’s work is dedicated to people
that have autism spectrum disorders [28, 29]. For example, Robins and colleagues
[28] proposed to analyze ASD children behavior, during a longitudinal study focused
on the robotic assistants in therapy and education of children with ASD. They used
one-second-instantaneous scan sampling to record child’s eye gaze towards the robot,
child’s touching behavior towards any part of the robot, child’s imitation of the robot
and child’s approach of the robot with close proximity. Throughout this ethological
analyze, evolution of the behaviors’ frequency, reflecting children’s skills develop-
ment, during 100 days was more informative that a pre-post questionnaire (either to
the ASD child—if possible—or to the therapist). Indeed, this longitudinal and obser-
vational study showed that expression of behaviors was not stable, e.g. sometimes
ASD children lost interest of the robot during some days and after, robot became
again attractive, with many imitation moments.
Temporal aspects were also studied in human-animal-robot interactions, like
in. Adam Miklosi’s team who try to better understand human-animal interaction
throughout HRI. Kerepesi et al. [30] compared children’s and adults’ play behavior
interacting either with an AIBO, a robotic dog, and a living dog puppy, a 5-month-old
female Cairn terrier, similar size to the robot. Experiment took place in a familiar
Evaluating Human-Robot Interaction with Ethology 265

room (e.g. at school) for 5 min in a spontaneous situation. All sessions were video-
recorded and later analyzed. Researchers proposed a behavioral repertoire based
on large categories, such as play behavior (e.g. look toy), activity (e.g. stand) and
interest in partner (e.g. stroke). During the coding procedure, the frequency and the
duration of behaviors as well as the latency of the first human tactile contact with
the dog/AIBO were recorded. Interestingly, the latency of humans—both children
and adults—touching the dog or the AIBO robot was similar. Nevertheless, from
behavioral analysis, the authors showed that AIBO robot has a limited ability to
engage in temporally structured behavioral interactions with humans, suggesting
that “more attention should be paid to the temporal aspects of behavioural pattern
when comparing human–animal versus human–robot interaction” [30].
Interestingly, sometimes researchers in HRI used methods very close to ones used
ethology without mentioning the correct vocabulary. For example, in their article
named “robots in the wild: observing human-robot social interaction outside the
lab”, Sabanovic and colleagues [31] explained that “videotaped data was coded
using behavioral analysis software” and focused on behaviors such as “utterance,
spatial movement, gesture, and gaze as performed by the robots, the people who
interacted with them directly, and those who were in close proximity but did not
interact directly”. Without mentioning the sampling methods, we could understand
that they used focal-individual sampling.
Other proposed also simple behavioral analyses to complete a questionnaire
approach and helped them to respond to their hypothesis. For example, Breazeal and
colleagues [32] focused on effects of nonverbal communication on efficiency and
robustness in human-robot teamwork in a lab context. They coded “the total num-
ber of errors during the interaction; the time from when an error occurred to being
detected by the human; the length of the interaction as measured by time and by
the number of utterances required to complete the task”. Interestingly, without such
observations, they explained they could not affirm or deny some of their hypotheses,
such as the fact that implicit non-verbal communication positively impacts human-
robot task performance with respect to understandability of the robot, efficiency of
task performance, and robustness to errors that arise from miscommunication.
These studies are some examples of the growing literature on human-robot inter-
actions where an ethological approach was adopted, illustrating all the interest of
ethology in HRI.

5 Conclusion

Although ethology is a science in itself, it is also a science with numerous links


to other fields. Ethologists often use methodologies from other fields to complete
their observations as related to HRI researchers’ suggestions [13]. Moreover, evalu-
ation of human-robot interaction is a multidisciplinary field at the interface between
ethology and robotics. These two sciences share common interests and complement
each other in their evaluation methodologies. Several questions currently raised in
266 M. Grandgeorge

robotics have been the subject of ethological research for several decades concern-
ing the human-animal, animal-animal or human-human interaction (e.g. respect for
individual distance). Both disciplines also question uni-multimodal communication
and the importance of being similar—and up to what point—in order to establish
relationship.
Thus, ethology and robotics mutually enhance each other. Robotics becomes a
tool for ethology, for example, when robots are used to understand animal behaviors.
Ethology is also a tool for robotics, for example, thanks to the behavioral repertoire
and sampling methods if offers, but also it makes it possible to record information
that is not available when using questionnaires (attitude versus behavior). Finally,
some significant issues are worth noting. As robots become more and more present
in our society, the question of the evaluation of the human-robot interaction becomes
essential and raises many ethical questions, mainly if we open perspectives by high-
lighting Hinde’s work [33]: ethologists consider that an interaction is a brief event
involving two or more individuals and one and more types of behavior (A does X to
B and B responds with Y). Each interaction is influenced by the previous one(s) in
the process of developing a relationship: therefore, according to the possible ‘posi-
tive or negative memory’ related to it, each partner has expectations concerning the
other’s subsequent behavior, which can also be modulated by previous experience.
Ethological methods are therefore made to evaluate interactions and relationships
over the long term, a possibility that is really required by HRI researchers.

References

1. Jost, C., Le Pevedic, B., Belpaeme, T., Grandgeorge, M.: Evaluating Human-Robot interaction
with ethology. Presented at the 25th IEEE RO-MAN, New York, United States (2016)
2. Tinbergen, N.: On aims and methods of ethology. Zeitschrift für Tierpsychologie 20, 410–433
(1963)
3. Lorenz, K.: Trois essais sur le comportement animal et humain, p. 240. Seuil (1970)
4. Bowlby, J.: Attachment and loss, Vol. 1: Attachment. New York (1969)
5. Young, J.E., et al.: Evaluating human-robot interaction: focusing on the holistic interaction
experience. Int. J. Soc. Robot. 3(1), 53–67 (2011)
6. von Uexküll, J.: Mondes Animaux et Monde Humain, suivi de la Théorie de la Signification.
Gonthier, Paris (1965)
7. Roy, N., Pineau, J., Thrun, S.: Spoken dialogue management using probabilistic reasoning.
Presented at the 38th Annual Meeting on Association for Computational Linguistics (2000)
8. Bruce, A., Nourbakhsh, I., Simmons, R.: The role of expressiveness and attention in Human-
Robot Interaction (2001)
9. Tapus, A., Tapus, C., Matari, M.: The role of physical embodiment of a therapist robot
for individuals with cognitive impairments. Presented at the Robot and Human interactive
communication, Toyama, Japan (2009)
10. Thanh Nguyen, D., Li, W., Ogunbona, P.: A local intensity distribution descriptor for object
detection. Electron. Lett. 47(5), 322–324 (2011)
11. Wimmer, M., MacDonald, B.A., Jayamuni, D., Yadav, A.: Facial expression recognition for
human-robot interaction: a prototype. In: Proceedings 2nd International Conference on Robot
Vision, pp. 139–152 (2008)
12. Adams, A., Robinson, P.: An android head for social-emotional intervention for children with
autism spectrum conditions. In: Affective Computing and Intelligent Interaction, pp. 183–190
(2011)
Evaluating Human-Robot Interaction with Ethology 267

13. Bethel, C.L., Murphy, R.R.: Review of human studies methods in HRI and recommendations.
Int. J. Social Robot. 2, 347–359 (2010)
14. Zabel, R.H., Zabel, M.K.: Ethological approaches with autistic and other abnormal populations.
J. Autism Dev. Disord. 12(1), 71–83 (1982)
15. Altmann, J.: Observational study of behaviour: sampling methods. Behaviour 49, 227–267
(1974)
16. Zeifman, D.M.: An ethological analysis of human infant crying: answering Tinbergen’s four
questions. Dev. Psychobiol. 39, 265–285 (2001)
17. Jost, C., Grandgeorge, M., Le Pévédic, B., Duhaut, D.: Robot or tablet: users’ behaviors on
a memory game. In: 23rd IEEE International Symposium on Robot and Human Interactive
Communication, pp. 1050–1055. Edinburgh, Scotland, UK (2014)
18. Kidd, C.D., Breazeal, C.: Human-Robot interaction experiments: lessons learned. In: AISB’05:
Social Intelligence and Interaction in Animals, Robots and Agents, pp. 141–142. University of
Hertfordshire, Hatfield, UK (2005)
19. Powers, A., Kiesler, S., Fusell, S.R., Torrey, C.: Comparing a computer agent with a
humanoid robot. Presented at the Second ACM SIGCHI/SIGART Conference on Human-Robot
Interaction (2007)
20. Ohara, K., Negi, S., Takubo, T., Mae, Y., Arai, T.: Evaluation of virtual and real robot based
on human impression. Presented at the IEEE RO-MAN (2009)
21. Saint-Aimé, S., Grandgeorge, M., Le Pévédic, B., Duhaut, D.: Evaluation of Emi interaction
with non-disabled children in nursery school using wizard of Oz technique. Presented at the
Robotics and Biomimetics IEEE-Robio (2011)
22. Grandgeorge, M., Deleau, M., Lemonnier, E., Hausberger, M.: The strange animal situation
test. Anthrozoos 24(4), 393–408 (2011)
23. Melson, G.F., et al.: Children’s behavior toward and understanding of robotic and living dogs.
J. Appl. Dev. Psychol. 30(2), 92–102 (2009)
24. Saint-Aimé, S., Le-Pévédic, B., Letellier-Zarshenas, S., Duhaut, D.: EmI—my emotional cud-
dly companion. Presented at the IEEE RO-MAN 2009, 18th International Symposium on Robot
and Human Interactive Communication, Toyama, Japan (2009)
25. Shibata, T., Wada, K., Ikeda, Y., Sabanovic, S.: Tabulation and analysis of questionnaire results
of subjective evaluation of seal robot in seven countries. In: 17th IEEE International Symposium
on Robot and Human Interactive Communication, pp. 689–694 (2008)
26. Hansen, S.T., Andersen, H.J., Bak, T.: Practical evaluation of robots for elderly in Denmark.
An overview. In: Proceedings 5th IEEE-ACM International Conference on Human-Robot
Interaction (HRI), pp. 149–150 (2010)
27. André, V., et al.: Ethorobotics applied to human behaviour: can animated objects influence
children’s behaviour in cognitive tasks? Anim. Behav. 96, 69–77 (2014)
28. Robins, B., Dautenhahn, K., Boekhorst, R.T., Billard, A.: Robotic assistants in therapy and
education of children with autism: can a small humanoid robot help encourage social interaction
skills? Univ. Access Inf. Soc. 4, 105–120 (2005)
29. Wainer, J., Ferrari, E., Dautenhahn, K., Robins, B.: The effectiveness of using a robotics class
to foster collaboration among groups of children with autism in an exploratory study. Pers.
Ubiquit. Comput. 14(5), 445–455 (2010)
30. Kerepesi, A., Kubinyi, E., Jonsson, G.K., Magnusson, M.S., Miklosi, A.: Behavioural compar-
ison of human-animal (dog) and human-robot (AIBO) interactions. Behav. Proc. 73(1), 92–99
(2006)
31. Sabanovic, S., Michalowski, M.P., Simmons, R.: Robots in the wild: observing human-robot
social interaction outside the lab. Presented at the Advanced Motion Control, 9th IEEEcongress,
Turkey (2006)
32. Breazeal, C., Kidd, C.D., Thomaz, A.L., Hoffman, G., Berlin, M.: Effects of nonverbal com-
munication on efficiency and robustness in human-robot teamwork. In: IEEE/RSJ International
Conference on Intelligent Robots and Systems, Vols. 1–4, pp. 383–388 (2005)
33. Hinde, R.: On describing relationships. J. Child Psychol. Psychiatry 17, 1–19 (1976)
268 M. Grandgeorge

Marine Grandgeorge is lecturer in ethology at the human and


animal Ethology lab at the university of Rennes 1. She belongs
to Pegase team focused on cognitive processes and social fac-
tors associated with scientific and societal issues that include
communication, brain plasticity, perception and understanding
of conspecific and heterospecific signals, remediation and wel-
fare. Her research is mainly focused on heterospecific commu-
nications such as human-robot interactions as well as human-
pet interactions and relationships, especially on animal assisted
interventions (e.g. dog, horse).
Evaluating Human-Robot Interaction
with Ethnography

An Jacobs, Shirley A. Elprama and Charlotte I. C. Jewell

Abstract In the field of Human-Robot Interaction (HRI) the concept of ethnography


is not unheard of. It is, however, often misunderstood and seen as a methodology
that has no scientific validity or it is misused in practice. The aim of this chapter
is to offer researchers that are unfamiliar with ethnography a brief overview of the
most important aspects of ethnography and guide HRI researchers towards a better
understanding and use of ethnography in HRI research. This is done to show that
ethnography has a role to play within the HRI field and that it is a scientific method-
ology with its own quality criteria. A small meta-review of the current status and use
of ethnography in the field of HRI is used to guide the chapter and illustrate with
HRI based examples, what is being done and what is needed.

Keywords Ethnography · Human–robot interaction · Qualitative research


methods · Observation · Interview

1 Introduction

Today’s robots are getting mature enough to leave the confined lab environment, due
to the more social attributes, increased mobility and semi-autonomous features. The
maturity of robots also allows for longer periods of evaluation. All these advance-
ments allow interactions between people and robots which are more socially situated,
varied and complex. However, it is hard to explore how people make sense of their
interaction with a robot in different contexts with frequently used HRI evaluation
methods such as experimental designs. These questions ask for a more holistic eval-
uation approach [1]. The social sciences, and more specifically anthropology and
sociology have a long-standing tradition in methods enabling that kind of evaluation:
interpretative research traditions and more specific the ethnographic approach.
In line with the call for scientific rigor for the experiment design in HRI from
Bethel and Murphy [2], we want to explain the core of ethnography and the quality

A. Jacobs (B) · S. A. Elprama · C. I. C. Jewell


imec-SMIT-Vrije Universiteit Brussel, Brussel, Belgium
e-mail: An.Jacobs@vub.be

© Springer Nature Switzerland AG 2020 269


C. Jost et al. (eds.), Human-Robot Interaction, Springer Series
on Bio- and Neurosystems 12, https://doi.org/10.1007/978-3-030-42307-0_11
270 A. Jacobs et al.

criteria to adhere to when performing or evaluating ethnographic HRI studies. The


goal of this paper is to give some guidance to HRI-researchers who are unfamiliar with
conducting ethnographic research. We want to show the contributions ethnography
can bring to the field, not only when the robot is deployed and sold as a finished
product, but also during its system creation process. To realize the potential of this
methodology, the different scientific foundations (ontological choices—what is the
“reality” to study and epistemological choices—what is the relationship between the
researcher and the object of study) have to be understood and accepted.
We will first review how ethnography has been used in the HRI community by
conducting a review of the papers published at the previous HRI conferences (Sect. 2).
Then, we explain what ethnography is, where it comes from and how quality in
ethnography can be assessed in Sect. 3. It is also explained which data collection
methods can be used and what the implications are for your research in choosing
a particular method for data collection. In Sect. 4, we explain how ethnographic
research should be reported. Finally, we end this chapter by summarizing the main
conclusions that can be drawn from this chapter.

2 Current Use of “Ethnography” in the HRI Community

To assess the current status and use of ethnography in the HRI community we did a
small meta-review of the available publications from the core conference, the annual
international conference on Human Robot interaction, since 2006. We searched the
ACM Full-text collection (557,828 records, 6th of June 2019) with 2150 records
labeled “HRI” “Proceedings”. We got only 13 results by adding the search term
“ethnograp*” (see Table 1).
These are not all examples of ethnographic research endeavors. Some are just
referring to other or own previous ethnographic studies in their introduction or in
their review of the current state [3, 4]. In some of them, the term “ethnography”
is missing, thus it appears surprising that they were shown as results in the search
[5, 6]. The contribution of Hasse et al. [7] is a reference to a workshop on the use
of ethnography in the field. All other 8 cases, are examples of the current state of
applying ethnography to HRI research by conducting empirical work.
In conclusion, although ethnographic contributions have been made to the HRI
conference from the start in 2006, this topic is not frequently reported on within the
HRI community. We will come back to the current practices of reporting ethnogra-
phy for HRI in a later section, but let us first take a step back and reflect on what
ethnography actually is beyond the field of HRI.
Evaluating Human-Robot Interaction with Ethnography 271

Table 1 References from past HRI proceedings found using “ethnograph*”


* Authors Year Title
1 Forlizzi, J., & DiSalvo, C. [23] 2006 Service Robots in the Domestic
Environment: A Study of the Roomba
Vacuum in the Home
2 Stubbs, K., Hinds, P., & Wettergreen, 2006 Challenges to Grounding in
D. [35] Human-robot Interaction
3 Forlizzi, J. [24] 2007 How Robotic Products Become Social
Products: An Ethnographic Study of
Cleaning in the Home
4 Mutlu, B., & Forlizzi, J. [36] 2008 Robots in Organizations: The Role of
Workflow, Social, and Environmental
Factors in Human-robot Interaction
5 Sabelli, A. M., Kanda, T., & Hagita, N. 2011 A Conversational Robot in an Elderly
[34] Care Center: An Ethnographic Study
6 Leite, I., Castellano, G., Pereira, A., 2012 Modelling Empathic Behaviour in a
Martinho, C., & Paiva, A. [37] Robotic Game Companion for
Children: An Ethnographic Study in
Real-world Settings
7 Ohyama, T., Maeda, Y., Mori, C., 2012 Implementing Human Questioning
Kobayashi, Y., Kuno, Y., Fujita, R., … Strategies into Quizzing-robot
Ikeda, K. [3]
8 Yamazaki, A., Yamazaki, K., Ohyama, 2012 A Techno-sociological Solution for
T., Kobayashi, Y., & Kuno, Y. [33] Designing a Museum Guide Robot:
Regarding Choosing an Appropriate
Visitor
9 Kraft, K. [5] 2016 Robots Against Infectious Diseases
10 Kraft, K., & Smart, W. D. [4] 2016 Seeing is Comforting: Effects of
Teleoperator Visibility in
Robot-Mediated Health Care
11 Wiles, J., Worthy, P., Hensby, K., 2016 Social Cardboard: Pretotyping a Social
Boden, M., Heath, S., Pounds, P., … Ethnodroid in the Wild
Weigel, J. [31]
12 Eguchi, A., & Okada, H. [6] 2018 If You Give Students a Social Robot? -
World Robot Summit Pilot Study
13 Hasse, C., Trentemøller, S., & 2018 The Use of Ethnography to Identify
Sorenson, J. [7] and Address Ethical, Legal, and
Societal (ELS) Issues.

3 What Is Ethnography?

What ethnography is still a topic of discussion among ethnographers. There is a cor-


pus of courses and books describing how to learn the skill of performing an ethno-
graphic study (e.g. [8]), but anthropologists and sociologists are the ones receiving
272 A. Jacobs et al.

formal training in this approach and are, thus, well equipped to perform these types
of studies within the HRI field.
Describing ethnography as a method is too reductionist to describe its complexity.
The result of ethnography (the research process) is called an ethnography (the written
text). The important thing to know about ethnography is not what it is, but what it
does as Heath and Street [9] explain. They state that the written text is a narrative,
but should not be confused with fiction. It is the result of a systematic in-depth
involvement with real people participating and observing in their real life (the field),
and writing the understanding of social and cultural phenomena down in a narrative
to transfer this knowledge to others. This understanding comes via direct individual
experience by a researcher or team of researchers participating in the selected field
of study. The ethnographer starts as an outsider, being introduced to a new lifeworld
and sees what is the same or different from his or her own experience. After a
while the researcher becomes a kind of an insider, an “immigrant” taking their own
cultural and social capital into a new field. This field is more than an environment,
it is a community built on practices, ideas, beliefs and thoughts. To learn them
one has to listen, practice and participate in order to understand the lived reality
and the ‘taken for granted” rules and knowledge. This is a time intensive process.
Charmaz [10, p. 22] describes ethnography as we understand it in this chapter with the
following quote, “Ethnographers seek detailed knowledge of the multiple dimensions
of life within the studied milieu and aim to understand members’ taken-for-granted
assumptions and rules”.

3.1 Retracing Ethnography

As with any methodological approach, ethnography is embedded in the timeframe


from which it emerged. It is to a certain degree a product of its historical origin.
Therefore, to be able to fully comprehend the thought processes, we need to briefly
touch upon the history of its emergence. Ethnography is commonly accepted to have
emerged in the anthropological field in the late nineteenth and early twentieth cen-
tury. The first development of ethnography happened in Britain in the field of social
anthropology and this is largely associated with two key figures, Bronislas K. Mali-
nowksi [11] and Alfred R. Radcliffe-Brown [12]. Traditionally, ethnography studied
the new and exotic places abroad to understand the more “primitive beings” through
a western gaze. There was a need to better understand the different cultural groups
under their rule and it is, therefore, not surprising that the emergence of ethnography
coincides with colonialism [13]. The use of urban ethnography, ethnography that is
done in one’s own garden, has been attributed within the sociological discipline to
‘The Chicago School’ and more specifically Robert E. Park and Ernest W. Burgess.
Another prominent figure in ethnography is Margaret Mead (1901–1978), an
American cultural anthropologist. The question she set out to answer in Samoa was
“Are the disturbances which vex our adolescents due to the nature of adolescence
itself or to the civilization? Under different conditions does adolescence present a
Evaluating Human-Robot Interaction with Ethnography 273

different picture?” [14, pp. 6–7]. Mead was the first to use her ethnographic research
in Samoa as a way to analyze the way in which context and culture play an important
role in the socialization of a person or society [15]. Interestingly, all kinds of data
was collected. Not only written text, but also visual material like pictures and film
in the format of a documentary.

3.2 Thinking Ethnography

What is often overlooked or not known is that ethnography functions in an epistemo-


logical framework that is significantly different from a (post)-positivist epistemol-
ogy. To appreciate ethnography a different manner of thinking is necessary. Within
ethnography there are differences in the epistemological streams (e.g. constructivism,
critical theory), but there is not one that is right or wrong.
Ethnography has its origins in an area (see earlier, part on retracing ethnography)
where the dominant scientific paradigm was positivism. This paradigm assumes
that there is a “real” reality (answering the ontological question what is there to be
known?) that can be discovered by scientists. This can be done by keeping their dis-
tance from their study object to come to “truth” (dualistic epistemology, answering
the question relation between the knower and what is there to be known) by making
use of experimental set-ups or a strict hypothetical testing research designs (method-
ology, answering the question how something can be known), in which quantitative
methods are the core.
As Gubba and Lincoln [16] explained, you needs to position yourself as a
researcher within a scientific paradigm to understand your choice of methodology
better. The positivistic paradigm was guiding ethnography in the nineteenth century.
However, sciences evolve over time. As part of the internal critique on its eurocen-
trism and paternalistic steered results, the anthropological and sociological field went
through a history of internal debate and adaptations of their ethnographic approach.
Due to these debates, a post-positivist paradigm was formulated. In this post-
positivist paradigm as a researcher you are aware of the limitations of your research
instruments, a more probabilistic and partial approach on discovering the reality
(ontology) is put forward. This leads to researchers who still try to keep distance
of their research subject, but are aware of the mutual influence they have on each
other. Results are made explicit in probability of the truth (epistemology). On a
methodological level this led to an approach in which qualitative methods and criti-
cal multiplism (thoughtful use of multiple methods for example [17]) are approaches
next to the experimental and deductive hypothetical testing, to enable us to shed light
on different facets of the reality that is there to be discovered. Kvale [18] has a nice
metaphor for this ontological vision. The (post)-positivist researchers are like miners
looking for a reality that is already waiting to be discovered. While in the counterpart
paradigms labeled by Gubba and Lincoln [16] as critical theories and constructivism,
the reality can be only discovered by the researchers like a traveler learns to under-
stand the visited reality in dialogue with people and surroundings they encounter. In
274 A. Jacobs et al.

the critical theories there is a historical/virtual “reality”, a crystallization of societal


values, while for people working from the constructivists paradigm the reality is
highly locally constructed. Both paradigms thus take the (epistemological) stance
that the researcher is creating knowledge in a transactional way and that results are
value bounded/subjective. Therefore, methodologically they opt for methods which
are dialogic and dialectic within critical theory, and hermeneutic and dialectic within
constructivism. The dialectic in the approaches is the circle between one part to the
other. For example, in the hermeneutic approach one goes from the interpretation of
part of the text to the interpretation of the whole text and back. The dialogic in the
critical approach consists of the consideration that is given to the historical back-
ground of both the structure and the researcher for interpreting the data. Meanwhile
they are integrated in a sibling approach called critical hermeneutics, which also
found its application in the field of information system design [19]. Both paradigms,
critical and constructivist—as well as their sub-variations—are also referred to as
interpretivist approaches.
In this chapter we elaborate on the interpretivist approach, because it is one that is
widely known within the ethnography community. To recapitulate, in an interpretivist
approach reality is socially influenced by people and can, therefore, be different in
each specific context. People are considered complex and ever-changing. The people
are the center of understanding and creating the reality in which they live [20].

3.2.1 Different Quality Criteria Are Needed to Check the Value


of Scientific Contributions

This view of the researcher as a traveler in constant dialogue with their surroundings
during the journey, results in the need for different quality criteria to check the
value of scientific contributions following this tradition. In Table 2, we illustrate
how criteria in a post-positivist paradigm are adapted to fit comparable criteria in an
interpretivist paradigm. Within the interpretative paradigm there are different visions
on the appropriateness of these criteria, we recommend readers to further develop
their knowledge of different nuances and discussions that fall outside of the scope
of this chapter (e.g. [16]).

Table 2 Quality criteria for


Basic question Post-positivist Interpretative criteria
research in post-positivist and
criteria
interpretative paradigm [16])
Neutrality Objectivity Credibility
Truth Internal validity Confirmability
Consistency Reliability Dependability
Applicability External validity Transferability
Evaluating Human-Robot Interaction with Ethnography 275

3.3 Quality Criteria in the Interpretive Research Tradition

In the interpretative research tradition in which current ethnographic work situates


itself not everybody agrees that the post-positivistic criteria have their one-on-one
counterpart (for more on this discussion read [21]). However, for a novice researcher
or a reviewer not acquainted with the approach it is helpful to have some practical
guidance. In their article, Treharne and Riggs [21], give four quality criteria: (1)
Transparency, (2) Personal reflexivity and end-user involvement, (3) The transfer-
ability of findings and (4) The triangulation of data sources. We will use these criteria
as a starting point in this chapter, but would like to emphasize that these criteria do
not function as a checklist. Rather they should be interpreted as a guideline to which
other facets that are not discussed can be added.
1. Transparency
It is a general scientific practice to be as transparent as possible throughout the
whole of a research study from the very beginning to the written or oral end result.
2. Personal Reflexivity and End-User Involvement
Researchers in all fields should take the time to see how their personal circum-
stances shape their research. It is a perpetual questioning of oneself that needs to take
place from the beginning until the end of the research. One way by which this can
be tracked is through regular recording of reflections and memos to leave a trial for
oneself and for others to go back to and understand the choices made or not at certain
points. On the other hand, end-user involvement involves including the members of
the community that is being researched. This needs to be done to ensure that the data
represents what the community feels and experiences. Personal reflexivity and end-
user involvement are strongly intertwined with the previous point on transparency
as it is necessary for this to be transmitted in order to understand the meaning of the
results.
3. The Transferability of Findings
There are three different types of generalizability according to Lewis and Richie
[22]: representational generalization, inferential generalization and theoretical gen-
eralization. Representational generalization (as often used in quantitative research)
refers to the extent to which research results found in a sample can be generalized
to the population from which it was sampled. Inferential generalization refers to
whether the findings can be generalized or inferred to other contexts or settings than
the one studied. Theoretical generalization refers to the theory or theories that are
created based on the collected data and transcends a single case or observation. Such
theory could be used in other research and is therefore generalizable.
4. Triangulation of Data Sources
Triangulation stand for the use of different sources of data to explore whether there
are any convergences, complementarities and differences. This is often done with
a mix of quantitative data and qualitative data, but can also be done with different
276 A. Jacobs et al.

sources of qualitative data: interview data, observational data, collections of letters,


pictures and other types of data.

3.4 Data Collection Methods in Ethnography

Ethnography as a methodological approach uses interpretive or so called “qualita-


tive” methods (e.g. contextual interviews, interpretative document or object analysis,
observations, …), with a priority for observation as a source of data, to study social
interactions, behaviors and perceptions of people. Ethnography was born as a tech-
nique based upon direct observation. This is not to say that listening, asking and
reading do not have a place within ethnography, but more that these are ancillary
sources of information. From our review (see Table 1), we labelled 8 HRI studies as
empirical ethnographic studies. In all 8 studies, a combination of observational and
interview techniques are used. However, the emphasis placed on the data collection
methods differs. For example, the studies by Forlizzi (et al.) [23, 24] emphasize
the interview technique and the observation (called house tours) as additional data-
collection technique. There are different ways to organize data that was collected via
observation and interviews. We briefly describe them in the following paragraphs.

3.4.1 Observation Techniques and Choices

Observation, like other research techniques, demands some preparation before going
into the field to collect data. There are different aspects that need to be carefully
thought about and decided upon in order to uphold scientific rigor.
1. Level of Visibility and Transparency of the Ethnographer’s Role
First, there is a choice to be made as to whether the observing will be done overtly
(the people being observed know they are being observed) or covertly (the people
being observed do not know they are being observed) [25]. In covert observations
the researcher will have less of an influence on ‘daily reality’ as people will not
feel watched or judged. However, this does mean that the researcher is in a certain
way ‘deceiving the people in the field’ and this can be an issue in terms of research
ethics in general, but also depends on the fragility of the population the researcher
is working with (e.g. children, people in an uneven power relationship for example
at work) [26].
2. Level of Participation of the Ethnographer
Another important methodological choice that has to be made is how much you
are participating in the activities under study. When opting for a non-participant
Evaluating Human-Robot Interaction with Ethnography 277

observation, the researcher is not in direct contact with the subjects under obser-
vation, rather they are observed ‘from a distance’. Opting on the other hand for a
participant observation, rather the practices and interactions of people are observed
‘from within’. In the latter the researcher interacts with the participants and takes part
in everyday life [25]. Between non- and full participation, on can pick one’s position
When opting for a total participation there is the risk of “going native”, becoming a
full member of the observed group, and forgetting the reflective role as a researcher.
3. Observation Process: Phases and Duration
Besides the role a researcher will take during the observation, it is also good to
be aware of the different phases of observations that a researcher will go through
while observing. Spradley [27] describes three steps in an observation process: (1)
descriptive observations, (2) focused observations and (3) selective observations.
The first type of observations consist of more general and descriptive observations.
For instance, ask a nurse to guide you around in a nursing home if that is where
your robot will be used. During the second step, the observations are more focused,
for example observe activities that are organized for older adults living in a nursing
home.
The final step consists of selective observations. In our research, these usually
take place after we have already analysed most of the gathered data and we realize
more detailed in-depth information is missing. In the nursing home example, these
observations could be focusing on the way nurses communicate with older adults
during activities.
At the start of the observation process, it is often not known how long a researcher
will be in the field gathering data, but this sampling process ideally comes to an end
once data saturation is reached. Data saturation is reached in qualitative research
when the researcher obtains no new information when new data is collected and
analysed.
4. What to Observe?
It is always your research question which guides your decisions about what you
want or need to observe. Nevertheless, nine points of attention exist to facilitate the
data collection, these are the physical space, the actors present/involved, the activities,
the objects present, actions done by people, events, time, aim and feeling [28]. These
are the more standard points of attention and should serve as a guideline, since it
is not always necessary to focus on or capture all of these aspects. Alternatively,
one could choose to focus on a couple points during one observation and focus on
other point during a following observation. Furthermore, other frameworks exist,
such as Forlizzi’s product ecology framework [29], with which the focus points can
be prepared. Additionally, observing is as much about what you see as what you do
not see. For instance, during robot-assisted surgery not a lot of verbal communication
between the surgeon and assistant was observed, which (after conducting in-depth
interviews) turned out to be an indicator of how often they have worked together
[30].
278 A. Jacobs et al.

5. Capturing Data
During observations, data collected can include pictures, videos, written field
notes and drawings (for instance, a schematic drawing of a robot interacting with a
person). A useful function to convert your field notes to full sentences is the Dictation
and Speech function, available on most computers.
This allows you to dictate sentences to your computer and extend the field notes
to full written sentences after you left your field site. Of course, sensor and camera
data collected by the robot used in a study can complement the observation by the
ethnographer. In one of the selected cases [31] the aim was to develop an “ethnodroid”
to enable the study of child-robot interaction. As a complementary tool to the human
ethnographer a viable option, the study itself shows however that there are a lot of
unexpected actions and reactions to be detected by the researchers themselves.
Doing observations in different contexts (e.g. nursing homes, operating rooms),
going to a new context for the first time can feel very overwhelming. Ideally, observa-
tions start with a general tour (cf. Spradley’s general observation [27]) of the context
in which robots are or will be used, which will give you a general impression of the
environment and is followed by more focused observations.
Depending on the context, observation is more suitable than interviews. For
instance, observation was a more appropriate method in the operating room, since
verbal communication could be disturbing the surgery. However, you should always
be prepared for unexpected opportunities for contextual interviews. During some
observations in the operating room, some members of the operating staff were very
willing to talk about their work.

3.4.2 “Qualitative” Interview Techniques and Choices

A qualitative interview distinguishes itself from other types of interviews in four


ways; a flexible approach, being interactive, non-directive interviewing and face-to-
face communication [26].
Similar to observations, doing interviews demands some preparation. The first
step is to decide what type of interview to do, an unstructured, semi-structured or
structured interview, as the preparation that needs to be done will heavily depend
on the type of interview. Table 3 shows the different types of interviews one can
choose from and also that semi-structured and unstructured interviews were mostly
used by the studies in our review. The structured ones are more in line with the
post-positivistic tradition, focusing on comparability instead of exploration from the
point of view of the participants.
There is a flow of the type of questions that is often adhered to qualitative inter-
views. The researcher will often start the conversation with some opening questions,
these act as ice-breakers for the interviewer and interviewee to get more comfort-
able with each other. Then come introductory questions which aim to understand
Evaluating Human-Robot Interaction with Ethnography 279

Table 3 Shows the different types of interviews, with semi-structured interviews being the most
used in the proceedings of the HRI conference (based on [26])
Type of interview Structured Semi-structured Unstructured
Explanation Structured set of Researcher has an idea Researcher lets the
questions from which of themes or topics to conversation flow as
the researcher does not explore, but is flexible naturally as possible
deviate and will explore new
themes/topics brought
up in interviews
Preparation Structured list of Interview guide Concise interview
questions Question protocol guide
Nothing
Article [23, 24, 34, 36, 37] [34, 36]

the opinion of the interviewee of the research topic and these are followed by tran-
sition questions. Transition questions lead the discussion more specifically towards
the research topic. The key questions, which come after the transition questions, are
the questions that the interviewer will use to answer the research questions. They
are supposed to make the interviewee discuss the core of the research topic. Final
questions allow an interviewer to wrap up an interview and to thank the interviewee
for their participation [26].
There are more facets to doing qualitative observations and interviews (e.g. data
analysis, …) that could not be discussed in this chapter due to space limits of this
chapter, therefore we strongly recommend researchers new to these research methods
to read qualitative research literature, such as or Denzin and Lincoln [8] before
doing data collection. This should be done to enhance the scientific value of the data
collected and the results thereof.

4 Reporting Ethnographic HRI Research

In this section, we want to reflect on the current practices of reporting ethnographic


research in HRI based on quality criteria for ethnographic reporting originated from
anthropology and sociology.

4.1 Format

The format of a HRI conference paper is restrictive in length compared to more


classical ethnographic contributions. The cases at hand are rather short and follow
the classical buildup of a conference proceeding: title, abstract, keywords, intro-
duction, related work/theoretical framework, method/study design, results/findings,
280 A. Jacobs et al.

implications for HRI design/lessons learned/discussion and/or conclusion and refer-


ences. The video format is also an accepted format at the HRI conference, but it was
excluded from our analysis. This documentary oriented format is a classic format to
report ethnographic fieldwork, but in sociology and anthropology the videos are in
general much longer than at the HRI conference.

4.2 Content

Table 4 shows the used data collection methods in the selected 8 HRI cases. There
is a clear variation in the detail reported between the different cases. To increase
the credibility of the study one should transparently report the choices in sampling
of audience and place, data and data analytics made. Quotes of conversations or
fragments of field observations are standard ways of transferring the ethnographic
inishgts. Also, like White et al. [32] we recommend explicitly explaining to the reader
what qualitative research can bring to the field of HRI. The lack of ethnographic
research published in the HRI conference proceedings might also suggest that the
HRI audience is less familiar with qualitative research and might be unaware of its
merits and quality criteria.

4.3 Types of Contributions Resulting from the HRI Studies


with Ethnographic Empirical Work

In the abstract, the introduction and methods section of a conference paper typically
described which kind of contributions is made with their study to the field of HRI.
Some of the cases describe a study to understand current behaviors and reasoning
without a robot by a certain group in a certain context. The results of this type of
study help to reflect and design a robot solution that fits with current practices of the
user(s). These studies also result in: an improved understanding of the next step in
design space for a certain context and a list of requirements and potential bottlenecks.
This can be a study on itself. But in the analyzed cases it was always a first phase in
a process of subsequent studies in which a robot is introduced later: a commercially
available robot (for example [23]) or a newly designed prototype (for example [33]).
Table 4 Overview of used data collection method in retrieved HRI studies with ethnographic empirical work (2006–2018)
* Authors Observation Interview Document Artefacts Introduction Robot(s) Duration
robot?
1 Forlizzi and Home tours Semi-structured Current Yes, partially Roomba, 4 months
DiSalvo [23] with family interviews; face cleaning vacuum
members (overt to face (14) and products cleaning
observation); by telephone
auto collection
via diary
2 Stubbs et al. 2 sites of 63 document No, robot was Zoë, 2 weeks
[35] researchers, artifacts already astrobiologist
overt observers introduced
3 Forlizzi [24] Home tours Semi-structured / Current Yes, partially Roomba, 4 steps,
with family interviews; face cleaning vacuum duration not
members (overt to face (21) products cleaning made explicit
observation);
visual story
Evaluating Human-Robot Interaction with Ethnography

diaries
4 Mutlu and Participant and Open-ended / / Yes, partially Aethon’s TUG, 15 months
Forlizzi [36] Fly-on the wall interviews; delivery
observations, semi-structured
two sites interviews
(manufacturer
& hospital)
5 Sabelli et al. Overt and Open-ended / / Yes, with Robotvie, 3,5 months
[34] covert interviews; operator companionship
observations semi-structured
interviews
(continued)
281
Table 4 (continued)
282

* Authors Observation Interview Document Artefacts Introduction Robot(s) Duration


robot?
6 Leite et al. [37] Observation Semi-structured / / Yes, with iCat Sessions with
overt by interview: operator 40 children
researchers in yes/no
room; covert by questions and
video camera’s open questions
7 Yamazaki, A., Observation / / Phase without Guide robot 4 months phase
et al. [33] overt by video and with robot without robot,
cameras 1 day evaluation
with robot
8 Wiles, J., et al. Observation Questions to the / / Yes Etnodroid At lab:1 month,
[31] overt by parents, open 6 children;
facilitator & by ended interview at fair: 5 h 17
robot, and children; &
covert by their parents
puppeteer
A. Jacobs et al.
Evaluating Human-Robot Interaction with Ethnography 283

4.4 Rational for Using Ethnography Within Selected HRI


Studies

Although not every case mentioned an explicit reason why the choice for an ethnog-
raphy was made, one can deduct from the contributions one wants to make a contri-
bution in the interdisciplinary challenge to create solutions that go beyond the current
state of the art of robotic solutions, creating solutions that are desirable by humans,
fitting into their everyday lives in the future:
Ethnography has the potential to elucidate people’s interaction in a real context with in-depth
observation of their behavior as well as reasoning. [34, p. 38]

Ethnographic research provides a close look at real-life experiences of human engagement


with robotic technologies, in use and in design processes; and opens up how we may study
the human needs and societal concerns that are emerging in response to these technolo-
gies. Ethnographic methods provide data that, through interdisciplinary collaboration, can
help identify and address new ethical, legal, and societal (ELS) issues in robot design and
implementation. [7, p. 393]

Hasse et al. [7] argues that the HRI literature mainly focuses on how efficient
human-robot interactions are and whether robots are accepted. In contrast, less
emphasis is placed on concerns in the daily life of humans and society. Examples they
give include how robots could have an effect on existing human-human interactions
and the difference between how robot designers think about users and how these
users interact with robots in their daily life. Studying such themes can be a reason to
choose for ethnographic research, particularly if you are interested in actual behavior
and interaction, rather than self-reported behavior or attitudes.

5 Conclusion

With this chapter, we have given a brief introduction into ethnography and what it can
bring to the field of HRI. We explained different types of data collection methods
that can be used in ethnography and what should be reported. One advantage of
using interviews and observations is that you can observe actions and behaviors of
people in their context and not only collect data on people’s opinions or attitudes as
is done with survey research. Observation as a method also allows to observe hidden
practices and see what people actually do versus what they might say in an interview
or how they might perform in an experimental laboratory setting.
Also, we summarized which guidelines can be used to judge qualitative research.
The outcomes of ethnographic research can be theoretical as well as practical. It is
easier to propose changes or adjustments to robot technology when you are familiar
with the lived social situation and actions. It is within this mindset that ethnography
can be of much added value within HRI and generate holistic social accounts. Finally,
we encourage other HRI researchers to learn more about qualitative research, to
284 A. Jacobs et al.

enrich their own research, as well as daring to have more collaborations with people
trained in ethnographic research.

Additional Reading
Denzin, N.K., Lincoln, Y.S. (eds.): The Sage Handbook of Qualitative Research, 5th
edn. Sage, Thousand Oaks, CA (2017).
Daynes, S., Williams, T. On Ethnography. Polity Press, Cambridge (2018).

References

1. Young, J.E., Sung, J., Voida, A., Sharlin, E., Igarashi, T., Christensen, H.I., Grinter, R.E.:
Evaluating Human-Robot Interaction. Int. J. Social Robot. 3(1), 53–67 (2011)
2. Bethel, C.L., Murphy, R.R.: Review of human studies methods in HRI and recommendations.
Int. J. Social Robot. 2(4), 347–359 (2010). https://doi.org/10.1007/s12369-010-0064-9
3. Ohyama, T., Maeda, Y., Mori, C., Kobayashi, Y., Kuno, Y., Fujita, R., Yamazaki, K., Miyazawa,
S., Yamazaki, A., Ikeda, K.: Implementing human questioning strategies into quizzing-robot.
In: Proceedings of the Seventh Annual ACM/IEEE International Conference on Human-Robot
Interaction, pp. 423–424. ACM, New York, NY, USA (2012). https://doi.org/10.1145/2157689.
2157829
4. Kraft, K., Smart, W.D. Seeing is comforting: effects of teleoperator visibility in robot-mediated
health care. In: The Eleventh ACM/IEEE International Conference on Human Robot Interac-
tion, pp. 11–18. IEEE Press, Piscataway, NJ, USA (2016). Retrieved from http://dl.acm.org/
citation.cfm?id=2906831.2906836
5. Kraft, K.: Robots against infectious diseases. In: The Eleventh ACM/IEEE International Con-
ference on Human Robot Interaction, pp. 627–628. IEEE Press, Piscataway, NJ, USA (2016).
Retrieved from http://dl.acm.org/citation.cfm?id=2906831.2907014
6. Eguchi, A., Okada, H.: If you give students a social robot?—world robot summit pilot study.
In: Companion of the 2018 ACM/IEEE International Conference on Human-Robot Interaction,
pp. 103–104. ACM, New York, NY, USA (2018). https://doi.org/10.1145/3173386.3177038
7. Hasse, C., Trentemøller, S., Sorenson, J.: The use of ethnography to identify and address
ethical, legal, and societal (ELS) issues. In: Companion of the 2018 ACM/IEEE International
Conference on Human-Robot Interaction, pp. 393–394. ACM, New York, NY, USA (2018).
https://doi.org/10.1145/3173386.3173560
8. Denzin, N.K., Lincoln, Y.S. (eds.): The Sage Handbook of Qualitative Research, 5th edn.
SAGE, Thousand Oaks, CA (2017)
9. Heath, S.B., Street, B.V.: Ethnography. Approaches to Language and Literacy Research.
Teachers College Press, New York (2008)
10. Charmaz, K.: Constructing Grounded Theory. A Practical Guide Through Qualitative Analysis.
Sage, London (2006)
11. Malinowski, B.: Argonauts of the Western Pacific. Dutton, New York (1961)
12. Radcliffe-Brown, A.R.: The Andaman Islanders. At the University Press, Cambridge (1933)
13. Brewer, J.D.: Ethnography. Open University Press, Buckingham (2000)
14. Mead, M.: Coming of Age in Samoa: A Psychological Study of Primitive Youth for Western
Civilisation. HarperCollins Publishers, New York, NY (1928)
15. Langness, L.L.: Margaret mead and the study of socialization. Ethos 3(2), 97–112 (1975)
16. Guba, E.G., Lincoln, Y.S.: Competing paradigms in qualitative research. In: Denzin, N.K.,
Lincoln, Y.S. (eds.) Handbook of Qualitative Research, pp. 105–117. Sage, Thousand Oaks,
CA (1994)
17. Patry, J.: Beyond multiple methods: critical multiplism on all levels. Int. J. Mult. Res.
Approaches 7(1), 50–65 (2013)
Evaluating Human-Robot Interaction with Ethnography 285

18. Kvale, S.: Interviews—An Introduction to Qualitative Research Interviewing. Sage, Thousand
Oaks, CA (1996)
19. Myers, M.D.: Dialectical hermeneutics: a theoretical framework for the implementation of
information systems. Inf. Syst. J. 5(1), 51–70 (1995)
20. Willis, J.W.: Foundations of Qualitative Research: Interpretive and Critical Approaches. Sage,
London (2007)
21. Treharne, G.J., Riggs, D.W.: Ensuring quality in qualitative research. In: Rohleder, P., Lyons,
A.C. (eds.) Qualitative Research in Clinical and Health Psychology, pp. 57–73. Palgrave
Macmillan, Basingstoke
22. Lewis, J., Ritchie, J.: Generalising from qualitative research. In: Ritchie, J., Lewis, J. (eds.)
Qualitative Research Practice, pp. 263–286. Sage, Londen (2003)
23. Forlizzi, J., DiSalvo, C.: Service robots in the domestic environment: a study of the roomba
vacuum in the home. In: Proceedings of the 1st ACM SIGCHI/SIGART Conference on Human-
robot Interaction, pp. 258–265. ACM, New York, NY, USA (2006). https://doi.org/10.1145/
1121241.1121286
24. Forlizzi, J.: How Robotic products become social products: an ethnographic study of cleaning
in the home. In: Proceedings of the ACM/IEEE International Conference on Human-Robot
Interaction, pp. 129–136. ACM, New York, NY, USA (2007). https://doi.org/10.1145/1228716.
1228734
25. Schwartz, M., Schwartz, C.: Problems in participant observation. Am. J. Sociol. 60(4), 343–353
(1955)
26. Mortelmans, D.: Handboek Kwalitatieve Onderzoeksmethoden. Acco, Leuven (2007)
27. Spradley, J.P.: Participant Observation. Waveland Press, Long Grove, IL (1980)
28. Merriam, S.B.: Qualitative Research and Case Study Applications in Education. Jossey-Bass
Publishers, San Francisco (1998)
29. Forlizzi, J.: The product ecology: understanding social product use and supporting design
culture. Internal Journal of Design 2(1), 35 (2008)
30. Duysburgh, P., Elprama, S.A., Jacobs, A.: Exploring the social-technological gap in telesurgery:
collaboration within distributed OR teams. In: Proceedings of the 17th ACM Conference on
Computer Supported Cooperative Work & Social Computing, pp. 1537–1548
31. Wiles, J., Worthy, P., Hensby, K., Boden, M., Heath, S., Pounds, P., Rybak, N., Smith, M.,
Taufotofua, J. Weigel, J.: Social cardboard: pretotyping a social ethnodroid in the wild. In: The
Eleventh ACM/IEEE International Conference on Human Robot Interaction, pp. 531–532.
IEEE Press, Piscataway, NJ, USA (2016). Retrieved from http://dl.acm.org/citation.cfm?id=
2906831.2906962
32. White, C., Woodfield, K., Ritchie, J.: Reporting and presenting qualitative data. In: Ritchie, J.,
Lewis, J. (eds.) Qualitative Research Practice, pp. 263–286. Sage, Londen (2003)
33. Yamazaki, A., Yamazaki, K., Ohyama, T., Kobayashi, Y., & Kuno, Y.: A techno-sociological
solution for designing a museum guide robot: regarding choosing an appropriate visitor. In:
Proceedings of the Seventh Annual ACM/IEEE International Conference on Human-Robot
Interaction, pp. 309–316. ACM, New York, NY, USA (2012). https://doi.org/10.1145/2157689.
2157800
34. Sabelli, A.M., Kanda, T., Hagita, N.: A conversational robot in an elderly care center: an
ethnographic study. In: Proceedings of the 6th International Conference on Human-Robot
Interaction, pp. 37–44. ACM, New York, NY, USA (2011). https://doi.org/10.1145/1957656.
1957669
35. Stubbs, K., Hinds, P., Wettergreen, D.: Challenges to grounding in human–robot interaction.
In: Proceedings of the 1st ACM SIGCHI/SIGART Conference on Human-Robot Interaction,
pp. 357–358. ACM, New York, NY, USA. https://doi.org/10.1145/1121241.1121314
36. Mutlu, B., Forlizzi, J.: Robots in organizations: The role of workflow, social, and environ-
mental factors in human-robot interaction. In: Proceedings of the 3rd ACM/IEEE International
Conference on Human Robot Interaction, pp. 287–294. ACM, New York, NY, USA (2008).
https://doi.org/10.1145/1349822.1349860
286 A. Jacobs et al.

37. Leite, I., Castellano, G., Pereira, A., Martinho, C., Paiva, A.: Modelling empathic behaviour in a
robotic game companion for children: an ethnographic study in real-world settings. In: Proceed-
ings of the Seventh Annual ACM/IEEE International Conference on Human-Robot Interaction,
pp. 367–374. ACM, New York, NY, USA. https://doi.org/10.1145/2157689.2157811

An Jacobs holds a PhD in Sociology and is a part-time lec-


turer in Qualitative Research Methods (Vrije Universiteit Brus-
sel). She is the manager of the Program Data and Society within
the imec-SMIT-VUB research group. She is a founding member
of BruBotics, a collective of multiple research groups at Vrije
Universiteit Brussel that together conduct research on robots.
In her research, she focuses on future human-robot interaction
in healthcare and production environments in various research
projects.

Shirley A. Elprama is a senior researcher at imec-SMIT-VUB


since 2011. In her research, she investigates social robots, col-
laborative robots and exoskeletons at work and particularly
under which circumstances these different technologies are
accepted by end users. In her PhD, she focuses on the accep-
tance of different types of robots (healthcare robot, collaborative
robots, exoskeletons) in different user contexts (car factories,
hospitals, nursing homes) by different users (workers, nurses,
surgeons).

Charlotte I. C. Jewell is a researcher at imec-SMIT-VUB, in


the Smart Health and Work unit. Her research explores human-
robot interaction in healthcare. Furthermore, she does research
in the field of Sex and Love with Robots where she studies
the people’s perception of different types of relationships with
robots.
Designing Evaluations: Researchers’
Insights Interview of Five Experts

Céline Jost and Brigitte Le Pévédic

Abstract The objective of this chapter is to understand approaches conducted by


experts to design an evaluation. It investigates the evaluation methods used by
five disciplines (cognitive science, cognitive psychology, sociology, ethology, and
ergonomics), in order to deduce the similarities and differences between these dis-
ciplines. In our qualitative survey, we asked five experts to reason on three research
questions. These experts received instructions that consisted to explain how they
would proceed to answer these questions. We then qualitatively analyzed experts’
explanations, which means that this chapter is the interpretation of both authors and
cannot be generalized, even if it gives several interesting avenues worth exploring.
In their responses, we mainly observed three different systems of thought which all
lead to the same conclusion that seems to be a strong consensus that all research
questions can be answered with a comparative study.

Keywords Evaluations · Meta-evaluation · Qualitative survey · Disciplines ·


Methods · Standardization · Personal view

1 Introduction

All the previous chapters showed that the question of standardization is not trivial.
Actually, each researcher seems to have his/her own definition, even perception, of
what should be a standardization. This book is edited by 8 researchers who, them-
selves, do not have the same opinion about standardization. That highlights the com-
plexity of this problematic and the richness of each contribution. This final chapter is
more testimony than a classical scientific paper. This chapter does not contain refer-
ences. This chapter is its own reference, giving the floor to experts without censorship.

C. Jost (B)
Laboratory EA 4004 CHArt, Paris 8 University, 2 rue de la Liberté, 93526 Saint-Denis, France
e-mail: celine.jost@gmail.com
B. Le Pévédic
Laboratory UMR 6285 Lab-STICC, South Brittany University, 8 rue Michel de Montaigne,
56000 Vannes, France
e-mail: brigitte.le-pevedic@univ-ubs.fr

© Springer Nature Switzerland AG 2020 287


C. Jost et al. (eds.), Human-Robot Interaction, Springer Series
on Bio- and Neurosystems 12, https://doi.org/10.1007/978-3-030-42307-0_12
288 C. Jost and B. Le Pévédic

Our objective here is to dissect some evaluation methods in order to give a different
point of view and to initiate a debate. In this aim, we interviewed five researchers
belonging to five different disciplines and qualitatively analyzed their answers. This
approach investigates differences and similarities between different disciplines. Can
we draw a common process? Is it really possible to standardize evaluation meth-
ods for HRI? Is there already an existing implicit standardization? To explore these
questions, we followed the following steps. First, we prepared a precise protocol
where we described a survey to be conducted (see Sect. 2). Then, we selected five
experts—belonging to cognitive science, cognitive psychology, sociology, ethology,
and ergonomics—who accepted to take part in the survey (Sect. 3). Afterward, we
analyzed the received answers (Sect. 4). According to our observations and interpre-
tations (Sect. 5), we decided to make further investigations and analysis on a specific
point: the need for designing a comparative study (Sect. 6). We then opened up a
discussion to highlight what we learned from this survey (Sect. 7) and tries to draw
conclusions from it (Sect. 8). The complete answers provided by experts can be read
at the end of this chapter (Sect. 9).

2 Survey Protocol

2.1 Context

A scientific evaluation aims at answering a research question or at confirming or


infirming one or several hypotheses. It is a very delicate process where each decision
about its design may have unpleasant consequences. For example, a minor modifi-
cation may totally change the exploration and lead to the resolution of an unforeseen
problem. At worst, it may also induce biases that can totally invalidate results.
With this in mind, we designed a survey in order to understand how each expert
in evaluations proceeds to answer a research question. What are the important steps?
What do we have to think about? They are not expected to design experimental
protocols, but rather to divulge their secrets about the design itself.
Our objective was therefore to understand their intellectual approach. Our investi-
gation was exploratory and it was of importance to avoid influencing experts towards
specific answers. Thus, to obtain the best possible results, we tried to formulate the
most open-ended instructions possible, letting answers totally free.
For this purpose, we proposed experts to reason on three research questions and
to explain to us how they would proceed to answer these questions. Nothing else to
avoid influences and to obtain their own reflection.

2.2 Research Questions

It was quite difficult to select research questions. First of all, how many questions?
One single question cannot conduct to generalization. Two questions cannot allow
Designing Evaluations: Researchers’ Insights Interview … 289

tipping the balance in case of equality. Three questions are therefore the minimum.
And from four questions, experts may think that it takes a long time and decline
our request. We discussed for weeks to choose the three questions. How could we
ensure to formulate three representative questions? On which criteria? Research
about evaluation methods standardization is at its infancy, and we do not know
which type of questions implies which type of evaluations anymore, neither what are
the different types of evaluations. Thus, how to choose several questions which lead
to different kind of evaluations? It was clear to us that it was impossible to obtain
the perfect questions. Thus, we took a decision which is, of course, debatable, but
we just had to start somewhere. Thus, we decided to focus the research questions on
the relationship between humans and robots. And we tried to distinguish three levels
of abstraction: a question on a very specific point between a robot and a human
(bystander effect), a more ambiguous and less precise question implying several
humans and robots (influence of size group), and a vague question about the relation
(without precision) between humans and robots. The order of questions was changed
as follows.
The three questions were:
1. Is bystander/audience effect the same whether the individual present is a human
or a robot?
2. Is the relationship that humans develop with a robot the same than with an animal?
3. Group size influences the type of relationship between individuals, either between
same species or different species: would it be the same between human(s) and
robot(s)?

2.3 First Step: Recruiting Experts

The first step of this survey was to contact researchers, who were experts in evalua-
tions, in order to recruit them. We decided to recruit five experts coming from five
different disciplines, for example, anthropology, ergonomic, ethnography, ethology,
philosophy, psychology, sociology and so on. Five experts coming from different
disciplines are enough to detect a standardization if existing. More than five experts
would have been too complex to analyze knowing that our study was exploratory and
that it would have produced too much data. We contacted only researchers who did
not write a chapter in the book or who were not co-editor. We contacted researchers,
ones after the others, being attentive to select different disciplines, until having five
positive answers. We did not have any strong expectations about disciplines a priori,
except that they had to be different to obtain several points of view.
To recruit participants, we sent the following mail.
Subject: Your participation to an international book about evaluations in Human-Robot
Interaction
Content:
290 C. Jost and B. Le Pévédic

[First name surname],

I am pleased to contact you to offer you to participate to a collective book that I co-publish.
This book will be published by Springer in the series “Series in Bio- and Neurosystems”
(http://www.springer.com/series/10088). We are planning to publish this book in January,
2020.
This book is the following of a series of three workshops which aimed at discussing about
evaluation methods standardization in Human-Robot Interaction.
Here are the links to the different EMSHRI workshops websites:
https://sites.google.com/site/emshri2015/
https://sites.google.com/site/emshri2016/
https://sites.google.com/view/emshri2017/
One of the chapters is dedicated to interview specialists (ergonomist, psychologist, ethologist,
sociologist, ethnologist, …) in order to know their practices and their methodologies to
elaborate an evaluation which aims at answering predefined research questions.
Following the answers obtained, we will compare the different practices and methodologies
in order to deduce common bases and note differences.
Would you be interested to be part of our survey and appear as a [discipline] specialist in
this book? If you are interested, we will ask you to answer a few questions. This will be the
subject of an exchange of 2 or 3 emails between you and us between today until the end of
June.

Best regards,
[Editor name]

2.4 Second Step: Interviewing Experts

From the moment we obtained a positive consent, we started the interview, which,
in fact, consisted of a single mail.
Subject: EMSHRI international book: survey
Content:
[First name Surname],
We thank you for your interest in our survey and are really pleased to count you as a specialist
of our book.
For this first step, we ask you the following question: “Can you explain us the work process
that allow you to answer each of the three research questions below?”
The length and format of your answer are completely free. You don’t have to do the job. We
only ask you to explain your approach.
Research questions are:
1. Is bystander/audience effect the same whether the individual present is a human or a robot?
2. Is the relationship that humans develop with a robot the same than with an animal?
3. Group size influences the type of relationship between individuals, either between same
species or different species: would it be the same between human(s) and robot(s)?
Designing Evaluations: Researchers’ Insights Interview … 291

Please, can you give us your answer before [today + 3 weeks]? Don’t hesitate to contact us
if you need further information.
As a result of your answer, we may need to contact you for further clarification with more
focused questions.
We really thank you for your participation.
All the best,
[Editor name]

3 The Five Experts

The five experts, who collaborated to our survey, are presented in Table 1.
The five experts do not represent all existing disciplines, but there is a very inter-
esting representativeness. Two disciplines are coming from social sciences (cognitive
psychology and sociology), one discipline is coming from biology (ethology) and the
two others are interdisciplinary (Human-Technology Interaction and ergonomics).
On the one hand, Human-Technology Interaction is a mix of behavioral and technical
sciences, for example human-computer interaction, cognitive science, psychology,
and engineering; cognitive science itself comes from linguistics, psychology, artifi-
cial intelligence, philosophy, neuroscience, and anthropology. On the other hand,
ergonomics comes from anatomy and physiology, psychology, engineering, and
statistics.
Moreover, the five experts practice their profession in very different domains:
human factors and user experience with robots and extended reality technologies (Iina
Aaltonen), ergonomics and social psychology for User Experience (Sophie Lemon-
nier), methods, practices and communications of sciences and technics (Jérôme
Michalon), ecology, physiology and ethology (Cédric Sueur), and technology,
disabilities, interfaces, multimodalities (Gérard Uzan).

Table 1 Presentation of the five experts


Name Discipline Position Institution Country
Iina Aaltonen Human-Technology Senior scientist VTT Technical Finland
Interaction Research Centre
of Finland Ltd
Sophie Cognitive Associate Lorraine France
Lemonnier psychology professor University
Jérôme Sociology Researcher Lyon University France
Michalon
Cédric Sueur Ethology (Animal Associate Strasbourg France
Behavior) professor University
Gérard Uzan Ergonomics Researcher Paris 8 France
University
292 C. Jost and B. Le Pévédic

Thus, that means the five experts come from various influences with a really
good representativeness of evaluation methods diversity and have different and
complementary points of view of Human-Robot Interaction.

4 First Analysis

In this analysis we decided to focus on the methodological approach conducted


by each expert and the evaluation proposed. We extracted from their answers all
elements which may belong to one of these both categories (methodological approach
and evaluation proposed). This section presents only our analysis, but the complete
answers can be read in Sect. 9. For each question, we filled three tables (Tables 2, 3
and 4 for Q1; Tables 5, 6 and 7 for Q2; Tables 8, 9 and 10 for Q3).
The three tables contained the following information:
• Methodological approach: these tables summarize topics of discussion about
the evaluation design (Tables 2, 5 and 8). What did experts think to design the
evaluation?
• Generalization of methodological approach: these tables present the general cat-
egories of questioning conducted by each expert (Tables 3, 6 and 9). The labels
have been deduced from the answers of question 1 and are similar to each question
in order to compare the level of details provided.
• Evaluations proposed by experts: these tables summarize all indications given
about the evaluation to be conducted (Tables 4, 7 and 10). To be noted that
instructions did not ask experts designing evaluations.
In each table, experts are labeled E1, E2, E3, E4, E5 to save space and cor-
respond to experts’ name in alphabetic order: Iina Aaltonen, Sophie Lemonnier,
Jérôme Michalon, Cédric Sueur, and Gérard Uzan.

4.1 Question 1

As a reminder, the first question was: “Is bystander/audience effect the same whether
the individual present is a human or a robot?”.
Table 2 shows that the experts needed more precision as they discussed several
points such as the nature of the robot, the type of participants, the evaluation methods,
the context or task of the evaluation, the metrics to produce, the expected results,
the biases. As a first observation, we can notice that it is necessary to have perfect
knowledge of the topic to design an evaluation. Indeed, experts who were unfamiliar
with the audience effects tried first to understand what this concept meant. To this aim,
we observe two main strategies: investigating the literature to find a key evaluation
that already answers the problematic or to discuss it with an interdisciplinary team.
Designing Evaluations: Researchers’ Insights Interview … 293

Table 2 Methodological approach for question 1


E1 What kind of robots?
Verification of the meaning of question keywords
Discussing with colleagues to design the evaluation
What about practicalities and ethics/privacy aspects of the research?
What type of participants (number, knowledge)?
Reformulation of the question research (hypothesis)
Which effects/biases?
Which HRI metrics?
Which evaluation methods?
Which context/task?
Comparison between conditions
Which study design?
E2 Identifying an exemplar experimentation in the literature
Reproducing this experimentation in our context
Comparison between conditions
What about biases?
What kind of robots?
Reformulation of the question research (hypothesis)
Which context/task?
What expected results?
E3 What is the meaning of question keywords?
What kind of robots?
What kind of participants?
Which context/task?
Reformulation of the question research (hypothesis)
Which evaluation methods?
Comparison between conditions
What expected results?
What type of participants?
E4 What expected results?
What context/task?
Comparison between conditions
What kind of robots?
Which evaluation methods?
Which HRI metrics?
E5 Which evaluation methods?
Which context/task?
What expected results?
Comparison between conditions
294 C. Jost and B. Le Pévédic

Table 3 Generalization of the methodological approach for question 1


E1 E2 E3 E4 E5
Familiarization with the topic/Verification of good knowledge X X
about the topic of the question
Discussion about the robot to raise ambiguities X X X X
Discussion about participants (number, knowledge of robots, age, X X
social origin…)
Defining the context/task of evaluation X X X X X
Making hypotheses to precise/reformulate the research question in X X X
order to raise ambiguities
Looking for information about the evaluation to design in order to X X
answer the question (colleagues, literature)
Proposition to reproduce and adapt an existing evaluation X
Discussion about effects and biases X X
Defining evaluation methods X X X X
Defining metrics X X X
Discussion about the expected results X X X

Last, we observe that all the experts proposed to compare a situation with humans
and a situation with robots in order to answer the research question.
As we tried to generalize the points discussed by the experts, we identified 10
categories presented in Table 3. This table reveals two main stages of reflection:
familiarization with the topic of the problematic and raising all ambiguities. The
order of items was dictated by their order of apparition in answers we analyzed. It is
representative of nothing.
It is interesting to note that designing an evaluation is a process quite subjective as
shown in Table 4. Indeed, the five experts imagined different evaluations: studying
sportive performance according to robots’ capabilities or audience effect according
to the robot level of anthropomorphism or the report to authority according to the
social environment or the human’s attention according to the nature of the individual
in presence or the design of an activity analysis protocol.

4.2 Question 2

As a reminder, the second question was: “Is the relationship that humans develop
with a robot the same than with an animal?”
Table 5 shows that the experts needed more precision than given by the research
question and discussed the same topics than for question 1. It can be observed that only
two discussions were new: “What kind of animals?” and “What are the measurable
criteria of a relation?”. These new concerns were directly related to the second
question topic. The number of discussions raised by the experts was less than for
Designing Evaluations: Researchers’ Insights Interview … 295

Table 4 Evaluations proposed by experts for question 1


E1 Experiment in the field
Environment where the same people gather frequently: sport club
People prepared beforehand on properties of robot
Mixed methods: ethnography, questionnaires, interviews
Between subjects/groups
Comparison between two kind of beliefs: intelligent robot vs less capable robot
Comparison between a human coach and a robot coach
Hypothesis: the person’s perception of the robot’s capabilities would have an effect on
their performance
E2 The objective is to be in a situation where the audience effect is measured, to modify the
context according to the question research and to test if the audience effect is still
measured
Reproducing an existing evaluation: signal detection task (push a button when a luminous
signal appears 12 times by hour)
Three conditions: supervisor often comes vs robot often comes vs neither supervisor or
robot come
Comparison between a Nao robot and a vacuum cleaner or between an intelligent Nao and
a less intelligent Nao
Hypothesis: the level of anthropomorphism influences the audience effect, thus more a
robot is anthropomorphic, more the audience effect is important
E3 Ethnographic observation of a classroom with a human teacher vs a robot teacher
Qualitative survey: interviews with human teachers, interviews with robots’ programmer,
and interviews with pupils
Comparison between high-income classes and low-income classes
Hypothesis: the report to authority (represented by a teacher) and the report to technology
are not the same according to the social environment
E4 Experimentation where individuals are presented by turns in a showcase. Comparison
between an animate animaloid robot, an inanimate animaloid robot, an animal such as dog
or a cat, and an inanimate dog or cat
Observation of attendance time about “less than” or gaze in order to evaluate humans’
attention
E5 Exploratory analysis: establishment of an activity analysis protocol about dialog,
collaboration, cooperation, assistance, negotiation, adjustment, conflict with a human
collaborator…
Experimentation in lab then in controlled ecology, using a Wizard of Oz, with the same
tasks

question 1 (39 discussions in question 1; 31 discussions in question 2, that is 79.5%


in comparison with question 1).
Last, we observe that almost all the experts proposed to compare a situation with
animals and a situation with robots in order to answer the research question.
296 C. Jost and B. Le Pévédic

Table 5 Methodological approach for question 2


E1 What kind of robots?
What type of participants?
Discussing with colleagues to design the evaluation
What kind of animals?
Which HRI metrics?
Comparison between conditions
Defining what are the measurable criteria of a relation
E2 What kind of evaluation?
What type of participants?
Defining what are the measurable criteria of a relation (literature)
What kind of robots?
Comparison between conditions
Reformulation of the question research (hypothesis)
E3 What kind of evaluation?
What type of participants?
What kind of animals?
Which context/task?
Which methods?
What expected results?
Defining what are the measurable criteria of a relation
Comparison between conditions
E4 What kind of robots?
What type of participants?
Which context/task?
Which methods?
Comparison between conditions
E5 Which methods?
Which context/task?
Defining what are the measurable criteria of a relation
What is the meaning of question keywords?
Discussion about the space of robots in cosmogonies

Table 6 shows that the experts had less discussion than for the first question. No
expert discussed the need to familiarize with the topic nor to reproduce or adapt an
existing evaluation.
As shown by Table 7, it seems that it was more difficult for the experts to imagine
an evaluation which can answer to the research question. They gave less precision
than for question 1. For example, they indicated the need to measure the relationship
or to compare “bond elements” with metrics in questionnaire like Godspeed, without
Designing Evaluations: Researchers’ Insights Interview … 297

Table 6 Generalization of the methodological approach for question 2


E1 E2 E3 E4 E5
Familiarization with the topic/Verification of good knowledge X X X X
about the topic of the question
Discussion about the robot to raise ambiguities X X X X
Discussion about participants (number, knowledge of robots, age, X X X
social origin…)
Defining the context/task of evaluation X X
Making hypotheses to precise/reformulate the research question in X X
order to raise ambiguities
Looking for information about the evaluation to design in order to
answer the question (colleagues, literature)
Proposition to reproduce and adapt an existing evaluation X X
Discussion about effects and biases X X X
Defining evaluation methods X
Defining metrics X
Discussion about the expected results

giving an idea of how to do it. And the majority of experts did not detail the context
of the evaluation.
We can observe two main concerns in this question: measuring the relationship
and identifying an existing situation where we can already compare a situation with
animals and a situation with robots.

4.3 Question 3

As a reminder, the third question was: “Group size influences the type of relationship
between individuals, either between same species or different species: would it be
the same between human(s) and robot(s)?”
Table 8 shows that the experts also needed more precision than given by the
research question and discussed the same topics than for questions 1 and 2. It can
be observed that no question was new, but the number of discussions raised by the
experts was less than for question 2 (39 for Q1; 31 for Q2; 21 for Q3; that is 100%
for Q1; 79.5% for Q2; 53.8% for Q3 in comparison with Q1).
Last, we observe that only two experts proposed to make conditions comparison
in order to answer the research question.
Table 9 shows that the experts had less discussion than for the first question but
almost as much as for the second question. No expert discussed the need to familiarize
with the topic nor to discuss about effects and biases.
298 C. Jost and B. Le Pévédic

Table 7 Evaluations proposed by experts for question 2


E1 HRI metrics on animacy
Comparison between “bond elements” and metrics in questionnaire like Godspeed
Social robot with a recognition system and memory for identifying humans
E2 Exploration on the field
Need to define how to measure relationship
Need to find participants who may create a relationship with robots
What about creating a profile of an ideal robot using the literature
Comparison between participants who own a robot and participants who own an animal
Ideally for years. In practical, for a month
Either with participants who just adopted an animal and with participants who just
received a lended robot or with participants who ever have an animal or a robot
Interview semi-structured and questionnaire after a month
Reformulation: Can an affective relationship establish with a robot as it is the case with an
animal?
E3 Comparison between two situations
Need to identify a situation where the human-animal interactions have ever been replaced
by human-robot interactions
Context of milking animals
Observation and interviews with robots’ builders, with breeders who have robots and with
others
The first objective is to understand how breeders qualify their relationships with animals
and robots
The second objective is to point at common points and differences between human-animal
and human-robot relationships
E4 Robots have not enough empathy to replace dogs or cats. Maybe iguana or tarantula
Middle or long term evaluation
Retirement home?
Study relationships that participants develop with a dog vs a robotic dog vs an invertebrate
vs a robotic invertebrate
E5 Two approaches: anthropological and socio-functional
Discussion about the context of guide dog
Need a systemic analysis
First, we need to understand the spaces of robots in our society

As shown by Table 10, it seems that it was difficult for the experts to imagine
an evaluation which can answer to the research question. They have less discussion
than for questions 1 and 2. The majority of experts decided to propose an evaluation
without real robots which appear to have too many limitations to be used to answer
this research question. They proposed to use alternatives such as conversational
agents or virtual reality, which may be dictated by their disciplines or influences.
The other experts opted for a different strategy that consisted to use existing robots,
Designing Evaluations: Researchers’ Insights Interview … 299

Table 8 Methodological
E1 Which kind of relations?
approach for question 3
Which methods?
Which robots?
What is the meaning of question keywords?
E2 Identifying an exemplar experimentation in the literature
Reproducing this experimentation in our context
Comparison between conditions
Which context/task?
Which robots?
E3 Which robots?
Which type of participants?
Which context/task?
Comparison between conditions
Which HRI metrics?
Which methods?
Need to Reformulate the question research (hypothesis)
E4 Identifying an exemplar experimentation in the literature
Reproducing this experimentation in our context
E5 Which robots?
Which methods?
Which context/task?

Table 9 Generalization of the methodological approach for question 3


E1 E2 E3 E4 E5
Familiarization with the topic/Verification of good knowledge
about the topic of the question
Discussion about the robot to raise ambiguities X X X X
Discussion about participants (number, knowledge of robots, age, X
social origin…)
Defining the context/task of evaluation X X X
Making hypotheses to precise/reformulate the research question in X X
order to raise ambiguities
Looking for information about the evaluation to design in order to X
answer the question (colleagues, literature)
Proposition to reproduce and adapt an existing evaluation X X
Discussion about effects and biases
Defining evaluation methods X X X
Defining metrics X
Discussion about the expected results X X
300 C. Jost and B. Le Pévédic

Table 10 Evaluations proposed by experts for question 3


E1 Long-lasting relationships or brief situations?
Current robots are too limited
Evaluation with conversational AI Agents
Question: how well the robot pays attention to individual group members?
Objective: understanding robot’s future role in the participant’s life (and also social
influence)
Need more knowledge on group interactions and relationships
E2 Current robots are too limited
Evaluation with virtual reality and serious game (existing evaluation)
Replacing virtual humans by robots
E3 Look for a situation where humans and robots interact together
Current robots are too limited
Vending machines (cash machine, ticket machine, automatic pay station…)
Socio-demographic data
Observations and interviews
Objective: to statistically explore links between a certain behavior, the settlement of
territory and machines distribution
E4 We can adapt two existing evaluations in our context (references)
E5 Need to classify existing robots and/or to create a categorization of robots
Systemic approach
First, we need to understand the spaces of robots in our society (communication,
collaboration, cooperation)
Evaluation with real situations (controlled ecology)

even if the condition of “social robots” was not respected (vending machines, robotic
fish or cockroach).

5 First Observations

5.1 A Progressively Decrease of Discussed Topics

In the previous section, we observed that the experts had less discussed in question
3 than in question 2 than in question 1 (see tables “Methodological approach” in
Tables 2, 5 and 8). The number of discussions was respectively 39, 31 and 21 (the
sum of observations reported in tables per question). We can observe in Table 11
this decrease for each expert (E1: 34.8%, E2: 15.8%, E3: 8.33%, E4: 30.77%, E5:
8.33%; mean of decrease: 19.8%).
Designing Evaluations: Researchers’ Insights Interview … 301

Table 11 Differences in number of discussions per question


Number of discussions Q2 − Q1 Q3 − Q2 Q3 − Q1 Percentage Q3 − Q1
(%)
E1 23 -5 -3 -8 -34.80%
E2 19 -2 -1 -3 -15.80%
E3 24 -1 -1 -2 -8.30%
E4 13 -1 -3 -4 -30.80%
E5 12 1 -2 -1 -8.30%
Total 91 -8 -10 -18 -19.80%

Number of words used by answer


1000

800
Number of words

600

400

200

0
Q1 Q2 Q3
Questions
E1 E2 E3 E4 E5

Fig. 1 Number of words written in experts’ answers

This decrease can also be observed in the generalized topics (see tables “General-
ization of methodological approach” in Tables 3, 6 and 9). The number of discussions
was respectively 31, 22 and 19 (number of crosses in tables).
To understand if this decrease is due to the nature of the question or due to the
repetitive task, we noted the number of words used by expert by question. As shown
by Fig. 1, we observe that the experts had globally used fewer words in question 3
than in question 2 than in question 1. We can speculate that this decrease is due to
the repetitive task as all experts, except E4, are concerned by this decrease. To be
noted that E4 has used the least of words in all answers.
To investigate the nature of this decrease, we also decided to compare the number
of discussions and the number of words by answer. Table 12 shows the average

Table 12 Comparison
Discussions Words
between the number of
discussions and words by Question 1 31 3003
answer Question 2 22 1682
Question 3 19 980
302 C. Jost and B. Le Pévédic

Table 13 Topics of
Discussion about the robot to raise ambiguities 12
discussion ordered by number
of apparition Defining the context/task of evaluation 11
Defining evaluation methods 10
Discussion about participants 7
Making hypotheses to precise/reformulate the research 7
question
Discussion about the expected results 6
Looking for information with colleagues or in the 5
literature
Defining metrics 5
Discussion about effects and biases 4
Proposition to reproduce and adapt an existing 3
evaluation
Familiarization with the topic/Verification of good 2
knowledge

of both indicators for each question. We then computed the correlation coefficient
between the matrix “discussions” and the matrix “words” with the corresponding
Excel function. We obtained a value of 0.99 which indicates the correlation between
the number of discussions raised by each expert and the number of words he/she has
written.
Thus, we make the hypothesis that the categories identified and presented in these
three tables have the same importance in the three questions and that experts did not
have the same discussions three times, because of the repetitive task, even if it was
of importance to consider these aspects in each answer.
But, even if we hypothesize that all these categories are of importance to design
an evaluation (which are in line with Cindy Bethel’s recommendations), we can
observe their level of importance by counting their number of apparition in given
answers. Table 13 shows the discussion topics ordered by their number of apparition.
We can notice that the most important topics seem to be related to the robot, to the
context of the evaluation and to evaluation methods: which robot in order to do what?
And indeed, it is of importance to answer these questions before choosing the most
adapted evaluation methods. It seems to indicate that the nature of the evaluation
is strongly linked to the nature of the robot and to the context of the evaluation.
Moreover, we observe that the type of participants is also very important. One can
find numerous discussions in this book about participants recruiting or about the
nature of participants which explain that, on the one hand, the choice of participants
can influence or invalidate evaluation results and that, on the other hand, the nature
of participants influences the design of evaluation as one cannot question children in
the same way than the elderly for example.
Regarding Tables 2, 5 and 8, we counted the number of times that the same expert
had a same topic of discussion in each question, and noted some differences between
each expert. That was 2 for E1, 3 for E1, 6 for E3, 0 for E4, and 2 for E2. We noticed
Designing Evaluations: Researchers’ Insights Interview … 303

a correlation of 0.71 (computed with the Excel function) between these numbers
and the total numbers of discussions for each expert. It may indicate the degree of
importance of these topics for each expert. E1 discussed each time about methods and
robots; E2 about robots, comparison between conditions, and searching for similar
research in literature or with colleagues; E3 about methods, robots, comparison
between conditions, participants, context/task, reformulation; and E5 about methods
and context/task.

5.2 An Interesting Difference Between Experts

The observation of the evaluations proposed by the experts (Tables 4, 7 and 10)
highlights some differences which require further investigations. The Table 14
summarizes the main idea which is discussed below.
Regarding question 1, experts 1, 2 and 3 have had exactly the same approach. They
proposed an evaluation based on a comparison between a human and a robot having
the same role in the same situation (sports coach for E1, supervisor of soldiers for E2
and teacher for E3). To be noted that E2 proposed to adapt an existing evaluation. The
three experts have formulated a new hypothesis, different than the research question:
performance according to robot’s capability for E1, audience effect according to the
level of anthropomorphism for E2 and report to authority according to the social
environment for E3. E4 proposed a more complex comparison, with 4 conditions—
robotic vs real and inanimate vs animate—in order to measure the level of human
attention. This expert has not formulated a new hypothesis, contrary to E1, E2, and
E3. Last, E5 has had a totally different approach. He has placed his reflection to a
more high-level of reasoning and has not tried to answer the question, but proposed an
evaluation to understand the general problem which includes the research question.

Table 14 Differences in evaluations proposed by experts


E1 E2 E3 E4 E5
Q1 Comparison between a situation with a human and a situation with a High-level
robot reasoning to
Sport coach Supervisor Teacher No role study more
than the
Hypothesis different than the research question No hypothesis question
Q2 Measuring the relationship Comparison
No evaluation Observation at Observation at between 4
design home work conditions
Q3 Robots are too limited to design an evaluation Adapting
Proposition of alternative solutions existing
Conversational Virtual reality Vending evaluation
agents machines
304 C. Jost and B. Le Pévédic

Regarding question 2, E1, E2, and E3 has had the same approach again. They
focused their reflection on the measure of the relationship on the one hand and dis-
cussed the impossibility to design this evaluation with current robots that are not
intelligent enough on the other hand. Despite that E2 has proposed to design an eval-
uation with people who adopted a pedagogic robot vs people who adopted an animal,
and E3 has proposed to compare a situation where the human is partly replaced by
an industrial robot. But, in the latter case, instead of having the comparison between
human-animal vs human-robot interaction, his approach measures the difference
between human-animal and human-robot-animal interaction, which is not exactly
the research question. As for E4, he has not discussed how to measure the relation.
He has also noticed that existing robots are not social enough to design an ideal eval-
uation and has proposed an evaluation with existing robots with 4 conditions (robotic
vs real and dog vs invertebrate). And once again, E5 has placed his reflection on a
high level of reasoning and has discussed an evaluation to understand the place of
robots in our society which is more than the scope of the research question.
Regarding question 3, we also observe the same differences between the five
experts: E1, E2, and E3 have followed the same approach while E4 has directly
proposed an evaluation without discussion, and E5 has placed his reflection to a high
level of reasoning. In this question, E1, E2, and E3 have emphasized that the existing
robots are too limited to design this evaluation and have proposed an alternative
solution (conversational agents for E1, serious game in virtual reality for E2, and
vending machines for E3). E4 has not discussed the limit of robots and has proposed
to adapt two existing evaluations about collective animal behavior with a robotic
fish and a robot for cockroaches. Last, E5 has proposed to classify robots and to
understand their places in our society.
These substantial differences between the five experts are really debatable. Are
they due to experts’ disciplines, influences, history, experience…?
Moreover, these differences are also highlighted by the number of discussions
about methodological approaches by experts. Figure 2 summarizes these numbers
and it can be observed that two groups are formed by an approximatively same
number of discussions (around 20 for E1, E2, and E3, and around 7 for E4 and E5).

Number of discussion about methodological


approach
Number of discussion

25
20
15
10
5
0
Question 1 Question 2 Question 3 Total
Number of question
E1 E2 E3 E4 E5

Fig. 2 Number of discussions about methodological approach by experts


Designing Evaluations: Researchers’ Insights Interview … 305

Consequently, this raises the question of the reason for these differences. It is not
incoherent to think about the discipline itself for the following reasons. On the one
hand, E5 is an ergonomist, and by his discipline, he has a more general eye on the
proposed problematics. One the other hand, E4 is an ethologist and is a specialist in
the observation of animals (including the human being) behavior.
This raises a very interesting question as ethologists are experts in studying inter-
actions under natural conditions. Have the researchers in ethology already investi-
gated for animals the questions which are currently emerging and which will emerge
for robots? One can observe that the answers of E4 do not contain hesitation nor
subsidiary questions, that he always adapted his answers to things as they are, and
that he always gave the impression to know how to answer the three questions. And,
to the best of our knowledge, E4 does not conduct research about Human-Robot
Interaction. But, if the robots may be considered as a new species to study, such as
an animal, an ethologist does not need to be a specialist to Human-Robot Interaction
to study it.
But it also may be a coincidence due to our research questions or to another factor
that has to be determined (for example the fact that the robot is built by humans).
The question is open.

5.3 General Observations

5.3.1 Different Processes

Despite the apparent similarity of approaches deduced from E1, E2, and E3 answers,
we think that each expert may, in fact, follow a single and different process. For
example, E1 seems to pay special attention to the interdisciplinary aspect of HRI
and seems to need discussing with other colleagues or other disciplines in order to
completely master the problematic. E2 seems to systematically refer to the literature
in order to either adapt an existing evaluation or find some response elements. E3
seems to look for real situations, which fit the problematic in order to observe and
analyze them. E4 seems to answer directly to the question without discussion nor
explanation as if it was obvious. Finally, E5 seems to emphasize the studies which
have to be conducted in order to be able to answer the question. Most of the time, he
seems to propose to carry out an analysis of needs.

5.3.2 Importance of Comparison

Except for E5, all experts have proposed to design evaluation with a comparison of
conditions. In all cases, each condition is exactly the same situation either with a
human or with a robot. Only E4 proposed some comparisons with animals, maybe
because robots are not advanced enough to be compared with humans, but can more
easily be compared with animals, or because it is like a reflex for him, who is used
306 C. Jost and B. Le Pévédic

to study non-human animals. Even E3 who started the discussion with the willing
to base evaluations on observations and interviews indicated that he was forced to
design a comparative evaluation: “here again, it is a comparative question, which
leads to looking for two different contexts, or a particular context, to which we
would add an additional dimension”.

5.3.3 Need for Precision

We noticed that a unique research question can inspire five totally different answers,
sometimes even out of the scope of what we imagined with our baggage and experi-
ence. What emerges from the approach of experts is the need to precise the research
question until the complete absence of ambiguities. That lead experts to learn about
questions topic, to refer to existing studies, to make hypotheses on elements that were
unknown in order to answer the question or to reformulate the research question. We
particularly noticed the importance to indicate which type of robots is involved, what
is the target audience, what is the task both robots and humans have to do.
Without a maximum of precision, if there are still some ambiguities, it may force
the designer of the evaluation to formulate hypotheses or take decisions that may
distort the question which needs to be studied. The question, hypotheses, and context
have to be described with a high level of precision. Ideal instructions given to five
different people should lead to design five identical or quasi identical evaluations.

5.3.4 Choice of Evaluation Methods

Evaluation methods, as defined in “Conducting Studies in Human-Robot Interac-


tion”, belong to a finite set: self-Assessments (e.g. questionnaire), interviews, psy-
chophysiology, observations (e.g. ethnography), task performance. However, in spite
of this known finite set, no unique method has emerged from our survey. Each expert
has looked for the best suitable method for each question according to the nature
of the question, of the robot, of the target audience, of the tasks to do. The chosen
evaluation methods would be different if the robot is anthropomorphic vs industrial,
or if the robot is endowed of smart capabilities vs is an automaton, or if the robot
has to interact with children vs with the elderly, or if the interaction is contactless vs
needs a physical collaboration, and so on.

5.3.5 Complexity of Relationships

Except for E4, it seemed impossible to measure the relationships with a holistic
approach, and experts seemed to investigate questions with a reductionist approach,
without entering the debate “holism vs reductionism” that is not the point of this
book. Anyway, it seems that experts who used a reductionist-like approach (E1, E2,
E3, E5) have had to make hypotheses and to reformulate the research question as
Designing Evaluations: Researchers’ Insights Interview … 307

a subset of the original research question, or to decompose the problem as a set of


criteria, functions, categories… That is why we consider it as a reductionist approach.
E4 has always proposed an evaluation in order to answer the whole question. That
is why we consider it as a holistic approach.
Our objective in this discussion is not to determine who is right or who is wrong,
or which evaluation is relevant or not. But, as observers, we notice that the main
obstacle to designing evaluations who answer the three research questions, and more
generally evaluations for Human-Robot Interaction, is to understand how to measure
an interaction or a relation. To be noted that we qualify a relationship as a series
of interactions in time, as defined in “Communication between humans: towards an
interdisciplinary model of intercomprehension”.

6 Further Investigations About Comparisons

At this time of our analysis, we were rather convinced to have understood that the
problematic of standardization was like an iceberg and that we were only discovering
its visible part. The question of standardization was appearing to us as a set of free
electrons: broad and disordered. Where had we have to begin?
In this widespread confusion, thus, we were deciding to focus on similarities. And
the only similarity which was appearing to us, at this moment, was: the comparison.
Therefore, we decided to make further investigation on comparisons, especially
because E3 seemed to explain that the formulation of our research questions had
forced him to propose a comparative evaluation instead of a survey. Was that an
unwilling bias that we had to correct? Thereby, to evaluate human-robot interaction,
do we have to use comparison only or did all experts propose comparative evaluations
because of the nature of the three research questions?
We sent the following mail to delve deeper into the question:
Subject: [EMSHRI interview] - one last question
Content:
Dear all,
We are coming back to you concerning your answer to our survey. We thank you for your
answers which are very interesting.
You can find the summary table containing your answers at the end of the mail.
It is very interesting to note that all of you have followed a general approach which is:

• all experts have made comparisons between literature or studies they conducted.
• all experts tried to reformulate questions to define a precise context (which robot, which
task, which criteria to be measured, …)
• all experts determined some comparison criteria to answer each question.
• all experts looked for the methodologies the most suitable for each question.

We have a question following these observations. We note that our three questions led
experts to think about evaluations where they proposed to compare several situations. Thus,
308 C. Jost and B. Le Pévédic

we wonder if all the existing research questions in our context lead to evaluations with
comparison or if it is a consequence of the questions we have chosen.
Taking into account that the robots we want to study are anthropomorphic (not necessarily
humanoid, but at least classified in the category of companion robots) and that they all interact
with a human (the task is an excuse to put a robot in interaction with a human in order to
evaluate this interaction), is there a category of research questions that leads to something
else than a comparative study?
Do you have any idea of a research question coming spontaneously in your head or does it
require further reflection?
Thanks again for your help.
All the best,
[Editor name]

Thus, our email contained, in fact, two questions: Is there a category of research
questions that leads to something else than a comparative study? Do you have any
idea of a research question coming spontaneously in your head or does it require
further reflection?
A summary of the given answers is presented in Table 15 by experts but the
complete answer can be read in Sect. 9.4.
These answers are wonderful lessons. Two opposite points of view are emerging.
On the one hand, we can notice that there are in fact two possibilities for HRI evalu-
ations which are: observation and experimental comparison. And on the other hand,
we can notice that there is in fact only one possibility: the experimental comparison.
In the former case, the experts highlight the fact that the formulation of the three
research questions imply to design a comparative study. And they emphasize the
observations as an alternative. E1 gives an example of a new formulation of question 1
which leads to an evaluation with observation: “What characteristics of the bystander
effect can be observed with robots?” instead of “Is bystander/audience effect the same
whether the individual present is a human or a robot?”. But these both questions
do not have the same objective and do not have the same prerequisite. The first

Table 15 Answers given to the final question


E1 Comparison because of the formulation of the questions we chose
Question that leads to observations: What characteristics of the bystander effect can be
observed with robots?
E2 We always need to compare at least two conditions
Need to totally master the topic to think of other research questions
E3 Comparison because of the formulation of the questions
There is also a descriptive approach (ethnology, anthropology)
But, there is always a comparison (even implicit)
Conversation analysis may be interesting to study
E4 Experimental comparison is always better to reproduce evaluations and confirm
hypotheses, which is the base of scientific rigor
E5 Possible to use observations (direct analysis)
Designing Evaluations: Researchers’ Insights Interview … 309

question has a focus on the bystander effect and supposes that one is capable of
listing all the characteristics of the bystander effect and is able to measure each of
these characteristics. The second question focuses on the result of the interaction
itself, whatever we can or not understand what is exactly a bystander effect. In this
question, we are rather interested in the consequence of this effect on the human being.
Thus, if we want to favor an approach with observation, it may be less restrictive
to use a more general question which leads to a qualitative analysis as described
in “Evaluating Human-Robot Interaction with ethnography” or observation based
on a behavior repertoire as described in “Evaluating Human-Robot Interaction with
ethology”. In both cases, we would learn a lot about humans (reactions, emotions,
mood, attitudes…) and would be able to measure the consequence of the situation,
here represented by “a bystander effect”. But all that depends on one thing: what are
we looking for exactly?
In the latter case, experts support the comparison only (E2, E3, E4). E4 explains
that comparison “is always better because it allows replicating the evaluation in order
to confirm again and again our hypotheses, which is the base of scientific rigor.”
And E3 gives another explanation which opens the door to an important debate. He
explains that “there is a kind of implicit comparison in the approach of the disciplines
which have descriptions at the very heart of their epistemology. It is implicit in the
sense that it supposed that the investigator knows herself/himself, knows her/his
‘culture’, her/his ‘social context’ and that she/he will establish differences between
these elements that she/he thinks to know about herself/himself and what she/he
discovers on ‘others’.”
Thus, do we want to implicitly compare the robot with the human in all HRI
evaluations? Do we want to compare the impact of robots on people with the impact
of people on people? The question is open.

7 Discussion

7.1 Questions Wording

The five experts revealed all the ambiguities contained in the three research ques-
tions. We understand that our questions were not sufficient and that they had to be
accompanied by a complete explanation.
But actually we think that this problem of ambiguities exists for all the research
questions in our context: “a human being and a social robot are interacting with each
other.”
Let us look at a concrete example with the five following questions:
1. Is the Nao robot efficient to help the children to review their lessons?
2. Does the Pepper robot have a place to entertain the elderly in retirement homes?
3. Is the Pepper robot relevant as an advertisement reception robot in the hall of a
mall?
310 C. Jost and B. Le Pévédic

4. In which case can the robot be an efficient/useful/interesting coach?


5. Is it acceptable to use a robot to guide a child in her/his learning?
About question 1, what does “efficient” exactly mean? How to measure the effi-
ciency of helping children to review their lessons? What kind of lessons? Are the
concerning children in difficulty at school?
About question 2, what is exactly “having a place”? A place in activities time? A
place in heart? A place in mind? What kind of entertainment and to do what? What
kind of seniors? People are not defined by their age, how to do with those who are
reluctant to technologies? How to compose with their history?
About question 3, what is “relevant” for a robot? How can we measure the effects
on advertising with a robot on customers?
About question 4, what does it mean to be efficient, useful, interesting for a coach?
How can we know that the robot is useful? Interesting according to what? And what
kind of coach? Coaching a person who wants to develop her/his muscles does not
require the same strategies as coaching a person who wants to change her/his life
after a burnout for example.
And about question 5, what does “acceptable” exactly mean? How can we measure
this acceptability? And acceptability according to what? What kind of robots? What
kind of guidance? What kind of children (age, level, social context)? And what
learning task?
Let us be clear: that is only a few examples. This looks fastidious, but it does
not. The five research questions are indeed too vague to allow anybody design-
ing evaluations. The words have to be chosen with a high precision without any
ambiguity.

7.2 Comparison vs Observation?

It is certainly one of the most important contributions of this survey. Two types of
evaluation have emerged to be perfectly adapted to HRI: comparative studies and
observations. But there is not a favorite one as they are not comparable. Observation
produces data. Observation is a measuring tool. Comparison is a method to analyze
and interpret these data, to acquire knowledge. Comparisons are useful when one
has the idea of the expected results when one is looking for something precise. But,
as E3 explained, we are always looking for something precise, even if it is implicit.
And all experts agreed that we always need to compare at least two conditions.
Giving sense to the data produced by observation requires to compare these data
with something. The example given by E1 is a good illustration. The initial research
question leads to a comparative study: “Is bystander/audience effect the same whether
the individual present is a human or a robot?”. This question has the objective to
understand the effect of robots on humans. The alternative question, proposed by
E1, leads to an observation: “What characteristics of the bystander effect can be
observed with robots?”. First of all, designing this evaluation requires us to be able
Designing Evaluations: Researchers’ Insights Interview … 311

to list all the characteristics of the bystander effect and to know how to measure
them. After obtaining data about the characteristics, we will have to compare them
with reference values supposed to be representative of the bystander effect. Thus it
is also a comparison. And what about a research question that do not require any
prerequisites such as: “What do we learn about the human in this situation?”. In this
case, the data produces are about the human, and as E3 explained, we will implicitly
compare them with what we think to know about us.
But we would like to emphasize once again that these three research questions
have different objectives and does not give the same results. Thus, the question we
have to investigate now is: what do we want to measure exactly? For what do we
want standardization?
The question is open.
In all cases, it seems to have a consensus on the fact that an evaluation should
always compare several conditions.

7.3 Toward a Common Vocabulary

The objective of this survey was to understand the approach of experts to design an
evaluation. As it was implicit for us that we alluded to the methodological approach,
we are not sure now that everyone has the same definition as us. Indeed, we drew
up a list of vocabulary used by experts to describes their approaches which can be
grouped into 6 categories:
1. Nature of the evaluation: survey, experimentation, exploration
2. Location of evaluation: in the field, in the lab, in controlled ecology
3. Nature of the analysis: qualitative, quantitative
4. Measuring tools: observation, questionnaire, psychophysiology, functional anal-
ysis, socio-functional analysis,
5. Disciplines: anthropology, sociology for technics, ethnography, social psychol-
ogy,
6. Indicator or metrics which are generated by the evaluation and being currently
discussed as the base of standardization.
This is only an example to illustrate our idea. But when we work in a multidisci-
plinary context, we can notice that a word can have several definitions and that a same
definition can be related to several words that have themselves different definitions
in other disciplines. That makes complicated communication.
But if HRI is destined to become a transdiscipline as defended in “Testing for
‘Anthropomorphisation’ – A Case for Mixed Methods in HRI”, it will be very
essential to agree on a common vocabulary, which will be the starting point of the
standardization.
312 C. Jost and B. Le Pévédic

7.4 Ethology: A Promising Discipline?

From our observations within the context of the survey, we are questioning the suit-
ability of ethology to be the discipline adapted to evaluate Human-Robot Interaction.
Indeed, our ethologist expert always answered quickly our questions. He is a spe-
cialist of animals’ behavior (including humans). He always adapted to things as they
are (robots, situations, context…). And he always gave the impression to know how
to answer each question without hesitation. In the previous discussion (Sect. 5.2),
we wondered “Have the researchers in ethology already investigated for animals the
questions which are currently emerging and which will emerge for robots?”
But, maybe E4 has simply a concise literary style. Thus, to understand if these
concise answers are due to ethology or due to our expert, we tried to find clues in
the number of words used by each expert in answers. Figure 3 shows the comparison
between the mean of the number of words used to answer the three questions and
the number of words used to answer the final question (Q4 on the figure). Q4 is a
different type of answer as it offered experts to give their opinion. And we hope to
be able to deduce from this answer the literary style of our experts. We can notice
that all experts used fewer words for Q4 than for each other answer, which is logical
because Q4 required less explanation. We can observe that in all cases E4 gave a
short answer and that there is no contradiction between the three research questions
and Q4. Thus, it can let us think that E4 has a concise literary style. And we cannot
conclude anything about ethology. Thus it requires further investigations to confirm
or infirm our intuition that ethology could naturally be adapted to HRI.
The main argument we can claim now is that ethology has at the same time the
advantage of disciplines that qualitatively observe real and existing situations (soci-
ology, ethnography, ethnology…) and the expertise in studies animals’ behaviors,
relationships, interactions (including humans).
But that requires further investigations. We should interview experts from all
existing disciplines to be able to decide if one of these disciplines can fit our
problematic.

Comparison between words used in answers


700
600
Number of words

500
400
300
200
100
0
E1 E2 E3 E4 E5
Experts
Q4 Mean Q1-Q2-Q3

Fig. 3 Comparison between the number of words used in the answers of experts
Designing Evaluations: Researchers’ Insights Interview … 313

8 Conclusion

This chapter had the objective to investigate the evaluation methods of several dis-
ciplines in order to understand the differences and similarities between these disci-
plines, in the idea to take the first step towards standardization of evaluation methods
for Human-Robot Interaction. To this end, we interviewed five experts belonging to
five different disciplines: cognitive science, cognitive psychology, sociology, ethol-
ogy, and ergonomics. We asked them to explain how they used to process to design
evaluations.
This chapter has been written after all the other ones. Thus, at the time of writing,
we had a complete vision of the content of the book and a complete vision of the
thought of all contributors.
Last, it is of importance to understand that this chapter is open to the interpretation
of both authors and to consider it for what it is: a testimony. First of all, our qualitative
analysis of experts’ explanations is inevitably subjective. And second, this chapter is
based on a limited sample of researchers and can definitely not represent the whole
thoughts.
The first recommendation deduced from this work is that each discipline seems
to have its own course of action, its own beliefs and its own vocabulary. Let us recall
that a word can have several different definitions, sometimes even contradictory. It
is really important to take this point into account in this reflection for creating a
standardization, where everyone must adopt the same “culture”.
About the lessons learned on the formalization of standardization, we highlighted
three essential steps. It seems obvious that the first step toward standardization is
a common vocabulary, even a common vision. The second step seems to be the
need to know how to measure relationships. It is actually the starting point for each
evaluation, even the base of each evaluation. And for this step, it may be interesting
to take a look at ethology which seems to study interactions and relations for a long
time. The third step toward standardization is to understand what we want to measure.
What do we exactly want to standardize? What common vision do we want to adopt
as the foundation of our discipline, even transdiscipline?
About the lessons learned in designing an evaluation, we highlighted two impor-
tant points. The first point is the need for precision. We are progressing in a mul-
tidisciplinary environment, soon transdisciplinary where all of us do not interpret
words in the same manner. Thus, in order to design an evaluation, it is of importance
to detail the research question in order to raise all ambiguities: which topic, which
robot, which target audience, which tasks to do, which context, which objective,
which bias, which metric to be produced, which expected results. The objective is
to remove ambiguities. The second point is the main contribution of this work: the
comparison. All the research questions can be answered with a comparative study.
To acquire new knowledge on a topic, it seems necessary to compare it with current
knowledge.
314 C. Jost and B. Le Pévédic

To conclude, this survey showed that the question of standardization is very com-
plex and is at its infancy. We were looking for an answer and we are realizing that
we have barely begun to formalize a question.
This work opens numerous perspectives and requires further investigations. For
example, what would be results if we had interviewed five other experts belonging to
the same disciplines? And what would be results if we had interviewed five experts
to other disciplines? We are conscious that this chapter is only a point of view, a
testimony. But it opens really interesting debates and reflections. This chapter is
opening a door.

9 Complete Answers

9.1 Question 1

9.1.1 Iina Aaltonen

General Comments

My concern when approaching the research questions is the lack of knowledge on


the type of robots (and their level of intellect) we are talking about. Considering
the questions, I assume that the robots in question would be social, humanoid(ish)
robots, with eyes (cameras) and ears (microphones), possibly “skin” (some pressure
sensors on the outer surface), feet or wheels to enable mobility. Especially regarding
Q1, the possibilities of the robot to be connected online to transmit information (cf.
a phone call or a social media post a human could make) would have an effect.
Regarding my approaches in general, in reality I would not design any stud-
ies alone, but rather gather a group of my colleagues to discuss the best ways to
approach the questions. In addition to discussing the questions, I would also want to
discuss the practicalities and the ethics/privacy aspects of the research. The number
of participants needed would also need to be discussed.
Q2 and Q3 fall out of my comfort zone, because I feel those questions are more
philosophical than Q1, and I haven’t done research on relationships and emotional
aspects.

Answer

My approach would be to do an experiment in the field (another option would be to


show videos of staged situations, but I think that does not capture the true intentions
of humans as well). It would be easiest to control for variables in an environment
where the same people gather frequently, such as a sports club, and where people
could be prepped beforehand on the properties of the robots they might encounter.
Designing Evaluations: Researchers’ Insights Interview … 315

1. As preparatory work, I checked if I remembered correctly the meaning of the


bystander and audience effects (very unscientific approach—I used Wikipedia).
Regarding the bystander effect and considering the current autonomy and object
manipulation capability of robots, I would exclude emergency situations that
would demand physical manipulation (e.g., stopping a human bleeding) from
my scenarios/analysis. For the audience effect, my first thought was that the
person’s perception of the robot’s capabilities (videoing, online connectivity,
robot’s ability to relay to others what it observed) would have an effect on their
performance. The audience effect would probably be easier to test experimentally
because subjects could be recruited more concisely.
2. At this stage, I would rethink what it is we really want to find out. I would
look into the literature on the effects, and also consider what HRI metrics would
be applicable to this case. Most probably, a mixed methods approach would be
selected, with a lot of emphasis on ethnography, and some on questionnaires and
interviews.
3. Considering the above line of thinking, I would find it important to control the
knowledge of the robot the participants have about the robot. This could be
done by making the experiment in two different places (sports clubs in different
cities). For example, in club A, the people would be told the robots are sort of like
combinations of personal trainers and security guards, the robots are connected to
other robots and the sports club’s audio, reservation and surveillance systems, and
they are equipped with cameras with motion detection and enough intelligence to
deduce what a person is doing and make an alert on abnormal behaviour. Those
robots are able to speak English. In club B, the people would be told the robots
are “less capable”, for example, they lack some of the qualities listed for club A.
(between subjects/groups design)
4. Another way of approaching Q1 would be to vary the way the robot acts (within
subjects design). On different days, the robot could be more interactive (e.g.,
making beeps, sounds, move about, speak aloud what is happening) or passive.
I would probably use at least 3 levels: passive observant (move from A to B,
move head/camera), semi-interactive (move with clear intentions, speak in a
monotone voice, make eye contact), highly interactive (move about repeatedly
[cf. Kobayashi and Yamada, 2005, Nakata et al., 1998], cheering/varying volume
and intonation of speech, engage with people).
5. Depending on the points made in 2–4, I would do experiments with human-
human situations first to get a baseline, and then decide what kind of a robot
behaviour we need and do the final touches there. Especially point 2 is important,
because the analysis of (video)ethnography is slow and reaching a suitable n is
difficult, and therefore we should avoid collecting too much data and varying too
many parametres. Regardless of the details of the experiments, I would aim at
comparing the human-human situations with the robot-present situations.
316 C. Jost and B. Le Pévédic

9.1.2 Sophie Lemonnier

(Translated by authors from French. Sentences are written to be the most possible
close to the original text, accepting the incorrect English.)
To test an effect which is well-known for the literature, that is the audience effect,
in a different situation where the human has been replaced by a robot, the first thing
to do is to identify the standard experience that is used in the literature to highlight
this effect. Indeed, the objective is to get oneself into a situation where one knows
that one should observe the audience effect, to modify what is related to the research
question, and to test if one can observe this effect again.
The most cited experiment is the one conducted by Bergum and Lehr (1963). A
researcher with the audience effect as the object of study, and who is, therefore, more
aware of the literature on this subject than I am, will probably have other more recent
studies to base the conception of her/his own experience. For our part, we will use
this study that is classic by now.
This study consisted to ask participants (to be precise soldiers) to perform a tedious
signal detection task, that is, participants simply had to press a button when a light
signal appeared, knowing that was only occurring 12 times during the time of the
experiment, which was 1 h. The participants were divided into two groups: one in
which their direct supervisor was sometimes coming (the group with the audience),
and one in which their direct supervisor was never coming (the group without an
audience). The time of the experiment was divided into 5 periods, and the results show
a decrease of the correct detections in the course of the periods in both conditions,
with a more important decrease in the group without an audience, which highlights
the audience effect.
Therefore, we could reproduce this study by simply replacing the direct supervisor
by a robot, and notice if the results produced are the same as those presented above.
However, two problems arise:
• We cannot observe our results and conclude without statistical analysis whether
we obtained the same result. If we observe a difference between the condition
with the robot and the condition without an audience, we will have an effect, but
we won’t be able to conclude that this effect is the same as the effect observed
with a human. Thus, we will also need to add an audience condition with a human
to be able to compare it to the audience condition with the robot. But we cannot
pose a hypothesis of equality. If results seem to show that there is no significant
difference between the audience condition with the human and with the robot,
it will be difficult to conclude that the effect is then the same, because we do
not know what we have manipulated by replacing the human by the robot, and a
lack of difference can always be explained by a too important experimental noise,
masking a potential difference. Furthermore, if the results seem to show that there
is a difference, we could also hardly conclude, but simply indicate that the effect
is not the same; the interpretation of our results is limited by the fact that when
we replaced the human by a robot, in the same way, we did not really have a
hypothesis about what we were actually changing.
Designing Evaluations: Researchers’ Insights Interview … 317

Correct detection rate (%)

Periods

With human audience Without audience


With very anthropomorphic With little anthropomorphic
robot audience robot audience

Fig. 4 Predicted percentage of correct detections according to time period and four audience
conditions

• The second problem we face can be found in what we mean by “robot”. This term
covers a very wide range of objects ranging from food processors or lambda termi-
nals to humanoid robots equipped with a very developed AI. We could postulate in
a first study that the level of robots’ general anthropomorphism impacts the audi-
ence effect, and therefore the more anthropomorphic a robot is, the more important
the audience effect will be. This would solve the first problem by proposing an
additional statistical comparison between a low anthropomorphic robot audience
group and a very anthropomorphic robot audience group. We could then make the
predictions illustrated in Fig. 4; the curves with the human audience and without
an audience representing the results obtained by Bergum and Lehr (1963).
It would then remain to decide how the level of anthropomorphism can be modu-
lated. For example, we can change the appearance, but experimental limits may arise
because there is not necessarily an available humanoid robot. We could imagine mak-
ing a vacuum robot versus a Nao robot (possessed by a number of laboratories) pass
in the experimental room. We could also modify the robot at the behavioral level. For
example, we could use a Nao-type robot under both conditions: in a condition the
robot could say sentences implying undeveloped intellectual abilities (e.g. “I have to
go around the room.” I have to go around the room.”), and in the other condition, it
could say sentences closer to a human (e.g. “I hope the task is not too difficult, I’ll
come back later.”). To decide, it would be necessary to read the appropriate literature
and to be based on explanatory theories of the audience effect and then choose.
This study would probably pave the way for many others, and would, in addition
to providing a beginning answer to the question asked, to better understand the
audience effect (in the case we chose to explore: how much it depends on the level of
318 C. Jost and B. Le Pévédic

anthropomorphism), and therefore to better define the type of robot that would allow
us to approach an audience effect close to that observed with a human audience.

9.1.3 Jérôme Michalon

(Translated by authors from French. Sentences are written to be the most possible
close to the original text, accepting the incorrect English.)
My first reflex was to find out about “audience effect”, which was a concept
I did not know given that I’m not a social psychologist. Once done, I wondered
how I could answer this research question. It seems to me that the question is not
precise enough: what robots are we talking about? Are we talking about automated
visual devices (CCTV cameras)? Or anthropomorphic automata? And if so, to what
degree of physical resemblance to humans are we dealing? What are these robots
doing? Since it’s about audience effect, do these robots settle for listening (recording
sounds)? To watch (to record pictures)? To transmit images? Sounds? To whom?
In the same way: what are individuals? Men? Women? From which social origin?
What age? What level of income? Finally, what situation are we talking about? We
talk about co-presence between an individual and a robot, but in which context?
Experimental context? If so, what are they doing together? What is the instruction
that has been given?
I am not criticizing the framework and its imprecision, but on the contrary, I am
responding to the exercise: the first reflex of a sociologist, I think, is to wonder about
the question which is asked. To rephrase a question in terms that can lead to a survey
is the first step for any sociological work. Furthermore, I have just used the term
“survey”: it is also a reflex of sociologist, I think, to favor the survey (that is to say
the study of the behaviors and the humans speeches in an existing context) to an
experimental approach. Which has therefore as consequence to not give it a miss on
the preliminary definition (even minimal) of the situation that we will seek to observe
and of the people who populate it.
That being said, how could I answer this question as a sociologist? Since the
ideas of “surveillance” and “social control” seem to be at the heart of the concept of
audience effect, I would tend to look for situations in which we can observe humans
“at work”, under the eye of other humans, and other situations, comparable, where
the eye of humans is replaced by “robots” one. I would tend to favor situations
where the observer (human or robot) has is in an explicitly hierarchical position. In
this respect, educational situations seem to me to be very appropriate. Especially
that experiments of “virtual teacher” (“robots” being in classrooms, equipped with a
screen, supposed to occupy the role of the human teacher) have been tried for several
years. (N.B.: you can see that despite an initial orientation towards the survey, the
question asked, and the concept behind it, tend to make me look for contexts that
are closer to experimentation and that allow comparison). First, I would, therefore,
seek to observe a class with a human teacher, and a class with a “robot” teacher,
to understand how the audience effect manifests under these both situations. The
observation that I would lead would be clearly ethnographic: no video recording,
Designing Evaluations: Researchers’ Insights Interview … 319

accepting the position of an observer. This would aim to very summarily characterize
differences in terms of sound environment, and of physical behaviors, in these two
classes, to identify the moments where the silence imposes itself, where the attention
seems more floating, etc. To also identify the spatial distribution of these behaviors
in the classroom, while paying attention to the teacher’s eye on the class (Where
does he/she look at? Where does not he/she look at? When?). Afterward, I would
reproduce the exercise, by selecting more precisely my classes, according to ages,
and especially to social background. The idea would be to compare “low-income”
and “high-income” classes, assuming that the report to authority (represented by the
teacher) and the report to technology are not the same in these social backgrounds.
Once these observations made, I would pass onto a qualitative survey: interviews
with the human teachers, and interviews with the designers of the “robot” teachers,
and interviews with the pupils. The objective would be to understand how they think
of teaching tasks in general, the place of the teacher in the classroom, the role of
educator and supervisor, etc., and especially how they experienced the difference
between the human teacher and the “robot” teacher. Each interview would include
a section dedicated to socio-demographic data collection (sex, parental occupation,
places of residence, household income), with the idea of associating representations
of human and robot teaching, and these social characteristics.

9.1.4 Cédric Sueur

(Translated by authors from French. Sentences are written to be the most possible
close to the original text, accepting the incorrect English.)
I do not think so, there must be an attraction to the new technology on one side,
but the lack of facial signals and of spontaneous reactions in the robot must not lead
to the same attention as in a human.
I would make an experiment here where I would present: 1. an animated animaloid
robot, 2. an inanimate animaloid robot, 3. an animal such as a dog or a cat, 4. an
inanimate animal such as dog or cat. I would present in turns these animals in the
window of a store for example and would note by the time of presence “less than”
or the gaze, the attention of the humans.

9.1.5 Gérard Uzan

(Translated by authors from French. Sentences are written to be the most possible
close to the original text, accepting the incorrect English.)
To analyze this effect and answer the question, I would develop 3 methodological
approaches:
1. The implementation of an activity analysis protocol about dialogue-
collaboration-cooperation-assistance-negotiation-adjustement/conflict of a
human collaborator in a set of tasks to be carried out in pairs. Different
320 C. Jost and B. Le Pévédic

categories of tasks are considered depending on the degree of autonomy of


the character whose analysis focuses on. The analysis is about the duo but
with a focus on the human character who collaborates with a main charac-
ter: cooperation without collaboration, cognitive and manual cooperation,
enslavement-control-feedback, delegation by step, groups of steps or total, tasks
which require dialogues in a more or less operational or natural language or
applied to data with closed and predictable contents/formats, or open contents,
tasks followed by adaptive procedures, or which require production of heuristic.
Every task models such as mad* and models of perceptual-cognitive resources,
competences or tools can be used. Exploratory type analysis consists in extract-
ing knowledge about the conditions, resources or skills that can facilitate or
defeat the pair’s homeostasis until the expectation of a goal or the achievement
of the expected result by the accomplishment of the tasks.
2. What are the differences (in nature and in measurement) between the behavior of
Human-Machine Interactions and the behavior of Human-Human Interactions?
Hypotheses can easily be formulated on these aspects: for example, we exper-
imented more than 25 years ago that, orally, it is better, if we want to seem
more “natural human”, do not systematically say “OK” or “all right” but to
randomly choose confirmation feedback among at least 7 different items. The
methodological approach may be “laboratory experiment and then controlled
ecology” type with the same type of tasks as in the previous point 1. The use
of simulation or “wizard of Oz” seem to be particularly appropriate. It is thus a
question of testing the more or less machine-human or human-human character
of each interaction and of extracting robot design recommendations to give them
intentionally, according to their use, a “machine” or “human behavior” in their
interactions with humans.
3. Specialized and applied to the assistance to disabled persons, the limits of the
machines (robots or not) appear very quickly to have a very reduced “territory
of utility and usability” in usage scenarios. The “unhooking of utility” or the
poverty of the service rendered and the mass of the cases where it is not or can
not be is manifest in real situation. This unhooking testifies to the difficulty to
respond to the causal variability of the situations (context, goal, result, tasks).
Robotization promises, in the public imaginary representations, to “an adaptive
capacity” which is really too much difficult to realize in the laboratory. To cre-
ate a model allowing to beforehand evaluate the “utility perimeter and causal
variability of upholding of these utilities”.
Designing Evaluations: Researchers’ Insights Interview … 321

9.2 Question 2

9.2.1 Iina Aaltonen

I would first approach this topic through the robots available today. I believe there
should be some literature on how people want to give names to their Roombas, and
how demented people pet Paros. If we were interested in children, then we could look
into Tamagotchi research. I would talk with researchers doing psychology research
on how people react to animals and what are the elements related to forming bonds.
Q2 did not distinguish what kind of animals (pets as in cat and dogs, or aquarium
animals, or birds in the wild life) but that would also need to be considered.
To do an experimental study on this, I would expect we would need slightly more
social robots which had at least some kind of recognition system and memory for
identifying humans, and whatever the elements for bonding are. Possibly some HRI
metrics on animacy might be relevant (I would probably compare “bonding elements”
with metrics in questionnaires like GODSPEED). Based on my current knowledge,
I am not able to consider this question further.

9.2.2 Sophie Lemonnier

(Translated by authors from French. Sentences are written to be the most possible
close to the original text, accepting the incorrect English.)
Two approaches would seem to me to answer this question, based on two different
methods.
The first one would consist to test the question in a very general way, a field
exploration study. We could do not have too much preconceptions, that is to say, do
not really have underlying hypothesis, but simply to want to observe what happens
if people find themselves to live with a robot as they would do with a pet. For
example, we could imagine an ideal experiment which consists of splitting a number
of participants with no particular characteristics into two groups. It would be asked
to the first group to “welcome” a pet at home, and to the second one to “welcome” a
robot. We could then regularly give participants a questionnaire to identify the nature
of the relationship established between them and the animal or robot, and this for
years.
As this experiment is not feasible, we could first give these same questionnaires
to specific communities already having a robot, and therefore being in a situation
where such a link could have developed. For example, there is a community of people
building by their own their robots in order to make them participate to competitions,
others who own robots that can be coded on their own (e.g. GoGiGo), others yet who
have toy robots (e.g. Cozmo), especially children who can develop strong bonds with
some of their toys and who would represent an issue that may deserve to be treated
separately.
322 C. Jost and B. Le Pévédic

The other way to approach the question, and that we could start in parallel, would
be to list the different ways of defining the human/animal relationship in the literature,
and to test the human/robot relationship for each of the aspects.
For example, many studies over the last 10 years tend to show that a robot under the
form of an animal can bring a lot, especially to the elderly or seriously ill people. New
forms of therapy are emerging and are based on the use of such robots to decrease
some behavioral and physiological troubles, and to facilitate communication and
social interactions in general (e.g. Paro). We know in addition that a pet can bring the
same benefits. This relationship seems to have beneficial effects on health whether
with a robot or an animal, which makes these two relationships comparable.
We could use the existing literature on human/animal relationship to draw the
profile of the robot that is most likely to allow the existence of a similar relationship.
We could especially study the way to communicate. Humans develop a particular
way of communicating with their pet, the voice being able to be enriched by contact
and gaze. We could test on a given robot, in the laboratory, whether the participants
are able to use the same communication repertoire, and whether the robot is able
to understand it. This would allow to know if it is theoretically possible that such a
relationship exists.
In the same way, learning is probably central, and the fact that the robot is able to
learn in contact with humans as an animal does is probably indispensable.
The fact that the human is also involved to taking care of the pet is a point
that means in their relationship, and this can be in accordance with the fact that a
community involved with its robot, because they have created it, coded, or at least,
is able to understand its functioning and to repair it if necessary, or even to make it
evolve, may be more inclined to see such a relationship appear.
However, in my opinion, to the underlying question which is “can an affective
relationship establish with a robot as it is the case with an animal”, only the first
approach seems to be able to answer it. This field experience can become realistic
by using for example a single robot left to participants for example for a month.
It would be difficult to have a large number of participants, but at the end of each
experiment period a semi-directive interview could be conducted and would probably
be very rich. It would compare this group of participants to, for example, a group
of participants who have just adopted an animal (it is easy to get in touch with pet
shops, even with the society for the prevention of cruelty to animals, and to find such
participants), and to conduct the same interview.

9.2.3 Jérôme Michalon

(Translated by authors from French. Sentences are written to be the most possible
close to the original text, accepting the incorrect English.)
Here again, it is a comparative question, which leads to looking for two different
contexts, or a particular context, to which we would add an additional dimension:
the experimental approach returns at a gallop! Another way to answer this question
is to identify a situation where interactions between humans and animals have been
Designing Evaluations: Researchers’ Insights Interview … 323

replaced by interactions between robots and animals. I think, for example, of intro-
duction of milking robots in dairy farms (a situation documented several times by
agronomist students whose work I had followed). The approach here would be to be
in a sociology of techniques perspective. First, it would be a question of being inter-
ested in the design of these robots: which tasks are they supposed to replace? With
what representation of the human/animal relationship are they conceived? How do
these representations of the relationship have translation in technical devices? Here
I would practice qualitative interviews with the designers, and I would try, with
them, to unfold the “script” of milking robots, their anticipated use, and the antic-
ipated users (humans and animals). Second, I would practice interviews with the
farmers. First farmers using these robots, to grasp what had changed or not in their
work following the introduction of the robot. Then with “classic” farmers. In both
cases, it will be to understand how they describe their report to cows, their report
to robots (whether they own them or not), how the robot comes to configure their
relation to animals differently, and how they perceive the report between the cows
and the robot. In parallel, I would make observations of the daily activity of farms
equipped with a milking robot, according to the farmers in priority, to then be able
to debrief with them. The approach here will be essentially qualitative and will serve
to unfold the differences and the common points observed by the farmers, between
the anthropozoological relationships, and the anthropo-robotic ones (?).

9.2.4 Cédric Sueur

(Translated by authors from French. Sentences are written to be the most possible
close to the original text, accepting the incorrect English.)
I do not think that today a robot can have the same reactions and the same empathy
towards a human as a dog, nor the same complexity as a cat. However, humans today
have reptiles even invertebrates as pets, and I do not think in this sense that the
relationship that can be developed with a robot is less than the relationship that can
be developed with an iguana or with a mygale.
I think one should simply study here in the medium or long term the relationship
that people (in a retirement home for example) develop with 1. a dog, 2. a dog robot,
3. an invertebrate, 4. an invertebrate robot.

9.2.5 Gérard Uzan

(Translated by authors from French. Sentences are written to be the most possible
close to the original text, accepting the incorrect English.)
Two approaches seem interesting to me to deploy: anthropological and socio-
functional:
324 C. Jost and B. Le Pévédic

• Anthropological: Gods, humans, animals, and trees belong to cosmogonies among


populations and particular times in which the status of each category can be con-
sidered equivalent. In all cases, the characteristics of life, the responsibility of acts,
the irreversibility of death and the existence or not of a soul during life and beyond,
questions the place of the robot, presented as a new being category to be among
the others in these cosmogonies. A specific work on the analysis of these “repre-
sentations” and of their impacts on an insertion of robots or “robot” categories in
these representation spaces (in the Middle Ages, in Europe, there are trials where
the accused is a domestic animal -pig for example- and as a moral-physical person
in the same way as a human person).
• Socio-functional: having worked on the mobility of blind people and on the role
of the guide dog, the white cane and many devices (digital and/or electrotechnical)
(research demonstrator and devices or application of products), it appears that all
assistance systems by animals constitutes a “complex system” which cannot be
reduced to the visible priority function (and apparently unique in view of a too
hasty external observation) of obstacle avoidance, finding objects, arrangement
or memorization of courses or stages. The knowledge protocol about the habits
of blind people, of her/his human environment and her/his contexts of life, of
the dog selection-education, the learning of the guide-dog trainer/guide-dog/blind
person triad constitutes a group whose dog is a member whose primary role,
which constitutes its gratifying contribution, is “guiding”, whether it is a pet,
assistance, laboratory, farming or mystical veneration animal. Functional analysis
should have a “systemic” line of attack to prevent that any approach of partial or
total substitution, or companionship or assistance to the animal, produce reductions
in functions or affects or relationships, or dissatisfaction, or else no longer respond
to functions of “second-rate in terms of a primo-analysis” but priority for the human
in relation.

9.3 Question 3

9.3.1 Iina Aaltonen

I’m not sure if I understood this question correctly. Is the question about long-lasting
relationships, such as what could be considered as families? Or is it about brief
situations where we interact with each other, such as a group conversing casually
about holiday plans? In either case, I cannot think of a good approach to this question,
at least not one that would involve physical robots available today. Conversational
AI agents could maybe be used. From the research perspective, I can think of a
couple of aspects that might make a difference: social interaction (how well the
robot pays attention to individual group members) and the robot’s future role in the
participant’s life (and also social influence). Overall, I would need more knowledge
on group interactions and relationships to consider this further.
Designing Evaluations: Researchers’ Insights Interview … 325

9.3.2 Sophie Lemonnier

(Translated by authors from French. Sentences are written to be the most possible
close to the original text, accepting the incorrect English.)
I would begin by studying in the literature the highlighted different influences of
the group size. The objective here would be, as for question 1, to identify each time
a study that can be reproduced by replacing humans with one or more robots. For all
experiments in the laboratory heavily relying on suggestion, this can be quite easy.
The instruction can play a lot, and modifying this one by specifying the presence of
robot and their nature can in some studies be enough to test the same effect with this
new condition.
However, for most studies in social psychology conducted on these issues, par-
ticipants are indeed upon contact with other people and can interact with them. The
main difficulty run the risk to be then the experimental limits of this project: access
to one or more robots, the nature of these robots (we spoke for question 1 of the level
of anthropomorphism, it is highly likely that it also plays on this type of influence),
and the fact that these robots are able to interact with the participant and vice versa.
To overcome this major problem and to be able to keep a wide range of possible
experiments, I would turn to virtual reality. I would probably start with serious games,
as some of which have been precisely designed to induce a maximum of interactions
with the game characters (e.g. games in the operating room where the participant
has only part of the information and must communicate as much as possible with
the rest of the nursing staff to perform a correct task). It would then be possible to
replace humans, or a part of humans, with robots, and to test in situations equivalent
to existing studies on this subject what are the influences of the group size.

9.3.3 Jérôme Michalon

(Translated by authors from French. Sentences are written to be the most possible
close to the original text, accepting the incorrect English.)
As with question 1, it would be to describe a little more about which humans and
which robots it is. Here again, I am looking for an interactional situation between
humans and robots, that could be studied on a large scale and a small scale. I cannot
see yet a situation where humans are in daily and routinized relationships with totally
anthropomorphic robots (it may exist in Japan, but I do not know well). The type of
robots (automaton) that comes to my mind is the sales automatons (ATMs, tickets,
automatic cash boxes, etc.). Perhaps here to address the dimension of “group size” one
might consider using second-hand socio-demographic data. One could, for example,
use territories, as an observation scale, chosen according to the distribution of sale
automatons and according to the density of the population. It would require a territory
densely populated and well endowed with automatons; a less densely populated
territory but still well equipped with automatons; a sparsely populated territory with
few automatons, etc. Then, it would be to think about a “behavioral” indicator (related
to transport or consumption habits). The aim would be to statistically explore the links
326 C. Jost and B. Le Pévédic

between a given behavior, the territory population, and the distribution of automatons.
This would require formulating hypotheses. For this, it would be necessary to conduct
upstream exploratory interviews with users of these automatons to understand what
changes they brought in daily practices.

9.3.4 Cédric Sueur

(Translated by authors from French. Sentences are written to be the most possible
close to the original text, accepting the incorrect English.)
Yes, I invite you to see this experience on cockroaches to confirm this:
Halloy, J., Sempo, G., Caprari, G., Rivault, C., Asadpour, M., Tâche, F., … &
Detrain, C. (2007). Social integration of robots into groups of cockroaches to control
self-organized choices. Science, 318(5853), 1155–1158.
Or
Faria, J. J., Dyer, J. R., Clément, R. O., Couzin, I. D., Holt, N., Ward, A. J., … &
Krause, J. (2010). A novel method for investigating the collective behaviour of fish:
introducing ‘Robofish’. Behavioral Ecology and Sociobiology, 64(8), 1211–1218.
Similar experiments could be done with humans.

9.3.5 Gérard Uzan

(Translated by authors from French. Sentences are written to be the most possible
close to the original text, accepting the incorrect English.)
The less a robot is hypertelic, the more we can think that it is the fusion into one of
several robots; we must thus assume in order to consider a group of robots in commu-
nity that either they are designed to have a minimum of collaboration (coordinated or
common tasks) and/or cooperation (having a common goal) or that they can choose or,
even more, define one or several common goal(s). A first approach would be to clas-
sify existing robots and/or to create a categorization of robots which allows highlight-
ing their potential for collaboration, cooperation, convergence of goals or behavior
regarding theme-concept, and cognitive and/or manual operations. Always with a sys-
temic approach in the background, one must analyze the effects of complementarity,
substitutability, or barriers to communication-collaboration-cooperation.
Approaches specific to a dynamic analysis of small groups (collaboration, compe-
tition, leader effect, marginalization, majority dynamics construction, etc.) or situa-
tionist ethnomethodology may be relevant if the analyzed situations are “ecological”
that are situations which are real or really projectable to the innovation process.
The impossibility or weakness (still actual) of robots’ heuristic production sug-
gests that the study of “genes”, clumsiness, failures, lapses, and mistakes must be
particularly analyzed. Modeling, simulation and/or experimentation in controlled
ecology could be usefully considered.
*Editor’s note: what Simondon calls “hypertelic” is the over-adaptation of the
object to its purpose.
Designing Evaluations: Researchers’ Insights Interview … 327

9.4 Further Investigations on Comparisons

9.4.1 Iina Aaltonen

I think that the reason for choosing a comparison type of setup is because of the
formulation of the questions you chose. A comparison would allow control over the
parameters (e.g., human or robot present) and the results would be stronger when
the comparison baseline is known in the same research settings. Naturally, we could
(and would) use literature as general background, but the context might have a big
effect on the results, and therefore a generalized baseline would probably be deemed
useless by the research community.
If I formulated your research questions slightly differently, then I would come up
with more ethnographic studies.
E.g. the first research question:
Is bystander/audience effect the same whether the individual present is a human
or a robot?
->
What characteristics of the bystander effect can be observed with robots? …or
in the form of hypothesis,…“we hypothesize that characteristics a, b, and c of the
bystander effect are also applicable for robots”.

9.4.2 Sophie Lemonnier

(Translated by authors from French. Sentences are written to be the most possible
close to the original text, accepting the incorrect English.)
Is there a category of research questions that leads to something else than a
comparative study?
I’m not really sure how to answer your question. As you know, we must in the end
always compare at least two conditions, so there will indeed always be a notion of
comparison at one level or another. So I suspect you mean something more specific
when you refer to the presence of comparison.
Do you have any idea of a research question coming spontaneously in your head
or does it require further reflection?
This comes to me spontaneously when I start to master a subject, and when I read
a lot and already thought a lot about it upstream to close themes, otherwise I need
further reflection (and especially read more articles).

9.4.3 Jérôme Michalon

(Translated by authors from French. Sentences are written to be the most possible
close to the original text, accepting the incorrect English.)
328 C. Jost and B. Le Pévédic

I reread your original questions, and it is sure they were intrinsically comparative
(the term “same” is present in each one). To answer your question, I think that a
descriptive approach can easily, at first, do without a comparative dimension. Eth-
nology and anthropology are full of descriptions of situations, of interactions, most
of which have been realized without any other preliminary research question than
that of the description of a still unknown reality (“non-Western” peoples and cultures
first, then unknown “Western” contexts). The approach supposes the establishment
of otherness between the investigator and what she/he chooses to explore to precisely
evaluate the degree of this otherness (“what do” we “have in common with nomads
of the Sahara?”). There is thus a kind of implicit comparison in the approach of
the disciplines which have descriptions at the very heart of their epistemology. It
is implicit in the sense that it supposed that the investigator knows herself/himself,
knows her/his “culture”, her/his “social context” and that she/he will establish dif-
ferences between these elements that she/he thinks to know about her-self/himself
and what she/he discovers on “others”. Admittedly, this approach is partly “blind”
to its comparative scope, but it allows engaging in the description of the situation
“in itself” and thus to produce material to embark upon an explicitly comparative
approach (examining different descriptions of different situations). In a nutshell, to
my point of view, there are interesting questions only if they are comparative, even
if they take shape based on data that are not collected in an initially comparative
purpose.
Another thing that comes to me, I recently participated in a workshop about
conversational analysis, which is a discipline I am not familiar with, but which
highlights the methodological dimension of interactions description, almost as an
end in itself (“blind” “in sum to the comparative dimension”). I told myself that
there maybe was in this discipline things to get for you: https://fr.wikipedia.org/
wiki/Analyse_conversationnelle.

9.4.4 Cédric Sueur

(Translated by authors from French. Sentences are written to be the most possible
close to the original text, accepting the incorrect English.)
I think that the experimental comparison is always better because it allows repli-
cating the evaluation in order to confirm again and again our hypotheses, which is
the base of scientific rigor.
See for example, but in another domain: Reproducibility of preclinical animal
research improves with heterogeneity of study samples, B Voelkl, L Vogt, ES Sena,
H Würbel, PLoS biology 16 (2), e2003693.

9.4.5 Gérard Uzan

(Translated by authors from French. Sentences are written to be the most possible
close to the original text, accepting the incorrect English.)
Designing Evaluations: Researchers’ Insights Interview … 329

What we can observe, what we can feel is that most issues related to age refer
to the robot functions. For example, at my university, our students want to make a
robot that will test diabetes, and that will bring a sweet product, etc. They made the
mistake of wanting to bring sweet iced coca-cola into EHPADs while they should use
drinks without bubble, sugary, and hot. Regardless of that, the problem is that when
you put a robot in contact with older people, it is that the human touch, the voice, the
touch of skin to skin, the feeling of a human presence is more fundamental for them
than the service provided. In fact, they do not care to have a robot that brings them
food, because what they want through the fact to bring them to eat is that they can
discuss with a human being next to them, and this is the presence of a human being
which is fundamental. At that moment, one really must dissociate services that are
seemingly rendered by a machine, as sophisticated as it is, and the necessary human
relationship and that explains with a psychological/psychosocial point of view, that
a senior who does not have contact with humans or robots will commit suicide much
more quickly.
From the methodological point of view, can we operate differently than by com-
parison? Yes we can. We can make evaluations of another nature, given that when
we say we want to compare it is not that we want to compare the relative efficiency
of humans regarding a robot or any other machine compared to the robot or even
animals. That’s not really the problem. The problem is that if one wants to evaluate
the robot, one can take the problem in reverse and say “I do not compare but I look
at how a certain type of robot is accepted or not in addition or not to the human
presence”. That’s another issue. Maybe we will realize that in other human beings
who come, the human presence may finally be a service rendered in itself and that is
not negotiable or not substitutable to the presence of a robot. However, quite a few
warnings, alerts, communication, basically technological features, in the end, can
be more easily provided by a robot. Therefore, one must especially say to oneself
isn’t it another methodology to start from a robot that renders a service and to make
it a Swiss army knife, that is to say, to make a sort of multi-service pole but not in
the parallel with the human but in the parallel with the services provided today by
the electronic pillbox that is put on the table, the anti-fall safeguard that the person
has at arm or around neck, or the ability to order for example a meal or products
in a hypermarket via the Internet. Maybe the robot will be able to replace all these
technical services but not the human, and at this point, we can drop the comparison
and make a direct analysis by grid about robot’s utilities or functionalities.
330 C. Jost and B. Le Pévédic

Céline Jost is an Associate Professor in Computer Science at


Paris 8 University in France, working in the CHArt laboratory
for her research. She obtained her Ph.D. in Computer Science
from South Brittany University (France). She was a Postdoctoral
Research at the National Engineering School of Brest (ENIB) in
France, working in the Lab-STICC laboratory.
She mostly conducts multidisciplinary research with differ-
ent disciplines, for which she received the “RJS/KROS Distin-
guished Interdisciplinary Research Award” in RO-MAN 2014.
She has co-organized various conferences and workshops on
Human-Robot Interaction and Assistive technology for disabili-
ties, and is actively involved in the IFRATH Society (Federative
Institute for Research on Assistive Technology for People with
Disabilities).
She has also been involved in many research projects funded
by the French Research Agency and is currently leading the
EMSHRI project. She is currently leading the MemoRob project
that aims at studying the distractor effect of a robot during the
learning task, and co-leading the StimSense project which aims
at studying the importance of multisensorial in learning tasks,
especially during stimulation cognitive exercises.
Her research interests include natural interaction, individ-
ualized interaction, multisensory interaction, human-machine
interaction, interaction paradigm, evaluation methods, cogni-
tive ergonomic, serious game, mulsemedia, artificial companion,
disabilities, education, and cognitive stimulation.

Brigitte Le Pévédic is an assistant professor at the University


of South Brittany. She obtained her Ph.D. in Natural Language
Processing from the University of Nantes and she defended his
Habilitation in November 2012 at the University of South Brit-
tany. Her research interests include Human Computer Interac-
tion, cognitive assistive technologies and multisensory interac-
tion.
Recommendations and Conclusions
Experimental Research Methodology
and Statistics Insights

Renato Paredes Venero and Alex Davila

Abstract Methodological and statistical misunderstandings are common within


empirical studies performed in the field of Human Robot Interaction (HRI). The cur-
rent chapter is aimed to briefly introduce basic research methods concepts required
for running robust HRI experimental studies. In addition, it is oriented to provide a
conceptual perspective to the discussion regarding normality assumption violation,
and describes a nonparametric alternative for complex experimental designs when
such assumption cannot be fulfilled. It is concluded that HRI researchers should hold
internal validity of studies as a priority and foster the use of within-subjects designs.
Furthermore, the described statistical procedure is an alternative to analyze experi-
mental data in multifactorial designs when normality assumptions are not fulfilled
and may be held as a suggested practice within the field of HRI.

Keywords Statistics · Research methodology · Non-parametric tests ·


Experimental designs · Human-Robot Interaction

1 Introduction

Experimental research studies within the field of Human-Robot Interaction (HRI)


are being progressively accepted and have proved their benefits and potential. Nev-
ertheless, methodological misunderstandings are common in the process of includ-
ing empirical examinations of recently developed robots and their applications. This
occurs because practitioners of STEM (Science, Technology, Engineering, and Math-
ematics) related disciplines usually lack mandatory training in research methods,
experimental designs or applied statistics along with academic majors and posterior
professional practices [1].

R. Paredes Venero (B) · A. Davila


Department of Psychology, Pontifical Catholic University of Peru, Lima, Peru
e-mail: renato.paredes@pucp.edu.pe
A. Davila
e-mail: adavila@pucp.edu.pe

© Springer Nature Switzerland AG 2020 333


C. Jost et al. (eds.), Human-Robot Interaction, Springer Series
on Bio- and Neurosystems 12, https://doi.org/10.1007/978-3-030-42307-0_13
334 R. Paredes Venero and A. Davila

As a consequence, several empirical studies within the field are labelled as exper-
iments but “lack clear control conditions or even an experimental manipulation that
would allow for disentangling cause-and-effect relationships between key variables
of interest” [1]. Such studies are commonly “pre-experiments” of a single observa-
tion that may be auspicious for hypothesis generation and the design of properly
called experiments. However, by definition these studies can only provide descrip-
tive data as insights into assumed psychological phenomena related to a HRI setting:
Such kind of data does not provide the basis for causation and must not be used to
attribute robust effects to the supposed to be causal variables.
Furthermore, conceptual misunderstandings may extensively occur when decid-
ing whether to employ parametric or nonparametric tests to analyze data. Briefly,
parametric tests are based on assumptions about how data are distributed whereas
nonparametric tests are not. Literature suggests that parametric tests (e.g. T-test or
ANOVA) are preferred in the vast majority of cases if their main underlying assump-
tion is met: normality of data gathered in each experimental condition [2]. If nor-
mality assumption is not verified successfully, nonparametric alternatives (e.g. U of
Mann-Whitney or H of Kruskal-Wallis) are commonly preferred [3]. However, it has
been proposed that such assumptions may be violated and parametric tests may be
employed and still obtain valid results [4, 5].
There are other cases in which the robustness of employing parametric statistics
when the required assumptions are not verified is under discussion. On one hand,
we may mention the case of Likert items and scales which are widely employed
within the fields of psychology, medicine and the social sciences [6]. Particularly,
within HRI literature the use of Likert-type responses as a dependent variable has
become quite extensive within experimental studies involving humans: A recent
analysis of HRI’15 conference proceedings suggests that 58% of the papers that
involved experimentation with human subjects employed such response type to
solicit perceptions of human participants as a function of one or more independent
variables1 [7].
Even though Likert-type responses are commonly accepted within a wide range of
fields involving human subjects, an unresolved debate remains regarding the proper
approach to analyze this type of data and define if there are significant differences
among experimental conditions or not through hypothesis testing [8, 9]. Specifically,
the discussion is focused on whether parametric or nonparametric statistical tests
should be employed given the ordinal nature of these responses.2 Particularly, a
review of the proceedings for HRI’16 showed that 21 of 23 papers that use a Likert
response format applied parametric tests [7]. These tests assume that the data are
interval and normally distributed, assumptions that ordinal data by nature cannot
fulfill.

1A variable is held as independent when it is considered as a supposed cause in a relationship


between variables. Conversely, the supposed effect elicited by this so-called independent variable
is named dependent variable.
2 When responses within a set of data are ordinal, they provide sufficient information to pick any

pair of them and determine in each comparison which is the lowest and which is the highest but
they do not provide any information of their magnitudes.
Experimental Research Methodology and Statistics Insights 335

On the other hand, it might be the case that parametric analyses are selected
under the argument that nonparametric alternatives are not available for complex
experimental designs even though normality assumptions were not met. Specifically,
it is commonly thought that there is a lack of reliable within-subjects nonparametric
alternatives to compare more than two experimental conditions to which human
participants may be exposed: This is the case of Friedman test which has been
signalled as an inconvenient parallel for within-subjects ANOVA [10]. However,
nonparametric alternatives for multifactorial designs3 have been developed recently
but they are not commonly introduced in current research methods courses and books.
Consequently, the current chapter is aimed to briefly review basic research meth-
ods concepts that if taken into account, may be useful to increase methodological
quality of HRI experimental studies. In addition, it is oriented to provide a conceptual
perspective to the discussion regarding the practice of taking as normally distributed
data which are not: the so-called normality assumption violation. To end the chapter,
the authors review an available nonparametric alternative for complex experimental
designs when normality assumption cannot be fulfilled and provide computational
evidence to encourage its adoption.

2 Conceptual Notes

2.1 Experiment Requisites

An experimental study is called properly as such when it fulfills three fundamental


requisites [11]:
• it involves intentional manipulation of one or more independent variables
• it measures the effect of the independent variable on the dependent variable
• it achieves internal validity through sophisticated control of the experimental set-
ting.
Regarding the first requisite, a study can be considered experimental only when the
so-called independent variable displays at least two levels of manipulation or action
of the experimenter. This means that within an experimental setting the researcher
deliberately and artificially builds at least two conditions and expose participants to
them to elicit and register responses. For example, in a study regarding the effect
of movement type on the perception of emotional intensity, a researcher may shape
movement characteristics of a single original source to produce and display two
animation types: emotional vs neutral movement. Participants will answer to these
animations either by filling a questionnaire, pressing a button to record their answers
by an experimental software or expressing their emotions by contraction and relax-
ation of their facial muscles.

3 Multifactorialdesigns are those that analyze the statistical effects of two or more independent
variables on a dependent variable.
336 R. Paredes Venero and A. Davila

Moreover, it has been mentioned that an experiment must necessarily measure


the effect that the manipulated independent variable exerts on the dependent vari-
able. In order to do so, a convenient quantitative measurement technique to answer a
research goal or question should be chosen for use in the experiment: questionnaires
and psychological scales to get self reports of relevant mental states such as percep-
tions or emotions, observation sheets of behavioral responses towards a humanoid
robot, performance records such as speed and accuracy to detect if a humanoid robot
is “angry” or “polite”, or physiological measures such as EEG records of mirror
neurons, among others. For a general review of these techniques see [12]. Neverthe-
less, such measures may not be indicative of such effect if internal validity, the third
requisite, is not carefully met.
Internal validity of an experiment refers to the degree of confidence with which
a cause-effect relationship between two (or more) variables can be inferred in the
direction indicated by the researcher [13]. In other words, if an experiment has inter-
nal validity, we may indicate that what we are registering as changes in the dependent
variable actually are related to the intentional manipulation of the independent vari-
able. This cause-effect relationship is ensured by employing experimental techniques
designed to avoid external influences on the dependent variable. Such strategies are
referred as control techniques and have been extensively described within psycho-
logical literature [14]. For example, we may mention the use of control groups,
counterbalancing, pre-test measurements, single and double blind, matching, among
others.

2.2 Experiment Validity in HRI

As outlined above, internal validity is a fundamental requisite to properly conduct


an experimental study. However, there is another form of validity that should be
carefully considered whenever an experiment is designed: external validity. Such
form of validity refers to the degree of confidence or credibility with which a cause-
effect relationship found between two (or more) variables can be concluded to be
representative or not; that is, if it can be generalized to other contexts different from
that used by the researcher [13]. While the internal validity requisite points to warrant
the strength of a study, the external validity points to its replicability across different
contexts.
In this regard, it must be stressed that even though external validity fulfillment
might be an important methodological outcome when conducting an experiment, it
is not a must to warrant the quality of an experimental study. Furthermore, the lack
of external validity does not say anything regarding the cause-effect relationship
examined within the study, but only about the extent that such relationship may or may
not occur outside of the experimental environment (e.g. in another context, population
or present time). In this regard, it depends on the purpose of the study whether external
validity is critical in an experiment, whether it is oriented to understand the nature
Experimental Research Methodology and Statistics Insights 337

of a given phenomenon or to generalize the occurrence of such phenomenon outside


the experimental setting.
In the case of HRI research, external validity stands as critical provided that most
research within this field is oriented towards practical applications of a given robot,
and not towards answering basic science questions driven by a theoretical framework.
However, within the field there are methodological efforts that still stand as a minority
but may conduct towards the consolidation of a solid theoretical framework for HRI
research [1, 12]. Hence, we think that it is decisive that researchers in the field
start holding internal validity of experiments as a priority. This would allow to build
empirical and theoretical foundations on which further external validation efforts of
generalization of results may evolve. In a sentence, genuine controlled experimental
basic research must be fostered in HRI.

2.3 Experimental Designs

A key understanding necessary to conduct robust experimental studies is to carefully


choose the most appropriate design for a given research question. In broad terms,
there are two possibilities (for details see [15]):

• between-subjects design (where each participant is exposed to a single experimen-


tal condition to which she/he will be randomly assigned)
• within-subjects design (where participants are exposed to all experimental condi-
tions)

The preferred design for basic research is the within-subjects design because of the
greater robustness it attains by controlling individual differences. Hence, this design
is characterized by a strong internal validity. Additionally, the adoption of such design
requires smaller sample sizes with the corresponding savings in budgeting and time
for HRI research [16].
Conversely, between-subjects designs present the advantage of absence of con-
tamination among treatments because each participant is exposed only to a sin-
gle experimental condition. Such advantage is of particular interest in HRI studies
because of the experimental settings and tasks that are usually employed within the
field: The use of robots in social situations, although in a controlled environment,
represents a challenge in terms of avoiding contamination among conditions, partic-
ularly when the foci is to preserve ecological external validity. As a consequence,
between-subjects designs are widely employed in HRI experiments.
Nevertheless, we think that HRI researchers should consider employing within-
subjects designs as their preferred alternative to strengthen basic research within the
field.
338 R. Paredes Venero and A. Davila

2.4 Symmetrical Distributions

In a given experiment, a normality assumption required for parametric tests may


be rejected after a proof with a normality test (e.g. Shapiro-Wilk or Kolmogorov-
Smirnov). As a consequence of normality rejection, T-tests or ANOVA tests cannot
be applied in principle because the normality assumption is needed to fit or approx-
imate data sets to t-Student or F statistical distributions in order to make inferences
regarding the similarity of means or their difference.
Nevertheless, such experimental data may still be reasonably analyzed through a
parametric approach using those tests whenever the distributions to be compared are
symmetrical [17]. This is plausible given that it is the symmetry of distributions which
ensures that it is reasonable to rely on their means as central tendency measures and
their variances as dispersion measures. Hence, symmetry is required for parametric
statistical comparisons: The use of T-tests or ANOVA with caution seems reasonable
or, even better, the use of customized tests based on tailored probability density
functions which are symmetrical but not normal would fix the issue.
In contrast, a proper analysis of non-symmetrical data through a parametric
approach would be unlikely. In fact, it has been proposed that when dealing with non-
symmetrical (i.e. skewed) distributions there is “more danger of failing to detect a
faulty hypothesis when the long tail of the true population distribution points towards
the position of the supposed mean than when the steep tail does”4 [18]. This suggests
that the mean and variance cannot be held as valid measures to describe a skewed
distribution. Hence, these measures should not be used for statistical inference or
descriptive statistics of non-symmetrical data.
A numerical method to verify the symmetry of a distribution is the skewness coef-
ficient. Such coefficient is available in most statistical packages and it is interpreted
as follows: when the coefficient is 0 or close to 0, the distribution is symmetrical and
when it is different from 0 the distribution skews to the left when positive and to the
right when negative. According to Kline [19], the lack of symmetry is severe when
the absolute value of such coefficient is larger than 3. More conservatively, Bulmer
suggests that the distribution is highly skewed when the coefficient is larger than 1
in absolute value [20]. In this regard, we suggest that the value proposed by Bulmer
may be held as a threshold for determining the suitability of employing parametric
procedures for the analysis of a given distribution.

2.5 Likert Items and Scales

The advocates of the use of parametric procedures for the analysis of Likert-type
data argue that authors who hold the “ordinalist” view “rarely mention or address
the empirical findings and facts that support the advantages of employing parametric

4 The long tail is the region of the distribution where data are less concentrated, whereas the steep
tail is the region where data are more concentrated.
Experimental Research Methodology and Statistics Insights 339

statistics to analyze such data” [21]. Indeed, there are several studies that provide
evidence of the usefulness of employing parametric statistics to analyze Likert-type
responses: among them, the contribution made by Gombolay and Shah can be high-
lighted [7].
In the case of ordinal data as those generated with a N-point Likert scale, a fixed
distance between any pair of sequential points across the scale cannot be assumed:
Their underlying magnitudes may vary [22]. For example, on a 5-point scale, if a
difference of one ordinal unit from the middle point (i.e. 3) to the point 4 corresponds
to a magnitude A and a difference of one ordinal unit from the point 4 to the end of the
scale (i.e. 5) to a magnitude B, A and B do not necessarily coincide. This disagreement
is thought to be dependent on the individual perception of respondents, among other
factors.
In order to address this problem, monotonic transformations are suggested and
employed. Monotonic rank transformations map the original scores into the trans-
formed scores keeping the order among the former in the new set of scores. A
monotonic transformation could be achieved, for example, through logarithmic or
sigmoidal transformations or by simply assuming that the points registered are con-
tinuous values from 1 to N.
Remarkably, a study within the field of Human Computer Interaction (HCI) [23]
stresses that differing results are obtained when parametric methods are applied to
different monotonic transformations of the same original data. This raises questions
about the validity and reliability of the parametric approach for the analysis of Likert-
type data. Therefore, we consider that monotonic rank transformations of ordinal data
constitute a fundamental issue that should be adequately discussed before accepting
the possibility of employing parametric statistics for such data type.
In the mean time, we encourage the use of nonparametric approaches to analyze
Likert-type data.

2.6 Nonparametric Statistics for Factorial Designs

An additional methodological concern arises when a factorial design is employed


and nonparametric alternatives for statistical analyzes are unknown or unavailable. A
factorial design is an experiment that comprises two or more factors or independent
variables that are controlled by a researcher to explore or elicit forsaken effects in
different groups or a single group of participants. In those experiments where the
normality assumption cannot be reasonably violated (i.e. severe skewness or ordinal
data type, such as Likert data) a nonparametric approach is strongly recommended.
Fortunately, a nonparametric analytic approach for factorial designs has been
developed [24], and has been made available in R 3.5.2 [25]. In fact, this approach
named nonparametric analysis of longitudinal data in factorial experiments has
already been introduced within HCI studies [23]. In broad terms, the referred non-
parametric analysis for factorial experiments is based on the ranks of the samples
that are being compared rather than relying on the mean and variance, as the para-
340 R. Paredes Venero and A. Davila

metric approach does. This feature allows the possibility to reliably analyze ordered
categorical, dichotomous and skewed data. Moreover, it has been proposed that this
approach is robust to outliers and perform well with small sample sizes [24].
Particularly, within this framework there is a nonparametric alternative for
ANOVA, named ANOVA-type statistic (ATS) which may be used to interpret mul-
tifactorial experimental data that do not fulfill the previously mentioned statistical
assumptions required to apply parametric statistics. For a formal introduction to this
analysis refer to [26] and for a practical approach and software implementation refer
to [25]. In Sect. 3, for illustration purposes, we report the results of simulated experi-
ments that register Likert scales responses which are analyzed using both parametric
and nonparametric approaches.

3 A Simulation Study

For this section an hypothetical HRI experiment is given as an example to illustrate


the analysis of several effects on a self-reported mental state: likeability of robots. We
consider as a between-subjects factor ‘Sex’ with two levels:
(1) Female and (2) Male participants, and as a within-subjects factor ‘Design’ with
three levels: (1) Anthropomorphic, (2) Zoomorphic, and (3) Machinelike. In this
example, the effects on likeability are measured by a 5-point Likert scale that female
and male participants should fill it in after the display of each type of robot.
We simulated the responses of 104 participants for two scenarios: (1) Effects are
present: when there is at least a difference on likeability between two types of robot
design display and a difference between female and male participants, and (2) Effects
are absent: when there were not differences at all neither among likeability towards
the three robot design displays nor likeability between women and men. We assigned
probability values to each possible response. Such values were different according to
Design and Sex when the effects were present (see Table 1) and identical (all = 0.2)
when the effects were absent.

Table 1 Assigned probabilities to each value of a 5-point Likert scale to simulate an experiment
in which principal effects are present
1 2 3 4 5
Sex
Female 0.25 0.30 0.25 0.1 0.1
Male 0.1 0.1 0.25 0.30 0.25
Design
Anthropomorphic 0.1 0.25 0.3 0.25 0.1
Zoomorphic 0.1 0.1 0.25 0.3 0.25
Machinelike 0.25 0.3 0.25 0.1 0.1
Experimental Research Methodology and Statistics Insights 341

The probability values in Table 1 indicate that female participants are more likely
to select lower values of the scale (i.e. from 1 to 3), whereas male participants higher
values (i.e. from 3 to 5). Analogously, participants exposed to anthropomorphic
designs would select central values of the scale (i.e. from 2 to 4), whereas when
exposed to zoomorphic and machinelike designs would more likely select higher
and lower values of the scale respectively.
The responses of participants were simulated by sampling values from 1 to 5
according to the joint probabilities correspondent to a particular combination of
sex of the participant and type of robot. As in this example there is no interaction
between Sex and Design, the probability of such responses would be obtained by
simply multiplying those probabilities for a given particular sex and a given particular
robot design. For example, if the probability of a female participant responding 5
were 0.1 and the probability of a participant -woman or man- responding 5 after the
display of a zoomorphic robot, were 0.25; the probability of a female participant
responding 5 after the display of a zoomorphic robot would be 0.1 × 0.25 = 0.025.

3.1 Analyzing Under the Nonparametric Approach

The above described experiment had two scenarios for simulation: with and with-
out principal effects. For both cases, no interaction between factors was assumed. In
ANOVA-like analysis considering two factors, a principal effect is attributed to a fac-
tor or independent variable when there is statistical evidence of differences between
at least a pair of means that correspond to the levels stated for the analyzed factor.
In our example there are two possible principal effects: (1) One for the between-
subjects factor Sex, and (2) Other for the within-subjects Design. For example, if
we had a principal effect for Sex there would be differences in likeability towards
robots between female and male participants. When there is an interaction between
factors, there is a differential effect between means attributable to the combinations
of the levels that have been defined for these factors: An illustration of a differen-
tial effect in our example would be that the difference between likeability towards
anthropomorphic and zoomorphic robots were not the same for female participants
when compared with male participants.
The simulation of data and their statistical analysis were performed using R ver-
sion 3.5.2 [27] and the nparLD package [25]. The code employed to run the simula-
tions is presented in Listing 1. The first three sets of statements declare the possible
values of Likert-type responses and the probabilities of likeability for the levels of the
between and within-subjects factors. Statements from AF to MM simulate samples
of size 52 for the probabilities corresponding to each possible combination between
the type of robot design and the sex of a participant. Statements from DAF to DMM
link the calculated probabilities such as responses with the names of the six possible
combinations of analyzed factors. The final line of the code concatenates the previous
results.
342 R. Paredes Venero and A. Davila

Correspondent descriptive statistics of Likert-type responses going from 1 to 5 for


likeability towards robots can be inspected in Table 2. A 1 response means strongly
dislike, 2 means dislike, 3 means no opinion, 4 means like, and 5 means strongly
like. Levels of Sex, Design and their combinations are named across rows of the
first column and values for the median (Me) and the interquartile range between
percentiles 75 and 25 (IQR) are depicted for the two scenarios: with and without
principal effects.
Listing 1 R code employed for the simulation
# I n i t i a l setup
l i k e r t <− c ( 1 , 2 , 3 , 4 , 5 )
s e t . seed (104)

# Between s u b j e c t s p r o b a b i l i t i e s
Female <−c ( . 2 5 , . 3 0 , . 2 5 , . 1 , . 1 )
Male <−c ( . 1 , . 1 , . 2 5 , . 3 0 , . 2 5 )

# Within s u b j e c t s p r o b a b i l i t i e s
Anthropomorphic <−c ( . 1 , . 2 5 , . 3 , . 2 5 , . 1 )
Zoomorphic <−c ( . 1 , . 1 , . 2 5 , . 3 0 , . 2 5 )
M a c h i n e l i k e <−c ( . 2 5 , . 3 0 , . 2 5 , . 1 , . 1 )

# S i m u l a t i o n run
AF <− sample ( l i k e r t , s i z e =52 , r e p l a c e =T , p rob = Anthropomorphic ∗
Female )
ZF <− sample ( l i k e r t , s i z e =52 , r e p l a c e =T , p r o b =Zoomorphic ∗ Female )
MF <− sample ( l i k e r t , s i z e =52 , r e p l a c e =T , p r o b = M a c h i n e l i k e ∗ Female )
AM <− sample ( l i k e r t , s i z e =52 , r e p l a c e =T , p rob = Anthropomorphic ∗Male )
ZM <− sample ( l i k e r t , s i z e =52 , r e p l a c e =T , p r o b =Zoomorphic ∗Male )
MM <− sample ( l i k e r t , s i z e =52 , r e p l a c e =T , p r o b = M a c h i n e l i k e ∗Male )

DAF <− data . frame ( i d = 1 : 5 2 , Group=" Female " , C o n d i t i o n =" Ant hropomorphic
" , r e s p o n s e =AF)
DZF <− data . frame ( i d = 1 : 5 2 , Group=" Female " , C o n d i t i o n =" Zoomorphic " ,
r e s p o n s e =ZF )
DMF <− data . frame ( i d = 1 : 5 2 , Group=" Female " , C o n d i t i o n =" M a c h i n e l i k e " ,
r e s p o n s e =MF)
DAM <− data . frame ( i d = 5 3 : 1 0 4 , Group=" Male " , C o n d i t i o n =" Ant hropomorphic
" , r e s p o n s e =AM)
DZM <− data . frame ( i d = 5 3 : 1 0 4 , Group=" Male " , C o n d i t i o n =" Zoomorphic " ,
r e s p o n s e =ZM)
DMM <− data . frame ( i d = 5 3 : 1 0 4 , Group=" Male " , C o n d i t i o n =" M a c h i n e l i k e " ,
r e s p o n s e =MM)

e x p e r i m e n t <− rbind (DAF, DZF ,DMF,DAM,DZM,DMM)

The simulated experimental data fit to a F1 LD F1 Model. This model considers


one between-subjects factor and one within-subjects factor and constitutes the non-
parametric alternative to a mixed ANOVA. Such model is called automatically by
employing the function nparLD, as shown in Listing 2. The code employed makes
the analysis after declaring the factors Sex (which it will be called Group onwards)
Experimental Research Methodology and Statistics Insights 343

Table 2 Descriptive statistics of the simulated experiment


Principal effects No principal effects
Me IQR Me IQR
Sex
Female 3 1 3 2
Male 3 1 3 2
Design
Anthropomorfic 3 2 3 2
Zoomorphic 4 1 3 2
Machinelike 3 1 3 2
Sex*Design
Female*Anthropomorfic 2 1 3 2
Female*Zoomorphic 3 3 3 2
Female*Machinelike 2 1.25 3 2
Male*Anthropomorfic 3 1 3 2
Male*Zoomorphic 4 1 3 2
Male*Machinelike 3 2 2 3

and Design (which it will be call Condition onwards) and calling the previous cal-
culated data. Additionally, makes a summary and a plot to display the results of the
analysis.
The nonparametric analysis results are displayed in Table 3 where the key column
to be inspected is the p (probability) column across rows corresponding to the factors
and their interaction. Results show that the ATS test accurately found principal effects
( p < 0.05) for Group and Condition factors in the simulated scenario where such
effects were expected. In contrast, no effects were found in the alternative scenario.
A graphical illustration of the analysis can be found in Fig. 1. This analysis permits
to detect the presence or absence of principal effects.
Listing 2 R code employed for the nonparametric analysis.
l i b r a r y ( nparLD )

# Nonparametric a n a l y s i s
a t s <− nparLD ( r e s p o n s e ~ C o n d i t i o n ∗Group , data = e x p e r i m e n t ,
s u b j e c t = ’ i d ’ , d e s c r i p t i o n = TRUE)

# Show r e s u l t s
summary ( a t s )

# Generate p l o t
plot ( ats )

Up to this point, we have simulated two experiments: one with principal effects
and other without them. To achieve these effects in the simulation, we tailored our
data by manual assignment of probabilities to each value of a 5-point Likert scale
344 R. Paredes Venero and A. Davila

Table 3 ANOVA-type results


Principal effects No principal effects
ATS df p ATS df p
Sex 45.14 1 0.000 0 1 0.936
Design 21.28 2 0.000 0.12 2 0.886
Sex*Design 0.11 2 0.900 0.35 2 0.708

before running the simulations. Afterwards, we examined the presence or absence


in simulated data of such effects by the ATS test.
In the following section, we will simulate those experiments repeatedly to compare
the effectiveness of ANOVA and ATS to detect differences among means of interest.
In line with the conceptual discussion presented previously, we expect that the ATS
would outperform ANOVA because it does not assume normally distributed interval
data.

3.2 Comparison Between Nonparametric and Parametric


Approaches

In order to compare the performance of both nonparametric and parametric already


described procedures, 2000 simulations of the same experiment were computed. Half
of the simulations corresponded to the scenario where principal effects exist and half
to the scenario where such effects are absent. No interaction between factors was
assumed. The nonparametric analysis was computed using the ATS test, whereas the
parametric analysis was computed employing a mixed ANOVA using the function
‘aov’, as shown in Listing 3.5 The nonparametric and parametric analysis statements
call the functions to run their respective analyses on the experimental data we simu-
lated and described in Sect. 3.1.
Listing 3 R code employed for parametric and nonparametric analysis.
# Nonparametric a n a l y s i s
a t s <− nparLD ( r e s p o n s e ~ C o n d i t i o n ∗Group , data = e x p e r i m e n t ,
s u b j e c t = ’ i d ’ , d e s c r i p t i o n = TRUE)

# Parametric a n a l y s i s
anova <− aov ( r e s p o n s e ~ Group∗ C o n d i t i o n + E r r o r ( i d / C o n d i t i o n ) , data
=experiment )

Sensitivity and specificity are two concepts that permit to assess the effectiveness
of statistical tests to estimate parameters of interest (i.e. means): Sensitivity may be
defined as the property that a test has to detect successfully probability values that

5 The complete code employed for the simulation ran in this subsection can be found in https://
github.com/renatoparedes/ATS-vs-ANOVA.
Experimental Research Methodology and Statistics Insights 345

Fig. 1 The graphics illustrate the results of likeability towards robots as probabilities of the ATS
test for both scenarios. Each graphic illustrates in separate lines for female and male participants
the plot of means of probabilities for the relative treatment effects (RTE) across the three types of
robots: anthropomorphic, zoomorphic, and machinelike. Pointwise 95% confidence intervals bars
centered on the means are depicted. The top plot corresponds to the scenario where principal effects
come out and the bottom plot to the scenario where principal effects do not come out
346 R. Paredes Venero and A. Davila

correspond to a statement on estimable parameters (e.g. two means are equal) and
are below a threshold p (e.g. p < 0.05). Specificity may be defined as the property
to detect successfully probability values that are above the aforementioned threshold
(e.g. p > 0.05). When a simultaneous analysis of sensitivity and specificity of a sta-
tistical test takes place, these measures permit to describe tests as binary classifiers of
observations of interest into two categories: below and above a probability threshold.
We proceeded to calculate sensitivity and specificity measures for our 2000 sim-
ulations. In this context sensitivity may be defined precisely as the proportion of
successful classifications of probability values below the threshold and specificity as
the proportion of successful classifications of probability values above the threshold.
The simultaneous plotting of sensitivity and specificity results is known as a receiver
operating characteristic (ROC) curve that helps to visualize the effectiveness of sta-
tistical tests as binary classifiers. ROC curves for our simulations are depicted in
Fig. 2.
As both tests performances are above chance for either between (Group) or within-
subjects factors (Condition), they are well suited to minimize the statistical errors that
arise when making estimations of parameters of interest: differences among means
of likeability towards the three different types of robots that have been displayed or
differences between means of likeability by the sex of participants. However, the ATS
test works as a better classifier than the ANOVA. Particularly, in terms of sensitivity
(i.e. true positive rate or how well the test detected differences when they existed)
in the within-subjects factor Condition. The performances of both tests expressed as
the area under the ROC curve (AUROC) are shown in Table 4: While performance
of ANOVA is a bit smaller than performance of ATS for Group, performance of
ANOVA is lower than performance of ATS for Design for about 0.200.
Analysis of AUROC results suggests that ATS works as a better classifier of sig-
nificant differences than ANOVA for between-subjects comparisons of participants
by sex and for within-subjects comparisons of the three experimental levels of type of
robot design that were displayed. Considering that a random classifier would obtain
AUROC values of 0.5, these results indicate that both statistical tests perform bet-
ter than chance but the ATS is more sensitive and specific in finding differences in
the simulated experiment. Particularly it performs considerably better than ANOVA
when dealing with differences among conditions.

3.2.1 The Non Independent Factors Scenario

The same analysis was performed for the scenario where an interaction between fac-
tors exists (i.e. factors are not independent). This was implemented in the simulation
by conditioning group responses (Group) over the experimental levels (Condition)
by multiplying the probability of the responses for the three types of robot design
and the probability of the responses either for female or male participants given a
particular type of robot display: anthropomorphic, zoomorphic or machinelike.
Experimental Research Methodology and Statistics Insights 347

Fig. 2 The graphic illustrates ROC curves for both ATS and ANOVA statistical tests corresponding
to the principal effects of our example: Group and Condition. In both figures, the gray diagonal
represents a test performance that would correspond to chance. At the top we depict the classifier
results obtained for the between-subjects factor (Group) and at the bottom those obtained for the
within-subjects factor (Condition). The ROC curves reveal that the ATS considerably outperforms
ANOVA as a binary classifier, specially for the within-subjects factor
348 R. Paredes Venero and A. Davila

Table 4 AUROC results for ATS and ANOVA tests


Test Group Condition
ATS 0.973 0.971
ANOVA 0.918 0.785

Fig. 3 The graphic illustrates the results of likeability towards robots as probabilities of the ATS
test for the scenario where there is an interaction between Group and Condition. The graphic
illustrates in separate lines for female and male participants the plot of means of probabilities for
the relative treatment effects (RTE) across the three types of robots: anthropomorphic, zoomorphic,
and machinelike. Pointwise 95% confidence intervals bars centered on the means are depicted

The code we wrote for the analysis is shown in Listing 4: The first two sets of
statements declare the same probabilities of likeability for the levels of the between
and within-subjects factors as in Listing 1 and statements from AF to MM simulate
samples of size 52 for the probabilities corresponding to each possible combination
between the type of robot design and the sex of a participant in the same way as in
Listing 1. The declared probabilities for female and male participants in Listing 1
were diminished by 0.1 to introduce the aforementioned interaction. The results of
a single simulation are shown in Fig. 3.
Experimental Research Methodology and Statistics Insights 349

Listing 4 R code employed for parametric and nonparametric analysis.


# Between s u b j e c t s p r o b a b i l i t i e s
Female <−c ( . 2 5 , . 3 0 , . 2 5 , . 1 , . 1 )
Male <−c ( . 1 , . 1 , . 2 5 , . 3 0 , . 2 5 )

# Within s u b j e c t s p r o b a b i l i t i e s
Anthropomorphic <−c ( . 1 , . 2 5 , . 3 , . 2 5 , . 1 )
Zoomorphic <−c ( . 1 , . 1 , . 2 5 , . 3 0 , . 2 5 )
M a c h i n e l i k e <−c ( . 2 5 , . 3 0 , . 2 5 , . 1 , . 1 )

# S i m u l a t i o n run
AF <− sample ( l i k e r t , s i z e =52 , r e p l a c e =T , p rob = Anthropomorphic ∗
Female )
ZF <− sample ( l i k e r t , s i z e =52 , r e p l a c e =T ,
p r o b =Zoomorphic ∗ ( Female −.1) )
MF <− sample ( l i k e r t , s i z e =52 , r e p l a c e =T , p r o b = M a c h i n e l i k e ∗ Female )
AM <− sample ( l i k e r t , s i z e =52 , r e p l a c e =T , p rob = Anthropomorphic ∗Male )
ZM <− sample ( l i k e r t , s i z e =52 , r e p l a c e =T ,
p r o b =Zoomorphic ∗ ( Male −.1) )
MM <− sample ( l i k e r t , s i z e =52 , r e p l a c e =T , p r o b = M a c h i n e l i k e ∗Male )

As for the scenario where factors were independent, 2000 simulations were com-
puted to examine the performance of both tests as binary classifiers. The sensitivity
and specificity results plotted as ROC curves are shown in Fig. 4.
The ROC curves reveal that the ATS considerably outperforms ANOVA for the
within-subjects factor (Condition) and the interaction (Condition by Group), whereas
both tests are equivalent for the between-subjects factor (Group). Particularly, the
difference in performance in both tests is larger than in the independent factors
scenario for the within-subjects factor. Moreover, the performances of the ANOVA
in the within-subjects factor and the interaction are moderately superior than chance
(ROC red curves and diagonal gray lines depicted in the graphs are close), which
indicates that such parametric test would perform poorly in finding differences across
experimental conditions and interactions when the factors are not independent.
The performance results are displayed numerically in terms of the AUROC in
Table 5: Performance of ANOVA is lower than performance of ATS for Condition
and for the interaction Condition by Group for about 0.270.

3.3 Discussion

The simulations that have been presented were performed as an illustration of the suit-
ability of a nonparametric procedure for a factorial design with Likert-type responses
as a dependent measure. Results indicate that such procedure may be reliable to deter-
mine differences when comparing experimental conditions or groups. Our procedure
generates results that are in agreement with those by Kaptein, Nass and Markopoulos
[23]. A complete review of the reliability of the ATS has been reported by Brunner,
Domhof and Langer [24] and a generalized guide to employ it for different experi-
mental designs can be found in [25].
350 R. Paredes Venero and A. Davila

Fig. 4 The graphic illustrates ROC curves for both ATS and ANOVA statistical tests corresponding
to the principal effects and the interaction of our example: Group, Condition and Group by Condition.
In all figures, the gray diagonal represents a test performance that would correspond to chance. At
the top left we depict the classifier results obtained for the between-subjects factor (Group), at the top
right those obtained for the within-subjects factor (Condition) and at the bottom those obtained for
the interaction (Condition by Group). The ROC curves reveal that the ATS considerably outperforms
ANOVA as a binary classifier when factors are not independent, specially for the within-subjects
factor and the interaction

Table 5 AUROC results for ATS and ANOVA tests


Test Group Condition Interaction
ATS 0.973 0.964 0.953
ANOVA 0.971 0.695 0.682

In addition, the finding that such nonparametric procedure would considerably


outperform the parametric alternative in determining within-subjects differences and
interactions has relevant implications. This opposes previous claims of no practical
difference between employing parametric and nonparametric tests to analyze Likert-
type responses [7, 21] and provides computational evidence to support the “ordinal-
ist” view in this debate. Furthermore, results presented here provide arguments to
disregard the use of monotonic rank transformations to pre-process Likert-type data
to make it suitable for a parametric analysis. Instead, results suggest the possibility
of obtaining robust results by employing rank-based nonparametric analysis such as
the ATS.
Experimental Research Methodology and Statistics Insights 351

Overall, the use of ATS is advisable when Likert-type data as those illustrated
in our simulation are generated in the process of measuring mental states by self-
reporting. In our example the Likert-type data corresponded to the hypothetical case
of likeability towards robots but the spectrum of possible responses that can be
measured and analyzed in a similar way is broad: responses like willingness to
interact with robots, customer satisfaction with assistant robots, perception of their
utility or friendship, among others.

4 Conclusion

This chapter has introduced conceptual insights of experimental research methodol-


ogy and statistical data analysis. It started with a review of the requisites of a study
to be properly called an experiment. By doing so, it has been argued the necessity
of holding internal validity as a priority in HRI experimental research and the use
of within-subjects designs to increase basic science oriented research in the field. In
addition, the chapter has introduced arguments for employing nonparametric statis-
tics in HRI experiments when the normality assumption required for parametric pro-
cedures cannot be reasonably violated. Furthermore, a nonparametric alternative for
factorial designs is discussed and illustrated through a simulated experiment holding
Likert-type responses as a dependent measure. After the current conceptual review
and simulation, it seems feasible to employ the described nonparametric approach
as a an alternative to analyze the results of experiments when the distributions are
asymmetrical or the data are ordinals in nature, as discussed in this chapter. Given
that such scenarios are frequent in HRI research, the reviewed statistical analysis
alternative is proposed as a recommended methodological practice for the field.

References

1. Eyssel, F.: An experimental psychological perspective on social robotics. Robot. Auton. Syst.
87, 363–371 (2017)
2. Razali, N.M., Wah, Y.B., et al.: Power comparisons of shapiro-wilk, kolmogorov-smirnov,
lilliefors and anderson-darling tests. J. Stat. Model. Anal. 2(1), 21–33 (2011)
3. Siegal, S.: Nonparametric Statistics for the Behavioral Sciences. McGraw-hill (1956)
4. Boneau, C.A.: The effects of violations of assumptions underlying the t test. Psychol. Bull.
57(1), 49–64 (1960)
5. Bradley, J.V.: Robustness? Br. J. Math. Stat. Psychol. 31(2), 144–152 (1978)
6. Likert, R.: A technique for the measurement of attitudes. Arch. Psychol. 140, 1–55 (1932)
7. Gombolay, M., Shah, A.: Appraisal of statistical practices in HRI vis-á-vis the t-test for likert
items/scales. In: 2016 AAAI Fall Symposium Series (2016)
8. Jamieson, S., et al.: Likert scales: how to (ab) use them. Med. Educ. 38(12), 1217–1218 (2004)
9. Carifio, J., Perla, R.J.: Ten common misunderstandings, misconceptions, persistent myths and
urban legends about likert scales and likert response formats and their antidotes. J. Soc. Sci.
3(3), 106–116 (2007)
352 R. Paredes Venero and A. Davila

10. Baguley, T.: Serious stats: a guide to advanced statistics for the behavioral sciences. Macmillan
International Higher Education (2012)
11. Hernández Sampieri, R., Fernández Collado, C., Baptista Lucio, P., et al.: Metodología de la
investigación, vol. 3. McGraw-Hill, México (2010)
12. Bethel, C.L., Murphy, R.R.: Review of human studies methods in HRI and recommendations.
Int. J. Soc. Robot. 2(4), 347–359 (2010)
13. García, M.A., Seco, G.V.: Diseños experimentales en psicología. Pirámide (2007)
14. Kantowitz, B.H., Roediger III, H.L., Elmes, D.G.: Experimental Psychology. Nelson Education
(2014)
15. Coolican, H.: Research Methods and Statistics in Psychology. Psychology Press (2017)
16. Smith, P.L., Little, D.R.: Small is beautiful: in defense of the small-n design. Psychon. Bull.
Rev. 25(6), 2083–2101 (2018)
17. Efron, B.: Student’s t-test under non-normal conditions, Technical report. Harvard Univ Cam-
bridge Ma Dept of Statistics (1968)
18. Pearson, E.S., Adyanthāya, N.: The distribution of frequency constants in small samples from
non-normal symmetrical and skew populations. Biometrika 21(1/4), 259–286 (1929)
19. Kline, R.B.: Principles and Practice of Structural Equation Modeling. Guilford Publications
(2015)
20. Bulmer, M.G.: Principles of Statistics. Courier Corporation (1979)
21. Carifio, J., Perla, R.: Resolving the 50-year debate around using and misusing likert scales.
Med. Educ. 42(12), 1150–1152 (2008)
22. Schunn, C.D., Wallach, D., et al.: Evaluating goodness-of-fit in comparison of models to data.
In: Psychologie der Kognition: Reden and vorträge anlässlich der emeritierung von Werner
Tack, pp. 115–154 (2005)
23. Kaptein, M.C., Nass, C., Markopoulos, P.: Powerful and consistent analysis of likert-type
ratingscales. In: Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems, pp. 2391–2394. ACM (2010)
24. Brunner, E., Domhof, S., Langer, F., Brunner, E.: Nonparametric Analysis of Longitudinal
Data in Factorial Experiments. Wiley, New York (2002)
25. Noguchi, K., Gel, Y.R., Brunner, E., Konietschke, F.: nparld: an r software package for the
nonparametric analysis of longitudinal data in factorial experiments. J. Stat. Softw. 50(12)
(2012)
26. Brunner, E., Puri, M.L.: Nonparametric methods in factorial designs. Stat. Pap. 42(1), 1–52
(2001)
27. R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for
Statistical Computing, Vienna, Austria (2019)

Renato Paredes Venero is an early career researcher concerned


with Computational Cognitive Neuroscience and Human-Technology
Interaction. He has experience as a research assistant in the
fields of experimental research methods, cognitive neuroscience
and human-robot interaction. Currently, he is pursuing post-
graduate studies in Cognitive Science with the School of Infor-
matics at The University of Edinburgh. His aim is to special-
ize in computational modeling of the neural basis of cognition
and behaviour. His topics of interest are the neural mechanisms
of self-perception, motor cognition and social interaction. He
earned his undergraduate degree in Psychology from Pontifical
Catholic University of Peru (PUCP) and worked in an inter-
disciplinary research project named “Neurocognitive study of
perception of actions and emotions in the interaction between
human beings and humanoid robots” funded by the same insti-
tution. He also worked as a teaching assistant of experimental research methods and statistics in
the Faculty of Psychology at PUCP.
Experimental Research Methodology and Statistics Insights 353

Alex Davila is associate professor in the Department of Psy-


chology at the Pontifical Catholic University of Peru (PUCP)
since 2005 and regular member of the American Physiological
Society since 2018. Doctor in Psychology from the KU Leu-
ven and candidate to a Master’s degree in Physics by the PUCP.
During his doctoral studies, he worked making 3D captures and
applying simulation techniques of human biological motion. He
also did contributions in the psychophysical analysis of this kind
of motion. Between April 2016 and February 2018, he led at
the PUCP the interdisciplinary project “Neurocognitive study
of the perception of actions and emotions in the interaction of
the human being with humanoid robots”, contributing with his
expertise to the design and making of experiments on perception
and the adoption of techniques for acquisition, processing, and
analysis of EEG signals. He also has a specialization in CAD-
CAE-CNC-CAM (2005) and studies in Mechatronics Engineering at a diploma level (2007), both
by the PUCP. In March 2018, he migrated to the field of Materials Sciences by joining the Mate-
rials Sciences and Renewable Energies Group (MatER-PUCP) where he is working to simulate
electronic and optical properties of amorphous materials.
Advice to New Human-Robot Interaction
Researchers

Tony Belpaeme

Abstract As the field of Human-Robot Interaction is relatively young and highly


interdisciplinary, it often happens that we as researchers make mistakes which could
have been avoided had we known more about good and bad research practices in fields
other than our own. Bad practices such as convenience sampling, p-hacking, or the
Hawthorne effect will be known to some, but are too often unfamiliar to others. HRI
is lucky to mature during one of the biggest revolutions in experimental psychology:
the “replication crisis” was one of the most seminal moments in psychology and its
repercussions are felt far and wide, including in HRI. This chapter lists some of the
important mistakes often made in HRI studies and suggest practical recommendations
for better research studies.

Keywords HRI evaluations · Common mistakes · Bad practices ·


Good practices · Recommendations

1 Remember Your First HRI Study?

Imagine, your first day as a HRI researcher. You probably are a computer scientist,
psychologist, engineer, sociologist, or designer. Or you might have a background
in another discipline relevant to HRI work, HRI draws on many of the science and
engineering disciplines. You will perhaps start by reading up on the vast and diverse
literature in HRI, but will soon move on to getting your hands dirty. You might want
to know what people’s attitudes are towards robots. You might want to design a
machine learning solution to read social cues from a range of sensors mounted on a
robot. You might want to know how effective a social robot is in supporting children
with a long-term medical condition. You could be working on a collaborative robot
which applies glue to plastic parts that a human worker then fits onto a car. Or you
might design the interaction of a companion robot for elderly users. Unless you are

T. Belpaeme (B)
Ghent University, IDLab – imec, Technologiepark 126, 9052 Ghent, Belgium
e-mail: tony.belpaeme@ugent.be
University of Plymouth, Drake Circus, Plymouth PL4 8AA, UK
© Springer Nature Switzerland AG 2020 355
C. Jost et al. (eds.), Human-Robot Interaction, Springer Series
on Bio- and Neurosystems 12, https://doi.org/10.1007/978-3-030-42307-0_14
356 T. Belpaeme

a philosopher, there will probably come a time when you want to know how well
your robot works and when you want data to guide your efforts. In a nutshell: it is
time for you to run a study.
You probably already have a good idea of what your colleagues would consider
to be good work. If you are an engineer, you might value a live demonstration of
your system. You demonstrate how the robot recognises you and responds to your
spoken commands, and you use a video recording of your robot system to show
the world what technical feats you’ve accomplished. If you are an experimental
psychologist, you know your colleagues value carefully controlled experiments in
which dozens of balanced participants interact with one of two or more conditions
and where you carefully measure a number of outcomes. If you’re a sociologist,
you might have designed a survey which you send out online to capture data on how
people’s attitudes to robots differ across age groups. Each discipline and industry has
its preferred ways of collecting and communicating research and design efforts, and
during our education at university we have been trained to use these effectively within
our discipline. Designers will be skilled at presenting, psychologists at measuring,
sociologists at interviewing. However, it is unlikely that as an inexperienced HRI
researcher without formal training in HRI –as there are few training programmes in
HRI– you will be versed in a wide range of study methods bearing on the field. And
just as the author of this chapter, you will probably make some beginner’s mistakes.
Some easy to avoid, but some serious enough to get your six-month research study
rejected by your peers or to irreparably damage a roll-out of a commercial product.
This chapter contains some of the most prevalent and fundamental errors committed
in HRI studies, and suggests potential solutions.

2 Current Practice in HRI Studies

While many of the mistakes made in HRI studies are not unique to HRI (other
disciplines are not perfect either), their prevalence is often higher due to the interdis-
ciplinary nature of HRI [2]. This section will draw upon an analysis of three years
of HRI studies published between 2013 and 2015 in the IEEE/ACM Conference on
Human-Robot Interaction, the main conference in the field [2].

2.1 Lab-Based or in the Wild?

To help us understand the lay of the land in HRI research, it will be helpful to first
define the different types of studies. A lab study requires participants to come to
a specific location, a lab at a university or company, to take part in the study. A
non-lab study, also known as an in-the-wild study, is one where the study is run
on location, for example in a factory or a hospital. Almost three quarters of HRI
studies are lab-based [2]. While there certainly are good reasons for conducting
Advice to New Human-Robot Interaction Researchers 357

studies in a carefully controlled lab environment, there are questions about their
ecological validity [5]. Admittedly, experimental psychology often uses lab-based
experiments to better understand human cognition, and the results obtained from lab
experiments do indeed shed light on the functioning of cognition and the interplay
of cognition with the external world. But in HRI the purpose is often not to elucidate
human cognition, but rather to assess whether one approach to HRI differs from
another. Is a robot that offers breaks to young learners more effective as a tutor [23]?
Do robots have a role to play in Autism Spectrum Disorder therapy [25]? These
questions are often better answered in ecologically valid settings, as what might
appear to be effective in a controlled lab environment, is likely to be washed out
by the buzz and noise of the real world. Still, there are good reasons for running
lab-based studies. Sometimes the technical setup just does not travel well. It might
contain an expensive robot which cannot easily be moved out of the lab [6] or the
sensory rig is too cumbersome to dismantle, build and calibrate [11]. An attractive
middle ground is the living lab, a semi-naturalistic environment in which conditions
of natural environments are replicated. This can range from a single room, such as a
living room, to an entire house. These environments allow for complex technological
setups, while offering a perhaps more relaxed environment in which the behaviour
of the user can be more natural. At any rate, the decision between a lab, living lab or
non-lab evaluation environment should be carefully considered, and when resources
and technology allow, preference should be given to the latter.

2.2 Wizard of Oz or Full Autonomy?

A second dimension along which studies differ is the level of autonomy that the
robot has. Sometimes the interaction with the robot will run autonomously, apart
from perhaps the robot being started or stopped by the experimenter. Sometimes,
the robot will to some extent be controlled by another human, a method known as
Wizard of Oz (WoZ) [24]. Key here is that the participant is unaware of this: to the
user it appears as if the robot is fully autonomous, while in reality a remote operator,
called the wizard, takes over some aspects of the robot’s functionality. The wizard
can take over perceptual and cognitive aspects of the control. When taking over
perceptual aspects, the wizard fills in for the lack in perceptual abilities of the robot,
such as speech or vision, due to the technology not being sufficiently robust or due to
time and resource constraints in implementing autonomous perception on the robot.
If the wizard handles cognitive aspects, it means they make decisions, for example
on how the robot should respond to the user. Wizarding can even just serve as a
“stub” for functionality that is not yet sufficiently mature, or for aspects of the robots
functionality that you which to trial before moving to an expensive implementation.
About 40% of studies where people interact with a robot use autonomous robots, all
other to some extent require the assistance of a human operator. When reporting your
research, it is important to mention how the robot was controlled. While it might of
course necessary to use a smoke and mirrors approach during the experiment, the
358 T. Belpaeme

eventual report should disclose fully to what extent the robot was wizarded. Further
reporting guidelines for WoZ experiments are available [24].

2.3 On-Screen or Real Robot?

The type of exposure to the robot is also key in a study. In some HRI studies people
will interact with a real robot, while in other studies they will see a robot on-screen,
either as a still or as a video. In some of these the robot will be shown on its own, in
others people will see an interaction unfold between one or more users and the robot.
The temptation of using on-screen presentations of robots are many. A photo or video
does not crash midway the experiment. You often don’t need to program or even own
a robot, but instead can use a photo [22] or an illustration [14] of a robot instead.
And finally, and perhaps most importantly, using an on-screen presentation of a
robot allows for setting up the study online, thereby giving you access to a potential
participant pool of hundreds of thousands of geographically spread respondents.
There are cases where the reasons for using an on-screen presentation outweigh the
effort of using an actual robot, but as interacting with a real embodied robot is a
more real and more visceral experience, in which the user is more invested in the
experience [18], it is reasonable to expect that a study involving a real robot will in
most cases result in more useful results, perhaps even different results compared to
seeing pictures of robots.

2.4 Convenience Sampling or Representative Sampling?

When running HRI studies at a university there is a steady supply of willing (or readily
coerced) participants at hand: the students. Participants that are easily recruited –
for example students or visitors to an exhibition– make up a convenience sample.
Convenience samples are rife in HRI, at between 2012 and 2015 when research was
not conducted with children (aged less than 18) or the elderly (aged over 65), 87%
of studies reported at the HRI conference used samples which drew from university
populations, see Fig. 1. It might be that some intended to study how well-educated,
technology-savvy people in their late teens and early twenties respond to robots, but
it is unlikely that all wanted to do so. Collecting data using a convenience sample
introduces a sample bias, and the results obtained in this way are unlikely to translate
to the real world. Convenience samples do have a place in experimental work, but
given that HRI studies often require a subjective response to the robot, it is unlikely
that a biased sample of students or colleagues will provide good data on which to
build your Human-Robot Interaction. Therefore, efforts should be taken to go out
and collect data from the range of users who eventually are expected to interact with
your robot. Creating an unbiased and balanced participant pool is difficult, and is
Advice to New Human-Robot Interaction Researchers 359

30

25
Number of Studies

20

15

10

0
< 10 10 < 18 18 < 25 25 < 35 35 < 45 45 < 55 55 < 65 >= 65 unstated/
unclear
Subject Average Age Group

Fig. 1 The age of participants (if at all reported) in three years of studies presented at the HRI
conference. A convenience sample of students is over-represented

further compounded by the fact that robots are new to most people and this first time
exposure to a robot might influence the results (see novelty effect).
Another issue to take into consideration is the size of the sample. Collecting data
from real participants can be resource intensive and time consuming, so there is a
temptation to go for a low number of participants. The exception to this is when
participants are plentiful, such as with crowd sourced studies. In an analysis of
three years of HRI studies shows that sample sizes in HRI usually are very small
when judged by standards of other fields, see Fig. 2. Small sample sizes lead to a
lack of statistical power, which in turn lead to incoherent results, a concern often
voiced in psychology [19]. The recommendation would be to go for larger and more
representative samples, which unfortunately is easier said than done, as running
studies requires considerable technical, logistic and staffing effort. In some cases,
larger samples might not even be available, for example when working in a clinical
context. The lack of quantity when it comes to data is not necessarily a problem,
and single case studies can be informative, especially in early stages of research.
However, the ambition should be to use large sample sizes in ecologically valid
studies. HRI evaluations are not for the lazy.

2.5 Single or Long-Term Interaction?

In most HRI studies participants only interact once with the robot (of 96 studies
reported in [2] only 5 consisted of more than one interaction). This of course has
rather profound implications for the field. As participants often have never interacted
with a robot before and are typically naïve to the interaction under study, there is a
360 T. Belpaeme

45

40

35
QuanƟty of Studies

30

25

20

15

10

0
< 10 10 < 20 20 < 30 30 < 40 40 < 50 50 < 60 60 < 70 70 < 80 80 < 90 90 < 100 > 100

Average Number of Subjects per CondiƟon

Fig. 2 The sample size in three years of studies presented at the HRI conference. Most studies use
a very low number of participants. Data from [2]

novelty effect. The robot, the study setting or the interaction is new and unfamiliar
to the participants, and it is likely that this will colour their interaction with the
robot. Some outcomes might be stronger due to the novelty effect, while in some
case the novelty effect might have an adverse effect on the outcomes. For example,
young users might be intimidated by the robot, an effect which might wear off with
repeated exposure. Long-term studies, in which the interaction last longer than a brief
session or in which people interact with the robot over repeated sessions, are still
relatively rare in HRI. Still, the value of doing long-term interaction research cannot
be understated. While sometimes the novelty effect is actively sought (for example in
entertainment applications), HRI is generally concerned with how people will interact
with robots in day-to-day life and what the technical, societal and psychological
consequences and applications are. As such there is a strong desire to know how
the user’s behaviour will evolve over time once the novelty effect wears off. It is
difficult to say what a long-term interaction exactly is, but suffice to say that the
novelty effect with regards to the robot, the interaction and the environment should
have disappeared [17].

3 Setting Up a Study

Translating an appealing research idea into a well executed study almost always
involves more than you think. Luckily many have gone before you and excellent
introductions exist on how to set up experimental studies, with perhaps the best
and accessible advice coming from experimental and social psychology [10, 12].
Advice to New Human-Robot Interaction Researchers 361

However, even when following these introductory guides, it is still very much possible
to slip up, especially if you did not receive any formal training in experimental
methods. In this section we look at new trends and a selection of problems and
questionable practices that occur from time to time in HRI research. These are not
necessarily unique to HRI research, and there are many exemplary HRI studies which
manage to avoid all of these issues. But often, either through a lack of knowledge or
experience, or through the context in which the study is set, the study inadvertently
falls foul of one or several of these problems.

3.1 Null-Hypothesis Significance Testing

If quantitative data is collected during a study, there will be a need to compare


data. Most HRI studies, just as in other experimental sciences, adopt the practice
of using Null-Hypothesis Significance Testing (NHST). NHST needs a minimum
of two conditions (the intervention and control, in analogy to the medical sciences
where patients are given a treatment which is compared against no treatment or an
established treatment, called the control). In each condition, one or more outcomes
are measured. NHST then checks the hypothesis that the data distribution from the
intervention conditions (consisting of sample size, mean and standard deviation when
the data has a normal distribution) does not differ from the distribution of the control
condition. This is called the null hypothesis. If the hypothesis can be “rejected”,
then the result is called “significant”. Rejection is based on a statistical test, such as
the Student’s t-test, which returns a p value. This value is the probability that the
difference you see between the two conditions is due to chance. If the p-value is
less or equal to some threshold, for example p ≤ 0.05, it is safe to conclude that the
difference is significant. The large majority of HRI studies will use NHST and will
report p-values to support their conclusions.
However, in recent years NHST has come under fire [26]. A first problem is
the term “significant”, as it gives the uninitiated the impression that the results are
important. Unfortunately, significant in a statistical context really does not mean that
the results are important, substantial or major. It just means that it is highly unlikely
that the difference we see between the two conditions is a coincidence. Even if the
difference is very small, the results could still be called statistically significant if
the comparative statistic measures say so. Imagine you are testing two versions of a
robot –a polite robot and a direct robot– that delivers parcels in an office building.
You do 100 runs of each condition. The polite robot has a mean time to finish its
delivery round of 82 min (standard deviation = 8.0 min). The second robot has a mean
of 82 min (standard deviation = 4.0 min). If you calculate the two-tailed t-test, the
p-value is 0.0265, so the difference in mean time to complete a delivery round is
statistically significant. But does it really matter? Honestly? Too often statistically
significant results are presented as scientifically significant, while they are not. While
this is a problem of how we communicate results, the other problems with NHST
are more profound.
362 T. Belpaeme

The first problem is the probability of rejecting the null-hypothesis being wrong.
Typically, we assume that the p-value needs to be less than 0.05 before a result is
called significant. This number is however arbitrary, and different scientific fields
will use different threshold values (physics often uses 0.01 or 0.001). Determining
whether a result is significant based on a random threshold seems at odds with the
precision we usually expect from our scientific methods. The second problem is that
p-values tend to fluctuate between repeated experiments. You would expect that when
an experiment is repeated, the p-value would be similar. Unfortunately simulations
have shown that p-values tend to be very unstable [9]. An experiment that first shows
significant results is very likely to not be significant when run a second time!

3.2 The New Statistics

As NHST and the reporting of p-values has been shown to be problematic [9, 26],
what alternatives are there? A first recommendation is to report descriptive statistics
of any data, a practice too often ignored in HRI studies. Descriptive statistics should
at a minimum include the size of the sample, the mean, the standard deviation and
details of the population (mean, range and standard deviation of age, and gender
distribution) and details on how participants were recruited.
When comparing results, we’re tempted to go for a t-test or, for when comparing
more than 2 groups, for an ANOVA and post-hoc test. But given the problems with
p-values described above, it might be worth considering alternatives. Kaptein and
Robertson [15] propose a number of solutions. One is to use Bayesian statistics,
which computes the probability that a hypothesis are true. A Bayesian t-test com-
putes the ratio of the likelihood of two competing hypothesis, for example the null
hypothesis and an alternative hypothesis. Various tools exist, such as the BayesFac-
torPCL package for R or the free JASP tool, making the use of Bayesian statistics
rather less of a pain than a few years ago. Still, despite the ease with which Bayesian
statistics can now be computed, traditional approaches to statistics are here to stay.
When using traditional statistics, next to reporting p-values, it is very much worth
reporting effect sizes as well. All too often there is a lot of enthusiasm for a p-value
less than 0.05, without asking critical questions about the effect size of the result.
If there is a difference between two conditions, how big is that difference and does
it actually matter? Significance, as described above, can be a coincidence, but can
equally be the result of large sample sizes. If, for example, responses are collected
using crowd sourcing, where hundreds of responses can be collected in a matter of
hours, it is likely that a significant effect will be found through the sheer size of N
(i.e. the sample size). Therefore it is important to critically look at the effect sizes,
and always mention these together with p-values.
Perhaps the most important advice, apart from using the most appropriate report-
ing of statistics, is to make use of the now available options to plot data. Gone are
the days where the printer only could handle a simple black-and-white bar chart with
error bars and an asterisk to mark significance. Modern tools, such as the Python
Advice to New Human-Robot Interaction Researchers 363

seaborn library, can be used to create plots capturing all the richness of experimental
data. Observations can be plotted alongside bar charts, violin plots show the density
of the data, evolution of participants can be traced with hairlines. A plot can now
capture the richness and complexity of data which is all to often lost through only
using descriptive statistics. And with online papers now the norm, there is no reason
to hold back on using colour to convey information (while keeping in mind that some
of your readers might be visually impaired).

3.3 Selective Publication of Data

The temptation to only report the data that supports your agenda has always existed.
Imagine, you set up a study to show how a robot can make a positive contribution when
used as a tutor in an education context [3]. After building the setup and painstakingly
programming the robot to teach children mathematics, you take the robot into a class
room and when comparing the results of the pre test and post test, the children turn
out to have learnt nothing. Now, the temptation might be to look for a silver lining
in your study. If the children didn’t learn, they might still have enjoyed the robot?
So you reach for your questionnaire data in which children reported how much they
liked the robot, next to answering a number of other questions on children’s attitudes
towards robot. And indeed, the children self-report to have enjoyed the robot very
much. You decide to report this as your main results, and ignore the fact that the
learning effect was insignificant.
Why is this selective use of evidence a problem? Surely, the fact that the children
enjoy working with the robot is worth reporting? However, there are two problems in
our example. One is that the results pertaining to the hypothesis on which the study
was based are not reported, pretending that the study was intended to test something
else. The second is that your colleagues might set up a similar experiment, and
not knowing about your negative result, they cannot build upon your work. Luckily
cherry picking and selective use of data are relatively innocent in HRI, and often
occur through ignorance or a desire to put forward the most interesting results, rather
than a conscious manipulation of data. Unlike other fields, such as climate studies
or medicine, it is very unlikely that biased or selective reporting of data in HRI will
cause ecological or personal harm. Still, in days of online sharing of data and results,
the page limit of a paper really shouldn’t be a reason to withhold data. Open data
initiatives abound and a paper can be accompanied by an online data repository,
sharing not only data, but extended information on methods and perhaps even code
related to the experiment and data analysis.
364 T. Belpaeme

3.4 The Hawthorne Effect

The Hawthorne effect is named after the Hawthorne Works, a telephone equipment
factory outside Chicago, where in 1932 the management brought in psychologists
to see whether workers could be made more productive by making changes to the
working environment [16]. Different aspects where varied, for example, the amount
of light was varied with the assumption that good lighting should increase productiv-
ity. What the psychologists found was that no matter the changes to the environment,
productivity increased. Whether lighting was increased or decreased, productivity
improved. In the end, it turned out that it was the presence of the psychologists that
increased productivity in the workforce, not the changes to the working environ-
ment. In its broadest sense, the Hawthorne effect refers to a change in the behaviour
of people because they feel observed. Whether they are actually observed or not is
irrelevant, the mere belief of being observed is often sufficient to change people’s
behaviour. Although the original study has been criticised [1], the concept of the
Hawthorne effect is still very relevant.
In the context of HRI research, the Hawthorne effect becomes acutely relevant.
In research studies participants are often aware they are taking part in a study, for
example through being recruited or through signing a consent form. Even if no
experimenters or video equipment is visibly present during the study, the mere fact
of taking part in an experiment will already change the participant’s natural behaviour
and responses. This often leads to unexpected results or the changed behaviour of
the participants washes out small effects [13].
One way to manage the Hawthorne effect is to not let people know they are being
observed. This is sometimes possible in natural settings. Nomura et al. [20] observed
how children “bullied” a robot operating in a shopping mall. The bullying behaviour
would likely not have occurred had the children known they were being watched. In
a lab environment, one can try to use deception to let the participant believe the study
is about something else, while the relevant aspect of the study is only revealed at the
end. Care needs to be taken with deception or with not informing people they are part
of a research study, and many ethics committees, often known as the Institutional
Review Board (IRB) in US institutions, will frown upon the use of deception where
it can at all be avoided. Still, the persistence of the Hawthorne effect does justify the
judicious use of deception or naturalistic observation.
In HRI there is an additional complication when using social robots. It is not only
the belief that one is being observed by an experimenter that impacts behaviour, but
the social presence of the robot itself might also change people’s behaviour. This is
known as social facilitation, or audience effect, and states that people’s performance
will change in the presence of others as compared to their performance when alone.
Typically, people tend to perform better at simple tasks or tasks they are skilled at,
and worse at new or hard tasks. In HRI the social facilitation is not caused by other
people being present, but could be caused by an artificial social agent, i.e. the robot,
being present [27].
Advice to New Human-Robot Interaction Researchers 365

Overall, it is safe to assume that the presence of a social robot or the belief that
one is being observed will change people’s behaviour. This is not necessarily a bad
thing, in some applications you intend for people to feel watched. If a robot is used to
encourage people to choose healthy snacks over chocolate or use the stairs instead of
the lift, then feeling watched is exactly what promotes healthy behaviour. An effect
which might only be increased by feeling observed by the experimenters as well.

3.5 Crowdsourcing Data

Collecting a large number of responses in HRI is both time intensive and expensive,
and getting access to participants typically unavailable in or near your institution, such
as participants from different cultures or geographical regions, is difficult. Crowd-
sourcing provides a cheap and convenient alternative to quickly and cheaply collect
responses. Once the study is set up an running, it often takes a matter of hours for
hundreds of responses to come in at the price of just a few hundred US dollars. Some
HRI studies will rely on online crowdsourcing platforms to collect experimental data,
the best known of which is Amazon Mechanical Turk (often abbreviated to AMT or
MTurk). Others exist as well, such as Figure Eight, Clickworker, or Microworker,
but crowdsourcing platforms tend to pop up and disappear all the time.
Ever since crowdsourcing became available in 2005, the quality of crowdsourced
research data has been discussed. Overall, the consensus seems to be that crowd-
sourcing allows you to reach more diverse participants, that the quality of the data
is relatively good as long as a correct financial reward is given, and that data overall
is as reliable as data collect using traditional methods [4, 7]. Results from experi-
mental psychology obtained by using traditional methods are often replicated using
crowdsourced data, convincingly showing the robustness of the method.
However, careful screening is needed of data and participants, and all too often
in the excitement of setting up a study this aspect is glossed over. Several methods
exist to identify participants and responses that do not meet quality criteria. Meta
data of the participant’s work, such as completion time and depth of responses, can
be used to filter participants. Crowdsourcing platforms will also allow for the setting
of a threshold on who can take part, giving you the option to only allow participants
who have shown to provide high quality data in the past to take part in the study.
Checker questions and gold questions are also an important tool in filtering out bad
data: these are question to which there is an indisputably correct answer. What is the
colour of the robot’s shirt in the video? The robot mentioned a number, what was
that number? If the worker does not answer correctly to these questions, his data
should be discarded. Finally, the remuneration and incentives are important. If the
task is enjoyable, the interface is easy to understand and the pay attractive, you can
expect to get better data. Still, expect the have to throw away a significant amount
of crowdsourced data. In some HRI studies, up to 50% of collected data had to be
thrown out.
366 T. Belpaeme

With HRI there is another issue which needs to be carefully considered. Crowd-
sourcing only allows for on-screen presentation of stimuli, and often the interactive
character is severely limited. While a lab or field experiments usually involved a real,
live encounter with a robot, this is very impossible to replicate on a crowdsourcing
platform. Nevertheless HRI studies do frequently use crowdsourcing platforms to
collect responses about interactions and design aspects of robots, but one should
realise that an on-screen presentation of robots is a poor second to interacting or
seeing an actual robot.

3.6 The Replication Crisis

The replication crisis refers to a key moment in science, starting in social and experi-
mental psychology but spreading to other fields of science, including medicine, where
it was realised that a worryingly large number of positive research results were dif-
ficult to replicate or reproduce. Results which seemed solid enough to be part of
student textbooks and popular science books, suddenly could not be reproduced
when the original experiments were repeated by independent researchers or even by
the original researchers. In 2011 a string of scandals tore through the psychology
world, when fraudulent scientists and questionable results were exposed in quick
succession [21]. The case of social psychologist Diederik Stapel was particularly
alarming. Stapel’s research had seemed solid and often caused quite a sensation,
such as the study that showed that meat eaters are more selfish than vegetarians,
but was revealed to be based on fabricated or manipulated data. 58 publications by
Stapel and his co-workers were withdrawn as they were based on suspected or proven
fraudulent data.
This sparked a critical appraisal of psychology research, and research in general,
leading to establishment of the Open Science Foundation. The OSF set itself the
task of promoting good research practice and to get a view on how widespread the
issue of lack of reproducibility was it asked team across the world to replicate 100
high-profile psychology studies. Of these 100, only about half managed to reproduce
the results of the original studies [8].

4 Move Away from the Low Hanging Fruit of Short-Term


Studies

In HRI, as in other scientific fields, we have a tendency to first go for the low hanging
fruit. These are often studies that are short-term and that can be go from conception
to completion with relatively minimal resources. Most studies use a single short
exposure of the user to a robot. In some cases there is not even an actual interaction,
instead participants are shown images or short videos of robots. The value of visual
Advice to New Human-Robot Interaction Researchers 367

presentations of robots as a substitute for actual interaction is very much contested


in the field. While many of these studies have exploratory value, it is unclear if these
results will still hold if the interaction with robots moves from the short-term to the
long-term.
A study requires at least a robot and a researcher, participants and time. Moving
from short-term studies to long-term studies means that more of these resources will
be needed. When running long-term studies additional goodwill will be needed from
participants: it is relatively easy to convince people to come in for a quick 20 min
interaction study, but you will need to be very persuasive to get participants to return
for 10 sessions or to have a robot in a real world environment for several weeks. Still,
try we must, as the pursuit of HRI is to build the technical systems supporting HRI,
and for the HRI to make it into the real world, where it will hopefully be used for
longer than 20 min.
My team has experienced this first hand when we ran repeat interaction studies
and had too many participants drop out before we reached the third interaction. The
solution to getting everyone to return for repeat interactions turned out to be surpris-
ingly simple: pay participants well, but only hand out the payment after completing
all sessions. We made the mistake of rewarding participants after each session by
offering a hot drink, clearly not enough of an incentive to come back for more.

5 Conclusion

This chapters covered a selection of contemporary contentious issues in HRI and


in experimental work at large. Many fields, including HRI, are realigning to a new
reality. A reality where we desire more rigour from experimental science and more
honesty in our reporting. While many solutions have been proposed to these prob-
lems, especially in experimental psychology, the consensus on what good practice
is in HRI is still emerging. Our field would benefit from adopting agreed upon
practices from psychology and medical sciences, including new practices such as
pre-registering studies, where the hypotheses and protocol are made public before
the start of the actual study, taking away the opportunity to change the hypotheses
to fit the data. HRI is fortunate to grow at a time where the global research culture
is changing, moving towards a high-quality and transparent culture. HRI, as a field,
can only benefit from these developments.

Acknowledgements The author would like to thank Paul Baxter, Bahar Irfan, James Kennedy,
Fotios Papadopolous, Séverin Lemaignan, Emmanuel Senft for the discussion and insights used in
this chapter.
368 T. Belpaeme

References

1. Adair, J.G.: The hawthorne effect: a reconsideration of the methodological artifact. J. Appl.
Psychol. 69(2), 334 (1984)
2. Baxter, P., Kennedy, J., Senft, E., Lemaignan, S., Belpaeme, T.: From characterising three
years of HRI to methodology and reporting recommendations. In: The Eleventh ACM/IEEE
International Conference on Human Robot Interaction, pp. 391–398. IEEE Press (2016)
3. Belpaeme, T., Kennedy, J., Ramachandran, A., Scassellati, B., Tanaka, F.: Social robots for
education: a review. Sci. Robot. 3(21), eaat5954 (2018)
4. Berinsky, A.J., Huber, G.A., Lenz, G.S.: Evaluating online labor markets for experimental
research: Amazon.com’s mechanical turk. Polit. Anal. 20(3), 351–368 (2012)
5. Berkowitz, L., Donnerstein, E.: External validity is more than skin deep: some answers to
criticisms of laboratory experiments. Am. Psychol. 37(3), 245 (1982)
6. Boucher, J.D., Pattacini, U., Lelong, A., Bailly, G., Elisei, F., Fagel, S., Dominey, P.F., Ventre-
Dominey, J.: I reach faster when i see you look: gaze effects in human-human and human-robot
face-to-face cooperation. Front. Neurorobotics 6, 3 (2012)
7. Buhrmester, M., Kwang, T., Gosling, S.D.: Amazon’s mechanical turk: a new source of inex-
pensive, yet high-quality, data? Perspect. Psychol. Sci. 6(1), 3–5 (2011)
8. Collaboration, O.S., et al.: Estimating the reproducibility of psychological science. Science
349(6251), aac4716 (2015)
9. Cumming, G.: Replication and p intervals: p values predict the future only vaguely, but confi-
dence intervals do much better. Perspect. Psychol. Sci. 3(4), 286–300 (2008)
10. Dawson, C.: Introduction to research methods. In: A Practical Guide for Anyone Undertaking
a Research Project, 5th edn. Robinson (2019)
11. Esteban, P.G., Baxter, P., Belpaeme, T., Billing, E., Cai, H., Cao, H.L., Coeckelbergh, M.,
Costescu, C., David, D., De Beir, A., et al.: How to build a supervised autonomous system for
robot-enhanced therapy for children with autism spectrum disorder. Paladyn J. Behav. Robot.
8(1), 18–38 (2017)
12. Field, A., Hole, G.: How to Design and Report Experiments. Sage (2002)
13. Irfan, B., Kennedy, J., Lemaignan, S., Papadopoulos, F., Senft, E., Belpaeme, T.: Social psychol-
ogy and human-robot interaction: an uneasy marriage. In: Companion of the 2018 ACM/IEEE
International Conference on Human-Robot Interaction, pp. 13–20. ACM (2018)
14. Kalegina, A., Schroeder, G., Allchin, A., Berlin, K., Cakmak, M.: Characterizing the design
space of rendered robot faces. In: Proceedings of the 2018 ACM/IEEE International Conference
on Human-Robot Interaction, pp. 96–104. ACM (2018)
15. Kaptein, M., Robertson, J.: Rethinking statistical analysis methods for CHI. In: Proceedings
of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1105–1114. ACM
(2012)
16. Landsberger, H.A.: Hawthorne revisited: management and the worker, its critics, and develop-
ments in human relations in industry (1958)
17. Leite, I., Martinho, C., Paiva, A.: Social robots for long-term interaction: a survey. Int. J. Soc.
Robot. 5(2), 291–308 (2013)
18. Li, J.: The benefit of being physically present: a survey of experimental works comparing
copresent robots, telepresent robots and virtual agents. Int. J. Hum.-Comput. Stud. 77, 23–37
(2015)
19. Maxwell, S.E.: The persistence of underpowered studies in psychological research: causes,
consequences, and remedies. Psychol. Methods 9(2), 147 (2004)
20. Nomura, T., Kanda, T., Kidokoro, H., Suehiro, Y., Yamada, S.: Why do children abuse robots?
Interact. Stud. 17(3), 347–369 (2017)
21. Pashler, H., Wagenmakers, E.J.: Editors’ introduction to the special section on replicability in
psychological science: a crisis of confidence? Perspect. Psychol. Sci. 7(6), 528–530 (2012)
Advice to New Human-Robot Interaction Researchers 369

22. Phillips, E., Zhao, X., Ullman, D., Malle, B.F.: What is human-like?: decomposing robots’
human-like appearance using the anthropomorphic robot (ABOT) database. In: Proceedings
of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, pp. 105–113.
ACM (2018)
23. Ramachandran, A., Huang, C.M., Scassellati, B.: Give me a break!: personalized timing strate-
gies to promote learning in robot-child tutoring. In: Proceedings of the 2017 ACM/IEEE Inter-
national Conference on Human-Robot Interaction, pp. 146–155. ACM (2017)
24. Riek, L.D.: Wizard of oz studies in HRI: a systematic review and new reporting guidelines. J.
Hum.-Robot. Interact. 1(1), 119–136 (2012)
25. Robins, B., Dautenhahn, K., Te Boekhorst, R., Billard, A.: Effects of repeated exposure to a
humanoid robot on children with autism. In: Designing a More Inclusive World, pp. 225–236.
Springer (2004)
26. Vidgen, B., Yasseri, T.: P-values: misunderstood and misused. Front. Phys. 4, 6 (2016)
27. Woods, S., Dautenhahn, K., Kaouri, C.: Is someone watching me?-consideration of social
facilitation effects in human-robot interaction experiments. In: Proceedings. 2005 IEEE Inter-
national Symposium on Computational Intelligence in Robotics and Automation. CIRA 2005,
pp. 53–60. IEEE (2005)

Tony Belpaeme is Professor at Ghent University and Profes-


sor in Robotics and Cognitive Systems at the University of Ply-
mouth, UK. He received his PhD in Computer Science from
the Vrije Universiteit Brussel (VUB) and currently leads a team
studying cognitive robotics and human-robot interaction. He
coordinated the H2020 L2TOR project, studying how robots
can be used to support children with learning a second lan-
guage, and coordinated the FP7 ALIZ-E project, which stud-
ied long-term human-robot interaction and its use in paediatric
applications. He worked on the FP7 DREAM project, study-
ing how robot therapy for Autism Spectrum Disorder. Starting
from the premise that intelligence is rooted in social interaction,
Belpaeme and his research team try to further the science and
technology behind artificial intelligence and social human-robot
interaction. This results in a spectrum of results, from theoretical
insights to practical applications.
Editors’ Personal Conclusions

Tony Belpaeme
My team and I have probably made all the beginners mistakes that could have been
made. Luckily all of them were caught by us, by an eagle-eyed colleague or by an
observant reviewer, before the results made it to publication. When we started our
forays into social robotics back in 2005, we all were engineers fascinated by the
uncharted territory that was HRI. We built robots and software, something we were
quite good at, and then observed how people interacted with our systems, for which
we were ridiculously unequipped. We wanted to see how well people interpreted the
eye fixations of our new robot [2] and so started measuring, engineers know how to
measure, but we didn’t know which statistics to use to show that our measurements
mattered. So, we just picked a statistical test that looked sort of alright. Of course,
we used the wrong test we should have used an ANOVA instead of several t-tests
luckily the reviewers kindly pointed the flaw and gave us the change to fix our
statistics. When we built interfaces for social robots for eldercare our software was
sent to Italy for a field trial [3]. Dozens of over 80-year-olds were asked to input their
weekly shopping into the robot. When the results came back several weeks later,
things seemed perfect, almost too good to be true. All participants in our study really
got on with the interface and were able to order their shopping through the robot with
only minor hitches. But when we looked at the video logs, we noticed that our helpful
Italian research assistant was standing behind the elderly participants, pushing the
buttons for them. We hadn’t measured how good Italian nonnas en nonnos were at
HRI, but instead measured how compassionate Italian research assistants are. When
in 2014 we ran a huge long-term study in using robots in the classroom [1], we got
some fishy results back. Our robots stayed in the school for several weeks, placed at
the back of two classrooms. They were always on and whenever a child was allowed
by the teacher, the child would identify themselves to the robot by tapping their name
on a screen and would start a tutoring session with the robot.
One of the boys in our cohort was showing odd results, he suddenly seemed rather
good at prehistory. The striking thing was that he seemed to be learning after the
end of the school day, when the robot was in the classroom and he was supposed
to be at home. Inspection of the video logs solved the mystery. At the end of the
© Springer Nature Switzerland AG 2020 371
C. Jost et al. (eds.), Human-Robot Interaction, Springer Series
on Bio- and Neurosystems 12, https://doi.org/10.1007/978-3-030-42307-0
372 Editors’ Personal Conclusions

day, one of the cleaning ladies sat herself down in front of the robot, and using the
boy’s identity, treated herself to a tutoring session on prehistory. This showed that
having a video recording of your experiment is always a good idea, and that not only
children but also cleaning staff are keen on tutor robots. While it is unlikely that
reading this book will help you avoid all mistakes in experimental HRI, perhaps it
will help you avoid some of the booboos, slips and blunders that the authors of this
volume made. Perhaps the best way to do high quality HRI studies is to just jump
in, do the experiment and be critical of every decision you make. Good luck and go
build that future of robotics.
Cindy Bethel
In order for data and results to be compared across human–robot interaction (HRI)
studies, it is important to attempt to establish standardized measures of evaluation.
This will help in the interpretation of the data collected but also with the even-
tual integration and acceptance of robots into our daily lives. The establishment of
standardized metrics is a significant challenge because, to date, each human–robot
interaction study has often used different approaches, variables, and methods of
evaluation. The area of human–robot interaction is relatively new and therefore each
study is typically very unique. Researchers will often develop their own metrics for
evaluating the results of the interactions they are studying. Although this approach
overall has worked well, as the field grows it becomes more important for reliability
and the replicability of results that some standard metrics are developed and used by
researchers in the field. Having validated measures similar to what has been used in
the fields of psychology and sociology, would be helpful for use in HRI studies. As
the field matures, researchers are now more concerned with validating their surveys
and making them so that they can be applied to different scenarios for human studies
in HRI. This will be helpful to the field as a whole as it continues to evolve. The
development of standardized measures of evaluation helps to provide credibility to
the research being performed and also helps make it easier to perform consistent and
reliable studies. Progress in this area is important, even though it may not be possible
to completely standardize all measures, at least having some measures available can
be beneficial to the human–robot interaction research community.
Dimitrios Chrysostomou
In this book, we attempt to shed light into the challenges that arise for accurate
evaluation and standardisation of a Human–Robot Interaction experiment in several
scientific domains.
Initially, we looked into the different ways that humans communicate compre-
hensively, and we examined the different characterisations of social robots. It was
also essential to examine the several ways of conducting studies for human–robot
interaction experiments before we look into several ways of accurate evaluation. As
most of the studies use the same suite of statistical analysis tools and methods, it was
useful to highlight the lessons learned from statistical analyses and shared insightful
advice for the new researchers in the field.
Editors’ Personal Conclusions 373

Additionally, we addressed the current practices in user evaluation of compan-


ion robots and proposed a process for finding and using the appropriate scales in
a questionnaire for a study. We presented the USUS Goals evaluation framework,
which is widely used in HRI studies, and we also attempted to propose a new stan-
dardised questionnaire to evaluate the tendency for anthropomorphisation. Besides,
we evaluated the general methodologies used for user experience in HRI and more
specifically, the Human–Robot Interaction with ethology and ethnography. Lastly,
for every study that will require an interview, we propose a process for design and
conduct a semi-structured interview.
After looking into all these aspects, we can conclude that evaluation of human–
robot interaction experiments is not an easy task. The variety of aspects that have to
be taken into consideration and the nature of the experiments themselves allow only
minimal room for standardisation. Since every experiment is distinctively different
we can only agree on introducing standardized processes for the tools used in the HRI
experiments such as statistical analyses, interviews and questionnaires. Naturally,
as standardisation limits innovation we still have a long way to cover before we
unanimously agree on the specific metrics and indicators that everyone across the
HRI scientific field should use. Hopefully, this book closes the gap by summarizing
the experiences and lessons learned of several distinguished researchers working with
robots in an interdisciplinary context and offer recommendations of best practices.
Nigel Crook
Human–robot interaction (HRI) has become a well-established and internationally
recognised multidisciplinary field of research and development in its own right. In
recent years, the advent of so-called ‘social robots’ (i.e. robots that are explicitly
designed for social interaction) has intensified the need for rigorous HRI studies
that give us a deeper understanding of the complex social interactions that can now
occur between people and machines. Although HRI is a mature field of research,
there still is a notable lack of standardisation and consistency in the methods used to
scientifically evaluate interactions between humans and machines. To this end, this
book is very timely and the work that it presents speaks directly into the heart of the
problem of the standardization of methods in HRI studies.
The particular strength of the work presented here is that it brings some unique
and helpful perspectives on HRI together with clear guidance on the application of
evaluation methods that both the experienced and the novice HRI researchers will
find useful. The book also offers a healthy cross-disciplinary perspectives on HRI
research, drawing extensively on computer science, robotics and psychology, but also
giving insights from ethology, anthropology, sociology, ergonomics, philosophy and
user experience design.
Another strength of the book is that it helps the reader to develop an deeper
understanding of both humans and robots outside the HRI context. Human to human
communication is clearly an important analogue for investigating HRI. Similarly,
being able to reliably assess the tendency of humans to anthropomorphize is essen-
tial to evaluating the impact a robot may have on an individual. Furthermore, it is
important for researchers to understand the impact that various attributes of robots
374 Editors’ Personal Conclusions

may have on an HRI study; its appearance, intelligence, degree of autonomy and the
extent of its social competence can all have a significant influence on how people
interact with a robot. The book adds to this mix an extremely helpful introduction
to, and a proposed standardisation for, human experience design which will enable
HRI researchers to design their experiments more from the human’s perspective.
The core contribution of this book, however, is a very well informed discussion
on HRI experimental research methodology, which includes a helpful introduction
to HRI and to the (re)use of questionnaires for new researchers, and a comprehensive
survey of current HRI practice. Overall, this book offers many valuable insights and
useful guides to support HRI and to help the field reach full maturity in terms of
scientific rigour.
Marine Grandgeorge
Working on such book was a great opportunity to consider a real interdisciplinary
approach of human–robot interactions, highlighting that such research are really
challenging!
Robot could be consider as a full entity, with own characteristics (as explained
in Chapter 2); sometimes one could argue that robots would even become a new
species. Whatever is it wrong or not, we should conduct robust and reproducible
research about our interactions and relationships with them (as support and explain
in this book), to better understand them and, maybe, develop a positive daily life with
them. For that, as I explain in my chapter, ethology appears to be one of the most
adapted approach, combined with others sciences (e.g. ethnology, psychology…).
Moreover, in April 2019 in Nature, a consortium of researchers proposed to go fur-
ther (https://www.nature.com/articles/s41586-019-1138-y). They proposed to “study
machine behaviour [by] incorporat[ing] and expand[ing] upon the discipline of com-
puter science”. As we suggested here in this book with ethology, they argue that
“understanding the behaviour of artificial intelligence systems is essential to our
ability to control their actions, reap their benefits and minimize their harms.” Such
new movement claim we need to create a new scientific discipline to understand the
behaviour of the machines, as previously done for the study of animal behavior (i.e.
ethology).
Whatever the choices or the ways that will be taken, we need to keep in mind that
systemic and interdisciplinary approach remain essential for all researches that have
major societal issues, as our relationships to robots are! And I hope that this book
would contribute to be a new step of this story.
Céline Jost
This book is composed of high-quality contributions for which all of us can be proud
of. Each of us gave the best of her/himself. And I’m really glad about that. Together,
with a multidisciplinary group, we were able to highlight numerous issues existing
in too many evaluations presented in the literature. Caution, I do not blame anyone.
My first evaluations contained a lot of mistakes. And I did not know that. I would
have like that this book has existed to guide me, and I am delighted that it exists
now to guide new researchers. I am delighted too with the team who built this book,
Editors’ Personal Conclusions 375

half composed of experts of robots, and half composed of experts of humans. It is a


simplified view because precise words are difficult to find, but one can understand the
idea. This book represents plenty of points of view. Last, I am delighted too because
not only this book points to existing problems but also it proposes concrete solutions
through numerous contributions. It contains a road map to design evaluations from
A to Z, an explanation on how to create or reuse questionnaires, an explanation on
how to create a semi-structured interview, an easy and pedagogical explanation on
how to use statistics in our contexts, some invaluable testimonies on how different
disciplines do evaluate Human–Robot Interaction, and numerous advice to avoid
current mistakes. Thus one can learn that it is of importance to correctly formalize
research questions, to choose the adapted robot, to carefully recruit participants, to
choose a suitable context and adapted tasks to do, to avoid biases that would distort
results, to conceive an evaluation that can be reproduced, to compare experimental
conditions between them to be able to interpret results, to correctly use statistics, and
so on.
In all cases, as far as I am concerned, I had already learned that. I learned that
when I worked with colleagues who are ethologists. I discovered this science almost
a decade ago, and it was a revelation. Ethology does observe existing situations,
does observe animals (including humans) interacting with their natural environment,
does make a lot of discoveries. The starting point is exploratory and without pre-
conception. And I think that it is exactly the science which is adapted to evaluate
Human–Robot Interaction. They use the same methods to evaluate animal–animal,
human–human, human–animal, animal–robot, human–robot, and even robot–robot
interaction. Robots can be considered as a new species among others. Of course,
evaluating humans, animals or robots do not have the same objectives. For example,
we can evaluate Human–Robot Interaction to verify whether the robot’s behavior
meets our objectives and if that is not the case to change the robot’s behavior. We
aim at modulating robots according to our needs whereas when observing living
beings, we observe in order to learn from them (and normally not to try to change
them—but this is another debate). In all cases, even if the reason for evaluating is
different, methods are similar. And ethology is a mature multidisciplinary science
that also uses adapted methods coming from other disciplines. To my point of view,
ethology is adequate science to evaluation Human–Robot Interaction in our context.
Brigitte Le Pévédic
Why do we evaluate and compare? For me, it is not a question but an obviousness. I
am scientific! We must position our work in relation to the work of others.
Therefore, at the beginning of this work, the need for standardization in Human–
Robot Interaction was obvious to me. To compare studies, we must standardize
because we cannot compare the things different. More the advanced work, the more
my conviction is weakened.
The cognitive science, cognitive psychology, sociology, ethology, and
ergonomics… belong to different disciplines and use different methodologies, but all
are largely readily available for Human–Robot Interaction studies or research. Are
there some intersections, common points, similar methods among these different
disciplines?
376 Editors’ Personal Conclusions

The sharing of experiences and methodologies about HRI evaluations is a need


and a wealth.
Evaluating an application or an interaction on a robot is complex because the goals
are different and the criteria needed to evaluate HRI are not common: it is difficult
to compare experiments.
All sciences can bring to the field of HRI but why would Human–Robot Interaction
not become a new science or discipline?
I was thinking to find a method of standardization for Human–Robot Interaction
but I found only many questions. However, what adventure!
Nicole Mirnig
Although Human–Robot Interaction is still a somewhat young research field, is has
been around long enough to require more than standalone research projects, some of
which more or less starting from scratch. The community needs comparability over
different research projects, experiments, platforms, etc.
While robot research and development produce great amounts of scientific outputs
on HRI, the results are scattered in a myriad of different approaches and ways of per-
forming and assessing the interaction. The deployed metrics are likewise manifold,
which leads to the consequence that results are not comparable and benchmarking
of the various approaches proposed is not possible. Consensus tools to benchmark
robot platforms and applications are required, same as modes for the standardized
assessment of the same. These modes and tools need to cover both, a robot’s hard
skills such as performance and safety, as well as soft skills such as user experience.
Hard skills are somewhat covered in current norms and standards, which, however,
share the perspective of robots being machinery and part of factory equipment. None
of these standards consider robots from an HRI perspective yet. In order to launch
standards for aspects of HRI (i.e. robot soft skills), the research community needs
to acquire normative data by means of wide consultation in an open and transparent
manner. In this way, the results become widely acceptable, and can be exploited for
the creation of international quality norms and standards, which in turn would mean
measurable robot performances in terms of HRI.
Reproducible and comparable results and interoperable systems should be a long-
term goal which will ultimately improve social robots. Norms and standards will
allow recognizing a well-working robot. Reliable means for assessing robots on an
interactional level will allow the community to create social robots that foster a
positive user experience.
The book at hand provides an access point to the complex topic of standardization
in HRI. The book covers a wide perspective, ranging from theoretical definitions
and a framework, to more hands-on contributions such as methods for evaluation,
examples of good and bad practices, and recommendations on statistics. The book
is one important step towards comparability in HRI.
Editors’ Personal Conclusions 377

References

1. Baxter, P., Ashurst, E., Read, R., Kennedy, J., Belpaeme, T.: Robot education peers in a situated
primary school study: personalisation promotes child learning. PloS One 12(5), e0178126 (2017)
2. Delaunay, F., de Greeff, J., Belpaeme, T.: A study of a retro-projected robotic face and its
effectiveness for gaze reading by humans. In: Proceedings of the 5th ACM/IEEE International
Conference on Human–Robot Interaction, pp. 39–44. IEEE Press (2010)
3. Di Nuovo, A., Broz, F., Belpaeme, T., Cangelosi, A., Cavallo, F., Esposito, R., Dario, P.: A web
based multi-modal interface for elderly users of the ROBOT-ERA multi-robot services. In: 2014
IEEE international conference on Systems, Man, and Cybernetics (SMC), pp. 2186–2191. IEEE
(2014)
Book Overview: Towards New Perspectives

Céline Jost and Brigitte Le Pévédic

Endless Discussion

Céline: “Well, this conclusion? From reading the book, we can notice that every-
one proposes a solution according to her/his discipline. And, in fact,
everyone knows how to make evaluations for Human-Robot Interaction.
Consequently, there is no need of standardization!”
Brigitte: “Of course yes there is because we cannot compare results between them.”
Céline: “Oh, okay, thus there is a need for standardization.”
Brigitte: “No, there is a need for common metrics.”
Céline: “But, what’s the difference between standardization and common met-
rics?”
Brigitte: “To the best of my knowledge standardization means having a common
method to evaluate. But it is totally possible to use several methods of eval-
uation in order to produce the same indicators to be studied. For example,
there are several tools to measure wind speed, but in the end, it always
gives values with the same unit that can be compared.”
Céline: “It is interesting because if we take the example of question 1 in our chapter,
it could be possible to study the effect of culture - are Finnish people less
sensitive to audience effect than Peruvian people? - or the context effect
- are participants less sensitive to audience effect in a sportive context or
a teaching context? But, if I understand well, given that metrics are not
the same, it is not possible to perform statistical tests to make this kind of
comparison.”
Brigitte: “Thus the question is: do we need to compare them? Does the audience
effect in sports need to be compared to the audience effect in teaching?
Right now, this is the rigor of scientists that we can observe in the HRI
community. In our chapter, each expert specified the context, designed an
© Springer Nature Switzerland AG 2020 379
C. Jost et al. (eds.), Human-Robot Interaction, Springer Series
on Bio- and Neurosystems 12, https://doi.org/10.1007/978-3-030-42307-0
380 Book Overview: Towards New Perspectives

experimental protocol aiming to reduce edge-effects, and tried to answer


research questions using the disciplines they knew. But each expert chose
a unique use case.
Thus the question is rather to know: “Do we need to standardize?”. Yes, of
course, we are scientists and we want to standardize everything. We need
that. But is it a necessity?”
Céline: “And even more! Some authors focus on the User Experience and others
on the tendency to anthropomorphize. It seems clear that the community
is currently focusing on “What do we have to evaluate?”. This is just
the beginning of the research activities, which will allow us to think of
evaluation standardization for Human-Robot Interaction.”
Brigitte: “That is indeed a very relevant comment.”
Céline: “It’s really difficult to conclude this book because we raised more questions
than we found solutions. As discussed in our chapter conclusion, we had
thought we would find an answer and we’ve barely formalized a question.
And on top of that, we don’t even know if standardization is really required
or what has to be standardized.”
Brigitte: “Yes, I’m rather disappointed. As far as I am concerned, when we started
our reflection in 2014, and even more when we started the book writing
process, I was certain that standardization was needed.”
Céline: “Oh, we really had different motivations. In my case, I was sure that
ethology was sufficient given the work we were doing with the ethologists.
I was thinking that we would show that ethology was the discipline adapted
to HRI.”
Brigitte: “I’ve never understood why you’ve been so focused on ethology, not on the
other sciences. As far as I’m concerned ethology is not sufficient because
animals are independent and have their own autonomy and display a slow
evolution on phylogenic roots, while robots cannot exist without humans
and have a really fast evolution. I think that time is an important parameter
because animals (to which humans belong) can adapt to each other because
of their slow evolution, while one thinks that robots will evolve too fast to
allow humans to adapt.”
Céline: “And that’s precisely why ethology in interesting because we don’t need
to have previous knowledge of robots. And Ethology has a Darwinist
inspiration. We just have to study a human in interactions, which is done
by ethology, in order to deduce how to modify the robot in order to reach
our objective such as bringing comfort, avoiding bad experiences, and
so on. In fact, we always analyze humans since we measure the effects
(on humans) coming from what we implement on robots. So, ethology is
adaptable.”
Brigitte: “Robots can have adaptative behavior thus we have to study both humans
and robots in addition to the relationship between them.”
Céline: “Yes of course. And this is exactly the specialty of ethology. And that’s
interesting because, in the chapter written by Marine Grandgeorge, she
said: “Interestingly, sometimes researchers in HRI use methods very close
Book Overview: Towards New Perspectives 381

to the ones used in ethology without mentioning the correct vocabulary.”


But well, it shows that you were right when you say we need a common
and exhaustive toolbox. This type of toolbox may avoid using existing
methods, without knowing them, with the risk of making mistakes due to
this non-knowledge. We need to look deeper into the question.
Anyway. Let’s resume. The question of standardization is really com-
plicated because we have to know what type of standardization we are
discussing. When talking about standardization, we have to take bias into
consideration. Even if we build a very precise experimental protocol and
that several researchers conduct the same evaluation in order to compare
results, there will always be some bias. Rooms can be different, experi-
menters and the external environment may also differ. It is not possible to
ensure an exact reproduction except if participants move to the same place.
But that’s not realistic. We can’t ask people to fly 10,000 km in order to
participate in an evaluation. Anyway, such huge difference between partic-
ipants’ journeys may also necessarily induce bias. Therefore, what type of
standardization should we discuss and develop? Which need do we expect
to satisfy?”
Brigitte: “In itself I think that people’s culture represents a bias, hence we can’t have
a single standardization. I wonder whether we should speak in the plural
“some standardizations.” However, we can define a protocol adapted to
these cultural biases for example.”
Céline: “But it is still not enough. For example, Franz Werner’s chapter focuses
on the elderly while the second chapter written by Cindy Bethel is focused
on children. That illustrates the fact that even age is an important fac-
tor in evaluation design and that it does not lead to the same evaluation.
About this point, Marine Grandgeorge told me: “The protocol above all
depends on the research question but also on the people. You can’t compare
bipedal walking between babies, children, teenagers, and adults knowing
that your babies can’t walk (taking into account individual ontogenesis).”
That means we must list all the possible biases, correct?”
Brigitte: “I am wondering if we are not wanting to create norms according to our
target population. We don’t want a generalized interaction; therefore, how
do we standardize?”
Céline: “…”
Céline: “Well, that is the most relevant observation in our discussion! That’s an
excellent question. Where to begin? If we take again the recommendations
presented in the first chapter of Cindy Bethel, one would need a broad
sample of participants who are related to the topic studied, but not students.
One would need evaluations out of the lab, in the wild. Indeed, how to
standardize in this case? One may need to have long-term evaluations.
In this case, one could imagine home evaluations conducted by robots
themselves about their collaboration with humans. But here, one would
need common databases instead of standardization. And this goes back to
382 Book Overview: Towards New Perspectives

the beginning of the conversation when you said we need common metrics.
We have to know how to do with these metrics!”
Brigitte: “It’s a vicious circle!”
Céline: “Yes, it is… One of the problems mentioned in the book is statistics misuse.
Maybe should we start by writing a white paper to standardize the use of
statistics according to the research question?”
Brigitte: “The advantage of statistical science is to choose good method(s). Stan-
dardization is a fact. One should have better set indicators that have to
be measured. We should establish rules of good practice to make studies
comparable.”
Céline: “As suggested by Tony Belpaeme in his chapter, the first step toward
standardization is, first of all, to agree on good evaluation practice. That
is in fact the purpose and content of this book.”
Brigitte: “Yes, this is obvious!”
Céline: “Starting from obviousness, that’s great! From reading the book, standard-
ization seems to be rules and recommendations given to design rigorous
evaluations that would give reliable results
For example, a number of rules have been formulated in the book:
• Questions about ecological validity (lab-based vs. in-the-wild). It seems
that a good compromise is to replicate real conditions.
• Need for comparing conditions. It seems that we acquire knowledge
when comparing results coming from different conditions among them
and with reference knowledge.
• Choosing appropriate participants. Numerous results are invalidated
because of an inappropriate sample. One should make a strong effort to
recruit participants because they produce the data, which are analyzed.
• Statistics misuse. Statistical tools that are used are not always relevant
or appropriate. There is a lot to be done about the good and appropriate
use of statistics.”
Brigitte: “We clearly see that we need rules to allow comparison of results. If our
objective is to prove that our robot is the “best”, there needs to be the
ability to conduct identical evaluations, but not to standardize!”
Céline: “Creating standardization may be writing a book that contains the process
to create an evaluation from start to finish, as we started in this book. Maybe
we must not have the willingness to create a new science or to create a
method that works for all evaluations. As you said, it may be enough to
create a toolbox and to learn to use it. We should, therefore, have a white
paper that guides new researchers to use statistics, and consequently to
agree on which statistical tools have to be used according to the types of
data being analyzed
In all chapters, we clearly see that each discipline is using tools of course.
In fact, researchers emphasize the evaluation methods used more than
the disciplines themselves. For example, ethology and ethnography both
use observations. In brief, there is a limited number of methods that each
Book Overview: Towards New Perspectives 383

discipline can use according to the context. We may consider HRI as a


context as ethology, ethnography, sociology, and so on. I join you in the
fact that we simply need to provide toolboxes and methods to researchers.
In fact, standardization may already exist, but we don’t know it.”
Brigitte: “After reading the chapters of this book, I notice that tools are different
according to disciplines and that there seems to be “habits” associated with
these tools. I wonder if knowing other disciplines and can exchange differ-
ent evaluations and methods between us will not ensure the harmonization
of methods, and consequently standardization de facto.”
Céline: “So concretely, we can start with the standardization of information pro-
vided in research papers. We could have a questions sheet to complete.
If everyone provides the same sort of information about evaluations, we
may be able to reproduce them.
But, is it necessary to make compatible produced metrics? Do we really
need this, knowing that we don’t have the ability to account for some
biases? There would be no scientific validity to compare the results from
different evaluations conducted with different participants in different con-
texts and locations. As far as I’m concerned, standardization should be the
process of designing evaluations, but not the common metrics.”
Brigitte: “We should, therefore, establish a standardized experimental protocol,
shouldn’t we?”
Céline: “Don’t you think that’s precisely what Cindy Bethel did in her paper dated
in 2010? We can notice that this paper is cited in the book several times
and seems to be the starting point. In all cases, as far as we are concerned,
it is indeed her paper that generated all this work and allowed this book to
be born.
The question is still open.”

Concluding Remarks

Franz Werner, in his chapter, selected 49 papers dealing with evaluation methodolo-
gies of home care robots addressed for the elderly. He ruled out 29 papers that did not
contain basic information about the evaluation methods used and thus that was not
exploitable, which represented 59.18% of the papers selected. He also ruled out 14
of 24 projects, which was 58.33% of the projects. That seems to mean almost 60%
of evaluations presented in the literature would be not replicable because papers do
not describe enough of the evaluation to perform the experiments or studies again in
a reliable manner.
Emerging from this book are two different visions: on the one hand, ensuring good
practices to improve results reliability, and on the other hand, define common metrics
to be able to compare results. Thus, actually, some of us think that standardization
384 Book Overview: Towards New Perspectives

has to be made upstream from evaluations, and the others think that standardization
has to be made downstream from the evaluations.
But from reading this book, we are not convinced that we need to compare results
between them. Thus we may not need to create common metrics. There is an important
issue to solve now: we are looking for individualized interactions, but we want to
standardize. Isn’t that contradictory?
It seems that there is a consensus on the need for evaluation replication. Replication
seems to be more important than common metrics and seems to be the base of a
standardization that can be adapted to HRI. However, HRI is a context where it seems
impossible to obtain the same interaction more than once. Reproducing evaluations
is very important to confirm or deny the validity of the results already obtained. In
this book, no chapter defended common metrics while five chapters are based on the
idea that standardization is a process:
1. “Conducting Studies in Human–Robot Interaction” by Cindy Bethel,
2. “Introduction to (Re)using Questionnaires in HRI Research” by Matthew
Rueben, Shirley A. Elprama, Dimitrios Chrysostomou and An Jacobs,
3. “Qualitative Interview Techniques for Human–Robot Interaction”, by Cindy L.
Bethel, Jessie E. Cossitt, Zachary Henkel and Kenna Baugus,
4. “Design and Development of the USUS Goals Evaluation Framework” by
Josefine Wallström and Jessica Lindblom, and
5. “Testing for ‘Anthropomorphization’—A Case for Mixed Methods in Human–
Robot Interaction” by Malene Damholdt, Christina Vestergaard and Johanna
Seibt.
In addition to this discussion of standardization, this book highlights an important
point: the relationship is the central point in our context, that is: “a human being
and a social robot are interacting with each other.” Actually we noticed that while
we really need to measure the relationship because it is the result of successful
interactions, though we do not know what a relationship exactly is. And it is clearly
understudied. There is almost no research on a long-term relationship for example,
except in ethology. However, there are numerous research studies associated with
different parts of relationships. But is it really necessary to dissect relationships in
order to study, one after the other, each criterion? Do we really need that to achieve
our objectives? Do we have another objective than maximizing users’ well-being?
Indeed, we are interested in the results of different interactions, not in the interaction
itself. It is possible to measure this result. In the case of a home care robot, one can
question the user or make them undertake a medical examination. In the case of a
teaching robot, one can evaluate a learner’s knowledge. In the case of a welcome
robot, one can observe customers and/or question the staff who interact with the
robot. In brief, this lack of knowledge about relationships is a major concern. This
issue has to be solved if we want to have standardization of methods of assessment
in HRI.
Finally, the third important point deals with HRI itself. It seems to us that the HRI
discipline is still evolving, and even is still under construction. Damholdt et al. pointed
Book Overview: Towards New Perspectives 385

at the “need to define both the interdisciplinary scope of HRI research and its pluridis-
ciplinary format”. Grandgeorge highlighted the fact that “sometimes researchers in
HRI used methods very close to ones used in ethology without mentioning the cor-
rect vocabulary”. And she added that “ethology and robotics mutually enhance each
other.” Throughout this book, one can find numerous reflections showing that the
HRI discipline needs to find its own identity and is learning from other disciplines.
Damholdt et al. defended that “HRI will become a transdiscipline field in the long
run.”

∗ ∗ ∗

This book is, in fact, an introduction to a general reflection on a global standardization


in Human–Robot Interaction disciplines. Is our discipline becoming mature? In any
case, it seems to currently self-organize.

You might also like