Social Signal Processing

Social Signal Processing
Social Signal Processing is the first book to cover all aspects of the modeling, automated
detection, analysis, and synthesis of nonverbal behavior in human–human and human–
machine interactions.
Authoritative surveys address conceptual foundations, machine analysis and synthesis
of social signal processing, and applications. Foundational topics include affect perception
and interpersonal coordination in communication; later chapters cover technologies for
automatic detection and understanding, such as computational paralinguistics and facial
expression analysis, and for the generation of artificial social signals, such as social robots
and artificial agents. The final section covers a broad spectrum of applications based on
social signal processing in healthcare, deception detection, and digital cities, including
detection of developmental diseases and analysis of small groups.
Each chapter offers a basic introduction to its topic, accessible to students and other
newcomers, and then outlines challenges and future perspectives for the benefit of experi-
enced researchers and practitioners in the field.
Judee K. Burgoon is Professor of Communication, Family Studies and Human Develop-

ment at the University of Arizona, where she is Director of Research for the Center for
the Management of Information. She has authored or edited 14 books and monographs
and more than 300 published articles, chapters, and reviews related to nonverbal and ver-
bal communication, interpersonal deception, and computer-mediated communication. The
recipient of the highest honors from the International Communication Association and
National Communication Association, she has been named the most published woman in
the field of communication in the twentieth century.
Nadia Magnenat-Thalmann has pioneered research into virtual humans over the last
30 years. In 1989, she founded the interdisciplinary research group MIRALab at the
University of Geneva. She has published more than 500 works on virtual humans and
social robots and she has given more than 300 keynote lectures in various institutions and
organizations. She has received more than 30 awards and, besides directing her research
group MIRALab in Geneva, is presently Visiting Professor and Director of the Institute
for Media Innovation (IMI) at Nanyang Technological University, Singapore.
Maja Pantic is Professor of Affective and Behavioral Computing and leader of the
iBUG group at Imperial College, London, working on machine analysis of human non-
verbal behavior and its applications to human–computer, human–robot, and computer-
mediated human–human interaction. She has published more than 250 technical papers in
machine analysis of facial expressions, machine analysis of human body gestures, audio-
visual analysis of emotions and social signals, and human-centered machine interfaces.
Alessandro Vinciarelli is Senior Lecturer (Associate Professor) of the School of Computing

Science and Associate Academic of the Institute of Neuroscience and Psychology at the
University of Glasgow. He has published more than 100 scientific works, has been prin-
cipal or co–principal investigator on 15 national and international projects (including the
European Network of Excellence on Social Signal Processing), has organized more than
25 scientific events and has co-funded a webcasting company, Klewel.
Social Signal Processing
JUDEE K. BURGOON
NADIA MAGNENAT-THALMANN
MAJA PANTIC
ALESSANDRO VINCIARELLI
University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
4843/24, 2nd Floor, Ansari Road, Daryaganj, Delhi - 110002, India
79 Anson Road, #06-04/06, Singapore 079906
Cambridge University Press is part of the University of Cambridge.

It furthers the University’s mission by disseminating knowledge in the pursuit of
education, learning, and research at the highest international levels of excellence.
www.cambridge.org
Information on this title: www.cambridge.org/9781107161269
DOI: 10.1017/9781316676202
© Judee K. Burgoon, Nadia Magnenat-Thalmann, Maja Pantic and Alessandro Vinciarelli 2017
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2017
Printed in the United States of America by Sheridan Books, Inc.
A catalogue record for this publication is available from the British Library.
Library of Congress Cataloging-in-Publication Data
Names: Burgoon, Judee K., editor. | Magnenat-Thalmann, Nadia, 1946– editor. | Pantic, Maja, 1970– editor. |
Vinciarelli, Alessandro, editor.
Title: Social signal processing / edited by Judee K. Burgoon (University of Arizona), Nadia
Magnenat-Thalmann (University of Geneva), Maja Pantic (Imperial College London),
Alessandro Vinciarelli (University of Glasgow).
Description: Cambridge, United Kingdom ; New York, NY : Cambridge University Press, 2017. |
Includes bibliographical references and index.
Identifiers: LCCN 2016041635| ISBN 9781107161269 (hardback ; alk. paper) |
ISBN 1107161266 (hardback ; alk. paper) | ISBN 9781316613832 (pbk. ; alk. paper) |
ISBN 1316613836 (pbk. ; alk. paper)
Subjects: LCSH: Human-computer interaction. | Signal processing. | Human face recognition
(Computer science) | Nonverbal communication. | Facial expression. | Pattern recognition systems. |
Multimodal user interfaces (Computer systems)
Classification: LCC QA76.9.H85 S633 2017 | DDC 621.382/2 – dc23 LC record available at
https://lccn.loc.gov/2016041635
ISBN 978-1-107-16126-9 Hardback
ISBN 978-1-316-61383-2 Paperback
Cambridge University Press has no responsibility for the persistence or accuracy
of URLs for external or third-party Internet Web sites referred to in this publication
and does not guarantee that any content on such Web sites is, or will remain,
accurate or appropriate.
Contents
Contributors page ix
1 Introduction: Social Signal Processing 1

Alessandro Vinciarelli
Part I Conceptual Models of Social Signals
2 Biological and Social Signaling Systems 11

Kory Floyd and Valerie Manusov
3 Universal Dimensions of Social Signals: Warmth and Competence 23

Cydney H. Dupree and Susan T. Fiske
4 The Vertical Dimension of Social Signaling 34

Marianne Schmid Mast and Judith A. Hall
5 Measuring Responses to Nonverbal Social Signals: Research on Affect

Receiving Ability 46
Ross Buck, Mike Miller, and Stacie Renfro Powers
6 Computational Analysis of Vocal Expression of Affect: Trends and Challenges 56

Klaus Scherer, Björn Schüller, and Aaron Elkins
7 Self-presentation: Signaling Personal and Social Characteristics 69

Mark R. Leary and Katrina P. Jongman-Sereno
8 Interaction Coordination and Adaptation 78

Judee K. Burgoon, Norah E. Dunbar, and Howard Giles
9 Social Signals and Persuasion 97

William D. Crano and Jason T. Siegel
10 Social Presence in CMC and VR 110

Christine Rosakranse, Clifford Nass, and Soo Youn Oh
vi Contents
Part II Machine Analysis of Social Signals
11 Facial Actions as Social Signals 123

Michel Valstar, Stefanos Zafeiriou, and Maja Pantic
12 Automatic Analysis of Bodily Social Signals 155

Ronald Poppe
13 Computational Approaches for Personality Prediction 168

Bruno Lepri and Fabio Pianesi
14 Automatic Analysis of Aesthetics: Human Beauty, Attractiveness, and Likability 183

Hatice Gunes and Björn Schüller
15 Interpersonal Synchrony: From Social Perception to Social Interaction 202

Mohamed Chetouani, Emilie Delaherche, Guillaume Dumas, and David Cohen
16 Automatic Analysis of Social Emotions 213

17 Social Signal Processing for Automatic Role Recognition 225

18 Machine Learning Methods for Social Signal Processing 234

Ognjen Rudovic, Mihalis A. Nicolaou, and Vladimir Pavlovic
Part III Machine Synthesis of Social Signals
19 Speech Synthesis: State of the Art and Challenges for the Future 257
Kallirroi Georgila
20 Body Movements Generation for Virtual Characters and Social Robots 273
Aryel Beck, Zerrin Yumak, and Nadia Magnenat-Thalmann
21 Approach and Dominance as Social Signals for Affective Interfaces 287

Marc Cavazza
22 Virtual Reality and Prosocial Behavior 304

Ketaki Shriram, Soon Youn Oh, and Jeremy Bailenson
23 Social Signal Processing in Social Robotics 317

Maha Salem and Kerstin Dautenhahn
Contents vii
Part IV Applications of Social Signal Processing
24 Social Signal Processing for Surveillance 331

Dong Seon Cheng and Marco Cristani
25 Analysis of Small Groups 349

Daniel Gatica-Perez, Oya Aran, and Dinesh Jayagopi
26 Multimedia Implicit Tagging 368

Mohammad Soleymani and Maja Pantic
27 Social Signal Processing for Conflict Analysis and Measurement 379

28 Social Signal Processing and Socially Assistive Robotics in Developmental

Disorders 389
Mohamed Chetouani, Sofiane Boucenna, Laurence Chaby, Monique Plaza, and David Cohen
29 Social Signals of Deception and Dishonesty 404

Judee K. Burgoon, Dimitris Metaxas, Thirimachos Bourlai, and Aaron Elkins
Contributors
Oya Aran
Idiap Research Institute
Jeremy Bailenson
Stanford University
Aryel Beck
Nanyang Technological University
Sofiane Boucenna
University Pierre and Marie Curie
Thirimachos Bourlai
West Viriginia University
Ross Buck
University of Connecticut
Judee K. Burgoon
University of Arizona
Marc Cavazza
Tees University
Laurence Chaby
University Pierre and Marie Curie, University Paris Descartes
Dong Seon Cheng

Hankuk University of Foreign Studies
Mohamed Chetouani
University Pierre et Marie Curie
David Cohen
William D. Crano
Claremont Graduate School
x Contributors
Marco Cristani
University of Verona
Kerstin Dautenhahn
University of Hertfordshire
Emilie Delaherche
Guillaume Dumas
Norah E. Dunbar
University of California Santa Barbara
Cydney H. Dupree
Princeton University
Aaron Elkins
San Diego State University
Susan Fiske
Princeton University
Kory Floyd
Arizona State University
Daniel Gatica-Perez
Idiap Research Institute and EPFL
Kallirroi Georgila
Institute of Creative Technologies
Howard Giles
University of California Santa Barbara
Hatice Gunes
University of Cambridge
Judith A. Hall
Northeastern University
Dinesh Jayagopi
IIIT Bangalore
Katrina T. Jongman-Sereno
Duke University
Contributors xi
Mark Leary
Duke University
Bruno Lepri
Bruno Kessler Foundation
Nadia Magnenat-Thalmann
University of Geneva and Nanyang Technological University
Valerie Manusov
University of Washington
Dimitri Metaxas
Rutgers University
Mike Miller
Massachusetts College of Pharmacy and Health Sciences
Clifford Nass
Stanford University
Mihalis Nicolau
Imperial College London
Soo Youn Oh
Stanford University
Maja Pantic
Imperial College London and University of Twente
Vladimir Pavlovic
Rutgers University
Fabio Pianesi
Bruno Kessler Foundation
Monique Plaza
University Pierre and Marie Curie, University Paris Descartes
Ronald Poppe
University of Twente
Stacie Renfro Powers

Philliber Research Associates
Christine Rosakranse
Stanford University
xii Contributors
Ognjen Rudovic
Maha Salem
University of Hertfordshire
Klaus Scherer
University of Geneva
Marianne Schmid Mast

University of Lausanne
Björn Schüller
Imperial College London and Technical University Munich
Ketaki Shriram
Stanford University
Jason T. Siegel
Claremont Graduate School
Mohammad Soleymani
Nadia Thalmann
Michel Valstar
University of Nottingham
University of Glasgow
Zerrin Yumak
Nanyang Technological University
Stefanos Zafeiriou
1 Introduction: Social Signal
Processing
Introduction
Social signal processing (SSP) is the computing domain aimed at modeling, analy-
sis, and synthesis of social signals in human–human and human–machine interactions
(Pentland, 2007; Vinciarelli et al., 2008, 2012; Vinciarelli, Pantic, & Bourlard, 2009).
According to different theoretic orientations, social signals can be defined in different
ways, for example, “acts or structures that influence the behavior or internal state of
other individuals” (Mehu & Scherer, 2012; italics in original), “communicative or infor-
mative signals which . . . provide information about social facts” (Poggi & D’Errico,
2012; italics in original), or “actions whose function is to bring about some reac-
tion or to engage in some process” (Brunet & Cowie, 2012; italics in original). The
definitions might appear different, but there seems to be consensus on at least three
points.
r Social signals are observable behaviors that people display during social interactions.
r The social signals of an individual A produce changes in others (e.g., the others
develop an impression or a belief about A, react to A with appropriate social signals,
or coordinate their social signals with those of A).
r The changes produced by the social signals of A in others are not random, but follow
principles and laws.
In a computing perspective, the observations above lead to the key idea that shapes the
field of Social Signal Processing, namely that social signals are the physical, machine
detectable trace of social and psychological phenomena not otherwise accessible to
direct observation. In fact, SSP addresses the following three main problems.
r Modeling: identification of principles and laws that govern the use of social signals.
r Analysis: automatic detection and interpretation of social signals in terms of the prin-
ciples and laws above.
r Synthesis: automatic generation of artificial social signals following the principles and
laws above.
Correspondingly, this book is organized into four main sections of which the first three
focus on the three problems outlined above while the fourth one introduces current
applications of SSP technologies.
2 Introduction: Social Signal Processing
r Part I Conceptual models of social signals: this section covers definitions and models
of social behaviour and social signals – the core concepts of SSP as researched in
social psychology, cognitive sciences, evolutionary psychology, and anthropology.
r Part II Machine analysis of social signals: this section covers the technologies aimed
at automatic detection of social signals apparent from face and facial behaviour, vocal
expressions, gestures and body postures, proxemics, etc.
r Part III Machine synthesis of social signals: this section covers the technologies
aimed at empowering artificial agents with the ability of displaying social signals,
including expressive speech synthesis, facial animation, and dialogue management.
r Part IV Applications of SSP: this section covers the most important SSP applications
domains, including socially intelligent surveillance, deception detection, healthcare,
and multimedia indexing.
Every chapter is a survey aimed at beginners and experienced researchers in the field.
For the former, the surveys will be a fundamental source of references and a starting
point in the research on the topic. For the latter, the chapters will be a compendium of
the large body of knowledge accumulated in SSP, informed by the critical views of some
of the most influential researchers in the domain.
Part I Conceptual Models of Social Signals
Part I introduces social science perspectives on social signaling. Covered are theories
and models related to the etiologies, form, and functions of social signals. The first
chapter, “Biological and Social Signaling Systems” (Kory Floyd and Valerie Manusov),
addresses the fundamental issue of nurture versus nature influences on social signals,
focusing in particular on the interplay between innate biological processes and acquired
components resulting from sociocultural processes. The next two chapters concern the
horizontal versus vertical dimensions along which social messages are expressed and
interpreted. The chapter, “Universal Dimensions of Social Signals: Warmth and Com-
petence” (Susan Fiske and Cydney Dupree), surveys recent results on the perception of
warmth and competence, the two dimensions along which people tend to assess unac-
quainted others in the earliest stages of an interaction. In particular, the chapter high-
lights that the two dimensions are universal, that is, they tend to appear in all situations
and cultures. Judith Hall and Marianne Schmid Mast survey the use of social signals
as a means to express social verticality – status and power differences between people
belonging to the same social system – in the chapter entitled “The Vertical Dimension
of Social Signaling.”
The two chapters that follow concern the relationship between emotions and social
signals. The fourth chapter, “Measuring Responses to Nonverbal Social Signals:
Research on Affect Receiving Ability” (Ross Buck, Mike Miller and Stacie Renfro Pow-
ers), addresses the perception of emotions and affect that others display. In particular,
the chapter focuses on pickup and processing of facial and bodily displays. It is com-
plemented by the chapter authored by Klaus Scherer, Björn Schüller and Aaron Elkins,
Introduction: Social Signal Processing 3
“Computational Analysis of Vocal Expression of Affect Trends & Challenges,” which

focuses on the vocal expression of emotions. Furthermore, the chapter addresses the
role that signal processing technologies can have in the investigation of social signals.
The role of social signals as a means to display identity and personality is the focus
of “Self-presentation: Signaling Personal and Social Characteristics” (Mark R. Leary
and Katrina P. Jongman-Sereno). In particular, this chapter analyses the considerable
efforts that people make in order to lead others to treat them in desired ways. Finally,
the last three chapters of Part I address phenomena that take place during the interaction
between people. The chapter, “Interaction Coordination and Adaptation,” by Judee Bur-
goon, Norah Dunbar, and Howard Giles focuses on the tendency of interacting people
to mutually adapt their interaction styles or to adopt similar behavior patterns. Persua-
sion is at the core of the chapter authored by William Crano and Jason Siegel, “Social
Signals and Persuasion,” with particular attention to the effect of social signals on the
credibility of a source. Finally, the last chapter of Part I, “Social Presence in CMC and
VR” by Christine Rosakranse, Clifford Nass, and Soo Youn Oh, focuses on technology
mediated interaction contexts and, in particular, on how to convey social presence when
interaction is not face-to-face.
These Part I chapters supply essential context for conducting machine analysis of
social signals. They identify the multitude of functions that given signals may perform
and draw attention to the fact that many signals arise not from meanings that senders
are attempting to convey but rather are a response to the displays of interlocutors and
the jointly created exchange.
Part II Machine Analysis of Social Signals
The second part of the book deals with machine analysis of social signals. It represents
a collection of surveys covering the state of the art in research and technology aimed at
automatic detection of social signals.
The first two chapters deal with two of the most important sources of social signals,
namely face and body. In “Facial Actions as Social Signals,” Michel Valstar, Stefanos
Zafeiriou, and Maja Pantic survey the past work in machine analysis of facial gestures
(i.e., facial action units), which are the building blocks of all facial expressions, includ-
ing the facial expressions typical of displays of social signals such as interest, mimicry,
empathy, envy, and so on. Particular attention is paid to discussing automatic facial ges-
ture recognition in unconstrained conditions and real-life situations. Ronald Poppe, the
author of “Automatic Analysis of Bodily Social Signals,” surveys the state of the art
approaches and technologies for automatic recognition of social signals apparent from
a human body’s posture and movement. This includes interest detection in interactions
with robot companions, detection of phenomena such as mimicry and turn taking, and
deception detection.
The chapters following those mentioned above address the problem of using social
signals as a means to infer people’s characteristics. Personality traits profoundly influ-
ence one’s displays of social signals and one’s social interactions. For instance, it is
commonly known that extrovert people easily establish and have more pleasant social
interactions than is the case with more introvert people. In “Computational Approaches
for Personality Prediction,” Bruno Lepri and Fabio Pianesi discuss two approaches to
automatic prediction of one’s personality. The first relies on automatic recognition of so-
called distal cues (e.g., voice pitch) and learning which distal cues underlie which per-
sonality trait (extrovert, neurotic, agreeable, conscientious, open). The second approach
to automatic personality prediction relies on one’s profile and interactions in a social
network such as Facebook. Attractiveness and likability affect social exchanges in very
predictable ways. It is widely known, for example, that attractive people establish social
interaction more easily than less attractive people. In “Automatic Analysis of Aesthet-
ics: Human Beauty, Attractiveness, and Likability,” Hatice Gunes and Björn Schüller
survey the past work on automatic analysis of human attractiveness and likability based
on audio and visual cues shown by the judged person.
The remaining chapters of Part II focus on phenomena that take place during social
interactions. A large body of research in psychology points out that an individual’s tem-
poral coordination in social interactions has detrimental effects on the outcome of the
interaction (e.g., whether one will feel liked or not, whether the outcome of negotiation
will be positive or not, etc.). In “Interpersonal Synchrony: From Social Perception to
Social Interaction,” Mohamed Chetouani, Emilie Delaherche, Guillaume Dumas, and
David Cohen focus on computational models of interpersonal synchrony and survey
the automatic approaches to interpersonal synchrony assessment. Social emotions are
defined as emotions that relate to interpersonal interactions, rather than to individual
feelings (e.g., empathy, envy, shame, etc.). In “Automatic Analysis of Social Emotions,”
Hatice Gunes and Björn Schüller provide an overview of the past research on automatic
recognition of social emotions from visual and audio cues. In “Social Signal Processing
for Automatic Role Recognition,” Alessandro Vinciarelli surveys the past work on this
earliest research topic addressed by the SSP community – recognition of social roles
(i.e., the position that someone holds in a given social context, such as “moderator”
versus “discussion participant”). Particular attention is paid to open issues and chal-
lenges in this research field.
All previously mentioned approaches to automatic analysis of social signals build
upon machine learning techniques to model latent and complex behavioral patterns,
underpinning target social signals, from available data (i.e., audio, visual, multimodal
observations of target social signals). In “Machine Learning Methods for Social Sig-
nal Processing,” Ognjen Rudovic, Mihalis Nicolaou, and Vladimir Pavlovic focus on
systematization, analysis, and discussion of recent trends in machine learning methods
employed typically in SSP research.
Part III Machine Synthesis of Social Signals
Part III includes surveys on some of the most important aspects of social signals syn-
thesis, from the generation of artificial nonverbal cues, to the use of artificial cues to
convey socially relevant information, to social robots.
The first two chapters address, respectively, speech synthesis and the generation of
gestures and bodily movements. Kallirroi Georgila – author of “Speech Synthesis: State-
of-the-art and Challenges for the Future” – describes state-of-the-art techniques for the
generation of artificial speech and emphasizes in particular the synthesis of emotional
and expressive speech through the use of paralanguage and nonverbal cues. Similarly,
the authors of “Body Movements Generation for Virtual Characters and Social Robots”
(Aryel Beck, Zerrin Yumak, and Nadia Magnenat-Thalmann) survey not only the tech-
nologies to synthesize nonverbal cues such as body posture, gestures, and gaze, but also
the use of these cues when it comes to the communication of emotion and affect.
In the two chapters that follow those mentioned above, the authors address the prob-
lem of how to artificially generate social phenomena and, in particular, how to convey
emotion and prosocial behavior. Marc Cavazza (author of “Approach and Dominance
as Social Signals for Affective Interfaces”) surveys the adoption of affective interfaces
as a principled approach toward the improvement of the interactions between users and
machines. Ketaki Shriram, Soon Youn Oh, and Jeremy Bailenson (authors of “Virtual
Reality and Prosocial Behavior”) survey the efforts aimed at promoting positive changes
in behavior (e.g., increasing environment awareness or adopting healthier lifestyles)
through the adoption of virtual spaces where it is possible to interact in a controlled
setting, possibly including artificial characters.
The conclusive chapter of Part III, “Social Signal Processing in Robotics” by Maha
Salem and Kerstin Dautenhahn, focuses on social robots, one of the most important
forms of embodiment where the synthesis of social signals can play a crucial role in
ensuring smooth, enjoyable, and effective interactions between humans and machines.
Part IV Applications of Social Signal Processing
The last part of the book deals with the applications of social signal processing. While
being a relatively young domain (the very expression social signal processing was
coined less than ten years ago), the methodologies produced in the field have been shown
to be promising in a wide spectrum of application areas.
The first two chapters of this part show applications where the very analysis of social
signals can serve practical purposes, namely surveillance and automatic understand-
ing of group behavior. Dong Seon Cheng and Marco Cristani (“Social Signal Process-
ing for Surveillance”) show how the automatic analysis of social signals can improve
current surveillance approaches that, typically, analyze human behavior without tak-
ing into account the peculiarities of social behavior. Daniel Gatica-Perez, Oya Aran,
and Dinesh Jayagopi (“Analysis of Small Groups”) survey efforts aimed at inferring
the social phenomena taking place in small groups, such as social verticality, personal-
ity, group cohesion, and characterization. These efforts are beneficial in particular for
applications aimed at making meetings effective and productive.
Another two chapters show the use of social signal processing methodologies as a
support for multimedia indexing methodologies. The chapter “Multimedia Implicit Tag-
ging” (Mohammad Soleymani and Maja Pantic) shows that capturing the reaction of a
user (e.g., laughter or sobbing) in front of a multimedia item (e.g., a video) provides
information about the content of the item itself that can then be tagged with categories
such as funny or sad. In a similar vein, Alessandro Vinciarelli (“Social Signal Process-
ing for Conflict Analysis”) shows that the detection of conflict can help to extract the
most important moments in large repositories of political debates.
The last two chapters of this part target the adoption of social signal processing
methodologies in two major application areas, that is, healthcare and deception detec-
tion. Mohamed Chetouani, Sofiane Boucenna, Laurence Chaby, Monique Plaza, and
David Cohen (“Social Signal Processing and Socially Assistive Robotics in Develop-
mental Disorders”) show in particular that the analysis of social signals can help the
detection of developmental problems in children that, in many cases, cannot even speak.
Judee K. Burgoon, Dimitris Metaxas, Thirimachos Bourlai and Aaron Elkins (“Social
Signals of Deception and Dishonesty”) survey the progress on the possibility of devel-
oping technologies capable to identify people who lie.
Conclusions
This chapter provides a description of the book’s organization and content. The goal is
to allow the readers to identify chapters of interest quickly and easily and, at the same
time, to develop awareness of the main problems and areas covered in social signal
processing. The many authors involved in the book have made major efforts to combine
rigour and depth with clarity and ease of access. This will hopefully make this book a
valuable instrument for a wide spectrum of readers.
r SSP beginners: researchers starting their investigations in SSP will benefit from sur-
veys because these provide an overview of the state-of-the-art perspectives, identify
the most important challenges in the field, include rich bibliographies, and provide the
right terminology.
r SSP experts: researchers knowledgeable in SSP can benefit from the surveys because
these condensate, in a compact and concise form, a large body of knowledge typically
scattered across multiple disciplines. Critical views of the authors could provide a
fertile ground for discussion and, in turn, be an effective tool in pushing the limits of
innovation in the field.
r SSP teachers: teachers will benefit from the material because it provides an intro-
duction to the field and can be used as didactic material for students with different
backgrounds and/or at different stages of their education. Furthermore, the material
is organized in parts that correspond to the most natural structure of an SSP course.
r SSP interested: researchers and practitioners who are not active in the field, but are
interested in the domain and research in the related areas (e.g., human behavior anal-
ysis) can benefit from the book because it provides a clear account of state-of-the-art
challenges and opportunities in the field and a clear positioning of the SSP research
with respect to the related areas. Furthermore, the book can be an excellent entry
point to the SSP domain.
r Graduate and undergraduate students: students at all levels will benefit from the book
because the material is introductory and provides a clear explanation of what the SSP
domain is about. In this respect, the book can help the students to decide whether SSP
actually fits their interests or not.
r Industry experts: industry practitioners (or observers) can benefit from the book
because they can find in it an extensive overview of the state-of-the-art applications
in a wide spectrum of topics of potential interest as well as an indication on the most
important actors in the domain.
Like any vibrant research field, social signal processing keeps developing in both
depth and breadth. New conceptual and methodological issues emerge with continuity,
often inspired by new application domains. Correspondingly, the editors hope that the
chapters of this book will not be considered as a static body of knowledge, but as a
starting point toward new research and application avenues. The goal of this book is not
to provide the conclusive word on social signal processing, but to allow any reader to
quickly engage with novelties and progress that will hopefully come in the years after
the publication of the volume.
References
Brunet, P. & Cowie, R. (2012). Towards a conceptual framework of research on social signal
processing. Journal of Multimodal User Interfaces, 6(3–4), 101–115.
Mehu, M. & Scherer, K. (2012). A psycho-ethological approach to social signal processing.
Cognitive Processing, 13(2), 397–414.
Pentland, A. (2007). Social signal processing. IEEE Signal Processing Magazine, 24(4), 108–111.
Poggi, I. & D’Errico, F. (2012). Social signals: A framework in terms of goals and beliefs. Cog-
nitive Processing, 13(2), 427–445.
Vinciarelli, A., Pantic, M., & Bourlard, H. (2009). Social signal processing: Survey of an emerg-
ing domain. Image and Vision Computing Journal, 27(12), 1743–1759.
Vinciarelli, A., Pantic, M., Bourlard, H., & Pentland, A. (2008). Social signal processing: State-
of-the-art and future perspectives of an emerging domain. Proceedings of the ACM Interna-
tional Conference on Multimedia (pp. 1061–1070). New York: Association for Computing
Machinery.
Vinciarelli, A., Pantic, M., Heylen, D., Pelachaud, C., Poggi, I., D’Errico, F., & Schroeder, M.
(2012). Bridging the gap between social animal and unsocial machine: A survey of social
signal processing. IEEE Transactions on Affective Computing, 3(1), 69–87.
Part I
Conceptual Models of Social
Signals
2 Biological and Social Signaling
Systems
Kory Floyd and Valerie Manusov
As complex beings, humans communicate in complex ways, relying on a range of fac-

ulties to encode and decode social messages. Some aptitudes are innate, based on one’s
biological characteristics, whereas others are acquired, varying according to one’s social
and cultural experiences. As we explain in this chapter, each of us uses a combination of
biological and sociocultural processes to produce and interpret social signals. Our goal
is to introduce some of the forms that these processes can take.
We begin this chapter with an overview of social signals and a comparison between
the biological and sociocultural processes underlying their production and interpreta-
tion. Next, we explore three examples of biologically processed social signals, and then
examine sociocultural processing of the same signals. We conclude the chapter by dis-
cussing some ways in which biological and sociocultural processes interact.
The Nature of Social Signals
Communicators depend on a wide variety of social signals to make sense of the world
around them. Poggi and D’Errico (2011) define a signal as “any perceivable stimu-
lus from which a system can draw some meaning” and a social signal as “a commu-
nicative or informative signal which, either directly or indirectly, provides information
about ‘social facts,’ that is, about social interactions, social attitudes, social relations
and social emotions” (Poggi & D’Errico, 2011: 189). Social interactions are situations
in which people perform reciprocal social actions, such as a game, a surgical proce-
dure, an orchestral performance, or a conflict. Social attitudes are people’s tendencies
to behave in a particular way toward another person or group and include elements
such as beliefs, opinions, evaluations, and emotions. Social relations are relationships
of interdependent goals between two or more people. Finally, social emotions include
those emotions that (1) we feel toward someone else, such as admiration and envy;
(2) are easily transmitted from one person to another, such as enthusiasm and panic;
and/or (3) are self-conscious, such as pride and shame.
As noted, humans use both biological and sociocultural processes to produce and
interpret social signals. At least four distinctions differentiate these processes from one
another: (1) their connection to physical versus social traits, (2) their cultural variation,
12 Conceptual Models of Social Signals
(3) their uniqueness to the human species, and (4) the advantages or values they embody.
We discuss each of these briefly to help ground our chapter.
First, a biologically processed social signal is connected to an organic anatomical trait
or physiological process and derives its meaning from that trait or process. In humans,
some social signals regarding age meet this criterion, insofar as height and body size,
proportion of facial features, condition of skin and hair, and other visual markers of
age are products of the organic aging process. In contrast, a socioculturally processed
social signal is connected to traits or processes whose meaning is culturally constructed.
For example, human social signals of political affiliation – such as style of dress or
the variety of bumper stickers on one’s car – reflect culturally constructed ideas about
politics, such as the idea that conservative attire denotes conservative ideology.
Second, the meaning of biologically processed social signals is largely culturally
invariant. To the extent that basic anatomy and physiology are consistent across humans
around the world, the first criterion gives rise to the second criterion, that cultures should
show little variation in how they interpret a biologically processed social signal. For
some social signals, such as emotion displays, there is compelling evidence of cultural
invariance. No such evidence exists for some other social signals, yet cultural consis-
tency would be expected. The meaning of socioculturally processed social signals, how-
ever, often varies substantially across cultural and social groups, and there is little reason
to expect otherwise. For example, a personal distance of twelve inches (30 cm) may be
seen as intimate in some cultures and distant in others.
Third, biologically processed social signals are processed similarly in similar species.
Many species have muscular and central and peripheral nervous systems similar to those
of humans. When a social signal is rooted in an organic anatomic trait or physiological
process in humans, it should be similar in species with similar anatomies or physi-
ologies. Better evidence exists for this consistency in some signals (such as emotion
displays) than in others. This consistency depends on relevant anatomical or physio-
logical similarity, so primates with similar facial muscles would be expected to display
emotions similarly to humans, but not to grow facial hair as a secondary sexual charac-
teristic if their faces are already covered with hair. On the contrary, there is no reason
to expect socioculturally processed social signals to be processed similarly – if at all –
by other species. Indeed, many such signals express meanings that have no correspon-
dence in nonhuman species, such as religious affiliation or the ability to switch between
languages.
Finally, biological processes often confer advantages for survival and/or reproduc-
tion of the organism, but they are neutral with respect to their social value. Biologically
driven signals of sexual attraction, such as pupil dilation and erection (discussed below),
occur because sexual interaction promotes procreation but are largely indifferent to cul-
tural practices or social mores. Learned behaviors, however, are embedded firmly within
the beliefs, morals, and norms of a particular social system, such that certain ways of
being become better or worse within the cultural or social frame. So, for instance, par-
ticular body sizes are thought to be beautiful in some cultures and are stigmatized in
others based on the values of the particular social system.
Biological and Social Signaling Systems 13
Biological Processes Underlying Social Signals
Having shown some of the ways that biological processes differ from sociocultural pro-
cesses relevant to social signals, we offer more background on each system separately
before we suggest ways in which they are integrated. Humans are biological beings
who use their nervous systems and sensory abilities to navigate their social world. Con-
sequently, they biologically process a range of communicative and informative social
signals. Three examples, discussed here, are secondary sexual characteristics, emotion
displays, and signals of attraction and sexual receptivity. They are used to illustrate the
nature and reach of biological processes. Similar examples are used when we discuss
sociocultural processes.
Secondary Sexual Characteristics

Sexual ontogeny is characterized by the development of secondary sexual characteris-
tics, those physical features that distinguish the males and females of a species but are
not directly related to the functions of the reproductive system (Sherar, Baxter-Jones,
& Mirwald, 2004). Androgens, estrogens, and progesterone in humans promote sec-
ondary sexual characteristics such as growth of facial and body hair, enlargement of
the larynx and deepening of the voice, and increased muscle mass and strength in men,
and enlargement of breasts, widening of hips, and rounding of the jawline in women.
The development of secondary sexual characteristics in humans begins around age nine
(Susman et al., 2010), although there is a documented trend toward earlier development
among children in the United States (Herman-Giddens, 2006).
In principle, these and other phenotypic markers (the observable physical character-
istics of an organism) provide sufficient information for people to differentiate between
women and men in social interaction with high levels of accuracy. Indeed, research
shows that observers distinguish the sexes at above-chance levels based on differences
in secondary sexual characteristics such as waist-to-hip ratio (Johnson & Tassinary,
2005), jawline shape (Brown & Perrett, 1993), and vocal pitch (Bennett & Montero-
Diaz, 1982). Secondary sexual characteristics therefore serve as biological social sig-
nals, insofar as they are produced biologically (hormonally, in this instance) and provide
information that can shape social interactions, attitudes, relations, and/or emotions.
Emotion Displays
Emotion displays are perceivable kinesic (body) and vocalic behaviors that convey emo-
tional states. Many emotion displays are more socially than biologically processed, as
we will discuss. Nonetheless, some displays arise from organic physiological processes
and are sufficiently similar across cultures and species to qualify as biological social
signals.
An anger display provides an illustrative example. The experience of anger ini-
tiates sympathetic arousal, prompting a variety of physical changes that are often
perceivable by others, such as increased muscle tension in the face and body,
flared nostrils, increased perspiration, and flushing in the face (Levenson, 2003; Tucker,
Derryberry, & Luu, 2000). Muscle tension is observed in the furrowed brow and
clenched jaw that accompany the prototypical facial display of anger, whereas flush-
ing results from increased vascular blood flow. Flared nostrils allow for increased oxy-
gen intake, providing extra energy to fuel a potential attack, and increased perspiration
serves to prevent hyperthermia. Galati, Scherer, and Ricci-Bitti (1997) demonstrated
that this configuration does not differ significantly between sighted and congenitally
blind individuals, suggesting a primarily biological (rather than learned) basis.
Facial anatomy and sympathetic nervous system physiology are culturally invari-
ant (see e.g., Gray & Goss, 1966), to the extent that anger displays are biologically
processed, a high degree of correspondence would be expected across cultures in (1)
the way anger is encoded and (2) the expression that is interpreted to convey anger.
Matsumoto et al. (2008) reviewed evidence from multiple cross-cultural studies docu-
menting that anger (and other basic emotions) are both encoded and decoded in highly
consistent ways across cultures (although we discuss the limits to this in our next sec-
tion). Similarly, to the extent that anger displays are biologically processed, the human
display of anger should be similar to that of species with similar facial structure and
musculature. Parr, Waller, and Fugate (2005) review evidence from nonhuman primates
documenting displays of aggression analogous to human facial displays of anger, sup-
porting their biological origins.
These observations are not unique to anger displays. As Darwin (1873) observed,
humans and other animals express many emotions in ways that serve the survival func-
tions of those emotions. For instance, the emotion of surprise aids survival by focusing
attention on an unexpected and potentially threatening occurrence. The prototypical
look of surprise serves that function with wide eyes (for increased visual acuity), an
open mouth (for increased oxygen intake, fueling a potential response to the threat),
and a hand over the mouth (for protection against unwanted ingestion). Similarly, the
emotion of disgust aids survival by prompting the expulsion of a toxic substance from
the body, and the expression of disgust configures the face to spew such a substance
from the mouth.
Signals of Attraction and Sexual Receptivity

Some species are less than subtle when signaling their sexual interest and availability
to conspecifics (others of the same species). The hindquarters of the female savannah
baboon, for instance, swell and turn bright red, an unmistakable biological signal of
her sexual receptivity (Altmann, Hausfater, & Altmann, 1988). Although human social
signals of attraction and sexual receptivity may be more discreet, some are similarly
biologically processed.
Like the baboon, male and female humans experience vasocongestion secondary to
the process of sexual arousal and reproduction. Vasocongestion occurs when increased
vascular blood flow and localized blood pressure cause body tissues to swell. One read-
ily observable effect is the reddening of the skin during sexual excitement, plateau,
orgasm, and/or resolution known colloquially as “sex flush” (Mah & Binik, 2001). Vaso-
congestion also produces penile erection, hardening of the clitoris, and swelling of the
nipples during sexual arousal (Janssen & Everaerd, 1993; Laan, Everaerd, & Evers,
1995). To those who observe them, these physical responses signal sexual attraction
and receptivity among humans.
Another social signal of attraction (and perhaps also of receptivity) that is biologi-
cally processed is pupil dilation. In many species, including humans, pupils dilate auto-
matically in response to sympathetic nervous system arousal (see Bradley et al., 2008).
Having dilated pupils therefore signals arousal. Although sympathetic arousal can result
from both positively and negatively valenced emotions, pupil dilation increases physi-
cal attractiveness in humans and may therefore signal romantic and/or sexual receptiv-
ity. Early research with adolescents suggested a sex difference in this effect (see Bull
& Shead, 1979), but Tombs and Silverman (2004) demonstrated that both women and
men are more attracted by larger pupils than by smaller pupils in opposite-sex partners.
Secondary sexual characteristics, emotion displays, and signals of attraction and
receptivity are not the only social signals that humans process biologically. It is likely
that signals related to age, ethnicity, sexual orientation, intelligence, dominance, empa-
thy, and many other data also have biologically processed components. Contrariwise,
many social signals are processed in fundamentally sociocultural ways, as we examine
next.
Sociocultural Processing of Social Signals
In addition to being biological beings, humans are also social and cultural beings,
brought up in and affected by the people around them. Following others, Philipsen
(2009) refers to groups of people who share the same set of social rules and meanings as
speech communities. As people grow up in a certain community, they learn the norms,
values, beliefs, and patterns of engaging that group. These cultural ways of being shape
the ways in which people come to understand many of the social signals others send to
them and those they send to others. Philipsen, as do many other scholars (e.g., S. Hall,
1997; Schegloff, 1984), argues that social signals and the rules that govern them come
to be understood within a particular context; only those who share a particular cultural
code can fully understand the social signals and the rules that govern them. Socially
determined behaviors also reflect and affect the values and ideologies of those who use
the codes. To help show how these processes work, we use the same primary areas dis-
cussed in the section on biological signals, albeit in very different ways, to provide three
examples of how being a cultural being can shape our signaling processes.
Being Gendered
Whereas people are born with and develop secondary sexual characteristics naturally as
part of their ontological development, they also learn what it means to be male or female
within a particularly society. When scholars talk about “gender” rather than “biological
sex” they reference typically how people are brought up to act, think, and feel by virtue
of being male or female. In many cultures, for instance, women and girls are encouraged
to be “pleasant,” and they are significantly more likely to smile (and to be expected to
smile) in social interactions and in photographs than are males (J. Hall, 2006). That
there is no such difference when people are not in social situations suggests that the
pattern is learned and not innate.
Women are also taught to be the subject of males’ gaze in some cultures. Roy (2005)
argued that women are often portrayed by mediated sources in India as “the province and
property” of men, in that they are positioned most commonly in advertisements so as to
be gazed upon by men. Men are not gazed at in the same way by women. Roy argued
that the position, along with camera angle, lighting, and other elements suggested that
women were there to be looked at, and in some cases “owned” by the gazing men. This
suggests an array of rules presented to consumers of what it means to be a male or
female in that culture.
The differences in actual behaviors (biological or learned) between males and females
across cultures are quite small (J. Hall, 2006), but the perception that the two groups
differ significantly is enhanced by stereotypes developed within a culture or set of
cultures. In a recent study, Koppensteiner and Grammer (2011) found that their Aus-
trian research participants made different judgments of the social characteristics of
stick figures, with “male” figures seen to be more extraverted and emotionally stable
and “females” described as agreeable. Whereas stereotyping is a common biological
process, the concepts held within the stereotypes, and the behaviors people engage in
because of their stereotypes, are learned within a cultural or social group (S. Hall, 1997).
Emotional Expression
As noted earlier, there is evidence for universal emotional expressions, such as anger.
But emotional expressions and the rules for their use are also shaped by our speech
communities. Ekman (1977) discussed cultural display rules to reveal the ways in
which a particular group defines “appropriate” and normative emotional expression,
including whether or not to show an experienced emotion (see also, Aune, 2005). In a
project testing an inventory of display rules (the Display Rule Assessment Inventory),
Matsumoto et al. (2005) found that, of the groups they studied, the Japanese were least
likely to show anger and contempt, with Americans showing the most happiness. Relat-
edly, Matsumoto, Yoo, and Fontaine (2008) learned that, compared to individualistic
cultures, collectivistic cultures enforce a display rule of less emotional expressivity
overall. Within-group differences are also learned. For instance, norms of politeness
proscribe displaying specific emotions in particular social contexts, such as the expres-
sion of anger toward a customer in a customer–service encounter (Goldberg & Grandey,
2007).
People learn display rules as part of their socialization or enculturation. In some
cultures, the media play an important role in affective (emotional) learning, and the
greater people’s exposure to the media, the more they are “programmed” by what they
see. Emotions displayed on television tend to be different than what occurs in real life
(Houle & Feldman, 1991) in that they appear more commonly, tend to be only of three
types (happiness, sadness, anger), and are also simple rather than complex emotions.
Thus, those who learn affective social signals largely from television have a different,
and generally incorrect, view about the nature of such cues than do others.
Because emotional expressiveness has a learned quality, people can also become bet-
ter at it over time. Variously named affective skill, emotional expressivity, and expres-
siveness control, among other similar terms (see Riggio & Riggio, 2005), researchers
have created systems for teaching people to tend better to the socially appropriate
expression of emotions within their speech community (Duke & Nowicki, 2005). Given
the problems people face when they are ineffective at emotional signaling, the ability to
learn how to do so more effectively is promising.
Signals of Attraction
As part of our enculturation, we come to see certain characteristics as more or less
attractive, and certain ways of acting as more or less likely to attract. Within Western
cultures, attractiveness has come to be defined over time as tied to youthfulness. This
is a relatively recent phenomenon, and this “ageist ideology” is not one shared by all
cultures (Jaworski, 2003). In order to attract others, people in many Western cultures
do a great deal to suggest more youthfulness than they may have. This has been more
prominent for women than for men, and for girls than for boys, but the emphasis on
youthfulness as an attractor has been increasing for males as well (Coupland, 2003).
Whereas courtship and mating behaviors occur across species in order to attract
another, the nature of those behaviors and the patterning of them differ significantly
across cultures and are done differently by men and women. Within the United States
and Canada, studies of flirting or courtship behaviors between heterosexuals in bars
shows that such behaviors often follow a particular sequence linked with learned gen-
der roles (e.g., Perper & Weis, 1987). Initial signaling tends to be done by women, for
example, and includes three types of gaze, smiling, and caressing objects. Such behav-
iors are typically learned covertly (by watching others, with no formal discussion about
how to engage in them) and, as such, can be seen as “natural” attraction cues, even
though they are a part of the speech community’s signaling code.
Even within the same speech community, however, different groups are socialized
to see the same social attraction signals in different ways. Across several studies in
the United States and England, for instance, men tend to interpret more attraction and
sexual intent in cues that women see instead as “friendly” behavior (Abbey & Melby,
1986; Egland, Spitzberg, & Zormeier, 1996). Thus, there are at times competing codes,
learned sometimes by part of the group in a different way than other parts.
Interactions between Biological and Sociocultural Processes
Although we have discussed them independently, biological and sociocultural pro-

cesses of producing and interpreting social signals often behave interdependently. In this
section, we reference some of the means through which this occurs. To begin, some bio-
logically processed social signals are modified by sociocultural influences. For example,
individuals can intentionally manipulate many secondary sexual characteristics to alter
the signal being sent (i.e., the data regarding their biological sex). Even without inter-
vening hormonally (e.g., by taking androgen therapy), for instance, transgender individ-
uals can modify their vocal qualities to sound more like their desired than their biologi-
cal sex (Hancock & Garabedian, 2013). Men with gender dysphoria can undergo breast
augmentation and facial feminizing surgery (Monstrey et al., 2014), and male cross-
dressers often use electrolysis to remove facial hair (Ekins, 1997). By altering the look
or sound of secondary sexual characteristics, these strategies modify their meaning and
significance as social signals. They may, for example, change the information conveyed
about (1) which biological sex an individual was born with and/or (2) which biological
sex, if any, the individual identifies with, either of which can alter social interactions,
attitudes, relations, or emotions (see e.g., Pusch, 2005).
Such characteristics are augmented in other ways. Goffman (1976) referred to the
ways that people exaggerate their biological sex traits through gender advertisements.
In his review of print advertisements, Goffman revealed a tendency for women to be
shown largely as shy, dreamy, gentle, helpless, and likely to be manipulated, whereas
males were “advertised” as powerful, controlling, and dominant. Although advertising,
or displaying with some purpose, that we are a male or a female is only one way in
which we use inherited cues in a social way, and it is a very powerful one.
In gender advertisements, biology is exaggerated by cultural demands, but social
rules may also affect the ways in which we respond physiologically to another. Buck
(1989), for instance, described a social biofeedback process that occurs in relationships.
Partners in relationships develop rules over time for how to approach emotion and its
expression between them. As the relationship continues, the rules the couple share, and
the constraints that the rules provide, affect the ways in which the couples experience
those emotions subsequently. When, for instance, couples come to enjoy arguing, the
emotion they experience automatically when conflict arises will be positive, compared
to the fear, sadness, or anger that others might feel. Thus, the existence of the social
or cultural patterns of the relationship change how the couple experience some of the
emotion-invoking events that occur between them.
Similarly, the social environment sometimes plays a role in activating biological pro-
cessing. Some biological means of processing social signals are inert, in other words,
without the influence of specific inputs from the social or cultural environment. For
instance, Panksepp, Knutson, and Pruitt (1998) first described the epigenesis of emo-
tional behavior, the process by which particular environmental influences are necessary
to activate genetic predispositions for emotional experience (and, thus, for expression).
In an example of empirical work aimed at identifying specific social/genetic interac-
tions that influence emotion, Sugden et al. (2010) found that a variant on the serotonin
transporter (5-HTT) gene predisposes children to a broad spectrum of emotional prob-
lems but that such problems emerge only among children living in threatening social
environments.
These are just a few of the many ways in which biological and sociocultural processes
interact as we use social signals to engage with others. They begin, however, to speak
to the complexity of determining in any given social encounter which cues are purely
biological, determined by the social or cultural surround, or are a unique combination
of biological and sociocultural processing. Our hope is that this chapter provides an
opportunity to begin to appreciate the intricate ways in which our innate and learned
capabilities allow us to interact and relate with one another.
References
Abbey, A. & Melby, C. (1986). The effect of nonverbal cues on gender differences in perceptions
of sexual intent. Sex Roles, 15, 283–298.
Altmann, J., Hausfater, G., & Altmann, S. A. (1988). Determinants of reproductive success in
savannah baboons, Papio cynocephalus. In T. H. Clutton-Brock (Ed.), Reproductive Success:
Studies of Individual Variation in Contrasting Breeding Systems (pp. 403–418). Chicago: Uni-
versity of Chicago Press.
Aune, K. S. (2005). Assessing display rules in relationships. In V. Manusov (Ed.), The Sourcebook
of Nonverbal Measures: Going Beyond Words (pp. 151–161). Mahwah, NJ: Lawrence Erlbaum.
Bennett, S. & Montero-Diaz, L. (1982). Children’s perceptions of speaker sex. Journal of Pho-
netics, 10, 113–121.
Bradley, M. M., Miccoli, L., Escrig, M. A., & Lang, P. J. (2008). The pupil as a measure of emo-
tional arousal and autonomic activation. Psychophysiology, 45, 602–607. doi: 10.1111/j.1469-
8986.2008.00654.x.
Brown, D. E. & Perrett, D. I. (1993). What gives a face its gender? Perception, 22, 829–840. doi:
10.1068/p220829.
Buck, R. (1989). Emotional communication in personal relationships: A developmental-
interactionist view. In C. D. Hendrick (Ed.), Close Relationships: Review of Personality and
Social Psychology (vol. 10, pp. 144–163). Newbury Park: SAGE.
Bull, R. & Shead, G. (1979). Pupil dilation, sex of stimulus, and age and sex of observer. Percep-
tual and Motor Skills, 49, 27–30. doi: 10/2466/pms.1979.49.1.27.
Coupland, J. (2003). Ageist ideology and discourses of control in skincare product marketing. In
J. Coupland & R. Gwyn (Eds), Discourse, the Body and Identity (pp. 127–150). Basingstoke,
England: Palgrave Macmillan.
Darwin, C. R. (1873). The Expression of the Emotions in Man and Animals. London: John Murray.
Duke, M. & Nowicki, S. (2005). The Emory Dissemia Index. In V. Manusov (Ed.), The Source-
book of Nonverbal Measures: Going Beyond Words (pp. 25–46). Mahwah, NJ: Lawrence
Erlbaum.
Egland, K. L., Spitzberg, B. H., & Zormeier, M. M. (1996). Flirtation and conversational
competence in cross-sex platonic and romantic relationships. Communication Reports, 9,
105–117.
Ekins, R. (1997). Male Femaling: A Grounded Theory Approach to Cross-dressing and Sex
Changing. New York: Routledge.
Ekman, P. (1977). Biological and cultural contributions to body and facial movement. In J. Black-
ing (Ed.), Anthropology of the Body (pp. 34–84). London: Academic Press.
Galati, D., Scherer, K. R., & Ricci-Bitti, P. E. (1997). Voluntary facial expression of emotion:
Comparing congenitally blind with normally sighted encoders. Journal of Personality and
Social Psychology, 73, 1363–1379.
Goffman, E. (1976). Gender Advertisements. Cambridge, MA: Harvard University Press.
Goldberg, L. S. & Grandey, A. A. (2007). Display rules versus display autonomy: Emotion regu-
lation, emotional exhaustion, and task performance in a call center situation. Journal of Occu-
pational Health Psychology, 12, 301–318. doi: 10.1037/1076-8998.12.3.301.
Gray, H. & Goss, C. M. (1966). Anatomy of the Human Body (28th edn). Philadelphia, PA: Lea
& Febiger.
Hall, J. A. (2006). Women’s and men’s nonverbal communication: Similarities, differences, stereo-
types, and origins. In V. Manusov & M. L. Patterson (Eds), The SAGE Handbook of Nonverbal
Communication (pp. 201–218). Thousand Oaks, CA: SAGE.
Hall, S. (1997). Representation: Cultural Representations and Signifying Practices. London:
SAGE.
Hancock, A. B. & Garabedian, L. M. (2013). Transgender voice and communication treatment: A
retrospective chart review of 25 cases. International Journal of Language & Communication
Disorders, 48, 54–65. doi: 10.1111/j.1460-6984.2012.00185.x.
Herman-Giddens, M. E. (2006). Recent data on pubertal milestones in United States children:
The secular trend toward earlier development. International Journal of Andrology, 29, 241–
246. doi: 10.1111/j.1365-2605.2005.00575.x.
Houle, R. & Feldman, R. S. (1991). Emotional displays in children’s television programming.
Journal of Nonverbal Behavior, 15, 261–271.
Janssen, E. & Everaerd, W. (1993). Determinants of male sexual arousal. Annual Review of Sex
Research, 4, 211–245. doi: 10.1080/10532528.1993.10559888.
Jaworski, A. (2003). Talking bodies: Representations of norm and deviance in the BBC Naked
programme. In J. Coupland & R. Gwyn (Eds), Discourse, the Body and Identity (pp. 151–176).
Basingstoke, England: Palgrave Macmillan.
Johnson, K. L. & Tassinary, L. G. (2005). Perceiving sex directly and indirectly: Mean-
ing in motion and morphology. Psychological Science, 16, 890–897. doi: 10.1111/j.1467-
9280.2005.01633.x.
Koppensteiner, M. & Grammer, K. (2011). Body movements of male and female speakers and
their influence on perceptions of personality.Personality and Individual Differences, 51, 743–
747. doi: 10.1016/j.paid.2011.06.014.
Laan, E., Everaerd, W., & Evers, A. (1995). Assessment of female sexual arousal: Response
specificity and construct validity. Psychophysiology, 32, 476–485. doi: 10/1111/j.1469-
8986.1995.tb02099.x.
Levenson, R. W. (2003). Autonomic specificity and emotion. In R. J. Davidson, K. R. Scherer,
& H. H. Goldsmith (Eds), Handbook of Affective Sciences (pp. 212–224). New York: Oxford
University Press.
Mah, K. & Binik, Y. M. (2001). The nature of human orgasm: A critical review of major trends.
Clinical Psychology Review, 21, 823–856. doi: 10/1016/S0272.7358(00)00069-6.
Matsumoto, D., Keltner, D., Shiota, M., Frank, M., & O’Sullivan, M. (2008). Facial expressions
of emotion. In M. Lewis, J. Haviland, & L. Feldman-Barrett (Eds), Handbook of Emotion
(pp. 211–234). New York: Guilford Press.
Matsumoto, D., Yoo, S. H., & Fontaine, J. (2008). Mapping expressive differences around the
world: The relationship between emotional display rules and individualism versus collectivism.
Journal of Cross-Cultural Psychology, 39, 55–74. doi: 10.1177/0022022107311854.
Matsumoto, D., Yoo, S. H., Hirayama, S., & Petrova, G. (2005). Validation of an individual-
level measure of display rules: The display rule assessment invesntory (DRAI). Emotions, 5,
23–40.
Monstrey, S. J., Buncamper, M., Bouman, M.-B., & Hoebeke, P. (2014). Surgical interventions
for gender dysphoria. In B. P. C. Kreukels, T. D. Steensma, & A. L. C. de Vries (Eds), Gender
Dysphoria and Disorders of Sex Development (pp. 299–318). New York: Springer.
Panksepp, J., Knutson, B., & Pruitt, D. L. (1998). Toward a neuroscience of emotion: The epi-
genetic foundations of emotional development. In M. F. Mascolo & S. Griffin (Eds), What
Develops in Emotional Development? Emotions, Personality, and Psychotherapy (pp. 53–84).
New York: Plenum Press.
Parr, L. A., Waller, B. M., & Fugate, J. (2005). Emotional communication in primates:
Implications for neurobiology. Current Opinion in Neurobiology, 15, 716–720. doi:
10.1016/j.conb.2005.10.017.
Perper, T. & Weis, D. L. (1987). Proceptive and rejective strategies of US and Canadian college
women. Journal of Sex Research, 23, 455–480.
Philipsen, G. (2009). Researching culture in contexts of social interaction: An ethnographic
approach, a network of scholars, illustrative moves. In D. Carbaugh & P. M. Buz-
zanell (Eds), Distinctive Qualities in Communication Research (pp. 87–105). New York:
Routledge.
Poggi, I. & D’Errico, F. (2011). Social signals: A psychological perspective. In A. A. Salah & T.
Gevers (Eds), Computer Analysis of Human Behavior (pp. 185–225). London: Springer.
Pusch, R. S. (2005). Objects of curiosity: Transgender college students’ perceptions of
the reactions of others. Journal of Gay & Lesbian Issues in Education, 3, 45–61. doi:
10.1300/J367v03n01_06.
Riggio, R. E. & Riggio, H. R. (2005). Self-report measures of emotional and nonverbal expres-
siveness. In V. Manusov (Ed.), The Sourcebook of Nonverbal Measures: Going Beyond Words
(pp. 105–111). Mahwah, NJ: Lawrence Erlbaum.
Roy, A. (2005). The “male gaze” in Indian television commercials: A rhetorical analysis. In
T. Carilli & J. Campbell (Eds), Women and the Media: National and Global Perspectives
(pp. 3–18). Lanham, MD: University Press of America.
Schegloff, E. (1984). On some gestures’ relation to talk. In J. M. Atkinson & J. Heritage (Eds),
Structures of Social Action (pp. 266–296). Cambridge: Cambridge University Press.
Sherar, L. B., Baxter-Jones, A. D. G., & Mirwald, R. L. (2004). Limitations to the use of sec-
ondary sex characteristics for gender comparisons. Annals of Human Biology, 31, 586–593.
doi: 10.1080/03014460400001222.
Sugden, K., Arseneault, L., Harrington, et al. (2010). The serotonin transporter gene moder-
ates the development of emotional problems among children following bullying victimiza-
tion. Journal of the American Academy of Child and Adolescent Psychiatry, 49, 830–840. doi:
10/1016/j.jaac.2010.01.024.
Susman, E. J., Houts, R. M., Steinberg, L., et al. (2010). Longitudinal development of secondary
sexual characteristics in girls and boys between ages 91/2 and 151/2 years. JAMA Pediatrics,
164, 166–173. doi: 10.1001/archpediatrics.2009.261.
Tombs, S. & Silverman, I. (2004). Pupillometry: A sexual selection approach. Evolution &
Human Behavior, 25, 221–228. doi: 10/1016/j.evolhumbehav.2004.05.001.
Tucker, D. M., Derryberry, D., & Luu, P. (2000). Anatomy and physiology of human emotions:
Vertical integration of brainstem, limbic, and cortical systems. In J. C. Borod (Ed.), The Neu-
ropsychology of Emotions (pp. 80–105). Oxford: Oxford University Press.
Further Reading
Matsumoto, D., Yoo, S. H., & Chung, J. (2010). The expression of anger across cultures. In M.
Potegal, G. Stemmler, & C. Spielberger (Eds), International Handbook of Anger (pp. 125–137).
New York, NY: Springer.
Simpson, B. S. (1997). Canine communication. The Veterinary Clinics of North America, Small
Animal Practice, 27, 445–464.
3 Universal Dimensions of Social
Signals: Warmth and Competence
Cydney H. Dupree and Susan T. Fiske
Humans have long developed the automatic ability to prioritize social percep-
tion. Whether traveling ancient, dusty roads thousands of years past or meandering
metropolitan blocks long after midnight, people must immediately answer two critical
questions in a sudden encounter with a stranger. First, one must determine if the stranger
is a friend or foe (i.e., harbors good or ill intent), and second, one must ask how capable
the other is of carrying out those intentions. Since ancestral times, these two questions
have been crucial for the survival of humans as social animals. The ability to quickly
and accurately categorize others as friend or foe would have profoundly influenced the
production and perception of social signals exchanged between agents. In developing
computational analyses of human behavior, researchers and technicians alike can ben-
efit from a thorough understanding of social categorization – the automatic process by
which humans perceive others as friend or foe. This chapter will describe over a decade
of research emerging from social psychological laboratories, cross-cultural research,
and surveys that confirm two universal dimensions of social cognition: warmth (friend-
liness, trustworthiness) and competence (ability, efficacy) (see Fiske, Cuddy, & Glick,
2007, for an earlier review).
Foundational Research
Although appearing under different labels, the warmth and competence dimensions have
consistently emerged in classical and contemporary studies of person perception (Asch,
1946; Rosenberg, Nelson, & Vivekananthan, 1968; Wojciszke, Bazinska, & Jaworski,
1998), construal of others’ past actions (Wojciszke, 1994), and voters’ approval of
political candidates in both the United States (Abelson et al., 1982; Kinder & Sears,
1981) and Poland (Wojciszke & Klusek, 1996). Developing impressions of leaders also
involves the warmth and competence dimensions, including image management (build-
ing trust), relationship development (warmth), and resource deployment (competence
and efficiency) (Chemers, 1997).
Further examination of past and present research reveals the extent to which humans
use warmth and competence dimensions in navigating the social world. Peeters (1983)
was one of the first to describe two independent dimensions at the trait level by
defining self-profitability (competence, advantageous to the self) and other-profitability
(warmth and morality, advantageous to others) in perceivers’ social domain. This work
set the precedent for Wojciszke’s impression formation research, which suggests that
approach-avoidance tendencies are primarily based on appraisals of morality and com-

petence. Basic dimensions of morality and competence account for 82 percent of vari-
ance in perceptions of well-known others (Wojciszke et al., 1998). Similar patterns
emerge in impressions of work supervisors (Wojciszke & Abele, 2008). In addition,
self-perception shows similar patterns, for three-quarters of over 1,000 personally expe-
rienced behaviors are framed in terms of morality or competence (Wojciszke, 1994).
Taken together, these findings suggest that our spontaneous judgments of self and oth-
ers are almost entirely accounted for with the two basic dimensions of warmth and
competence (for a review, see Wojciszke, 2005).
Regarding terminology, although one could take issue with the combination or sep-
aration of “warmth” and “trust” (Leach, Ellemers, & Barreto, 2007), the two features
are strongly linked, consistently correlating (Kervyn, Fiske, & Yzerbyt, 2015). Though
Wojciszke and colleagues use terms translated as “morality” and “competence,” these
moral traits include terms such as fair, generous, honest, and righteous – all of which
overlap with the warmth-trustworthiness dimensions. The “competence” term used by
Wojciszke’s lab clearly refers to traits like clever, efficient, knowledgeable, foresighted,
and creative. Therefore, regardless of the terms used, the core dimensions consistently
emerge.
Outside of psychological research, both ancient and contemporary rhetorical scholars
have long emphasized expertise (competence) and trustworthiness for social perception
of credibility in communication. For decades, rhetorical scholars have considered source
credibility, the receiver’s attitude toward a communicator, to be one of the most impor-
tant aspects of communication, persuasive or otherwise (Andersen & Clevenger, 1963;
McCroskey & Teven, 1999; McCroskey & Young, 1981; Sattler, 1947). Evidence of
this construct extends back to ancient times, when Aristotle described the image of the
communicator as a source’s ethos.
The multidimensionality of a source’s image has been without question among rhetor-
ical scholars since ancient times, and the warmth and competence dimensions have
consistently emerged within this literature. Classic studies measure source credibility
along the dimensions of reputation and competence (Haiman, 1948), which correspond
to Aristotle’s ethos components of character and intelligence. Though multiple compo-
nents have emerged as the field has focused its attention on measuring ethos/credibility,
theorists have generally agreed on two dimensions: “competence” (qualification, expert-
ness, intelligence, authoritativeness) and “trustworthiness” (character, sagacity, hon-
esty) (McCroskey & Teven, 1999).
An abundance of research on relational messages further demonstrates the role of
person perception in the field of communication. As expressed by Hawes (1973) and
many other communication researchers, “Communication functions not only to trans-
mit information, but to define the nature of the relationship binding the symbol users”
(p. 15). Hawes cautioned against viewing communication as a series of distinct, easily
delineated segments, for each segment is subject to previous influence, and such rela-
tional influence shapes future segments of communication. Accordingly, many com-
munication theorists have focused on relational communication, the verbal and non-
verbal themes present in communication, defining interpersonal relationships (Burgoon
Universal Dimensions of Social Signals: Warmth and Competence 25
& Hale, 1984, 1987; Burgoon & LePoire, 1993). Empirical investigations, including
factor analysis and content coding, suggest up to twelve distinct themes along which
relational communication may vary. However, again communication researchers tend to
agree on two primary dimensions that underlie relational communication: affiliation (or
intimacy) and dominance (status/competence) (Burgoon & Hale, 1984, 1987; Dillard &
Solomon, 2005).
Leary’s (1957) theory of personality provides indirect support for the warmth and
competence dimensions in person perception. Leary proposed a two-factor theory of
interpersonal behavior, suggesting that judgment of others’ behavior and personal-
ity center around two orthogonal dimensions: dominance-submission and love-hate
(affection-hostility). (For further examination of the dominance-submission dimension,
see Chapter 4, this volume.) Leary laid the foundation for decades of research support-
ing his two-factor theory and contributed to a variety of work supporting the centrality
of warmth and competence in person perception.
Theoretical and empirical evidence from multiple fields support the essentiality of
both warmth and competence in person perception. However, examining these dimen-
sions from an evolutionary perspective could suggest that judging whether another’s
intentions are good or ill (warmth) may have priority over judging another’s abilities
(competence). Considerable evidence shows that warmth is in fact primary: warmth
is judged before competence, and warmth judgments carry more weight. This has
been shown with the examination of approach-avoidance tendencies. Morality (warmth)
judgments precede competence-efficacy judgments to predict these approach–avoid ten-
dencies, making them the fundamental aspect of evaluation (Cacioppo, Gardner, &
Berntson, 1997; Peeters, 2001; Willis & Todorov, 2006). The moral-social dimension
is more cognitively accessible, in greater demand, more predictive, and more heavily
weighted in evaluative judgments toward others (Wojciszke et al., 1998). People tend to
use the warmth dimension when determining valence of impressions toward others (i.e.,
whether the impression is positive or negative); in contrast, people use the competence
dimension when determining strength of impressions (i.e., how positive or how nega-
tive the impression is) (Wojciszke et al., 1998; see also Wojciszke, Brycz, & Borkenau,
1993).
From Interpersonal to Intergroup Perception

Though warmth and competence dimensions have been shown to guide impressions and
reactions toward others on an interpersonal level, these two dimensions also emerge in
judgments of different social groups. Examining stereotypes applied to groups reveals
warmth and competence as central dimensions of intergroup perceptions. As with inter-
personal evaluations, people spontaneously home in on the traits associated with warmth
and competence when evaluating ingroups and outgroups.
Historically, social psychologists had largely ignored questions about the content of
stereotypes that are applied to social groups. Researchers have instead preferred to
study the process of group stereotyping (how stereotypes develop, are used, and change)
rather than the content and social function of these stereotypes. However, in the past few
decades, researchers began to go beyond investigating process to examine also the vari-
ety of social stereotype content and what factors predict perceptions of various ingroups
and outgroups.
Although the earliest studies of stereotypes emphasized their content (Katz &
Braly, 1933), and some recognized stereotypes derogating intelligence versus sociality
(Allport, 1954), the dominant view was uniform negativity toward outgroups and pos-
itivity toward ingroups. At the end of the twentieth century, research began systemat-
ically differentiating attitudes toward different social groups. For example, by exper-
imentally manipulating intergroup contexts, Alexander, Brewer, and Hermann (1999)
showed divergent images associated with different outgroups. Their intergroup image
theory predicts negative stereotypes toward outgroups that are perceived as having goals
incompatible with the ingroup. Goal incompatibility leads to negative perceptions along
the warmth dimension: untrustworthiness, hostility, and ruthlessness. Outgroups that are
perceived as having low status and low power are stereotyped negatively along the com-
petence dimension. In creating this taxonomy of enemy images in political psychology,
Alexander and colleagues proposed that behavioral orientations toward outgroups vary
based on factors such as power, status, and goal compatibility (Alexander et al., 1999;
Alexander, Brewer, & Livingston, 2005). The types of biases toward various outgroups
can differ depending on perceptions of that group’s willingness and ability to help – or
hinder – the social standing of one’s ingroup.
Despite the similar two-dimensional nature of stereotype content, the way that people
judge warmth and competence of individuals differs from judging these dimensions in
groups. At the interpersonal level, the two dimensions tend to correlate positively. Peo-
ple expect other individuals to be more-or-less evaluatively consistent (Fiske & Taylor,
2013; Rosenberg et al., 1968). However, when people judge social groups, warmth and
competence evaluations tend to correlate negatively. Many groups are simultaneously
judged as high in one dimension and low in the other, which has implications for pre-
dicting people’s behavioral and emotional reactions to members of other social groups
(Fiske, 1998; Fiske et al., 1999, 2002; Yzerbyt et al., 2005).
Stereotype Content Model
The stereotype content model’s (SCM) warmth-competence framework allows for

four unique combinations of social judgment: two unambivalent warmth-competence
combinations (high warmth/high competence, low warmth/low competence) and two
ambivalent warmth-competence combinations (high warmth/low competence, low
warmth/high competence). This two-dimensional warmth-by-competence space cate-
gorizes social ingroups and outgroups.
The SCM shows that people depict the ingroup, the societal prototype groups and its
allies, as high in both warmth and competence. At present, in the United States, those
identified as middle class, heterosexual, Christian, and citizens are viewed as societal
ingroups. People express pride and admiration for them (Cuddy, Fiske, & Glick, 2007;
Fiske et al., 1999; Fiske et al., 2002).
In contrast, the other unambivalent space in the SCM’s framework is occupied by

the most extreme of social outcasts: drug addicts and the homeless (Harris & Fiske,
2007), welfare recipients, and undocumented immigrants (Fiske & Lee, 2012; Lee &
Fiske, 2006). These groups are viewed with extreme antipathy, actively scorned. These
groups elicit feelings of contempt and even disgust (Harris & Fiske, 2006). Indeed,
low-warmth/low-competence social groups elicit automatic neural reactions that reflect
feelings of scorn, contempt, and even disgust (Harris & Fiske, 2006, 2007).
Although some outgroups fall into the low/low space of the SCM framework, ambiva-
lence more often is involved in intergroup perception. Groups stereotyped as high in one
dimension are often seen as low in the other (Kervyn, Yzerbyt, & Judd, 2010; Kervyn
et al., 2009). One of these ambivalent quadrants of the SCM space includes groups that
are seen as warm but incompetent. In US data, these include handicapped groups, such
as the physically or mentally disabled, and the elderly. These groups are seen as harm-
less and trustworthy, but incapable of acting on their well-meaning intentions. The high
warmth/low competence associated with these groups elicits feelings of pity and sym-
pathy (Cuddy, Norton, & Fiske, 2005; Cuddy et al., 2007; Fiske et al., 2002). These
inherently ambivalent emotions communicate paternalistic positivity (“harmless”) but
subordinate status (“beneath me”) (Glick & Fiske, 2001).
Recent research suggests, however, that these groups are pitied only as long as they
follow the prescriptions laid out for them, adhering to their stereotypic roles as high-
warmth and low-competence group members. For example, people with physical or
mental disabilities are seen as deserving pity only if the fault for their disability does
not lie with them. If disabled people somehow caused the condition (e.g. recklessly
ignored warnings) or neglected treatment, then they quickly become ineligible for the
pity that is granted to members of this ambivalent group (Wu et al., in press). As for
the elderly, the “dear but doddering” stereotypes applied to this group hold only if they
cooperate with young people to reduce intergenerational tensions. This includes identity
boundaries (acting one’s age), the appropriate succession (moving out of the way to pass
along jobs and resources), and sharing consumption (not using too many resources joint
with the younger generation, such as Social Security) (North & Fiske, 2012, 2013).
The second ambivalent group includes those who are seen as competent but cold
(untrustworthy). In the United States, groups such as the rich, female professionals,
Asian people, and Jewish people are evaluated as cold and unfriendly, but also high-
status and able (Cuddy et al., 2007; Fiske et al., 1999, 2002). These groups possess
resources and abilities that elicit feelings of resentment, such as envy and jealousy.
These feelings are inherently ambivalent because they suggest that the outgroup pos-
sesses something of value but that their intentions are dubious.
The evidence for the social perception of groups using the warmth × competence
space appears in representative samples both nationally and internationally. Worldwide,
these combinations of warmth-competence have been shown to fit in more than thirty
nations on five continents (Cuddy et al., 2009; Durante et al., 2013; Fiske & Cuddy,
2006). These four types of outgroups match ethnic stereotypes that have been stud-
ied since the 1930s (Bergsieker et al., 2012; Durante, Volpato, & Fiske, 2010). Peo-
ple also apply the SCM’s warmth-competence dimensions to many subgroups of larger
societal categories. For example, when broken into subgroups of African Americans
identified by African Americans, the images spread across the quadrants of the SCM
space (Fiske et al., 2009). The warmth-by-competence space describes other subtypes
of social groups, including subgroups of women and men, gay men, and the mentally ill
(respectively: Eckes, 2002; Clausell & Fiske, 2005; Fiske, 2011). Even animal species
and corporations are categorized according to the SCMs warmth-competence dimen-
sions (Kervyn, Fiske, & Malone, 2012; Malone & Fiske, 2013; Sevillano & Fiske,
2016), simply because animals and brands can be perceived as having intent and agency.
Future Research
Warmth and competence dimensions are universal dimensions of social perception that
have endured across stimuli, time, and place. These dimensions predict distinct emo-
tional and behavior reactions to distinct types of outgroup members. Recent research in
social cognitive neuroscience has begun to reveal neural reflections of the stereotypical
warmth and competence, giving insight into how – and even whether – we think about
the minds of outgroup members. When thinking of groups that elicit feelings of disgust
(e.g., scorned outgroups), people may even fail to recognize the other person’s mind
(i.e., dehumanized perception; Harris & Fiske, 2006). An area of the brain that reliably
activates when one thinks of other people’s thoughts and feelings (i.e., the medial pre-
frontal cortex, mPFC) does not come online when people view pictures of homeless
people and drug addicts, the most scorned of outgroups. The mPFC activates to some
extent when people consider groups that fall into all other SCM quadrants, suggest-
ing that the scorn toward low-warmth/low-competence group members hinders people’s
ability to connect with them on a human level, to read their social signals beyond con-
temptible group membership. However, merely asking participants to consider what
one of these allegedly disgusting outgroup member’s individual preferences may be (for
example, “What vegetable would he eat?”) reactivates the mPFC.
Another recent line of research has revealed neural responses to another negatively-
regarded outgroup, those seen as high in competence but low in warmth. Having shown
that disgusting, scorned outgroups deactivate the mPFC, one might guess that respected
but envied outgroups elicit the opposite reaction. This is indeed what social neuroscien-
tists have found; however, this increased mPFC activation does not mean seeing these
groups as more human, or more mindful. More likely, envied groups prime their pos-
session of social rewards; while parts of the mPFC come online when thinking of other
minds, other parts also come online in the pursuit of social or human-caused reward
(Harris et al., 2007; Van den Bos et al., 2007).
The thought of envied outgroups enjoying their own rewards may contribute to acti-
vating the mPFC. However, envy also can cause people to react otherwise when they
see enviable outgroups in vulnerable positions. This Schadenfreude – malicious glee at
outgroup members’ misfortune – occurs when people witness envied groups experience
misfortune. Physiological methods reveal that people reliably show the hints of a smile
when seeing a rich person or investment banker splashed by a taxi or sit on chewing
gum; this response does not target the same events experienced by members of other
SCM quadrants (Cikara & Fiske, 2011). On a neural level, social neuroscientists have
found signs of reward processing when envied groups are lowered to a position of scorn,
even if only momentarily (Cikara, Botvinick, & Fiske, 2011; Cikara & Fiske, 2012).
These distinct neural activations differentiate disgust, envy, and even Schadenfreude
when encountering outgroups that fall into distinct quadrants of the warmth-and-
competence space. These and other developing lines of research provide a foundation,
bridging psychological science and neuroscience. However, ongoing research examines
some neural and physiological responses to ingroups and to the pitied groups who are
relegated to the place of harmless subordinates.
Social cognitive researchers have spent over a decade uncovering the universality of
warmth and competence in person perception. This and other social cognitive research
can greatly inform the field of social signal processing. As social signal processing
works to bridge the gap between human and computer, conceptual frameworks explain-
ing the way humans perceive and react to others can inform those machine models. The
study of social signals enhances human–computer interactions and computer-mediated
interaction between humans, benefiting a wide variety of domains (for review, see Salah,
Pantic, & Vinciarelli, 2011). However, attempts to improve the social intelligence of
machines should incorporate theory on how people spontaneously perceive other enti-
ties and how such perceptions influence emotions and behaviors. Conversely, research
on social signaling can inform psychologists, providing theoretical and methodological
insight to examine people’s behavior and mental states in an increasingly computer-
based world.
References
Abelson, R. P., Kinder, D. R., Peters, M. D., & Fiske, S. T. (1982). Affective and semantic com-
ponents in political person perception. Journal of Personality and Social Psychology, 42, 619–
630.
Alexander, M. G., Brewer, M. B., & Hermann, R. K. (1999). Images and affect: A functional
analysis of outgroup stereotypes. Journal of Personality and Social Psychology, 77(1), 78–93.
Alexander, M. G., Brewer, M. B., & Livingston, R. W. (2005). Putting stereotype content in con-
text: Image theory and interethnic stereotypes. Personality and Social Psychology Bulletin,
31(6), 781–794.
Allport, G. W. (1954). The Nature of Prejudice. Reading, MA: Addison-Wesley.
Andersen, K. & Clevenger, T., Jr. (1963). A summary of experimental research in ethos. Speech
Monographs, 30, 59–78.
Asch, S. E. (1946). Forming impressions of personality. Journal of Abnormal and Social Psychol-
ogy, 41, 258–290.
Bergsieker, H. B., Leslie, L. M., Constantine, V. S., & Fiske, S. T. (2012). Stereotyping by omis-
sion: Eliminate the negative, accentuate the positive. Journal of Personality and Social Psy-
chology, 102(6), 1214–1238.
Burgoon, J. K. & Hale, J. L. (1984). The fundamental topoi of relational communication. Com-
munication Monographs, 51, 193–214.
Burgoon, J. K. & Hale, J. L. (1987). Validation and measurement of the fundamental themes of
relational communication. Communication Monographs, 54, 19–41.
Burgoon, J. K. & LePoire, B. A. (1993). Effects of communication expectancies, actual commu-
nication, and expectancy disconfirmation on evaluations of communicators and their commu-
nication behavior. Human Communication Research, 20(1), 67–96.
Cacioppo, J. T., Gardner, W. L., & Berntson, G. G. (1997). Beyond bipolar conceptualizations
and measures: The case of attitudes and evaluative space. Personality and Social Psychology
Review, 1, 3–25.
Chemers, M. M. (1997). An Integrative Theory of Leadership. Mahwah, NJ: Lawrence Erlbaum.
Cikara, M., Botvinick, M. M., & Fiske, S. T. (2011). Us versus them: Social identity shapes neural
responses to intergroup competition and harm. Psychological Science, 22(3), 306–313.
Cikara, M. & Fiske, S. T. (2011). Bounded empathy: Neural responses to outgroup targets’
(mis)fortunes. Journal of Cognitive Neuroscience, 23(12), 3791–3803.
Cikara, M. & Fiske, S. T. (2012). Stereotypes and Schadenfreude: Affective and physiological
markers of pleasure at outgroup misfortunes. Social Psychological and Personality Science,
3(1), 63–71.
Clausell, E. & Fiske, S. T. (2005). When do subgroup parts add up to the stereotypic whole?
Mixed stereotype content for gay male subgroups explains overall ratings. Social Cognition,
23(2), 161–181.
Cuddy, A. J. C., Fiske, S. T., & Glick, P. (2007). The BIAS map: Behaviors from intergroup affect
and stereotypes. Journal of Personality and Social Psychology, 92, 631–648.
Cuddy, A. J. C., Fiske, S. T., Kwan, V. S. Y., et al. (2009). Stereotype content model across
cultures: Towards universal similarities and some differences. British Journal of Social Psy-
chology, 48(1), 1–33.
Cuddy, A. J. C., Norton, M. I., & Fiske, S. T. (2005). This Old Stereotype: The Pervasiveness and
persistence of the elderly stereotype. Journal of Social Issues, 61(2), 267–285.
Dillard, J. P. & Solomon, D. H. (2005). Measuring the relevance of relational frames: A relational
framing theory perspective. In V. Manusov (Ed.), The Sourcebook of Nonverbal Measures:
Going Beyond Words (pp. 325–334). Mahwah, NJ: Lawrence Erlbaum.
Durante, F., Fiske, S. T., Kervyn, N., et al. (2013). Nations’ income inequality predicts ambiva-
lence in stereotype content: How societies mind the gap. British Journal of Social Psychology,
52(4), 726–746.
Durante, F., Volpato, C., & Fiske, S. (2010). Using the stereotype content model to examine
group depictions in fascism: An archival approach. European Journal of Social Psychology,
40(3), 465–483.
Eckes, T. (2002). Paternalistic and envious gender stereotypes: Testing predictions from the
stereotype content model. Sex Roles, 47(3–4), 99–114.
Fiske, S. T. (1998). Stereotyping, prejudice, and discrimination. In D. T. Gilbert, S. T. Fiske, &
G. Lindzey (Eds), Handbook of Social Psychology (4th edn, vol. 2, pp. 357–411). New York:
McGraw-Hill.
Fiske, S. T. (2011). Envy Up, Scorn Down: How Status Divides Us. New York: Russell Sage
Foundation.
Fiske, S. T., Bergsieker, H. B., Russell, A. M., & Williams, L. (2009). Images of black Americans:
Then, “them” and now, “Obama!” DuBois Review: Social Science Research on Race, 6, 83–
101.
Fiske, S. T. & Cuddy, A. J. C. (2006). Stereotype content across cultures as a function of
group status. In S. Guimond (Ed.), Social Comparison and Social Psychology: Understanding
Cognition, Intergroup Relations, and Culture (pp. 249–263). New York: Cambridge University
Press.
Fiske, S. T., Cuddy, A. J. C., & Glick, P. (2007). Universal dimensions of social cognition: Warmth
and competence. Trends in Cognitive Sciences, 11, 77–83.
Fiske, S. T., Cuddy, A. J. C., Glick, P., & Xu, J. (2002). A model of (often mixed) stereotype
content: Competence and warmth respectively follow from perceived status and competition.
Journal of Personality and Social Psychology, 82(6), 878–902.
Fiske, S. T. & Lee, T. L. (2012). Xenophobia and how to fight it: Immigrants as the quintessential
“other”. In S. Wiley, G. Philogène, & T. A. Revenson (Eds), Social Categories in Everyday
Experience (pp. 151–163). Washington, DC: American Psychological Association.
Fiske, S. T. & Taylor, S. E. (2013). Social Cognition: From Brains to Culture (2nd edn). London:
SAGE.
Fiske, S. T., Xu, J., Cuddy, A. C., & Glick, P. (1999). (Dis)respecting versus (dis)liking: Status and
interdependence predict ambivalent stereotypes of competence and warmth. Journal of Social
Issues, 55(3), 473–489.
Glick, P. and Fiske, S.T. (2001). Ambivalent sexism. In M. P. Zanna (Ed.), Advances in Experi-
mental Social Psychology (vol. 33, pp. 115–188). Thousand Oaks, CA: Academic Press.
Haiman, F. (1948). An experimental study of the effects of ethos in public speaking. Unpublished
Doctoral Dissertation, Northwestern University.
Harris, L. T. & Fiske, S. T. (2006). Dehumanizing the lowest of the low: Neuroimaging responses
to extreme outgroups. Psychological Science, 17(10), 847–853.
Harris, L. T. & Fiske, S. T. (2007). Social groups that elicit disgust are differentially processed in
mPFC. Social Cognitive and Affective Neuroscience, 2, 45–51.
Harris, L. T., McClure, S. M., Van der Bos, W., Cohen, J. D., & Fiske, S. T. (2007). Regions of
the MPFC differentially tuned to social and nonsocial affective evaluation. Cognitive, Affective
& Behavioral Neuroscience, 7(4), 309–316.
Hawes, L. C. (1973). Elements of a model for communication processes. Quarterly Journal of
Speech, 59(1), 11–21.
Katz, D. & Braly, K. (1933). Racial stereotypes of one hundred college students. Journal of Abnor-
mal and Social Psychology, 28(3), 280–290.
Kervyn, N., Fiske, S. T., & Malone, C. (2012). Brands as intentional agents framework: How
perceived intentions and ability can map brand perception. Journal of Consumer Psychology,
22(2), 166–176.
Kervyn, N., Fiske, S., & Yzerbyt, V. (2015). Forecasting the primary dimension of social percep-
tion: Symbolic and realistic threats together predict warmth in the stereotype content model.
Social Psychology, 46(1), 36–45.
Kervyn, N., Yzerbyt, V., & Judd, C. M. (2010). Compensation between warmth and competence:
Antecedents and consequences of a negative relation between the two fundamental dimensions
of social perception. European Review of Social Psychology, 21(1), 155–187.
Kervyn, N., Yzerbyt, V. Y., Judd, C. M., & Nunes, A. (2009). A question of compensation: The
social life of the fundamental dimensions of social perception. Journal of Personality and
Kinder, D. R. & Sears, D. O. (1981). Prejudice and politics: Symbolic racism versus racial threats
to the good life. Journal of Personality and Social Psychology, 40, 414–431.
Leach, C., Ellemers, N., & Barreto, M. (2007). Group virtue: The importance of morality (vs.
competence and sociability) in the positive evaluation of in-groups. Journal of Personality and
Leary, T. (1957). Interpersonal Diagnosis of Personality: A Functional Theory and Methodology

for Personality Evaluation. New York: Ronald Press.
Lee, T. L. & Fiske, S. T. (2006). Not an outgroup, not yet an ingroup: Immigrants in the Stereotype
Content Model. International Journal of Intercultural Relations, 30(6), 751–768.
Malone, C. & Fiske, S. T. (2013). The Human Brand: How We Relate to People, Products, and
Companies. San Francisco: Wiley/Jossey Bass.
McCroskey, J. C. & Teven, J. J. (1999). Goodwill: A reexamination of the construct and its mea-
surement. Communication Monographs, 66, 90–103.
McCroskey, J. C. & Young, T. J. (1981). Ethos and credibility: The construct and its measurement
after three decades. Central States Speech Journal, 32, 24–34.
North, M. S. & Fiske, S. T. (2012). An inconvenienced youth? Ageism and its potential intergen-
erational roots. Psychological Bulletin, 138(5), 982–997.
North, M. S. & Fiske, S. T. (2013). A prescriptive intergenerational-tension ageism scale:
Succession, identity, and consumption (SIC). Psychological Assessment, 25(3), 706–
713.
Peeters, G. (1983). Relational and informational pattern in social cognition. In W. Doise &
S. Moscovici (Eds), Current Issues in European Social Psychology (pp. 201–237). Cambridge:
Cambridge University Press, Cambridge.
Peeters, G. (2001). From good and bad to can and must: Subjective necessity of acts associated
with positively and negatively valued stimuli. European Journal of Social Psychology, 31, 125–
136.
Rosenberg, S., Nelson, C., & Vivekananthan, P. S. (1968). A multidimensional approach to the
structure of personality impressions. Journal of Personality and Social Psychology, 9, 283–294.
Salah, A. A., Pantic, M., & Vinciarelli, A. (2011). Recent developments in social signal pro-
cessing. Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics,
380–385.
Sattler, W. M. (1947). Conceptions of ethos in ancient rhetoric. Communication Monographs, 14,
55–65.
Sevillano, V. & Fiske, S. T. (2016). Warmth and competence in animals. Journal of Applied Social
Psychology, 46(5), 276–293.
Van den Bos, W., McClure, S. M., Harris, L. T., Fiske, S. T., & Cohen, J. D. (2007). Dissociat-
ing affective evaluation and social cognitive processes in the ventral medial prefrontal cortex.
Cognitive, Affective & Behavioral Neuroscience, 7(4), 337–346.
Willis, J. & Todorov, A. (2006). First impressions: Making up your mind after a 100-ms exposure
to a face. Psychological Science, 17(7), 592–598.
Wojciszke, B. (1994). Multiple meanings of behavior: Construing actions in terms of competence
or morality. Journal of Personality and Social Psychology, 67, 222–232.
Wojciszke, B. (2005). Morality and competence in person- and self-perception. European Review
of Social Psychology, 16, 155–188.
Wojciszke, B. & Abele, A. E. (2008). The primacy of communion over agency and its reversals
in evaluations. European Journal of Social Psychology, 38(7), 1139–1147.
Wojciszke, B., Bazinska, R., & Jaworski, M. (1998). On the dominance of moral categories in
impression formation. Personality and Social Psychology Bulletin, 24, 1245–1257.
Wojciszke, B., Brycz, H., & Borkenau, P. (1993). Effects of information content and evaluative
extremity on positivity and negativity biases. Journal of Personality and Social Psychology, 64,
327–336.
Wojciszke, B. & Klusek, B. (1996). Moral and competence-related traits in political perception.
Polish Psychological Bulletin, 27, 319–324.
Wu, J., Ames, D. L., Swencionis, J. K., & Fiske, S. T. (in press). Blaming the victim: An fMRI
study on how perceptions of fault influence empathy for people with disabilities.
Yzerbyt, V., Provost, V., & Corneille, O. (2005). Not competent but warm . . . really? Compen-
satory stereotypes in the French-speaking world. Group Processes & Intergroup Relations, 8,
291–308.
4 The Vertical Dimension of Social
Signaling
Marianne Schmid Mast and Judith A. Hall
Interpersonal interactions and relationships can be described as unfolding along two

perpendicular dimensions: verticality (power, dominance, control; Burgoon & Hoobler,
2002; Hall, Coats, & LeBeau, 2005) and horizontality (affiliativeness, warmth, friendli-
ness; Kiesler, 1983; Wiggins, 1979). The vertical dimension refers to how much control
or influence people can exert, or believe they can exert, over others, as well as the sta-
tus relations created by social class, celebrity, respect, or expertise. Numerous earlier
authors have discussed variations and differences within the verticality concept (e.g.,
Burgoon & Dunbar, 2006; Burgoon, Johnson, & Koch, 1998; Ellyson & Dovidio, 1985;
Keltner, Gruenfeld, & Anderson, 2003).
Social control aspects are prevalent in many social relationships and interactions, not
only in formal hierarchies such as in the military or in organizations; there is also a
difference in social control between parents and their children, and husbands and wives
can have different degrees of power in their relationships. Even within groups of friends
or peers, a hierarchy emerges regularly.
Verticality encompasses terms such as power, status, dominance, authority, or lead-
ership. Although different concepts connote different aspects of the vertical dimension,
their common denominator is that they are all indicative of the amount of social control
or influence and thus of the vertical dimension. Structural power or formal authority
describes the difference in social control or influence with respect to social or occupa-
tional functions or positions (Ellyson & Dovidio, 1985) (e.g., first officer). Status refers
to the standing on the verticality dimension stemming from being a member of a specific
social group (e.g., being a man versus a woman) (Pratto et al., 1994). Status also means
being awarded a high position on the verticality dimension by others (e.g., emergent
leader) (Berger, Conner, & Fisek, 1974). The term dominance (also authority) is used
to describe a personality trait of striving for or of having high social control (Ellyson
& Dovidio, 1985). Dominance is also used to denote behavior that is aimed at social
control (Schmid Mast, 2010). Leadership is the influence on group members to achieve
a common goal (Bass, 1960). In a given social situation, different verticality aspects can
either converge or diverge. A company leader has high structural power but his or her
interaction or leadership style can express more or less dominance.
Verticality is an interpersonal concept and as such it cannot exist for one person
alone without a reference to other people; a person’s level of social control and influ-
ence is always relative to another’s (or several others’). As an example, in an organiza-
tional hierarchy, a middle manager has more position power and status than the sales
The Vertical Dimension of Social Signaling 35
representative working for him/her, but at the same time less power and status than the
CEO of the company. Moreover, a person can have high social control in one situation or
domain and low verticality in another. The CEO of the company might control everyone
at work but still take orders from her husband at home.
It is not only important to study verticality and its effect because the vertical dimen-
sion is omnipresent in different types of social interactions but also because the position
within the vertical dimension affects how social interactions unfold. As examples, think-
ing about having power leads to better performance in a self-presentation task because
power reduces feelings of stress and social signals of nervousness (Schmid & Schmid
Mast, 2013). But, also, power can lead to overconfidence in decision making (Fast et al.,
2012) and power without status can make a person choose more demeaning activities
for their partners to perform (Fast, Halevy, & Galinsky, 2012).
In many social encounters, the vertical position of the interaction partner is not known
but needs to be inferred, and even if the position is known, the way a person exerts
his or her power differs from one person to the other. The behavioral manifestations
of verticality, that is, the social signals that are linked to verticality, therefore become
important information that guides our social interactions and their outcomes.
Social Signaling
Social signals are nonverbal behavior cues produced and conveyed during social interac-
tions via face, eyes, body, and voice (other than words). More specifically, they encom-
pass vocal nonverbal behavior such as speaking time, interruptions, speech fluency, and
the like and visual nonverbal behavior such as gazing, nodding, facial expressions, body
posture, gesturing, and interpersonal distance among others (Knapp, Hall, & Horgan,
2014). They are used, explicitly or implicitly, by a person to express his or her states and
traits and can be used by social interaction partners to read those states and traits. As an
example, an intern attending a team meeting of a work unit who is unfamiliar with the
team members will not have much difficulty identifying the team leader because high
status people behave differently than low status people in such gatherings. Most likely,
the team leader will be looked at by the others more often and others will seek his or
her confirmation for ideas presented. The team leader might also take up more room
by showing expansive gestures and talking more and in a louder voice than the rest of
the team members. Social signals therefore convey important information useful for the
smoothness of the unfolding social interactions and relationships; reading these social
signals correctly is an important skill.
Nonverbal behaviors per se are not unequivocal in meaning. We can, for instance,
smile at others because we like them, because we want to ingratiate, or because we
are happy and the other just happened to be nearby. Some nonverbal behaviors carry
meanings that are more specific to the vertical dimension than others. Also, whether
verbal or nonverbal behavior matters more as a source of information depends on the
situation. People often turn to the nonverbal channel for information when the nonverbal
cues contradict the words being spoken or when people doubt the honesty of a verbal
communication. This is indeed a good strategy because lie detection seems to be more
successful when people rely on nonverbal (and especially paralinguistic cues such as
laughing or vocal pitch) as opposed to verbal cues (Anderson et al., 1999).
Given the omnipresence of the vertical dimension in our daily lives, uncovering the
social signals indicative of the vertical dimension becomes important. If we know the
signals high status or high power individuals emit, we can use this knowledge to infer
status or power in social interactions in which we do not know the vertical position
of the social interaction partners. Although research shows that there are indeed some
signals that are typically indicative of the vertical dimension (which we will review in
detail in the following section), the link between verticality and nonverbal cues is not
always clear cut. One reason for this is that high (and low) power individuals can be in
different motivational or emotional states that might be more important for determining
their interpersonal behavior than their power position per se. For example, how much
the high or low power person smiles may depend on who feels happier. Or, the person
with the louder voice could be the boss who commands or the subordinate who wants to
compete. If such differences were merely a matter of random variation, one might not
be so concerned about their influence when groups of people are compared. However,
it is possible for a given situation, whether in real life or in the research laboratory, to
systematically influence the motives or states of everyone in a given vertical position.
For example, the situation might be one in which all of the bosses are feeling happier
than the subordinates, or the reverse. Then separating the influence of vertical position
from these correlated states becomes a problem (Hall et al., 2005). Moreover, nonverbal
correlates that exist for one definition of power or one situation may not hold for another
(e.g., personality dominance versus rank in an organization). As one example, a preoc-
cupied boss might not feel much need to engage in eye contact with subordinates, while
an emergent leader in a group might engage in much eye contact because his leadership
status rests on group members’ conviction that he is interested in them and the group’s
goals.
Despite those challenges, there seems to be an array of nonverbal social signals that
show a rather consistent link with verticality in the sense that people high on verticality
tend to show these behaviors more frequently than people low on verticality. We will
review these social signals of verticality in the following section.
Signals of Verticality
People high on the vertical dimension possess a number of characteristics that differ-
entiate them from people low on the dimension. For instance, high power individuals
cognitively process information in a more abstract and global way (Smith & Trope,
2006) and experience less stress in stressful situations (Schmid & Schmid Mast, 2013).
Despite nonverbal behavior depending much on motivational and emotional influences,
as discussed above, individuals high on verticality also show relatively robust dif-
ferences in some of their nonverbal social signals compared to individuals low on
verticality.
The meta-analysis by Hall et al. (2005) investigated how different definitions of ver-
ticality (personality dominance, power roles or rank, as well as socioeconomic status),
either experimentally manipulated or pre-existing, were associated with different non-
verbal behavior. Results showed that people high in verticality used more open body
positions, had closer interpersonal distances to others, were more facially expressive,
spoke more loudly, engaged in more successful interruptions, and had less vocal vari-
ability compared to lower power people. For many other behaviors, there was no net
effect in one or the other direction; however, results showed pronounced heterogene-
ity, meaning that there was considerable variation in the effects found. For instance, for
smiling and gazing, some studies found individuals high in verticality to show more
smiling and gazing while other studies found individuals high in verticality to show less
smiling and gazing.
The amount of time a person speaks during a social interaction is also a valid cue to a
high position on the vertical dimension and is, indeed, a more consistent and strong cue
than most of the cues mentioned above. Meta-analytic evidence shows that superiors
talk more than their subordinates, people in high power roles talk more than people in
low power roles, and the more a person is dominant as a personality trait, the more he
or she talks during an interaction (Schmid Mast, 2002).
Despite gazing not being related overall to verticality, the gaze pattern called the
visual dominance ratio (VDR) has consistently shown to be indicative of high vertical
positions (Dovidio et al., 1988). The VDR is defined as the percentage of gazing at an
interaction partner while speaking divided by the percentage of gazing while listening;
a high VDR gives the impression of less conversational attentiveness because one gazes
relatively less at the other person while that person is speaking compared to when one
has the floor oneself. Research has clearly demonstrated that being higher on the vertical
dimension is associated with a higher VDR for both men and women and for a variety
of definitions of power, such as personal expertise on a topic (Dovidio et al., 1988),
objectively measured rank (Exline, Ellyson, & Long, 1975), experimentally ascribed
status (Ellyson et al., 1980), and personality dominance (Ellyson et al., 1980).
Also, the “prolonged gaze pattern” is a behavior used by both emergent and appointed
leaders in three-person groups to choose the next speaker by engaging in prolonged gaz-
ing at that person as the leader nears the moment of yielding the floor (Kalma, 1992).
As these examples show, relatively subtle cues and cue combinations (e.g., gazing com-
bined with speaking time) might be more informative of verticality than certain behav-
iors taken alone.
Many of these findings fit into the classification suggested by Burgoon and Dunbar
(2006) as indicative of dominance and power in both human and nonhuman species:
(1) physical potency, (2) resource control, and (3) interaction control. Physical potency
is evident by social signals expressing threat (e.g., staring, giving somebody the silent
treatment), indicators of size or strength (e.g., erect posture, mature faces), and expres-
sivity (e.g., animated face, loud voice). Resource control is evident in having com-
mand of the space (e.g., expansive and open body postures), displaying precedence,
which means “who gets to go first” (e.g., walking ahead, entering a space first), exer-
cising the prerogative to deviate from social norms and expectations (e.g., adopting
close interpersonal distance, leaving more crumbs when eating), and possessing valued
commodities, meaning possession of luxury goods and other status signals. Interaction
control affects the where, when, and how of the social interaction and is characterized
by behaviors indicative of centrality (e.g., being in the center of attention measured by
the visual dominance ratio or a central position in a group of people), of elevation (e.g.,
looking up to someone), of initiation (e.g., interruptions), and of nonreciprocation (e.g.,
resisting mimicking the social interaction behaviors of another).
If high levels of verticality are associated with certain social signals, expressing those
social signals might elevate a person’s felt level of verticality. In the next section, we
review how the embodiment of social signals indicative of high vertical positions can
make a person feel higher on verticality.
Inferring Verticality
When interaction partners or observers infer the vertical position of a person, which
social signals are used for those judgments? In research investigating this question, per-
ceivers typically rate the degree of power or status of a target person. Then, the nonver-
bal behaviors of the targets are assessed by neutral coders. Those coded behaviors are
then correlated with the perceivers’ judgments of power to reveal the cues that predict
their judgments of power. This means that the perceiver does not necessarily need to
be conscious about the cues he or she is using when inferring another person’s vertical
position. The meta-analysis by Hall et al. (2005) showed that many nonverbal behav-
iors were used by perceivers to infer the vertical position of a person. Perceivers rated
targets higher on verticality if they showed more gazing, lowered eyebrows, a more
expressive face, more nodding, less self-touch, more other-touch, more arm and hand
gestures, more bodily openness, more erect or tense posture, more body or leg shifts,
closer interpersonal distance, a more variable voice, a louder voice, more interruptions,
less pausing, a faster speech rate, a lower voice pitch, more vocal relaxation, shorter
time latencies before speaking, and more filled pauses (such as umm and ahh). Smiling
was also negatively related to power (with more smiling being associated with lower
ratings of power), but when the results for a large group of studies that all used the
same facial stimuli were combined into an average effect size, this result disappeared.
Moreover, there is a strong positive relation between speaking time and perceived high
verticality (Schmid Mast, 2002), and observers use the visual dominance ratio (defined
above) as an indicator of high vertical positions (Dovidio & Ellyson, 1982).
When people are asked what they explicitly expect in terms of social signals from
people high as opposed to low in social influence, results pretty much converge with the
ones we just reported. When participants are asked to report the behavior of people high
or low in hierarchical rank in a work setting on high or low in personality dominance, it
becomes apparent that they have consistent beliefs with significant effects occurring for
thirty-five of seventy expressed nonverbal behaviors (Carney, Hall, & LeBeau, 2005).
Among other behaviors, individuals high on the vertical dimension are believed to hand-
shake more, stand closer, touch others more, have more expressive faces and overall
animation, gesture more, gaze more, show less self-touch, have a more erect posture,
lean forward more, and use more open body positions.
Not much research has investigated whether the social signals people use to infer
verticality are the same across different cultures. Although power relations are more
clearly displayed through nonverbal behavior in some countries (e.g., Germany) than
others (e.g., United States, United Arab Emirates), there is evidence of cultural univer-
sality in the processing of dominance cues (Bente et al., 2010).
There are clearly fewer social signals that are characteristic of people with an actual
high standing on the vertical dimension than there are nonverbal behaviors perceived
as indicators of high verticality. All signals indicative of actual vertical position are
also used by observers to assess verticality, and Hall et al. (2005) even found positive
correlations between the effect sizes of perceived and actual verticality cues. However,
the list of verticality indicators assumed by observers is much longer than the data can
support. Thus, perceivers seem to use social signals that are not necessarily diagnostic of
the verticality dimension. If this is the case, are people still accurate in judging another
person’s vertical position? For instance, people believe that gazing is indicative of high
verticality, this can only result in an accurate verticality assessment of the target if actual
vertical position is conveyed by high levels of gazing (which it is not) (Hall et al., 2005).
Accurate Perception of Verticality
Accurate perception of another person’s standing in terms of verticality is an important

skill. Knowing who the boss is makes it easier to communicate efficiently in order to
achieve one’s goals (e.g., address those who have the resources and not making social
gaffes). Such knowledge also helps maintain the existing social order.
Research shows that people’s vertical position can be assessed accurately at better
than chance level. For instance, judges were accurate at deciding which of two target
people in a photograph was the other’s boss (Barnes & Sternberg, 1989). People can
accurately assess the status of university employees based on photographs (Schmid Mast
& Hall, 2004). In another study, perceivers’ ratings of CEOs’ dominance based on their
photographs significantly predicted the CEOs’ company earnings (Rule & Ambady,
2008). This result may be an indirect indicator of accuracy in judging dominance if the
CEOs’ dominance was responsible for the performance of the company. The ability to
accurately assess the vertical position of a target seems to develop early in life. Children
who were asked to select a leader out of pairs of photographs depicting real politicians
reliably chose the politicians who actually won the election (Antonakis & Dalgas, 2009).
It is surprising that, although people seem to use a number of non-diagnostic cues
to infer verticality, they are still able to correctly infer the vertical position of a person.
Maybe the researchers have not measured the cues the observers actually use to infer
verticality. Although this certainly remains an option, we do not think that this is the case
given the long list of social signals that researchers have tested to date. More likely, the
perceiver might rely on a combination of specific social signals, such as the visual dom-
inance ratio mentioned before, to infer verticality. Judging the vertical position may be
more of a gestalt-like impression formation process. For example, a nonverbal behavior

pattern involving touching, pointing at the other, invading space, and standing over the
other has been related to perceived dominance (Henley & Harmon, 1985). Alternatively,
people might change their strategy when assessing a person’s verticality depending on
the nonverbal cues that seem most salient in a given situation. For example, in a work
setting, perceivers might rely more on how formally somebody is dressed to assess his
or her status, whereas in a peer group discussion, indicators such as speaking time or
loud voice might be used to find out who is the most influential person in the group.
There is clearly more research needed to understand how observers use social signals
to infer verticality of their social interaction partners correctly.
Verticality and Accurate Social Perception
Another question in the realm of interpersonal accuracy (defined as correctly assessing

another’s state or trait) and verticality is whether high or low power people are more
accurate at person perception (in general and not necessarily with respect to detect-
ing interpersonal power). Both positions have been argued and have received empirical
support.
Powerless people are said to be more accurate than the powerful at inferring others’
states (Fiske & Dépret, 1996; Goodwin et al., 2000), primarily because it is likely to be
adaptive for them to be accurate. Subordinates may be motivated to learn their superiors’
intentions, moods, and desires so that subordinates can adjust their own behavior in
order to achieve their desired goals. If one assumes that the people high in verticality
do not depend on others and they control relevant resources, powerful people may not
be motivated to know their subordinates’ feelings, thoughts, or expectations.
It is also possible that because of high cognitive demands that come with high power
positions, high power people may not have the cognitive capacity to attend to the feelings
and behaviors of others. This then also results in individuals high on the vertical dimen-
sion being less interpersonally accurate than individuals low on the vertical dimension.
The hypothesis that high levels of verticality result in less accuracy than low levels was
supported in some studies (e.g., Galinsky et al., 2006; Moeller, Lee, & Robinson, 2011).
However, the opposite hypothesis that high levels of verticality are correlated with
better interpersonal accuracy has also obtained empirical support (Rosenthal, 1979;
Schmid Mast, Jonas, & Hall, 2009). Powerful people may be motivated to know oth-
ers who depend on them to secure respect and support and thus maintain their power
position. Indeed, felt pride and felt respect partially mediated the effect of power on
interpersonal accuracy (Schmid Mast et al., 2009). Also, it is possible that people who
are particularly interpersonally sensitive are more likely to become leaders (Riggio,
2001). Alternatively, people high in verticality might be more interpersonally accurate
because they use a more global cognitive processing style (Smith & Trope, 2006) which
can favor interpersonal accuracy in certain circumstances (e.g., facial emotion recogni-
tion) (Bombari et al., 2013).
We conducted a meta-analysis on the question of how power relates to interpersonal
accuracy (Hall, Schmid Mast, & Latu, 2015). The meta-analysis consisted of 104 studies
encompassing two definitions of accuracy (accurate inference about others and accurate
recall of others’ behavior or attributes) and four kinds of verticality (pre-existing ver-
tical position, personality dominance, socioeconomic status [SES], and experimentally
manipulated vertical position). Most of the studies in the literature measure interper-
sonal accuracy by giving people a test of cue judgments that is then scored for accuracy.
For these studies, there was a significant but small and heterogeneous effect showing
that people higher on verticality were better at accurately assessing others than were
people low on verticality. Given the high heterogeneity of the results, we broke down
the analyses separately for the different definitions of accuracy and verticality. Results
showed that people higher in SES had better accuracy at inferring others’ states and
higher experimentally manipulated vertical position predicted higher accuracy defined
as recall of others’ words.
In a smaller number of studies, accuracy was measured based on people’s judgments
of another live person in a dyadic interaction. For studies of this type where verticality
was defined as experimentally assigned positions, there was evidence that the lower
vertical person was more accurate than the higher vertical person. However, one cannot
interpret this result with confidence because of the possibility that it is due to failures
of expressive clarity on the part of the lower vertical partners, and not on failures of
perceptivity on the part of the higher vertical perceivers (Hall et al., 2006; Snodgrass,
Hecht, & Ploutz-Snyder, 1998).
This meta-analysis confirmed that verticality per se might not be enough to explain
interpersonal accuracy and, as with verticality and social signaling, different definitions
and operationalizations of power as well as different emotional and motivational states
high and low power people can be in, affect the outcomes.
Future Directions
One challenge for future research is to consider the different types (definitions and
operationalizations) of verticality as a moderator of the link between verticality and
social signals. For instance, low power individuals who strived for a high power posi-
tion talked more in a social interaction than low power individuals who were content
with the relatively powerless position (Schmid Mast & Hall, 2003). The study of the
interplay between different types of verticality (e.g., power and status) and its effect on
social signals and how social signals are interpreted in terms of, say, power or status is
only beginning to emerge (Dunbar & Burgoon, 2005).
Another avenue to pursue is the inclusion of the specific motivational or emotional
states the powerful or the powerless individual is in when investigating social signals of
verticality. These states can chiefly influence the social signals emitted. As an example,
powerful people tend to show aggressive behavior more so when their ego is threatened
than when not (Fast & Chen, 2009).
Research on social signals has typically looked at single social cues and how these
relate to verticality. We therefore know very little about how different combinations or
different timing of single cues indicate different levels of verticality and how they can
affect the perception of verticality. In order to be able to advance more in this direction,
the tedious coding of nonverbal cues needs to be facilitated and automatized. This
becomes more and more possible when researchers from the field of nonverbal behavior
collaborate with computer scientists whose skills can help tackle questions of cue com-
bination and cue timing. As an example, nonverbal cues of dominance have successfully
been modeled by computer scientists (Jayagopi et al., 2009). Moreover, computer algo-
rithms were developed to identify the emergent leader in a group of people working
on a problem-solving task based on the group member’s nonverbal (vocal and visual)
signals (Sanchez-Cortes et al., 2010). Also, efficiency is gained even without the help
of computer algorithms if researchers use excerpts of behavior for coding instead of the
entirety of the behavioral episodes at hand. Research increasingly points to the validity
of this “thin slice of behavior” approach (Murphy, 2005; Murphy et al., 2013).
Summary
The vertical dimension of social interactions is central to many domains of our lives, and
knowing which social signals indicate a high or a low standing on this dimension and
how people use social signals to infer others’ verticality is important for smooth social
interactions. People high on the vertical dimension express this mostly through their
voice and behavior that regulates interpersonal distance; they speak more, more loudly,
and with less vocal variability and interrupt their interaction partners more. Also, they
have more open body postures, approach their interaction partners more, and look at
them more while talking as compared to looking at them while listening (visual domi-
nance). Importantly, smiling and gazing do not show a clear link to the actual vertical
position of a person.
When inferring verticality by observing the nonverbal cues of people engaged in a
social interaction, observers use many more cues as an indicator of high verticality than
are indicative of the actual vertical position. People showing the following nonverbal
behaviors are perceived as higher on the verticality dimension: more gazing, lowered
eyebrows, a more expressive face, more nodding, less self-touch, more other-touch,
more arm and hand gestures, more bodily openness, more erect or tense posture, more
body or leg shifts, closer interpersonal distance, a more variable voice, a louder voice,
more interruptions, less pausing, a faster speech rate, a lower voice pitch, more vocal
relaxation, shorter time latencies before speaking, more filled pauses, less smiling, more
speaking time, and more visual dominance. Research shows that even though observers
use many cues that are not diagnostic of actual verticality, a person’s verticality can still
be inferred correctly.
References
Anderson, D. E., DePaulo, B., Ansfield, M., Tickle, J., & Green, E. (1999). Beliefs about cues to
deception: Mindless stereotypes or untapped wisdom? Journal of Nonverbal Behavior, 23(1),
67–89. doi: 10.1023/A:1021387326192.
Antonakis, J. & Dalgas, O. (2009). Predicting elections: Child’s play! Science, 323(5918), 1183.
doi: 10.1126/science.1167748.
Barnes, M. L. & Sternberg, R. J. (1989). Social intelligence and decoding of nonverbal cues.
Intelligence, 13(3), 263–287. doi: http://dx.doi.org/10.1016/0160-2896(89)90022-6.
Bass, B. M. (1960). Leadership, Psychology, and Organizational Behavior. Oxford: Harper.
Bente, G., Leuschner, H., Issa, A. A., & Blascovich, J. J. (2010). The others: Universals and
cultural specificities in the perception of status and dominance from nonverbal behavior.
Consciousness and Cognition, 19(3), 762–777. doi: http://dx.doi.org/10.1016/j.concog.2010
.06.006.
Berger, J., Conner, T. L., & Fisek, H. (1974). Expectation States Theory: A Theoretical Research
Program. Cambridge: Winthrop.
Bombari, D., Schmid, P. C., Schmid Mast, M., et al. (2013). Emotion recognition: The role of
featural and configural face information. The Quarterly Journal of Experimental Psychology,
1–17. doi: 10.1080/17470218.2013.789065.
Burgoon, J. K. & Dunbar, N. E. (2006). Nonverbal expressions of dominance and power in human
relationships. In V. P. Manusov & M. L. Patterson (Eds), The SAGE Handbook of Nonverbal
Communication (pp. 279–297). Thousand Oaks: SAGE.
Burgoon, J. K. & Hoobler , G. D. (2002). Nonverbal signals. In M. L. Knapp & J. A. Daly (Eds),
Handbook of Interpersonal Communication (pp. 240–299). Thousand Oaks, CA: SAGE.
Burgoon, J. K., Johnson, M. L., & Koch, P. T. (1998). The nature and measurement
of interpersonal dominance. Communication Monographs, 65, 308–335. doi: 10.1080/
03637759809376456.
Carney, D. R., Hall, J. A., & LeBeau, L. S. (2005). Beliefs about the nonverbal expression of social
power. Journal of Nonverbal Behavior, 29(2), 105–123. doi: 10.1007/s10919-005-2743-z.
Dovidio, J. F., Brown, C. E., Heltman, K., Ellyson, S. L., & Keating, C. F. (1988). Power displays
between women and men in discussions of gender-linked tasks: A multichannel study. Journal
of Personality and Social Psychology, 55(4), 580–587. doi: 10.1037/0022-3514.55.4.580.
Dovidio, J. F. & Ellyson, S. L. (1982). Decoding visual dominance behavior: Attributions of power
based on the relative percentages of looking while speaking and looking while listening. Social
Psychology Quarterly, 45(2), 106–113.
Dunbar, N. E. & Burgoon, J. K. (2005). Perceptions of power and interactional dominance in
interpersonal relationships. Journal of Social and Personal Relationships, 22(2), 207–233. doi:
10.1177/0265407505050944.
Ellyson, S. L. & Dovidio, J. F. (1985). Power, dominance, and nonverbal behavior: Basic concepts
and issues. In S. L. Ellyson & J. F. Dovidio (Eds), Power, Dominance, and Nonverbal Behavior
(pp. 1–27). New York: Springer.
Ellyson, S. L., Dovidio, J. F., Corson, R. L., & Vinicur, D. L. (1980). Visual dominance behavior
in female dyads: Situational and personality factors. Social Psychology Quarterly, 43(3), 328–
336.
Exline, R. V., Ellyson, S. L., & Long, B. D. (1975). Visual behavior as an aspect of power role rela-
tionships. In P. Pliner, L. Krames & T. Alloway (Eds), Advances in the Study of Communication
and Affect (vol. 2, pp. 21–52). New York: Plenum.
Fast, N. J. & Chen, S. (2009). When the boss feels inadequate: Power, incompetence, and aggres-
sion. Psychological Science, 20(11), 1406–1413. doi: 10.1111/j.1467-9280.2009.02452.x.
Fast, N. J., Halevy, N., & Galinsky, A. D. (2012). The destructive nature of power without status.
Journal of Experimental Social Psychology, 48(1), 391–394. doi: http://dx.doi.org/10.1016/j
.jesp.2011.07.013.
Fast, N. J., Sivanathan, N., Mayer, N. D., & Galinsky, A. D. (2012). Power and overconfident
decision-making. Organizational Behavior and Human Decision Processes, 117(2), 249–260.
doi: http://dx.doi.org/10.1016/j.obhdp.2011.11.009.
Fiske, S. T. & Dépret, E. (1996). Control, interdependence and power: Understanding social
cognition in its social context. European Review of Social Psychology, 7(1), 31–61. doi:
10.1080/14792779443000094.
Galinsky, A. D., Magee, J. C., Inesi, M. E., & Gruenfeld, D. H. (2006). Power and perspectives not
taken. Psychological Science, 17(12), 1068–1074. doi: 10.1111/j.1467-9280.2006.01824.x.
Goodwin, S. A., Gubin, A., Fiske, S. T., & Yzerbyt, V. Y. (2000). Power can bias impression
processes: Stereotyping subordinates by default and by design. Group Processes & Intergroup
Relations, 3(3), 227–256. doi: 10.1177/1368430200003003001.
Hall, J. A., Coats, E. J., & LeBeau, L. S. (2005). Nonverbal behavior and the vertical dimension of
social relations: A meta-analysis. Psychological Bulletin, 131(6), 898–924. doi: 10.1037/0033-
2909.131.6.898.
Hall, J. A., Rosip, J. C., LeBeau, L. S., Horgan, T. G., & Carter, J. D. (2006). Attributing the
sources of accuracy in unequal-power dyadic communication: Who is better and why? Journal
of Experimental Social Psychology, 42(1), 18–27. doi: http://dx.doi.org/10.1016/j.jesp.2005.01
.005.
Hall, J. A., Schmid Mast, M., & Latu, I. M. (2015). The vertical dimension of social relations
and accurate interpersonal perception: A meta-analysis. Journal of Nonverbal Behavior, 39(2),
131–163.
Henley, N. & Harmon, S. (1985). The nonverbal semantics of power and gender: A perceptual
study. In S. Ellyson & J. Dovidio (Eds), Power, Dominance, and Nonverbal Behavior (pp. 151–
164). New York: Springer.
Jayagopi, D. B., Hung, H., Chuohao, Y., & Gatica-Perez, D. (2009). Modeling dominance in
group conversations using nonverbal activity cues. IEEE Transactions on Audio, Speech, and
Language Processing, 17(3), 501–513. doi: 10.1109/TASL.2008.2008238.
Kalma, A. (1992). Gazing in triads: A powerful signal in floor apportionment. British Journal of
Social Psychology, 31(1), 21–39. doi: 10.1111/j.2044-8309.1992.tb00953.x.
Keltner, D., Gruenfeld, D. H., & Anderson, C. (2003). Power, approach, and inhibition. Psycho-
logical Review, 110, 265–284. doi: 10.1037/0033-295X.110.2.265.
Kiesler, D. J. (1983). The 1982 interpersonal circle: A taxonomy for complementarity in human
transactions. Psychological Review, 90(3), 185–214. doi: 10.1037/0033-295X.90.3.185.
Knapp, M. L., Hall, J. A., & Horgan, T. G. (2014). Nonverbal Communication in Human Interac-
tion (8th edn). Boston: Watsworth.
Moeller, S. K., Lee, E. A. E., & Robinson, M. D. (2011). You never think about my feelings:
Interpersonal dominance as a predictor of emotion decoding accuracy. Emotion, 11(4), 816–
824. doi: 10.1037/a0022761.
Murphy, N. A. (2005). Using thin slices for behavioral coding. Journal of Nonverbal Behavior,
29(4), 235–246. doi: 10.1007/s10919-005-7722-x.
Murphy, N. A., Hall, J. A., Schmid Mast, M., et al. (2013). Reliability and validity of non-
verbal thin slices in social interactions. Personality and Social Psychology Bulletin, 41(2),
199–213.
Pratto, F., Sidanius, J., Stallworth, L. M., & Malle, B. F. (1994). Social dominance orientation: A
personality variable predicting social and political attitudes. Journal of Personality and Social
Psychology, 67(4), 741–763. doi: 10.1037/0022-3514.67.4.741.
Riggio, R. E. (2001). Interpersonal sensitivity research and organizational psychology: Theoret-

ical and methodological applications. In J. A. Hall & F. J. Bernieri (Eds), Interpersonal Sensi-
tivity: Theory and Measurement (pp. 305317). Mahway, NJ: Lawrence Erlbaum.
Rosenthal, R. (Ed.) (1979). Skill in Nonverbal Communication: Individual Differences. Cam-
bridge, MA: Oelgeschlager, Gunn, & Hain.
Rule, N. O. & Ambady, N. (2008). The face of success: Inferences from chief executive
officers’ appearance predict company profits. Psychological Science, 19(2), 109–111. doi:
10.1111/j.1467-9280.2008.02054.x.
Sanchez-Cortes, D., Aran, O., Schmid Mast, M., & Gatica-Perez, D. (2010). Identifying emer-
gent leadership in small groups using nonverbal communicative cues. Paper presented at the
International Conference on Multimodal Interfaces and the Workshop on Machine Learning
for Multimodal Interaction, Beijing, China.
Schmid, P. C. & Schmid Mast, M. (2013). Power increases performance in a social evaluation
situation as a result of decreased stress responses. European Journal of Social Psychology,
43(3), 201–211. doi: 10.1002/ejsp.1937.
Schmid Mast, M. (2002). Dominance as expressed and inferred through speaking time. Human
Communication Research, 28(3), 420–450. doi: 10.1111/j.1468-2958.2002.tb00814.x.
Schmid Mast, M. (2010). Interpersonal behaviour and social perception in a hierarchy: The inter-
personal power and behaviour model. European Review of Social Psychology, 21(1), 1–33. doi:
10.1080/10463283.2010.486942.
Schmid Mast, M. & Hall, J. A. (2003). Anybody can be a boss but only certain people make good
subordinates: Behavioral impacts of striving for dominance and dominance aversion. Journal
of Personality, 71(5), 871–892. doi: 10.1111/1467-6494.7105007.
Schmid Mast, M., & Hall, J. A. (2004). Who is the boss and who is not? Accuracy of judging
status. Journal of Nonverbal Behavior, 28(3), 145–165. doi: 10.1023/B:JONB.0000039647.
94190.21.
Schmid Mast, M., Jonas, K., & Hall, J. A. (2009). Give a person power and he or she will show
interpersonal sensitivity: The phenomenon and its why and when. Journal of Personality and
Social Psychology, 97(5), 835–850. doi: 10.1037/a0016234.
Smith, P. K. & Trope, Y. (2006). You focus on the forest when you’re in charge of the trees: Power
priming and abstract information processing. Journal of Personality and Social Psychology,
90(4), 578–596. doi: 10.1037/0022-3514.90.4.578.
Snodgrass, S. E., Hecht, M. A., & Ploutz-Snyder, R. (1998). Interpersonal sensitivity:
Expressivity or perceptivity? Journal of Personality and Social Psychology, 74, 238–249.
doi:10.1037/0022-3514.74.1.238.
Wiggins, J. S. (1979). A psychological taxonomy of trait-descriptive terms: The interpersonal
domain. Journal of Personality and Social Psychology, 37(3), 395–412. doi: 10.1037/0022-
3514.37.3.395.
5 Measuring Responses to Nonverbal
Social Signals: Research on Affect
Receiving Ability
Ross Buck, Mike Miller, and Stacie Renfro Powers
Facial and bodily expressions function as social signals: communicative displays of

affect that regulate social interaction. It has long been recognized that abilities to read
such signals accurately is a kind of social intelligence, distinct from the traditional IQ.
An understanding and valid and reliable measures assessing such abilities would be very
useful. In recent years a number of techniques have been developed for the automatic
analysis of the stream of affect display across time, including facial expressions, body
movements and postures, and vocalic analyses. Such techniques enable the efficient and
objective recording of the dynamic stream of display and are of immense value, permit-
ting the analysis of the detailed structure of nonverbal “body language” as never before.
Potential exists for applications that help to assess the detailed structure of nonverbal
receiving abilities: for example, the nature of specific cues that underlie accurate or
inaccurate judgment on the part of different receivers.
This chapter considers the conceptual foundations and assumptions underlying mea-
sures of social signal pickup and processing, and the current developments art including
specific measures that have been proposed. A major challenge is that current approaches
are almost exclusively based upon posed or enacted facial and bodily displays, many of
them static rather than dynamic. There is much evidence that static and/or posed dis-
plays differ from dynamic spontaneous displays involving the authentic experience of
emotion on the part of the sender. Evidence suggests that the processing of spontaneous
versus posed displays differs as well. A second concern of this chapter involves the con-
cept of emotion sonar: that in interactive situations the tone is set by the display behavior
of the sender more than the interpretive skills of the receiver. Given attention, displays
are “picked up” automatically, affording mutual contingent responsiveness and enabling
primary intersubjectivity vis-á-vis sender and receiver in which each is constantly
attuned to the subjective state displayed by the other. Finally, we will consider evidence
of the role of the neurohormone oxytocin (OT) in responsiveness to social signals.
Measuring Abilities to “Read” Social Signals
Person Perception Accuracy

Attempts to measure abilities at social recognition, also termed person perception accu-
racy, date from the 1920s. However, early attempts were frustrated by methodological
Measuring Responses to Nonverbal Social Signals 47
problems. One was the issue of assumed similarity: when judging others, people often
assume others are similar to the self, so that it has appeared, for example, that extraverts
are more accurate at judging the personalities of extraverts, simply because with little
or no evidence to the contrary, extraverts assume that others are extraverts. If the other
happens to be so, this appears to be an accurate judgment. Another problem with the
early approaches was the reliability of the criterion measure: error associated with any
unreliability in measuring the extraversion of the target person would be compounded
by error in the judgment process (Buck, 1984).
Many of the difficulties encountered in judging personality were finessed when inter-
est turned to the measurement of emotion receiving ability, or empathy, defined as the
ability to “read” emotion accurately from displays. Assumed similarity was no longer
relevant and the criterion problem was minimized because it is easier objectively to
measure a stimulus person’s emotional display than that person’s personality. Also,
the scales of judgment that could be employed were easier to understand and more
straightforward.
Measures of Receiving Ability

A number of instruments of emotion receiving ability have been developed since the
late 1960s. One of the first examples was the Brief Affect Recognition Test (BART),
which used as stimuli seventy photographs of posed facial expressions from the Pic-
tures of Facial Affect (Ekman & Friesen, 1975). Another early attempt was the Profile
of Nonverbal Sensitivity (PONS), developed by Rosenthal and colleagues (1979). A
single actor portrayed a series of twenty affective scripts rated for positivity and dom-
inance. The performance was videotaped by three cameras, focused on the face, the
body, or including both face and body. The vocal response was also included, with ver-
bal content disguised by removing the higher frequencies of the voice with a band-
pass filter (content filtering) or randomly splicing the audiotape (random spliced voice).
Items combined four video and three audio channels, including no video and no audio.
Respondents were asked a series of directed questions about the target segment ranging
from a direct assessment of the sender’s emotional state to identifying the social context
in which the sender was embedded.
The Interpersonal Perception Task (IPT), developed by Costanzo and Archer (1989),
moved away from a pure focus on emotion receiving ability. The IPT shows a series of
videotaped vignettes and asks the participant to make a series of social judgments, such
as to identify which of two individuals is lying, which of two individuals is the natural
parent of children shown in the vignette, which of two people being interviewed has just
won a tennis match, and so on (Archer, Costanzo, & Akert, 2001). Another measure
is the Diagnostic Analysis of Nonverbal Accuracy (DANVA) developed by Nowicki
and Duke (1994) and consists of twenty-four photographs of posed facial expressions
and twenty-four vocal expressions (paralanguage). A large number of photographs were
taken of people who were asked to portray specific emotions; a small subset was selected
to generate a standard set of responses: the faces represent happiness, sadness, anger,
and fear.
Emotional Intelligence (EI)

Recent work on nonverbal receiving ability has been informed by the theoretical work
of Peter Salovey and Jack Mayer on the construct of Emotional Intelligence (EI:
Mayer et al., 2001, 2003). The Mayer, Salovey, and Caruso Emotional Intelligence Test
(MSCEIT) is the third generation of instruments developed to assess EI. The subsection
of the MSCEIT corresponding to emotion receiving ability – perceiving emotions –
comprises four faces and six pictures/abstract art representations for which the respon-
dent makes separate ratings of emotion: five per face or picture. The faces were gen-
erated much as the DANVA faces. Respondents rate the pictures along five emotion
dimensions. Two scoring options are available: one based upon general consensus and
one based upon the consensus of experts.
Spontaneous versus Posed Expression
All of the instruments designed to measure nonverbal receiving ability involve posed
expression or behaviors filmed with the knowledge of the subject, and most (all save the
PONS and SIT) employ static facial expressions. There is, however, much evidence that
posed expressions differ from spontaneous displays in significant respects. Ekman and
Friesen (1982) distinguished between spontaneous or “felt” smiles and “unfelt” smiles
on the basis of the involvement in the former of the orbicularis oculi causing “crow’s
feet” at the outer edges of the eyes, in addition to the zygomaticus major that pulls the
corners of the mouth up; and also by the timing of the onset, apex, and offset of the
smile. Cohn and Schmidt (2004) found posed smiles to have faster onsets and to be less
symmetric than spontaneous smiles. Schmidt et al. (2006) found greater onset and offset
speed, offset duration, and amplitude of movement in deliberate smiles; although, they
did not replicate the earlier finding of greater asymmetry. There is also evidence that
spontaneous and posed smiles are processed differently. Hess et al. (1989) filmed video
segments of smiles of persons posing or experiencing happiness and found that they
could be discriminated on the basis of the EMG responses of persons viewing them.
Given the differences in spontaneous versus posed smiling, it is concerning that
most research using automatic emotion detection software (AEDS) has employed posed
expression. In a recent survey of the innovative features of AEDS, Bettadapura (2012)
noted the need for a shift from the analysis of posed to spontaneous expression. He noted
in this regard the need for a standardized spontaneous expression database. He sug-
gested that this should (a) contain video sequences in which the participant is unaware
of being filmed; (b) in conditions where spontaneous expressions are encouraged; (c)
that sequences should be labeled, including information about the participant’s emo-
tional response, either from self-ratings, observer ratings, or both; and (d) that sequences
should show a complete temporal pattern including the onset, apex, and offset of the
emotional response. All of these characteristics are found in video sequences taken in
the slide-viewing technique (SVT), which are used in the Communication of Affect
Receiving Ability Test (CARAT).
The Slide-Viewing Technique and Spontaneous Expression

We have used the slide-viewing technique (SVT) to study spontaneous emotional
expression, experience, and communication in a variety of samples including brain-
damaged persons and psychiatric groups (Buck, 1976, 2005; Buck & Powers, 2013). In
the SVT, a sender’s spontaneous facial/gestural expressions to emotionally-loaded pic-
tures is filmed by an unobtrusive camera as the sender sits alone in a room, watches a
series of emotionally-loaded slides, and describes and rates his/her emotional responses
to each. Senders are told that the study involves their emotional response to a series of
slides in the following categories: familiar people, unfamiliar people, sexual, scenic,
unpleasant, and unusual. Senders view the slide for ten seconds and on signal verbally
describe their emotional response to the slide. The slide is then removed, and senders
rate their emotional response (e.g., happy, sad, afraid, angry, surprised, disgusted, pleas-
ant, unpleasant, strong, weak). Viewing the sender on each sequence, receivers judge the
type of slide viewed and rate the sender’s emotional response. These are compared to
the actual slide viewed and the sender’s self-ratings to yield two measures of emotional
communication: the percent of slides correctly categorized (percent correct measure),
and the correlation between the sender’s rating and the receiver’s rating of each emotion
across the sequences (emotion correlation measure, separate for each emotion rated).
This procedure yields video clips of spontaneous expressions filling all of the desiderata
cited by Bettadapura (2012): the sender is alone and unaware of being filmed, encour-
aged to be expressive, responses are labeled both by self-ratings and observer ratings,
and a complete temporal pattern including onset, apex, and offset is presented.
The Communication of Affect Receiving Ability Test (CARAT)

Brief (20 second) video clips of the facial-gestural expressions of senders to the slides
were collected in the Communication of Affect Receiving Ability Test (CARAT), which
is the only test of nonverbal receiving ability which uses dynamic, spontaneous, and
ecologically valid nonverbal expressions as stimuli (Boone & Buck, 2004). On CARAT,
receivers attempt to guess the kind of slide being viewed by the sender, and accuracy
is determined by comparing the judgment with the actual slide viewed (Buck, 1976).
Essentially, CARAT presents brief “thin slices” of spontaneous facial-gestural expres-
sion, which have been demonstrated to carry powerful, albeit often unconscious, non-
verbal messages (Ambady & Rosenthal, 1993). CARAT has been used in a variety of
studies involving the segmentation of spontaneous nonverbal expression (Buck et al.,
1980; Buck, Baron, & Barrette, 1982), the analysis of differences in emotional commu-
nication between strangers versus familiar partners (Buck & Lerman, 1979; Sabatelli,
Buck, & Kenny, 1986), and the study of clinical groups (Buck et al., 1998). A new ver-
sion termed CARAT-05 was created using a collection of high-quality stimuli filmed via
S-VHS video and converted to digital format (Buck, Renfro, & Sheehan, 2005). Pow-
ers and colleagues presented CARAT-05 to receivers in the fMRI with instructions to
guess the kind of slide presented, marking the first time that patterns of brain responses
were recorded to spontaneous facial displays (Powers et al., 2007). These ecologically
valid expressions activated more and different brain areas in comparison to static and
posed facial expressions (Powers, 2009). This is of potential importance in understand-
ing empathy and clinical phenomena involving the processing of social signals.
Powers and Buck also filmed a new collection of SVT sequences using digital record-
ing from the outset. For the CARAT-S, forty spontaneous sequences were chosen from
more than 1300 sequences to be clearly judged by any clinically normal individual
(90%+ accuracy: Buck, Powers, & Kapp, 2011). After the spontaneous sequences were
filmed, senders were informed about the camera and asked to pose a series of expres-
sions: showing “how their face would look” if a slide in a certain category were pre-
sented (familiar, unpleasant, unusual, or neutral). In fact, no slide was presented on
the posed trials. Stimuli in the posed category correspond to what Ekman and Friesen
(1975) termed simulation: displaying an emotion when none is felt. After this, senders
were asked to do the same thing but in the presence of an actual slide. In some cases they
were asked to pose a positive expression (e.g., response to a picture of a good friend) in
the presence of a negative slide, in other cases they posed a negative expression (e.g.,
a picture of a wounded child) in the presence of a positive slide. We term these reg-
ulated expressions, and they correspond to Ekman and Friesen’s masking: showing an
expression different from that felt. The resulting instrument is termed the CARAT-SPR
(spontaneous-posed-regulated).
The CARAT-S and CARAT-SPR differ from previous versions in that they were not
intended to measure only receiving ability. The sequences are shorter in time, lasting
twelve seconds or less. They have also been digitally edited to standardize the size of
the sender’s face on the screen: only the head and upper shoulders are visible, with a
uniform blue background. At the same time, the CARAT-SPR added posed and reg-
ulation sequences, which were designed to assess the response to spontaneous versus
posed versus regulated expressions, such as a sender reacting to a slide showing famil-
iar person. The CARAT-S and CARAT-SPR also differ from most previous measures of
nonverbal receiving ability in that each sender is presented only once in the test to avoid
familiarity effects. They also were developed with the explicit permission of partici-
pants that their videotaped images could be used for future research including studies
of the brain’s responses to these types of expressions.
Interactional Context: The Role of Sender Expressiveness
Interpersonal Synchrony: Mutual Contingent Responsiveness and

Primary Intersubjectivity
All of the techniques used for measuring emotion receiving abilities, including CARAT,
present prerecorded expressions to the receiver; receivers have no opportunity to influ-
ence the expressions on which they base their judgments. However, in interpersonal
face-to-face interactions, the expressions of each partner influence those of the other.
This is termed mutual contingent responsiveness: both partners respond “on line” to
the flow of the communicative behavior of the other and the responsiveness of each
individual is, to an extent, influenced by, or contingent upon, the responsiveness of the
other. For example, Murray and Trevarthen (1986) demonstrated that when a mother and
infant communicate over a live video link, each respond in synchrony with the flow of
the display behavior of the other. Trevarthen (1979) suggested that this affords primary
intersubjectivity vis-à-vis infant and mother: that is, each is naturally, directly, and auto-
matically attuned to the subjective state displayed by the other – presumably mediated
by displays and preattunement systems involving mirror neurons.
The pattern of smooth communicative flow changed, however, when either mother
or infant unexpectedly viewed a playback of the others’ behavior. Although the behav-
ior was physically identical to that played at another time, synchrony with the partner,
and mutual contingent responsiveness, was impossible as this disrupted communication
(Trevarthen and Aitken, 2001). Similarly, the still face phenomenon occurs in infants
happily interacting face-to-face with a responsive partner. When the partner suddenly
stops all facial expression and looks past the infant, the infant shows an immediate and
wrenching response (Tronick, 1978).
Emotion Sonar
This dynamic dyadic relationship of interaction partners is displayed by their dyad-level
nonverbal behaviors, including mirroring, imitation, equilibrium, reciprocity, and inter-
personal synchrony. This implies that an individual’s receiving ability is more than an
ability to read nonverbal emotion cues: in interactive situations it also involves the ten-
dency of a receiver to increase or decrease the expressiveness of the interaction partner.
Boone and Buck (2004) termed this emotional sonar by analogy with systems used to
locate submarines by emitting a loud ping and reading the reflection of the ping from the
hull of the submarine. Similarly, in interactive situations, an individual actively emits
displays to which the partner can respond or not; the richer the display, the more the
partner is encouraged to respond in kind.
In effect, everyone carries around a “bubble of expressiveness” by which they influ-
ence the expressiveness of others. More expressive persons carry a bubble of enriched
expression and communication; inexpressive persons carry a bubble of impoverished
expression and communication. In this way, expressive and inexpressive persons live in
emotionally enriched or impoverished environments, respectively.
Oxytocin and Interaction

There is evidence that the bubble of expressiveness can be manipulated by the neurohor-
mone oxytocin (OT). Notably, these effects can be assessed in double-blind studies in
humans by administering OT or a placebo in nasal spray. Effects of OT have been exam-
ined in interactional contexts in three prototypical human relationships: those between
parent and infant, friends, and sexual partners (Feldman, 2012). Feldman suggested that
these prototypes share common brain mechanisms underpinned by OT in the promo-
tion of temporal concordance of behavior or interpersonal synchrony. This was assessed
by the observation and micro-coding of interaction behaviors including touching, eye
contact, emotion display, and soft vocalization in parent–infant, friend, and sexual
dyads. Feldman (2012) reported a number of studies in which OT was associated with
positive communication sequences and interpersonal synchrony. In one study, for exam-
ple, fathers inhaling OT showed more engagement and more frequent touch with their
infant. Intriguingly, levels of OT in the infant were dramatically raised when the father
had inhaled OT, despite the fact that OT was not administered to the infant. Feldman
concluded, “OT administrations to a parent can lead to alterations in the physiology and
behavior of an infant in ways that induce greater readiness for social contact” (2012: 7).
On the other hand, evidence suggests that OT effects are not always positive. For
example, Rockliff et al. (2011) reported participants low in self-reassurance, social safe-
ness, and attachment security and high in self-criticism showed less positive responses
to OT. Bartz et al. (2011) found that, while OT in secure men produced recollections
that their mothers were more close and caring, OT in anxious attached men produced
recollections that their mothers were less close and caring. Also, OT increased reflexive
tendencies to attend to the eye region of others (Gamer, Zurowski, & Büchel, 2010)
and the ability to infer emotions expressed by the eyes (Domes et al., 2007). This is
significant because the eyes are more likely than the lower face to produce spontaneous
displays as opposed to intentional expressions (Buck, 1984). Moreover, in an economic
choice game OT increased both envy or gloating when an opponent was relatively more
successful or failed (Shamay-Tsoory et al., 2009), and also enhanced the categorization
of others into in-groups and out-groups (De Dreu, 2012). The last finding suggest that
OT may foster xenophobia: the rejection and ostracism of those deemed to be not within
the group.
These findings suggest that OT functions to increase accurate emotional communi-
cation and social engagement, whether “positive” or “negative.” This would increase
positive social behaviors among secure persons interacting with kin and comrade, and
at the same time increase negative social behaviors in insecure persons interacting with
potential adversaries. Such effects are consistent with a corollary to the emotion sonar
hypothesis suggested by Boone and Buck (2004): that such sonar can function in IFF
(Identification of Friend or Foe).
Conclusions
Nonverbal receiving ability, or the ability to respond accurately to nonverbal social sig-
nals, can be considered an individual-level “ability” that crosses situations and rela-
tionships. There are three aspects of emotion communication that are missing from
most current measures. One is that most measures employ posed or intentionally
enacted expressions; another is the neglect of the analysis of emotional expressive-
ness or sending accuracy as determining the “bubble of expressiveness” carried every-
where; and the third is the neglect of investigating these processes in interactional
contexts. In all of these cases, there is great potential for the machine analysis of
expressive behavior to improve the reliability of the measurement of expressive behav-
iors in both individual and interactive settings, bearing in mind machine analysis
systems are developed with spontaneous stimuli. In particular, machine analysis could
be used to assess interpersonal synchrony, which is critical as a sign of mutual con-
tingent responsiveness and primary intersubjectivity, whether between cherished kin,
friends, and lovers; or bitter foes.
References
Ambady, N. & Ambady Rosenthal, R. (1993). Half a minute: Predicting teacher evaluations from
thin slices of nonverbal behavior and physical attractiveness. Journal of Personality and Social
Psychology, 64, 431–441.
Archer, D., Costanzo, M., & Akert, R. (2001). The Interpersonal Perception Task (IPT): Alterna-
tive approaches to problems of theory and design. In J. Hall and R. Bernieri (Eds), Interpersonal
Sensitivity (pp. 161–182). Mahwah, NJ: Lawrence Earlbaum.
Bartz, J. A., Zaki, J., Bolger, N., & Ochsner, K. N. (2011). Social effects of oxytocin in
humans: Context and person matter. Trends in Cognitive Sciences, 15(7), 301–309. doi:
10.1016/j.tics.2011.05.002.
Bettadapura, V. (2012). Face Expression Recognition and Analysis: The State of the Art. Tech
Report, arXiv:1203.6722, April.
Boone, R. T. & Buck, R. (2004). Emotion Receiving Ability: A new view of measuring individual
differences in the ability to accurately judge others’ emotions. In G. Geher (Ed.), Measuring
Emotional Intelligence: Common Ground and Controversy (pp. 73–89). Hauppauge, NY: Nova
Science.
Buck, R. (1976). A test of nonverbal receiving ability: Preliminary studies. Human Communica-
tion Research, 2, 162–171.
Buck, R. (1984). The Communication of Emotion. New York, NY: Guilford Press.
Buck, R. (2005). Measuring emotional experience, expression, and communication: The slide-
viewing technique. In V. Manusov (Ed.), Beyond Words: A Sourcebook of Methods for Measur-
ing Nonverbal Cues (pp. 457–470). Mahwah, NJ: Lawrence Erlbaum.
Buck, R., Baron, R., Baron & Barrette, D. (1982). Temporal organization of spontaneous emo-
tional expression: A segmentation analysis. Journal of Personality and Social Psychology, 42,
506–517.
Buck, R., Baron, R., Goodman, N., & Shapiro, B. (1980). The unitization of spontaneous non-
verbal behavior in the study of emotion communication. Journal of Personality and Social
Buck, R., Goldman, C. K., Easton, C. J., & Norelli Smith, N. (1998). Social learning and emo-
tional education: Emotional expression and communication in behaviorally disordered children
and schizophrenic patients. In W. F. Flack & J. D. Laird (Eds.), Emotions in Psychopathology
(pp. 298–314). New York: Oxford University Press.
Buck, R. & Lerman, J. (1979). General vs. specific nonverbal sensitivity and clinical training.
Human Communication, Summer, 267–274.
Buck, R. & Powers, S. R. (2013). Encoding and display: A developmental-Interactionist model of
nonverbal sending accuracy. In J. Hall & M. Knapp (Eds.), Nonverbal Communication. (Vol. 2,
pp. 403–440). Berlin: Walter de Gruyter.
Buck, R., Powers, S. R., & Kapp, W. (2011). Developing the communication of affect receiving
ability test-spontaneous-posed-regulated. International Communication Association Conven-
tion, Boston, May 2011.
Buck, R., Renfro, S., & Sheehan, M. (2005). CARAT-05: A new version of the Communication
of Affect Receiving Ability Test. Unpublished paper, Department of Communication Sciences,
University of Connecticut.
Cohn, D. F. & Schmidt, K. L. (2004). The timing of facial motion in posed and spontaneous
smiles. International Journal of Wavelets, Multiresolution, and Information Processing, 2, 1–
12.
Costanzo, M. & Archer, D. (1989). Interpreting the expressive behavior of others: The interper-
sonal perception task. Journal of Nonverbal Behavior, 13, 225–245.
De Dreu, C. K. W. (2012). Oxytocin modulates cooperation within and competition between
groups: An integrative review and research agenda. Hormones and Behavior, 61(3), 419–428.
doi: 10.1016/j.yhbeh.2011.12.009.
Domes, G., Heinrichs, M., Michel, A., Berger, C., & Herpertz, S. C. (2007). Oxy-
tocin improves “mind-reading” in humans. Biological Psychiatry, 61(6), 731–733. doi:
10.1016/j.biopsych.2006.07.015.
Ekman, P. & Friesen, W. (1975). Pictures of Facial Affect. Palo Alto, CA: Consulting Psycholo-
gists Press.
Ekman, P. & Friesen, W. V. (1982). Felt, false, and miserable smiles. Journal of Nonverbal Behav-
ior, 6, 238–252.
Feldman, R. (2012). Oxytocin and social affiliation in humans. Hormones and Behavior, 61(3),
380–391. doi: 10.1016/j.yhbeh.20.
Gamer, M., Zurowski, B., & Büchel, C. (2010). Different amygdala subregions mediate valence
related and attentional effects of oxytocin in humans. Proceedings of the National Academy of
Sciences of the United States of America, 108, 9400–9405.
Hess, U., Kappas, A., McHugo, G., Kleck, R., & Lanzetta, J. T. (1989). An analysis of the encod-
ing and decoding of spontaneous and posed smiles: The use of facial electromyography. Journal
of Nonverbal Behavior, 13(2), 121–137.
Mayer, J. D., Salovey, P., Caruso, D., & Sitarenios, G. (2001). Emotional intelligence as a standard
intelligence. Emotion, 1(3), 232–242.
Mayer, J. D., Salovey, P., Caruso, D. R., & Sitarenios, G. (2003). Modeling and measuring emo-
tional intelligence with the MSCEIT V2.0. Emotion, 3, 97–105.
Murray, L. & Trevarthen, C. (1986). The infant’s role in mother–infant communications. Journal
of Child Language, 13, 15–29.
Nowicki, S., Jr. & Duke, M. P. (1994). Individual difference in nonverbal communication of affect:
The diagnostic analysis of nonverbal accuracy scale. Journal of Nonverbal Behavior, 18, 9–35.
Powers, S. R. (2009). Toward more ecologically valid emotion displays in brain research: A func-
tional neuroimaging study of the communication of affect receiving ability test. Unpublished
doctoral dissertation, University of Connecticut. Thesis C66 2009. Theses 16629.
Powers, S. R., Buck, R., Kiehl, K., & Schaich-Borg, J. (2007). An fMRI study of neural responses
to spontaneous emotional expressions: Evidence for a communicative theory of empathy.
Paper presented at the 93rd Annual Convention of the National Communication Association.
Chicago.
Rockliff, H., Karl, A., McEwan, K. et al. (2011). Effect of oxytocin on compassion-focused
imagery. Emotion, 11, 1388–1396.
Rosenthal, R., Hall, J., Archer, P., DiMatteo, M. R., & Rogers, P. L. (1979). The PONS
test: Measuring sensitivity to nonverbal cues. In S. Weitz (Ed.), Nonverbal Communication
(2nd edn, pp. 449–511) New York, NY: Oxford University Press.
Sabatelli, R. M., Buck, R., & Kenny, D. A. (1986). A social relations analysis of nonverbal com-
munication accuracy in married couples. Journal of Personality, 54(3), 513–527.
Schmidt, K., Ambadar, Z., Cohn, J., & Reed, L. I. (2006). Movement differences between delib-
erate and spontaneous facial expressions: Zygomaticus major action in smiling. Journal of
Nonverbal Behavior, 30, 37–52.
Shamay-Tsoory, S. G., Fischer, M., Dvash, J., et al. (2009). Intranasal administration of oxy-
tocin increases envy and Schadenfreude (gloating). Biological Psychiatry, 66(9), 864–870. doi:
10.1016/j.biopsych.2009.06.009.
Trevarthen, C. (1979). Communication and cooperation in early infancy: A description of primary
intersubjectivity. In M. Bullowa (Ed.), Before Speech: The Beginning of Human Communica-
tion (pp. 321–347). Cambridge: Cambridge University Press.
Trevarthen, C. & Aitken, K. J. (2001). Infant intersubjectivity: Research, theory, and clinical
applications. Journal of Child Psychology and Psychiatry, 42(1), 3–48. doi: 10.1111/1469-
7610.00701.
Tronick, E. (1978). The infant’s response to entrapment between contradictory messages in a
face-to-face interaction. Journal of the American Academy of Child Psychiatry, 17, 1–13.
Further Reading
Hall, J. (2001). The PONS test and the psychometric approach to measuring interpersonal sensi-
tivity. In J. Hall and R. Bernieri (Eds.), Interpersonal Sensitivity (pp. 143–160). Mahwah, NJ:
Lawrence Earlbaum.
6 Computational Analysis of Vocal
Expression of Affect: Trends and
Challenges
Klaus Scherer, Björn Schüller, and Aaron Elkins
In this chapter we want to first provide a short introduction into the “classic” audio
features used in this field and methods leading to the automatic recognition of human
emotion as reflected in the voice. From there, we want to focus on the main trends
leading up to the main challenges for future research. It has to be stated that a line is
difficult to draw here – what are contemporary trends and where does “future” start.
Further, several of the named trends and challenges are not limited to the analysis of
speech, but hold for many if not all modalities. We focus on examples and references in
the speech analysis domain.
“Classic Features”: Perceptual and Acoustic Measures
Systematic treatises on the importance of emotional expression in speech communica-

tion and its powerful impact on the listener can be found throughout history. Early Greek
and Roman manuals on rhetoric (e.g., by Aristotle, Cicero, Quintilian) suggested con-
crete strategies for making speech emotionally expressive. Evolutionary theorists, such
as Spencer, Bell, and Darwin, highlighted the social functions of emotional expression
in speech and music. The empirical investigation of the effect of emotion on the voice
started with psychiatrists trying to diagnose emotional disturbances and early radio
researchers concerned with the communication of speaker attributes and states, using
the newly developed methods of electroacoustic analysis via vocal cues in speech. Sys-
tematic research programs started in the 1960s when psychiatrists renewed their inter-
est in diagnosing affective states, nonverbal communication researchers explored the
capacity of different bodily channels to carry signals of emotion, emotion psychologists
charted the expression of emotion in different modalities, linguists and particularly pho-
neticians discovered the importance of pragmatic information, all making use of ever
more sophisticated technology to study the effects of emotion on the voice (see Scherer,
2003, for further details).
While much of the relevant research has exclusively focused on the recognition of
vocally expressed emotions by naive listeners, research on the production of emo-
tional speech has used the extraction of acoustic parameters from the speech signal
as a method to understand the patterning of the vocal expression of different emotions.
The underlying theoretical assumption is that emotions differentially change autonomic
arousal and the tension of the striate musculature and thereby affect voice and speech
Computational Analysis of Vocal Expression of Affect 57
production on the phonatory and articulatory level and that these changes can be esti-
mated by different parameters of the acoustic waveform (Scherer, 1986), an assumption
that has been recently confirmed by an empirical demonstration of the measurement
of emotion-differentiating parameters related to subglottal pressure, transglottal airflow,
and vocal fold vibration (Sundberg et al., 2011).
Researchers have used a large number of acoustic parameters (see Juslin & Laukka,
2003; Patel & Scherer, 2013), the most commonly used being the following.
Time domain: Total duration of an utterance, of the voiced and unvoiced parts, of the
silent periods, and the speech rate (based on duration or number of syllables).
Frequency domain: fundamental frequency (F0) either dynamically as F0 contour
and its derivatives (e.g., rising vs falling) or distribution measures over an utter-
ance (e.g., mean, standard deviation, percentiles, and range parameters).
Amplitude domain: intensity or energy (generally in dB or equivalent continuous
sound level [Leq]), either dynamically as intensity contour and its derivatives
(e.g., attack and decay) or distribution measures over an utterance (e.g., mean,
standard deviation, percentiles, and range parameters).
Spectral domain: Voice quality measures such as energy in different frequency bands
(e.g., third octave), spectral balance (proportion of energy below and above a cer-
tain threshold such as 0.5 or 1 kHz), the Hammarberg index (difference between
the energy maxima in the 0–2 kHz and 2–5 kHz range), spectral slope (slope of the
regression line through the long-term average spectrum), spectral flatness (quo-
tient of the harmonic and geometric power spectrum means), spectral skewness
(differences in spectral shape above and below the spectral mean), the harmonics-
to-noise ratio (HNR, degree of acoustic periodicity expressed in dB), the autocor-
relation of the signal, jitter (mean absolute difference between consecutive peri-
ods, divided by the mean period), and shimmer (mean absolute difference between
the amplitudes of consecutive periods, divided by the mean amplitude).
Main Trends of the Current State of the Art
In recent years, speech scientists and engineers, who had tended to disregard pragmatic
and paralinguistic aspects of speech, have started to pay more attention to speaker atti-
tudes and emotions in the interest of increasing the acceptability of speech technology
for human users.
Based on the features as shown above, systems can recognize human emotion by
training a suitable machine learning algorithm based on labeled examples of emotional
speech. We will outline the current main trends observable in this context in the remain-
der of this chapter.
Understanding Features
After more than a decade of research on automatic emotion recognition from acoustic
speech features, there is still no agreement on what the optimal features are. In fact,
not even the temporal unit of analysis is “fixed” or “settled” – as opposed to automatic
speech recognition, where usually “frame-based” acoustic features are observed at a
roughly 100 Hz rate, supra-segmental features as the above named (e.g., mean, stan-
dard deviation, percentiles, and range parameters) prevail when it comes to analysis of
affective cues.
Alternative Modeling and Representation

Similarly, there is also no “universal” operationalization of affect (Gunes et al., 2011).
At first, discrete classes such as the “Ekman big six” emotions (anger, disgust, fear,
happiness, sadness, and surprise) prevailed. Recently, one can observe increasing popu-
larity of more diverse inventories often including a broader range of affective states and
more subtle “everyday” emotions (e.g., interest, pleasure, relief or stress) or social emo-
tions (e.g., embarrassment, guilt, pride or shame) as well as continuous models (e.g.,
by dimensions as arousal, valence, dominance, novelty or intensity) and others (e.g.,
tagging approaches not requiring a single label, but allowing for several attached to a
speech sample) beyond “closed set” (single) discrete classes. Other promising options
such as appraisal-based approaches are also slowly finding their way into technical
application. Overall, this diversity in affect modeling and representation can also be
seen as one of the major challenges as this makes “re-usage” and comparison across
speech resources difficult, albeit ways of “translating” from one scheme to another par-
tially exist, such as from dimensional to categorical models and vice versa.
In the Wild
After more than a decade of research on machine analysis of human emotion as reflected
in vocal parameters, the focus has shifted increasingly toward “more realism” in a num-
ber of senses. The emotions considered are becoming less prototypical, that is, non-
basic emotional states are also taken into account, and intensity of emotion is often more
fine-grained, including at the subtle low-intensity end of the scale. At the same time,
challenging acoustic environments and everyday situations or genuine media material
without “cherry-picking” is taken for analysis (e.g., Dhall et al., 2013; Wöllmer et al.,
2013). In particular, this also includes further ambition to cope with noise presence
(e.g., Tawari & Trivedi, 2010b) or reverberation.
Semi-Autonomous Learning
A recent trend targets reducing the efforts of human labeling in particular, which usu-
ally is quite labor intensive. Crowd-sourcing facilitated this to a certain extent as large
amounts of labelers can be easily accessed. Examples of emotional speech are plentiful
on the Internet, other archives, or via broadcast, however, we currently have very sparse
labels for these examples. Four types prevail to reduce the human effort in labeling.
These variants are discussed below in more detail.
Active learning still requires a human to label, yet, fewer labeled instances are usually
needed by this method to reach equivalent performance (Wu & Parsons, 2011). The
principle idea is to have the machine “preselect” the speech data that appears to be most
informative from a large pool of unlabeled speech data. An example of how to identify
such data points of apparent interest is by “sparse instance tracking” (Zhang & Schuller,
2012) – in speech recordings, usually only a comparatively low percentage of the speech
is actually “emotional.” The machine can be trained to identify such non-neutral speech
samples and have the human annotate only these samples. As a positive side effect, this
method typically leads to relatively balanced training material across emotion classes.
Another variant is to have the machine look for instances of medium confidence level,
that is, identify the ambiguous instances, as it hopes to be able to learn significantly
from knowing how a human would label these. In fact, predicting the uncertainty of
labels, that is, how much humans would likely (dis)agree on the emotion label of a
specific speech instance is a related highly efficient approach (Zhang et al., 2013a).
Such methods have also been created for dimensional, that is, non-categorical emotion
models (Han et al., 2013).
Semi-supervised learning requires some initial human labeled instances, but the
machine then “takes it from there” by labeling further speech data by itself if suffi-
ciently confident it knows “what’s inside.” This means that again confidence values play
a decisive role. Semi-supervised learning has shown to be effective for emotion in text
first (Davidov, Tsur, & Rappoport, 2010), but also when based upon acoustic feature
observation (Mahdhaoui & Chetouani, 2009; Zhang et al., 2011). In fact, using different
“views” such as acoustic and linguistic features, separately or different acoustic feature
groups, such as prosodic and spectral, can further improve effectiveness by so-called
co-training (Zhang, Deng, & Schuller, 2013b).
In fact, active and semi-supervised learning can also be combined efficiently by dis-
carding instances believed to be “non-interesting” in case of low confidence, keeping
those with high confidence, and asking for human help in case they appear to be of
interest when the confidence is not sufficiently high.
Unsupervised learning differs from semi-supervised learning (Wöllmer et al., 2008)
as it does not require an initial basis of human labeled instances. Rather, it “blindly”
clusters the instances, for example, based on acoustic features such as those described
at the beginning of this chapter. However, this comes at the risk that the initial clustering
by the machine is not purely owing to different emotion, but may be influenced also by
other speaker characteristics or even spoken words.
Finally, transfer learning is based on the idea that machines learn from a similar
problem or domain and “transfer” this knowledge to the target domain. Dissimilar-
ities between source and target domain may include the emotion representation, the
features used, or the kind of data. The methods are highly different – one method
shown to be successful in this field is the learning of a compact feature representa-
tion of the target problem followed by transferring the features of the further “source”
data to the target domain. Roughly speaking, this can be achieved by “auto-encoder”
neural networks characterized by equal feature output-dimensionality as feature input-
dimensionality, yet, an intentionally reduced dimensionality of the hidden layer
inbetween. If all instances available from the target domain are run through this auto-
encoder network during its training phase, one accordingly learns a compact target-
domain feature representation. Then, the instances of the further source-type data are
run through this trained network for their transfer to the target-domain characteristics.
For improved effectiveness, additional sparsity constraints can be imposed. Success is
reported in this field, for example, to transfer “source” adult emotional speech to the
“target-domain” child emotional speech by Deng et al. (2013, 2014) or when transfer-
ring from affect in music to affect in speech as shown by Coutinho, Deng, and Schuller
(2014). Interestingly, even without additional transfer efforts, a learning and classifica-
tion across different audio types (speech, music, and general “sound”) is reported to
provide results significantly above chance levels by Weninger et al. (2013).
Confidence Measures
In addition to having a machine trained to decide between certain affective categories
or continuous values, it can be very useful to have it provide additional information on
its confidence or “certainty” in its decision. Such “confidence measures” can often be
taken directly from a learning algorithm’s output, such as in k-nearest neighbor deci-
sions, where one sees how many instances ki belonging to class i out of k instances
lie close to the speech sample of investigation in the feature space, thus providing an
immediate estimate on confidence (namely, ki /k). However, an informative confidence
measure should be independent from the actual decision process to provide alternative
views on the problem – potentially based on other types of information to be less depen-
dent. Deng, Han, and Schuller (2012) introduced an approach where other emotional
speech databases are used to train several additional classifiers that have as a learn-
ing target whether the actual emotion recognizer is likely to be right or wrong. Then,
semi-supervised adaptation takes place during the system’s life cycle. Alternatively, the
approach presented by Deng and Schuller (2012) trains an additional classifier to predict
human labeler agreement in addition to the one trained to recognize the emotion. This is
possible, as naturalistic emotional speech data is usually labeled by several labelers and
the number of those agreeing is known for the learning examples. In fact, it could be
shown that a surprisingly accurate prediction of the number of labelers agreeing can be
reached. Overall, both confidence measurement approaches were observed to provide
useful estimates, but more research in this direction is yet to be seen.
Distribution
For the analysis of emotional speech “on the go,” that is, in mobile settings, it can
be beneficial to distribute the process in a client server manner. As an advantage, this
allows for centralized update of models on the server’s end benefiting from “big data”
coming in from thousands of users rather than having a local calculation exclusively on
a user’s own device. Distribution can take place in different ways, such as immediately
transmitting the original audio to a server. However, this would come at high band-
width requirement and does not preserve the users’ privacy, as the full original voice
sample would be sent. More efficiently, one can thus decide on feature extraction on the
end-user’s device and transmit only these features. Even more efficiently, vector quan-
tization of the feature vector by a code-book “look-up” of the nearest reference vector
in feature space can be used in order to only transmit the number of this reference fea-
ture vector that is closest to the current observation, reducing bandwidth to very low
demands. If this is not sufficient in performance, one can partition the feature vector
into several subvectors and quantize these individually, for example, by feature groups
such as prosodic or spectral. In Han et al. (2012) and Schuller et al. (2013a) compara-
bly high bandwidth reduction rates could be reached at comparably low loss in emotion
recognition accuracy.
Standardization
In general prosody, there have been some standards introduced in the past, for exam-
ple, by Silverman et al. (1992). However, these are comparably sparse, in particular
for affective speech analysis, and the diversity of approaches is high. For example,
there is no standard measure of reliability available (Hayes & Krippendorff, 2007),
let alone standards for many of the above-named issues, such as modeling, confidence
measures, or distribution. However, there are initial standards on how to encode emo-
tion, for example, as given by the W3C Emotion Markup Language (EmotionML)
(Schröder et al., 2007) or even a feature set for the analysis of vocal expression of affect
(Schuller & Batliner, 2013). Further, there is currently an increased effort to standardize
data and evaluation methods made by competitive research challenges, such as a series
held annually at Interspeech since 2009 (see Schuller et al., 2013b), the Audio/Visual
Emotion Challenge (see Schuller, 2012; Schuller et al., 2013b), and the recent Emotion
in the Wild Challenge by Dhall et al. (2013). Notably, these challenges partially also
provide a standard for feature sets besides standardized test-beds.
Main Challenges for Future Research
Let us now move more into “white spot” or sparsely researched areas which are crucial
to master before recognition of vocal expression of affect by computers will find its way
into our everyday lives.
Multicultural and Multilingual Aspects

There is a great debate on the extent that psychological universals such as emotion
exist between cultures. While some studies have touched upon the influence of cultural
diversity, there is a need for increased research to prepare real-life emotion recognition
systems for a potential broad user range. In particular, such systems should “know”
about different cultural manifestations of emotion, but must also be able to identify
culture-specific affective behavior in order to aid the actual emotion recognition process.
The existing literature in this area has mainly investigated the human recognition
and expression of emotional voices across cultures (e.g., Riviello et al., 2010; Kövec-
ses, 2000) or acoustic factors (Sauter, 2006; Sauter et al., 2010) demonstrating exist-
ing differences, owing, for example, to culture-specific appraisal biases (Scherer &
Brosch, 2009). Yet, systems taking these differences explicitly into account are yet to
come.
While the existing research on the expression of vocal affect between cultures is min-
imal, there is mounting evidence that there is overlap on how vocalizations are recog-
nized between maximally different cultures. For example, when Westerners and cultur-
ally isolated Namibians listen to the same emotional vocalizations of basic emotions,
it was equally recognized (Sauter et al., 2010). The best recognition between culture
was on negatively valenced emotional (e.g., anger, disgust) vocalizations, while posi-
tive emotions contained more culture-specific vocal signals. It is speculated that pos-
itive emotions that facilitate social cohesion might be encoded idiosyncratically for
in-group members and may require the most cultural adaptation for recognition by a
computer.
The universality of emotion recognition is still under question. While the results to
date have been encouraging, cultural psychologists argue that most of the cross-cultural
research examining recognition of emotions between cultures have employed a flawed
methodology that encourages participants to identify behaviors that were primed by the
instructions, narrative, or words preceding the recognition task. When emotion recogni-
tion tasks were replicated without priming or making specific emotions accessible, par-
ticipants were much less able to identify the emotions underlying the observed behaviors
(Lindquist et al., 2006). The implications are that not only are emotions not recognized
universally between cultures, but that there may be entirely different emotions infused
with relative cultural experience and context. This means that some emotions detectable
in one culture may not exist in another, or its vocal profile may mean something entirely
different (e.g., positive instead of negative valence).
Research on the recognition of emotion is important for determining the universality
of emotions, but the largest gap in current research is on the actual behaviors exhib-
ited during emotional experiences. It is entirely possible that context free recognition
or perceptions of emotions are not an accurate reflection of the emotional voice. The
robustness of future emotion recognition system will rely heavily on greater diversity in
vocal examples of emotion, particularly from non-Western speakers.
Similarly, and characteristic for vocal analysis of human affect, multilinguality has
further to be taken into account. On the word-level, experiences and approaches exist
(e.g., Banea, Mihalcea, & Wiebe, 2011; Havasi, Speer, & Alonso, 2007). However, on
the acoustic level technological solutions and even experiences are broadly lacking. Dif-
ferent languages can be quite influential on accuracy of recognition, as analysis is often
attempted independent of the spoken content. In addition, significantly different prosody
may exist between tonal (e.g., Mandarin Chinese) and nontonal languages. Interestingly,
further effects may arise from individuals speaking in foreign languages. For personality
traits, differences have been found in bilingual speakers (Chen & Bond, 2010; Ramírez-
Esparza et al., 2006).
Context Exploitation
It seems plausible that exploitation of contextual knowledge, such as multimodal, sit-
uational, local, dialogue, or interaction context, can enrich the analysis of emotion. So
far, this has been mostly investigated in the context of spoken dialogue systems (e.g.,
Forbes-Riley & Litman, 2004; Liscombe, Riccardi, & Hakkani-Tür, 2005; Callejas &
Lòppez-Cózar, 2008). However, there is a number of further application cases where
context can be assumed to be beneficial (e.g., Tawari & Trivedi, 2010a).
Package Loss
A topic broadly ignored so far is the effect of lossy transmission, that is, how does, for
example, speech data package loss or change of package order by delay and jitter influ-
ence the accuracy of affective speech analyzers. Some studies lead in this direction, such
as the impact on feature-specific characteristics of speech (e.g., Kajackas, Anskaitis, &
Gursnys, 2008) or the effect of “gating,” that is, cutting of the speech after a given time
from the speech onset (Schuller & Devillers, 2010).
Overlapping Speakers
While research on emotion in speech has touched upon various types of noisy conditions
already, the most challenging “noise” case, that is, overlapping speakers, has not yet
been investigated to the level of separation of speakers and its impact on the individual
analysis of their affect. Related to this, a number of use cases exist where one will be
interested in knowing the emotion of a group rather than of an individual.
Atypical Affect Display

Affect in the voice has so far mostly been computationally analyzed in somewhat
“typical” cases. A whole range of less typical cases will need to be dealt with to enter
broad application. For example, whispered speech plays a decisive role in human com-
munication (Cirillo, 2004) or in “silent” speech interfaces. So far only a few works tar-
get this topic – mostly for vocalizations rather than actual emotion analysis (Obin, 2012;
Cirillo & Todt, 2002). Another type of atypical emotion expression is given by “patho-
logical” cases. Up to now, mostly autism spectrum condition has been investigated in
this respect, such as change of spectral and prosodic features (Bonneh et al., 2011),
also in relation to affect (Grossman et al., 2010). In particular the works (Marchi et al.,
2012a, 2012b) compare impact on recognition systems, including also child voices. A
whole range of further less typical cases have not received any or very little attention in
the literature, including regulation and masking of emotion, affect under speaker cog-
nitive or physical load, eating disorders, intoxication, sleepiness, voice pathologies, and
so on, as well as different age ranges or special environments, such as underwater or
space acoustics.
Zero Resource
For the problems discussed in the last section, namely atypical affect, there exists very
little to no data to train a recognizer. In this case, one can consider to either exploit
methods of transfer learning or consider so-called zero-resource approaches. Ultimately,
this may mean that rule-sets reported in the literature are used for analysis, such as
“pitch range is higher in angry speech than in neutral speech.” In combination with an
online speaker adaptation to normalize a system to her or his neutral speech, this can be
surprisingly efficient. A challenge does, however, lie not only in the partially nonlinear
behavior of some features, but also in modeling the interplay of acoustic features.
Application
While some first real-world applications including consumer-targeted ones use com-
putational methods for emotion recognition from vocal properties, the broad public
is likely unaware of technology in their everyday lives up to now. Some mass market
products include a video-console game detecting deception (“Truth or Lies”) or phone-
voice analyzers aiming to detect stress, deception (“HandyTruster”), or affection (“Love
Detector”) as well as a vocal baby-emotion analyzer intended for young parents (“Why-
Cry”). However, general experience on user acceptance on a broad scale is missing, and
it seems timely to approach the market given the promising improvements in recognition
performance obtained over the last years by the community.
Ethics
With affective speech analyzers starting to be applied in real-world products, a number
of new ethical challenges will need to be highlighted. Such considerations have accom-
panied research in the field already (for example, as made by Cowie, 2011; Döring,
Goldie, & McGuinness, 2011; Goldie, Döring, & Cowie, 2011; Sneddon, Goldie, &
Petta, 2011). However, in particular for speech analysis, ethics may want to be revisited
due to the change in performance and broadening of tasks taken into account (Batliner
& Schuller, 2014).
Additionally, recognition systems in modern computing applications are building per-
sonalized speaker models to improve accuracy. This requires the collection and storage
of vocal samples on remote databases. In isolation this could be considered benign,
however, vocal databases are becoming more prevalent in tandem with biometrics (e.g.,
fingerprint, face) and additional behavior measurements (e.g., location tracking, eye-
movement). The protection and ownership of these data is not well defined, nor what
the unintended consequences may be in the future when large databases of personal
behavior data exist.
Conclusion
We discussed recent trends and future challenges in the field of computational analy-
sis of the vocal expression of affect. We had to limit our choice to some of the most
relevant – many more exist, and certainly also deserve attention.
Overall, we feel confident that the field has reached a new level of maturity now await-
ing broader usage in products and solutions intended for daily live usage and ready for
“big speech data” analysis. In the longer term, this will expectedly lead to a whole new
range of experiences, approaches, and interesting research, but also system engineering
questions.
References
Banea, C., Mihalcea, R., & Wiebe, J. (2011). Multilingual sentiment and subjectivity. In I. Zitouni
& D. Bikel (Eds), Multilingual Natural Language Processing. Prentice Hall.
Batliner, A. & Schuller, B. (2014). More than fifty years of speech processing – the rise of compu-
tational paralinguistics and ethical demands. In Proceedings ETHICOMP 2014. Paris, France:
CERNA, for Commission de réflexion sur l’Ethique de la Recherche en sciences et technolo-
gies du Numérique d’Allistene.
Bonneh, Y. S., Levanon, Y., Dean-Pardo, O., Lossos, L., & Adini, Y. (2011). Abnormal speech
spectrum and increased pitch variability in young autistic children. Frontiers in Human Neuro-
science, 4.
Callejas, Z. & Lòpez-Cózar, R. (2008). In fluence of contextual information in emotion annotation
for spoken dialogue systems. Speech Communication, 50(5), 416–433.
Chen, S. X. & Bond, M. H. (2010). Two languages, two personalities? Examining language effects
on the expression of personality in a bilingual context. Personality and Social Psychology Bul-
letin, 36(11), 1514–1528.
Cirillo, J. (2004). Communication by unvoiced speech: The role of whispering. Annals of the
Brazilian Academy of Sciences, 76(2), 1–11.
Cirillo, J. & Todt, D. (2002). Decoding whispered vocalizations: relationships between social and
emotional variables. Proceedings IX International Conference on Neural Information Process-
ing (ICONIP) (pp. 1559–1563).
Coutinho, E., Deng, J., & Schuller, B. (2014). Transfer learning emotion manifestation across
music and speech. In Proceedings 2014 International Joint Conference on Neural Networks
(IJCNN) as part of the IEEE World Congress on Computational Intelligence (IEEE WCCI).
Beijing: IEEE.
Cowie, R. (2011). Editorial: “Ethics and good practice” – computers and forbidden places: Where
machines may and may not go. In P. Petta, C. Pelachaud, & R. Cowie (Eds), Emotion-Oriented
Systems: The Humaine Handbook (pp. 707–712). Berlin: Springer.
Davidov, D., Tsur, O., & Rappoport, A. (2010). Semi-supervised recognition of sarcastic sentences
in Twitter and Amazon. In Proceedings 14th Conference on Computational Natural Language
Learning (pp. 107–116).
Deng, J. & Schuller, B. (2012). Confidence measures in speech emotion recognition based on
semi-supervised learning. In Proceedings Interspeech 2012. Portland, OR.
Deng, J., Han, W., & Schuller, B. (2012). Confidence measures for speech emotion recognition: A
start. In T. Fingscheidt & W. Kellermann (Eds), Proceedings 10th ITG Conference on Speech
Communication (pp. 1–4). Braunschweig, Germany: IEEE.
Deng, J., Zhang, Z., Marchi, E., & Schuller, B. (2013). Sparse autoencoder-based feature transfer
learning for speech emotion recognition. In Proceedings 5th biannual Humaine Association
Conference on Affective Computing and Intelligent Interaction (ACII 2013) (pp. 511–516).
Geneva: IEEE.
Deng, J., Xia, R., Zhang, Z., Liu, Y., & Schuller, B. (2014). Introducing shared-hidden-layer
autoencoders for transfer learning and their application in acoustic emotion recognition. In
Proceedings 39th IEEE International Conference on Acoustics, Speech, and Signal Processing,
ICASSP 2014. Florence, Italy: IEEE.
Dhall, A., Goecke, R., Joshi, J., Wagner, M., & Gedeon, T. (Eds) (2013). Proceedings of the 2013
Emotion Recognition in the Wild Challenge and Workshop. Sydney: ACM.
Döring, S., Goldie, P., & McGuinness, S. (2011). Principalism: A method for the ethics of
emotion-oriented machines. In P. Petta, C. Pelachaud, & R. Cowie (Eds), Emotion-Oriented
Systems: The Humaine Handbook (pp. 713–724). Berlin Springer.
Forbes-Riley, K. & Litman, D. (2004). Predicting emotion in spoken dialogue from multiple
knowledge sources. In Proceedings HLT/NAACL (pp. 201–208).
Goldie, P., Döring, S., & Cowie, R. (2011). The ethical distinctiveness of emotion-oriented tech-
nology: Four long-term issues. In P. Petta, C. Pelachaud, & R. Cowie (Eds), Emotion-Oriented
Systems: The Humaine Handbook (pp. 725–734). Berlin: Springer.
Grossman, R. B., Bemis, R. H., Skwerer, D. P., & Tager-Flusberg, H. (2010). Lexical and affective
prosody in children with high-functioning autism. Journal of Speech, Language, and Hearing
Research, 53, 778–793.
Gunes, H., Schuller, B., Pantic, M., & Cowie, R. (2011). Emotion representation, analysis and
synthesis in continuous space: A survey. In Proceedings International Workshop on Emotion
Synthesis, Representation, and Analysis in Continuous space, EmoSPACE 2011, held in con-
junction with the 9th IEEE International Conference on Automatic Face & Gesture Recognition
and Workshops (FG 2011) (pp. 827–834). Santa Barbara, CA: IEEE.
Han, W., Zhang, Z., Deng, J., et al. (2012). Towards distributed recognition of emotion in speech.
In Proceedings 5th International Symposium on Communications, Control, and Signal Pro-
cessing, ISCCSP 2012 (pp. 1–4). Rome, Italy: IEEE.
Han, W., Li, H., Ruan, H., et al. (2013). Active learning for dimensional speech emotion recogni-
tion. In Proceedings Interspeech 2013 (pp. 2856–2859). Lyon, France: ISCA.
Havasi, C., Speer, R., & Alonso, J. (2007). ConceptNet 3: A flexible, multilingual semantic net-
work for common sense Knowledge. In Recent Advances in Natural Language Processing,
September.
Hayes, A. F. & Krippendorff, K. (2007). Answering the call for a standard reliability measure for
coding data. Communication Methods and Measures, 1(1), 77–89.
Juslin, P. N. & Laukka, P. (2003). Communication of emotions in vocal expression and music
performance: Different channels, same code? Psychological Bulletin, 129, 770–814.
Kajackas, A., Anskaitis, A., & Gursnys, D. (2008). Peculiarities of testing the impact of packet
loss on voice quality. Electronics and Electrical Engineering, 82(2), 35–40.
Kövecses, Z. (2000). The concept of anger: Universal or culture specific? Psychopathology, 33,
159–170.
Lindquist, K., Feldman Barrett, L., Bliss-Moreau, E., & Russell, J. (2006). Language and the
perception of emotion. Emotion, 6(1), 125–138.
Liscombe, J., Riccardi, G., & Hakkani-Tür, D. (2005). Using context to improve emotion detection
in spoken dialog systems. In Proceedings of INTERSPEECH (pp. 1845–1848).
Mahdhaoui, A. & Chetouani, M. (2009). A new approach for motherese detection using a semi-
supervised algorithm. Machine Learning for Signal Processing XIX – Proceedings of the 2009
IEEE Signal Processing Society Workshop, MLSP (pp. 1–6).
Marchi, E., Schuller, B., Batliner, A., et al. (2012a). Emotion in the speech of children with
autism spectrum conditions: Prosody and everything else. In Proceedings 3rd Workshop on
Child, Computer and Interaction (WOCCI 2012), Satellite Event of Interspeech 2012. Portland,
OR: ISCA.
Marchi, E., Batliner, A., Schuller, B., et al. (2012b). Speech, emotion, age, language, task, and
typicality: Trying to disentangle performance and feature relevance. In Proceedings 1st Inter-
national Workshop on Wide Spectrum Social Signal Processing (WS3P 2012), held in conjunc-
tion with the ASE/IEEE International Conference on Social Computing (SocialCom 2012).
Amsterdam, The Netherlands: IEEE.
Obin, N. (2012). Cries and whispers – classification of vocal effort in expressive speech. In Pro-
ceedings Interspeech. Portland, OR: ISCA.
Patel, S. & Scherer, K. R. (2013). Vocal behaviour. In J. A. Hall & M. L. Knapp (Eds), Handbook
of Nonverbal Communication. Berlin: Mouton-DeGruyter.
Ramírez-Esparza, N., Gosling, S. D., Benet-Martínez, V., Potter, J. P., & Pennebaker, J. W. (2006).
Do bilinguals have two personalities? A special case of cultural frame switching. Journal of
Research in Personality, 40, 99–120.
Riviello, M. T., Chetouani, M., Cohen, D., & Esposito, A. (2010). On the perception of emotional
“voices”: a cross-cultural comparison among American, French and Italian subjects. In Analy-
sis of Verbal and Nonverbal Communication and Enactment: The Processing Issues (vol. 6800,
pp. 368–377). Springer LNCS.
Sauter, D., Eisner, F., Ekman, P., & Scott, S. K. (2010). Cross-cultural recognition of basic emo-
tions through nonverbal emotional vocalizations. Proceedings of the National Academy of Sci-
ences of the United States of America, 107(6), 2408–2412.
Sauter, D. A. (2006). An investigation into vocal expressions of emotions: the roles of valence,
culture, and acoustic factors. PhD thesis, University College London.
Scherer, K. R. (1986). Vocal affect expression: A review and a model for future research. Psycho-
logical Bulletin, 99, 143–165.
Scherer, K. R. (2003). Vocal communication of emotion: A review of research paradigms. Speech
Communication, 40, 227–256.
Scherer, K. R. & Brosch, T. (2009). Culture-specific appraisal biases contribute to emotion dis-
positions. European Journal of Personality, 23, 265–288.
Schröder, M., Devillers, L., Karpouzis, K., et al. (2007). What should a generic emotion markup
language be able to represent? In A. Paiva, R. W. Picard, & R. Prada (Eds), Affective Computing
and Intelligent Interaction: Second International Conference, ACII 2007, Lisbon, Portugal,
September 12-14, 2007, Proceedings. Lecture Notes on Computer Science (LNCS) (vol. 4738,
pp. 440–451). Berlin: Springer.
Schuller, B. (2012). The computational paralinguistics challenge. IEEE Signal Processing Maga-
zine, 29(4), 97–101.
Schuller, B. & Batliner, A. (2013). Computational Paralinguistics: Emotion, Affect and Personal-
ity in Speech and Language Processing. Hoboken, NJ: Wiley.
Schuller, B. & Devillers, L. (2010). Incremental acoustic valence recognition: An inter-corpus
perspective on features, matching, and performance in a gating paradigm. In Proceedings Inter-
speech (pp. 2794–2797). Makuhari, Japan: ISCA.
Schuller, B., Dunwell, I., Weninger, F., & Paletta, L. (2013a). Serious gaming for behavior
change – the state of play. IEEE Pervasive Computing Magazine, Special Issue on Under-
standing and Changing Behavior, 12(3), 48–55.
Schuller, B., Steidl, S., Batliner, A., et al. (2013b). The INTERSPEECH 2013 computational
paralinguistics challenge: Social signals, conflict, emotion, autism. In Proceedings Interspeech
2013 (pp. 148–152). Lyon, France: ISCA.
Silverman, K., Beckman, M., Pitrelli, J., et al. (1992). ToBI: A standard for labeling English
prosody. In Proceedings ICSLP (vol. 2, pp. 867–870).
Sneddon, I., Goldie, P., & Petta, P. (2011). Ethics in emotion-oriented systems: The challenges for
an ethics committee. In P. Petta, C. Pelachaud, & R. Cowie (Eds), Emotion-Oriented Systems:
The Humaine Handbook. Berlin. Springer.
Sundberg, J., Patel, S., Björkner, E., & Scherer, K. R. (2011). Interdependencies among voice
source parameters in emotional speech. IEEE Transactions on Affective Computing, 99, 2423–
2426.
Tawari, A. & Trivedi, M. M. (2010a). Speech emotion analysis: Exploring the role of context.
IEEE Transactions on Multimedia, 12(6), 502–509.
Tawari, A. & Trivedi, M. M. (2010b). Speech emotion analysis in noisy real world environment.
In Proceedings 20th International Conference on Pattern Recognition (ICPR) (pp. 4605–4608).
Istanbul, Turkey: IAPR.
Weninger, F., Eyben, F., Schuller, B., Mortillaro, M., & Scherer, K. R. (2013). On the acoustics
of emotion in audio: What speech, music and sound have in common. Frontiers in Psychology,
Emotion Science, Special Issue on Expression of emotion in music and vocal communication,
4(292), 1–12.
Wöllmer, M., Eyben, F., Reiter, S., et al. (2008). Abandoning emotion classes – towards continu-
ous emotion recognition with modelling of long-range dependencies. Proceedings Interspeech
2008 (pp. 597–600). Brisbane, Australia: ISCA.
Wöllmer, M., Weninger, F., Knaup, T., et al. (2013). YouTube movie reviews: Sentiment analysis
in an audiovisuall context. IEEE Intelligent Systems Magazine, Special Issue on Statistical
Approaches to Concept-Level Sentiment Analysis, 28(3), 46–53.
Wu, D. & Parsons, T. (2011). Active class selection for arousal classification. Proceedings Affec-
tive Computing and Intelligent Interaction (ACII) (pp. 132–141).
Zhang, Z. & Schuller, B. (2012). Active learning by sparse instance tracking and classifier confi-
dence in acoustic emotion recognition. In Proceedings Interspeech 2012. Portland, OR: ISCA.
Zhang, Z., Weninger, F., Wöllmer, M., & Schuller, B. (2011). Unsupervised learning in cross-
corpus acoustic emotion recognition. Proceedings 12th Biannual IEEE Automatic Speech
Recognition and Understanding Workshop (ASRU 2011) (pp. 523–528). Big Island, HY: IEEE.
Zhang, Z., Deng, J., Marchi, E., & Schuller, B. (2013a). Active learning by label uncertainty for
acoustic emotion recognition. Proceedings Interspeech 2013 (pp. 2841–2845). Lyon, France:
ISCA.
Zhang, Z., Deng, J., & Schuller, B. (2013b). Co-training succeeds in computational paralinguis-
tics. In Proceedings 38th IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP 2013) (pp. 8505–8509). Vancouver, IEEE.
7 Self-presentation: Signaling Personal
and Social Characteristics
Mark R. Leary and Katrina P. Jongman-Sereno
When people interact, their behaviors are greatly influenced by the impressions they
have of one another’s personalities, abilities, attitudes, intentions, identities, roles, and
other characteristics. In fact, many important outcomes in life – outcomes as diverse as
friendships, professional success, income, romantic relationships, influence over others,
and social support – depend to a significant extent on the impressions that people make
on others. Knowing that others respond to them on the basis of their public impressions,
people devote considerable thought and energy to conveying impressions that will lead
others to treat them in desired ways. In many instances, the impressions people project
of themselves are reasonably accurate attempts to let other people know who they are
and what they are like (Murphy, 2007). At other times, people may convey impressions
of themselves that they know are not entirely accurate, if not blatantly deceptive, when
they believe that fostering such images will result in desired outcomes (Hancock &
Toma, 2009).
Social and behavioral scientists refer to people’s efforts to manage their public images
as self-presentation or impression management (Goffman, 1959; Schlenker, 2012).
Some researchers use different terms for the process of controlling one’s public image
depending on whether the efforts are honest or deceitful and whether they involve
impressions of one’s personal characteristics or information about one’s social roles
and identity. But we will use the terms interchangeably to refer to any intentional effort
to convey a particular impression of oneself to another person without respect to the
accuracy or content of the effort.
Tactics of Self-presentation
Nearly every aspect of people’s behavior provides information from which others can
draw inferences about them, but actions are considered self-presentational only if they
are enacted, at least in part, with the goal of leading other people to perceive the indi-
vidual in a particular way. People convey information about their personal and social
characteristics using a wide array of tactics.
Verbal Claims
The most direct self-presentational tactics involve verbal statements that make a particu-
lar claim regarding one’s personal or social characteristics. By telling others about their
personalities, abilities, backgrounds, likes and dislikes, accomplishments, education,

occupations, roles, and so on, people can convey desired impressions of themselves.
Although verbal self-presentations often occur in face-to-face encounters, people also
present themselves in writing, such as through letters, emails, resumes, personal ads,
and postings to social media sites.
A less direct verbal tactic is to convey information that implies that one possesses
a particular attribute without making an explicit claim. People can convey information
about themselves by talking about their experiences, attitudes, and reactions to events,
and by explaining their behavior to others. For example, recounting a personal expe-
rience can convey an impression (“That reminds me of the time I was hiking alone in
Montana”) as does expressing attitudes (“I am against capital punishment”) and explain-
ing one’s behavior (“I guess I was just too tired to do well”). Of course, such statements
are often merely nontactical fodder for conversation, but they can also be enacted strate-
gically to make particular impressions or to evoke desired reactions from other people.
Just as important as what people say about themselves is what they do not say. People
sometimes manage their impressions by withholding information – by not mentioning
that they possess a particular trait, had a particular experience, or hold an attitude that
might lead others to form an undesired impression of them (Ellemers & Barreto, 2006).
Nonverbal Behavior
People also convey information about themselves nonverbally. Again, people cannot
help but to express self-relevant information through nonverbal channels, but they some-
times intentionally and deliberately manage their public impressions through their facial
expressions, body position, direction of attention, gestures and other movements, or by
controlling spontaneous nonverbal behaviors that might convey an undesired impres-
sion. For example, people sometimes express, conceal, exaggerate, or fake their emo-
tional reactions nonverbally to signal information about their characteristics and states.
Photographs have been used as a nonverbal self-presentational tactic since the inven-
tion of the camera, but their importance has increased with the spread of Internet social
media such as Facebook. The photos and videos that people post online are selected in
part to convey desired impressions of their appearance, activities, and personal charac-
teristics. Granted, people sometimes appear not to have considered how certain audi-
ences might react to their posted photos, but presumably they do not intentionally try
to make undesired impressions. Research suggests that photographs posted online may
influence targets’ impressions of an individual more strongly than verbal descriptions
(Van der Heide, D’Angelo, & Schumaker, 2012).
Props
Props are physical objects that can be used to convey personal or social information
about a person (Schlenker, 1980). For example, clothing and jewelry affect others’
impressions of the individual, and people sometimes choose clothes and bodily adorn-
ment to convey a particular impression in a particular context. In recent years, body
Self-presentation: Signaling Personal and Social Characteristics 71
art (e.g., tattoos) is being increasingly used in Western countries to signal particular
identities to other people.
How people decorate their homes and offices – and the props that they display – are
partly selected for their self-presentational impact. People hide possessions from public
view that might convey undesired impressions and exhibit those that are consonant with
the image that they wish others to have of them (Baumeister, 1982; Goffman, 1959;
Gosling et al., 2002).
Social Associations
As the saying goes, people are known by the company they keep, so they may tout con-
nections with certain people and groups while hiding their associations with others as a
way of managing their impressions. For example, people sometimes try to enhance their
public image by advertising their connections with those who are known to be success-
ful, powerful, attractive, popular, or simply interesting. People both “bask in reflected
glory” (i.e., alert others of their connections with desirable others; Cialdini et al., 1976)
and “cut off reflected failure” (i.e., distance themselves from undesirable others; Snyder,
Lassegard, & Ford, 1986). They also increase the perceived positive attributes of people,
groups, institutions, and places with which they are already associated (burnishing), and
minimize the unfavorable features of entities with which they are connected (boosting)
(Cialdini, 1989).
Interpersonal Behaviors
A great deal of human behavior is enacted primarily for tangibly instrumental reasons
with little or no attention paid to its self-presentational implications. Yet, people some-
times behave as they do with the central goal of conveying a desired impression, and
the instrumental function of the behavior is secondary. For example, people sometimes
behave in helpful ways because they want to be seen as a nice person, respond aggres-
sively in order to be viewed as an aggressive person who should not be trifled with, and
cooperate even when they prefer not to do so in order to be seen as a cooperative person.
Determinants of Self-presentation
A great deal of research has examined factors that determine the kinds of impressions
that people try to create (for reviews, see Baumeister, 1982; Leary, 1995; Schlenker,
1980, 2012). Many such determinants involve features of the social context, such as the
person’s goals in the situation, the identity of others who are present, the person’s role,
and social norms. Other antecedents of self-presentation can be traced to the person’s
self-image and personality. Here we describe five primary determinants of the specific
images that people try to convey.
Goals
Fundamentally, people engage in self-presentation as a means of achieving their goals
by leading other people to respond to them in desired ways. As a result, the images that
people try to project are strongly determined by their goals in a particular situation. For
example, when people believe that being liked will help them to achieve their goals,
they present images of being agreeable or approachable, whereas when people believe
that appearing task-oriented will be more beneficial they describe themselves as more
task-focused (Leary et al., 1986).
People’s goals in life are more often facilitated by being viewed as possessing socially
desirable attributes rather than undesirable ones. For example, being seen as likable,
competent, or moral generally leads to better outcomes than being regarded as unlik-
able, incompetent, or immoral. For this reason, people generally desire to make positive,
socially desirable impressions on other people. However, depending on the situation,
people who wish to enhance the desirability of their social image may focus on either
claiming positive characteristics (attributive self-presentation) or denying negative char-
acteristics (repudiative self-presentation) (Hermann & Arkin, 2013).
Furthermore, in some encounters people may believe that making a socially undesir-
able impression will help them achieve desired goals and in such situations people may
present unfavorable images of themselves. For example, people may foster impressions
of being irresponsible, helpless, or even mentally unstable when such images lead others
to respond to them in desired ways (Braginsky, Braginsky, & Ring, 1969; Kowalski &
Leary, 1990), and people who belong to deviant social groups may foster impressions
that are deemed undesirable by outgroups (Schütz, 1998).
Target
To be successful, people’s self-presentations must be tailored to the preferences and
values of the target audience (Leary, 1995). This does not necessarily mean that people
deceptively change themselves as they move from one interaction to another as if they
were some kind of social chameleon. Rather, people have an immense warehouse of
personal and social characteristics that they can honestly convey without lying about
themselves, and they often select which ones to emphasize to a particular target without
dissimulating. However, when people believe that presenting themselves accurately to
a particular audience will have undesired consequences, they may misrepresent them-
selves (Ellemers & Barreto, 2006).
Research also shows that people present themselves differently to friends than to
strangers. In general, people present themselves more modestly to friends than strangers
(Tice et al., 1995) and are also less motivated to manage their impressions when with
people they know well, presumably because others already have well-formed impres-
sions of them (Leary et al., 1994). Other research shows that the mere presence of a
friend can lead people to try to make more positive impressions and to disclose more
about themselves to a stranger (Pontari & Glenn, 2012).
Roles and Norms

People are under social pressure to convey impressions that are consistent with their
current social role and with the norms that are operating in a particular situation. For
example, when a woman is speaking to her board as CEO of a company, her role requires
that she convey a quite different public persona than when she is drinking with close
friends after work. Conveying her “fun-loving friend” image would compromise her
success in the boardroom just as conveying her “CEO” image would create problems
while socializing with friends. Similarly, situational norms dictate how people should
appear to others. At a lively party, norms dictate that people should appear to enjoying
themselves no matter how they might actually feel, whereas norms at a funeral would
caution against appearing to enjoy oneself.
Self-concept
As noted earlier, people often manage their impressions to convey what they view as
an accurate impression of themselves. Unless they are motivated to hide, exaggerate, or
distort information about themselves, people are usually comfortable presenting images
that are consistent with how they see themselves, so their public images are guided by
their self-concepts. Although the notion that people sometimes make a special effort to
convey an honest impression of themselves may initially seem odd, others often cannot
form an accurate impression of another person unless he or she proactively fosters that
impression. People often find it difficult, if not impossible, to infer what others are like
without the person’s deliberate efforts to convey information about him- or herself. Fur-
thermore, evidence suggests that trying to make an impression can increase the accuracy
of targets’ views of a person (Human et al., 2012).
Even when people might wish to present themselves differently than they really are,
they often elect to be honest because they are concerned that they will be unable to
sustain impressions that are contrary to how they really are (Baumeister, Tice, & Hut-
ton, 1989; Schlenker, 1975). People are expected to be who and what they claim to
be (Goffman, 1959), and those who misrepresent themselves are negatively sanctioned
(Baumeister et al., 1989; Goffman, 1959; Schlenker, 1980). As a result, people tend to
present themselves consistently with what others know or are likely to find out about
them, even if it is negative (Schlenker, 1975). However, in instances in which reality
forces people to convey undesired impressions of themselves, they may compensate by
also presenting positive images of themselves on unrelated dimensions (Baumeister &
Jones, 1978). For example, people who appear incompetent may work to promote an
image of being particularly nice.
Although many people espouse the belief that people should always present them-
selves authentically – that is, consistently with their true characteristics and inclinations
(Kernis & Goldman, 2006) – we are all fortunate that almost no one does. Inauthen-
tic self-presentation is often needed in the service of decorum, politeness, and concern
for other people, and even highly authentic people must tactically decide when, where,
and to whom to present honest impressions of themselves. Indeed, occasions arise in

which presenting an honest impression of one’s characteristics, attitudes, and emotions
would indicate a serious psychological disorder. Few people would unnecessarily inform
a despised acquaintance how much they hated him or tell a grieving family member
that they were glad that the deceased person was dead. People who disclose informa-
tion about their characteristics, attitudes, and experiences at places and times that are
socially inappropriate in order to be “authentic” can disrupt social encounters, evoke
negative reactions from other people, and cause negative social, professional, and legal
consequences for themselves (Swider et al., 2011).
Trait-specific Self-presentations
People’s personalities can impel certain self-presentations because people with partic-
ular personality characteristics often want to be viewed in particular ways (Leary &
Allen, 2011). For example, people who score high on the trait of agreeableness want to
be perceived as pleasant and approachable, people who score high in power motivation
foster public images of being dominant, powerful, high status individuals (Fodor, 2009),
very hostile people foster images that they are intolerant and intimidating (Bassett, Cate,
& Dabbs, 2002), and dependent people want others to see them as helpless and in need
of support (Mongrain & Zuroff, 1995). In addition, people’s self-presentations are often
constrained by their personalities. For example, a person may be so low in conscien-
tiousness that he or she cannot believably present an image of being careful and depend-
able. Thus, people’s self-presentations are influenced by their personalities.
Self-presentational Predicaments
As much as they might try to sustain desired impressions that facilitate their goals,
people’s public images are sometimes threatened or blatantly damaged by the turn of
the events. Showing oneself to be incompetent, unethical, inconsiderate, or not what
they have claimed induces strong motives to restore one’s damaged image. When people
realize that others have formed an impression of them that they did not want to convey,
they typically experience embarrassment and take steps to repair their damaged image
(Miller, 1996).
Most remedial tactics focus on convincing others that one’s actions should not be
taken as a reflection on one’s personal or social characteristics. For example, people may
claim that they were not entirely responsible for the offending behavior (excuses), that
the negative consequences of the questionable behavior were minimal (justifications),
or that no matter how badly they might have behaved on this particular occasion, they
usually behave much better, and their current actions do not reflect their true personality,
ability, or character (exceptions) (Schlenker, 1980).
People sometimes anticipate that an upcoming situation will potentially undermine
an image that they desire to convey. In such instances, they may engage in preemptive
self-presentations to lower others’ expectations (Kolditz & Arkin, 1982) or provide

information that may help to compensate for a negative impression that they might later
convey (Tyler, Burns, & Fedesco, 2011).
Conclusion
No matter what else they may be doing, people are rarely unconcerned with how they
are being perceived and evaluated by others. Much of the time, their self-presentational
concerns lurk just beneath the surface of their social interactions, perhaps constrain-
ing their behaviors but not necessarily dictating that they behave in any particular way.
However, in some situations, people are motivated to convey particular impressions of
their personal or social characteristics to other people and sometimes engage in particu-
lar behaviors in order to be viewed by others in a particular fashion. Virtually any verbal
or nonverbal signal can be used self-presentationally, and people are creative and ver-
satile in managing their impressions. Of course, people’s public images are sometimes
damaged, which prompts them to turn their attention to repairing their image.
One challenge in behavioral research on self-presentation has been capturing the
complexities of real-life self-presentation under controlled conditions (Leary, Allen, &
Terry, 2011). Despite the myriad tactics that people use in everyday self-presentation,
most research has focused on verbal claims (often conveyed via ratings on question-
naires that others will ostensibly see), with a few studies of nonverbal behavior and
posts on social media. In addition, many of the variables that affect self-presentation in
everyday life are difficult, if not impossible to recreate in controlled experiments, and
the consequences of people’s impressions in real life are obviously much greater than
those in research studies. Nonetheless, since Goffman’s (1959) seminal introduction to
self-presentation, a great deal has been learned about this exceptionally important fea-
ture of human behavior.
References
Bassett, J. F., Cate, K. L., & Dabbs, J. M., Jr. (2002). Individual differences in self-
presentation style: Driving an automobile and meeting a stranger. Self and Identity, 1, 281–288.
doi:10.1080/152988602760124892.
Baumeister, R. F. (1982). A self-presentational view of social phenomena. Psychological Bulletin,
91, 3–26. doi:10.1037/0033-2909.91.1.3.
Baumeister, R. F. & Jones, E. E. (1978). When self-presentational is constrained by the target’s
knowledge: Consistency and compensation. Journal of Personality and Social Psychology, 36,
608–618. doi:10.1037/0022-3514.36.6.608.
Baumeister, R. F., Tice, D. M., & Hutton, D. G. (1989). Self-presentational motivations and per-
sonality differences in self-esteem. Journal of Personality, 57, 547–579. doi: 10.1111/j.1467-
6494.1989.tb02384.x.
Braginsky, B. M., Braginsky, D. D., & Ring, K. (1969). Methods of Madness: The Mental Hospital
as a Last Resort. Washington, DC: University Press of America.
Cialdini, R. B. (1989). Indirect tactics of image management: Beyond basking. In R. A. Giacalone

& P. Rosenfeld (Eds), Impression Management in the Organization (pp. 45–56). Hillsdale, NJ:
Lawrence Erlbaum.
Cialdini, R. B., Borden, R. J., Thorne, A. et al. (1976). Basking in reflected glory: Three (football)
field studies. Journal of Personality and Social Psychology, 34, 360–375. doi:10.1037/0022-
3514.34.3.366.
Ellemers, N. & Barreto, M. (2006). Social identity and self-presentation at work: How attempts
to hide a stigmatised identity affect emotional well-being, social inclusion and performance.
Netherlands Journal of Psychology, 62, 51–57. doi: 10.1007/BF03061051.
Fodor, E. M. (2009). Power motivation. In M. R. Leary & R. H. Hoyle (Eds), Handbook of Indi-
vidual Differences in Social Behavior (pp. 426–440). New York: Guilford Press.
Goffman, E. (1959). The Presentation of Self in Everyday Life. New York: Doubleday.
Gosling, S. D., Ko, S. J., Mannarelli, T., & Morris, M. E. (2002). A room with a cue: Personality
judgments based on offices and bedrooms. Journal of Personality and Social Psychology, 82,
379–398.
Hancock, J. T. & Toma, C. L. (2009). Putting your best face forward: The accuracy of
online dating photographs. Journal of Communication, 59, 367–386. doi:10.1111/j.1460-
2466.2009.01420.x.
Hermann, A. D. & Arkin, R. M. (2013). On claiming the good and denying the bad: Self-
presentation styles and self-esteem. Individual Differences Research, 11, 31–43.
Human, L. J., Biesanz, J. C., Parisotto, K. L., & Dunn, E. W. (2012). Your best self helps reveal
your true self: Positive self-presentation leads to more accurate personality impressions. Social
Psychological and Personality Science, 3, 23–30. doi:10.1177/1948550611407689.
Kernis, M. H. & Goldman, B. M. (2006). A multicomponent conceptualization of authen-
ticity: Theory and research. Advances in Experimental Social Psychology, 38, 283–357.
doi:10.1016/S0065-2601(06)38006-9.
Kolditz, T. A. & Arkin, R. M. (1982). An impression management interpretation of the
self-handicapping strategy. Journal of Personality and Social Psychology, 43, 492–502.
doi:10.1037/0022-3514.43.3.492.
Kowalski, R. M. & Leary, M. R. (1990). Strategic self-presentation and the avoidance of aversive
events: Antecedents and consequences of self-enhancement and self-depreciation. Journal of
Experimental Social Psychology, 26, 322–336. doi:10.1016/0022-1031(90)90042-K.
Leary, M. R. (1995). Self-presentation: Impression Management and Interpersonal Behavior.
Boulder, CO: Westview.
Leary, M. R. & Allen, A. B. (2011). Self-presentational persona: Simultaneous management
of multiple impressions. Journal of Personality and Social Psychology, 101, 1033–1049.
doi:10.1037/a0023884.
Leary, M. R., Allen, A. B., & Terry, M. L. (2011). Managing social images in naturalistic versus
laboratory settings: Implications for understanding and studying self-presentation. European
Journal of Social Psychology, 41, 411–421. doi:10.1002/ejsp.813.
Leary, M. R., Nezlek, J. B., Downs, D. L., et al. (1994). Self-presentation in everyday interactions.
Journal of Personality and Social Psychology, 67, 664–673.
Leary, M. R., Robertson, R. B., Barnes, B. D., & Miller, R. S. (1986). Self-presentations of small
group leaders as a function of role requirements and leadership orientation. Journal of Person-
ality and Social Psychology, 51, 742–748. doi:10.1037/0022-3514.51.4.742.
Miller, R. S. (1996). Embarrassment: Poise and Peril in Everyday Life. New York: Guilford
Press.
Mongrain, M. & Zuroff, D. C. (1995). Motivational and affective correlates of dependency

and self-criticism. Personality and Individual Differences, 18, 347–354. doi:10.1016/0191-
8869(94)00139-J.
Murphy, N. A. (2007). Appearing smart: The impression management of intelligence, person
perception accuracy, and behavior in social interaction. Personality and Social Psychology
Bulletin, 33, 325–339. doi:10.1177/0146167206294871.
Pontari, B. A. & Glenn, E. J. (2012). Engaging in less protective self-presentation: The effects of
a friend’s presence on the socially anxious. Basic and Applied Social Psychology, 34, 516–526.
doi:10.1080/01973533.2012.728112.
Schlenker, B. R. (1975). Self-presentation: Managing the impression of consistency when reality
interferes with self-enhancement. Journal of Personality and Social Psychology, 32, 1030–
1037. doi:10.1037/0022-3514.32.6.1030.
Schlenker, B. R. (1980). Impression management: The Self-concept, Social Identity, and Interper-
sonal Relations. Monterey, CA: Brooks/Cole.
Schlenker, B. R. (2012). Self-presentation. In M. R. Leary & J. P. Tangney (Eds), Handbook of
Self and Identity (pp. 542–570). New York: Guilford Press.
Schütz, A. (1998). Coping with threats to self-esteem: The differing patterns of subjects with
high versus low trait self-esteem in first-person accounts. European Journal of Personality, 12,
169–186.
Snyder, C. R., Lassegard, M., & Ford, C. E. (1986). Distancing after group success and failure:
Basking in reflected glory and cutting off reflected failure. Journal of Personality and Social
Psychology, 51, 382–388. doi:10.1037/0022-3514.51.2.382.
Swider, B. W., Barrick, M. R., Harris, T. B., & Stoverink, A. C. (2011). Managing and creat-
ing an image in the interview: The role of interviewee initial impressions. Journal of Applied
Psychology, 96, 1275–1288. doi:10.1037/a0024005.
Tice, D. M., Butler, J. L., Muraven, M. B., & Stillwell, A. M. (1995). When modesty prevails:
Differential favorability of self-presentation to friends and strangers. Journal of Personality
and Social Psychology, 69, 1120–1138. doi:10.1037/0022-3514.69.6.1120.
Tyler, J. M., Burns, K. C., & Fedesco, H. N. (2011). Pre-emptive self-presentations for future
identity goals. Social Influence, 6, 259–273. doi:10.1080/15534510.2011.630240.
Van der Heide, B., D’Angelo, J. D., & Schumaker, E. M. (2012). The effects of verbal versus pho-
tographic self-presentation on impression formation in Facebook. Journal of Communication,
62, 98–116. doi:10.1111/j.1460-2466.2011.01617.x.
8 Interaction Coordination and
Adaptation
Judee K. Burgoon, Norah E. Dunbar, and Howard Giles
A Biological and Social Imperative
Adaptation is a biological and social imperative – biologically, for the survival of a

species; socially, for the survival of a society. Vertebrates and invertebrates alike come
equipped with reflexes that produce involuntary survival-related forms of adaptation
in the form of fight or flight responses. In the face of a threat, a frightened organism
may sound an alarm call, emit an odor, or display a visual signal that is recognized by
species mates as fear. The fear triggers behavioral mimicry that leads the entire flock,
herd, swarm, or school to take flight en masse. Or, rage by a single individual may fuel a
contagion of aggression that turns into mob violence. These reciprocal actions may not
be easily suppressed or controlled.
Other forms of adaptation are volitional, intentional, and socially oriented. Humans
may copy the speech patterns of their social “superiors” in hopes of being viewed as
belonging to the same ingroup. Or one person’s antagonistic demeanor toward a target
may elicit a docile, calming response by the victim.
Both forms of adaptation – involuntary and voluntary – undergird social organization.
As Martin Luther King Jr. observed in his Letter from a Birmingham Jail (1963), “we
are caught in an inescapable network of mutuality.” By means of verbal and nonverbal
communication, civilized societies negotiate access to scarce resources, work out their
interpersonal relationships, and create their social organizations. Thus, communication
is fundamentally an adaptive enterprise that reflects and channels these biological and
social imperatives. How, when, and why such adaptation takes place is the topic of this
chapter.
Forms of Coordination and Adaptation
It is perhaps unsurprising that given its fundamental role in social interaction, terms
describing various forms of adaptation have proliferated, leading to conceptual and
operational disarray. The same terms have been applied to different phenomena and
different terms have been applied to the same phenomenon. Here we introduce the most
common usage from scholars of communication, psychology and linguistics who over
the course of forty years have largely converged on these definitions. These concep-
tual and operational definitions are summarized in Table 8.1. The reader is directed to
Interaction Coordination and Adaptation 79
Table 8.1 Conceptual and operational definitions of forms of adaptation, with examples.
Term Description Example Operational definition
Matching Verbal or nonverbal A and B both whisper in a A1 = B1 and A2 = B2 or

behavioral similarity theater at Times 1 and 2 A1 − B1 = 0,
between actors A2 − B2 = 0
Mirroring Visual nonverbal A and B both sit with left A1 = B1 and A2 = B2 , or
similarity between two leg crossed over right A1 − B1 = 0,
or more actors knee at Times 1 and 2 A2 − B2 = 0
Complementarity One actor’s verbal or A yells and B speaks A1 = −B1 , A2 = −B2
nonverbal behavior is softly at Times 1 and 2
opposite the other(s)
Reciprocity Changes in one actor’s A shows liking by (A2 − A1 ) ∼
= (B2 − B1 ) or
verbal or nonverbal increasing gaze; B A ∼= B if and only if
behaviors are met with reciprocates by smiling A = 0, B = 0
similar changes of
comparable functional
value by the other(s)
Compensation Changes in one actor’s A shows liking by (A2 − A1 ) ∼
= −(B2 − B1 )
verbal or nonverbal increasing gaze; B or A ∼ = −B if and
behaviors are met with shows dislike by only if A = 0, B = 0
opposite behaviors of compensating with
comparable functional backward lean and a
value by the other(s) frown
Approach/ One actor’s verbal or A and B begin with (abs(A1 − B1 )) >
Convergence nonverbal behavior indirect body (abs(A2 − B2 )) where
becomes more like orientations and limited abs = absolute
another(s) over time gaze at Time 1; A and difference
B face each other more
directly and increase
eye contact by Time 2
Avoidance/ One actor’s verbal or A and B smile a lot at (abs(A1 − B1 )) <
Divergence nonverbal behavior Time 1; A becomes (abs(A2 − B2 )) where
becomes less like increasingly abs = absolute
another(s) over time inexpressive and stops difference
smiling by Time 2
Synchrony Degree to which A’s and B’s head nods beat (YA = XA1 + X2A1 +
behaviors in an in time with A’s XA2 + X2A2 ) ≈ (YB =
interaction are verbal-vocal stream XB1 + X2B1 + XB2 +
nonrandom, patterned, X2B2 ) where YA and YB
or synchronized in both are nonlinear time series
timing and form regression lines for A
and B
Maintenance/ An actor makes no Person A shifts from A1 = A2 or B1 = B2
Nonaccom- change in his or her dialect-free speech to A = 0 and B = 0
modation/ communication using a Southern
Nonmatching behavior in response to accent; Person B
changes by another maintains dialect-free
speech
Burgoon, Dillman, and Stern (1993) and Burgoon et al. (1998) for more elaboration of
definitions.
Interaction Coordination
At a global level, all the terms we will introduce relate to interpersonal coordina-
tion, which Bernieri and Rosenthal (1991) defined as “the degree to which the behav-
iors in an interaction are nonrandom, patterned, or synchronized in both timing and
form” (pp. 402–403). Though an apt descriptor, this label can connote surplus mean-
ings beyond communication itself (e.g., marching in stride in a parade or avoiding
other pedestrians when crossing a street). Our focus here is communicative forms of
adaptation.
Matching, Mirroring, and Complementarity

Matching refers to behavioral patterns that are similar between two or more actors,
regardless of their cause. These need not be cases of parties coordinating with one
another. Extremely cold temperature in a room may cause everyone to shiver but putting
on a coat is a case of the actors independently adapting to the environment, not to each
other. If the cause of behavior is unknown, the most objective label for it is matching.
Mirroring is the more specific case of two actors displaying identical visual signals, such
as both resting their head on their hand. It is more likely to represent one actor’s behav-
ior being contingent on the other’s but that is not guaranteed: both could be mirroring
a third unseen party rather than each other. Finally, complementarity describes patterns
that are opposite one another, such as one person leaning forward and the other leaning
backward. These terms describe a static relationship without any temporal ordering or
change.
Compensation and Reciprocity

These next patterns are ones in which actors are actually adjusting to one another. There
is an observable change over time, one actor’s behavior is directed toward and contingent
upon what the other does, and their joint dyadic pattern can be described as interdepen-
dent (Burgoon, Stern, & Dillman, 1995). Although such patterns imply intent, they need
not imply a high degree of awareness because they are so deeply ingrained that they can
be executed easily and automatically. Compensation refers to adaptations in the oppo-
site direction, such as returning shows of rudeness with shows of kindness. Behaviors
need not be identical but should convey the same functional meaning, such as expres-
sions of liking or expressions of dominance. Nonverbally, many behaviors can substitute
for one another without loss of meaning. For example, psychological closeness (called
immediacy) can be signaled through close proximity, forward lean, direct facing, and
eye contact, or a touch could be substituted for direct facing.
Reciprocity is itself a fundamental sociological principle in which members of a given

society are expected to return good for good and to avoid harm in return for avoid-
ance of harm (Gouldner, 1960). This “eye for an eye” or “tit for tat” philosophy is the
essence of social exchange and foundational to social organization and communication.
Communicators are expected to reciprocate one another’s behavior and to do so rather
automatically. Theories of intimacy development and escalation rely on this norm: One
person’s displays of relational intimacy are expected to beget displays in kind, just as
aggression is expected to elicit reciprocal aggression. Burgoon et al. (1995) go so far
as to declare reciprocity the default pattern in human interaction, noting that when one
communicator deviates from the norm, a partner may display the expected and desired
behavior as an attempt to elicit a reciprocal response. Relevant to observation of human
interaction is that observers must be aware that what is often being witnessed is not
an actor’s deliberate and self-initiated communication pattern but rather a reciproca-
tion of behavior initiated by the partner. As an example, interviewee behavior may be
more a reflection of the interviewer’s demeanor than the interviewee’s own emotions and
attitudes.
Approach/Convergence
Approach is a form of adaptation that can be exercised by one or both participants in an
interaction. One person can stand still while the other moves closer, or both people can
move closer to each other. Convergence has the same meaning but is often used in the
context of describing speech patterns. One person’s speech converges toward another to
the extent that they become more similar.
Avoidance/Divergence
Avoidance and divergence are, as one might expect, the opposites of approach and con-
vergence. Moving farther away from someone who has moved in very close would be
both a compensatory move and avoidance. Changing from one’s usual Cockney accent
to a more standard dialect to avoid strangers with a Cockney accent striking up a con-
versation in a pub would be a case of divergence.
Interactional Synchrony
The concept of interactional synchrony was developed in the 1960s, in part, from study-
ing psychotherapy sessions, and Kendon (1970) is often regarded as a foundational con-
tributor to understanding the mechanics and consequences of what he called this “com-
plex dance.” According to sociolinguists and anthropologists who first investigated it
(e.g., Bullowa, 1975; Chapple, 1982; Condon & Ogston, 1971), interactional synchrony
adds a rhythmic component to adaptation patterns. The most common form, simulta-
neous synchrony, is when a listener entrains his or her movements to the verbal-vocal
stream of a speaker; the speaker’s tempo becomes a metronome for the two of them. The
behaviors need not be the same as long as the junctures at which change points occur
are the same. A second kind of synchrony, concatenous synchrony, refers to a serial
form of coordination from speaker to speaker in which one interactant’s actions while
are speaking are mimicked by the next speaker (Burgoon & Saine, 1978). This kind of
synchrony can register when successive speakers “pick up” prior speakers’ demeanor
and language.
For conversation and dialogues to be effective, those involved need to coordinate not
only their own personal channels of communication – both verbal and nonverbal – but to
successfully achieve this interdependence together. Certainly, automatically and sponta-
neously matching, as well as calculatedly reciprocating the reward value, substance,
and rhythm of another’s communicative behaviors and actions (e.g., facial express-
ions and word choices) can involve split-second timing, and be a process that meshes
more and more as an interaction unfolds. Studies show that this can happen very early
in life between parents and their children, some claiming this process to be innate.
Interactional synchrony, if seemingly effortlessly enacted, can foster the experience
of being experientially on the same wavelength and can be, accordingly, an enabler of
rapport. That said, it has been found that moderate or intermediate levels of coordina-
tion can be the most relationally beneficial and satisfying. Indeed, in situations where
interpersonal goals are uncertain or ambiguous, intense levels of synchrony can reflect
strain, discomfort, and anxiety where communicators may, with all good intent, be try-
ing overly hard to coordinate their efforts. Various conceptualizations of empathy, rap-
port, and emotional contagion and theories of interactional coordination rely on syn-
chrony as one component of the process.
Maintenance/Nonaccommodation/Nonmatching
Of course, humans do not always adapt to one another. Cappella (1984), among
others, have noted that humans can be quite consistent behaviorally over time, maintain-
ing their own characteristic communication style because they lack the communication
skills to adjust. Or they can actively resist matching another’s speech style, opting to
maintain their own dialect, accent, tempo, and the like “to make a statement,” as in the
case of gang members refusing to speak grammatical English around authority figures.
Whether displayed passively and inadvertently or actively and strategically, these pat-
terns reflect nonaccommodation to the communication of others (Giles, Coupland, &
Coupland, 1991).
Models and Theories of Adaptation
Having described the various patterns of coordination and adaptation that populate
human interaction, we turn now to the theories and models that have been advanced
to account for their existence, causes, and effects. Our brief journey through these
models is organized according to their causal mechanisms. We begin with the earliest
models that featured reflexive reactions and arousal- or stress-driven factors under the
governance of the oldest part of the brain. We next move to models that add affect and
valencers, which are under the control of the limbic system and paleomammalian brain.
Next are models that add higher-order cognitive elements under the control of the neo-
mammalian brain. We conclude with the most complex communication-based models
that incorporate all of the foregoing.
Biologically Based Models

At a biological level, humans, like other organisms, are equipped with reflexive forms
of adaptation directed toward coping with threats and risks. These include reflexes that
orient the organism to novel stimuli or trigger defensive reflexes preparing the organism
for fight or flight (Sokolov, 1963). Thus, the earliest forms of interaction in the evolution
of a species are forms of approach (fight) and avoidance (flight) that enable survival in
the face of threats. Approach and avoidance are shorthand terms for the cognitive and
emotional activity that is oriented either toward or away from threat (Roth & Cohen,
1986).
Recent theorizing by Woody and Szechtman (2011) proposes that humans and other
species have evolved a complex neurobiological circuit, dubbed the security motivation
system, that can detect subtle indicators of threat and activate precautionary behaviors in
response. This responsivity may be rooted in each organism’s attunement to the presence
of species mates (referred to as compresence or audience effects) that creates arousal
and drive states known as social facilitation. Described by Zajonc (1965) as one of the
most fundamental forms of inter-individual influence, social facilitation effects were
observed as far back as 1987 by Triplett and 1925 by Travis, among others. These exper-
imental psychologists found that organisms are responsive to the actual (or imagined)
physical presence of species mates and that the mere presence of them can facilitate the
performance of well-learned responses but impair performance of newly learned ones.
For humans, this powerful effect on performance underscores how attuned humans are
to one another and thus likely to modify their behavior when others are co-present. The
complex reactions humans exhibit in interactional contexts can be attributed in part to
this basic social facilitation effect.
In one of the first formalized theories of interaction drawing upon approach-
avoidance forces, Argyle and Dean’s (1965) equilibrium theory (also called affiliative
conflict theory) posited that humans and other animals are biologically programmed to
seek a state of homeostasis between competing cross-pressures for distance and prox-
imity. Distance accords territory and privacy whereas proximity satisfies needs for affil-
iation and the safety of the group. Therefore, if one person approaches, the other should
compensate and respond with avoidance so as to restore equilibrium.
One other biologically based form of adaptation has been called the chameleon effect
(Chartrand & Bargh, 1999), which refers to humans’ apparently unconscious tendency
to mimic the behaviors of others. Although often thought to be an innate reaction,
Bavelas et al. (1988) showed that this phenomenon has been adapted in a uniquely com-
municative way to display empathy by showing what a listener perceives a speaker is
feeling, as in wincing when hearing about the person tell of running into a low-hanging
tree limb. In these cases, the mimicry is not a direct match with what the speaker is
displaying at the moment but, rather, what the narrator of an incident is perceived to
have experienced.
Affect Based Models

Emotions are fundamental to all human experience and represent a basic level of com-
munication about our well-being, internal states, and behavioral intentions (Nyklíček,
Vingerhoets, & Zeelenberg, 2011). The affect-based theories reviewed here all concern
how human responses to affective signals can influence the outcomes of interactions
with others and lead to compensatory or reciprocal behavioral responses. Many of these
theories are reviewed in more depth by Burgoon et al. (1995), but a brief overview of
each is provided here.
Affiliative Conflict Theory

Researchers have emphasized intimacy and immediacy behaviors, both of which
enhance closeness, as the primary methods of communicating affective messages
(Andersen & Andersen, 1984; Coutts & Schneider, 1976). Argyle and Dean’s (1965)
affiliative conflict theory (ACT, also called equilibrium theory) suggests that approach
and avoidance forces underlie the reciprocity of nonverbal social behaviors as a signal
of the intimacy of the relationship in that approach forces emphasize the gratification
of affiliative needs, while avoidance forces can be interpreted as the fear of being open
to scrutiny and rejection. If an actor’s nonverbal expressions of intimacy or immediacy
in the form of conversational distance, eye contact, body lean, and body orientation dis-
rupt equilibrium, the partner is predicted to compensate on one of the same behaviors to
restore equilibrium (Coutts & Schneider, 1976). The theory does not explain instances
in which interactants would reciprocate intimacy nor subsequent theories specify which
conditions will cause equilibrium levels to increase, decrease, or be unaffected. It also
avoids discussing the causal mechanisms that will explain the relationship between
approach or avoidance tendencies. Thus, although ACT was a formative theory in the
history of the study of nonverbal communication, it has been subsumed by other theo-
ries and largely abandoned by researchers who perhaps heeded a request for “respectful
internment” of the theory (Cappella & Greene, 1982: 93).
Discrepancy Arousal Theory

Social signaling is bi-directional, as captured by the term “mutual influence”
(Cappella & Greene, 1984). Offered as an alternative to ACT is discrepancy arousal
theory (DAT), which proposes that arousal is a key mediator of whether changes in one
interactant’s behavior elicit compensatory or reciprocal responses (Cappella & Greene,
1982). Changes in cognitive arousal are proposed to have an inverted-U relationship
such that small discrepancies are accompanied by small changes in arousal that are
experienced as rewarding (or perhaps neutral or unnoticed), with positive affect. Large
discrepancies are accompanied by large changes in arousal, which are experienced
as aversive and prompt negative affect. Positive affect leads to an approach response
whereas negative affect leads to withdrawal and a reduction of behavioral involvement.

Through DAT, it is postulated that these reciprocal and compensatory responses result
from two sources: (1) the degree of discrepancy between a partner’s behavioral involve-
ment and the expectations derived from situational norms, individual preferences, and
experience with the particular partner and (2) the degree of arousal caused by the dis-
crepant behaviors (Cappella & Greene, 1984).
Although the theory attempts to predict when reciprocity and compensation will
occur, the two publications on the theory (Cappella & Greene, 1982, 1984) are incon-
sistent in equating reciprocity with approach or with small deviations (approach or
avoidance), and also give examples in which certain situational and contextual vari-
ables reverse the predictions. For example, contextual or individual difference variables
could cause one to compensate for the affect displayed by one’s partner even though
the discrepancy appears to be slight. Granting that DAT is an improvement over ACT,
Patterson (1983) nonetheless noted that it suffered from structural limitations and inde-
terminacies and proposed the sequential functional model as an alternative.
Sequential Functional Model

The sequential functional model (SFM; Patterson, 1982, 1983) divides interaction into
pre-interaction and interaction phases. Antecedents comprised of personal characteris-
tics, past experiences, and relational-situational constraints govern the interaction prior
to it beginning. Certain mediating processes influence the differential involvement of
interactants both at the pre-interaction stage and during the interaction itself by (a)
determining behavioral predispositions for differential involvement, (b) precipitating
arousal change, and (c) developing cognitive and affective expectancies (Patterson,
1982). These three mediators limit the range of involvement that is initiated and deter-
mines when behavior adjustments are required or when they are not needed to maintain
the stability of the cognitive-arousal processes. The resultant outcomes of the mediating
processes affect whether or not a certain function such as expressing intimacy, social
control, or performing a service or task is served (Edinger & Patterson, 1983). Cappella
and Greene (1984) argue in a critique of Patterson’s work that absent direct assessment
of cognitions, a number of rival explanations for outcomes could be entertained; Giles
and Street (1994) offered similar critiques of the indeterminacies in both DAT and SFM.
Though the model did not generate much traction empirically, it stands as an excellent
depiction of key variables that must be taken into account when predicting and explain-
ing adaptation processes.
Attachment Theory and Emotional Regulation

Whereas the SFM addresses what happens during specific encounters, the next
theory – attachment theory (AT) – concerns more enduring and innately driven ori-
entations that can permeate all of an individual’s interactions during an extended time
period. According to the theory, emotional reactions innately govern human interac-
tions through infancy forward in the form of the attachments that are formed between
infant and caregiver. AT holds that people are born with an innate tendency to seek
closeness to others and that the physical, emotional, and social support they receive
from caregivers affects their ability to form secure relationships in adulthood. Chil-
dren who receive consistent parental support develop secure attachment styles, whereas
those who receive inconsistent support develop anxious attachment styles, and those
who lack parental support develop avoidant attachment styles (Bowlby, 1980, 1982).
The conceptualization of adult attachment evolved from crossing a mental model of self
with a mental model of others. People with a negative model of self experience anxiety,
whereas those with a favorable view of self are characterized by optimism and confi-
dence in times of distress. Those with a negative model of others avoid attachment and
are characterized by hyper-vigilance to the social and emotional cues of others (Richards
& Hackett, 2012), whereas those with a positive model of others seek rather than avoid
connection. The resultant typology includes the “secure style,” in which both the anxi-
ety and avoidance dimensions are low; the “anxious style,” in which anxiety is high and
avoidance is low; the “avoidant style,” which is characterized by high anxiety and high
avoidance (Mikulincer, Shaver, & Pereg, 2003); and the “detached style,” character-
ized by low anxiety and high avoidance. Others have distinguished between “dismissing
avoidants” and “fearful avoidants” in which both experience high avoidance, but fear-
ful avoidants experience high anxiety as well (Bartholomew & Horowitz, 1991). These
different attachment styles, which are fairly stable, have been shown to have powerful
effects on the social signals a person sends and the interpretations assigned to others’
signals.
For example, several scholars have linked attachment styles to emotion regulation
(ER) which “refers to the process by which individuals attempt to influence their
own emotions; when they experience them, and how they express them behaviorally”
(Richards & Hackett, 2012: 686). ER includes the regulation of affective states covering
dimensions such as: overt to covert (how perceivable to others it is), explicit to implicit
(whether it is conscious or unconscious), and voluntary to automatic (whether there is
intent behind the display or not) (Nyklíček et al., 2011). Secure individuals are better
able to regulate emotions than are either anxiously and avoidantly attached individuals,
but even individuals with an anxious attachment orientation will form higher quality
relationships when they use emotion regulation strategies such as suppression (altering
emotional responses to felt emotions) and reappraisal (rethinking a situational to control
the emotional response) (Richards & Hackett, 2012).
Cognitive Theories
These next theories give cognitions preeminence over emotion or reflexive actions.
Andersen’s cognitive valence theory (CVT; Andersen, 1998; Andersen et al., 1998)
is focused on the intimacy or immediacy expressed by either party in a dyad and the
resultant outcomes in three areas: the degree to which people change their cognitive
appraisals of their partner, the degree to which they reciprocate or compensate for
their partners’ behavior, and the changes in relational closeness that result from the
intimacy expressed (Andersen, 1998). Any increase in intimacy by one partner that is
perceived and processed by the other partner activates what Andersen calls six “cog-
nitive schemata”: (1) the appropriateness of the behavior according to cultural norms,
(2) personality traits, (3) interpersonal valence or reward of the communicator, (4) rela-
tional or (5) situational appropriateness according to the context, and (6) transitory
psychological or physical states. If any of the cognitive schemata are evaluated neg-
atively (i.e., the behavior is deemed culturally or relationally inappropriate), then the
result would be negative appraisals, compensation, and/or diminished relational close-
ness (Andersen, 1999). Only if all six schemata are evaluated positively would positive
relational or behavioral outcomes occur. In a study of opposite-sex friendship dyads
designed to test competing hypotheses from CVT, EVT, and DAT, one of the friends
was instructed to display high immediacy or a more moderate level of immediacy. The
results did not comport fully with CVT predictions because the high immediacy condi-
tion produced a mix of compensatory and reciprocal responses (Andersen et al., 1998).
Communication Theories
These last theories include many of the foregoing principles and constructs. Whereas
preceding theories originating from communication scholars had factors such as arousal
or cognition as their centerpiece, however, these last two theories accord centrality to
communication.
Communication Accommodation Theory

Communication accommodation theory (CAT) is a framework for describing and
explaining why people do or do not adapt their communication with each other, together
with the personal and social consequences of these practices; see Giles (2016) and
McGlone and Giles (2011) for histories of CAT’s development. An important element
of the theory is that speakers and writers accommodate to where they believe the oth-
ers “stand” communicatively and, consequently, sometimes this can be miscarried and,
thereby, be a source of contentions and/or conflict. In this way, the theory has some
highly subjective twists to it.
CAT devoted a significant proportion of its early attention to examining how and
why we converge to or diverge from each other to various degrees (mutually or asym-
metrically). The former occurs when interactants’ communication styles become more
similar to another by choice of slang, jargon, accent, pitch, hand movements, and so on.
When the features involved connote social value (e.g., a fast speech rate is associated
with competence, while a slow rate with incompetence), convergence can be termed
“upward” or “downward.” The former occurs, for example, when an individual approx-
imates another’s more formal, prestigious communicative style, while the latter refers
to matching another’s more colloquial, informal, and/or nonstandard-accented message.
For example, a speaker of standard British English who adopts the Cockney accent of
his taxi driver is using downward convergence.
The convergent process is considered to be a barometer of an individual’s desire to
signal attraction to, identification with, and/or glean social approval from another. Such
moves convey respect (and sometimes effort when consciously crafted) which in turn
engenders appreciative responses from those accommodated (e.g., liking and altruism).
An important element in the approval-seeking process is social power: for instance,
interviewees will be inclined to converge more toward their interviewers than vice versa;
newly arrived immigrants more toward the host community than the converse, and sales-
persons more than clients. These accommodations – whether they be matching another’s
utterance length, smiling, or language choice – can be regarded as an attempt on the part
of communicators to modify, conjure up, or disguise their personae in order to make it
more acceptable to the listeners and readers so addressed. Furthermore, cross-cultural
studies show that accommodation from both younger as well as same-aged peers can
enhance older adults’ reported life satisfaction.
Speech convergence may also be a mechanism whereby speakers make themselves
better understood and can be an important component of the influential construct “com-
municative competence” and other related social skills. The more a sender reflects
the receiver’s own mode of communication, the more easily their message should be
understood. In addition, interactants can take into account their partner’s knowledge of,
or sophistication about, a particular topic being discussed (called the “interpretability
strategy”) as well as attuning to their emotive states and conversational needs. Hence,
accommodating one’s partner’s desire to talk about certain topics or themes rather
than others (called the “discourse management strategy”) can increase comprehension,
coherence, as well as communication and relational satisfaction.
CAT proposes that people do not resonate to nonaccommodating others. This can
signal, other things being equal, that the non-converger does not need the other’s
approval or respect, a perception that does not easily enhance self-esteem for a recipi-
ent. Indeed, this often results in negative attributions about, and personal derogation of,
the nonaccommodator. Attributions, however, can play an important role in the evalu-
ative and interpretive process of judging accommodators and nonaccommodators. For
instance, should the nonaccommodator be known for not having the language repertoire
to effect convergence, then the lack of it can be explicable, discounted, and perhaps even
forgiven.
CAT also sheds light on why interactants may sometimes choose to accentuate com-
municative differences between themselves and others. This may occur through so-
called speech maintenance where people deliberately avoid using another’s communica-
tive style and, instead, retain their own idiosyncratic stance or that of their social group’s;
for instance, by not switching languages when they easily have the capability of doing
so. Moving along a social differentiation continuum, people can diverge from others by
adopting a contrasting language, dialect, jargon, speech rate, or gestural style. Draw-
ing upon social identity theory (see Giles, Bourhis, & Taylor, 1977), CAT has argued
that the more a person psychologically invests in or affiliates with a valued ingroup (be
it ethnic, gay, religious, political, or whatever), the more they will want to accentuate
positively their identity by communicatively divergent means when confronting con-
trastive (and especially threatening) outgroup members. This will be evident when the
dimensions diverged are salient components of their social identity, or when the rele-
vant outgroup has threatened some aspect of their social livelihood, and particularly by
illegitimate means. In this way, CAT acknowledges evolving and dynamic historical,
cultural, socio-structural, and political forces (see Giles & Giles, 2012) and, thereby, is
able to theorize about both interpersonal and intergroup encounters. Such a stance can
explain why people can simultaneously or sequentially converge on some communica-

tive features, while diverging on others.
All in all, it appears that satisfying communication requires a delicate balance
between convergence – to demonstrate willingness to communicate – and divergence –
to incur a healthy sense of group identity. A final distinction introduced here is that
CAT distinguishes between objective and psychological accommodation; the former is
that which can be measured, and the latter that which is subjectively construed. For
instance, sometimes objective divergence can fulfill positive psychological functions as
in the case of a speaker slowing down an overly fast talking other by adopting a very
slow, measured rate or in the case of a male diverging from a romantic female acquain-
tance’s elevated pitch and expressed femininity, with a deeper pitch and in the process
accentuating their mutual attraction through a phenomenon known as “speech comple-
mentarity.” Thus, divergence or compensation need not be negatively valenced. Further-
more, calibrating the amount of perceived non-, under-, and over-accommodations one
receives can be an important ingredient in continuing or withdrawing from an interac-
tion and making decisions about anticipated future ones.
CAT now has a forty year history, been revised and elaborated frequently (see Drago-
jevic, Gasiorek, & Giles, in press), and many of its propositions have received empirical
support across an array of diverse languages and cultures, electronic media (for a statis-
tical meta-analysis of CAT studies, see Soliz & Giles, 2014), and even among different
nonhuman species (e.g., Candiotti, Zuberbüler, & Lemasson, 2012). There has been
a recent focus on unpacking different processes of nonaccommodation (see Giles &
Gasiorek, 2013) as well as the neural and biochemical underpinnings of accommodative
practices (Giles & Soliz, 2015). For instance, given recent work on the neuropsychol-
ogy of intergroup behavior (e.g., Fiske, 2012), would interpersonal accommodations
and adjustments lend themselves to a neural signature of medial prefrontal cortex activ-
ity, while seeing valued peers? Conversely, would divergence away from members of
disdained groups lead to neural activity in areas of the brain associated with reward pro-
cessing, such as in the ventral striatum? Relatedly, would the adverse affective reactions
to being a recipient of nonaccommodation be associated with, or be the precursor to,
neural activity in the anterior cingulate cortex, a region associated with pain and pun-
ishment? Finally, from a more evolutionary perspective (see Reid et al., 2012), could
divergence be predicated in part on individuals’ levels of pathogen-disgust or the sur-
vival value trait of avoiding disease and infection risks (see Reid et al., 2012)?
Needless to say and as ever, much still needs to be achieved. Although the theory’s
capacity to pay homage to linguistic specifications is of course limited – it emerged after
all from social psychology – its prospects for helping us understand, both theoretically
and pragmatically, communicative phenomena and processes in a wide range of applied
contexts is exciting (Giles et al., 1991).
Interaction Adaptation Theory

Interaction adaptation theory (IAT; Burgoon et al., 1995) grew out of a desire to rec-
oncile these various models and theories of interaction adaptation while also produc-
ing a theory with broader communication scope than its predecessors. The theory
incorporates both biological principles (e.g., compensatory arousal-driven reactions)

and social principles (e.g., reciprocity) and builds upon the scaffolding of expectancy
violations theory (EVT; Burgoon, 1983).
The theory, like EVT, SFM, and CAT, recognizes a number of pre-interactional fac-
tors that set the stage for interaction. The three central classes of features are require-
ments (R), expectations (E), and desires (D). Requirements refer to biologically based
factors such as protection and sustenance that must be satisfied and override other con-
siderations. A person who is hungry, tired, or fearful will behave according to those
needs rather than adapting to a partner’s communication. Expectations are the antici-
pated communication displays by self and partner given the characteristics of the actors,
their relationship, and the context. Female friends in an informal setting will expect
a moderately intimate interaction pattern (e.g., close proximity, frequent eye contact,
smiling). Desires refer to what the actors want out of the interaction. Friends may desire
a friendly chat; a patient may want respectful and empathic listening from a physi-
cian. These classes of RED factors combine to determine the projected starting point
or interaction position (IP) that people take vis-à-vis one another. Whether their ensu-
ing interaction is reciprocal or compensatory depends on the actual (A) communication
person adopts. If the A is more desirable than the IP, an actor is predicted to reciprocate
the A; if the A is less desirable than the IP, the actor is predicted to compensate. To use
a concrete example, if a friend is expected to engage in a warm and friendly interaction
but is instead stand-off-ish, the A is less desirable than the IP and the actor is predicted
to compensate by becoming even warmer and friendlier. Alternatively, if the friend is
even more expressive and happy than expected, the actor is predicted to reciprocate the
good mood.
There are many additional elements to the theory, including the hierarchy of the RED
factors, and factors such as social skills of actors that can alter patterns (see Burgoon,
Dunbar, & White, 2014; Burgoon & White, 1997, for further elaborations), but the
overriding points of the theory are that interaction adaptation is a complex process and
that both compensatory and reciprocal patterns can occur simultaneously or serially on
different behaviors. Any attempt to analyze adaptation processes must take into account
the actor, relationship, and contextual forces in play at the point of observation and
recognize that interaction adaptation is a necessarily dynamic process that will show
changes across the time scape.
Current State-of-the-Art and Main Trends
Culture and Communication

Social groups, such as adolescents, police officers, and ethnic groups, often have their
own distinctive cultures that include specialized foods, customs and rituals, literature,
dance and music, while other intergroup situations (e.g., artificially constructed groups)
constitute social categories that cannot claim such rich cultural artifacts. Importantly,
communication practices of the ilk caricatured above are the basis of what is meant by
a “culture” (Conway & Schaller, 2007).
Intercultural communication has been studied for well over fifty years and has devel-
oped to focus on how different cultures are distinguished from each other through
their management of communicative behaviors, such as personal space and gestures.
Particular attention has been devoted to articulating the cultural values that underpin
these different communicative practices, including individualism-collectivism and low–
high contexts (Gallois et al., 1995; Watson, 2012), and what ingredients of intercul-
tural communication competence are involved. Wiseman (2002) detailed these in a way
that embraced a skills training perspective. Premises underlying this are that individu-
als must have a knowledge of the culture with which they engage, and the motivation
to effectively communicate (including intercultural sensitivity and empathy), together
with the appropriate communication skills. A mainstream concern in this literature is
how immigrants adapt to the dialectical pulls and pushes of preserving their heritage
communicative habits while acquiring those of a host community.
One challenge for the future is that intercultural communication theory does not
really engage and explain when misunderstandings and mis-coordinations could, in
some cases, be inevitable despite any of the individuals’ skills and cultural knowledge.
Socio-psychological theories that emphasize the intergroup nature of intercultural com-
munication (with its focus on stereotypes, prejudice, ingroups, and outgroups), rather
than rely only on its interpersonal parameters, may be fruitfully applied to understand
when such misattributions, and even conflict, arise (Giles, 2012; Giles & Watson, 2008).
The challenge is to move toward bringing the disparate theoretical viewpoints of inter-
cultural and intergroup communication (whose scholars, in turn, neglect the important
dynamics of culture) together. The further value in coalescing these approaches is in
going way beyond the typically studied national and ethnic groups to embrace an array
of different cultural categories, including older people, homosexuals, bisexuals, aca-
demics from different disciplines, and so forth, as well as those embedded in different
religious and organizational cultures.
Deception and Synchrony

One of the possible applications in the use of behavioral synchrony is in the detection
of deception. Burgoon et al. (1995) make the argument that we are naturally inclined
toward synchrony or mutual adaptation but we posit that this process will be hindered
somewhat when one person introduces deceit. Guilty suspects have an incentive to coop-
erate and try to point the interviewer toward another suspect and might attempt to main-
tain the rapport that has been established by the interviewer (Vrij et al., 2007). Truth-
tellers may not maintain synchrony if they are surprised or offended by the accusation,
so there might be a greater detriment to nonverbal synchrony for truth-tellers than liars,
especially if deceivers are highly skilled and can use the rapport established to appear
innocent (Dunbar et al., 2014). Research of this genre suggests that highly skilled liars
are in fact quite different from unskilled liars because they both report and display less
cognitive effort than the less skilled (Dunbar et al., 2013). In two separate analyses com-
paring liars that were either sanctioned by the experimenter or chose to lie on their own,
those who chose to lie (and were presumably more skilled) were more difficult to detect
than those who were told to lie by the experimenter, both using automated detection of
synchrony (Yu et al., 2015) and manual coding of behavioral cues (Dunbar et al., 2014).
Automated Tools for Detecting Adaptation

This chapter should make abundantly clear that interpersonal interaction is fraught with
various patterns of adaptation and that analyzing any social signal or collection of sig-
nals in its midst poses significant challenges. Until recently, manual systems for behav-
ioral observation were the primary tools for detecting and tracking individual behav-
iors, and dyadic interaction often defied analysis. However, the explosion of research
into automated identification and tracking of nonverbal behaviors now makes possi-
ble the discovery of very subtle and transitory patterns of adaptation. An illustration
is the analysis conducted by Dunbar et al. (2014) using computer vision analysis to
analyze interactional synchrony between interviewers and their truthful or deceptive
interviewees. Using techniques that create bounding boxes and ellipses around each
person’s head and hands, gross postural and gestural movements can be identified and
changes and velocities can be tracked frame-by-frame. Separate techniques that locate
landmarks on the face can track temporal changes and combine features to identify
specific expressions. Time series analyses can then find points of synchrony between
each person’s behaviors and calculate the degree of interactional synchrony that exists
within each dyad. Similar kinds of analyses can be applied to other nonverbal sig-
nals such as the voice. These techniques, which are the focus of the remaining sec-
tions of this volume, promise to revolutionize the analysis of nonverbal behavior and
to uncover heretofore undetected interrelationships between interactants during social
exchanges.
Acknowledgments
Preparation of this chapter was supported in part by funding from the National Science
Foundation (Grants #0725895 and #1068026). The views, opinions, and/or findings in
this chapter are those of the authors and should not be construed as an official US
government position, policy, or decision.
References
Andersen, P. A. (1998). The cognitive valence theory of intimate communication. In M. T.

Palmer & G. A. Barnett (Eds), Mutual Influence in Interpersonal Communication: Theory and
Research in Cognition, Affect, and Behavior (pp. 39–72). Stamford, CT: Greenwood.
Andersen, P. A. (1999). Building and sustaining personal relationships: A cognitive valence expla-
nation. In L. K. Guerrero, J. A. DeVito, & M. L. Hecht (Eds), The Nonverbal Communication
Reader (pp. 511–520). Lone Grove, IL: Waveland Press.
Andersen, P. A. & Andersen, J. F. (1984). The exchange of nonverbal intimacy: A critical review
of dyadic models. Journal of Nonverbal Behavior, 8(4), 327–349.
Andersen, P. A., Guerrero, L. K., Buller, D. B., & Jorgensen, P. F. (1998). An empirical comparison
of three theories of nonverbal immediacy exchange. Human Communication Research, 24(4),
501–535.
Argyle, M. & Dean, J. (1965). Eye-contact, distance and affiliation. Sociometry, 289–304.
Bartholomew, K. & Horowitz, L. M. (1991). Attachment styles among young adults: A test of a
four-category model. Journal of Personality and Social Psychology, 61(2), 226–244.
Bavelas, J. B., Black, A., Chovil, N., Lemery, C. R., & Mullett, J. (1988). Form and function
in motor mimicry: Topographic evidence that the primary function is communicative. Human
Communication Research, 14, 275–299.
Bernieri, F. J. & Rosenthal, R. (1991). Interpersonal coordination: Behavior matching and inter-
actional synchrony. In R. S. Feldman & B. Rimé (Eds), Fundamentals of Nonverbal Behavior
(pp. 401–432). Cambridge: Cambridge University Press.
Bowlby, J. (1980). Attachment and Loss. New York: Basic Books.
Bowlby, J. (1982). Attachment and loss: Retrospect and prospect. American Journal of Orthopsy-
chiatry, 52(4), 664–678.
Bullowa, M. (1975). When infant and adult communicate: How do they synchronize their behav-
iors? In A. Kendon, R. M. Harris, & M. R. Key (Eds), Organization of Behavior in Face-to-
Face Interaction (pp. 95–129). The Hague: Mouton.
Burgoon, J. K. (1983). Nonverbal violations of expectations. In J. Wiemann & R. Harrison
(Eds), Nonverbal Interaction. Volume 11: Sage Annual Reviews of Communication (pp. 11–77).
Beverly Hills, CA: SAGE.
Burgoon, J. K., Dillman, L., & Stern, L. A. (1993). Adaptation in dyadic interaction: Defining and
operationalizing patterns of reciprocity and compensation. Communication Theory, 3, 196–
215.
Burgoon, J. K., Dunbar, N. E., & White, C. (2014). Interpersonal adaptation. In C. R. Berger
(Ed.), Interpersonal Communication (pp. 225–248). Berlin: De Gruyter Mouton.
Burgoon, J. K., Ebesu, A., White, C., et al. (1998). The many faces of interaction adaptation. In
M. T. Palmer & G. A. Barnett (Eds), Progress in Communication Sciences (vol. 14, pp. 191–
220). Stamford, CT: Ablex.
Burgoon, J. K. & Saine, T. J. (1978). The Unspoken Dialogue. Boston: Houghton-Mifflin.
Burgoon, J. K., Stern, L. A., & Dillman, L. (1995). Interpersonal Adaptation: Dyadic Interaction
Patterns. New York: Cambridge University Press.
Burgoon, J. K. & White, C. H. (1997). Researching nonverbal message production: A view from
interaction adaptation theory. In J. O. Greene (Ed.), Message Production: Advances in Commu-
nication Theory (pp. 279–312). Mahwah, NJ: Lawrence Erlbaum.
Candiotti, A., Zuberbüler, K., & Lemasson, A. (2012). Convergence and divergence in Diana
monkey vocalizations. Biology Letters, 8, 282–285.
Cappella, J. N. (1984). The relevance of microstructure of interaction to relationship change.
Journal of Social and Personal Relationships, 1, 239–264.
Cappella, J. N. & Greene, J. O. (1982). A discrepancy-arousal explanation of mutual influence in
expressive behavior for adult and infant–adult interaction. Communication Monographs, 49(2),
89–114.
Cappella, J. N. & Greene, J. O. (1984). The effects of distance and individual differences in arous-
ability on nonverbal involvement: A test of discrepancy-arousal theory. Journal of Nonverbal
Behavior, 8(4), 259–286.
Chapple, E. (1982). Movement and sound: The musical language of body rhythms in interaction.
In M. Davis (Ed.), Interaction Rhythms: Periodicity in Communicative Behavior (pp. 31–51).
New York: Human Sciences.
Chartrand, T. L. & Bargh, J. A. (1999). The chameleon effect: The perception–behavior link and
social interaction. Journal of Personality and Social Psychology, 76, 893–910.
Condon, W. S. & Ogston, W. D. (1971). Speech and body motion synchrony of the speaker-hearer.
In D. L. Horton & J. J. Jenkins (Eds), Perception of Language (pp. 150–173). Columbus, OH:
Merrell.
Conway, L. G., III & Schaller, M. (2007). How communication shapes culture. In K. Fielder (Ed.),
Social Communication (pp. 104–127). New York: Psychology Press.
Coutts, L. M. & Schneider, F. W. (1976). Affiliative conflict theory: An investigation of the inti-
macy equilibrium and compensation hypothesis. Journal of Personality and Social Psychology,
34(6), 1135–1142.
Dragojevic, M. & Giles, H. (2014). Language and interpersonal communication: Their intergroup
dynamics. In C. R. Berger (Ed.), Interpersonal Communication (pp. 29–51). Berlin: De Gruyter
Mouton.
Dunbar, N. E., Altieri, N., Jensen, M. L., & Wenger, M. J. (2013). The viability of EEG as a
method of deception detection. Paper presented at The 46th Hawaiian International Conference
on System Sciences, Maui, HI.
Dunbar, N. E., Jensen, M. L., Tower, D. C., & Burgoon, J. K. (2014). Synchronization of nonverbal
behaviors in detecting mediated and non-mediated deception. Journal of Nonverbal Behavior,
38(3), 355–376.
Edinger, J. A. & Patterson, M. L. (1983). Nonverbal involvement and social control. Psychological
Bulletin, 93(1), 30–56.
Fiske, S. T. (2012). Journey to the edges: Social structures and neural maps of inter-group pro-
cesses. British Journal of Social Psychology, 51, 1–12.
Gallois, C., Giles, H., Jones, E., Cargile, A., & Ota, H. (1995). Accommodating intercultural
encounters: Elaborations and extensions. In R. L. Wiseman (Ed.), Intercultural Communication
Theory (vol. 19, pp. 115–147). Thousand Oaks, CA: SAGE.
Giles, H. (Ed.) (2012). The Handbook of Intergroup Communication. New York: Routledge.
Giles, H. (Ed.) (2016). Communication Accommodation Theory: Negotiating Personal
Relationships and Social Identities across Contexts. Cambridge: Cambridge University
Press.
Giles, H., Giles, Bourhis, R. Y., & Taylor, D. M. (1977). Towards a theory of language in ethnic
group relations. In H. Giles (Ed.), Language, Ethnicity and Intergroup Relations (pp. 307–
348). London: Academic Press.
Giles, H., Coupland, N., & Coupland, J. (1991). Accommodation theory: Communication, con-
text, and consequence. In H. Giles, J. Coupland & N. Coupland (Eds), The Contexts of Accom-
modation: Developments in Applied Sociolinguistics (pp. 1–68). Cambridge: Cambridge Uni-
versity Press.
Giles, H. & Gasiorek, J. (2013). Parameters of non-accommodation: Refining and elaborating
communication accommodation theory. In J. Forgas, J. László , & V. Orsolya Vincze (Eds),
Social Cognition and Communication (pp. 155–172). New York: Psychology Press.
Giles, H. & Giles, J. L. (2012). Ingroups and outgroups communicating. In A. Kuyulo (Ed.),
Inter/Cultural Communication: Representation and Construction of Culture in Everyday Inter-
action (pp. 141–162). Thousand Oaks, CA: SAGE.
Giles, H. & Soliz, J. (2015). Communication accommodation theory. In D. Braithewaite &
P. Schrodt (Eds), Engaging Theories in Interpersonal Communication (2nd edn). Thousand
Oaks, CA: SAGE.
Giles, H. & Street, R. L., Jr. (1994). Communicator characteristics and behaviour: A review,
generalizations, and model. In M. Knapp & G. Miller (Eds), The Handbook of Interpersonal
Communication (2nd edn, pp. 103–161). Beverly Hills, CA: SAGE.
Giles, H. & Watson, B. M. (2008). Intercultural and intergroup parameters of communication. In
W. Donsbach (Ed.), International Encyclopedia of Communication (vol. VI, pp. 2337–2348).
New York: Blackwell.
Gouldner, A. W. (1960). The norm of reciprocity: A preliminary statement. American Sociologi-
cal Review, 25, 161–178.
Kendon, A. (1970). Movement coordination in social interaction: Some examples described. Acta
Psychologica, 32, 101–125.
King, M. L., Jr. (1963). The Negro is your brother. The Atlantic Monthly, 212(August), 78–88.
McGlone, M. S. & Giles, H. (2011). Language and interpersonal communication. In M. L. Knapp
& J. A. Daly (Eds), Handbook of Interpersonal Communication (4th edn, pp. 201–237). Thou-
sand Oaks, CA: SAGE.
Mikulincer, M., Shaver, P. R., & Pereg, D. (2003). Attachment theory and affect regulation: The
dynamics, development, and cognitive consequences of attachment-related strategies. Motiva-
tion and Emotion, 27(2), 77–102.
Nyklíček, I., Vingerhoets, A., & Zeelenberg, M. (2011). Emotion regulation and well-being: A
view from different angles. In I. Nykliček, A. Vingerhoets & M. Zeelenberg (Eds), Emotion
Regulation and Well-being (pp. 1–9). New York: Springer Science + Business Media.
Patterson, M. L. (1982). A sequential functional model of nonverbal exchange. Psychological
Review, 89(3), 231–249.
Patterson, M. L. (1983). Nonverbal Behavior: A Functional Perspective. New York: Springer.
Reid, S. A., Zhang, J., Anderson, G. L., et al. (2012). Parasite primes make foreign-accented
English sound more distant to people who are disgusted by pathogens (but not by sex or moral-
ity). Evolution and Human Behavior, 33, 471–478.
Richards, D. A. & Hackett, R. D. (2012). Attachment and emotion regulation: Compensatory
interactions and leader–member exchange. The Leadership Quarterly, 23(4), 686–701.
Roth, S. & Cohen, L. J. (1986). Approach, avoidance, and coping with stress. American Psychol-
ogist, 41(7), 813–819.
Sokolov, E. N. (1963). Higher nervous functions: The orienting reflex. Annual Review of Physiol-
ogy, 25, 545–580.
Soliz, J. & Giles, H. (2014). Relational and identity processes in communication: A contextual
and meta-analytical review of communication accommodation theory. In E. Cohen (Ed.), Com-
munication Yearbook, 38 (pp. 107–144). Thousand Oaks, CA: SAGE.
Vrij, A., Mann, S., Kristen, S., & Fisher, R. P. (2007). Cues to deception and ability to detect lies
as a function of police interview styles. Law and Human Behavior, 31(5), 499–518.
Wiseman, R. L. (2002). Intercultural communication competence. In W. B. Gudykunst & B. Mody
(Eds), Handbook of International and Intercultural Communication (2nd edn, pp. 207–224).
Thousand Oaks, CA: SAGE.
Watson, B. M. (2012). Intercultural and cross-cultural communication. In A. Kurylo (Ed.),

Inter/cultural Communication (pp. 25–46). Thousand Oaks: SAGE.
Woody, E. Z. & Szechtman, H. (2011). Adaptation to potential threat: The evolution, neurobiol-
ogy, and psychopathology of the security motivation system. Neuroscience and Biobehavioral
Reviews, 35, 1019–1033.
Yu, X., Zhang, S., Yan, Z. et al. (2015). Is interactional dissynchrony a clue to deception? Insights
from automated analysis of nonverbal visual cues. IEEE Transactions on Systems, Man, and
Cybernetics, 45, 506–520.
Zajonc, R. B. (1965). Social facilitation. Science, 149, 269–274.
9 Social Signals and Persuasion
William D. Crano and Jason T. Siegel
The pace of research devoted to the study of social and emotional intelligence has esca-
lated exponentially (Bar-On & Parker, 2000; Goleman, 2006; Mayer & Salovey, 1997;
Salovey & Mayer, 1990), and the upsurge in interest has intensified the need to under-
stand social signals, whose production and deciphering may be strongly affected by
these various intelligences (Gardner, 1983). Social signals have been conceptualized in
a variety of ways, but as social psychologists, we define a social signal as any variable
associated with a communicative act, excluding its written or spoken content, that car-
ries meaning. The signal may be intentionally or unintentionally encoded by a source,
and mindfully or mindlessly decoded by a receiver. This definition acknowledges and
allows for the smile or smirk that accompanies the expression, “You look good today,”
or the falling intonation of the word “today,” to carry more meaning than the actual
content of the declaration. By this definition, research on communication and persua-
sion from its inception has focused on understanding social signal effects. Research
designed to investigate the credibility of a source of a persuasive message, for example,
often relied upon extra-communication features (age, sex, dress, apparent success, etc.)
to signal the extent to which a receiver should ponder or trust the information provided
by the source. The speed of speech, attractiveness of the source, and the animation of
the speaker all have strong effects on persuasion. This is true even when the content of
the verbal or written communication remains constant across experimental conditions.
In experimental research, the social signal is considered from the vantage point of
the encoder, or in other terms, as an independent variable. As such, failures to obtain
differences attributable to signal variations often were counted as manipulation break-
downs. Less frequently, researchers have focused on the receivers of social signals, and
sometimes the interactive behaviors of both encoders and decoders are examined in
tandem. In one such study, college-age men were led to believe that a young woman
with whom they were to converse via telephone was either beautiful or plain (Snyder,
Tanke, & Berscheid, 1977). This instructional manipulation (Crano, Brewer, & Lac,
2014) strongly affected the behaviors of both of the interacting parties. Analyses of the
men’s expectations revealed strong differences as a function of the pictorial descrip-
tions. Even before the conversation began, men who believed they were to interact with
an “attractive” woman expected her to be more self-assured, outgoing, humorous, and
socially adept than did men paired with a purportedly unattractive partner. The pre-
dicted behavioral effects of these induced expectations, prompted by a simple photo-
graph, were confirmed by judges unaware of the study’s experimental manipulation.
98 Conceptual Models of Social Signal
The judges found that the women whose partners believed them to be attractive acted
more “sociable, poised, sexually warm, and outgoing” than their peers who had been
cast, unbeknownst to them, into the unattractive role (Snyder et al., 1977: 661). As Sny-
der and associates observed, “What had initially been reality in the minds of the men
had now become reality in the behavior of the women with whom they had interacted –
a behavioral reality discernible even by naive observer judges who had access only to
tape recordings of the women’s contributions to the conversation” (1977: 661). Obvi-
ously, the pictorial signal affected the men, whose extralinguistic behaviors affected the
behaviors of the women.
We believe that a more focused concentration on social signals and the role they
have played in research on human communication and persuasion can expand the reach
of this area of study. This view counsels a broadband exploration, and suggests that
research practices that unduly constrain the domain and definition of social signals are
ill-advised. For example, limiting research on social signaling to include only the affec-
tive features of the signal, a commonly investigated feature, is unnecessarily restrictive.
This is not to diminish research in which the affective aspect of a social signal is the
focus of study, but rather to suggest that considerably more information is conveyed
via signals other than mere affect, and we should be attuned to this evidence as well.
In the pages that follow, we present a necessarily abbreviated review of a limited set of
social signals that have been studied in research on communication and persuasion. In
this exposition, social signals that are manipulated as independent variables are empha-
sized to provide examples of the phenomena that may be usefully studied within the
framework of this conceptualization, and that lend themselves to more technologically
oriented analysis by computer scientists and engineers attempting to automate and inter-
pret nonverbal or extralinguistic behavior beyond content variations.
Social Signals Manipulated in Research on Communication and

Persuasion: A Partial List of Message Source Variations
Study of the characteristics of message sources on persuasion has occupied researchers

in communication and persuasion from these fields’ earliest days. In this chapter, we
consider four source characteristics that have been investigated extensively: credibility
and attractiveness, gestures, speed of speech, and vocal tone. Technological innova-
tions have had varying degrees of success at interpreting these signal sources, which are
inherently human social signal behaviors.
Source Credibility and Attractiveness

Source credibility, typically viewed as a confluence of a source’s expertise (its capac-
ity to provide valid information) and trustworthiness (its truthfulness, without consid-
erations of gain), was a central feature of Hovland, Janis, and Kelley’s (1953) classic
work on communication and persuasion. Their research suggested that merely imply-
ing differences in these factors was sufficient to affect social influence, and later
Social Signals and Persuasion 99
research confirmed and extended these expectations (e.g., Kumkale, Albarracín, &
Seignourel, 2010; Mills & Jellison, 1967; Rhine & Severance, 1970). In a classic
and informative study, Hovland and Weiss (1951) found that identical communications
attributed to sources of different levels of credibility (as judged earlier by the research
participants) had strongly different effects. Those who read a communication regarding
“the feasibility of nuclear submarines” attributed to J. Robert Oppenheimer (a credible
source) were significantly more persuaded than were those who received an identical
communication attributed to a non-credible source, Pravda, the leading newspaper in
Russia at the time, and the central organ of the Communist Party (it is well to remember
that this experiment was conducted when the Cold War was at full boil). Although this
effect did not persist beyond four weeks, later research by Kelman and Hovland (1953)
found that merely reinstating the source before a three-week delayed measure was suf-
ficient to reestablish the initial differences found as a result of credibility variations.
Factors that influence respondents’ inferences of credibility can vary widely and
even subtle source features can affect its credibility and consequent persuasiveness.
Attractiveness, for example, has been shown to affect a source’s persuasive power. In
a study of attractiveness on persuasion, Chaiken (1979) recruited sixty-eight student-
communicators, equally divided between men and women. They delivered a common
persuasive message to individual students randomly selected on a college campus. The
communicators were not aware that they had been rated on physical attractiveness by
a group of independent judges. Chaiken’s results indicated that the communicators of
either sex who had been judged attractive elicited greater agreement with their persua-
sive message (eliminating meat from meals at the university’s dining halls) than those
who rated as unattractive.1 Moreover, participants were more willing to sign a petition
consistent with the persuasive message if it had been delivered by the attractive (vs
unattractive) communicator.
In an earlier study of the attractiveness signal, Lefkowitz, Blake, and Mouton (1955)
reported results consistent with Chaiken’s. They found that when a simple rule viola-
tion (e.g., crossing against a red light) was performed by a “high status person with
a freshly pressed suit, [and] shined shoes,” more imitative counternormative behavior
was induced than when “a person in well-worn scuffed shoes, soiled patched trousers”
performed the same rule violation (1955: 704). These effects may prove short-lived,
however, as attractiveness may not be enough to sustain attitude formation or change
(Amos, Holmes, & Strutton, 2008; Kelman, 1958). The persistence of change induced
by signal variations deserves considerable attention in future research.
Indirect effects attributable to attractiveness differences also have been reported. In
research by Hebl and Mannix (2003), naive subjects played the role of personnel man-
agers who were to make hiring decisions. Before being assigned this role, each subject
interacted with a male and female student (both confederates of the experimenter) in the
waiting room of the research laboratory. The female confederate was either of average
weight, or was made up to appear obese. Exchanging pleasantries in the waiting room,
1 The judged attractiveness differences of the communicators were replicated by the subjects in the experi-
ment, who rated their communicators after each had delivered his or her pitch.
she subtly informed the naive participant that she was either the girlfriend of the male
confederate or had no relationship with him. This manipulation of relationship close-
ness between the male and female confederates had no influence on naive participants’
hiring decisions; however, consistent with earlier research, the weight of the job can-
didate’s female associate was significantly and negatively related to participants’ hiring
decisions – even if the confederates’ simultaneous arrival at the laboratory was depicted
as a function of chance, as they had never met before their encounter in the waiting
room.
In the marketing literature, where research on attractiveness is common, Praxmarer
(2011: 839) observed that attractiveness “affects persuasion positively regardless of
whether the presenter and receiver are of the same or the opposite sex and regardless of
whether receivers are characterized by low or high product involvement.” A fair sum-
mary statement of the literature on source features in persuasion is that positively valued
source qualities (e.g., attractiveness, status, dress) affect compliance or message accep-
tance even if these qualities are irrelevant to the content of the message (Crano, 1970).
Gesture
Gesture, one of the most obvious social signals, is a uniquely human and fundamental
accompaniment to speech (Wagner, Malisz, & Kopp, 2014). Gestures emphasize and
sometimes reverse the meaning of verbal content and often reveal the attitudes and
values of the speaker (McNeill, 1992, 2000). Gesture has been investigated since the
early days of psychology by some of its major figures. George Herbert Mead spent more
than a year with Wilhelm Wundt studying gesture, and the results of their collaboration
are evident in both men’s work. Wundt (1911) considered gesture the exteriorization
of interior thoughts and emotions,2 and Mead (1934) found gestures an unavoidable
feature of the mind: “Mind arises through communication by a conversation of gestures
in a social process or context of experience – not communication through mind” (1934:
50). More recently, Streeck (2009) argued that the evolution of the hand, as an organ of
communication, was a necessary antecedent to the evolution of the mind.
Categorizing gestures remains a herculean task. Most research on gestures has
focused on the hands (Wagner et al., 2014: p. 224, table 2), or to the functional com-
municative nature of the head and upper body, including arms and torso. Given the
range of possible sources of gestural information, the mere classification and categoriza-
tion of gestures remains challenging. Even so, there is abundant research on gestures,
admittedly constrained to specific features, but which nonetheless provides provoca-
tive insights into their nature and importance in communication and persuasion. In their
study of hand gestures and persuasion, for example, Maricchiolo et al. (2009) systemati-
cally scripted the hand gestures of a filmed model who delivered a communication argu-
ing for a tuition increase at her university. Four different gesture types were manipulated
and compared with a no-gesture control condition. In each case, the counterattitudinal
2 The association of these great theorists’ conceptions of a century ago and current research on psychological
embeddedness is striking.
pro-tuition argument delivered with gestures was more persuasive than the no-gesture
control presentation.
Research also indicates that the enactment of a gesture associated with power or high
status may affect receivers’ (or audience) responses to pro- or counterattitudinal infor-
mation (Fischer et al., 2011). To manipulate the embodiment of power, the participants
in Fischer and colleagues’ study were instructed to maintain a clenched fist in their non-
writing hands throughout an experiment. Others, who served as control participants, did
not do so. All participants read a typical business school case study and made a decision
regarding its resolution. After making their individual decisions, those in the embodied
power condition proved more resistant to counterargument, and more likely to accept
confirmatory information than those in the control condition. These results were repli-
cated in experimental conditions in which open or closed postural positions were used
to embody power (see Carney, Cuddy, & Yap, 2010).3 Research discussed earlier sug-
gested the social signals emitted by a source of a communication could have a powerful
impact on a receiver or listener. Carney and associates’ results indicate that physically
embodying a postural variation associated with social or physical power may engen-
der resistance on the part of receivers. These results extend Patterson’s view (1982,
2001) that gestures affect listeners’ evaluations of speakers’ attitudes and intentions.
Apparently, even adopting the gestural features of a type of individual (e.g., young or
old, strong or weak, etc.) may affect the receiver’s acceptance or rejection of a persua-
sive communication (see Bargh, Chen, & Burrows, 1996). These responses, Patterson
argues, often occur below conscious awareness.
Burgoon, Birk, and Pfau (1990) have shown that gestural and facial expressive differ-
ences can affect the perceived credibility of a message source and its consequent persua-
siveness. Research on dual process models of attitude change (e.g., Petty & Cacioppo,
1986) suggests that these differences should be most apparent in circumstances in which
the receivers’ vested interest in the issue is low rather than high (Crano, 1995). In situ-
ations involving high personal stake, the content of the source’s information would be
expected to weigh more heavily than the gestures accompanying the content on listen-
ers’ evaluations (Jackob, Roessing, & Petersen, 2011), though of course, gestures may
affect the credibility and acceptance of the message content.
Speed of Speech
The speed with which persuasive content is presented may affect its impact. With
rare exceptions (e.g., Chattopadhyay et al., 2003) speed of speech (usually engineered
through speech compression) is studied as a dichotomous variable (i.e., as fast vs con-
versational) and variations in speech rate on persuasion, or on variables theoretically
associated with persuasion (e.g., credibility, status, etc.) are the usual foci of investiga-
tion. Initially, although not invariably supported (e.g., Vann, Rogers, & Penrod, 1970;
Wheeless, 1971), the early consensus was that faster speech rates resulted in greater
persuasion and more favorable speaker evaluations. Those using more rapid speech
3 Open positions are associated with greater power.

were perceived as more knowledgeable, energetic, and enthusiastic (MacLachlan, 1982;

Miller et al., 1976). In their useful study, LaBarbera and MacLachlan (1979) randomly
assigned participants to listen to six different radio commercials for varying products
(e.g., McDonald’s hamburgers, Sports Illustrated, etc.). For some participants, the com-
mercials were played at normal speeds; for the others, speech was compressed so that the
adverts played 30 percent faster than normal. All participants rated the adverts immedi-
ately after exposure. Three of the six adverts in the compressed speech conditions were
judged significantly more interesting than the same adverts played at normal speed.
Interest rating of the remaining adverts favored the compressed adverts over normal
speech, but the differences were not statistically significant. Two hours after exposure,
participants were tested on brand recall. Analysis indicated that brand name recall was
higher for two of the six speeded adverts. Further, analysis of an open-ended advert
recall question (“Do you remember hearing a commercial for [brand name] product? If
so, please write down everything you can recall about that commercial”) revealed sig-
nificant differences for three of the six adverts, which favored the compressed speech
condition. In cases where differences were not statistically significant, results trended
toward the superiority of the compressed advert.
Early research findings of this type suggested a simple cause–effect relation in which
more rapid speech enhanced message persuasiveness. Consistent with this observa-
tion and based on their research findings, Miller and colleagues (1976: 621) observed
that “one might be inclined to assert confirmation of a new law: ‘Beware of the fast
talker.”’ Alas, this level of certainty was not to endure. Later multivariate investiga-
tions revealed that moderators had important effects on the relation between speech rate
and persuasion. For example, Smith and Shaffer (1991) found different effects for com-
pressed speech on attitude change and message elaboration depending on the pro- or
counterattitudinal nature of the communication. Counterattitudinal communications
were more persuasive under compressed speech conditions, whereas pro-attitudinal
communications had greater impact when presented at a normal speed. In a later study,
these same authors found that rate of speech mattered when the topic was of moderate
importance to participants, but not when the topic was of high importance (Smith &
Shaffer, 1995). As topic importance is an indicator of self-relevance or vested interest
(Crano, 1995), a variable that has been strongly linked to attitude strength, this result
suggests that speed of speech may affect attitudes that are not strongly held, but not
those that respondents deem as highly self-relevant (Johnson, Siegel, & Crano, 2014;
Lehman & Crano, 2002). Consistent with the implications of the vested-interest orienta-
tion, Herbst et al. (2011) discovered that fast speech was detrimental to persuasion when
the central content of the communication involved disclaimers. Compressed speech in
this case apparently sensitized listeners to the possibility of deceptive intent on the part
of the “fast talker.” This proposition suggests that the trustworthiness of the brand under
consideration would play a variable role in participants’ responses to compressed and
normal speech, and it did. For trustworthy brands, compressed speech had neither a
persuasive advantage nor disadvantage, but when the brand was deemed untrustworthy,
compressed speech resulted in decreased persuasiveness relative to normal speech. This
result implies that speed sometimes may incite perceptions of malicious or deceptive
intent when trust in the message source is not high, and hence, is not unequivocally
useful in inducing message acceptance.
Miller et al. (1976) suggested that increased speech rate might require more effort on
the part of listeners, who would then be prone to overvalue the message to (self-)justify
their efforts. It is sensible to work hard to understand an important communication, but
not an unimportant one. Thus, communications rendered more difficult by compressed
speech may be overvalued relative to identical messages conveyed at normal conversa-
tional levels. Outlays of effort appear to motivate receivers to justify their expenditure;
in this instance, the “effort justification” effect would have inclined Miller and col-
leagues’ participants to agree with the message (Festinger, 1957; Gerard & Mathewson,
1966; Norton, Mochon, & Ariely, 2012).
Consensus regarding the association of compressed speech with message persuasive-
ness has yet to emerge. However, a dual process orientation may hold the key to integrat-
ing this literature. In dual process models of persuasion, attitude change is a function
of the elaboration of messages. Central processing, the thoughtful consideration of the
content of a communication, is said to occur in contexts involving issues of high self-
relevance to the decoder. In this case, the strength of the message is critical (Petty &
Cacioppo, 1986; Petty, Wheeler, & Tormala, 2013). Thus, communications on issues
that are important and self-relevant are likely to be scrutinized by the listener more
closely that those that are not. Under conditions of high relevance, message quality mat-
ters, whereas speed of speech may not. When the issue is not relevant to the receiver’s
perceived well-being, speed of speech may be used as a cue to competence or decep-
tion, depending on the context. Factors that suggest malfeasance will retard acceptance
of rapid speech; factors that suggest acceptable motives on the part of the speaker may
enhance it.
Vocal Tone
Examination of the persuasive influence of a speaker or message would be incomplete
without consideration of vocal tone (e.g., tone of voice, pitch variation, nasality). As
suggested, falling intonation of one word may change the meaning of an entire state-
ment. Thus, ignoring the vocal quality of a persuasive message may overlook an influen-
tial communication feature. Fortunately, years of scholarship offers guidance for social
signal researchers seeking to understand this component of persuasion.
Van der Vaart et al.’s (2006) program of research investigated the role of varied vocal
components on potential respondents’ willingness to agree to take part in a telephone
survey. The researchers reasoned that decisions made about engagement in phone sur-
veys were made with such limited information that peripheral factors such as the inter-
viewer’s vocal tone might play a critical role in potential respondents’ decisions to
accept or refuse the invitation to participate. Results of this study indicated that flu-
ency and loudness, as rated by human judges, were positively and significantly associ-
ated with cooperation rates. However, when the interview introductions were measured
acoustically via computer (Boersma & Weenink, 2000), voice characteristics appeared
to have no influence over cooperation rates. This result suggests that human judges may
be sensitive to features of vocal tone that were beyond the measurement capacity of
popular acoustic analysis programs. This suggests further that acoustic analytics should
be improved and validated against the perceptions of human judges.
In other research, Sharf and Lehman (1984) conducted a post hoc analysis of three
telephone interviewers with high success rates, and three with low success rates. More
successful interviewers had shorter pauses and used falling intonation contours. Oksen-
berg, Coleman, and Cannell (1986) reported that low refusal rates were most preva-
lent when requests were characterized by greater pitch variation, higher pitch, increased
loudness, a faster rate of speech, and a clearer and more distinct conversation style.
Highlighting the likelihood that the main effects could be hiding interactions, Benki
et al. (2011) conducted a secondary analysis of phone interviews, which indicated that
pitch variation increased cooperation rates for female interviewers, but reduced rates for
male interviewers.
Beyond telephone cooperation rates, researchers have also assessed the influence of
vocal tone on speaker characteristics. Apple, Streeter, and Krauss (1979) mechanically
manipulated pitch of speakers’ voices and found high-pitched voices associated with
lower trustworthiness ratings, lower perceptions of speaker empathy, and greater per-
ceived nervousness. In another study, greater pitch variety was associated with higher
ratings of perceived competence, sociability, and character (Burgoon et al., 1990). More-
over, greater pitch variety was linked to increased persuasion. Jackob et al. (2011)
assigned participants to view a speech on globalization and, in addition to varying the
use of gestures, manipulated the presence or absence of vocal emphasis. Outcome mea-
sures included the perceived performance of the speaker (e.g., vividness, liveliness), per-
ceived characteristics of the speaker (e.g., credibility, competence), and perceived char-
acteristics of the arguments used (e.g., factual accuracy, thoroughness). Across eighteen
outcome measures, differences as a function of vocal emphasis emerged for five out-
comes: speakers using vocal emphasis were rated significantly more comprehensible,
self-assured, powerful, interesting, and likable than when no emphasis was used. There
were no outcomes for which the lack of vocal emphasis resulted in more favorable per-
suasion outcomes.
Research has revealed other influential components of vocal tone. For example,
Addington (1971) found that “throatiness” was associated with reduced perception of
credibility, noting, “There appears to be a reduction in credibility when speakers employ
nasality, tenseness, and denasality, and the impression (of credibility) is reduced further
when throatiness is simulated” (p. 247). Pittam (2001) investigated the role of nasality
in message persuasiveness by randomly assigning participants to listen to a persuasive
message advanced by speakers reading a passage using a nasal voice or a non-nasal
voice, or with the source trying to be as persuasive as possible. Analysis indicated that
the nasal-voiced presentations were associated with reduced persuasion.
Not every study supports vocal variation. Pearce and Conklin (1971) considered dif-
ferences in persuasive strength based on the dynamic nature of the speaker. Dynamic
speakers were characterized as having a larger range of inflections, more variety of rate
and pitch, greater volume, and higher pitch. The researchers used the same individual to
give that same speech, but in one rendition, the speaker used a “conversational” mode of
delivery, and in the other, a “dynamic” mode. Then, the taped speech was run through
a filter that “eliminated virtually all verbal intelligibility but retained the major vocal
characteristics of the speaker in relatively undistorted form” (Pearce & Conklin, 1971:
237). When using the conversational style of delivery, the speaker was evaluated more
positively and judged more trustworthy then when using the more dynamic delivery
approach. No direct measures of persuasiveness were reported in the research, so it is
not possible to know if variations in participants’ evaluations of the speakers translated
into greater or lesser attitude influence.
Some Tentative Conclusions/Observations
The list of possible social signals that may affect the persuasiveness of a message source
is too long even to list here, much less survey with any degree of completeness. A
more useful approach, perhaps, is to link representative findings in this relative area of
research with firmly established results and to suggest possible extensions that might
be used to guide future research and practice. Results on social signals reviewed in this
chapter suggest that the persuasiveness of a written or verbal message may be the result
of message content, the social signals accompanying the content, or their interaction.
The factors that regulate either the effects of the signal or its interaction with message
content are numerous and not yet fully defined, but available theory can be used to derive
reasonable hypotheses of potential outcomes in many research contexts.
It is clear from the studies reviewed to this point that the content of a persuasive mes-
sage contains only part of the information transmitted in a communication. Researchers
can better understand the persuasion process by considering both the content and the
signals that accompany it. This is not to suggest that a focus on message content or
social signals alone cannot produce interesting and useful results. However, both of
these complex features of human communication have been shown to be an influential
part of the overall process. To gain a firmer grasp on factors that affect communication
and persuasion, future studies should study message content and the accompanying sig-
nals in tandem, and build upon the theoretical and empirical progress that has occurred
over many years. Earlier research was not privy to the technological developments that
populate many of our best research laboratories, yet it produced a literature that has
powerfully contributed to our understanding. Imagine the progress that may be bought
by combining the hard-won theoretical insights of past research with today’s technolog-
ical capabilities.
References
Addington, D. W. (1971). The effect of vocal variations on ratings of source credibility. Speech
Monographs, 38, 492–503.
Amos, C., Holmes, G., & Strutton, D. (2008). Exploring the relationship between celebrity
endorser effects and advertising effectiveness: A quantitative synthesis of effect size.
International Journal of Advertising: The Quarterly Review of Marketing Communications,

27, 209–234.
Apple, W., Streeter, L. A., & Krauss, R. M. (1979). Effects of pitch and speech rate on personal
attributions. Journal of Personality and Social Psychology, 37, 715–727.
Bargh, J. A., Chen, M., & Burrows, L. (1996). Automaticity of social behavior: Direct effects of
trait construct and stereotype activation on action. Journal of Personality and Social Psychol-
ogy, 71, 230–244.
Bar-On, R. & Parker, J. D. A. (2000). The Handbook of Emotional Intelligence: Theory, Devel-
opment, Assessment, and Application at Home, School, and in the Workplace (1st edn). San
Francisco: Jossey-Bass.
Benkí, J. R., Broome, J., Conrad, F., Groves, R., & Kreuter, F. (2011). Effects of speech rate,
pitch, and pausing on survey participation decisions. Paper presented at the May 2011 AAPOR
meeting, Phoenix.
Boersma, P. & Weenink, D. (2000). Praat: Doing phonetics by computer. www.praat.org.
Burgoon, J. K., Birk, T., & Pfau, M. (1990). Nonverbal behaviors, persuasion, and credibility.
Human Communication Research, 17, 140–169.
Carney, D. R., Cuddy, A. J. C., & Yap, A. J. (2010). Power posing: Brief nonverbal displays affect
neuroendocrine levels and risk tolerance. Psychological Science, 21, 1363–1368.
Chaiken, S. (1979). Communicator physical attractiveness and persuasion. Journal of Personality
and Social Psychology, 37, 1387–1397.
Chattopadhyay, A., Dahl, D. W., Ritchie, R. B., & Shahin, K. N. (2003). Hearing voices: The
impact of announcer speech characteristics on consumer response to broadcast advertising.
Journal of Consumer Psychology, 13, 198–204.
Crano, W. D. (1970). Effects of sex, response order, and expertise in conformity: A dispositional
approach. Sociometry, 33, 239–252.
Crano, W. D. (1995). Attitude strength and vested interest. In R. E. Petty & J. A. Krosnick (Eds),
Attitude Strength: Antecedents and Consequences. (pp. 131–157). Hillsdale, NJ: Erlbaum.
Crano, W. D., Brewer, M. B., & Lac, A. (2014). Principles and Methods of Social Research
(3rd edn). New York: Psychology Press.
Festinger, L. (1957). A Theory of Cognitive Dissonance. Stanford, CA: Stanford University Press.
Fischer, J., Fischer, P., Englich, B., Aydin, N., & Frey, D. (2011). Empower my decisions: The
effects of power gestures on confirmatory information processing. Journal of Experimental
Social Psychology, 47, 1146–1154.
Gardner, H. (1983). Frames of Mind. New York: Basic Books.
Gerard, H. B. & Mathewson, G. C. (1966). The effect of severity of initiation on liking for a
group: A replication. Journal of Experimental Social Psychology, 2, 278–287.
Goleman, D. (2006). Emotional Intelligence. New York: Bantam Books.
Hebl, M. R. & Mannix, L. M. (2003). The weight of obesity in evaluating others: A mere prox-
imity effect. Personality and Social Psychology Bulletin, 29, 28–38.
Herbst, K. C., Finkel, E. J., Allan, D., & Fitzsimons, G. M. (2011). On the dangers of pulling
a fast one: Advertisement disclaimer speed, brand trust, and purchase intentions. Journal of
Consumer Research, 38, 909–919.
Hovland, C. I., Janis, I. L., & Kelley, H. H. (1953). Communications and Persuasion: Psycholog-
ical Studies in Opinion Change. New Haven, CT: Yale University Press.
Hovland, C. I. & Weiss, W. (1951). The influence of source credibility on communication effec-
tiveness. Public Opinion Quarterly, 15, 635–650.
Jackob, N., Roessing, T., & Petersen, T. (2011). The effects of verbal and nonverbal elements in
persuasive communication: Findings from two multi-method experiments. Communications:
The European Journal of Communication Research, 36, 245–271.
Johnson, I., Siegel, J. T., & Crano, W. D. (2014). Extending the reach of vested interest in predict-
ing attitude-consistent behavior. Social Influence, 9, 20–36.
Kelman, H. C. (1958). Compliance, identification, and internalization: Three processes of opinion
change. Journal of Conflict Resolution, 2, 51–60.
Kelman, H. C. & Hovland, C. I. (1953). “Reinstatement” of the communicator in delayed mea-
surement of opinion change. Journal of Abnormal and Social Psychology, 48(3), 327–335.
Kumkale, G. T., Albarracín, D., & Seignourel, P. J. (2010). The effects of source credibility in the
presence or absence of prior attitudes: Implications for the design of persuasive communication
campaigns. Journal of Applied Social Psychology, 40, 1325–1356.
LaBarbera, P. & MacLachlan, J. (1979). Time compressed speech in radio advertising. Journal of
Marketing, 43, 30–36.
Lefkowitz, M., Blake, R. R., & Mouton, J. S. (1955). Status factors in pedestrian violation of
traffic signals. Journal of Abnormal and Social Psychology, 51, 704–706.
Lehman, B. J. & Crano, W. D. (2002). The pervasive effects of vested interest on attitude-criterion
consistency in political judgment. Journal of Experimental Social Psychology, 38, 101–112.
MacLachlan, J. (1982). Listener perception of time compressed spokespersons. Journal of Adver-
tising Research, 2, 47–51.
Maricchiolo, F., Gnisci, A., Bonaiuto, M., & Ficca, G. (2009). Effects of different types of hand
gestures in persuasive speech on receivers’ evaluations. Language and Cognitive Processes,
24(2), 239–266.
Mayer, J. D. & Salovey, P. (1997). What is emotional intelligence? In P. Salovey & D. Sluyter
(Eds), Emotional Development and Emotional Intelligence: Implications for Educators (pp. 3–
31). New York: Basic Books.
McNeill, D. (1992). Hand and Mind: What Gestures Reveal about Thought. Chicago: University
of Chicago Press.
McNeill, D. (2000). Language and Gesture. Cambridge: Cambridge University Press.
Mead, G. H. (1934). Mind, Self, and Society. Chicago: University of Chicago Press.
Miller, N., Maruyama, G., Beaber, R. J., & Valone, K. (1976). Speed of Speech and Persuasion.
Mills, J. & Jellison, J. M. (1967). Effect on opinion change of how desirable the communication
is to the audience the communicator addressed. Journal of Personality and Social Psychology,
6, 98–101.
Norton, M. I., Mochon, D., & Ariely, D. (2012). The IKEA effect: When labor leads to love.
Journal of Consumer Psychology, 22(3), 453–460.
Oksenberg, L., Coleman, L., & Cannell, C. F. (1986). Interviewers’ voices and refusal rates in
telephone surveys. Public Opinion Quarterly, 50, 97–111.
Patterson, M. L. (1982). A sequential functional model of nonverbal exchange. Psychological
Review, 89, 231–249.
Patterson, M. L. (2001). Toward a comprehensive model of non-verbal communication. In W. P.
Robinson & H. Giles (Eds.), The New Handbook of Language and Social Psychology (pp. 159–
176). Chichester, UK: John Wiley & Sons.
Pearce, W. & Conklin, F. (1971). Nonverbal vocalic communication and perceptions of a speaker.
Speech Monographs, 38, 235–241.
Petty, R. E. & Cacioppo, J. T. (1986). Communication and Persuasion: Central and Peripheral
Routes to Attitude Change. New York: Springer.
Petty, R. E., Wheeler, S. C., & Tormala, Z. L. (2013). Persuasion and attitude change. In H.
Tennen, J. Suls, & I. B. Weiner (Eds), Handbook of Psychology, Volume 5: Personality and
Social Psychology (2nd edn, pp. 369–389). Hoboken, NJ: John Wiley & Sons.
Pittam, J. (2001). The relationship between perceived persuasiveness of nasality and source char-
acteristics for Australian and American listeners. The Journal of Social Psychology, 130, 81–
87.
Praxmarer, S. (2011). How a presenter’s perceived attractiveness affects persuasion for
attractiveness-unrelated products. International Journal of Advertising, 30, 839–865.
Rhine, R. J. & Severance, L. J. (1970). Ego-involvement, discrepancy, source credibility, and
attitude change. Journal of Personality and Social Psychology, 16, 175–190.
Salovey, P. & Mayer, J. D. (1990). Emotional intelligence. Imagination, Cognition, and Personal-
ity, 9, 185–211.
Sharf, D. J. & Lehman, M. E. (1984). Relationship between the speech characteristics and effec-
tiveness of telephone interviewers. Journal of Phonetics, 12, 219–228.
Smith, S. M. & Shaffer, D. R. (1991). Celerity and cajolery: Rapid speech may promote or inhibit
persuasion through its impact on message elaboration. Personality and Social Psychology Bul-
letin, 17, 663–669.
Smith, S. M. & Shaffer, D. R. (1995). Speed of speech and persuasion: Evidence for multiple
effects. Personality and Social Psychology Bulletin, 21, 1051–1060.
Snyder, M., Tanke, E. D., & Berscheid, E. (1977). Social perception and interpersonal behavior:
On the self-fulfilling nature of social stereotypes. Journal of Personality and Social Psychology,
35, 656–666.
Streeck, J. (2009). Gesturecraft: The manu-facture of meaning. Amsterdam: John Benjamins.
Van der Vaart, W., Ongena, Y., Hoogendoorn, A., & Dijkstra, W. (2006). Do interviewers’ voice
characteristics influence cooperation rates in telephone surveys? International Journal of Pub-
lic Opinion Research, 18, 488–499.
Vann, J. W., Rogers, R. D., & Penrod, J. P. (1970). The cognitive effects of time-compressed
advertising. Journal of Advertising, 16, 10–19.
Wagner, P., Malisz, Z., & Kopp, S. (2014). Gesture and speech in interaction: An overview. Speech
Communication, 57, 209–232.
Wheeless, L. R. (1971). Some effects of time-compressed speech on persuasion. Journal of
Broadcasting, 15, 415–420.
Wundt, W. (1911). Völkerpsychologie: Eine Untersuchung der Entwicklungsgesetze von Sprache,
Mythus, und Sitte [Ethnocultural psychology: An investigation of the developmental laws of
language, myth, and customs]. Aalen, Germany: Scientia Verlag.
Further Reading
Crano, W. D. & Prislin, R. (2006). Attitudes and persuasion. Annual Review of Psychology, 57,
345–374.
Kelman, H. C. (1961). Processes of opinion change. Public Opinion Quarterly, 25, 57–78.
Petty, R. E. & Wegener, D. T. (1998). Attitude change: Multiple roles for persuasion variables. In
D. T. Gilbert, S. T. Fiske, & G. Lindzey (Eds), The Handbook of Social Psychology (vols 1 and
2, 4th edn, pp. 323–390). New York: McGraw-Hill.
Petty, R. E. & Wegener, D. T. (1999). The elaboration likelihood model: Current status and contro-
versies. In S. Chaiken & Y. Trope (Eds), Dual-process Theories in Social Psychology (pp. 37–
72). New York: Guilford Press.
Woodall, W. G. & Burgoon, J. K. (1984). Talking fast and changing attitudes: A critique and
clarification. Journal of Nonverbal Behavior, 8, 126–142.
10 Social Presence in CMC and VR
Christine Rosakranse, Clifford Nass, and Soo Youn Oh
Introduction
The first message by telegraph was sent by Samuel Morris and read, “What hath God
wrought?” The first words communicated via telephone were, “Mr. Watson – Come
here – I want to see you.” We can see that over a relatively short period of time what
was first considered a feat to reckon with the powers of the universe became somewhat
more mundane. When it comes to telecommunications and, later, computer-mediated
communication (CMC) and virtual reality (VR), what can be achieved in terms of social
presence is all a matter of one’s perspective and goals.
The original goal of CMC was to increase social presence by emulating face-to-
face communication, thereby increasing the feeling that you were actually with another
person. Short, Williams, and Christie (1976) first formally introduced the concept of
social presence as a distinguishing attribute of telecommunication events. The com-
mercial world abbreviated this concept quite well when telephone companies beckoned
us to “reach out and touch someone.” Increasing social presence meant transmitting
social signals and creating richer contexts for communication with the ultimate goal of
being indiscernibly like face-to-face communication. Face-to-face (FtF), one-on-one,
real-time interaction with all the verbal and nonverbal modes of communication occur-
ring without lag or filtration was to be the pinnacle of communication technologies.
To this end, we have created very rich social signal environments that strive for this
canon.
The current state of technology, however, allows us to go beyond this “mundane”
goal of strict veridicality and back to a more awe-inspiring realm of possibility. Given
the nature of CMC, namely the ease with which one can alter variables in code, and the
flexibility of mental models, new goals can be attained. Realness is now the mundane
goal. Augmentation and enhancement are the new frontier.
Early CMC
In the early days of CMC, communication was limited to asynchronous, text-based for-
mats. Listservs, e-mail, and other discussion forums were places where one could post
and read, but responses were not immediate. The process of having a dialogue was
Social Presence in CMC and VR 111
stilted and slow. There were no visual or audio cues and misinterpretation or deceit in
messages undermined clear communication.
These early systems, however, also had benefits when looked at from another angle,
including the possibility to carefully compose and edit messages prior to sending them.
This allowed for selective self-presentation and identity play. You didn’t necessarily
know who was who on the early Internet, for better or worse.
The early theories of CMC reflected the constraints of the technology. The cues-
filtered-out approach introduced by Culnan and Markus in 1987 dominated the theo-
retical space into the early 1990s. CMC was considered to be inherently less social
and “leaner” than in-person interactions. “CMC, because of its lack of audio or
video cues, will be perceived as impersonal and lacking in sociability and norma-
tive reinforcement, so there will be less socioemotional (SE) content exchanged”
(Rice & Love, 1987: 88).
Walther responded to this cues-filtered-out approach in 1992 with his social infor-
mation processing theory (SIP). It was clear, even at this early stage, that relationships
were developing online and that people could form deep connections through CMC,
albeit at a slower rate than in person. Friendships and even romantic partnerships were
forming through online message threads. This trend continues today as hundreds of mar-
ried couples attribute their meeting to playing the same online role-playing game. From
this perspective, the positive outcomes related to social presence were not dependent
on veridical representations, but rather commonalities and complementarities that were
revealed even, or perhaps only, when certain facets of FtF communication were deleted
from the interaction.
Walther later introduced the hyperpersonal model in 1996. Sometimes the unique
affordances of CMC allow individuals to develop relationships that are “more socially
desirable than we tend to experience in parallel Face-to-Face interaction” (1996: 17).
The descriptor hyperpersonal refers to the possibility of creating ties that are more inti-
mate than those in formed FtF due to characeristics of CMC. From the sender’s end, you
can selectively present and edit information about yourself to create a version of you
that is more ideal than your real-world self. For example, if you considered your exten-
sive knowledge of Latin American art to be one of your better traits, you could monitor
your messages to create an image of a connoisseur of arts, but successfully keep your
smelly feet and ignorance of current affairs a secret.
Another theory that applies to social presence in CMC was presented by Reicher,
Spears, and Postmes in 1995. The social identity model of deindividuation effects
(SIDE) states that in deindividuated (depersonalized) settings with a salient group
identity, individual identity is submerged into the group identity. Individuals natu-
rally develop preferences for their ingroup members. Conversely, they disassociate with
members of the outgroup. We can see these effects most acutely in certain CMC con-
texts that foster deindividuation. In the most extreme cases, where users are anonymous,
SIDE can predict negative social outcomes. It is easier to demonize or marginalize those
outgroup members when your ingroup membership is made salient. Together these early
theories represent the initial foray into describing the social effects possible through
CMC, at least those effects that were apparent given the technology in the 1990s. At
this time, FtF interaction was the goal that the researchers and the technology were
trying to reach.
Social Presence Now
Since the beginning of the new millennium, however, several improvements have been
made to CMC-related technologies, especially in the area of VR, and we may frame
this work as aspiring to a new goal. The drive for an increased sense of social presence
altered inadequate or deficient presentation modalities, while at the same time improv-
ing other aspects related to social presence. The goal now is not strict emulation of
FtF communication, but the adoption of truly creative ways to increase social presence
beyond the capabilities of nonmediated, real-life interaction.
Currently, synchronous CMC is ubiquitous, even to the detriment of FtF communi-
cation. Real-time interactions are the norm with SMS instant messaging and video-chat
readily available through a number of different platforms. We also have the ability to
communicate via multichannel input/output devices, providing realism to our sensory
experience of CMC. These new communication spaces can also provide hyperrealis-
tic interactions in enhanced-cues environments. This is especially true of immersive
environments.
In terms of social signal processing, we can think of CMC and VR from two per-
spectives. Interrogating the communication channel or device through which the com-
munication occurs represents a technologist view, while the psychological perspective
highlights the method of increasing social presence by looking at what social informa-
tion is transmitted across that channel. Both are valid perspectives and inextricable from
one another. Starting with the available channels, we can see which improvements have
been made through the technology. This will then lead us to the psychological implica-
tions for social presence research.
The introduction of surround sound in CMC and VR environments progressed the
technology into greater levels of realism and physical presence. However, the most
impressive improvements have been made in the visual representations available to
researchers. The early years of VR provided users with polygon-based avatars with lit-
tle smoothing and little photorealism. In the immersive virtual environments (IVEs) of
today, the issues of latency and photorealism have been mitigated with faster processors
and more efficient processing algorithms. However, a real-time display versus photore-
alism tradeoff still exists.
Similar to surround sound innovation in audio output, 3-D presentations of visual
output are seen as the next evolution for displays. Lenticular screens are now available
commercially with 3-D movies providing any everyday context for increasingly rich
visual representations. Rear-projection cave automatic virtual environments (CAVEs)
and head-mounted displays (HMDs) make IVEs increasing more realistic. One other
factor that has been found to affect realism is refresh rate, how often the visual repre-
sentation updates.
While audio and visual channels for engagement and feedback through surround
vision and sound devices (HMDs and special “CAVE” rooms) create a certain level of
immersion, additional immersion is achieved as more senses become involved, now that
people in VR have the option to “touch” virtual bodies with both tactile and kinesthetic
feedback through gloves and other specially designed haptic devices.
With real-time motion capture devices, such as the Kinect, the user’s entire body
becomes the control mechanism for the avatar. The psychological implications for the
augmentation of the sense of social presence are profound. Overall, improvements and
modifications are continually being made to the way that signals are communicated,
making physical distance less of an obstacle, as long as the proper digital infrastructure
is in place. However, not all the senses have been treated with equal attention. In terms
of CMC and VR, audio and video are the most widely researched sensory areas, and
therefore the most improved upon.
While devices have been created specifically to affect our olfactory sense, this sense
has not been researched to the extent of audio or visual outputs. Touch, on the other
hand, does have some research to support its use in increasing social presence. Move-
ment, vibration, temperature, and texture all represent facets of a haptic interface. As a
communication channel, touch can express emotions, as well as empathy and kindness.
For example, the Keltner Laboratory at University of California, Berkeley found that
humans can communicate emotions such as gratitude and compassion with touches to
a stranger’s forearm. IJsselsteijn and his colleagues have studied mediated social touch
in order to determine how it might be used to increase social presence (Haans & IJs-
selsteijn, 2006). Similarly, the Midas touch, where servers were given higher tips after
touching patrons on the arm, was proven successful in a mediated format (Haans &
IJsselsteijn, 2009).
Once again, from a goals perspective, we can see that although these sense-related
interfaces can seek to veridically reproduce sensations that may occur in the real world,
due to the nature of the interfaces, they can also represent heightened or augmented
situations and environments according to the desired goal. For example, in terms of
social presence, one might seek to leverage the emotions elicited to make an interac-
tion more arousing. The basic goal of creating a rich social signal environment can
itself be built upon to reach these ends. Now that CMC and IVEs permit multisen-
sory social encounters, the role of affect becomes more salient to the researcher, espe-
cially when emotion mediates the role of cognitive functions such as memory and
learning.
The use of agents, computer-controlled virtual entities, has proven to be a rich vein
for social science research. Looking at factors affecting social presence, certain well-
established psychological interventions initially proven effective in real-world applica-
tions have proven effective when enacted by agents in IVEs. Mirroring another’s non-
verbal behavior with a certain amount of latency to reduce its obviousness, for example,
increases “liking” of the agent (Bailenson & Yee, 2005).
In VR, when designing one’s avatar, at that moment of digital conception, every pixel,
every behavior, and any profile information related to the avatar can be defined exactly
to one’s specifications, to be controlled as one sees fit and to be manipulated at any later
point in time due to “digital plasticity” (Bailenson & Beall, 2006). For avatars engaged
with other agents and avatars in interpersonal communication, as in a collaborative vir-
tual environments (CVE) or inside of a massively multiplayer online role-playing game
(MMORPG), this can bring both great benefits to the user controlling the avatar, as well
as a greater sense of self-presence for the user in the system and an increased sense of
social presence when interacting with others. For social signal processing, this dynamic,
where we can present information that would not be apparent or even true in the real
world, means that programmers and engineers have an entirely new space to develop in.
What was once a detriment, the mediated nature of communication through computer-
based interfaces, can now become an asset.
Transformed Social Interaction (TSI)
Transformed social interaction (TSI) refers to the decoupling of representation from

behavior and form that “involves novel techniques that permit changing the nature of
social interaction” (Bailenson et al., 2004: 429) by enhancing or degrading interpersonal
communication. That is, the TSI paradigm entertains the possibility of altering certain
social signals (i.e., nonverbal information) to change the social interaction itself. TSI
leverages the concepts that constitute Walther’s hyperpersonal model. Social interaction
in VR can be more than “real.” With VR’s ability to present information that is modified
or personalized to the individual recipient, the ramifications for massive open online
courses (MOOCs) (Tu & McIsaac, 2002) and other interpersonal online spaces are also
profound.
TSI is composed of the following three categories, each of which offers a separate
dimension of transformable social signals: (1) self-representations (i.e., avatars), (2)
sensory capabilities, and (3) contextual situations (Bailenson et al., 2004: 430). One
of the strongest examples of this can be seen when we look at not only dyadic interac-
tions, but one-to-many interactions. Imagine yourself taking an online course in a virtual
environment. Which parameters of the teacher representation help you to learn better?
Proximity to the teacher, eye-gaze, and tone of speech have all been demonstrated as
factors of learning.
For example, in an IVE, it is no longer the case that teachers can only look at one
student at a time. Their teacher avatar can look at you, in your virtual environment,
50 percent of the time, while in others’ virtual environments, the teacher can also look
at each of them 50 percent of the time. This ability to allocate more than 100 percent of
an avatar’s gaze has been called nonzero-sum gaze (Beall et al., 2003).
Proxemics, or how close we physically get to another person, has also been studied
in social science research through IVEs. In terms of social presence, one of the biggest
cognitive constraints to believing that another person was in the space with you was
the fact that the other person was not “actually” there with you. For this reason, the
ability of virtual environment to realistically present another person’s avatar as if they
were there with you has multifaceted repercussions. Similar to eye-gaze, how close
another avatar stands is something that can be manipulated algorithmically in the final
representation.
It has been shown that in classroom settings, those who sit closer to the front learn
more (Blascovich & Bailenson, 2011). This finding was tested in VR with some avatars
being placed close to a virtual teacher and some avatars being placed further back.
Those who were placed “virtually” closer did learn and retain more information. In
other words, everyone can sit at the front of the class. Therefore, IVEs may achieve a
more optimized outcome than FtF interactions by keeping an avatar an optimal distance
from a learning source in order to maximize learning potential, if the intended role is
that of being an effective teacher. This ability can be attributed to a combination of
human input and algorithm-based interventions when an avatar is represented in CMC
or in a virtual environment.
Perhaps your learning preferences are different than those of another student tak-
ing the same course. How can these differences in preferences be respected and even
leveraged? The way that the teacher interacts with you is no longer limited to veridical
representations. These components of social interaction change the perception of the
recipient, but if we ask ourselves “Which components of social presence are important
during an interaction?”, we can think again in terms of goals.
The bodily concept of the self can even be transformed to behave in ways that would
not be possible in FtF contexts. No longer bound to the rules of physics, you can have an
arm that stretches for miles; you may be granted a third arm or exchange the functions
of the arms and legs (homoncular flexibility; Lanier, 2006). Studies have already shown
that it is relatively easy to make people feel the illusion that they have grown an extra
hand or a longer nose – “the body image, despite all its appearance of durability and
permanence, is in fact a purely transitory internal construct” (Ramachandran & Rogers-
Ramachandran, 2000: 319–320).
Currently, there are even more subtle, but effective cases of selective self-selection
because they don’t lie in the realm of the blatantly impossible such as being a centaur.
For example, Second Life is a VR environment where physically handicapped individu-
als, who may be bound to a wheelchair for movement in the real world, can walk. They
can fly. In fact, Second Life has a larger population percentage-wise of physically hand-
icapped individuals than the real, non-digital world (Blascovich & Bailenson, 2011: 4).
In terms of social presence, this situation necessitates differentiation between goals nor-
mally associated with FtF communication and those goals that may now be attained
through CMC. Namely, the goal of CMC in this case may be to increase social presence
by mitigating real-world constraints in order to permit closer interactions or ones that
are more intimate.
In this way, being handicapped is no longer central to their identity because they have
an occasion and a place where having that handicap is not always “true.” Interfaces have
also been developed where you can control an avatar with a head gesture. Video game
systems that register brainwaves through a head-mounted device and use levels of “con-
centration” as the input for moving objects are now in development (e.g., Neurosky).
This is a case where the technology itself acts to circumvent given physical limitations
so an individual can hyperpersonalize their avatar.
However, this is just the tip of the digital iceberg when it comes to what hyperpersonal
modifications mean for social presence. Avatars reintroduce the possibility for nonver-
bal communication, such as gaze and gestures, into the interaction (Walther, 2006).
TSI, as a research paradigm, has provided empirical support for ways in which one
can alter self-presentation to affect social presence through interpersonal interaction
in a virtual environment (Yee, Bailenson, & Ducheneaut, 2009). Social advantage can
be gained by manipulating one’s facial features, for example. “Liking” of the other
in an interpersonal situation is now one factor that can be modified through many
means, including algorithmic. We can not only affect factors that make us like the
other more, but also ones that let us present ourselves differently, according to one’s
motivation.
Research has already shown that changing visual representations can have psycholog-
ical implications for self-perception and other-perception, leading to differences in atti-
tudes and behaviors. Earlier studies that manipulated visual representations found that
individuals evaluated computer interfaces more positively when they interacted with
their own face (vs someone else’s face) (Nass, Kim, & Lee, 1998). Students who were
asked to perform a task with members from their outgroup as a team displayed stronger
liking for their virtual group when they were assigned identical avatars compared to
when they were given avatars that were different from their virtual group members (e.g.,
Lee, 2004).
Today, technological development allows us to move away from ostensible duplica-
tions of visual representations to more subtle methods of similarity enhancement. A
study by Bailenson and his colleagues (2006) found that morphing a presidential candi-
date’s facial features with a participant’s facial features influenced how much they liked
the candidate. One could also morph their avatar’s face with another person’s in order to
engender feelings of liking and trust. At 20 or 40 percent morphing, the effect is subtle
enough that few people would recognize the manipulation. Other factors such as height
and race of the avatar can also be altered in VR. Overall, Walther (2006) found that
people who “engage in strategic self-presentation online have better social interactions
in these spaces” (Ratan, 2010). They become the masters, in a way, of the fate of their
mediated interactions.
A simple thought experiment helps to illuminate some of the potentials of TSI. It was
already mentioned that algorithms or invisible others could act to augment the behav-
ior of your character, but imagine if your avatar could touch the other avatar. What
if there was a pop-up bubble that came up next to another person when their facial
expression showed a frown that said, “This person finds touch reassuring”? You might
control your avatar to walk over and touch their virtual hand, which they would then feel
through a haptic device placed in their glove that squeezed their palm. You would then
be even more “empathic” than you could be in real life. Other people might have bub-
bles that say, “Do not touch under any circumstance” or “They find direct eye-gaze to be
confrontational.”
The main challenges to the future of social presence research include developing
commensurability between what social presence means when using different tech-
nologies compared to real-world findings. Researchers should also seek to apply their
research to interface development for social well-being and online education. In doing
this, they must also realize the role of audience, social influence, and broader con-
structs, such as traditional publics versus networked publics. The nature of “social” is a
continually changing concept, especially when we have the blurring of public and pri-
vate information.
Conclusion
The previous canon for CMC was FtF communication, with two characteristics: very
rich social signal environment and no deception or enhancement. Now, enhanced envi-
ronments give more opportunities to manifest the sensory truth, but also a greater possi-
bility for deception (e.g., nonzero-sum gaze). Experimental manipulations have already
shown that many cognitive heuristics that we follow in the real world also hold true
in VR. Making avatars into hyperpersonal self-representations can then leverage these
habitual mechanisms to bolster the positive aspects of a person or to give them advan-
tage in a virtual environment.
In some ways, the extent to which human beings are changed by the technology in
their environment is tremendous because we tend to interact with those technologies
using peripheral processing. We do not consciously reflect on the fact that the affec-
tive capability of any technology represents serious implications for the design and use
of that technology. For example, the positive aspects of a video game come in many
forms. “When my team wins, we all win.” This sense of social presence, community,
and teamwork is often unrivaled in most work places. In this sense, hyperpersonaliza-
tion of an avatar can benefit the entire team online. The research must study the affective
capability of video games, and other CMC platforms, and then perpetuate the benefits
through the user experience design.
Recent theoretical contributions to social presence research have acknowledged the
ever-increasing sophistication of CMC-related technologies. However, the commensu-
rability between social presence in nonmediated contexts and mediated communication
has not yet been fully explored. It may be the case that social presence can be augmented
via technology along certain dimensions, while other dimensions will prove to be less
compelling. The implications of social networking must also be factored into the general
social presence research, especially in regard to the longitudinal effects of long-term use
by millennials and “digital natives.” Will the evolving relationship to technology lead
to greater disparity between offline and online personae and what will that mean for
social presence research? Given the changing nature of social dynamics, future research
may find that technology is not linearly changing how we interact with others. Rather,
researchers may discover an iterative process whereby the technology is both changing,
and being influenced by, the culture itself.
References
Bailenson, J. N. & Beall, A. C. (2006). Transformed social interaction: Exploring the

digital plasticity of avatars. In R. Schroeder & A. Axelsson (Eds), Avatars at Work and Play:
Collaboration and Interaction in Shared Virtual Environments (pp. 1–16). Dordrecht, The
Netherlands: Springer-Verlag.
Bailenson, J. N., Beall, A. C., Loomis, J., Blascovich, J., & Turk, M. (2004). Transformed social
interaction: Decoupling representation from behavior and form in collaborative virtual envi-
ronments. PRESENCE: Teleoperators and Virtual Environments, 13(4), 428–441.
Bailenson, J. N., Garland, P., Iyengar, S., & Yee, N. (2006). Transformed facial similarity as a
political cue: A preliminary investigation. Political Psychology, 27(3), 373–385.
Bailenson, J. N. & Yee, N. (2005). Digital chameleons automatic assimilation of nonverbal ges-
tures in immersive virtual environments. Psychological Science, 16(10), 814–819.
Beall, A. C., Bailenson, J. N., Loomis, J., Blascovich, J., & Rex, C. (2003). Non-zero-sum mutual
gaze in collaborative virtual environments. Proceedings of HCI International, June 22–27,
Crete, Greece.
Blascovich, J. & Bailenson, J. (2011). Infinite Reality: Avatars, Eternal Life, New Worlds, and the
Dawn of the Virtual Revolution. New York: HarperCollins.
Culnan, M. J. & Markus, M. L. (1987). Information technologies. In F. M. Jablin & L. L. Put-
nam (Eds), Handbook of Organizational Communication: An Interdisciplinary Perspective
(pp. 420–443). Thousand Oaks, CA: SAGE.
Haans, A. & IJsselsteijn, W. A. (2006). Mediated social touch: A review of current research and
future directions. Virtual Reality, 9(2–3), 149–159.
Haans, A. & IJsselsteijn, W. A. (2009). The virtual Midas touch: Helping behavior after a medi-
ated social touch. IEEE Transactions on Haptics, 2(3), 136–140.
Lanier, J. (2006). Homuncular flexibility. Edge. www.edge.org/response-detail/11182print
.html-lanier.
Lee, E.-J. (2004). Effects of visual representation on social Influence in computer mediated com-
munication. Human Communication Research, 30(2), 234–259.
Nass, C., Kim, E. Y., & Lee, E.-J. (1998). When my face is the interface: An experimental
comparison of interacting with one’s own face or someone else’s face. In Proceedings of the
SIGCHI Conference on Human Factors in Computing Systems (pp. 148–154). New York: ACM
Press/Addison-Wesley.
Ramachandran, V. S. & Rogers-Ramachandran, D. (2000). Phantom limbs and neural plasticity.
Archives of Neurology, 57(3), 317–320.
Ratan, R. (2010). Self-presence, explicated. Paper presented at the 60th Annual Conference of
the International Communication Association, Singapore.
Reicher, S. D., Spears, R., & Postmes, T. (1995). A social identity model of deindividuation
phenomena. European Review of Social Psychology, 6(1), 161–198.
Rice, R. E. & Love, G. (1987). Electronic emotion socioemotional content in a computer-mediated
communication network. Communication Research, 14(1), 85–108.
Short, J., Williams, E., & Christie, B. (1976). The Social Psychology of Telecommunications.
London: John Wiley & Sons.
Tu, C. H. & McIsaac, M. (2002). The relationship of social presence and interaction in online
classes. The American Journal of Distance Education, 16(3), 131–150.
Walther, J. B. (1992). Interpersonal effects in computer-mediated interaction: A relational per-
spective. Communication Research, 19(1), 52–90.
Walther, J. B. (1996). Computer-mediated communication impersonal, interpersonal, and hyper-
personal interaction. Communication Research, 23(1), 3–43.
Walther, J. B. ( 2006). Nonverbal dynamics in computer-mediated communication or :( and
the Net :(’s with you, :) and you :) alone. In V. Manusov & M. L. Patterson (Eds),
The SAGE Handbook of Nonverbal Communication (pp. 461–479). Thousand Oaks, CA:
SAGE.
Yee, N., Bailenson, J. N., & Ducheneaut, N. (2009). The Proteus effect implications of trans-
formed digital self-representation on online and offline behavior. Communication Research,
36(2), 285–312.
Part II
Machine Analysis of Social
Signals
11 Facial Actions as Social Signals
Michel Valstar, Stefanos Zafeiriou, and Maja Pantic
According to a recent survey on social signal processing (Vinciarelli, Pantic, &

Bourlard, 2009), next-generation computing needs to implement the essence of social
intelligence including the ability to recognize human social signals and social behav-
iors, such as turn taking, politeness, and disagreement, in order to become more effec-
tive and more efficient. Social signals and social behaviors are the expression of one’s
attitude towards social situation and interplay, and they are manifested through a mul-
tiplicity of nonverbal behavioral cues, including facial expressions, body postures and
gestures, and vocal outbursts like laughter. Of the many social signals, only face, eye,
and posture cues are capable of informing us about all identified social behaviors. Dur-
ing social interaction, it is a social norm that one looks their dyadic partner in the eyes,
clearly focusing one’s vision on the face. Facial expressions thus make for very pow-
erful social signals. As one of the most comprehensive and objective ways to describe
facial expressions, the facial action coding system (FACS) has recently received signifi-
cant attention. Automating FACS coding would greatly benefit social signal processing,
opening up new avenues to understanding how we communicate through facial expres-
sions. In this chapter we provide a comprehensive overview of research into machine
analysis of facial actions. We systematically review all components of such systems:
pre-processing, feature extraction, and machine coding of facial actions. In addition, the
existing FACS-coded facial expression databases are summarized. Finally, challenges
that have to be addressed to make automatic facial action analysis applicable in real-life
situations are extensively discussed.
Introduction
Scientific work on facial expressions can be traced back to at least 1872 when Charles
Darwin published The Expression of the Emotions in Man and Animals (1872). He
explored the importance of facial expressions for communication and described vari-
ations in facial expressions of emotions. Today, it is widely acknowledged that facial
expressions serve as the primary nonverbal social signal for human beings, and are
responsible for a large part to regulate interactions with each other (Ekman & Ronsen-
berg, 2005). They communicate emotions, clarify and emphasize what is being said, and
signal comprehension, disagreement, and intentions (Pantic, 2009).
124 Machine Analysis of Social Signals
AU4 Brow lowerer AU4 Brow lowerer

AU9 Nose
wrinkle
AU6 Cheek raise
AU6 Cheek raise
AU9 Nose wrinkle
AU7 Lids tight

AU43 Eye closure
AU43 Eye closure
AU20 Lip stretch
AU25 Lips part AU25 Lips part

AU12 Lip corner puller
AU12 Lip corner puller
AU26 Jaw drop
AU26
Jaw drop
Figure 11.1 Examples of upper and lower face AUs defined in the FACS.
The most common way to objectively distinguish between different facial expressions
is that specified by the facial action coding system (FACS). The FACS is a taxonomy of
human facial expressions. It was originally developed by Ekman and Friesen in 1978,
and revised in 2002 (Ekman, Friesen, & Hager, 2002). The revision specifies thirty-
two atomic facial muscle actions, named action units (AUs), and fourteen additional
action descriptors (ADs) that account for miscellaneous actions, such as jaw thrust,
blow, and bite. The FACS is comprehensive and objective in its description. Since any
facial expression results from the activation of a set of facial muscles, every possi-
ble facial social signal can be comprehensively described as a combination of AUs (as
shown in Figure 11.1; Ekman & Friesen, 1978).
Over the past thirty years, extensive research has been conducted by psychologists
and neuroscientists using the FACS on various aspects of facial social signal processing.
For example, the FACS has been used to demonstrate differences between polite and
amused smiles (Ambadar, Cohn, & Reed, 2009), deception detection (Frank & Ekman,
1997), and facial signals between depressed patients and their counselors (Girard et al.,
2013).
Given the significant role of faces in our emotional and social lives, automating the
analysis of facial signals would be very beneficial (Pantic & Bartlett, 2007). This is
especially true for the analysis of AUs. A major impediment to the widespread use of
FACS is the time required both to train human experts and to manually score video.
It takes over 100 hours of training to achieve minimal competency as a FACS coder,
and each minute of video takes approximately one hour to score (Donato et al., 1999;
Ekman & Friesen, 1978). It has also been argued that automatic FACS coding can poten-
tially improve the reliability, precision, reproducibility, and temporal resolution of facial
measurements (Donato et al., 1999).
Historically, the first attempts to automatically encode AUs in images of faces were
reported by Bartlett et al. (1996), Lien et al. (1998), and Pantic, Rothkrantz, and
Facial Actions as Social Signals 125
Figure 11.2 Configuration of a generic facial action recognition system with hand-crafted features.
Koppelaar (1998). The focus was on automatic recognition of AUs in static images
picturing frontal-view faces, showing facial expressions that were posed on instruc-
tion. However, posed and spontaneous expressions differ significantly in terms of their
facial configuration and temporal dynamics (Pantic, 2009; Ambadar, Schooler, & Cohn,
2005). Recently the focus of the work in the field has shifted to automatic AU detec-
tion in image sequences displaying spontaneous facial expressions (e.g., Pantic, 2009;
Valstar et al., 2012; Zeng et al., 2009). As a result, new challenges such as head move-
ment (including both in-plane and out-of-plane rotations), speech, and subtle expres-
sions have to be considered. The analysis of other aspects of facial expressions such as
facial intensities and dynamics has also attracted increasing attention (e.g., Tong, Liao,
& Ji, 2007; Valstar & Pantic, 2012). Another trend in facial action detection is the use
of 3-D information (e.g., Savran, Sankur, & Bilge, 2012a; Tsalakanidou & Malassio-
tis, 2010). However, we limit the scope of this chapter to 2-D, and refer the reader to
Sandbach et al. (2012) for an overview of automatic facial expression analysis in 3-D.
In this work, we separately address three different steps involved in automatic facial
expression analysis: (1) image pre-processing including face and facial point detection
and tracking, (2) facial feature extraction, and (3) automatic facial action coding based
on the extracted features (see Figure 11.2).
Facial Action Coding System (FACS)
The FACS defines thirty-two AUs: nine in the upper face, eighteen in the lower face (as
shown in Figure 11.3), and five that cannot be uniquely attributed to either. Additionally
it encodes a number of miscellaneous actions, such as eye gaze direction and head pose,
and fourteen descriptors for miscellaneous actions. With FACS, every possible facial
expression can be described as a combination of AUs. Table 11.1 shows a number of
expressions with their associated AUs.
Voluntary versus involuntary. The importance of distinguishing between involuntary
and deliberately displayed (often referred to as “posed”) facial expressions is justi-
fied by both the different semantic content of the facial expression, and the different
physical realization of the expressions (Ekman, 2003; Ekman & Ronsenberg, 2005;
McLellan et al., 2010). Neuroanatomical evidence suggests that involuntary and delib-
erate facial expressions are controlled by different mechanisms, resulting in different
activation patterns of the facial muscles (Ekman, 2003; Ekman & Ronsenberg, 2005).
Table 11.1 Lists of AUs involved in some expressions.
AUs
FACS: upper face: 1, 2, 3, 4, 5, 6, 7, 43, 45, 46;

lower face: 9, 10, 11, 12, 13, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28;
other: 31, 37, 38
anger: 4, 5, 7, 10, 17, 22, 23, 24, 25, 26
disgust: 9, 10, 16, 17, 25, 26
fear: 1, 2, 4, 5, 20, 25, 26, 27
happiness: 6, 12, 25
sadness: 1, 4, 6, 11, 15, 17
surprise: 1, 2, 5, 26, 27
pain: 4, 6, 7, 9, 10, 12, 20, 25, 26, 27, 43
cluelessness: 1, 2, 5, 15, 17, 22
speech: 10, 14, 16, 17, 18, 20, 22, 23, 24, 25, 26, 28
AU1 Inner brow raise AU2 Outer brow raise AU1 Brow lowerer AU5 Upper lid raiser AU6 Cheek raise
AU7 Lids tight AU43 Eye closure AU45 Blink AU46 Wink AU9 Nose wrinkle
AU10 Upper lip raiser AU11 Nasolabial furrow deeper AU12 Lip corner puller AU13 Sharp lip puller AU14 Dimpler
AU15 Lip corner depression AU16 Lower lip depressor (AU25) AU17 Chin raiser AU18 Lip pucker
AU20 Lip stretch AU22 Lip funneler (AU25) AU23 Lip tightener AU24 Lip presser
AU25 Lips part AU26 Jaw drop (AU25) AU27 Mouth stretch AU28 Lips suck
Figure 11.3 A list of upper and lower face AUs and their interpretation.
Subcortically initiated facial expressions (involuntary ones) are characterized by syn-

chronized, smooth, symmetrical, and reflex-like muscle movements whereas cortically
initiated facial expressions (deliberate ones) are subject to volitional real-time control
and tend to be less smooth with more variable dynamics (Pantic, 2009).
Morphology and dynamics are two dual aspects of a facial display. Face morphol-
ogy refers to facial configuration, which can be observed from static frames. Dynamics
reflect the temporal evolution of one (possibly neutral) facial display to another and
can be observed in videos only. These dynamics can be described by duration, motion,
asymmetry of motion, relative intensity, and temporal correlation between AU occur-
rences. Regarding AU intensity, scoring is done on a five-point ordinal scale, A-B-C-D-
E, with E being the most intense score.
Facial dynamics (i.e., timing, duration, speed of activation and deactivation of various
AUs) can be better analyzed if the boundaries of the temporal segment (namely neutral,
onset, apex, offset) of each AU activation are known. These four temporal segments, or
phases, can be defined as follows.
r Neutral phase: there is no manifestation of activation of the muscle corresponding to
the target AU.
r Onset phase (attack): the intensity of the muscle activation increases toward the apex
phase.
r Apex phase (sustain): the plateau when the intensity of the muscle activation
stabilizes.
r Offset phase (release): progressive muscular relaxation toward the neutral phase.
Both the morphology and dynamics of facial expressions are crucial for the inter-
pretation of human facial behavior. Dynamics are essential for the categorization of
complex psychological states like various types of pain and mood (Williams, 2002),
and are thus essential for effective social signal processing. They improve the judgment
of observed facial behavior (e.g., affect) by enhancing the perception of change and
by facilitating the processing of facial configuration. They represent a critical factor
for interpretation of social behaviors, such as social inhibition, embarrassment, amuse-
ment, and shame (Costa et al., 2001; Ekman & Ronsenberg, 2005). They have high
correlation with trustworthiness, dominance, and attractiveness in social interactions
(Gill et al., 2012). They are also a key parameter in differentiating between posed and
spontaneous facial displays (Cohn & Schmidt, 2004; Ekman, 2003; Frank & Ekman,
2004; Frank, Ekman, & Friesen, 1993; Valstar, Gunes, & Pantic, 2007).
More than 7,000 AU combinations have been observed in everyday life (Scherer &
Ekman, 1982). Co-occurring AUs can be additive, in which the appearance changes
of each separate AU are relatively independent, or nonadditive, and one action masks
another or a new and distinctive set of appearances is created (Ekman et al., 2002).
When these co-occurring AUs affect different areas of the face, additive changes are
typical. By contrast, AUs affecting the same facial area are often nonadditive. As an
example of a nonadditive effect, AU4 (brow lowerer) appears differently depending on
whether it occurs alone or in combination with AU1 (inner brow raise). When AU4
occurs alone, the brows are drawn together and lowered. In AU1+4, the brows are drawn
together but are raised due to the action of AU1.
Pre-processing
The pre-processing step consists of all the processing steps that are required before
the extraction of meaningful features. We consider two aspects here: face localization
followed by facial point localization. Registering faces to a common reference frame
is the most important step in pre-processing, and localizing facial points is crucial to
that process. Face registration removes rigid head motion and to some extent shape
variations between different people. This allows features to be extracted from the same
physical locations in faces (e.g., the corner of the mouth).
Face Detection and Tracking

The first step of any face analysis method is to detect the face in the scene. The Viola
and Jones (2004) face detector is the most widely employed face detector. The public
availability of optimized versions (e.g., OpenCV or Matlab have implementations) and
its reliability for frontal and near-frontal images under varying conditions makes it the
leading reference face detection algorithm. While current AU detection methods assume
that a frontal face detector is sufficiently accurate to localize the face, in a general sce-
nario a face detection algorithm capable of finding faces in images with an arbitrary
head pose is necessary.
Multi-view face detection is typically achieved by using multiple view-specific detec-
tors (Viola & Jones, 2003). Recently, Zhu and Ramanan (2012) proposed an algorithm
capable of performing reliable multi-view face detection, head pose estimation, and
facial point detection. The proposed method offers superior performance to (Viola &
Jones, 2004), and is capable of dealing with head poses with a wide range of head rota-
tions. A similar model was proposed for the specific task of face detection in Orozco,
Martinez, and Pantic (2013), resulting in better performance and faster execution at the
expense of the facial point detection.
Once the face is localized, employing a face tracking algorithm is an optional step,
and it can be bypassed by directly applying a facial point detection and tracking algo-
rithm. However, a face tracker (e.g., Khan, Valstar, & Pridmore, 2013; Liwicki et al.,
2012; Ross et al., 2008; Zhang & Van der Maaten, 2013) might be desired when dealing
with low resolution imagery or when a low computational cost is required.
Facial Point Detection and Tracking

Fiducial facial points are defined as distinctive facial landmarks, such as the corners
of the eyes, center of the bottom lip, or the tip of the nose. Together they fully define
the face shape. The localization of facial points, either by detection or tracking, allows
face registration to be carried out as well as the extraction of geometrical features (see
section on geometry-based approaches). Most detection and tracking algorithms rely on
separate models for the face appearance and the face shape, and the problem is thus
often posed as maximizing a loss function that depends on the appearance model while
being constrained to maintaining a valid face shape.
Face shapes are typically modeled using a statistical shape model (Cootes & Tay-
lor, 2004). The possible variations of the face shape depend on two different sets of
parameters: rigid shape transformations that relate to variations in head pose (i.e., rigid
head movements), and nonrigid transformations that model the relation between move-
ments of facial points to facial expressions. One could further divide shape variations
according to whether they can be modeled by a Procrustes analysis or not. Assuming
a 2-D representation of facial points, in-plane rotations, translation, and uniform scal-
ing of the head can all be modeled using Procrustes analysis, while facial expressions,
out-of-plane head rotations, and, to some extent, identity cannot.
Both the rigid and nonrigid transformations of facial points are important, as one can
be used to register the face and allow, for example, appearance analysis on a normal-
ized frontal face, while the other can be directly used to detect facial expressions. In a
statistical shape model, the space of all possible nonrigid transformations is typically
obtained from a training set of face shapes by first removing all rigid transformations
using generalized procrustes analysis, and then applying principal component analysis
(PCA) over the resulting shapes.
A less common alternative is the use of morphable models. They use the same face
shape parameterization of PCA, however, the basis vectors that define the nonrigid face
shape transformations are heuristically defined (Ahlberg, 2001; Dornaika & Davoine,
2006). Shape variations due to identity and facial expressions are modeled separately,
and a 3-D shape model is used so head-pose information is part of the rigid transfor-
mation. One major benefit of this approach is that AUs are explicitly encoded in the
shape model and can thus be detected directly from the shape fitting. However, shapes
are not uniquely represented under this parameterization, as there may be different com-
binations of expression and identity parameters capable of expressing the same shape.
Intuitively, if the eyebrows are set particularly high on someone’s forehead, this might
be mistaken as the activation of AU1 and/or AU2, unless the particular physiognomy of
the subject is known in advance.
With graphical models, facial point detection is posed as a problem of minimizing the
graph energy. For example, Zhu and Ramanan (2012) use a tree to model the relative
position between connected points. Here convergence to the global maximum is guar-
anteed due to the absence of loops in the graph. Similarly, a MRF-based shape model
was proposed in Martinez et al. (2013) and Valstar et al. (2010), where the relative angle
and length ratio of the segments connecting pairs of points are modeled. The model is
therefore invariant to both scale and rotation. Graph-based shape models are usually
flexible, which can be beneficial, but sometimes leads to larger errors as the solution is
less constrained.
Facial point detection algorithms without an explicit shape model have recently been
proposed (Cao et al., 2012; Xiong & De la Torre, 2013). The predicted shape is always
a linear combination of training shapes, so that shape consistency is implicitly enforced.
A linear model might not be enough to approximate the space of all 2-D shapes in the
presence of all three modes of variation, that is, large head pose, identity, and facial
expressions. The use of 3-D shape models is a possible solution as it includes all head-
pose variation as part of the rigid motion.
Appearance models. The most common trends with respect to the way appearance
information is used include active appearance models (AAMs), active shape models
(ASM)/constrained local models (CLMs), and regression-based algorithms. (CLMs can
be considered a generalization of ASM [Saragih, Lucey, & Cohn, 2011].) AAMs try to
densely reconstruct the face appearance (Matthews & Baker, 2004). The facial points
are used to define a facial mesh, and the appearance variations of each triangle in the
mesh are modeled using PCA. The facial points are detected by finding the parameter
values that minimize the difference between the original image and the image recon-
structed by the AAM shape and appearance parameters. However, AAM appearance
models are often incapable of reconstructing generic (i.e., unknown) faces and have
traditionally reported lower precision than other methods under this setting. As a conse-
quence, it is common in practice to apply AAMs in person-specific scenarios (e.g., Zhu
et al., 2011).
In the ASM framework, the face appearance is represented as a constellation of
patches close to the facial points. An appearance representation (e.g., HOG [histogram
of oriented gradient] or LBP [local binary pattern] features) is extracted from patches
both centered at the target point and at locations in the point’s neighborhood. For each
point a classifier is trained that distinguishes between the true target location and its
surrounding locations.
Given an initial face shape estimate, each classifier is applied in a sliding window
manner in a region around the current point estimate, and the score of each evaluation
is used to build a response map. The aim is to find the valid shape that maximizes the
sum of individual responses. In order to apply an efficient gradient descent technique,
the response maps are approximated by a differentiable distribution. The construction
of the response maps and the shape fitting steps are alternated iteratively so that the
detection is refined at every iteration. An example of a well-optimized ASM is the work
by Milborrow and Nicolls (2008).
Alternatively, Saragih et al. (2011) proposed the constrained local models (CLM),
which uses a nonparametric distribution to approximate the response map. The result-
ing gradient descent shape fitting is substituted by a mean-shift algorithm. Although
the fitting offered is not very precise, it can offer a good trade-off as it can run in
real time and offers high robustness. An extension of the CLM was presented in
Asthana et al. (2013), where the authors proposed to substitute the mean-shift shape
fitting by a discriminative shape fitting strategy in order to avoid the convergence to
local maxima. This results in a much better performance in the presence of facial
expressions.
In direct displacement prediction based methods the appearance of local patches is
analyzed by a regressor instead of a classifier. More specifically, regressors are trained
to directly infer the displacement from the test location to the facial point location.
Although direct-displacement-based models are very recent, they are a dominating trend
and yield the best results to date (Cao et al., 2012; Cootes et al., 2012; Dantone et al.,
2012; Jaiswal, Almaev, & Valstar, 2013; Martinez et al., 2013; Xiong & De la Torre,
2013).
The use of random forests regression in combination with fern features are a com-
mon choice (e.g., Cao et al., 2012; Cootes et al., 2012; Dantone et al., 2012). This
results in very fast algorithms, ideal for low computational cost requirements. How-
ever, other regression methods, such as support vector regression, has been employed
(Jaiswal et al., 2013; Martinez et al., 2013; Valstar et al., 2010). Regression-based esti-
mates can be used to construct a response map as in classification-based CLM models
and then use a shape fitting strategy (Cootes et al., 2012; Jaiswal et al., 2013; Martinez
et al., 2013). Alternatively, a cascaded regression strategy has been proposed in the pop-
ular Supervised Descent Method (SDM), where regression is used to estimate the whole
shape at once, avoiding the use of an explicit shape model (Cao et al., 2012; Xiong &
De la Torre, 2013). Despite its simplicity, this results in an excellent and very robust
performance running in real time.
All direct displacement prediction methods mentioned above suffer from the problem
that they have to sample a limited number of patches around the expected location of
the target facial point. This is sub-optimal in terms of accuracy but required to retain
sufficiently low latency. Continuous regression (Sánchez-Lozano et al., 2012) solves this
problem by locally describing the appearance of the face with a Taylor expansion, which
in turn allows one to analytically calculate the predictions from all locations in this area,
and integrate them into a single prediction. This was later extended to also work for
cascaded regression in a method called incremental Continuous Cascaded Regression
iCCR (Sánchez-Lozano et al., 2016).
iCCR also included an incremental learning step, which allows the tracker to deal
with non-frontal and expressive faces. It is as accurate or more accurate than the state of
the art (depending on the test set used), and its implementation in Matlab is very fast. It
is an order of magnitude faster than SDM, and its update process is 25 times faster than
other known methods for incrementally updating cascaded regressors (Asthana et al.,
2014; Xiong & De la Torre, 2013).
Remaining challenges include the robust handling of partial occlusions, tracking in
low resolution imagery, and being able to use information from multiple 2-D views.
Furthermore, a system is required that can efficiently detect tracking failures and recover
from them.
Feature Extraction
Feature extraction converts image pixel data into a higher-level representation of appear-
ance, motion, and/or the spatial arrangement of inner facial structures. It aims to reduce
the dimensionality of the input space, to minimize the variance in the data caused by
unwanted conditions, such as lighting, alignment errors, or (motion) blur, and to reduce
the sensitivity to contextual effects, such as identity and head pose. Here, we group the
feature extraction methods into four categories: geometry-based methods, appearance-
based methods, motion-based methods, and hybrid methods.
Table 11.2 Definition of basic geometric features.
LOC Location of the facial fiducial landmarks

DIS Euclidean distance between pairs of points
ANG Angle defined by a set of points
DSP Difference of LOC, DIS, and ANG relative to a neutral frame
RAT Rate of change of static features in consecutive frames
POL Polynomial approximation of point trajectory over time
Geometry-based Approaches
Most facial muscle activations result in the displacement of facial landmark points. For
example, facial actions can raise/lower the corner of the eyebrows or elongate/shorten
the mouth. Many early approaches were based on geometric features as they closely
match human intuition of face perception (Pantic & Rothkrantz, 2000).
Geometry-based features can either be computed from a set of facial fiducial land-
marks localized in a single frame, or can include trajectories of facial points over time.
Furthermore, a distinction between holistic and local features can be made. Holistic
geometric features are used, for example, in Kapoor, Qi, and Picard (2003), as the coef-
ficients of a shape represented using a statistical shape model are employed. Most other
works in the field use features derived from the fiducial landmark locations and, in par-
ticular, use a subset of the features described in Table 11.2.
Geometric features are easily interpretable, allowing the definition of heuristics. This
is especially attractive for behavioral scientists who can use them to study the mean-
ing of expressions. Geometric features are also extremely computationally efficient,
once the facial landmarks have been tracked. It is in principle easier for geometry-
based approaches to deal with nonfrontal head poses in comparison to appearance-based
approaches, because there is no local appearance to nonlinearly warp to a frontal view.
Furthermore, geometry-based features are invariant to lighting conditions, provided that
the facial point tracking is successful. Some experiments have also shown that geomet-
ric features are particularly suited for some AUs, particularly AU4 (brow lowerer) and
AU43 (eye closure) (Valstar & Pantic, 2012). Finally, the dynamics of facial expressions
can be easily captured by geometric features (Pantic & Patras, 2005, 2006; Valstar &
Pantic, 2012).
However, geometry-based features have a number of shortcomings. First of all, a
facial point tracker is required and the performance of the system depends on the
tracker’s accuracy and robustness. It is difficult to detect subtle AUs with geometry-
based features as the magnitude of the tracking errors can be of a similar scale as the
displacements produced by a low-intensity AU activation. Most critically, only a sub-
set of AUs produce a discernible displacement of the facial points. For instance, AU6
(cheek raise), AU11 (nasolabial furrow deepener), AU14 (mouth corner dimpler), and
AU22 (lip funneler, as when pronouncing “flirt”) do not produce uniquely identifiable
face shapes in 2-D.
Appearance-based approaches
Static Appearance-based approaches
Static appearance features aim to capture texture patterns in a single image. We group
the different appearance features in the following categories: intensity, filter banks, bina-
rized local texture, gradient, and two-layer descriptors.
Image intensity. Once an image is properly registered, using raw pixel information is
a valid and arguably even the most appropriate appearance representation (e.g., Chew
et al., 2012; Lucey et al., 2011; Mahoor et al., 2009). Some experiments show that using
image intensity improves the performance of AU recognition compared to LBP features
if the inputs are head-pose-normalized face images (Chew et al., 2011).
It is important to note that the main weaknesses of using the image intensities is their
sensitivity to lighting conditions and registration errors. Therefore, image intensities
should only be used in scenarios with controlled lighting conditions, and they are not
expected to generalize well to less controlled scenarios.
Filter banks. Gabor wavelets are commonly used in the field of automatic AU analysis
as they can be sensitive to finer wave-like image structures, such as those corresponding
to wrinkles and bulges, provided that the frequency of the filters matches the size of the
image structures. If this is not the case (typically because the face image is too small),
Gabor filters will respond to coarser texture properties and miss valuable information.
Typically, only Gabor magnitudes are used as they are robust to misalignment (e.g.,
Bartlett et al., 2006; Mahoor et al., 2011; Savran, Sankur, & Bilge, 2012b).
Less commonly used features within this group include Haar-like filters (Papageor-
giou, Oren, & Poggio, 1998; Whitehill & Omlin, 2006), which respond to coarser image
features, are robust to alignment errors, and are computationally very efficient. Haar fil-
ters are not responsive to the finer texture details, so their use should be limited to
detecting the most obvious AUs (e.g., 12). The discrete cosine transform (DCT) fea-
tures encode texture frequency using predefined filters that depend on the patch size
(Ahmed, Natarajan, & Rao, 1974). DCTs are not sensitive to alignment errors and their
dimensionality is the same as the original image. However, higher frequency coefficients
are usually ignored, therefore potentially loosing sensitivity to finer image structures as
wrinkles and bulges. DCTs have been used for automatic AU analysis in a holistic man-
ner in Gehrig and Ekenel (2011) and Kaltwang, Rudovic, & Pantic (2012), with the
former employing a block-based representation.
Binarized local texture. Local binary patterns (LBP) (Ojala, Pietikäinen, & Harwood,
1996) and local phase quantization (LPQ) (Ojansivu & Heikkilä, 2008) are rather pop-
ular in the field of machine analysis of AUs. They are usually applied in the following
manner: (1) real-valued measurements extracted from the image intensities are quan-
tized to increase robustness (especially to illumination conditions) and reduce intraclass
variability, (2) often histograms are used to increase the robustness to shifts at the cost
of some spatial information loss.
The LBP descriptor (Ojala et al., 1996) is constructed by considering, for each pixel,
an 8-bit vector that results from comparing its intensity against the intensity of each of
the neighboring pixels. A histogram is then computed, where each bin corresponds to
one of the different possible binary patterns, resulting in a 256-dimensional descriptor.

However, most commonly the so-called uniform LBP is used. This results from elimi-
nating a number of pre-defined bins from the LBP histogram that do not encode strong
edges (Ojala, Pietikäinen, & Maenpaa, 2002). Many works successfully use LBP fea-
tures for automatic facial AU analysis. They are typically used in a block-based holistic
manner. Chen et al. (2013), Chew et al. (2011), Jiang et al. (2014), Jiang, Valstar, &
Pantic (2011), Smith & Windeatt (2011), and Wu et al. (2012) found 10 × 10 blocks to
be optimal in their case for uniform LBPs. The main advantages of LBP features are
their robustness to illumination changes, their computational simplicity, and their sen-
sitivity to local structures while remaining robust to shifts (Shan, Gong, & McOwan,
2008). They, are however, not robust to rotations and a correct normalization of the face
to an upright position is necessary. Many variants of the original LBP descriptor exist
and a review of LBP-based descriptors can be found in Huang, Shan, and Ardabilian
(2011).
The LPQ descriptor (Ojansivu & Heikkilä, 2008) uses local phase information
extracted using 2-D short-term Fourier transform (STFT) computed over a rectangular
M-by-M neighborhood at each pixel position. It is robust to image blurring produced by
a point spread function. The phase information in the Fourier coefficient is quantized by
keeping the signs of the real and imaginary parts of each component. LPQs were used
for automatic AU analysis in Jiang et al. (2011, 2014).
Two-layer appearance descriptors. These features result from the application of two
feature descriptors, where the second descriptor is applied over the response of the first
one. For example, Senechal et al. (2012) and more recently Almaev and Valstar (2013)
used local Gabor binary pattern (LGBP), which results from first calculating Gabor
magnitudes over the image and then applying an LBP operator over the resulting multi-
ple Gabor pictures. Gabor features are applied first to capture less local structures (each
LBP pattern considers only a 3 × 3 patch), while the LBP operator increases the robust-
ness to misalignment and illumination changes and reduces the feature dimensionality.
Senechal et al. (2012) won the FERA2011 AU detection challenge with a combination
of LGBP and geometric features (Valstar et al., 2012). Similarly, Wu et al. (2012) used
two layers of Gabor features (G2 ) to encode image textures that go beyond edges and
bars. They also compared single layer (LBP, Gabor) and dual layer (G2 , LGBP) archi-
tectures for automatic AU detection and concluded that dual layer architectures provide
a small but consistent improvement.
Dynamic Appearance-based Approaches

A recent trend is the use of dynamic appearance descriptors, which encode both spatial
and temporal information. Therefore, dynamic appearance descriptors seem particu-
larly adequate to represent facial actions, as the very word “action” implies temporally
structured texture.
LBPs were extended to represent spatiotemporal volumes in Zhao and Pietikäinen
(2007). To make the approach computationally simple, a spatiotemporal volume is
described by computing LBP features only on three orthogonal planes (TOP): XY, XT,
and YT, to form the LBP-TOP descriptor (see Figure 11.4). The same extension was
Figure 11.4 Three planes in spatiotemporal domain to extract TOP features and the histogram
concatenated from three planes.
proposed for LPQ features (Jiang et al., 2011) and LGBP features (Almaev & Valstar,
2013).
In principle, dynamic features, being a generalization of their static counterparts,
result in more powerful representations. This has been shown in Almaev and Valstar
(2013) and Jiang et al. (2014), where the performance of LBP, LPQ, LGBP, and their
TOP extensions were evaluated for automatic AU detection. A significant and consis-
tent performance improvement has been shown when using spatiotemporal features:
compared to LBP, LBP-TOP attained a 9 percent increase in 2AFC score, LPQ-TOP
11 percent, and LGBP-TOP no less than 27 percent. While the contiguity of pixels in
the spatial plane is given by the image structure, temporal contiguity depends on the
face registration. Interestingly, TOP features have been shown to be less sensitive to
registration errors than their static counterparts (Almaev & Valstar, 2013).
Motion-based Approaches
Motion features capture flexible deformations in the skin generated by the activation of
facial muscles. They are related to dense motion rather than to the motion of a discrete
set of facial landmarks. They are different from (dynamic) appearance features as they
do not capture texture but only its motion, so they would not respond to an active AU
if it is not undergoing any change (e.g., at the apex of an expression). We distinguish
two classes of motion-based features: those resulting from image subtraction and those
where a dense registration at the pixel level is required.
Image subtraction. A δ–image is defined as the difference between the current frame
and an expressionless-face frame of the same subject. This is usually combined with
a linear manifold learning to eliminate the effect of noise; for example, Bartlett et al.
(1999), Bazzo and Lamar (2004), Donato et al. (1999), and Fasel and Luettin (2000)
combined the δ–images with techniques such as PCA or ICA. Alternatively, Bazzo and
Lamar (2004) and Donato et al. (1999) used Gabor features extracted over δ–images.
More recently, Kotsia, Zafeiriou, and Pitas (2008) and Savran et al. (2012a) combined
δ–images with variants of non-negative matrix factorization (NMF).
Motion history images (MHI) (Bobick and Davis, 2001) use image differences to
summarize the motion over a number of frames. The motion at the current frame
(a) (b) (c)
(d) (e) (f)
Figure 11.5 Example of MHI and FFD techniques. (a) First frame. (b) Last frame. (c) MHI
for the entire sequence. (d) The motion field sequence from the FFD method applied to a
rectangular grid. (e) The motion field sequence from the FFD method applied to the first frame.
(f) Difference between (b) and (e). Source: Koelstra, Pantic, and Pantras (2010).
is represented by bright pixels, while the pixels where motion was only detected in
past frames fade to black linearly with time. This was first applied to AU analysis in
Valstar, Pantic, and Patras (2004), where MHI summarized window-based chunks of
video. An extension of MHI-based representation was applied for automatic AU analy-
sis in Koelstra et al. (2010), where the authors approximate the motion field by finding
the closest nonstatic pixel. The authors claim that this results in a more dense and infor-
mative representation of the occurrence and the direction of motion. The main advan-
tage of MHI-based methods is that they are robust to the inter-sequence variations in
illumination and skin color.
Nonrigid registration. Methods based on nonrigid image registration capture the
information in all image regions regarding the direction and intensity of the motion.
Motion estimates obtained by optical flow (OF) were considered as an alternative to
δ–images in earlier works (Donato et al., 1999; Lien et al., 2000). However, OF was
reportedly outperformed by δ–images.
Hybrid Approaches
Hybrid approaches are those that combine features of more than one type. Several
works investigate whether geometry-based features or appearance-based features are
more informative for automatic AU analysis (e.g., Valstar, Patras, and Pantic, 2005;
Zhang et al., 1998). However, both types convey complementary information and would
therefore be best used together. For example, the activation of AU11 (nasolabal fur-
rowed), AU14 (simpler), AU17 (chin raiser), and AU22 (lip funneler) is not apparent
from movements of facial points but rather from changes in the face texture. Instead,
geometric features perform significantly better for some AUs. Experimental evidence
consistently shows that combining geometry and appearance information is very ben-
eficial Hamm et al. (2011), Kotsia et al. (2008), Zhu et al. (2011) and, in particular,
Senechal et al. (2011) won the FERA2011 AU detection challenge with hybrid features.
Combining appearance and geometric features is even more important when using head-
pose-normalized images (see section on appearance-based approaches).
Feature Learning
With the advent of Deep Learning, in an increasing number of computer vision prob-
lems hand-crafted features such as those described above are superseded by features that
are implicitly learned. In particular the use of Convolutional Neural Networks (CNNs)
and Autoencoders have proven to beat the state of the art time and time again. Facial
expression recognition is no exception, and the top performance on facial expression
recognition challenges such as FERA 2015 is now reported by systems that learn fea-
tures. Jaiswal & Valstar claimed top performance by learning both static and dynamic
appearance and shape features using CNNs.
Machine Analysis of Facial Actions
In this section we review the different machine learning techniques used for automatic
AU analysis.
AU Activation Detection
AU activation detection aims to assign, for each AU, a binary label to each frame of
an unsegmented sequence indicating whether the AU is active or not. Therefore, Frame-
based AU detection is typically treated as a multiple binary classification problem where
a specific classifier is trained for each target AU. This reflects the fact that more than
one AU can be active at the same time, so AU combinations can be detected by simply
detecting the activation of each of the AUs involved. It is also important to take spe-
cial care when dealing with nonadditive AU combinations (see section on FACS); such
combinations need to be included in the training set for all of the AUs involved. An
alternative is to treat nonadditive combinations of AUs as independent classes (Yian,
Kanade, and Cohn, 2001). That makes the patterns associated with each class more
homogeneous, boosting the classifier performance. However, more classifiers have to
be trained/evaluated, especially because the number of nonadditive AU combinations
is large. Finally, the problem can be treated as multiclass classification, where a single
multiclass classifier is used per AU. AU combinations (either additive or nonadditive)
are treated as separate classes, as only one class can be positive per frame, which makes
this approach only practical when a small set of AUs is targeted (Smith and Windeatt,
2011).
Common binary classifiers applied to the frame-based AU detection problem include
artificial neural networks (ANN), ensemble learning techniques, and support vector
machines (SVM). ANNs were the most popular method in earlier works (Bartlett et al.,
2006; Bazzo & Lamar, 2004; Donato et al., 1999; Fasel & Luettin, 2000; Smith &
Windeatt, 2011; Tian, Kanade, & Cohn, 2002). ANNs are hard to train as they typically
involve many parameters, they are sensitive to initialization, the parameter optimization
process can end up in local minima and they are more prone to suffer from the curse
of dimensionality, which is particularly problematic as data for AU analysis are scarce.
Some of the advantages of ANN, such as naturally handling multiclass problems or
multidimensional outputs, are of less importance in case of frame-based AU detection.
Ensemble learning algorithms, such as AdaBoost and GentleBoost, have been a com-
mon choice for AU activation detection (Hamm et al., 2011; Whitehill & Omlin, 2006;
Yang, Liu, & Metaxasa, 2009; Zhu et al., 2011). Boosting algorithms are simple and
quick to train. They have fewer parameters than SVM or ANN, and are less prone to
overfitting. Furthermore, they implicitly perform feature selection, which is desirable
for handling high-dimensional data. However, they might not capture more complex
nonlinear patterns. SVMs are currently the most popular choice (e.g., Chew et al., 2012;
Gonzalez et al., 2011; Jiang et al., 2011; Mahoor et al., 2009; Wu et al., 2012; Yang,
Liu, & Metaxasa, 2011), as they often outperform other algorithms for the target prob-
lem (Bartlett et al., 2006; Savran et al., 2012a). SVMs are nonlinear methods, parameter
optimization is relatively easy, efficient implementations are readily available, and the
choice of various kernel functions provides flexibility of design.
Temporal consistency. AU detection is by nature a structured problem as, for exam-
ple, the label of the current frame is more likely to be active if the preceding frame is
also labeled active. Considering the problem to be structured in the temporal domain
is often referred to as enforcing temporal consistency. Graphical models are the most
common approach to attain this. For example, in Valstar et al. (2007) the authors used a
modification of the classical hidden Markov models (see Figure 11.6). In particular, they
substituted the generative model that relates a hidden variable and an observation with a
discriminative classifier. In terms of graph topology, this consists of inverting the direc-
tion of the arrow relating the two nodes, and results in a model similar to a maximum
entropy Markov model (McCallum, Freitag, & Pereira, 2000; see Figure 11.6).
Van der Maaten and Hendriks (2012) apply a conditional random field (CRF) (its
topology is shown in Figure 11.6). This model represents the relations between variables
as undirected edges and the associated potentials are discriminatively trained. In the
simplest CRF formulation, the label assigned to a given frame depends on contiguous
Figure 11.6 Graphical illustration of (a) hidden Markov model, (b) maximum entropy Markov
model, (c) conditional random field, and (d) hidden conditional random field. X is the
observation sequences, Z is the hidden variables and Y is the class label.
labels, that is, it is conditioned to the immediate future and past observations. Van der
Maaten and Hendriks (2012) trained one CRF per AU, and each frame was associated to
a node within the graph. The state of such nodes is a binary variable indicating AU acti-
vation. In Chang, Liu, and Lai (2009) the authors use a modified version of the hidden
conditional random field (HCRF) (see Figure 11.6), where the sequence is assumed to
start and end with known AU activation labels. The hidden variables represent the pos-
sible AU activations, while the labels to be inferred correspond to prototypical facial
expressions.
Structured-output SVM (Tsochantaridis et al., 2005) is an alternative to graphs
for structured prediction. Simon et al. (2010) proposed a segment-based classifica-
tion approach, coined kSeg-SVM, that incorporates temporal consistency through the
structured-output SVM framework. In consequence, the relations of temporal consis-
tency between the output labels are incorporated within the loss function used to train
the SVM classifier. The authors compare their method with standard SVM, showing a
moderate performance increase. They omit, however, a comparison with CRF. For the
case of binary problems, both methods seem equally suitable a priori, as they code the
same relations using similar models.
Unsupervised detection of facial events. In order to avoid the problem of lack of
training data, which impedes development of robust and highly effective approaches to
machine analysis of AUs, some recent efforts focus on unsupervised approaches to the
target problem. The aim is to segment a previously unsegmented input sequence into
relevant “facial events,” but without the use of labels during training (De la Torre et al.,
2007; Zhou, De la Torre, & Cohn, 2010). The facial events might not be coincident
with AUs, although some correlation with them is to be expected, as AUs are distinctive
spatiotemporal events. A clustering algorithm is used in these works to group spatiotem-
poral events of similar characteristics. Furthermore, a dynamic time alignment kernel is
used in De la Torre et al. (2010) to normalize the facial events in terms of the speed of
the facial action. Despite its interesting theoretical aspects, unsupervised learning tradi-
tionally trails behind in performance to supervised learning, even when small training
sets are available. A semi-supervised learning setting might offer much better perfor-
mance as it uses all the annotated data together with potentially useful unannotated data.
Transfer learning. Transfer learning methodologies are applied when there is a signif-
icant difference between the distribution of the learning data and the test data. In these
situations, the decision boundaries learned on the training data might be suboptimal for
the test data. Transfer learning encompass a wide range of techniques designed to deal
with these cases (Pan & Yang, 2010). They have only very recently been applied to auto-
matic AU analysis. For example, Chu, Torre, and Cohn (2013) proposed a new trans-
ductive learning method, referred to as a selective transfer machine (STM). Because
of its transductive nature, no labels are required for the test subject. At test time, a
weight for each training example is computed that maximizes the match between the
weighted distribution of training examples and the test distribution. Inference is then
performed using the weighted distribution. The authors obtained a remarkable perfor-
mance increase, beating subject-specific models. This can be explained by the reduced
availability of subject-specific training examples.
Both transfer learning and unsupervised learning are promising approaches when
it comes to AU analysis. Appearance variations due to identity are often larger than
expression-related variations. This is aggravated by the high cost of AU annotation and
the low number of subjects present in the AU datasets. Therefore, techniques that can
capture subject-specific knowledge and transfer it at test time to unseen subjects without
the need for additional manual annotation are very suited for AU analysis. Similarly,
unsupervised learning can be used to capture appearance variations caused by facial
expressions without the need for arduous manual labeling of AUs. Both transfer learning
and supervised learning have thus a great potential to improve machine analysis of AUs
with limited labeled data.
Analysis of AU Temporal Dynamics

As explained in the facial coding system (FACS) section, the dynamics of facial actions
are crucial for distinguishing between various types of behavior (e.g., pain and mood).
The aim of AU temporal segment detection is to assign a per-frame label belonging to
one of four classes: neutral, onset, apex or offset (see the section on FACS for their def-
inition). It constitutes an analysis of the internal dynamics of an AU episode. Temporal
segments add important information for the detection of a full AU activation episode as
all labels should occur in a specific order. Furthermore, the AU temporal segments have
been shown to carry important semantic information, useful for a later interpretation of
the facial signals (Cohn & Schmidt, 2004; Ambadar et al., 2005).
Temporal segment detection is a multiclass problem, and it is typically addressed by
either using a multiclass classifier or by combining the output of several binary classi-
fiers. Some early works used a set of heuristic rules per AU based on facial point loca-
tions (Pantic & Patras, 2004, 2005), while further rules to improve the temporal consis-
tency of the label assigned were defined in Pantic & Patras (2006). In Valstar and Pantic
(2012), a set of one-versus-one binary SVMs (i.e., six classifiers) were trained, and a
majority vote was used to decide on the label. Similarly, Koelstra et al. (2010) trained
GentleBoost classifiers specialized for each AU and each temporal segment character-
ized by motion (i.e., onset and offset).
Graphical models (detailed in the section on AU activation detection) can be adapted
to this problem to impose temporal label consistency by setting the number of states
of the hidden variables to four. The practical difference with respect to the AU acti-
vation problem is that the transitions are more informative as, for example, an onset
frame should be followed by an apex frame and cannot be followed by a neutral frame.
Markov models were applied to this problem in Koelstra et al. (2010) and Valstar and
Pantic (2012). An extension of CRF, and in particular a kernelized version of condi-
tional ordinal random fields, was used instead in Rudovic, Pavlovic, and Pantic (2012).
In comparison to standard CRF, this model imposes ordinal constraints on the assigned
labels. It is important to note that distinguishing an apex frame from the end of an onset
frame or beginning of an offset frame by its texture solely is impossible. Apex frames are
not characterized by a specific facial appearance or configuration, but rather for being
the most intense activation within an episode, which is by nature an ordinal relation.
While traditional classification methodologies can be readily applied to this problem,
they produce suboptimal performance as it is often impossible to distinguish between
the patterns associated with the different temporal segments at a frame level. There-
fore, the use of temporal information, both at the feature level and through the use of
graphical models, is the most adequate design. In particular, the use of graphical mod-
els has been shown to produce a large performance improvement, even when simpler
methods like Markov chains are applied (Jiang et al., 2014; Koelstra et al., 2010). The
use of CRFs, however, allows jointly to optimize the per-frame classifier and the tem-
poral consistency, while the use of ordinal relationships within the graphical model add
information particularly suited to the analysis of the AU temporal segments.
When it comes to automatic analysis of temporal co-occurrences of AUs, the relations
between AU episodes are studied, both in terms of co-occurrences and in terms of the
temporal correlation between the episodes. To this end, Tong et al. (2007) modeled the
relationships between different AUs at a given time frame by using a static Bayesian net-
work. The temporal modeling (when an AU precedes another) is incorporated through
the use of a dynamic Bayesian network (DBN). They further introduced a unified prob-
abilistic model for the interactions between AUs and other nonverbal cues such as head
pose (Tong, Chen, & Ji, 2010). The same group later argued that the use of prior knowl-
edge instead of relations learned from data helps to generalize to new datasets (Li
et al., 2013). Although traditionally unexploited, this is a natural and useful source of
information as it is well known that some AUs co-occur with more frequency (see the
section on FACS) due to latent variables such as, for example, prototypical facial expres-
sions. In particular, graph-based methodologies can readily incorporate these relations.
However, it is necessary to explore the generalization power of these models as they are
likely to have a strong dependency on the AU combinations present in the dataset used
to generate the networks.
AU Intensity Estimation
Annotations of intensity are typically quantized into A, B, C, D, and E levels as stipu-
lated in the FACS manual. Some approaches use the confidence of the classification to
estimate the AU intensity under the rationale that the lower the intensity is, the harder
the classification will be. For example, Bartlett et al. (2006) estimated the intensity of
action units by using the distance of a test example to the SVM separating hyperplane,
while Hamm et al. (2011) used the confidence of the decision obtained from AdaBoost.
Multiclass classifiers or regressors are more natural choices for this problem. It is
important to note, however, that the class overlap is very large for this problem. There-
fore, the direct application of a multiclass classifier is unlikely to perform well and
comparably lower than when using a regressor. That is to say, for regression, predict-
ing B instead of A yields a lower error than predicting D, while for a classifier this
yields the same error. An attempt of using a multiclass classifier for this task is pre-
sented in Mahoor et al. (2009). The authors employed six one-versus-all binary SVM
classifiers, corresponding to either no activation or one of the five intensity levels. The
use of a regressor has been a more popular choice. For example, Jeni et al. (2013) and
Savran, Sankur, and Bilge (2012b) applied support vector regression (SVR) for predic-
tion, while Kaltwang et al. (2012) used relevance vector regression (RVR) instead. Both
methods, SVR and RVR, are extensions to regression of SVM, although RVR yields a
probabilistic output.
AU intensity estimation is a relatively recent problem within the field. It is of particu-
lar interest due to the semantic richness of the predictions. However, it is not possible to
objectively define rules for the annotation of AU intensities, and even experienced man-
ual coders will have some level of disagreement. Therefore, the large amount of overlap
between the classes should be taken into consideration. Regression methodologies are
particularly suited, as they penalize a close (but different) prediction less than distant
ones. Alternatively, ordinal relations can alleviate this problem by substituting the hard
label assignment for softer ones (e.g., greater than). There is also a large degree of data
unbalance as high intensity AUs are much less common.
Data and Databases
The need for large, AU labeled, publicly available databases for training, evaluating, and
benchmarking has been widely acknowledged and a number of efforts to address this
need have been made. In principle, any facial expression database can be extended with
Table 11.3 FACS-annotated facial expression databases. Elicitation method: on Command/Acted/Induced/Natural. Size:
number of subjects. Camera view: frontal/profile/3-D. S/D: static (image) or dynamic (video) data. Act: AU activation
annotation (number of AUs annotated). oao: onset/apex/offset annotation. Int: intensity annotation.
Database Elicit. Subjects 2/3-D S/D Act oao Int
Cohn-Kanade (Kanade, Cohn, & Tian, 2000) C 97 2-D D Full Y N

Cohn-Kanade+ (Lucey et al., 2010) N 26 2-D D 8 N N
MMI (Part I-III) (Pantic et al., 2005) C 210 2-D SD Full Y N
MMI (Part IV-V) (Valstar & Pantic, 2010) I 25 2-D D Full N N
ISL Frontal (Tong et al., 2007) C 10 2-D D 14 Y N
ISL Multi-view (Tong et al., 2010) C 8 2-D D 15 Y N
SAL (Douglas-Cowie et al., 2008) I 20 2-D D 10 Y N
SEMAINE (McKeown et al., 2012) I 150 2-D D Y N N
GEMEP-FERA (Valstar et al., 2011) A 10 2-D D 12 N N
UNBC-McMaster (Lucey et al., 2011) I 129 2-D D 10 N Y
DISFA (Mavadati et al., 2012) I 27 2-D D 12 N Y
AM-FED (McDuff et al., 2013) I N/A 2-D D 10 N N
Bosphorous (Savran et al., 2008) C 105 3-D S 25 N Y
ICT-3-DRFE (Cosker, Krumhuber, & Hilton, C 23 3-D S Full N Y
2011)
D3-DFACS (Stratou et al., 2011) C 10 3-D D Full N N
BU-4DSP (Zhang et al., 2013) I 41 3-D D 27 N N
AU annotation. However, due to the annotation process being very time-consuming,

only a limited number of facial expression databases are FACS annotated, and even
fewer are publicly available. Table 11.3 summarizes some details of all freely available
FACS-coded databases.
Posed expressions databases are usually restricted to convey a single specific emo-
tion/AU per sequence, typically with exaggerated individual features. These expressions
are easier to collect and also easier to classify. In the early stages of research into auto-
matic facial expression analysis, most systems were developed and evaluated on posed
expressions, collected under homogeneous illumination and frontal still head pose, and
on a relatively small number of participants of fairly homogeneous groups with respect
to age and ethnicity.
In comparison to deliberately displayed facial expressions, spontaneous expressions
involve higher frequency and larger amplitude of out-of-plane head movements, sub-
tler expressions, and subtle transitions to and from the onset and offset phases. Taking
into account the differences in appearance and timing between spontaneous and posed
expressions, it is unsurprising that approaches trained on posed databases fail to gener-
alize to the complexity of real-world scenarios (Pantic, 2009).
A number of databases suitable for 3-D facial expression analysis have appeared since
2003, including BU-3-DFE , BU-4DFE, Bosphorus, ICT-3-DRFE, and the recently
introduced D3-DFACS. In addition, the first 3-D dynamic database containing spon-
taneous expressions was released, which for simplicity we will abbreviate as BU-4DSP.
To the best of our knowledge, of these databases only Bosphorus (Savran et al., 2008),
D3-DFACS (Cosker et al., 2011), ICT-3-DRFE (Stratou et al., 2011), and BU-4DSP
(Zhang et al., 2013) contain AU annotations.
Challenges and Opportunities
Although the main focus in machine analysis of AUs has shifted to the analysis of
spontaneous expressions, state-of-the-art methods cannot be used in the wild effec-
tively. Challenges preventing this include handling occlusions, nonfrontal head poses,
co-occurring AUs and speech, varying illumination conditions, and the detection of
low intensity AUs. Lack of data is another nagging factor impeding progress in the
field.
Nonfrontal head poses occur frequently in naturalistic settings. Due to the scarce-
ness of annotated data, building view-specific appearance-based approaches for auto-
matic AU analysis is impractical. The existence of 3-D databases may ease this prob-
lem, although rendering examples of AU activations at multiple poses is challenging
as it involves simulating realistic photometric variance. Using head-pose-normalized
images for learning and inference is a more feasible alternative. However, many chal-
lenges are associated with this approach. For example, the learning algorithms should be
able to cope with partially corrupted data resulting from self-occlusions. More impor-
tantly, head-pose normalization while preserving facial expression changes is still an
open problem that needs to be addressed.
AUs rarely appear in isolation during spontaneous facial behavior, yet co-occurrences
of AUs become much harder to model in the presence of nonadditive AUs (see the
section on FACS). Treating these combinations as new independent classes (Mahoor
et al., 2011) is impractical given the number of such nonadditive AU combinations.
On the other hand, when treating each AU as a single class, the presence of nonaddi-
tive combinations of AUs increases the intraclass variability, potentially reducing per-
formance (Jiang et al., 2011). Also, the limited number of co-occurrence examples
in existing AU-coded databases makes it this problem really difficult. Hence, there
are only two ways forward: either model the “semantics” of facial behavior, that is,
temporal co-occurrences of AUs, or use a combination of unsupervised learning and
supervised learning where unsupervised learning is applied to very large amounts of
unlabeled data to learn all possible appearance and shape exemplars and supervised
learning is used on top of this to identify which exemplars can be linked to specific
AUs.
At the time of writing, the Deep Learning revolution is starting to make a mark on
automatic FACS detection. While only a few reports of this have been made to date, it
is expected that this technique will also bring a significant performance boost to facial
expression recognition, and it’s not unlikely that it will solve many of the outstanding
issues in FACS analysis.
While the importance of facial intensities and facial dynamics for the interpretation of
facial behavior has been stressed in the field of psychology, it has received limited atten-
tion from the computer science community. The detection of AU temporal segments and
the estimation of their intensities are unsolved problems. There is some degree of class
overlap due to unavoidable labeler noise and unclear specifications of the class bound-
aries. Clearer annotation criteria to label intensity in a continuous real-valued scale may
alleviate this issue. Building tools to improve performance in the presence of inter-
labeler disagreement would remain important.
All AU-coded databases suffer from various limitations, the most important being the
lack of realistic illumination conditions and naturalistic head movements. This might
mean that the field is driving itself into algorithmic local maxima (Whitehill and Omlin,
2006). Creating publicly available “in-the-wild” datasets would be a major contribution.
The absence of an adequate widely used benchmark has also been a detrimental fac-
tor for the evolution of the field. The facial expression and analysis challenge (FERA),
organized in 2011, was the very first such attempt (Valstar et al., 2011, 2012). A proto-
col was set in Valstar et al. (2011) where the training and testing sets were predefined
and a performance metric was defined. The extended CK+ database has a similar func-
tion (Lucey et al., 2010). Reporting performance of proposed methodologies on these
databases should be encouraged and other benchmarks with different properties (e.g.,
in the wild conditions) are needed. Furthermore, the inclusion of cross-database exper-
iments in the benchmarking protocol should be enabled.
Building personalized models using online and transfer learning methodologies
(Chen et al., 2013; Chu et al., 2013) is the way forward in our opinion. This is because
of several reasons, such as the lack of training data, the large subject differences, and the
dependency of the displayed expressions, on a large number of factors, such as the envi-
ronment, the task, or the mood; all aspects which would be hard to cover exhaustively
even if a much larger amount of training data was available.
Low intensity AUs might be of special importance for situations where the subject is
intentionally controlling their facial behavior. Scenarios such as deceit detection would
benefit greatly from the detection of subtle facial movements. The first research question
relates to finding features that capture such subtle changes (Pfister et al., 2011).
Existing work deals mostly with classification/processing of the currently observed
facial expressive behavior. Being able to model the behavior typical for an individual
and use this to predict the subject’s future behavior given the current observations would
be of major interest. This is a novel problem that can be seen as a long-term aim in the
field.
Another impediment to the progress of the field is that very few fully automatic real-
time systems for automatic AU analysis with state-of-the-art accuracy are publicly avail-
able. This is necessary both for the reproduction of the published results and to allow
social scientists to use the tools. The computer expression recognition toolbox (CERT)
(Littlewort et al., 2011), followed up by FACET are the only fully automatic, real-time
software tools. Other publicly available tools are the LBP-based action unit detection
(LAUD) and LPQ-TOP-based action unit detection (TAUD) (Jiang et al., 2011). By
using these tools, fourteen AUs can be automatically detected from static images and
videos. However, these tools do not run in real time.
Overall, although a major progress in machine recognition of AUs has been made
over the past years, this field of research is still underdeveloped and many problems are
still open awaiting to be researched. Attaining a fully automatic AU recognition in the

wild would open up tremendous potential for new applications in games, security, and
health industries and investing in this field is therefore worth all the effort. We hope that
this chapter will provide a set of helpful guidelines to all those carrying out the research
in the field now and in the future.
References
Ahlberg, J. (2001). Candide-3 – an updated parameterised face. Technical report, Linkping Uni-
versity, Sweden.
Ahmed, N., Natarajan, T., & Rao, K. R. (1974). Discrete cosine transform. IEEE Transactions on
Computers, 23, 90–93.
Almaev, T. & Valstar, M. (2013). Local gabor binary patterns from three orthogonal planes for
automatic facial expression recognition. In Humaine Association Conference on Affective Com-
puting and Intelligent Interaction (ACII), September 2–5, Geneva.
Ambadar, Z., Cohn, J. F., & Reed, L. I. (2009). All smiles are not created equal: Morphology and
timing of smiles perceived as amused, polite, and embarrassed/nervous. Journal of Nonverbal
Behavior, 33, 17–34.
Ambadar, Z., Schooler, J. W., & Cohn, J. F. (2005). Deciphering the enigmatic face: The impor-
tance of facial dynamics in interpreting subtle facial expressions. Psychological Science, 16(5),
403–410.
Asthana, A., Cheng, S., Zafeiriou, S., & Pantic, M. (2013). Robust discriminative response map
fitting with constrained local models. In Proceedings of the IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, June 26–28, Portland, OR.
Asthana, A., Zafeiriou, S., Cheng, S., & Pantic, M. (2014). Incremental face alignment in the
wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp. 1859–1866).
Bartlett, M. S., Hager, J. C., Ekman, P., & Sejnowski, T. J. (1999). Measuring facial expressions
by computer image analysis. Psychophysiology, 36(2), 253–263.
Bartlett, M. S., Littlewort, G., Frank, M., et al. (2006). Automatic recognition of facial actions in
spontaneous expressions. Journal of Multimedia, 1(6), 22–35.
Bartlett, M. S., Viola, P. A., Sejnowski, T. J., et al. (1996). Classifying facial actions. In D. S.
Touretzky, M. C. Mozer, and M. E. Hasselmo (Eds), Advances in Neural Information Process-
ing Systems 8 (pp. 823–829). Cambridge, MA: MIT Press.
Bazzo, J. & Lamar, M. (2004). Recognizing facial actions using Gabor wavelets with neutral face
average difference. In Proceedings of Sixth IEEE International Conference on Automatic Face
and Gesture Recognition, May 19, Seoul (pp. 505–510).
Bobick, A. F. & Davis, J. W. (2001). The recognition of human movement using temporal
templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3), 257–
267.
Cao, X., Wei, Y., Wen, F., & Sun, J. (2012). Face alignment by explicit shape regression. In Pro-
ceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recogni-
tion, June 16–21, Providence, RI (pp. 2887–2894).
Chang, C.-C. & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Trans-
actions on Intelligent Systems and Technology, 2(3), 27:1–27:27.
Chang, K., Liu, T., & Lai, S. (2009). Learning partially observed hidden conditional random fields
for facial expression recognition. In Proceedings of the IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, June 20–25, Miami, FL (pp. 533–540).
Chen, J., Liu, X., Tu, P., & Aragones, A. (2013). Learning person-specific models for facial
expressions and action unit recognition. Pattern Recognition Letters, 34(15), 1964–1970.
Chew, S. W., Lucey, P., Lucey, S., et al. (2011). Person-independent facial expression detection
using constrained local models. In Proceedings of the IEEE International Conference on Auto-
matic Face and Gesture Recognition, March 21–25, Santa Barbara, CA (pp. 915–920).
Chew, S. W., Lucey, P., Saragih, S., Cohn, J. F., & Sridharan, S. (2012). In the pursuit of effective
affective computing: The relationship between features and registration. IEEE Transactions on
Systems, Man and Cybernetics, Part B: Cybernetics, 42(4), 1006–1016.
Chu, W., Torre, F. D. L., & Cohn, J. F. (2013). Selective transfer machine for personalized facial
action unit detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, June 23–28, Portland, OR.
Cohn, J. F. & Schmidt, K. L. (2004). The timing of facial motion in posed and spontaneous smiles.
International Journal of Wavelets, Multiresolution and Information Processing, 2(2), 121–132.
Cootes, T., Ionita, M., Lindner, C., & Sauer, P. (2012). Robust and accurate shape model fit-
ting using random forest regression voting. In 12th European Conference on Computer Vision,
October 7–13, Florence, Italy.
Cootes, T. & Taylor, C. (2004). Statistical models of appearance for computer vision. Technical
report, University of Manchester.
Cosker, D., Krumhuber, E., & Hilton, A. (2011). A FACS valid 3-D dynamic action unit database
with applications to 3-D dynamic morphable facial modeling. In Proceedings of the IEEE
International Conference on Computer Vision, November 6–11, Barcelona (pp. 2296–2303).
Costa, M., Dinsbach, W., Manstead, A. S. R., & Bitti, P. E. R. (2001). Social presence, embarrass-
ment, and nonverbal behavior. Journal of Nonverbal Behavior, 25(4), 225–240.
Dantone, M., Gall, J., Fanelli, G., & Gool, L. J. V. (2012). Real-time facial feature detection using
conditional regression forests. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, June 16–21, Providence, RI (pp. 2578–2585).
Darwin, C. (1872). The Expression of the Emotions in Man and Animals. London: John Murray.
De la Torre, F., Campoy, J., Ambadar, Z., & Cohn, J. F. (2007). Temporal segmentation of facial
behavior. In Proceedings of the IEEE International Conference on Computer Vision, October
14–21, Rio de Janeiro (pp. 1–8).
Donato, G., Bartlett, M. S., Hager, J. C., Ekman, P., & Sejnowski, T. J. (1999). Classifying facial
actions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(10), 974–989.
Dornaika, F. & Davoine, F. (2006). On appearance based face and facial action tracking. IEEE
Transactions on Circuits and Systems for Video Technology, 16(9), 1107–1124.
Douglas-Cowie, E., Cowie, R., Cox, C., Amier, N., & Heylen, D. (2008). The sensitive arti-
ficial listener: An induction technique for generating emotionally coloured conversation. In
LREC Workshop on Corpora for Research on Emotion and Affect, May 26, 2008, Marrakech,
Marokko, pages 1–4.
Ekman, P. (2003). Darwin, deception, and facial expression. Annals of the New York Academy of
Sciences, 1000, 205–221.
Ekman, P. & Friesen, W. V. (1978). Facial Action Coding System: A Technique for the Measure-
ment of Facial Movement. Palo Alto, CA: Consulting Psychologists Press.
Ekman, P., Friesen, W. V., & Hager, J. C. (2002). Facial Action Coding System. Salt Lake City,
UT: Human Face.
Ekman, P. & Ronsenberg, L. E. (2005). What the Face Reveals: Basic and Applied Studies of
Spontaneous Expression Using the Facial Action Coding System. Oxford: Oxford University
Press.
Fasel, B. & Luettin, J. (2000). Recognition of asymmetric facial action unit activities and inten-
sities. In Proceedings of the 15th International Conference on Pattern Recognition, September
3–7, Barcelona (pp. 1100–1103).
Frank, M. G. & Ekman, P. (1997). The ability to detect deceit generalizes across different types
of high-stakes lies. Journal of Personality and Social Psychology, 72(6), 1429–1439.
Frank, M. G. & Ekman, P. (2004). Appearing truthful generalizes across different deception situ-
ations. Journal of Personality and Social Psychology, 86, 486–495.
Frank, M. G., Ekman, P., & Friesen, W. V. (1993). Behavioral markers and recognizability of the
smile of enjoyment. Journal of Personality and Social Psychology, 64(1), 83–93.
Gehrig, T. & Ekenel, H. K. (2011). Facial action unit detection using kernel partial least squares.
In Proceedings of the IEEE International Conference Computer Vision Workshops, November
6–13, Barcelona (2092–2099).
Gill, D., Garrod, O., Jack, R., & Schyns, P. (2012). From facial gesture to social judgment: A
psychophysical approach. Journal of Nonverbal Behavior, 3(6), 395.
Girard, J. M., Cohn, J. F., Mahoor, M. H., Mavadati, S. M., & Rosenwald, D. P. (2013). Social risk
and depression: Evidence from manual and automatic facial expression analysis. In Proceed-
ings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture
Recognition, April 22–26, Shanghai.
Gonzalez, I., Sahli, H., Enescu, V., & Verhelst, W. (2011). Context-independent facial action
unit recognition using shape and Gabor phase information. In Proceedings of the International
Conference on Affective Computing and Intelligent Interaction, October 9–12, Memphis, TN
(pp. 548–557).
Hamm, J., Kohler, C. G., Gur, R. C., & Verma, R. (2011). Automated facial action coding system
for dynamic analysis of facial expressions in neuropsychiatric disorders. Journal of Neuro-
science Methods, 200(2), 237–256.
Huang, D., Shan, C., & Ardabilian, M. (2011). Local binary pattern and its application to facial
image analysis: A survey. IEEE Transactions on Systems, Man and Cybernetics, Part C: Appli-
cations and Reviews, 41(6), 765–781.
Jaiswal, S., Almaev, T., & Valstar, M. F. (2013). Guided unsupervised learning of mode specific
models for facial point detection in the wild. In Proceedings of the IEEE International Confer-
ence on Computer Vision Workshops, December 1–8, Sydney (pp. 370–377).
Jeni, L. A., Girard, J. M., Cohn, J., & Torres, F. D. L. (2013). Continuous AU intensity esti-
mation using localized, sparse facial feature space. In Proceedings of the 10th IEEE Interna-
tional Conference and Workshops on Automatic Face and Gesture Recognition, April 22–26,
Shanghai.
Jiang, B., Valstar, M. F., Martinez, B., & Pantic, M. (2014). A dynamic appearance descriptor
approach to facial actions temporal modelling. IEEE Transactions on Cybernetics, 44(2), 161–
174.
Jiang, B., Valstar, M. F., & Pantic, M. (2011). Action unit detection using sparse appearance
descriptors in space-time video volumes. In Proceedings of the IEEE International Conference
on Automatic Face and Gesture Recognition, March 21–25, Santa Barbara, CA (pp. 314–321).
Kaltwang, S., Rudovic, O., & Pantic, M. (2012). Continuous pain intensity estimation from facial
expressions. In Proceedings of the 8th International Symposium on Visual Computing, July
16–18, Rethymnon, Crete (pp. 368–377).
Kanade, T., Cohn, J. F., & Tian, Y. (2000). Comprehensive database for facial expression analysis.
In Proceedings of the 4th International Conference on Automatic Face and Gesture Recogni-
tion, March 30, Grenoble, France (pp. 46–53).
Kapoor, A., Qi, Y., & Picard, R. W. (2003). Fully automatic upper facial action recognition. In
Proceedings of the IEEE International Workshop on Analysis and Modeling of Faces and Ges-
tures, October 17, Nice, France (pp. 195–202).
Khademi, M., Manzuri-Shalmani, M. T., Kiapour, M. H., & Kiaei, A. A. (2010). Recognizing
combinations of facial action units with different intensity using a mixture of hidden Markov
models and neural network. In Proceedings of the 9th International Conference on Multiple
Classifier Systems, April 7–9, Cairo (pp. 304–313).
Khan, M. H., Valstar, M. F., & Pridmore, T. P. (2013). A multiple motion model tracker handling
occlusion and rapid motion variation. In Proceedings of the 5th UK Computer Vision Student
Workshop British Machine Vision Conference, September 9–13, Bristol.
Koelstra, S., Pantic, M., & Patras, I. (2010). A dynamic texture based approach to recognition of
facial actions and their temporal models. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 32(11), 1940–1954.
Kotsia, I., Zafeiriou, S., & Pitas, I. (2008). Texture and shape information fusion for facial expres-
sion and facial action unit recognition. Pattern Recognition, 41(3), 833–851.
Li, Y., Chen, J., Zhao, Y., & Ji, Q. (2013). Data-free prior model for facial action unit recognition.
IEEE Transactions on Affective Computing, 4(2), 127–141.
Lien, J. J., Kanade, T., Cohn, J. F., & Li, C. (1998). Automated facial expression recognition based
on FACS action units. In Proceedings of 3rd IEEE International Conference on Automatic Face
and Gesture Recognition, April 14–16, Nara, Japan (pp. 390–395).
Lien, J. J., Kanade, T., Cohn, J. F., & Li, C. (2000). Detection, tracking, and classification of action
units in facial expression. Robotics and Autonomous Systems, 31, 131–146.
Littlewort, G. C., Bartlett, M. S., & Lee, K. (2009). Automatic coding of facial expressions dis-
played during posed and genuine pain. Image and Vision Computing, 27, 1797–1803.
Littlewort, G. C., Whitehill, J., Wu, T., et al. (2011). The computer expression recognition toolbox
(CERT). In Proceedings of the IEEE International Conference on Automatic Face and Gesture
Recognition, March 21–25, Piscataway, NJ (pp. 298–305).
Liwicki, S., Tzimiropoulos, G., Zafeiriou, S., & Pantic, M. (2012). Efficient online subspace
learning with an indefinite kernel for visual tracking and recognition. IEEE Transactions on
Neural Networks and Learning Systems, 23, 1624–1636.
Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., & Ambadar, Z. (2010). The extended Cohn-Kanade
dataset (CK+): A complete dataset for action unit and emotion-specied expression. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop, June
13–18, San Francisco (pp. 94–101).
Lucey, P., Cohn, J. F., Matthews, I., et al. (2011). Automatically detecting pain in video through
facial action units. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics,
41(3), 664–674.
Lucey, P., Cohn, J. F., Prkachin, K. M., Solomon, P. E., & Matthews, I. (2011). Painful data:
The UNBC-McMaster shoulder pain expression archive database. In Proceedings of the IEEE
International Conference on Automatic Face and Gesture Recognition, March 21–25, Santa
Barbara, CA (pp. 57–64).
Mahoor, M. H., Cadavid, S., Messinger, D. S., & Cohn, J. F. (2009). A framework for automated
measurement of the intensity of non-posed facial action units. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, June 20–25, Miami, FL (pp. 74–80).
Mahoor, M. H., Zhou, M., Veon, K. L., Mavadati, M., & Cohn, J. F. (2011). Facial action unit
recognition with sparse representation. In Proceedings of IEEE International Conference on
Automatic Face and Gesture Recognition, March 21–25, Santa Barbara, CA (pp. 336–342).
Martinez, B., Valstar, M. F., Binefa, X., & Pantic, M. (2013). Local evidence aggregation for
regression based facial point detection. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 35(5), 1149–1163.
Matthews, I. & Baker, S. (2004). Active appearance models revisited. International Journal of
Computer Vision, 60(2), 135–164.
Mavadati, S. M., Mahoor, M. H., Bartlett, K., & Trinh, P. (2012). Automatic detection of non-
posed facial action units. In Proceedings of the 19th International Conference on Image Pro-
cessing, September 30–October 3, Lake Buena Vista, FL (pp. 1817–1820).
McCallum, A., Freitag, D., & Pereira, F. C. N. (2000). Maximum entropy Markov models for
information extraction and segmentation. In Proceedings of the 17th International Conference
on Machine Learning, June 29–July 2, Stanford University, CA (pp. 591–598).
McDuff, D., El Kaliouby, R., Senechal, T., et al. (2013). Affectiva-mit facial expression dataset
(AM-FED): Naturalistic and spontaneous facial expressions collected “in-the-wild.” In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops,
June 23–28, Portland, OR (pp. 881–888).
McKeown, G., Valstar, M. F., Cowie, R., Pantic, M., & Schroder, M. (2012). The SEMAINE
database: Annotated multimodal records of emotionally colored conversations between a per-
son and a limited agent. IEEE Transactions on Affective Computing, 3, 5–17.
McLellan, T., Johnston, L., Dalrymple-Alford, J., & Porter, R. (2010). Sensitivity to genuine ver-
sus posed emotion specified in facial displays. Cognition and Emotion, 24, 1277–1292.
Milborrow, S. & Nicolls, F. (2008). Locating facial features with an extended active shape model.
In Proceedings of the 10th European Conference on Computer Vision, October 12–18, Mar-
seille, France (pp. 504–513).
Ojala, T., Pietikäinen, M., & Harwood, D. (1996). A comparative study of texture measures with
classification based on featured distribution. Pattern Recognition, 29(1), 51–59.
Ojala, T., Pietikäinen, M., & Maenpaa, T. (2002). Multiresolution grey-scale and rotation invariant
texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 24(7), 971–987.
Ojansivu, V. & Heikkilä, J. (2008). Blur insensitive texture classification using local phase quanti-
zation. In 3rd International Conference on Image and Signal Processing, July 1–3, Cherbourg-
Octeville, France (pp. 236–243).
Orozco, J., Martinez, B., & Pantic, M. (2013). Empirical analysis of cascade deformable models
for multi-view face detection. In IEEE International Conference on Image Processing, Septem-
ber 15–18, Melbourne, Australia (pp. 1–5).
Pan, S. J. & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge
and Data Engineering, 22(10), 1345–1359.
Pantic, M. (2009). Machine analysis of facial behaviour: Naturalistic and dynamic behaviour.
Philosophical Transactions of The Royal Society B: Biological sciences, 365(1535), 3505–
3513.
Pantic, M. & Bartlett, M. S. (2007). Machine analysis of facial expressions. In K. Delac & M.
Grgic (Eds), Face Recognition (pp. 377–416). InTech.
Pantic, M. & Patras, I. (2004). Temporal modeling of facial actions from face profile image
sequences. In Proceedings of the IEEE International Conference Multimedia and Expo, June
27–30, Taipei, Taiwan (pp. 49–52).
Pantic, M. & Patras, I. (2005). Detecting facial actions and their temporal segments in nearly
frontal-view face image sequences. In Proceedings of the IEEE International Conference on
Systems, Man and Cybernetics, October 12, Waikoloa, HI (pp. 3358–3363).
Pantic, M. & Patras, I. (2006). Dynamics of facial expression: Recognition of facial actions and
their temporal segments from face profile image sequences. IEEE Transactions on Systems,
Man and Cybernetics, Part B: Cybernetics, 36, 433–449.
Pantic, M. & Rothkrantz, J. (2000). Automatic analysis or facial expressions: The state of the art.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12), 1424–1445.
Pantic, M., Rothkrantz, L., & Koppelaar, H. (1998). Automation of non-verbal communication of
facial expressions. In Proceedings of the European Conference on Multimedia, January 5–7,
Leicester, UK (pp. 86–93).
Pantic, M., Valstar, M. F., Rademaker, R., & Maat, L. (2005). Web-based database for facial
expression analysis. In Proceedings of the IEEE International Conference on Multimedia and
Expo, July 6, Amsterdam (pp. 317–321).
Papageorgiou, C. P., Oren, M., & Poggio, T. (1998). A general framework for object detection. In
Proceedings of the IEEE International Conference on Computer Vision, January 7, Bombay,
India (pp. 555–562).
Pfister, T., Li, X., Zhao, G., & Pietikäinen, M. (2011). Recognising spontaneous facial micro-
expressions. In Proceedings of the IEEE International Conference on Computer Vision,
November 6–13, Barcelona (pp. 1449–1456).
Ross, D. A., Lim, J., Lin, R.-S., & Yang, M.-H. (2008). Incremental learning for robust visual
tracking. International Journal of Computer Vision, 77(1–3), 125–141.
Rudovic, O., Pavlovic, V., & Pantic, M. (2012). Kernel conditional ordinal random fields for tem-
poral segmentation of facial action units. In Proceedings of 12th European Conference on
Computer Vision Workshop, October 7–13, Florence, Italy.
Sánchez-Lozano, E., De la Torre, F., & González-Jiménez, D. (2012, October). Continuous regres-
sion for non-rigid image alignment. In European Conference on Computer Vision (pp. 250–
263). Springer Berlin Heidelberg.
Sánchez-Lozano, E., Martinez, B., Tzimiropoulos, G., & Valstar, M. (2016, October). Cascaded
continuous regression for real-time incremental face tracking. In European Conference on
Computer Vision (pp. 645–661). Springer International Publishing.
Sandbach, G., Zafeiriou, S., Pantic, M., & Yin, L. (2012). Static and dynamic 3-D facial
expression recognition: A comprehensive survey. Image and Vision Computing, 30(10),
683–697.
Saragih, J. M., Lucey, S., & Cohn, J. F. (2011). Deformable model fitting by regularized landmark
mean-shift. International Journal of Computer Vision, 91(2), 200–215.
Savran, A., Alyüz, N., Dibeklioğlu, H., et al. (2008). Bosphorus database for 3-D face analysis.
In COST Workshop on Biometrics and Identity Management, May 7–9, Roskilde, Denmark
(pp. 47–56).
Savran, A., Sankur, B., & Bilge, M. T. (2012a). Comparative evaluation of 3-D versus 2-D modal-
ity for automatic detection of facial action units. Pattern Recognition, 45(2), 767–782.
Savran, A., Sankur, B., & Bilge, M. T. (2012b). Regression-based intensity estimation of facial
action units. Image and Vision Computing, 30(10), 774–784.
Scherer, K. & Ekman, P. (1982). Handbook of Methods in Nonverbal Behavior Research. Cam-
bridge: Cambridge University Press.
Senechal, T., Rapp, V., Salam, H., et al. (2011). Combining AAM coefficients with LGBP
histograms in the multi-kernel SVM framework to detect facial action units. In IEEE
International Conference on Automatic Face and Gesture Recognition Workshop, March 21–
25, Santa Barbara, CA (pp. 860–865).
Senechal, T., Rapp, V., Salam, H., et al. (2012). Facial action recognition combining heteroge-
neous features via multi-kernel learning. IEEE Transactions on Systems, Man and Cybernetics,
Part B: Cybernetics, 42(4), 993–1005.
Shan, C., Gong, S., & McOwan, P. (2008). Facial expression recognition based on local binary
patterns: A comprehensive study. Image and Vision Computing, 27(6), 803–816.
Simon, T., Nguyen, M. H., Torre, F. D. L., & Cohn, J. (2010). Action unit detection with segment-
based SVMs. In IEEE Conference on Computer Vision and Pattern Recognition, June 13–18,
San Francisco (pp. 2737–2744).
Smith, R. S. & Windeatt, T. (2011). Facial action unit recognition using filtered local binary pat-
tern features with bootstrapped and weighted ECOC classifiers. Ensembles in Machine Learn-
ing Applications, 373, 1–20.
Stratou, G., Ghosh, A., Debevec, P., & Morency, L.-P. (2011). Effect of illumination on auto-
matic expression recognition: A novel 3-D relightable facial database. In IEEE International
Conference on Automatic Face and Gesture Recognition, March 21–25, Santa Barbara, CA
(pp. 611–618).
Tax, D. M. J., Hendriks, E., Valstar, M. F., & Pantic, M. (2010). The detection of concept frames
using clustering multi-instance learning. In Proceedings of the IEEE International Conference
on Pattern Recognition, August 23–26, Istanbul, Turkey (pp. 2917–2920).
Tian, Y., Kanade, T., & Cohn, J. (2001). Recognizing action units for facial expres-
sion analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2),
97–115.
Tian, Y., Kanade, T., & Cohn, J. F. (2002). Evaluation of Gabor-wavelet-based facial action unit
recognition in image sequences of increasing complexity. In Proceedings of the 5th IEEE Inter-
national Conference on Automatic Face and Gesture Recognition, May 21, Washington, DC
(pp. 229–234).
Tong, Y., Chen, J., & Ji, Q. (2010). A unified probabilistic framework for spontaneous facial action
modeling and understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence,
32(2), 258–273.
Tong, Y., Liao, W., & Ji, Q. (2007). Facial action unit recognition by exploiting their dynamic
and semantic relationships. IEEE Transactions on Pattern Analysis and Machine Intelligence,
29(10), 1683–1699.
Tsalakanidou, F. & Malassiotis, S. (2010). Real-time 2-D+3-D facial action and expression recog-
nition. Pattern Recognition, 43(5), 1763–1775.
Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for
structured and interdependent output variables. Journal of Machine Learning Research, 6,
1453–1484.
Valstar, M. F., Gunes, H., & Pantic, M. (2007). How to distinguish posed from spontaneous smiles
using geometric features. In Proceedings of the 9th International Conference on Multimodal
Interfaces, November 12–15, Nagoya, Japan (pp. 38–45).
Valstar, M. F., Jiang, B., Mehu, M., Pantic, M., & Scherer, K. (2011). The first facial expression
recognition and analysis challenge. In IEEE International Conference on Automatic Face and
Gesture Recognition Workshop, March 21–25, Santa Barbara, CA.
Valstar, M. F., Martinez, B., Binefa, X., & Pantic, M. (2010). Facial point detection using boosted
regression and graph models. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, June 13–18, San Francisco (pp. 2729–2736).
Valstar, M. F., Mehu, M., Jiang, B., Pantic, M., & Scherer, K. (2012). Meta-analyis of the first
facial expression recognition challenge. IEEE Transactions on Systems, Man and Cybernetics,
Part B: Cybernetics, 42(4), 966–979.
Valstar, M. F. & Pantic, M. (2010). Induced disgust, happiness and surprise: an addition to the
MMI facial expression database. In Proceedings of the International Conference Language
Resources and Evaluation, Workshop on Emotion, May 17–23, Valetta, Malta (pp. 65–70).
Valstar, M. F. & Pantic, M. (2012). Fully automatic recognition of the temporal phases of facial
actions. IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics, 1(99), 28–
43.
Valstar, M. F., Pantic, M., Ambadar, Z., & Cohn, J. F. (2006). Spontaneous vs. posed facial behav-
ior: Automatic analysis of brow actions. In Proceedings of the International Conference on
Multimodal Interfaces, November 2–4, Banff, Canada (pp. 162–170).
Valstar, M. F., Pantic, M., & Patras, I. (2004). Motion history for facial action detection in video.
In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, Octo-
ber 10–13, The Hague, Netherlands (pp. 635–640).
Valstar, M. F., Patras, I., & Pantic, M. (2005). Facial action unit detection using probabilistic
actively learned support vector machines on tracked facial point data. In IEEE Conference
on Computer Vision and Pattern Recognition Workshops, September 21–23, San Diego, CA
(pp. 76–84).
Van der Maaten, L. & Hendriks, E. (2012). Action unit classification using active appearance
models and conditional random field. Cognitive Processing, 13, 507–518.
ing domain. Image and Vision Computing, 27(12), 1743–1759.
Viola, P. & Jones, M. (2003). Fast multi-view face detection. Technical report MERLTR2003–96,
Mitsubishi Electric Research Laboratory.
Viola, P. & Jones, M. (2004). Robust real-time face detection. International Journal of Computer
Vision, 57(2), 137–154.
Whitehill, J. & Omlin, C. W. (2006). Haar features for FACS AU recognition. In Proceedings
of the 7th IEEE International Conference on Automatic Face and Gesture Recognition, April
10–12, Southampton, UK.
Williams, A. C. (2002). Facial expression of pain: An evolutionary account. Behavioral and Brain
Sciences, 25(4), 439–488.
Wu, T., Butko, N. J., Ruvolo, P., et al. (2012). Multilayer architectures of facial action unit recog-
nition. IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics, 42(4), 1027–
1038.
Xiong, X. & De la Torre, F. (2013). Supervised descent method and its applications to face align-
ment. In IEEE Conference on Computer Vision and Pattern Recognition, June 23–28, Portland,
OR.
Yang, P., Liu, Q., & Metaxasa, D. N. (2009). Boosting encoded dynamic features for facial expres-
sion recognition. Pattern Recognition Letters, 30(2), 132–139.
Yang, P., Liu, Q., & Metaxasa, D. N. (2011). Dynamic soft encoded patterns for facial event
analysis. Computer Vision, and Image Understanding, 115(3), 456–465.
Zeng, Z., Pantic, M., Roisman, G. I., & Huang, T. S. (2009). A survey of affect recognition
methods: Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 31(1), 39–58.
Zhang, L., Tong, Y., & Ji, Q. (2008). Active image labeling and its application to facial action
labeling. In European Conference on Computer Vision, October 12–18, Marseille, France
(pp. 706–719).
Zhang, L. & Van der Maaten, L. (2013). Structure preserving object tracking. In IEEE Conference
on Computer Vision and Pattern Recognition, June 23–28, Portland, OR.
Zhang, X., Yin, L., Cohn, J. F., et al. (2013). A high resolution spontaneous 3-D dynamic facial
expression database. In IEEE International Conference on Automatic Face and Gesture Recog-
nition, April 22–26, Shanghai (pp. 22–26).
Zhang, Z., Lyons, M., Schuster, M., & Akamatsu, S. (1998). Comparison between geometry-
based and Gabor wavelets-based facial expression recognition using multi-layer perceptron. In
Proceedings of the 3rd IEEE International Conference on Automatic Face and Gesture Recog-
nition, April 14–16, Nara, Japan (pp. 454–459).
Zhao, G. Y. & Pietikäinen, M. (2007). Dynamic texture recognition using local binary pattern
with an application to facial expressions. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2(6), 915–928.
Zhou, F., De la Torre, F., & Cohn, J. F. (2010). Unsupervised discovery of facial events. In IEEE
Conference on Computer Vision and Pattern Recognition, June 13–18, San Francisco.
Zhu, X. & Ramanan, D. (2012). Face detection pose estimation, and landmark localization in the
wild. In IEEE Conference on Computer Vision and Pattern Recognition, June 16–21, Provi-
dence, RI (pp. 2879–2886).
Zhu, Y., De la Torre, F., Cohn, J. F., & Zhang, Y. (2011). Dynamic cascades with bidirectional
bootstrapping for action unit detection in spontaneous facial behavior. IEEE Transactions on
Affective Computing, 2(2), 79–91.
12 Automatic Analysis of Bodily
Social Signals
Ronald Poppe
The human body plays an important role in face-to-face interactions (Knapp & Hall,
2010; McNeill, 1992). We use our bodies to regulate turns, to display attitudes and to
signal attention (Scheflen, 1964). Unconsciously, the body also reflects our affective
and mental states (Ekman & Friesen, 1969). There is a long history of research into the
bodily behaviors that correlate with the social and affective state of a person, in partic-
ular in interaction with others (Argyle, 2010; Dittmann, 1987; Mehrabian, 1968). We
will refer to these behaviors as bodily social signals. These social and affective cues
can be detected and interpreted by observing the human body’s posture and movement
(Harrigan, 2008; Kleinsmith & Bianchi-Berthouze, 2013). Automatic observation and
analysis has applications such as the detection of driver fatigue and deception, the analy-
sis of interest and mood in interactions with robot companions, and in the interpretation
of higher-level phenomena such as mimicry and turn-taking.
In this chapter, we will discuss various bodily social signals, and how to analyze and
recognize them automatically. Human motion can be studied on many levels, from the
physical level involving muscles and joints, to the level of interpreting a person’s full-
body actions and intentions (Poppe, 2007, 2010; Jiang et al., 2013). We will focus on
automatically analyzing movements with a relatively short time scale, such as a gesture
or posture shift. In the first section, we will discuss the different ways of measurement
and coding, both from motion capture data and images and video. The recorded data can
subsequently be interpreted in terms of social signals. In the second section, we address
the automatic recognition of several bodily social signals. We will conclude the chapter
with a discussion of challenges and directions of future work.
Measurement of Body Motion
Body movement can be observed and described quantitatively, for example, in terms
of joint rotations or qualitatively with movement labels. While social signals are typi-
cally detected and identified as belonging to a certain category, body motion is typically
described quantitatively. Therefore, the detection of bodily social signals is often based
on a quantitative representation of the movement. From the perspective of computation,
body motion is most conveniently recorded and measured using motion capture (mocap)
devices. However, their obtrusive nature, cost, and the fact that they typically cannot be
used outside the laboratory has limited their employment. Therefore, many researchers
have turned to common, unobtrusive cameras for action recognition. Recently, the avail-
ability of cheap depth cameras provides opportunities as well. Bodily social signals can
be detected directly from videos and depth sequences or, indirectly, from recovered body
poses and movement.
We first discuss the manual and automatic measurement and common ways to rep-
resent human body movement. Next, we summarize the recording of motion capture,
video and depth images, and the processing needed to transform raw outputs into body
movement descriptions.
Manual and Automatic Measurement

The systematic analysis of body movement dates back to the early photography experi-
ments of Marey and Muybridge (see Klette & Tee, 2007 for a historical background). By
analyzing successive photos, they were able to analyze patterns of movement. Later, the
introduction of video recording and play-back equipment allowed researchers to ana-
lyze behavior on a finer time scale (Condon & Ogston, 1966; Eisler, Hersen, & Agras,
1973). Initially, such analyses were used to investigate patients with mental diseases,
but these methods soon found their way to the more general study into (communicative)
nonverbal behavior.
Together with the increasing sophistication of recording and play-back devices, the
opportunities for analysis developed. From videos, researchers coded specific behav-
iors that they were interested in. Evaluative coding relies on researchers that code
their recorded material for the occurrence of particular forms of nonverbal behavior
(Rozensky & Honor, 1982). These specific qualitative schemes have led to models of
turn-taking (Sacks, Schegloff, & Jefferson, 1974) and gesturing (Lausberg & Sloetjes,
2009), amongst others. While it has been found that many bodily behaviors can be
coded reliably (Baesler & Burgoon, 1987), evaluative schemes require interpretation of
the observed behavior. This is especially true for bodily social signals. The variation in
the performance of nonverbal behavior in magnitude, form, and direction requires that
boundaries on the labels are set, which is an arbitrary task (Scherer & Ekman, 2008).
To address this issue, researchers have been looking at ways to describe human
motion quantitatively. They developed schemes including the Bernese system for time
series notation (Frey & von Cranach, 1973) and the Laban movement analysis (von
Laban, 1975), which evolved into Labanotion (Hutchinson Guest, 2005). These sys-
tems describe body part positions and motion in terms of angles and velocities (Bente,
1989; Hirsbrunner, Frey, & Crawford, 1987) and have been found to be generally appli-
cable and sufficiently detailed to animate computer characters (Bente et al., 2001). The
recently introduced body and action posture (BAP; Dael, Mortillaro, & Scherer, 2012)
coding system includes both quantitative aspects such as orientation and magnitude of
body part movement, and functional descriptions, following Ekman & Friesen (1969).
The system differentiates between posture units and action units, of which the latter are
more subject to interpretation.
Automatic Analysis of Bodily Social Signals 157
Both the qualitative and quantitative approaches have led to insights into bodily
behavior. However, manually coding data is time consuming, meaning that there is often
an inherent trade-off between the number of coded actions and the amount of coded
material (Poppe et al., 2014). With the increasing availability of technology to record
and analyze human motion, researchers have begun to address the automatic analysis of
recorded data (Poppe, 2007, 2010). We will discuss advances in this direction.
Human Body Representation

Body movement can be described in terms of body mass displacement, muscle activa-
tions, or joint positions, to name a few. Describing the movement at the skeleton level
is convenient, given that motion takes place at the joints (Poppe et al., 2014). The skele-
ton can be considered as a set of body parts (bones) connected by joints. Body poses
can be represented as instantiations of joint positions. All joints and body parts in the
human body together form a kinematic tree, a hierarchical model of interconnectivity.
Typically, joints in the spine, hands, and feet are omitted. The joint at the top of the
tree, usually the pelvis, forms a root to which all other joints are relative. When two
joints are connected to a body part, the one higher in the tree hierarchy is considered the
parent and the other the child. Movement in a parent joint affects the child joints. For
example, movement of the left shoulder affects the position of the left elbow and wrist
joints. Joint positions can be described globally with reference to a global axis system
and origin. Alternatively, they can be described relative to their parent in the tree. Global
and local representations each have their relative advantages. The former are most con-
venient when comparing full-body poses, as distances between pairs of joints can be
calculated in a straightforward manner. When analyzing the movement of a single body
part or joint, local representations enable the analysis of the motion in isolation.
The global or local positions of all joints form an adequate description of the body
pose, especially when normalized for global position, orientation, and differences in
body sizes (Poppe et al., 2014). Poses encode the positions of body parts, but do not
reveal anything about their motion. To this end, the velocity of the joints can be used.
Pose and motion information are often complementary and are both used in the analysis
of bodily social signals.
Motion Capture
Motion capture technology employs either markers or wearable sensors to determine
a subject’s body pose. Marker-based mocap setups record the positions of markers
attached to the body using many cameras. With proper calibration, these sensor posi-
tions can be translated to the positions of the joints. The advantage of such systems
is their high accuracy. However, the space in which the movement can take place is
limited, and marker occlusions, especially in the presence of other subjects, require
additional post-processing. Inertial devices eliminate the need for visible markers as the
sensors are worn on the body, possibly underneath clothing. This allows for their use
in larger spaces and they perform more robustly when recording interactions between
multiple subjects. Their acceleration measurements can be converted to 3-D positions
of the joints. See (Poppe et al., 2014) for an overview and discussion of motion cap-
ture approaches. Both global and local joint positions can be obtained from mocap
devices.
Video Recordings
The use of video for the study of nonverbal behavior is appealing as the recording is
unobtrusive, both inside and outside the lab. In contrast to mocap devices, video cam-
eras are cheap and widely available. Moreover, the abundance of available recordings
portraying human behavior motivates the research efforts aimed at automatically ana-
lyzing them.
The analysis of human motion from video is challenging because of several factors.
An image is a projection of a 3-D scene in which the depth information is lost. More-
over, determining which parts of the image represent the human figure is challenging,
especially in the presence of background clutter and partial occlusion of the body. Nui-
sance factors such as variations in lighting, clothing, body sizes, and viewpoint add
further to the challenge (Poppe, 2010).
In general, there are two main approaches to analyzing human movement from video.
First, a body movement representation in terms of joint positions can be extracted, as
described in the section on human body representation. Second, the characteristics of the
image or movement in the image can be used directly for analysis. The results of these
two approaches are pose-based and feature-based representations, respectively. We will
discuss them in the following sections.
Pose-based Representations
There is a large volume of published research on estimating human body poses from
video. A comprehensive discussion appears in Poppe (2007). Here, we will outline the
most common approaches: model-based and discriminative.
In the first approach, model-based human pose estimation algorithms match an articu-
lated model of a human to an image frame in a video. The model consists of a kinematic
structure (see the section on human body representation) and a function that projects the
model to the image. The image projection function determines how a pose of the model
appears in an image, for example, in terms of image edges, silhouette, or color. Given
that a body pose is a particular joint parameter instantiation, pose estimation becomes
the process of finding the parameters that result in the best match between the image
and the model projection. This match is evaluated in terms of image feature distance,
usually in an iterative manner. This process is computationally expensive, but allows for
the evaluation of a large number of parameters of the pose as well as the shape of the
person (Guan et al., 2009).
This estimation process can be top-down, starting with the torso and working down
the kinematic chain until the pose of the limbs is found. Deutscher, & Reid (2005)
match the edges and silhouette information of a model with cylindrical body parts to
those extracted from an image. They gradually reduce the amount of change in the pose
to arrive at the final body pose estimate. Usually, the refinement of the pose is guided
by a priori information on how humans move, including typical poses (Vondrak, Sigal,
& Jenkins, 2013).
Alternatively, the process of estimating body poses can be bottom-up by first detect-
ing potential body part locations in the image. Detectors are templates of a body part,
often encoded as edge representations with additional cues such as color and motion
(Eichner et al., 2012). In recent years, deformable part models have become popular
due to their ability to simultaneously detect different parts of the body and reason which
body poses are physically feasible and plausible (Felzenszwalb et al., 2010). Their out-
put is a set of 2-D joint positions, which can be lifted to 3-D when sufficient assumptions
about the observed motion have been made.
The second approach is the discriminative approach. Rather than iteratively fitting
a human model to the data, one can learn a mapping from image to body poses from
training data. Such a mapping can be implemented by regression models (Bo & Smin-
chisescu, 2010). Typically, training data consists of image features and an associated
description of pose and viewpoint. Body poses can be recovered from test videos
by first extracting image features and then applying the mapping. These discrimina-
tive, or learning-based, approaches are computationally much faster than model-based
algorithms but can only reliably recover body poses if there is training data avail-
able with similar poses and viewpoints. This requires a lot of training data to suf-
ficiently cover the range of poses. Given the large number of possible body poses,
this has typically led researchers to concentrate their training data on common activ-
ities, although more recent approaches have targeted less constrained motion domains
(Shotton et al., 2011).
Feature-based Representations
In contrast to pose-based representations, feature-based representation are less seman-
tically meaningful but can be extracted efficiently from video images. Comparing an
image of a scene with people to an image of the same scene without people will reveal
one or a number of regions of differences that correspond to the locations of the people.
The locations, sizes and movements of these regions are informative of their positions
in the scene and can be used to investigate proximity and interaction patterns of small
groups, such as from top-down views (Veenstra & Hung, 2011).
By analyzing differences between subsequent frames, one can analyze motion at a
finer scale. While such differences can be the basis for the estimation of the locations
of body parts (Fragkiadaki, Hu, & Shi, 2013), they can also be used directly. For exam-
ple, the amount of movement, the direction of the movement, or the relative location of
the movement (upper-body or lower-body) can be informative of the social signals that
a person produces. Moreover, when looking at the movement of several people simul-
taneously, one can analyze the degree of mimicry in their interaction (Paxton & Dale,
2013).
When analyzing bodily social signals, often there is a specific interest in the locations
of the hands and face. This is especially true for the analysis of gestures. Estimating the
2-D or 3-D positions of the hands and head is often less complex than estimating a full-
body pose, especially when relying on skin color detection. By detecting skin-colored
pixels and grouping them into connected regions, one can recover the location of the
hands and face.
Depth Images
Time-of-flight (Ganapathi et al., 2010) and structured light cameras such as Microsoft’s
Kinect (Shotton et al., 2011), can estimate the distance between the camera and points
in the scene. The availability of cheap devices has sparked the interest to use them to
observe and analyze human movement. Nuisance factors that occur when using videos,
including cluttered backgrounds and variation in lighting, are significantly reduced and
the additional availability of depth information aids in labeling body parts and their
orientation.
Recognition of Bodily Social Signals
In this section, we will discuss the recognition of various bodily social signals from the
representations described in the first section. Recognizing, or classifying, social signals
is the process of assigning a (semantic) label to an observed sequence of bodily move-
ment. In general, the detection (in time) and recognition of bodily social signals are
challenging due to the variations in the temporal and spatial performance, both between
and within subjects. Social signals can have different bodily manifestations. Reversely,
one distinct bodily behavior can have different meanings. For example, raising a hand
can be a greeting or a sign to take the floor. The context in which the behavior is per-
formed is important to disambiguate between the different meanings. We will discuss
this in the next section.
Both the detection and recognition of social signals from body movement repre-
sentations are often implemented with machine learning techniques (Vinciarelli, Pan-
tic, & Bourlard, 2009). Given training data, which is a collection of body movement
instances with associated social signal labels, a mapping from the former to the latter
is learned. This mapping can take many forms, including state-space models such as
hidden Markov models (HMM), or discriminative classifiers, such as the support vector
machine (SVM). To deal with challenges, such as the diversity of the observed behav-
ior, the inherent ambiguity of the observed behavior, and the typically limited amount
of available training data, many different variants of machine learning algorithms have
been introduced. Other chapters address these techniques for the understanding of social
signals. In this section, we will focus on the potential and challenges in recognizing cer-
tain social signals from body movement. We will subsequently discuss the interpretation
of a person’s position relative to others (see Proxemics) and the analysis of social signals
from the body (see Kinesics).
Proxemics
The way people use the space around them in relation to others is referred to as prox-
emics. Hall (1966) defines four zones of interpersonal distance with different character-
istics in how people interact in terms of the way of gesturing and positioning the body.
Moreover, these zones correspond to the relation between the people, such as friend or
stranger. For small groups, people have been found to arrange themselves in so-termed
F-formations in which each person has equal, direct, and exclusive access to the others
(Kendon, 1990). When analyzing groups of people, the notion of relative orientation
and proximity have been found good cues to determine who is part of a subgroup (Groh
et al., 2010) and to predict mutual interest (Veenstra & Hung, 2011). Most of the work
on automatic analysis of proxemics has been carried out in social surveillance setting in
which body movement representations typically are feature-based. The automatic analy-
sis of proxemics has also been studied at a closer distance by Mead, Atrash, and Matarić
(2013). They considered a range of body movement features, including (relative) body
position and elements of the pose. We will discuss the analysis of full-body movement
in social interaction in the next section.
Kinesics
Kinesics refers to the study of body poses and movements as a mode of communi-
cation (Birdwhistell, 1952). The research on the automatic analysis of kinesics has
focused mainly on conversational settings, such as meetings, interviews, and other small
group interactions. The body has been found to communicate attitudes toward others in
the interaction (Ekman, 1965). Okwechime et al. (2011) have addressed the automatic
recognition of interest in interaction partners by analyzing gross body motion. Body
shifts can be easily detected from pose-based and feature-based body movement rep-
resentations, have been found to be indicative of disagreement (Bousmalis, Mehu, &
Pantic, 2013), and play a role in the turn-taking process (Scheflen, 1964), to name a few.
Moreover, mimicry in gross body motion can be a sign of rapport. It can be analyzed
from pose-based representations, from simple frame-differencing techniques (Paxton &
Dale, 2013) or from the detected position of the face in the image (Park et al., 2013).
Closer analysis of the body also allows for the analysis of respiration, which can be a
sign of anxiety. Burba et al. (2012) estimate the rate using a depth camera. Laughing
can be considered a more discrete bodily signal, and different types of laughter can be
recognized from mocap data (Griffin et al., 2013).
The hands are particularly informative of a subject’s social and affective state, given
that hand movements are closely tied to a person’s speech (McNeill, 1985). Gestures
and their co-occurrence with speech have been studied in great detail (Ekman & Friesen,
1969). The amount of gesturing has been found indicative of a user’s attitude and mental
state (Bull, 1987). For example, fidgeting behaviors have been shown to correlate with
an increased experience of distress (Scherer et al., 2013) and can be extracted robustly
from mocap representations (Burba et al., 2012). Similarly, self-touching has been found
to be a sign of self-confidence as well as anxiety (McNeill, 1992). Marcos-Ramiro et al.
(2013) analyze self-touching in conversations from body pose representations obtained

from a depth camera.
Especially in conversational settings, the pose and movement of the head is indicative
of the subject’s attention and serves several functions in the turn-taking process (Heylen,
2006). The analysis of head pose over time from pose-representations is straightforward.
When the camera view covers a larger area and the subjects in the view are smaller,
head orientation estimation based on both the subject’s pose and head detection can be
used (Bazzani et al., 2013). This allows investigating the role of head movement in the
process of group formation and the evolvement of small group interactions.
One line of research has focused on estimating a subject’s affective state from full-
body poses and movements. The relation between specific body part positions and
movements has been analyzed, for example, by Wallbott (1998). Recently, the auto-
matic analysis has been attempted from pose-based, mainly recorded with mocap equip-
ment, and feature-based representations. The reader is referred to Kleinsmith, Bianchi-
Berthouze, and Steed (2011) for an overview of research in this area.
Challenges and Opportunities
The research into automatic recognition of bodily social signals and the study of social,
nonverbal behavior are not isolated but rather benefit from each other. A better under-
standing of how humans behave informs the design and implementation of better recog-
nition algorithms and, in turn, these advances in the automatic recognition help to better
understand human behavior.
Apart from their use in understanding the principles of human behavior, automatic
analysis of human body motion will continue to provide opportunities for online appli-
cations. The analysis of body movement can be used to analyze the outcome of nego-
tiations and debates, to help practice public speaking and as a quick way to automate
border control surveillance, to name a few. While initial work along these lines has
already begun, there are some challenges that need to be addressed.
Measurement
Mocap equipment allows for the accurate measurement of body motion, but not unob-
trusively. As such, it is not suitable for many applications outside the lab. Advances in
computer vision algorithms and the recent introduction of depth cameras allow for the
measurement outside the lab without the need of markers or wearable sensors, but their
accuracy and robustness is still limited.
Given that many of the systematics of human nonverbal behavior are expressed in
qualitative terms, a challenge is faced in converting the quantitative body movement
measurements to these human-understandable, qualitative terms. This would allow for
the adoption of the large body of literature of bodily behavior. Velloso, Bulling, and
Gellersen (2013), among others, address this challenge by automatically estimating
BAP labels from mocap data. They demonstrate that this is not a straightforward task
and future work should be aimed at investigating how such a mapping can be made.
Recognition
Researchers have begun to adopt machine learning techniques that take into account
individual differences in the display of bodily signals and the inherent ambiguity of
body movement. Learning such models typically requires large amounts of training data
for which obtaining ground truth labels is time-consuming. Researchers should look for
alternative ways to label their data, for example, using crowdsourcing, implicit tagging,
semi-supervised approaches, or by considering correlations between modalities. More-
over, when evaluating recognition algorithms, the optionality and ambiguity of social
signals should be taken into account. The detection in time is often not addressed, which
effectively avoids issues with the rare occurrence of social signals, and the associated
problem of the detection of false positives. Future work should address the simultaneous
detection and recognition of social signals from body movement data.
Context
Current work targets the recognition of specific bodily social signals in relative isolation.
While the work in this direction progresses, there is an increasing need to understand
the behavior more thoroughly. To this end, researchers should look beyond just the body
and include other available knowledge, sometimes referred to as context. We distinguish
here between the notion of other subjects, the specific task and setting, and cues from
other modalities than the body movement.
Other subjects often provide a strong cue of the type of interaction that takes place.
People respond to each other in more or less known patterns. Observing certain behav-
ior in one person might aid in automatically understanding that of another person. For
example, recognizing that one person sneezes helps in understanding why others turn
their heads.
Many social signals are being studied in a restricted domain, such as a negotiation or
tutoring setting. Knowledge of this setting helps in reducing the ambiguity in explain-
ing the occurrence of a bodily behavior. When moving to less constrained application
domains, it will be necessary to explicitly model the task and setting in order to perform
such disambiguation.
We have discussed the analysis of social signals from the body, but there are often
correlations between behavior of the body, the face, and voice. By taking a multimodal
approach, the ambiguity in a single modality can be reduced and the recognition can
accordingly be made more robust. Moreover, taking into account multiple modalities
will help in addressing individual differences in the display of social signals across
modalities (Romera-Paredes et al., 2013).
Conclusion
In this chapter, we have discussed the measurement and representation of human body
motion. We have presented the current state of recognizing several bodily social signals.
Finally, we have presented challenges in the automatic detection and recognition of bod-
ily social signals and ways to address these. Given the advances, both in measurement
technology and recognition algorithms, we foresee many interesting novel applications

that consider social signals from the body. Moreover, the increasing robustness of cur-
rent algorithms will allow for a wider embedding of such algorithms in multimedia
analysis, social surveillance, and in human–machine interfaces, including social robots.
Acknowledgment
This publication was supported by the Dutch national program COMMIT, and received
funding from the EU FP7 projects TERESA and SSPNet.
References
Argyle, Michael (2010). Bodily Communication (2nd rev. edn). New York: Routledge.
Baesler, E. James & Burgoon, Judee K. (1987). Measurement and reliability of nonverbal behav-
ior. Journal of Nonverbal Behavior, 11(4), 205–233.
Bazzani, Loris, Cristani, Marco, Tosato, Diego, et al. (2013). Social interactions by visual focus
of attention in a three-dimensional environment. Expert Systems, 30(2), 115–127.
Bente, Gary (1989). Facilities for the graphical computer simulation of head and body move-
ments. Behavior Research Methods, Instruments, & Computers, 21(4), 455–462.
Bente, Gary, Petersen, Anita, Krämer, Nicole C., & De Ruiter, Jan Peter (2001). Transcript-based
computer animation of movement: Evaluating a new tool for nonverbal behavior research.
Behavior Research Methods, Instruments, & Computers, 33(3), 303–310.
Birdwhistell, Ray L. (1952). Introduction to Kinesics: An Annotation System for Analysis of Body
Motion and Gesture. Louisville, KY: University of Louisville.
Bo, Liefeng & Sminchisescu, Cristian (2010). Twin Gaussian processes for structured prediction.
International Journal of Computer Vision, 87(1–2), 28–52.
Bousmalis, Konstantinos, Mehu, Marc, & Pantic, Maja. (2013). Towards the automatic detection
of spontaneous agreement and disagreement based on nonverbal behaviour: A survey of related
cues, databases, and tools. Image and Vision Computing, 31(2), 203–221.
Bull, Peter E. (1987). Posture and Gesture. Oxford: Pergamon Press.
Burba, Nathan, Bolas, Mark, Krum, David M., & Suma, Evan A. (2012). Unobtrusive measure-
ment of subtle nonverbal behaviors with the Microsoft Kinect. In Proceedings of IEEE Virtual
Reality Short Papers and Posters March 4–8, 2012, Costa Mesa, CA.
Condon, William S. & Ogston, William D. (1966). Sound film analysis of normal and pathological
behavior patterns. Journal of Nervous and Mental Disease, 143(4), 338–347.
Dael, Nele, Mortillaro, Marcello, & Scherer, Klaus R. (2012). The body action and posture coding
system (BAP): Development and reliability. Journal of Nonverbal Behavior, 36(2), 97–121.
Deutscher, Jonathan, & Reid, Ian (2005). Articulated body motion capture by stochastic search.
International Journal of Computer Vision, 61(2), 185–205.
Dittmann, Allen T. (1987). The role of body movement in communication. In A. W. Siegman & S.
Feldstein (Eds), Nonverbal Behavior and Communication (pp. 37–64). Hillsdale, NJ: Lawrence
Erlbaum.
Eichner, M., Marin-Jimenez, M., Zisserman, A., & Ferrari, V. (2012). 2D articulated human pose
estimation and retrieval in (almost) unconstrained still images. International Journal of Com-
puter Vision, 99(2), 190–214.
Eisler, Richard M., Hersen, Michel, & Agras, W. Stewart (1973). Videotape: A method for
the controlled observation of nonverbal interpersonal behavior. Behavior Therapy, 4(3),
420–425.
Ekman, Paul (1965). Communication through nonverbal behavior: A source of information about
an interpersonal relationship. In S. S. Tomkins & C. E. Izard (Eds), Affect, Cognition, and
Personality (pp. 390–442). New York: Springer.
Ekman, Paul & Friesen, Wallace V. (1969). The repertoire of nonverbal behavior: Categories,
origins, usage and coding. Semiotica, 1(1), 49–98.
Felzenszwalb, Pedro F., Girshick, Ross B., McAllester, David, & Ramanan, Deva (2010). Object
detection with discriminatively trained part-based models. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 32(9), 1627–1645.
Fragkiadaki, Katerina, Hu, Han, & Shi, Jianbo (2013). Pose from flow and flow from pose. In Pro-
ceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2059–
2066).
Frey, Siegfried & von Cranach, Mario (1973). A method for the assessment of body move-
ment variability. In M. von Cranach & I. Vine (Eds), Social Communication and Movement
(pp. 389–418). New York: Academic Press.
Ganapathi, Varun, Plagemann, Christian, Koller, Daphne, & Thrun, Sebastian (2010). Real time
motion capture using a single time-of-flight camera. In Proceedings of the Conference on Com-
puter Vision and Pattern Recognition (CVPR) (pp. 755–762).
Griffin, Harry J., Aung, Min S. H., Romera-Paredes, Bernardino, et al. (2013). Laughter type
recognition from whole body motion. In Proceedings of the International Conference on Affec-
tive Computing and Intelligent Interaction (ACII) (pp. 349–355).
Groh, Georg, Lehmann, Alexander, Reimers, Jonas, Friess, Marc Rene, & Schwarz, Loren (2010).
Detecting social situations from interaction geometry. In Proceedings of the International Con-
ference on Social Computing (SocialCom). (pp. 1–8).
Guan, Peng, Weiss, Alexander, Bălan, Alexandru O., & Black, Michael J. (2009). Estimating
human shape and pose from a single image. In Proceedings of the International Conference On
Computer Vision (ICCV).
Hall, Edward T. (1966). The Hidden Dimension. New York: Doubleday.
Harrigan, Jinni A. (2008). Proxemics, kinesics, and gaze. In J. A. Harrigan & R. Rosenthal (Eds),
New Handbook of Methods in Nonverbal Behavior Research (pp. 137–198). Oxford: Oxford
University Press.
Heylen, Dirk (2006). Head gestures, gaze and the principles of conversational structure. Interna-
tional Journal of Humanoid Robotics, 3(3), 241–267.
Hirsbrunner, Hans-Peter, Frey, Siegfried, & Crawford, Robert (1987). Movement in human inter-
action: Description, parameter formation, and analysis. In A. W. Siegman & S. Feldstein (Eds),
Nonverbal Behavior and Communication (pp. 99–140). Hillsdale, NJ: Lawrence Erlbaum.
Hutchinson Guest, Ann (2005). Labanotation: The System of Analyzing and Recording Movement
(4th edn). New York: Routledge.
Jiang, Yu-Gang, Bhattacharya, Subhabrata, Chang, Shih-Fu, & Shah, Mubarak (2013). High-level
event recognition in unconstrained videos. International Journal of Multimedia Information
Retrieval, 2(2), 73–101.
Kendon, Adam (1990). Conducting Interaction: Patterns of Behavior in Focused Encounters.
Cambridge: Cambridge University Press.
Kleinsmith, Andrea, & Bianchi-Berthouze, Nadia (2013). Affective body expression perception
and recognition: A survey. IEEE Transactions on Affective Computing, 4(1), 15–33.
Kleinsmith, Andrea, Bianchi-Berthouze, Nadia, & Steed, Anthony (2011). Automatic recognition
of non-acted affective postures. IEEE Transactions on Systems, Man, and Cybernetics, Part B:
Cybernetics, 41(4), 1027–1038.
Klette, Reinhard, & Tee, Garry (2007). Understanding human motion: A historic review. In
B. Rosenhahn, R. Klette, & D. Metaxas (Eds), Human Motion: Understanding, Modelling,
Capture and Animation (pp. 1–22). New York: Springer.
Knapp, Mark L., & Hall, Judith A. (2010). Nonverbal Communication in Human Interaction (7th
edn). Andover, UK: Cengage Learning.
Lausberg, Hedda & Sloetjes, Han (2009). Coding gestural behavior with the NEUROGES-ELAN
system. Behavior Research Methods, 41(3), 841–849.
Marcos-Ramiro, Alvaro, Pizarro-Perez, Daniel, Romera, Marta Marrón, Nguyen, Laurent, &
Gatica-Perez, Daniel (2013). Body communicative cue extraction for conversational analysis.
In Proceedings of the International Conference on Automatic Face and Gesture Recognition
(FG) (pp. 1–8).
McNeill, David (1985). So you think gestures are nonverbal? Psychological Review, 92(3), 350–
371.
McNeill, David (1992). Hand and Mind: What Gestures Reveal About Thought. Chicago: Uni-
versity of Chicago Press.
Mead, Ross, Atrash, Amin, & Matarić, Maja J. (2013). Automated proxemic feature extraction
and behavior recognition: Applications in human-robot interaction. International Journal of
Social Robotics, 5(3), 367–378.
Mehrabian, Albert (1968). Some referents and measures of nonverbal behavior. Behavior
Research Methods, 1(6), 203–207.
Okwechime, Dumebi, Ong, Eng-Jon, Gilbert, Andrew, & Bowden, Richard (2011). Visualisation
and prediction of conversation interest through mined social signals. Pages 951–956 of: Pro-
ceedings of the International Conference on Automatic Face and Gesture Recognition (FG).
Park, Sunghyun, Scherer, Stefan, Gratch, Jonathan, Carnevale, Peter, & Morency, Louis-Philippe
(2013). Mutual behaviors during dyadic negotiation: Automatic prediction of respondent reac-
tions. In Proceedings of the International Conference on Affective Computing and Intelligent
Interaction (ACII) (pp. 423–428).
Paxton, Alexandra, & Dale, Rick (2013). Frame-differencing methods for measuring bodily syn-
chrony in conversation. Behavior Research Methods, 45(2), 329–343.
Poppe, Ronald (2007). Vision-based human motion analysis: An overview. Computer Vision and
Image Understanding, 108(1–2), 4–18.
Poppe, Ronald (2010). A survey on vision-based human action recognition. Image and Vision
Computing, 28(6), 976–990.
Poppe, Ronald, Van Der Zee, Sophie, Heylen, Dirk K. J., & Taylor, Paul J. (2014). AMAB: Auto-
mated measurement and analysis of body motion. Behavior Research Methods, 46(3), 625–633.
Romera-Paredes, Bernardino, Aung, Hane, Pontil, Massimiliano, et al. (2013). Transfer learning
to account for idiosyncrasy in face and body expressions. In Proceedings of the International
Conference on Automatic Face and Gesture Recognition (FG) (pp. 1–8).
Rozensky, Ronald H., & Honor, Laurie Feldman (1982). Notation systems for coding nonverbal
behavior: A review. Journal of Behavioral Assessment, 4(2), 119–132.
Sacks, Harvey, Schegloff, Emanuel A., & Jefferson, Gail (1974). A simplest systematics for the
organisation of turn-taking for conversation. Language, 50(4), 696–735.
Scheflen, Albert E. (1964). The significance of posture in communicational systems. Psychiatry,
27(4), 316–331.
Scherer, Klaus R., & Ekman, Paul (2008). Methodological issues in studying nonverbal behavior.
In J. A. Harrigan & R. Rosenthal (Eds), New Handbook of Methods in Nonverbal Behavior
Research (pp. 471–504). Oxford: Oxford University Press.
Scherer, Stefan, Stratou, Giota, Mahmoud, Marwa, et al. (2013). Automatic behavior descriptors
for psychological disorder analysis. Proceedings of the International Conference on Automatic
Face and Gesture Recognition (FG) (pp. 1–8).
Shotton, Jamie, Fitzgibbon, Andrew, Cook, Mat, et al. (2011). Real-time human pose recognition
in parts from single depth images. In Proceedings of the Conference on Computer Vision and
Pattern Recognition (CVPR) (pp. 1297–1304).
Veenstra, Arno, & Hung, Hayley (2011). Do they like me? Using video cues to predict desires dur-
ing speed-dates. In Proceedings of the International Conference on Computer Vision (ICCV)
Workshops (pp. 838–845).
Velloso, Eduardo, Bulling, Andreas, & Gellersen, Hans (2013). AutoBAP: Automatic coding
of body action and posture units from wearable sensors. In Proceedings of the International
Conference on Affective Computing and Intelligent Interaction (ACII) (pp. 135–140).
Vinciarelli, Alessandro, Pantic, Maja, & Bourlard, Hervé (2009). Social signal processing: Survey
of an emerging domain. Image and Vision Computing, 27(12), 1743–1759.
Vondrak, Marek, Sigal, Leonid, & Jenkins, Odest Chadwicke (2013). Dynamical simulation priors
for human motion tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence,
35(1), 52–65.
von Laban, Rudolf (1975). Laban’s Principles of Dance and Movement Notation (2nd edn).
London: MacDonald and Evans.
Wallbott, Harald G. (1998). Bodily expression of emotion. European Journal of Social Psychol-
ogy, 28(6), 879–896.
13 Computational Approaches for
Personality Prediction
Bruno Lepri and Fabio Pianesi
Introduction
In everyday life, people usually describe others as being more or less talkative or socia-
ble, more or less angry or vulnerable to stress, more or less planful or behaviorally
controlled. Moreover, people exploit these descriptors in their everyday life to explain
and/or predict others’ behavior, attaching them to well-known as well as to new acquain-
tances. In all generality, the attribution of stable personality characteristics to others
and their usage to predict and explain their behavior is a fundamental characteristics of
human naive psychology (Andrews, 2008).
As agents that in increasingly many and varied ways participate in and affect the lives
of humans, computers need to explain and predict their human parties’ behavior by,
for example, deploying some kind of naive folk-psychology in which the understanding
of people’s personality can reasonably be expected to play a role. In this chapter, we
address some of the issues that attempts at endowing machines with the capability of
predicting people’s personality traits.
Scientific psychology has developed a view of personality as a higher-level abstrac-
tion encompassing traits, sets of stable dispositions toward action, belief, and attitude
formation. Personality traits differ across individuals, are relatively stable over time, and
influence behavior. Between-individual differences in behavior, belief, and attitude can
therefore be captured in terms of the dispositions/personality traits that are specific to
each individual, in this way providing a powerful descriptive and predictive tool that
has been widely exploited by, for example, clinical and social psychology, educational
psychology, and organizational studies.
The search for personality traits has been often pursued by means of factor-analytic
studies applied to lists of trait adjectives, an approach based on the lexical hypothe-
sis (Allport & Odbert, 1936), which maintains that the most relevant individual differ-
ences are encoded into the language, and the more important the difference, the more
likely it is to be expressed as a single word. A well-known and very influential example
of a multifactorial approach is the Big Five (Costa & McCrae, 1992; John & Srivas-
tava, 1999), which owes its name to the five traits it takes as constitutive of people’s
personality:
1 extraversion versus introversion (sociable, assertive, playful vs aloof, reserved, shy);

2 emotional stability versus neuroticism (calm, unemotional vs insecure, anxious);
Computational Approaches for Personality Prediction 169
3 agreeableness versus disagreeable (friendly, cooperative vs antagonistic, faultfind-

ing);
4 conscientiousness versus un-conscientiousness (self-disciplined, organized vs ineffi-
cient, careless);
5 openness to experience (intellectual, insightful vs shallow, unimaginative).
Over the last fifty years the Big Five has become a standard in psychology. At
least three groups of researchers have worked independently on this problem and
have identified the same Big Five factors: Goldberg at the Oregon Research Institute
(Peabody & Goldberg, 1989), Cattell (1957) at the University of Illinois, and Costa and
McCrae (1992) at the National Institutes of Health. Despite the different methodolo-
gies exploited, the different names, and sometimes the different internal constitutions
of the five factors, the consensus is high on their meaning and on their breadth of cov-
erage (Grucza & Goldberg, 2007). During the years, several experiments using the Big
Five have repeatedly confirmed the influence of personality traits on many aspects of
individual behavior, including leadership (Hogan, Curphy, & Hogan, 1994; Judge et al.,
2002), general job performance (Hurtz & Donovan, 2000), sales ability (Furnham &
Fudge, 2008), teacher effectiveness (Murray, Rushton, & Paunonen, 1990), and so on.
For example, Judge, Heller, and Mount (2002) found that extraversion, conscientious-
ness, and neuroticism were significant predictors of job satisfaction. As far as leadership
was concerned, after an initial skepticism Judge et al. (2002) found that all the traits with
the exception of agreeableness have nonzero correlations with the leadership criteria
(leader emergence and leader effectiveness). Finally, in a comprehensive meta-analysis
of the personality team effectiveness literature, Bell (2007) found that each of the Big
Five traits significantly predicts team effectiveness. Additional studies have shown that
subjective well-being is related to the five factors of personality, especially neuroticism,
extraversion, and conscientiousness, and that, although subjective well-being is not sub-
sumed by personality, the two constructs are reliably correlated (DeNeve & Cooper,
1998; Vitters, 2001).
Big Five traits have also been shown to influence the human–technology relationship,
affecting attitudes toward computers in general as well as toward specific technologies,
such as adaptive systems (Goren-Bar et al., 2006), conversational agents (André et al.,
1999), tutoring systems (Zhou & Conati, 2003), and assistive robots (Tapus, Tapus, &
Mataric, 2008). For all these reasons, most of the works concerned with the automatic
prediction of personality have addressed the Big Five. Personality is also an important
piece of knowledge that can be used to build effective persuasive systems: people, in
fact, may react differently to persuasive stimuli according to their personality (Janis,
1954).
In this chapter, we discuss two approaches to automatic personality prediction.
The first approach takes inspiration from human processes of personality attribution,
whereby humans deploy knowledge about personality to attribute traits to other people,
even those they never met before and even on the basis of very short sequences (down to
few seconds) of expressive behavior, so-called thin slices (Ambady & Rosenthal, 1992;
Proximal
Distal cues
percepts
D1 P1
Trait D2 P2 Attribution
. .
. .
Dn Pn
Perceptual Inferential
Externalization
representation utilization
Figure 13.1 Brunswick’s lens model.
Ambady, Bernieri, & Richeson, 2000). The human attribution process can be described
by means of Brunswick’s lens model (Figure 13.1), as modified by Scherer (1978).
Omitting details that are not relevant for our purposes, in this model personality
traits are externalized or made manifest in behavior by means of objectively measur-
able variables called distal cues, which the perceiver represents him/herself as subjec-
tive/proximal percepts; these percepts are then subjected to inferential processes leading
to attribution. For instance, a distal cue (an externalization) of the extraversion trait can
be the voice pitch which the observer represents as loudness (the proximal percept) to
use in the course of the inferential process.
The second approach to automatic personality prediction exploits personality’s role
in shaping the structure of the social networks that we are part of: the number and type
of contacts we have, the way they are mutually linked, and so on, all reflect to a vary-
ing degree our personality profile. Contrary to the detailed microlevel information often
exploited by the first approach, humans have limited access to network-type information
and are not specifically attuned to it. Computer systems, in turn, can access and exploit
the huge amounts of information about the networks in which people live that are con-
tained in the digital traces of individuals’ and groups’ behaviors provided by wearable
sensors, smart phones, e-mails, social networks (e.g., Facebook, Twitter, etc.), and the
likes.
Finally, we describe a novel and alternative approach focusing on automatic clas-
sification of excerpts of social behaviors in personality states (Fleeson, 2001) corre-
sponding to the Big Five traits, rather than dealing with the traditional goal of using
behaviors to infer about personality traits. In our opinion, such an approach opens
interesting prospects for the task of automatically computing personalities. In the first
place, it provides the necessary flexibility to ground the relationship between behavior
and personality by emphasizing the situational characteristics together with personal-
ity as one of the key determinants of actual behavior. Second, this flexibility can be
expected to make easier not only the task of predicting personality from actual behav-
iors, but also the converse tasks of predicting and explaining behaviors from people’s
personality.
When Less is More: This Slice-based Personality Attribution
One way to endow computers with the ability to predict people’s personality is to adopt
some variants of the (modified) Brunswick model introduced above and apply it to zero-
acquaintance cases exploiting thin slices of expressive behavior. Several psychological
studies showed that personality traits can be judged rapidly from brief observations
(Borkenau & Liebler, 1992; Funder & Sneed, 1993). In one study, 148 participants
were recorded on video while they entered a room, walked over to a seated female
experimenter who greeted them, and then took their seat and begun a brief interview
(Dabbs & Bernieri, 1999). From these tapes only the first thirty seconds were extracted
and so this slice contained a little bit more than the entry, the meeting, the greeting,
and the seating. All the participants had been previously assessed by filling the five big
personality traits. Some naive observers judged each of the 148 participants on each
of the five big traits. The result of this experiment is that judgments of extraversion,
agreeableness, conscientiousness, and openness did correlate significantly with targets’
traits, while neuroticism was the only trait that did not correlate.
In computational approaches, a similar task can be modeled either as a classification
or a regression task. The behavioral cues used to form the thin slices would function
much as the distal cues of human attribution studies and inspiration can be taken from
these studies to find out which cues to employ. It should be noticed that the process
only partially follows Brunswick’s lens model because no space is given to proximal
percepts and the inferential part (which is usually construed through machine learning)
works directly on distal cues.
The base truth for the target variable – personality assessments – can be provided by
means of standard questionnaires that are either compiled by the subjects themselves
(self-assessment) or by other people (other-assessment) who are either well acquainted
with the target subjects (e.g., relatives) or, more frequently, strangers who see them for
the first time. The choice of the way personality is measured in the base truth determines
to a considerable extent the scope and the nature of the attribution task, as we will see
in the following.
As an example of the first approach, Lepri, Subramanian et al. (2012) obtained self-
assessments of the extraversion trait from the members of small groups convened to
address the so-called mission survival task. In this task, participants were asked to dis-
cuss and reach a consensus on how to survive a disaster scenario – a plane crash in
the Canadian mountains – by ranking up to fifteen items according to their importance
for the crew members’ survival. Each meeting was audio and video recorded. Drawing
on theoretical insights that identify the core of the extraversion trait in the tendency to
engage, attract, and enjoy social attention, the authors exploited distal behavioral cues,
such as a subject’s speaking time, the amounts of visual attention (gaze) he/she received
from, and the amount of visual attention he/she gave to the other group members. For
each subject, thin slices were formed by taking sequences of those behavioral cues in
windows of varying size (1–6 minutes) covering the duration of the whole meeting.
Each thin slice was then summarized by a feature vector consisting of the means and
standard deviations of the behavioral cues. The system’s task was to classify each thin
slice/feature vector as being produced either by an introvert or an extrovert, where such

a distinction was built by quantizing the scores of the self-assessed extraversion trait. A
similar approach was also used for predicting personality in different settings such as
short (30–120 seconds) self presentations (Batrinca et al., 2011), human–computer col-
laborative tasks (Batrinca et al., 2012), social scenarios where a group of people move
freely and interact naturally (e.g., an informal cocktail party) (Zen et al., 2010), and
so on.
Mohammadi and Vinciarelli (2012), in turn, exploited a set of eleven strangers to
assess the personality of the speakers of a number of audio clips (one speaker per clip)
by means of a standard Big Five questionnaire. The final trait scores for each speaker
were obtained by averaging the scores provided by the observers. A thin slice corre-
sponded to a clip and consisted of the sequence of values taken by the subject’s pitch,
the first two formants, the speech energy, and the length of the voiced and unvoiced seg-
ments, each measured from 40 ms long windows. The choice of those distal cues was
motivated by the extensive literature showing their importance in human attribution pro-
cesses. Each thin slice was then summarized by means of a feature vector consisting of
the following values for each distal cue: its minimum, its maximum, the mean, and the
relative entropy of the difference between the values of the cue in two consecutive win-
dows. Five classification tasks, one per Big Five trait, were set up, each targeting binary
distinction such as introvert/extrovert, neurotic/emotionally stable, and so on, built on
top of the combined zero-acquaintance assessments made by the external observers.
These two works are similar in many respects: both implement a partial Brunswick
lens model, by going directly from distal cues to personality attribution without interme-
diate entities (proximal percepts) and via statistical/machine learning. Both exploit thin
slices in the form of short sequences of expressive behavior built from the selected dis-
tal cues; both summarize thin slices by feature vectors consisting of measures extracted
from the behavioral sequences; both use those feature vectors as training instances and,
ultimately, behavioral excerpts that the machine uses to provide personality attributions.
In all these respects, both works address an automatic, zero-acquaintance, personality
attribution task through thin slices. There are important differences though. Moham-
madi and Vinciarelli push the similitude to the human zero-acquaintance case further
by: (a) exploiting zero-acquainted (stranger) observers for providing the personality
ground truth and (b) exploiting as distal cues behavioral features that have been shown
to be operational in human attribution studies. In a way, these authors address human
zero-acquaintance attribution by modeling the attribution of an “average” unacquainted
layperson. Lepri, Subramanian et al., in turn, exploit self-provided personality scores
and select distal cues according to one of the current theories (the social attention the-
ory) about the addressed trait (extraversion). In a way, they are modeling personality
attribution as performed by an agent exploiting “objective” or “expert-like” information
about personality; we will refer to this variant of the attribution task as the psychologist
task.
The differences between the two tasks probably account for the different accuracies:
limiting the comparison to the sole extraversion, Mohammadi and Vinciarelli (2012)
report accuracy values starting from 73 percent on their whole data set and going up to
85 percent on the subset of data where nine or more of their eleven observers agreed.
Lepri, Subramanian et al., in turn, report a maximum of 69 percent on manually anno-
tated data and 61 percent on automatic annotated ones. Other reasons aside, the layper-
son task seems advantaged because it aligns personality scores and distal cues around
unacquainted observers. The psychologist task is at odds because the data it works with
are of different origins: the target variable comes from self-assessments while the distal
cues are indirectly suggested by a theory, which, in the end, might well be less than
perfect. There have been attempts at using distal cues motivated by studies on human
personality attribution to predict self-ratings but the results were somewhat inconclu-
sive: some works (Mairesse et al., 2007) reported accuracies not higher than 60 percent,
while other studies (Pianesi et al., 2008) reported values higher than 90 percent for
self-ratings.
In the end, self- and other-attributions are different tasks in humans – witness the
low, r = 0.20, correlations between scores from self- and stranger-assessments reported
in Borkenau and Liebler (1993) – and once modeled by computers provide diverging
results. The layperson task can model the performances of strangers by exploiting the
same distal cues humans are purported to use; to the extent that those cues are available,
the results seem promising. Its long-term target, however, is unclear: endowing comput-
ers with the attribution capabilities of an average observer can prove unfeasible because
of the high inter-observer variability and the resulting need to start considering addi-
tional aspects such as the influence of culture, assessor’s personality, and so on. The psy-
chology task, in turn, relies on two tacit assumption: the first is that self-rating are a more
objective form of personality assessment than other-assessments (a common assumption
in psychology practice); the second assumption is that with computers it is possible to
extend the lens model of others’ attributions to self-attributions and make computers
capable to provide the same assessment as the self. For the execution of such an “artifi-
cial” task, distal cues must be tried out that can be suggested by current psychological
theories on the nature of the various traits, but the lack of an empirical relation between
the traits and the self-assessments might make it difficult to obtain high accuracy
figures.
A final word about the sensitivity of the two versions of the attribution task on the
social context: Despite the widely shared expectation that the (social) context modulates
the expression (the externalization) of personality traits and the space the debate has
taken in the psychological literature, there has not been much about the topic in the
computational one, with the exception of Pianesi et al. (2008) and Lepri, Subramanian
et al. (2012). In both cases, the authors adopted the same simple strategy of representing
the social context by expanding the feature vectors to include the same features for the
other group members as for the target. The results support the expectations: both papers
report significantly higher accuracy rates when the context is taken into account than
when it is not used (92% vs 84%, and 64% vs 57%, respectively). Those are initial and
very rough attempts to address the context in attribution tasks. We will return to the
importance of situational aspects later on.
Friends Don’t Lie: Exploiting the Structure of Social Networks
Traditionally, network science researchers devote their attention to the structure of the
network and to how the behavior of individuals depends on their position in the network;
individuals occupying central positions and having denser networks may gain faster
access to information and assistance (see Borgatti & Foster, 2003). Hence, a number
of recent works in social psychology and network science have started investigating the
role that individual psychological differences have in the structuring of social networks,
with an emphasis on ego-nets, the subnets consisting of a focal node, the “ego,” and
the nodes to which ego is directly connected, the “alters,” plus the ties among them,
if any (Kalish & Robins, 2006; Pollet, Roberts, & Dunbar, 2011). For instance, several
studies reported a positive correlation between extraversion and ego-network size. How-
ever, extraversion tends to decline with age and, after controlling for age, Roberts et al.
(2008) found no effect of extraversion on the size of the ego-network. Instead, Klein
et al. (2004) found that people who were low in neuroticism tended to have high degree
centrality scores in the advice and friendship networks. Unfortunately, their analysis
reports only in-degree centrality and hence it does not allow a complete investigation
of relationships between the local network structures and the personality traits of the
ego. In order to overcome the limitations of this work, Kalish and Robins (2006) pre-
sented a new method of examining strong and weak ties of ego-networks through a
census of nine triads of different types (e.g., WWW, SNS, SSS, where W means “weak
tie,” S means “strong tie,” and N means “no tie”). Their results suggest that the number
of strong triangles – configurations in which ego is connected through strong ties to
two alters who are in turn connected among themselves by a strong tie – is positively
correlated with extraversion and inversely correlated with neuroticism. In other words,
extroverts seem to apply the “friends of my friends are my friends” principle for strong
relationships, whereas neurotic people refrain from doing so. These and other results
can be leveraged for the task of automatically predicting personality traits by exploit-
ing the rich array of traces that the digitalization of human communication (e-mails,
phone calls, SMSs) makes available. Not a straightforward move, though, because the
networks built from digital traces are fundamentally different from those exploited in
the traditional social network literature. These usually resort to self-report surveys that
directly address the dimensions of interest (friendship, collaboration, information flow)
so that the ties between the nodes (the individuals) are directly interpretable in terms of
those dimensions – for example, a tie between subject A and subject B means that they
are friends. With digital traces, network ties reflect events, such as e-mail exchanges,
SMSs, presence in the same place (e.g., via Bluetooth or GPS), or combinations thereof;
we know that A and B have exchanged e-mails or that A calls B, but what this means (are
they friends?) is not determined by the event itself. As a first consequence, the results
from traditional social network theory cannot be directly transferred to networks built
from digital traces; the second consequence is that different digital traces – for exam-
ple, e-mails versus phone calls – give rise to different networks, with different meanings
and different structural properties even for the same population. Going back to the task
Table 13.1 ANOVA results.
Agreeableness Conscientiousness Extraversion Neuroticism Openness
Network type 11.422∗∗∗ 17.113∗∗∗ 44.254∗∗∗ 4.082∗∗ 7.199∗∗∗

Index Call 3.633∗∗ 2.124∗
Network type ∗ Index Class 1.699∗ 1.412! 2.269∗∗ 1.529∗
Note: F values and their significance values (!: p < 0.1; ∗ : p < 0.05; ∗∗ : p < 0.01; ∗∗∗ : p < 0.001).
of automatically predicting personality traits, we can expect complex patterns to arise

whereby traits are differentially associated with specific combinations of network types
and structural properties.
Staiano et al. (2012) considered the role of a number of structural ego-network indices
in the prediction of the (self-assessed) Big Five personality traits on two different types
of undirected networks based on: (a) cell-phone Bluetooth hits and (b) cell-phone calls.
The exploited data set comprised the digital traces of fifty-three subjects for a period of
three months.
Structural indices were grouped into four basic classes: centrality, efficiency, Kalish
and Robins (2006) triads, and transitivity. Centrality indices attempt to assess the impor-
tance of a node in a network and differ according to the property they select to represent
importance; in Staiano et al.’s (2012) work the properties and the corresponding indices
were: (i) the number of ties of a node (degree centrality); (ii) the closeness of the node
to other nodes (closeness); (iii) the extent to which the node is an intermediary between
other nodes (betweenness centrality); and (iv) the node’s contribution to the cohesive-
ness of the whole network (information centrality).
The notion of efficiency can be used to characterize the flow of information in the
network with higher efficiency being associated to highly clustered networks (cliques).
In particular, node efficiency computes how similar the ego-net is to a clique while local
efficiency targets the average path length in the ego-net.
We already encountered triadic measures when discussing Kalish and Robins
(2006) – they consist of nine indices distinguishing the triplets of ego-net nodes accord-
ing to the strength of their ties (if any) and to how close each of them goes to form a
triangle. Transitivity measures accounts for much the same intuitions as triads but in
a more compact way, computing ratios of the triangles available in the ego-net to the
number of possible triangles.
Staiano et al. (2012) ran a classification experiment building one classifier, based
on “Random Forest” (Breiman, 2001), for each index-call–network-type–personality
trait combination; the resulting classification accuracies were then compared through a
number of modified analyses of variance, one for each trait. Table 13.1 summarizes the
results.
As can be seen, the network type always influences the results. A detailed analysis of
the source of those effects reveals that BT (network based on cell phone Bluetooth hits)
is always superior to Call (network based on cell phone calls), but with neuroticism the
Table 13.2 Best accuracy results.
Agreeableness Conscientiousness Extraversion Neuroticism Openness

with BT with BT with BT with Call with BT
Centrality 74% 72% 73% 74% 71%

Efficiency 67% 67% 71% 67% 66%
Triads 70% 66% 70% 60% 70%
Transitivity 73% 62% 80% 57% 74%
pattern was reversed. Table 13.2 reports the accuracy results obtained for each trait on
the corresponding best network type.
Centrality indices always produce accuracy values higher than 70 percent, empha-
sizing the relevance of this class of indices for any of the considered personality
traits. Transitivity, in turn, yields quite a high accuracy value with extraversion, show-
ing the importance of the way ego-nets are clustered for this specific trait. Notice-
ably, transitivity seems to be more effective in this than triads that, though address-
ing similar structural properties, do so in a different manner and by means of a
higher number of indices. Finally, transitivity produces good performances with agree-
ableness (another “sociable” trait) and openness (the most elusive of the Big Five
traits).
In conclusion, as far as this study goes, for all traits but neuroticism, information
about simultaneous spatial co-location (the BT network) seems more effective than the
information about person-to-person contacts provided by the Call network. The inver-
sion of the pattern with neuroticism, could be associated to specific properties of this
traits, based on the necessity to control for potentially anxiogenic social situations such
as those where many people are co-located in the same place. The generally good results
of centrality measures can be attributed to the different structural properties that the
chosen indices exploit (closeness, information flow, intermediary role, etc.) that per-
mit to this class of indices to adapt to the different behavioral contexts (co-location
vs point-to-point communication) and the different trait properties. Finally, the associ-
ation of transitivity indices with BT network for extraversion could be related to the
tendency of extrovert people to keep their close partners together, possibly by promot-
ing the introduction of them to one another at the social gathering captured by the BT
network. Interestingly, using social data from Facebook and more precisely the ego-
networks containing the list of ego’s friends, Friggeri et al. (2012) found a negative
correlation between extraversion and the partition ratio. The partition ratio quantifies
the extent to which the communities of an ego-network are disjointed from one other.
Hence, this result implies that there is a link between the compartmentalization of the
ego-network and the subjects’ extraversion. In simple words, more extroverted subjects
tend to be in groups that are linked to each other, while less extroverted subjects tend to
be in more distinct and separate social groups. This observation is compatible with the
results obtained by Staiano et al. (2012) showing the tendency of introducing friends
belonging to different communities.
Digital trace data can be used also in a more direct manner for personality prediction,
as in the works of Chittaranjan et al. (2011, 2013) and De Montjoye et al. (2013) who
exploited behavioral aggregations (e.g., number of calls made and received, their dura-
tion, diversity of contacts, number of ) of digital trace data rather than network struc-
tural properties. The usage of network structural properties, however, has the advantage
of making the approach less sensitive to the behavioral data variability problem that we
will discuss in the final section.
In conclusion, it is indeed the case that the exploitation of behavioral data in the
form of digital trace can considerably change the picture in disciplines exploiting the
methods of social network analysis and greatly impact on the task of automatically
predicting personality. From the methodological point of view, dealing with networks
that are built from events rather than from survey items targeting specific dimensions
of interest requires the researcher to reconstruct the meaning of the networks and to
be ready to accommodate to the data accordingly. From a substantial point of view, the
availability of different social behaviors in the form of digital traces opens the possibility
of finding the best combinations of social behavior and structural properties for specific
personality traits. Finally, the prospect of merging social behaviors into one and the
same network, for example, by using multidimensional networks containing multiple
connections between any pair of nodes that are differentially sensitive to specific trait
combination (Berlingerio et al., 2013), is completely open to investigation.
Beyond the Personality Traits: Searching for a New Paradigm
With the exception of Staiano et al.’s (2012) usage of abstractions over concrete behav-
iors in the form of structural network properties, all the considered versions of the auto-
matic personality prediction task resort to excerpts of a person’s behavior to provide the
machine equivalent of judgments about his/her personality (Pianesi et al., 2008; Lepri
et al., 2010; Batrinca et al., 2011, 2012). A fundamental problem with this formulation
of the personality prediction task (and the related behavior prediction task) is that traits
are stable and enduring properties but people do not always behave the same way: an
extrovert might, on occasions, be less talkative or attempt less to attract social attention;
a neurotic person need not always react anxiously, and so on. Behavioral variability
has the effect that attributions based on, for example, thin slices will always exhibit a
certain amount of dependence on the thin slices used. There is, in other words, a ten-
sion between the invariance of personality traits and the natural variability of behavior
in concrete situations that risks to seriously hamper current attempts at automatically
predicting personality traits.
In psychology studies, such a tension has often been resolved by considering behav-
ior variability as noise that has to be canceled out by, for example, employing larger
behavioral samples; an approach commonly employed both in psychological and com-
putational works on personality. Although this move is surely recommended in compu-
tational studies too and will improve results, it can be argued that it cannot itself solve
the problem because within-person variability is not just noise to be canceled out. On
the contrary, stemming from the interaction between enduring traits and variable situ-
ational properties, it can give a valuable contribution to personality prediction and to
understanding the personality–behavior relationship (Fleeson, 2001). If we accept the
idea that people routinely express all levels of a given trait depending on situational
characteristics, then (a) neglecting the informative value of within-individual variation
is going to remain a serious limitations to the development of automated personality
prediction and (b) we should investigate alternatives that exploit the interplay between
personal and situational characteristics.
One such an alternative hits the bullet and shifts the attention to actual behaviors in
the form of personality states (Fleeson, 2001) and the associated situational character-
istics. Personality states are concrete behaviors (including ways of acting, feeling, and
thinking) that can be described as having the same contents as traits. A personality state
is, therefore, a specific behavioral episode wherein a person behaves more or less intro-
vertedly, more or less neurotically, and so on. Personality traits can be reconstructed
as distributions over personality states conditioned on situational characteristics. People
would differ because of their different personality state distributions, meaning that, for
example, an introvert does not differ from an extrovert because he/she never engages in
extrovert behaviors, but because he/she does so in a different manner. Such an approach
would reconcile the traditional focus on between-person variability with the meaningful-
ness of within-individual variability by turning actual behaviors into personality states
and sampling the corresponding space on the basis of circumstantial properties. Such
an approach could contribute to advance not only the task of personality prediction but
also the related task of predicting/explaining behavior from personality by matching
current situational properties to those effective for a given trait and then retrieving the
corresponding state distribution.
Recently, Lepri, Staiano et al. (2012) ran a six-weeks-long study in which they mon-
itored the activities of fifty-three people in a research institution during their working
days. In particular, during the study both stable and transient aspects of the individ-
uals were collected: (i) stable and enduring individual traits, personality traits, and
(ii) transient states concerning personality the person goes through his/her daily life
at work, personality states. To keep track of the transient personality states, an experi-
ence sampling methodology was employed: participants were asked to fill three short
internet-based surveys in each working day. The questions in the experience sam-
pling referred to the personality states experiences over the thirty minutes prior to the
survey.
The behavioral cues of the participants were collected by means of the SocioMetric
Badges, wearable sensors able to provide information about: (i) human movement, (ii)
prosodic speech features (rather than raw audio signals), (iii) indoor localization, (iv)
proximity to other individuals, and (v) face-to-face interactions (Olguin Olguin et al.,
2009).
Using the data collected by Lepri, Staiano et al. (2012), Kalimeri et al. (2013) made a
first attempt at addressing the new perspective of automatically recognizing personality
states. More precisely, they focused on the classification of excerpts of social behav-
ior into personality states corresponding to the Big Five traits, rather than dealing with
the more traditional goal of using behaviors to infer about personality traits. To these
ends, Kalimeri et al. exploited cues referring to acted social behaviors, for instance, the
number of interacting people and the number of people in close proximity as well as to
other situational characteristics, such as time spent in the canteen, in coffee breaks, in
meetings, and so on. In terms of obtained accuracies, the results obtained by Kalimeri
et al. (2013) are quite promising. Compared to a baseline of 0.33, they obtained the fol-
lowing highest accuracies figures: 0.6 for extraversion states; 0.59 for conscientiousness
states; 0.55 for agreeableness states; 0.7 for emotional stability states; 0.57 for openness
states. In a number of cases (extraversion and conscientiousness) evidence was found
for a role of the social context, while in others (agreeableness, emotional stability, and
openness) such evidence was not conclusive. Other valuable results concern indica-
tions about the effectiveness of the feature extracted, all of them built from signals or
information provided by nowadays widely available means (BT, wearable microphones,
e-mails, infrared sensors). Interestingly, for extraversion states the highest performance
was obtained by combining information from infrared sensors and e-mails from both the
target subject and the people she/he interacted with face-to-face. This is of some interest
for at least two reasons: in the first place, one could have expected speech features to
play some role given the relevance that both the psychosocial and the computational lit-
erature have assigned them for extraversion traits. If confirmed by further studies, this
datum could show that the behavioral fabric of states can be, at least partially, differ-
ent from that of traits. Second, this result emphasizes the role that the communicative
behavior (amount and quality of face-to-face interaction, amount and quality of elec-
tronically mediated communication, typology of interaction targets, etc.) of the people
around the target has for the prediction of extraversion states. From a more general point
of view, the results of these experiments show the feasibility of the proposed perspective
and will hopefully encourage further research.
It can be suggested, therefore, that the prospects for a well-founded theory of the
automatic prediction of personality rely on: (a) a qualitative characterization of actual
behaviors into personality states; (b) the reconstruction of personality traits as state dis-
tributions conditioned on situational properties; (c) a characterization of situations that,
to be useful for this project, must be defined in terms of psychological efficacy rather
than in more traditional space/physical terms (Fleeson, 2001). Recent theoretical and
practical advances in related fields, such as social psychology, social signal processing,
social computation, and ubiquitous computing, make the pursuance of this new perspec-
tive possible.
References
Allport, G. W. & Odbert, H. S. (1936). Trait-names: A psycho-lexical study. Psychological Mono-

graphs, 47, 1–171.
Ambady, N., Bernieri, F. J., & Richeson, J. A. (2000). Toward a histology of social behavior:
Judgmental accuracy from thin slices of the behavioral stream. In M. P. Zanna (Ed.), Advances
in Experimental Social Psychology (vol. 32, pp. 201–271). San Diego: Academic Press.
Ambady, N. & Rosenthal, R. (1992). Thin slices of expressive behavior as predictors of interper-
sonal consequences: A meta-analysis. Psychological Bulletin, 111(2), 256–274.
André, E., Klense, M., Gebhard, P., Allen, S., & Rist, T. (1999). Integrating models of personality
and emotions into lifelike characters. In Proceedings of the Workshop on Affect in Interaction
– Towards a New Generation of Interfaces (pp. 136–149).
Andrews, K. (2008). It’s in your nature: A pluralistic folk psychology. Synthese, 165(1), 13–29.
Batrinca, L. M., Lepri, B., Mana, N., & Pianesi, F. (2012). Multimodal recognition of person-
ality traits in human–computer collaborative tasks. In Proceedings of the 14th International
Conference on Multimodal Interaction (ICMI’12).
Batrinca, L. M., Mana, N., Lepri, B., Pianesi, F., & Sebe, N. (2011). Please, tell me about yourself:
Automatic personality assessment using short self-presentations. In Proceedings of the 13th
International Conference on Multimodal Interfaces (ICMI ’11), pp. 255–262.
Bell, S. T. (2007). Deep-level composition variables as predictors of team performance: A meta-
analysis. Journal of Applied Psychology, 92(3), 595–615.
Berlingerio, M., Coscia, M., Giannotti, F., Monreale, A., & Pedreschi, D. (2013). Multidimen-
sional networks: Foundations of structural analysis. World Wide Web, 16, 567.
Borgatti, S. P. & Foster, P. (2003). The network paradigm in organizational research: A review
and typology. Journal of Management, 29(6), 991–1013.
Borkenau, P. & Liebler, A. (1992). Traits inferences: Sources of validity at zero acquaintance.
Borkenau, P. & Liebler, A. (1993). Convergence of stranger ratings of personality and intelligence
with self-ratings, partner ratings and measured intelligence. Journal of Personality and Social
Breiman, L. (2001). Random forest. Machine Learning, 45(1), 5–32.
Cattell, R. B. (1957). Personality and Motivation: Structure and Measurement. New York:
Harcourt, Brace & World.
Chittaranjan, G., Blom, J., & Gatica-Perez, D. (2011). Who’s who with Big-Five: Analyzing and
classifying personality traits with smartphones. In Proceedings of International Symposium on
Wearable Computing (ISWC 2011).
Chittaranjan, G., Blom, J., & Gatica-Perez, D. (2013). Mining large-scale smartphone data for
personality studies. Personal and Ubiquitous Computing, 17(3), 433–450.
Costa, P. T. & McCrae, R. R. (1992). Four ways why five factors are basic. Personality and Indi-
vidual Differences, 13, 653–665.
Dabbs, J. M. & Bernieri, F. J. (1999). Judging personality from thin slices. Unpublished data.
University of Toledo.
De Montjoye, Y. A., Quoidbach, J., Robic, F., & Pentland, A. (2013). Predicting personality using
novel mobile phone-based metrics. In Proceedings of Social BP (pp. 48–55).
DeNeve, K. M. & Cooper, H. (1998). The happy personality: A meta-analysis of 137 personality
traits and subjective well-being. Psychological Bulletin, 124(2), 197–229.
Fleeson, W. (2001). Toward a structure- and process-integrated view of personality: Traits
as density distributions of states. Journal of Personality and Social Psychology, 80,
1011–1027.
Friggeri, A., Lambiotte, R., Kosinski, M., & Fleury, E. (2012). Psychological aspects of social
communities. In Proceedings of IEEE Social Computing (SocialCom 2012).
Funder, D. C. & Sneed, C. D. (1993). Behavioral manifestations of personality: An ecological
approach to judgmental accuracy. Journal of Personality and Social Psychology, 64, 479–490.
Furnham, A. & Fudge, C. (2008). The Five Factor model of personality and sales performance.
Journal of Individual Differences, 29(1), 11–16.
Goren-Bar, D., Graziola, I., Pianesi, F., & Zancanaro, M. (2006). Influence of personality factors
on visitors’ attitudes towards adaptivity dimensions for mobile museum guides. User Modeling
and User Adapted Interaction: The Journal of Personalization Research, 16(1), 31–62.
Grucza, R. A. & Goldberg, L. R. (2007). The comparative validity of 11 modern personality
inventories: Predictions of behavioral acts, informant reports, and clinical indicators. Journal
of Personality Assessment, 89, 167–187.
Hogan, R., Curphy, G. J., & Hogan, J. (1994). What we know about leadership: Effectiveness and
personality. American Psychologist, 49(6), 493–504.
Hurtz, G. M. & Donovan, J. J. (2000). Personality and job performance: The Big Five revisited.
Journal of Applied Psychology, 85, 869–879.
Janis, L. (1954). Personality correlates of susceptibility to persuasion. Journal of Personality,
22(4), 504–518.
John, O. P. & Srivastava, S. (1999). The Big Five trait taxonomy: History, measurement, and
theoretical perspectives. In L. A. Pervin & O. P. John (Eds), Handbook of Personality: Theory
and Research (pp. 102–138). New York: Guilford Press.
Judge, T. A., Bono, J. E., Ilies, R., & Gerhardt, M. W. (2002). Personality and leadership: A
qualitative and quantitative review. Journal of Applied Psychology, 87(4), 765–780.
Judge, T. A., Heller, D., & Mount, M. K. (2002). Five-factor model of personality and job satis-
faction: A meta-analysis. Journal of Applied Psychology, 87(3), 530–541.
Kalimeri, K., Lepri, B., & Pianesi, A. (2013). Going beyond traits: Multimodal recognition of
personality states in the wild. In Proceedings of 15th International Conference on Multimodal
Interfaces (pp. 27–34).
Kalish, Y. & Robins, G. (2006). Psychological predisposition and network structure: The relation-
ship between individual predispositions, structural holes and network closure. Social Networks,
28, 56–84.
Klein, K. J., Lim, B. C., Saltz, J. L., & Mayer, D. M. (2004). How do they get there? An exam-
ination of the antecedents of network centrality in team networks. Academy of Management
Journal, 47, 952–963.
Lepri, B., Staiano, J., Rigato, G., et al. (2012). The SocioMetric Badges Corpus: A multilevel
behavioral dataset for social behavior in complex organizations. In Proceedings of IEEE Social
Computing (SocialCom 2012).
Lepri, B., Subramanian, R., Kalimeri, K., et al. (2010). Employing social gaze and speaking activ-
ity for automatic determination of the Extraversion trait. In Proceedings of International Con-
ference on Multimodal Interaction (ICMI 2010).
Lepri, B., Subramanian, R., Kalimeri, K., et al. (2012). Connecting meeting behavior with
Extraversion – A systematic study. IEEE Transactions on Affective Computing, 3(4), 443–455.
Mairesse, F., Walker, W. A., Mehl, M. R., & Moore, R. K. (2007). Using linguistic cues for the
automatic recognition of personality in conversation and text. Journal of Artificial Intelligence
Research, 30, 457–500.
Mohammadi, G. & Vinciarelli, A. (2012). Automatic personality perception: Prediction of trait
attribution based on prosodic features. IEEE Transactions on Affective Computing, 3(3), 273–
284.
Murray, H. G., Rushton, J. P., & Paunonen, S. V. (1990). Teacher personality traits and student
instructional ratings in six types of university courses. Journal of Educational Psychology,
82(2), 250–261.
Olguin Olguin, D., Waber, B., Kim, T., et al. (2009). Sensible organizations: Technology for auto-
matically measuring organizational behavior. IEEE Transactions on Systems, Man and Cyber-
netics, Part B: Cybernetics, 39(1), 43–55.
Peabody, D. & Goldberg, L. R. (1989). Some determinants of factor structures from personality-
trait descriptors. Journal of Personality and Social Psychology, 57, 552–567.
Pianesi, F., Mana, N., Cappelletti, A., Lepri, B., & Zancanaro, M. (2008). Multimodal recognition
of personality traits in social interactions. In Proceedings of ACM-ICMI ’08.
Pollet, T. V., Roberts, S. G. B., & Dunbar, R. I. M. (2011). Extraverts have larger social net-
work layers but do not feel emotionally closer to individuals at any layer. Journal of Individual
Differences, 32(3), 161–169.
Roberts, S. G. B., Wilson, R., Fedurek, P., & Dunbar, R. I. M. (2008). Individual differences and
personal social network size and structure. Personality and Individual Differences, 4, 954–964.
Scherer, K. R. (1978). Inference rules in personality attribution from voice quality: The loud voice
of extraversion. European Journal of Social Psychology, 8, 467–487.
Staiano, J., Lepri, B., Aharony, N., et al. (2012). Friends don’t lie – inferring personality traits
from social network structure. Proceedings of UbiComp 2012.
Tapus, A., Tapus, C., & Mataric, M. (2008). User-robot personality matching and robot behavior
adaptation for post-stroke rehabilitation therapy. Intelligent Service Robotics Journal (special
issue on multidisciplinary collaboration for socially assistive robotics), 1(2), 169–183.
Zen, G., Lepri, B., Ricci, E., & Lanz, O. (2010). Space speaks: Towards socially and personality
aware visual surveillance. In Proceedings of Workshop on Multimodal Pervasive Video Analysis
(MPVA 2010), in conjunction with ACM Multimedia.
Zhou, X. & Conati, C. (2003). Inferring user goals from personality and behavior in a causal
model of user affect. In Proceedings of the 8th International Conference on Intelligent User
Interfaces (IUI’03).
Further Reading
Asendorpf, J. B. & Wilpers, S. (1998). Personality effects on social relationships. Journal of

Personality and Social Psychology, 74, 1531–1544.
Cassell, J. & Bickmore, T. (2003). Negotiated collusion: Modeling social language and its rela-
tionship effects in intelligent agents. User Modeling and User-Adapted Interaction, 13, 89–132.
Donnellan, M. B., Conger, R. D., & Bryant, C. M. (2004). The Big Five and enduring marriages.
Journal of Research in Personality, 38, 481–504.
Komarraju, M. & Karau, S. J. (2005). The relationship between the Big Five personality traits and
academic motivation. Personality and Individual Differences, 39, 557–567.
Reeves, B. & Nass, C. (1996). The Media Equation. Chicago: University of Chicago Press.
Rotter, J. B. (1965). Generalized expectancies for internal versus external control of reinforce-
ment. Psychological Monographs, 80(1), 1–28.
Sigurdsson, J. F. (1991). Computer experience, attitudes toward computers and personality char-
acteristics in psychology undergraduates. Personality and Individual Differences, 12(6), 617–
624.
Snyder, M. (1974). Self-monitoring of expressive behavior. Journal of Personality and Social
14 Automatic Analysis of Aesthetics:
Human Beauty, Attractiveness, and
Likability
According to the Oxford English Dictionary the definition of aesthetics is “concerned

with beauty or the appreciation of beauty.” Despite the continuous interest and exten-
sive research in cognitive, evolutionary, and social sciences, modeling and analysis of
aesthetic canons remain open.
Contemporary theories of aesthetics emphasize critical thinking about objects, things,
and people as well as experience, interaction, and value. In this regard, aesthetic norms
have become more relevant to the context of interaction between humans and objects,
human and computers (human–computer interaction or HCI), and between humans
themselves (human–human interaction or HHI) (Kelly, 2013).
When interested readers look up the phrases aesthetics and computing on the web,
they will likely encounter three main areas that appear to be related: aesthetic com-
puting (note the missing “s” at the end), aesthetics in human–computer interaction,
and computational aesthetics. Although there appears to be a close link between these
three, they refer to inherently different fields of research. Aesthetic computing can be
broadly defined as “applying the philosophical area of aesthetics to the field of com-
puting” linked principally to formal languages and design of programs or products
(Fishwick, 2013). Driven by design concerns, aesthetics in HCI focuses on the question
of how to make computational artifacts more aesthetically pleasing (Norman, 2004).
This concern has recently shifted toward aesthetics of interaction, moving the focus
from ease of use to enjoyable and emotionally rewarding experience (Ahmed, Mah-
mud, & Bergaust, 2009). Although this question has significant theoretical and practical
implications, there exists another relevant, yet largely unexplored question of whether
computational approaches can be useful in understanding aesthetic judgment and affect
in the context of HHI and HCI mainly given its highly subjective nature and often highly
different “taste” and perception. Computational aesthetics is the research of computa-
tional methods that can make applicable aesthetic decisions in a similar way to humans
(Hoeing, 2005). In other words, can human aesthetic perception and judgment be quan-
tified computationally, and can we make machines and systems aware of aesthetics sim-
ilarly to humans?
Having reviewed these broad definitions, we narrow down our interest in this chap-
ter to automatic analysis of aesthetics, beauty, attractiveness, and likability. To date, the
automatic analysis of human aesthetics has attracted the interest of computer vision,
computer audition, signal processing, and multimedia researchers in two forms, namely
as aesthetics in the input signal and aesthetics of the input signal. Human aesthetics
analysis in the input signal refers to people in images, video, or audio documents (e.g.,
Gunes & Piccardi, 2006; Gunes, 2011; Bottino & Laurentini, 2010; Nguyen et al., 2012)
where they appear, interact, and/or communicate by means of language, vocal intona-
tion, facial expression, body posture, and so on. Aesthetics analysis of the input signal
refers to the aesthetic quality evoked in human observers by sounds played and text and
images displayed (e.g., how appealing the image returned by an image retrieval system
is; Chan et al., 2012), or the experience evoked by artwork installations. In both cases,
the implication for the relevant research fields is that both the form and content of such
multimedia signals are heavily loaded with aesthetic qualities commonly known and
referred to as “beauty.”
In order to shed an interdisciplinary light on the issue, in this chapter we will pro-
vide a review of canons, norms, and models used for analyzing aesthetics with respect
to human beauty, attractiveness, and likability focusing on the visual and audio cues
measured and interpreted. We will describe the low and high level features extracted,
machine learning methods utilized, and data and media used in experiments carried out
to date. Much of the chapter is on facial and vocal attractiveness implying that attrac-
tiveness is equivalent to beauty. We would like to inform the reader that this is disputed
by quite a few scholars. Attractiveness focuses on liking and an approach tendency and
is primarily determined by the needs and urges of a person, which is even more individ-
ually different than beauty judgments.
Theories of Beauty, Attractiveness, and Likability
In this chapter, we use the terms beauty, attractiveness, and likability together, or at
times even interchangeably due to their different use in the theories presented, and the
applications and systems provided by the disciplines of video processing and audio
processing. Depending on the modality in question the terminology may vary slightly.
For instance, in speech analysis, the term “speaker likability” is found more frequently
than “vocal attractiveness” which clearly address different nuances of similar research
questions, albeit the practical usage of the terms is often not fully distinctive. We will
be guiding the reader through a collection of theories for each modality and cue.
Facial Attractiveness
Researchers suggested that the frontoparallelness of the face, precisely controlled sym-
metry, height of the internal features, relative luminance of different facial features, and
the quality and the characteristics of the skin play an important role in the perception
and assessment of facial attractiveness.
The Composite Faces Theory

Studies of reactions to average (composite) faces show that the more faces added to the
composite, the greater the perceived beauty. Moreover, an average face (created from
a set of random faces) is perceived as more attractive than the original ones (Langlois
Automatic Analysis of Aesthetics: Human Beauty, Attractiveness, and Likability 185
(a) (b)
Figure 14.1 Illustration of the composite faces theory: (a) images of 12 famous female faces and
(b) the composite (average) face obtained.
& Roggman, 1990). We illustrate this in Figure 14.1 where twelve images of famous
female faces were selected and cropped, and the composite facial image was obtained.
Morphing the facial shape of a face toward the mean facial shape of a set of images
appears to enhance attractiveness, whereas morphing the facial shape further from the
mean appears to reduce attractiveness (Rhodes & Tremewan, 1996). However, Alley and
Cunningham (1991) proved that, although averaged faces are perceived as attractive, a
very beautiful face is not close to this average.
The Symmetry Theory

There are various hypotheses regarding the role of symmetry in perception of attractive-
ness. The fact that human faces exhibit significant amounts of both directional asymme-
try and antisymmetry in skeletal and soft-tissue structures is a well-accepted concept.
However, despite this fact, facial symmetry is the first criterion when assessing facial
attractiveness (Zimbler & Ham, 2010). Fink, Grammer, and Thornhill (2001) inves-
tigated symmetry and averageness of faces and concluded that symmetry was more
important than averageness in facial attractiveness. Other studies suggested that facial
symmetry is actually perceived as less attractive than asymmetry, because perfect sym-
metry appears abnormal in an environment where asymmetry is normal (Swaddle &
Cuthill, 1995). This may be due to the fact that reducing asymmetry causes the face
to appear unemotional (the human face is known to possess asymmetry in emotional
expression).
The Skin and Texture Theory

The appearance of the skin seems to have an effect on the perception of attractiveness.
Fink, Grammer, and Matts (2006) demonstrated that women’s facial skin texture affects
male judgment of facial attractiveness and found that homogeneous skin (i.e., an even
distribution of features relating to both skin color and skin surface topography) is most
attractive. This theory also has direct implications for the composite faces theory. More
specifically, the smooth complexion of the blurred and smoothed faces may underlie
the attractiveness of the averaged faces (Kagian et al., 2008a). Skin texture, thickness,
elasticity, and wrinkles, or rhytids, are also listed as critical factors contributing to one’s
overall facial appearance (Zimbler & Ham, 2010).
The (Geometric) Facial Feature Theory

When it comes to measuring attractiveness from facial cues, the most commonly used
features are soft-tissue reference points (e.g., the point of transition between lower eye-
lid and cheek skin) and geometric features based on (skeletal) anatomic landmarks (e.g.,
a line drawn from the superior aspect of the external auditory canal to the inferior border
of the infraorbital rim) (Zimbler & Ham, 2010). A facial representation is obtained by
calculating a set of geometric features (i.e., landmarks on the face) using the major
facial points, including facial outline, eyebrows, eyes, nose, and mouth (Zimbler &
Ham, 2010). It has also been shown that it is possible to modify the attractiveness per-
ception by changing the geometric features while keeping other factors constant (Chen
& Zhang, 2010). Compared to other facial features, the chin, the upper lip, and the
nose appear to have a great effect on the overall judgment of attractiveness (Michiels &
Sather, 1994).
The Golden Ratio Theory

The golden ratio or proportion is approximately the ratio of 1 to 0.618 or the ratio of
1.618 to 1 (Borissavlievitch, 1958; Huntley, 1970) as shown in Figure 14.2(a). Accord-
ing to the golden ratio theory, for female facial beauty in the case of a perfect, vertically
aligned face, all the proportions must fit the golden ratio (Parris & Robinson, 1999; see
Figure 14.2(b)). In a recent cross-cultural beauty perception study, Mizumoto, Deguchi,
and Fong (2009) reported that there is no difference in golden proportions of the soft-
tissue facial balance between Japanese and white women in terms of facial height com-
ponents. Japanese women have well-balanced facial height proportions, except for a few
measurements.
The Facial Thirds Theory

This theory aims to assess the facial height. The theory states that a well-proportioned
face may be divided into roughly equal thirds by drawing horizontal lines through the
forehead hairline, the eyebrows, the base of the nose, and the edge of the chin (see
Figure 14.2(c)). Moreover, the distance between the lips and the chin should be double
the distance between the base of the nose and the lips (Farkas et al., 1985; Farkas &
Kolar, 1987; Jefferson, 1993; Ricketts, 1982).
The Facial Fifths Theory

This theory evaluates the facial width by dividing the face into equal fifths. In an aes-
thetically pleasant face the width of one eye should equal one fifth of the total facial
width, as well as the intercanthal distance or nasal base width.
a b
(a + b)/a = a/b = 1.618 [...]
(a)
10
4 4
1 11
3
8 8
6 6 13
2 9 12
5 5 5 14
7
(b) (c)
Figure 14.2 (a) The golden proportion ((a + b)/a = a/b = 1.618) and template images for
(b) golden proportions and (c) facial thirds.
The Juvenilized Face Theory

Ji, Kamachi, and Akamatsu (2004) investigated how the feminised or juvenilised faces
were perceived in terms of attractiveness. Feminized or juvenilized Japanese faces were
created by morphing between average male and female adult faces or between aver-
age male (female) adult and boy (girl) faces. The results showed moderately juve-
nilized faces are perceived to be highly attractive. They found that most of the attrac-
tive juvenilized faces involved impressions corresponding to elegance, mildness, and
youthfulness.
The Frontal Versus Lateral View Theory

Valenzano et al. (2006) demonstrated that facial attractiveness in frontal and lateral
views is highly correlated. Assessing facial attractiveness from lateral view is gain-
ing interest because certain anthropometric landmarks (glabella, nasion, rhinion, pogo-
nion, etc.) can be located only in lateral view, and lateral view avoids the computational
problems associated with the analysis of landmarks with bilateral symmetry (Valenzano
et al., 2006).
Other Factors
In addition to facial features, shape and form, people judge human faces using various
other attributes such as pleasant expressions (e.g., a smile) and familiarity (Kagian et al.,
2008a). Supporting such claims is the multiple fitness model (Cunningham et al., 1995)
that suggests that there is no single feature or dimension that determines attractiveness.
Instead, various categories and combinations of features represent different aspects (or
desirable qualities) of the perceived person. However, this theory still agrees that some
facial qualities are perceived as universally (physically) attractive.
Bodily Attractiveness
The most dominant bodily cue that affects the perception of female attractiveness
(excluding the face) appear to be shape and weight. The shape cue is concerned with the
ratio of the width of the waist to the width of the hips (the waist-to-hip ratio or WHR)
(Tovee et al., 1999). Thus, a low WHR (i.e., a curvaceous body) is believed to corre-
spond to the optimal fat distribution for high fertility and therefore is perceived to be
highly attractive for women. Tovee et al. (1999) focused on the perception of silhouettes
of bodies in frontal view and proved that weight scaled for height (the body mass index
or BMI) is the primary determinant of sexual attractiveness rather than WHR. BMI was
obtained by taking the path length around the perimeter of a figure and dividing it by the
area within the perimeter (PAR). They also showed that visual cues, such as PAR, can
provide an accurate and reliable index of an individual’s BMI and could be used by an
observer to differentiate between potential partners. Bilateral symmetry is another cue
(in addition to BMI and WHR) that plays a significant role in female physical attrac-
tiveness. This is again due to the fact that asymmetry is usually caused by disease or
parasites and therefore has a negative impact on an individual’s health.
Vocal Attractiveness
Acoustic correlates of the voice – in particular prosodic cues, such as intonation, inten-
sity, and tempo (Chattopadhyay et al., 2003), but also vocal tract parameters (e.g.,
vocal tract lengths as reflected in formant frequencies) and voice quality (Zuckerman &
Miyake, 1993; Liu & Xu, 2011) – influence our perception of speakers’ attractiveness
and beauty. For example, 20 percent lower pitch and 30 percent lower talking speed of
male speech seem to lead to listener’s perception of being more “potent” in terms of
tallness and thinness (Apple, Streeter, & Kraus, 1979). Similarly, Feinberg et al. (2005)
report that low fundamental frequency (F0) can be considered as indicating “masculinity
and reproductive capability” which was preferred by females in listening studies. This
is also confirmed by Collins (2000), Saxton, Caryl, and Roberts (2006), and Hodges-
Simeon, Gaulin, and Puts (2010). Additionally, men with voices with closely spaced,
low-frequency harmonics (Collins, 2000) and lower spatial distribution of formants and
high intensity were judged as being more attractive (Hodges-Simeon et al., 2010). Also,
the second formant frequency seems to be influential (Jürgens, Johannsen, & Fellbaum,
1996).
Riding, Lonsdale, and Brown (2006) confirm that low and medium average male
speaking pitch is more attractive to women raters, whereas the amount of pitch vari-
ation in hindsight seems to be negligible. As for vocal tract lengths, interestingly, the
preference seems to be dependent on the female judge’s own body size, thus being rather
subjective (Feinberg et al., 2005).
For women, higher-frequency voices are attributed to being more attractive and
younger (Collins & Missing, 2003). Lower voices belonged to larger women and were
rated as less attractive voices (Collins & Missing, 2003). Narrower formant dispersion
of taller women appears to also have an effect. Feinberg et al. (2008) observed that
male judges prefer exaggerated feminine characteristics, in particular “raised pitch for
all levels of starting pitch.”
A “clear,” “warm,” and “relaxed” voice, and constant voice “capacity” seem relevant
for likable voices (Ketzmerick, 2007; Weiss & Burkhardt, 2010). In terms of acous-
tic parameters this corresponds to less pressed, more breathy voice quality, and lower
spectral center of gravity (Weiss & Möller, 2011).
A further substantial body of literature not touched upon here deals with vocal beauty
of singers, for example, Kenny and Mitchell (2004) where such considerations are
broadly given. It seems obvious that culturally imposed perception differences may
exist, though cross-cultural studies are broadly lacking, and that there may be highly
personal differences, with similarity attraction being an important factor (Aronson, Wil-
son, & Akert, 2009). For example, Dahlbäck et al. (2007) observed speakers mirroring
their own accent in a tourist information system.
Audiovisual Attractiveness Dependencies

The abovementioned findings indicate that attractiveness measures mainly depend on
age, body size, and going as far as differences in shoulder-to-hip ratios (SHR, more
relevant for males) and waist-to-hip ratios (WHR, more relevant for females) (Hughes,
Dispenza, & Gallup, 2004). Even the hormonal profile (Pipitone & Gallup, 2008) seems
to have an influence. Overall, this indicates vocal attractiveness’ relation to physical
appearance and “sexual maturity” (Feinberg et al., 2005) as well as non-permanence.
The attractiveness and sexiness of the voice can be intentionally altered in time changing
the voice quality, typically by lowering of the voice and creating a more breathy voice
(Tuomi & Fisher, 1979).
Collins and Missing (2003) and Saxton (2005) confirm audiovisual dependencies,
that is, attractive visual and auditive perception seem to go hand in hand. Saxton et al.
(2006) add that human awareness of these cues fully develops only with the age that
“mate choice judgments become relevant.” It seems noteworthy, however, that the above
observations are usually based on listening studies as typically performed in opposite-
sex manner. Likewise, less is known about within-sex ratings on vocal attractiveness or
likability.
Computational Approaches
A framework for computational analysis of aesthetics can be described as consisting

of: (1) explicit human evaluations (or labels) obtained from a number of evaluators that
view, listen, or watch the media at hand, and/or implicit evaluations (Pantic & Vin-
ciarelli, 2009) obtained by recording the evaluators’ visual, auditive, and physiological
reactions while they are performing the evaluation task; (2) a scoring model (fitness
function) developed based on a machine learning system trained using human evalua-
tions and features extracted; and (3) providing intelligent interpretation and appropriate
responses in various interaction settings. The most significant challenge is then testing
and evaluation of the validity of the aesthetics metric used (i.e., whose judgment the
model represents and to what extent).
Data and Annotations

Data acquisition and annotation to analyze attractiveness and modeling has mostly been
done in an ad hoc manner. More specifically, each research group has used their own in-
house database (e.g., Kagian et al., 2008a) opted for obtaining data from the web (e.g.,
White, Eden, & Maire, 2004; Whitehill & Movellan, 2008), or used other databases
acquired for face or facial expression recognition purposes (e.g., Gunes & Piccardi,
2006).
Data
As representative examples of attractiveness data, we want to mention Kagian et al.
(2008a) who used a database composed of ninety-one frontal facial images of young
Caucasian American females (with a neutral expression), White et al. (2004) who com-
piled images and associated attractiveness scores from the website www.hotornot.com
(a website where users rate images of one another for attractiveness on a 1–10 scale),
and Davis and Lazebnik (2008) who created a heterogeneous dataset (images with vary-
ing viewpoints, facial expressions, lighting, and image quality) of over three thousand
images gathered from a website. The most noteworthy effort to date is the large-scale
benchmark database for facial beauty analysis introduced by Chen and Zang (2010).
The database contains 15,393 female and 8,019 male photographs (in frontal view with
neutral expression, 441 × 358 pixels in size) of Chinese people, 875 of them labeled as
beautiful (587 female and 288 male). Similarly to other relatively new research fields
(e.g., affective computing; Gunes & Pantic, 2010), the field of attractiveness analysis
and modeling is in need of creating the so-called data acquisition protocol that con-
sists of context (application domain), subjects (age, gender, and cultural background),
modalities, and type of data to be recorded. To date, recorded and used data fall into the
posed (with a neutral expression) and visual (static images) data categories. Acquiring
attractiveness data in a dynamic and multimodal setting (i.e., induced via clips or occur-
ring during an interaction, recorded in an audiovisual manner) will certainly advance
our understanding of the various factors that affect the perception and interpretation of
human attractiveness.
The best known database for audio modality is the SLD (speaker likability database).
This database is a subset of the German Agender database (Burkhardt et al., 2010),
which was originally recorded to study automatic age and gender recognition from
telephone speech. The Agender database contains about 940 speakers of mixed age
and gender recorded over both landline and mobile phones. The database contains eigh-
teen utterance types taken from a set listed in detail in (2010). The age groups in the
database (children, 7–14; youth, 15–24; adults, 25–54; seniors, 55–80 years) are repre-
sented fairly equally.
Annotation
Unlike other relevant research fields (e.g., affective computing; Gunes & Schuller,
2013) there currently exists no publicly available annotation tool that can be used for
annotating attractiveness data. Until recently, visual attractiveness data annotation has
been done by asking a (diverse) set of human raters to view the facial/bodily images
and pick a level along the (discretized) scale provided (e.g., Gunes & Piccardi, 2006).
Researchers seem to use different attractiveness levels: a seven-point Likert scale (1 =
very unattractive, 7 = very attractive) (Kagian et al., 2008a); a ten-point Likert scale
(e.g., 1 = least attractive – minimum; 10 = most attractive – maximum) (Gunes & Pic-
cardi, 2006); or integers in an arbitrary range (e.g., −1 = definitely not interested in
meeting the person for a date; 0 = not interested in meeting the person; 1 = interested
in meeting the person; 2 = definitely interested in meeting the person) (Whitehill &
Movellan, 2008). Ratings are usually collected via the specific website’s interface (e.g.,
www.hotornot.com; White et al., 2004) or with a specifically designed web interface
(e.g., Gunes & Piccardi, 2006; Kagian et al., 2008a). The final attractiveness rating is
usually calculated as the mean rating across all raters. However, using only the mean
rating as ground truth might not be sufficiently descriptive, for example, two images
with similar mean ratings might have different variance values. Taking into account
such aspects of the ratings has been reported to be extremely important when training
and evaluating automatic attractiveness predictors (Gunes & Piccardi, 2006; Kalayci,
Ekenel, & Gunes, 2014).
A representative example for vocal attractiveness data annotation is the Agender
database that was annotated in terms of likability ratings by presenting the stimuli to
thirty-two participants (17 male, 15 female, aged 20–42; average age, 28.6 years; stan-
dard deviation; 5.4 years). No significant impact of raters’ age or gender was observed
on the ratings. This holds also for speakers’ gender. However, speaker age groups were
rated differently: younger speakers were “preferred,” which may stem from the raters’
age. To establish a consensus from the individual likability ratings (16 per instance),
the evaluator weighted estimator (EWE) by Grimm and Kroschel (2005) was used. The
EWE is a weighted mean with weights corresponding to the “reliability” of each rater,
which is the cross-correlation of a rater’s rating with the mean rating of all raters. In gen-
eral, the raters exhibited varying reliability ranging from a cross-correlation of 0.057 to
0.697. The EWE rating was discretized into the “likable” (L) and “non-likable” (NL)
classes based on the median EWE rating of all stimuli in the SLD. Even this binary
classification was conceived as challenging because the distribution of the EWE ratings
is roughly normal and symmetric.
Preprocessing and Representation

Visual Cues
Experiments have shown that (geometric) features based on measured proportions, dis-
tances (as illustrated in Figure 14.2(b)), and angles of faces are most effective in captur-
ing the notion of facial attractiveness (Eisenthal, Dror, & Ruppin, 2006; Kagian et al.,
2008a). Therefore, a number of automatic attractiveness analyzers and predictors have
opted for using the geometric representation (e.g., Gunes & Piccardi, 2006; Kagian
et al., 2008a). The preprocessing step then comprises normalizing the image intensity
distribution, detecting the facial region, and localizing the facial feature points such as
eyes, eyebrows, nose, and lips (e.g., Gunes & Piccardi, 2006; Kagian et al., 2008a).
There also exist automatic analyzers that opt for an affine rectification that maps auto-
matically detected landmarks (eyes, nose, and corners of the mouth) onto canonical
locations (e.g., Davis & Lazebnik, 2008). Another common approach is to represent
a (whole) face as points in a face space where the geometric variation is reduced in
complexity and each face is represented by a tractable vector. Some well-known meth-
ods used in creating a face space include the eigenface projection (principal compo-
nent analysis) (e.g., Valenzano et al., 2006), Gabor decompositions (e.g., Whitehill &
Movellan, 2008), and manifolds (e.g., Davis & Lazebnik, 2008). For classifying faces
into attractive or unattractive, Eisenthal et al. (2006) reported that geometric features
based on pairwise distances between fiducial points were superior to textural features
based on eigenface projections. Moreover, from a human perspective, results obtained
from geometric feature representation are more amenable to interpretation compared
to the eigenface representation. However, as has been reported in Aarabi et al. (2001),
the recognition stage may be negatively affected if fiducial facial points are located
inaccurately. Kagian et al. (2008b) suggested that using a richer representation might
contribute to the overall success of an automatic beauty predictor. Accordingly, Sutic
et al. (2010) chose to combine the eigenface and the ratio-based features for face rep-
resentation. A number of researchers have started including other visual cues such as
(mean) hair color, skin color and, skin texture for automatic attractiveness prediction
(e.g., Kagian et al., 2008a). Overall, the preprocessing stage may become challeng-
ing if images contain resampling artifacts, uncontrolled lighting and pose, and external
objects such as eyeglasses, hands, and so on.
Vocal Cues
Unfortunately, no easy direct “measurement” of vocal beauty can be made, for exam-
ple, by looking at simple features such as peak height and area of average spectra long-
term (Kenny & Mitchell, 2004), mean pitch, or voice quality type (Liu & Xu, 2011).
Rather, data-driven approaches to train machine-learning algorithms and several (up to
thousands) of features are used. However, not much is reported on explicit machine
recognition of vocal beauty (Burkhardt et al., 2011; Pinto-Coelho et al., 2011, 2013;
Nguyen et al., 2012). A number of works are dealing with physical speaker attributes
that may be directly or indirectly related to automatic analysis of vocal attractiveness,
for example, recognition of speakers’ (Schuller et al., 2011) or singers’ (Weninger,
Wöllmer, & Schuller, 2011) height. The most focused effort made so far was given by
the Interspeech 2012 Speaker Trait Challenge’s likability sub-challenge (Schuller et al.,
2012) where several research teams aimed at the best result for automatic recognition
of speakers’ likability (ranging from sexual attraction and trust and usually dominated
by appraisal). In the challenge’s speaker likability database (SLD), however, the defi-
nition of likability was left open to the annotators (Burkhardt et al., 2011). The SLD
contains commands embedded in German free speech (maximum 8, mean 4.4 words)
of 800 speakers over the phone. Likability was judged by thirty-two raters (17 male,
15 female; aged 20–42; average age of 28.6 years; standard deviation of 5.4 years) on a
seven-point Likert scale.
Analysis and Prediction

Overall, research on quantifying and computing beauty and attractiveness has predom-
inantly focused on analyzing the face. We will provide details of the earlier systems as
they have pioneered as well as significantly influenced the recognition and prediction
systems that followed.
Aarabi et al. (2001) introduced an automatic beauty analyzer that extracts eight geo-
metric ratios of distances between a number of facial feature points (eyes, brows, and
mouth) and uses k-nearest neighbors (k-NN) to classify facial images into one of the
four beauty categories. When tested on a validation set of forty images, the system
achieved 91 percent correct classification. The beauty predictor of White et al. (2004)
uses textural features to predict the mean attractiveness scores assigned to 4,000 face
images (downloaded from www.hotornot.com) using ridge regression (with a Gaussian
RBF kernel). The best prediction results (a correlation coefficient of 0.37) were obtained
using kernel principal component analysis (PCA) on the face pixels. Gunes and Pic-
cardi (2006) presented an automatic system that analyzes frontal facial images in terms
of golden proportions and facial thirds in order to recognize their beauty by means of
supervised learning. Each face was represented in terms of distances between facial
features and a decision tree was then trained using the obtained ground truth and the
extracted ratios. The standardized classifier error (by using variance in human ratings)
was found to be on average less than the standard deviation within the class. Eisenthal
et al. (2006) focused on classifying face images as either attractive or unattractive using
support vector machines (SVMs), k-NN, and standard linear regression. When tested
on two databases (each containing 92 images of young women from the United States
and Israel posing neutral facial expressions), best results were obtained using geometric
features based on pairwise distances between fiducial points (a correlation coefficient
of 0.6) using linear regression and SVMs (eigenface projections provided a correlation
coefficient of 0.45). The attractiveness predictor of Kagian et al. (2008a) uses ninety
principal components of 6,972 distance vectors (between 84 fiducial point locations)
and standard linear regression to predict mean attractiveness scores of female facial
images. Kagian et al. tested their system using the female Israeli database of Eisen-
thal et al. (2006) and achieved a correlation of 0.82 with mean attractiveness scores
provided by human raters (along a range of 1–7). Davis and Lazebnik (2008) focused
on representing the face via a shape model and using the manifold kernel regression
technique to explore the relationship between facial shape and attractiveness (on a het-
erogeneous dataset of over three thousand images gathered from the Web). Whitehill
and Movellan (2008) presented an automatic approach to learning the personal facial
attractiveness preferences of individual users from example images. The system uses a
variety of low level representations such as PCA, Gabor filter banks, and Gaussian RBFs
as well as image representations based on higher-level features (i.e., automated analysis
of facial expressions and SVMs for regression). When evaluated on a dataset of images
collected from an online dating site, the system achieves correlations of up to 0.45 on
the attractiveness predictions for individual users. When the system was fed with facial
action unit (AU) features, the prediction accuracy improved only marginally. Therefore,
how facial expressions contribute to the perception and prediction of facial attractive-
ness needs to be investigated further. Chen and Zang (2010) introduced a benchmark
database for (female and male) facial beauty analysis. The extracted geometric features
were normalized and projected to tangent space (a linear space where the Euclidean
distance can be used to measure differences between shapes). After preprocessing, the
statistics of the geometric features were calculated. PCA was used for summarizing the
main modes of variation and dimensionality reduction. Their results indicated that first
personal component (PC) includes the variation of face width, the second PC includes
the variations of eyebrow length and face shape, and the third PC includes the variation
of configuration of facial organs, etc. The shapes were then modeled as a multivari-
ate Gaussian distribution. Kullback-Leibler (KL) divergence was used for measuring
the difference between distributions of attractive faces and the whole population. Their
results showed that averageness hypothesis and symmetry hypothesis reveal much less
beauty related information than the multivariate Gaussian model. Sutic et al. (2010)
chose to combine eigenface and ratio-based feature representation and compared k-NN,
neural network and AdaBoost algorithms for a two-class (more vs less attractive) and
a four-class (with quartile class boundaries: 3.0, 7.9, and 9.0 of maximum 10) attrac-
tiveness classification problem on a dataset of 2,250 female images (extracted from the
website www.hotornot.com). For the two-class problem, 61 percent classification accu-
racy was obtained using k-NN and geometric features, and 67 percent classification
accuracy was obtained using k-NN and the distances in the eigenface space. Using ratio
features and AdaBoost provided a classification accuracy of 55 percent. The results also
indicated that facial symmetry is an important feature for machine analysis of facial
beauty as well as using a wide set of features.
Examples of other approaches that investigated the relationship between geometric
features and attractiveness include Fan et al. (2012) and Schmid, Marx, and Samal
(2008). There has also been recent works approaching the problem as a personalized
relative beauty ranking problem (Altwaijry & Belongie, 2013) – given training data of
faces sorted based on a subject’s personal taste, the system learns how to rank novel
faces according to that person’s taste. Representation is obtained using combination of
facial geometric relations, HOG, GIST, L*a*b* Color Histograms, and Dense-SIFT +
PCA features. The system obtains an average accuracy of 63 percent on pairwise com-
parisons of novel test faces (Altwaijry & Belongie, 2013).
Most of the studies in the literature attempt to model and predict facial attractive-
ness using a single static facial image. In a recent study, Kalayci et al. (2014) proposed
to use dynamic features obtained from video clips along with static features obtained
from static frames for automatic analysis of facial attractiveness. SVM and RF were
utilized to create and train models of attractiveness using the features extracted. Their
experiments showed that combining static and dynamic features improve performance
over using either of these feature sets alone. Another recent study by Joshi, Gunes, and
Goecke (2014) used video clips and investigated how automatic prediction of perceived
traits, including facial attractiveness, might vary with the situational context. Their find-
ings suggest that changes in situational context cause changes in the perception and
automatic prediction of facial attractiveness. Such studies and findings indicate that in
order to fully understand the perception of facial attractiveness, the dynamics of facial
behavior need to be investigated further along with appearance features such as skin
texture and eye/lip colour.
We will summarize the research on quantifying and computing vocal attractiveness
in the context of the speaker trait challenge’s likability sub-challenge. For the challenge,
the Agender database was partitioned into a training, development, and test set based on
the subdivision for the Interspeech 2010 Paralinguistic Challenge (age and gender sub-
challenges). While the challenge task is classification, the EWE is provided for the train-
ing and development sets and participants were encouraged to present regression results
in their contributions. In the challenge, participants had to classify into binary classes
above or below average likability. Feature selection seems to have been crucial and was
the focus of some participants (Montacié & Caraty, 2012; Pohjalainen, Kadioglu, &
Räsänen, 2012; Wu, 2012). Roughly, spectral features were found to be more mean-
ingful than prosodic ones, which some authors did not even use (Buisman & Postma,
2012; Attabi & Dumouchel, 2012; Lu & Sha, 2012). Further, prosody and voice features
were compared (Montacié & Caraty, 2012; Hewlett Sanchez et al., 2012; Cummins
et al., 2012). For classification, the dominantly employed machine learning algorithm
were support vector machines, followed by Gaussian mixture models (Hewlett Sanchez
et al., 2012; Cummins et al., 2012), (deep) neural networks (Brueckner & Schuller,
2012), k-nearest neighbor (Pohjalainen et al., 2012), or more specific approaches, such
as the anchor model (Attabi & Dumouchel, 2012) or Gaussian processes (Lu & Sha,
2012). Gender separation has been shown beneficial (Lu & Sha, 2012; Buisman &
Postma, 2012) given the differences between female and male speakers, and gender
can be detected automatically almost perfectly. The winning contribution (Montacié &
Caraty, 2012) reached 65.8 percent unweighted accuracy – highly significantly above
chance level of 50 percent, clearly demonstrating the challenge of automatic likability
classification from vocal properties.
Discussion and Conclusion
Despite the lack of a theory of human beauty and aesthetics that is generally accepted,
there is a growing body of research on automatic analysis of human attractiveness and
likebility from human physical cues (facial cues, bodily cues, vocal cues, etc.). This is
possibly due to the recent emphasis on idealized physical looks and tremendous demand
for aesthetic surgery, as well as other application areas such as computer assisted search
of partners in online dating services (Whitehill & Movellan, 2008), animation, adver-
tising, computer games, video conferencing, and so on.
At times, aesthetics has been used as yet another dimension in user interface design
and evaluation, and has been linked to affect and emotions. A representative exam-
ple is Kim and Moon (1998) who defined the domain-specific emotion space using
seven dimensions, namely attractiveness, symmetry, sophistication, trustworthiness,
awkwardness, elegance, and simplicity. The most common way of linking aesthetics
and affect is the claim that an object’s aesthetic quality is perceived via the viewer’s
affective reaction to that object. In other words, if one is experiencing positive affect,
the perceived aesthetic quality is positive within the particular context and the limita-
tions imposed by one’s social, cultural, and historical background and standards (Zhang,
2009). Computational aesthetics focuse on stimuli and their affective impact on humans
and affective computing is interested in people’s affective reactions toward stimuli. This
view of aesthetics and affect considers aesthetics as a means to deducing desirable affec-
tive states in humans (Zhang, 2009). Although a link between the positive valence and
aesthetics has been established to some extent, how a significant link between aesthetics
and negative valence and aesthetics and other affect dimensions, such as arousal, power,
and expectation, can be established needs further investigation. Also, while beauty often
generates positive affective reaction, one needs to guard against a tendency to consider
aesthetic emotions as indicative of generalized valence. Positive emotional reactions can
be elicited by a very large variety of stimuli and appraisal processes, many of them unre-
lated to an aesthetic dimension as commonly defined. The challenge for future theoret-
ical and empirical work is to determine what is special for reactions to qualities of per-
sons, objects, or static and dynamic relationships considered to have aesthetic qualities.
Overall, despite having common grounds with other multidisciplinary research fields
such as social signal processing, automatic human beauty and aesthetics prediction is in
its infancy. First, not all theories of attractiveness have been explored for computation
and prediction of human beauty. Second, researchers have not investigated the particu-
lar reason(s) for the observer ratings obtained. Utilizing the rationale for the observer
ratings could be extremely useful in obtaining a deeper insight into the data at hand and
designing better automatic attractiveness predictors and enhancers. Additionally, the
comparison of results attained by different surveyed systems is difficult to conduct as
systems use different training/testing datasets (which differ in the way data was elicited
and annotated), they differ in the underlying representation model as well as in the uti-
lized classification (recognition vs regression) method and evaluation criterion.
Virtually all existing studies can be challenged in terms of ecological validity of their
results obtained because of the idealized and restricted settings used in their data (e.g.,
lack of motion, noise, etc.). As a consequence, many issues remain unclear: (i) how to
create benchmark databases (e.g., 2-D vs 3-D facial/bodily images, vocal and audiovi-
sual data, higher-level features like texture/color, hair style, etc.); (ii) how to analyze
the physical cues (single-cue vs multiple-cue and multimodal analysis); and (iii) how
including behavioral cues (e.g., smile, laughter) and contextual information will affect
the automatic analysis procedures (e.g., Kalayci et al., 2014; Joshi et al., 2014).
Solutions to these issues can be potentially sought in other relevant research fields,
such as affective computing and social signal processing (see Gunes & Schuller, 2013;
Vinciarelli, Pantic, & Bourlard, 2009), as well as new and upcoming works in human
perception of facial attractiveness from static versus dynamic stimuli. Creating research
and application pathways between aesthetics and affective and multimedia computing
is expected to have several benefits and could pave the way toward advancing the field.
Affective computing is a relatively more mature field, has clearer theoretical founda-
tions, and has been more extensively explored than aesthetics. For instance, dimen-
sional and continuous representation and analysis of affect has been an area of increased
interest in recent years (Gunes & Schuller, 2013) that could lead a number of mod-
els and structures to dimensional and continuous modeling and analysis of aesthetics.
Essentially, a major effort for bringing together the aesthetic constructs and affective
and multimedia computing, in the form of focused workshops and special sessions, is
needed.
Acknowledgment
The work of Hatice Gunes has been supported by the EPSRC MAPTRAITS Project
(Grant Ref: EP/K017500/1).
References
Aarabi, P., Hughes, D., Mohajer, K., & Emami, M. (2001). The automatic measurement of facial
beauty. In Proceedings of IEEE International Conference on Systems, Man and Cybernetics, 4,
2644–2647.
Ahmed, S., Al Mahmud, A., & Bergaust, K. (2009). Aesthetics in human-computer interaction:
Views and reviews. In Proceedings of the 13th International Human–Computer Interaction,
July 19–24, San Diego, CA.
Alley, T. R. & Cunningham, M. R. (1991). Averaged faces are attractive, but very attractive faces
are not average. Psychological Science, 2, 123–125.
Altwaijry, Hani & Belongie, Serge (2013). Relative ranking of facial attractiveness. In Workshop
on the Applications of Computer Vision (WACV), January 15–17, Clearwater Beach, FL.
Apple, W., Streeter, L. A., & Kraus, R. M. (1979). Effects of pitch and speech rate on personal
attributions. Journal of Personality and Social Psychology, 37(5), 715–727.
Aronson, E., Wilson, T., & Akert, R. M. (2009). Social Psychology (7th edn). Upper Saddle River,
NJ: Prentice Hall.
Attabi, Y. & Dumouchel, P. (2012). Anchor models and WCCN normalization for speaker trait
classification. In Proceedings of Interspeech 2012 (pp. 522–525).
Borissavlievitch, M. (1958). The Golden Number and the Scientific Aesthetics of Architecture.
London: A. Tiranti.
Bottino, A. & Laurentini, A. (2010). The analysis of facial beauty: an emerging area of research
in pattern analysis. In Proceedings of ICIAR 2010, 7th International Conference on Image
Analysis and Recognition (pp. 425–435), June 21–23, Povoa de Varzim, Portugal.
Brueckner, R. & Schuller, B. (2012). Likability classification – a not so deep neural network
approach. In Proceedings of Interspeech, September, Portland, OR.
Buisman, H. & Postma, E. (2012). The Log-Gabor method: Speech classification using spectro-
gram image analysis. In Proceedings of Interspeech, September, Portland, OR.
Burkhardt, F., Eckert, M., Johannsen, W., & Stegmann, J. (2010). A database of age and gen-
der annotated telephone speech. In LREC 2010, 7th International Conference of Language
Resources and Evaluation, May 19–21, 2010, Malta.
Burkhardt, F., Schuller, B., Weiss, B., & Weninger, F. (2011). “Would you buy a car from me?”
– on the likability of telephone voices. In Proceedings of the Annual Conference of INTER-
SPEECH (pp. 1557–1560), August 27–31, Florence, Italy.
Chan, Yin-Tzu, Hsu, Hao-Chen, Li, Po-Yi, & Yeh, Mei-Chen. (2012). Automatic cinemagraphs
for ranking beautiful scenes. In Proceedings of ACM Multimedia (pp 1361–1362).
Chattopadhyay, A., Dahl, D. W., Ritchie, R. J. B., & Shahin, K. N. (2003). Hearing voices: The
impact of announcer speech characteristics on consumer response to broadcast advertising.
Journal of Consumer Psychology, 13(3), 198–204.
Chen, F. & Zhang, D. (2010). A benchmark for geometric facial beauty study. Lecture Notes in
Computer Science, 6165, 21–32.
Collins, S. A. (2000). Men’s voices and women’s choices. Animal Behaviour, 60, 773–780.
Collins, S. A. & Missing, C. 2003. Vocal and visual attractiveness are related in women. Animal
Behaviour, 65(5), 997–1004.
Cummins, N., Epps, J., & Kua, J. M. K. (2012). A comparison of classification paradigms for
speaker likeability determination. In Proceedings of Interspeech, September, Portland, OR.
Cunningham, M. R., Roberts, A. R., Barbee, A. P., et al. (1995). Their ideas of beauty are, on the
whole, the same as ours. Journal of Personality and Social Psychology, 68, 261–279.
Dahlbäck, N., Wang, Q.-Y., Nass, C., & Alwin, J. (2007). Similarity is more important than exper-
tise: Accent effects in speech interfaces. In Proceedings of CHI 2007 – Conference on Human
Factors in Computing Systems (pp. 1553–1556), April 28–May 3, San José, CA.
Davis, B. C. & Lazebnik, S. (2008). Analysis of human attractiveness using manifold kernel
regression. In Proceedings of the International Conference on Image Processing (pp. 109–
112).
Eisenthal, Y., Dror, G., & Ruppin, E. (2006). Facial attractiveness: Beauty and the machine. Neu-
ral Computation, 18, 119–142.
Fan, J., Chau, K. P., Wan, X., Zhai, L., & Lau, E. (2012). Prediction of facial attractiveness from
facial proportions. Pattern Recognition, 45, 2326–2334.
Farkas, L. G., Hreczko, T. A., Kolar, J. C., & Munro, I. R. (1985). Vertical and horizontal pro-
portions of the face in young adult North American caucasians. Plastic and Reconstructive
Surgery, 75, 328–338.
Farkas, L. G. & Kolar, J. C. (1987). Anthropometrics and art in the aesthetics of women’s faces.
Clinics in Plastic Surgery, 14, 599–616.
Feinberg, D. R., DeBruine, L. M., Jones, B. C., & Perrett, D. I. (2008). The role of femininity
and averageness of voice pitch in aesthetic judgments of women?s voices. Perception, 37(4),
615–623.
Feinberg, D. R., Jones, B. C., Little, A. C., Burt, D. M., & Perrett, D. I. (2005). Manipulations
of fundamental and formant frequencies influence the attractiveness of human male voices.
Animal Behaviour, 69(3), 561–568.
Fink, B., Grammer, K., & Matts, P. J. (2006). Visible skin color distribution plays a role in the
perception of age, attractiveness, and health in female faces. Evolution and Human Behavior,
27, 433–442.
Fink, B., Grammer, K., & Thornhill, R. (2001). Human (Homo sapiens) facial attractiveness in
relation to skin texture and color. Journal of Comparative Psychology, 115, 92–99.
Fishwick, P. A. (2013). Aesthetic computing. In Mads, Soegaard, & Rikke Friis Dam (Eds),
The Encyclopedia of Human–Computer Interaction (2nd edn). Aarhus, Denmark: Interaction
Design Foundation.
Grimm, M. & Kroschel, K. (2005). Evaluation of natural emotions using self-assessment
manikins. In Proceedings of ASRU 2005 – Automatic Speech Recognition and Understanding
Workshop (pp. 381–385).
Gunes, H. (2011). A survey of perception and computation of human beauty. In Proceedings of
ACM Multimedia International Workshop on Social Signal Processing (pp. 19–24).
Gunes, H. & Pantic, M. (2010). Automatic, Dimensional and Continuous Emotion Recognition.
International Journal of Synthetic Emotions, 1(1), 68–99.
Gunes, H. & Piccardi, M. (2006). Assessing facial beauty through proportion analysis by
image processing and supervised learning. International Journal of Human–Computer Stud-
ies, 64(12), 1184–1199.
Gunes, H. & Schuller, B. (2013). Categorical and dimensional affect analysis in continuous input:
Current trends and future directions. Image & Vision Computing, 31(2), 120–136.
Hewlett Sanchez, M., Lawson, A., Vergyri, D., & Bratt, H. (2012). Multi-system fusion of
extended context prosodic and cepstral features. In Proceedings of Interspeech, September,
Portland, OR.
Hodges-Simeon, C. R., Gaulin, S. J. C., & Puts, D. A. (2010). Different vocal parameters predict
perceptions of dominance and attractiveness. Human Nature, 21(4), 406–427.
Hoeing, F. (2005). Defining computational aesthetics. In Proceedings of Computational Aesthet-
ics in Graphics, Visualization and Imaging.
Hughes, S. M., Dispenza, F., & Gallup, G. G. (2004). Ratings of voice attractiveness predict
sexual behavior and body configuration. Evolution and Human Behavior, 25, 295–304.
Huntley, H. E. (1970). The Divine Proportion: A Study in Mathematical Beauty. New York: Dover
Publications.
Jefferson, Y. (1993). Facial aesthetics-presentation of an ideal face. Journal of General Orthodon-
tics, 4, 18–23.
Ji, H. I., Kamachi, M., & Akamatsu, S. (2004). Analyses of facial attractiveness on feminised and
juvenilised faces. Perception, 33, 135–145.
Joshi, J., Gunes, H., & Goecke, R. (2014). Automatic prediction of perceived traits using visual
cues under varied situational context. In Proceedings of 22nd International Conference on
Pattern Recognition (ICPR).
Jürgens, C., Johannsen, W., & Fellbaum, K. (1996). Zur Eignung von Sprechern für die
Lautelemente-Bibliothek eines Sprachsynthesesystems. In: Proceedings of ITG Fachtagung
Sprachkommunikation, September 17–18, Frankfurt am Main, Germany.
Kagian, A., Dror, G., Leyvand, T., Cohen-Or, D., & Ruppin, E. (2008a). A Humanlike predictor
of facial attractiveness. Advances in Neural Information Processing Systems, 19, 674–683.
Kagian, A., Dror, G., Leyvand, T., et al. (2008b). A machine learning predictor of facial attrac-
tiveness revealing human-like psychophysical biases. Vision Research, 48, 235–243.
Kalayci, S., Ekenel, H. K., & Gunes, H. (2014). Automatic analysis of facial attractiveness from
video. In Proceedings of IEEE International Conference on Image Processing (ICIP).
Kelly, M. (2013). Commentary on: Fishwick, Paul A. (2013): Aesthetic Computing. In: Soe-
gaard, Mads, & Dam, Rikke Friis (eds), The Encyclopedia of Human-Computer Interaction,
2nd Ed.
Kenny, D. T. & Mitchell, H. F. (2004). Visual and auditory perception of vocal beauty: Conflict
or concurrence? In Proceedings of the 8th International Conference on Music Perception &
Cognition (pp. 171–174).
Ketzmerick, B. (2007). Zur auditiven und apparativen Charakterisierung von Stimmen. Dresden:
TUDpress.
Kim, J. & Moon, J. Y. (1998). Designing towards emotional usability in customer interfaces–
trustworthiness of cyber-banking system interfaces. Interacting with Computers, 10(1), 1–29.
Langlois, J. H. & Roggman, L. A. (1990). Attractive faces are only average. Psychological Sci-
ence, 1, 115–121.
Liu, X. & Xu, Y. (2011). What makes a female voice attractive? In Proceedings of ICPhS
(pp. 1274–1277).
Lu, D. & Sha, F. (2012). Predicting Likability of Speakers with Gaussian Processes. In Proceed-
ings of Interspeech, September, Portland, OR.
Michiels, G. & Sather, A. H. (1994). Determinants of facial attractiveness in a sample of white
women. International Journal of Adult Orthodontics and Orthognathic Surgery, 9, 95–103.
Mizumoto, Y., Deguchi, T., & Fong, K. W. C. (2009). Assessment of facial golden proportions
among young Japanese women. American Journal of Orthodontics and Dentofacial Orthope-
dics, 136, 168–174.
Montacié, C. & Caraty, M.-J. (2012). Pitch and intonation contribution to speakers’ traits classi-
fication. In Proceedings of Interspeech, September, Portland, OR.
Nguyen, T., Liu, S., Ni, B., et al. (2012). Sense beauty via face, dressing, and/or voice. In Pro-
ceedings of ACM Multimedia (pp. 239–248). Nara, Japan.
Norman, Donald A. (2004). Emotional Design: Why We Love (Or Hate) Everyday Things. New
York: Basic Books.
Pantic, M. & Vinciarelli, A. (2009). Implicit human-centered tagging. IEEE Signal Processing
Magazine, 26(6), 173–180.
Parris, C. & Robinson, J. (1999). The bold and the beautiful according to plastic surgeons, Tech-
nical report. Dallas, TX.
Pinto-Coelho, L., Braga, D., Sales-Dias, M., & Garcia-Mateo, C. (2011). An automatic voice
pleasantness classification system based on prosodic and acoustic patterns of voice preference.
In Proceedings of Interspeech (pp. 2457–2460).
Pinto-Coelho, L., Braga, D., Sales-Dias, M., & Garcia-Mateo, C. (2013). On the development
of an automatic voice pleasantness classification and intensity estimation system. Computer
Speech and Language, 27(1), 75–88.
Pipitone, R. Nathan & Gallup, G. G. (2008). Women’s voice attractiveness varies across the men-
strual cycle. Evolution and Human Behavior, 29(4), 268–274.
Pohjalainen, J., Kadioglu, S., & Räsänen, O. (2012). Feature selection for speaker traits. In Pro-
ceedings of Interspeech, September, Portland, OR.
Rhodes, G. & Tremewan, T. (1996). Averageness exaggeration and facial attractiveness. Psycho-
logical Science, 7, 105–115.
Ricketts, M. D. (1982). Divine proportions in facial aesthetics. Clinics in Plastic Surgery, 9, 401–
422.
Riding, D., Lonsdale, D., & Brown, B. (2006). The effects of average fundamental frequency
and variance of fundamental frequency on male vocal attractiveness to women. Journal of
Nonverbal Behaviour, 30, 55–61.
Saxton, T. (2005). Facial and vocal attractiveness: a developmental and cross-modality study. PhD
thesis, University of Edinburgh.
Saxton, T. K., Caryl, P. G., & Roberts, S. C. (2006). Vocal and facial attractiveness judgments of
children, adolescents and adults: The ontogeny of mate choice. Ethology, 112(12), 1179–1185.
Schmid, K., Marx, D., & Samal, A. (2008). Computation of a face attractiveness index based on
neoclassical canons, symmetry, and golden ratios. Pattern Recognition, 41, 2710–2717.
Schuller, B., Steidl, S., Batliner, A., et al. (2012). The INTERSPEECH 2012 Speaker Trait Chal-
lenge. In Proceedings of Interspeech 2012.
Schuller, B., Wöllmer, M., Eyben, F., Rigoll, G., & Arsić, D. (2011). Semantic speech tagging:
Towards combined analysis of speaker traits. In Proceedings of AES 42nd International Con-
ference (pp. 89–97). Ilmenau, Germany: Audio Engineering Society.
Sutic, D., Brekovic, I., Huic, R., & Jukic, I. (2010). Automatic evaluation of facial attractiveness.
In Proceedings of MIPRO.
Swaddle, J. P. & Cuthill, I. C. (1995). Asymmetry and human facial attractiveness: Symmetry
may not always be beautiful. Biological Sciences, 261, 111–116.
Tovee, M. J., Maisey, D. S., Emery, J. L., & Cornelissen, P. L. (1999). Visual cues to female
physical attractiveness. Proceedings: Biological Sciences, 266(1415), 211–218.
Tuomi, S. & Fisher, J. (1979). Characteristics of simulated sexy voice. Folia Phoniatrica, 31(4),
242–249.
Valenzano, D. R., Mennucci, A., Tartarelli, G., & Cellerino, A. (2006). Shape analysis of female
facial attractiveness. Vision Research, 46, 1282–1291.
ing domain. Image Vision Computing, 27, 1743–1759.
Weiss, B. & Burkhardt, F. (2010). Voice attributes affecting likability perception. In Proceedings
of Interspeech (pp. 1485–1488).
Weiss, B. & Möller, S. (2011). Wahrnehmungsdimensionen von Stimme und Sprechweise. In
Proceedings of ESSV 2011 – 22. Konferenz Elektronische Sprachsignalverarbeitung (pp. 261–
268), September 28–30, Aachen, Germany.
Weninger, F., Wöllmer, M., & Schuller, B. (2011). Automatic assessment of singer traits in popular
music: Gender, age, height and race. In Proceedings of the 12th International Society for Music
Information Retrieval Conference, ISMIR 2011 (pp. 37–42). Miami, FL: ISMIR.
White, R., Eden, A., & Maire, M. (2004). Automatic prediction of human attractiveness. UC
Berkeley CS280A Project.
Whitehill, J. & Movellan, J. R. (2008). Personalized facial attractiveness prediction. In Proceed-
ings of IEEE FGR (pp. 1–7).
Wu, D. (2012). Genetic algorithm based feature selection for speaker trait classification. In Pro-
ceedings of Interspeech 2012.
Zhang, P. (2009). Theorizing the relationship between affect and aesthetics in the ICT design
and use context. In Proceedings of the International Conference on Information Resources
Management (pp. 1–15).
Zimbler, M. S. & Ham, J. (2010). Aesthetic facial analysis. In C. Cummings & P. Flint (Eds),
Cummings Otolaryngology Cummings Head and Neck Surgery. St Louis: Mosby Elsevier.
Zuckerman, M. & Miyake, K. (1993). The attractive voice: What makes it so? Journal of Nonver-
bal Behaviour, 17(2), 119–135.
Further Reading
Kocinski, Krzysztof (2013). Perception of facial attractiveness from static and dynamic stimuli.
Perception, 42, 163–175.
15 Interpersonal Synchrony: From Social
Perception to Social Interaction
Mohamed Chetouani, Emilie Delaherche, Guillaume Dumas, and David Cohen
Introduction
Synchrony refers to individuals’ temporal coordination during social interactions

(Cappella, 2005). The analysis of this phenomenon is complex, requiring the percep-
tion and integration of multimodal communicative signals. The evaluation of synchrony
has received multidisciplinary attention because of its role in early development (Feld-
man, 2003), language learning (Goldstein, King, & West, 2003), and social connection
(Harrist & Waugh, 2002). Initially, instances of synchrony were directly perceived in the
data by trained observers. Several methods have been proposed to evaluate interactional
synchrony, ranging from behavior microanalysis (Cappella, 1997) to global perception
of synchrony (Bernieri, Reznick, & Rosenthal, 1988). Behavioral synchrony has now
captured the interest of researchers in such fields as social signal processing, robotics,
and machine learning (Prepin & Pelachaud, 2011; Kozima, Michalowski, & Nakagawa,
2009).
In this chapter, we focus especially on description and definition of synchrony for
the development of computational models. The chapter begins with a review of evi-
dences of interpersonal synchrony from different research domains (psychology, clin-
ics, neuroscience and biology). Then, we introduce a working definition of interper-
sonal synchrony (see Proposed Definition). The chapter surveys evaluation models
and methods from the literature of psychology (see Non-computational Methods of
Synchrony Assessment) and social signal processing (see Fully Automatic Measures
of Synchrony). Finally, the chapter discusses a number of challenges that need to be
addressed (see Conclusions and Main Challenges).
Non-verbal Evidence of Interpersonal Synchrony
Among social signals, synchrony and coordination have been considered lately
(Ramseyer & Tschacher, 2010; Delaherche et al., 2012). Condon and Ogston (1967)
initially proposed a microanalysis of human behavior (body motion and speech intona-
tion) and evidenced the existence of interactional synchrony, the coordination between
listener’s and speaker’s body movements, or between the listener’s body movement and
the speaker’s pitch and stress variations. Bernieri et al. (1988) define coordination as
the “degree to which the behaviors in an interaction are non-random, patterned or
Interpersonal Synchrony: From Social Perception to Social Interaction 203
synchronized in both form and timing”. (Kendon, 1970) raises fundamental questions
about the condition of interactional synchrony arousal and its function in interaction.
When he synchronizes with the speaker, the listener demonstrates his ability to antici-
pate what the speaker is going to say. In this way, he gives feedback to the speaker and
smoothens the running of the encounter.
In the “double video setting”, several teams manipulated the timing of exchanges
between mother and baby by alternating live and pre-recorded exchanges (Nadel et al.,
1999). They showed that in the pre-recorded sessions the child showed more negative
signs (anger or distress manifestations, cries) and that when they came back to the live
exchanges, the positive signals (gazes toward the mother, smiles...) were restored. In
these experiments, they demonstrated expectancies for synchronized and contingent
exchanges with the social partner (here the mother), since two months old. The key
role of synchrony was also found at early age in more natural early interaction such as
home breast feeding (Viaux-Savelon et al., 2012). In Saint-Georges et al. (2011), we
investigated early signs of autism by modeling the child’s development with an inter-
personal synchrony point of view. Regarding synchrony, the main results show that
(i) parents seemed to feel weaker interactive responsiveness and mainly weaker initia-
tive from their infants and (ii) parents increasingly tried to supply soliciting behaviours
and touching.
As part of the social signals, interpersonal coordination is a signal of great impor-
tance to evaluate the degree of attention or engagement between two social partners.
It is often related to the quality of interaction (Chartrand & Bargh, 1999), cooperation
(Wiltermuth & Heath, 2009) or entitativity (Lakens, 2010). Finally, its assessment con-
stitutes the first step in prospect of equipping a social robot with the ability to anticipate
a human partner’s reaction and enter in synchrony with him/her (Michalowski, Sim-
mons, & Kozima, 2009; Prepin & Gaussier, 2010; Boucenna et al., 2014).
Biological Evidence of Interpersonal Synchrony
Concerning the development of social interaction, it is important to highlight the major

role of synchrony of rhythms in bonding. Thus, Guedeney et al. (2011) emphasize
the importance of synchronization between infant and parental rhythms in very early
social interaction and socio-emotional development, from biological rhythms during
pregnancy to later exchange between caregiver and child.
Synchrony between partners has been correlated with biological markers. Correlation
at biological levels has also been found. In Feldman (2007), a biobehavioral synchrony
model is introduced on the basis of investigations of synchrony through physiological
signals (e.g. ECG, skin conductance) and behaviors during parent–infant interactions.
Naturally occurring variations in maternal behavior are associated with differences in
estrogen-inducible central oxytocin receptors, which are involved in pro-social behav-
iors (Champagne et al., 2001). Oxytocin appears to enhance both maternal/paternal
as well as affiliative behaviors in humans and is considered as the bonding hormone
(Weisman, Zagoory-Sharon, & Feldman, 2012).
Dumas et al. (2010) use hyper-scanning recordings to examine brain activity, includ-
ing measures of neural synchronization between distant brain regions of interacting indi-
viduals through a free exchange of roles between the imitator and the model. Their study
was the first to record dual EEG activity in dyads of subjects during spontaneous non-
verbal interaction. Five female-female pairs and 6 male-male pairs were scanned. They
showed that interpersonal hand movements were correlated with the emergence of syn-
chronization in the brain’s alpha–mu band (an area involved in social interaction) (Perry,
Troje, & Bentin, 2010) between the right centro-parietal regions.
Rhythm, synchrony, and emotion are increasingly being viewed by developmental
psychologists as key aspects of appropriate early interaction (Feldman, 2007; Saint-
Georges et al., 2013; Weisman et al., 2013).
Proposed Definition
Synchrony is the dynamic and reciprocal adaptation of the temporal structure of behav-
iors between interactive partners. Unlike mirroring or mimicry, synchrony is dynamic
in the sense that the important element is the timing, rather than the nature of the behav-
iors. As noted in Ramseyer and Tschacher (2006), the distinction between synchrony
and mirroring can be unclear; these phenomena are not disjunctive and can often be
observed simultaneously.
As described in Harrist and Waugh (2002), synchrony requires a (1) maintained
focus, (2) shared focus of attention, (3) temporal coordination, and (4) contingency.
Computational models of synchrony need, if not all, most of these ingredients. And the
main problem is that each of them is ambiguous and requires investigations, however,
taking into account advances in various fields, such as computational linguistics, social
signal processing and social robotics and virtual agents.
Non-computational Methods of Synchrony Assessment
Several non-computational methods have been proposed to evaluate interpersonal syn-

chrony, ranging from behavior microanalysis to global perception of synchrony. Behav-
ioral coding methods propose evaluating the behavior of each interactional partner on
a local scale. These methods require the use of computer-based coding (e.g., Observer
or Anvil) (Kipp, 2008) and trained raters. Various category and time scales can be used
for coding. Generally, a measure of synchrony is deduced from the covariation of the
annotated behaviors. The codes can be either continuous (speed of a gesture) or cat-
egorical (type of gesture). Cappella (2005) synthesized the three crucial questions to
be addressed when conducting an interaction study: “what to observe (coding), how to
represent observations (data representations) and when and how frequently to make the
observations (time)”.
Behavioral coding methods are time-consuming and tedious with regard to the train-
ing of observers, the number of behaviors coded and the duration of the video files to be
coded, particularly for longitudinal studies. Cappella (1997) and Bernieri et al. (1988)
proposed an alternative to behavior microanalysis: the judgment method. In their stud-

ies, they investigated the use of human raters to evaluate video clips of infants inter-
acting with their mothers. Raters judge for simultaneous movement, tempo similarity
and coordination and smoothness on a longer time scale using a Likert scale. Cappella
showed that untrained judges were consistent with one another and reliably judged the
synchrony between partners (Cappella, 1997).
Non-computational methods suffer serious drawbacks. Within the tedious task of cod-
ing, segmenting, and annotating behaviors can be confusing: when does a behavior
start, when does it end, how should it be labeled? Often, the annotator makes trade-
offs because no label accurately describes what he observes. The judges’ reliability in
assessing such a subjective and complex construct is also questionable, and no general
framework for synchrony assessment has been accepted to date. A method was recently
proposed to convert the judgments of multiple annotators in a study on dominance into
a machine learning framework (Chittaranjan, Aran, & Gatica-Perez, 2011). Finally, con-
versational partners are often studied individually when coding. Thus, it is particularly
difficult to recreate the dynamic and interpersonal aspects of social interaction manu-
ally and after coding. Nonetheless, annotation and judgment methods are essential in
proposing automatic systems for synchrony assessment and testing their performance.
Currently, no automatic systems modeling synchrony using real interaction data is free
from annotation.
Annotation is mainly used in two different manners. First, annotation is used to train
automatic systems to model and learn communication dynamics (see Machine under-
standing of interpersonal synchrony). These studies often rely on behavioral coded
databases. Second, another set of studies intends to measure the degree of synchrony
between dyadic partners with unsupervised methods. In these studies, the measure of
synchrony is not validated per se, but is judged by its ability to predict an outcome
variable that has been manually annotated, often using judgment methods. The outcome
variable can be friendship (Altmann, 2011), conflicting situations (Altmann, 2011), suc-
cess in psychotherapy (Ramseyer & Tschacher, 2011), etc.
Fully Automatic Measures of Synchrony
To exploit synchrony cues in human–machine interaction, automatic techniques can

be used to capture pertinent social signals and assess movement synchrony in human–
human interactions. These studies aim at measuring the degree of similarity between the
dynamics of the non-verbal behaviors of dyadic partners. The goals of these studies are
generally divisible into two categories: (a) compare the degree of synchrony under dif-
ferent conditions (e.g., with or without visual feedback) (Shockley, Santana, & Fowler,
2003; Varni, Volpe, & Camurri, 2010) and (b) study the correlation between the degree
of synchrony and an outcome variable (e.g., friendship, relationship quality) (Altmann,
2011; Ramseyer & Tschacher, 2011).
The first step in computing synchrony is to extract the relevant features of the dyad’s
motion with motion-tracking devices (Ashenfelter et al., 2009), image-processing
techniques (tracking algorithms, image differencing) (Delaherche & Chetouani, 2010;

Varni et al., 2010), or physiological sensors (Varni et al., 2010). After extracting the
motion features, a measure of similarity is applied. Correlation is the most commonly
used method to assess interactional synchrony (Altmann, 2011; Ramseyer & Tschacher,
2011). A time-lagged cross-correlation is applied between the movement time series of
the interactional partners using short windows of interaction. Another method to assess
the similarity of motion of two partners is recurrence analysis (Richardson, Dale, &
Shockley, 2008). Recurrence analysis assesses the points in time that two systems show
similar patterns of change or movement, called “recurrence points”. Spectral methods
constitute an interesting alternative to temporal methods when dealing with rhythmic
tasks. Spectral methods measure the evolution of the relative phase between the two
partners as an indication of a stable time-lag between them (Oullier et al., 2008; Richard-
son et al., 2007). Spectral methods also measure the overlap between the movement
frequencies of the partners, called cross-spectral coherence (Richardson & Dale, 2005;
Richardson et al., 2007; Delaherche & Chetouani, 2010) or power spectrum overlap
(Oullier et al., 2008).
A critical question when attempting to detect dependence relationships between fea-
tures is where the boundary should be between scores indicating significant and insignif-
icant synchrony. A well-spread method consists of applying surrogate statistical test-
ing (Richardson & Dale, 2005; Ashenfelter et al., 2009; Sun, Truong et al., 2011;
Delaherche & Chetouani, 2010). Video images of dyadic partners are isolated and re-
combined in a random order to synthesize surrogate data (pseudo-interactions). Syn-
chrony scores are assessed using the original and surrogate datasets. The synchrony
scores on the surrogate dataset constitute a baseline for judging for the dyad’s coordina-
tion. Fully automatic measures of movement synchrony are subject to several criticisms
in the context of studying naturalistic interaction data. First, the measures provided by
these methods are mostly global and do not shed light on what happened locally during
the interaction; they do not provide a local model of the communication dynamics. Sec-
ond, the importance of speech and multimodality is often concealed in these methods.
Machine Understanding of Interpersonal Synchrony

Given these criticisms, many in the field adopted the alternative practice of modeling
the timing and occurrence of higher-level behavioral events, such as smiles, head ges-
tures, gazes and speaker changes. These behavioral events can be either extracted from
a human-annotated database or predicted from low-level signals automatically extracted
from data. These methods arise from a great interest in identifying the dynamical pat-
terns of interaction and characterizing recurrent interpersonal behaviors.
Machine learning methods offer an interesting framework for the exploration of inter-
active behaviors. A key challenge is proposing models with the content and tempo-
ral structure of dyadic interactions. Various sequential learning models, such as Hid-
den Markov Models (HMMs) or Conditional Random Fields (CRFs), are usually used
to characterize the temporal structure of social interactions. Messinger et al. employ
related techniques for the understanding of communicative development, which is
characterized by mutual influences during interaction: infants and parents influence

and respond to one another during communication (Messinger et al., 2010). In Mahd-
haoui and Chetouani (2011), an integrative approach is proposed to explicitly consider
the interaction synchrony of behaviors. The model is applied to the characterization of
parent–infant interactions for differential diagnosis: autism (AD), intellectual disability
(ID), and typical development (TD). The authors estimate transitions between behaviors
of the infant and the parent by analyzing behaviors co-occurring in a 3-second window.
Among interpersonal behaviors, the prediction of turn-taking and back-channels has
been largely studied in the perspective of building fluent dialog systems. The central
idea is to develop “predictive models of communication dynamics that integrate previ-
ous and current actions from all interlocutors to anticipate the most likely next actions
of one or all interlocutors” (Ozkan, Sagae, & Morency, 2010). The purpose of the turn-
taking prediction is to accurately predict the timing between speaker transitions and the
upcoming type of utterance (speaker holding the floor, speaker changes) as it occurs
in human–human interactions (Ward, Fuentes, & Vega, 2010). Back-channel behavior
assures the speaker that the listener is paying attention and is in the same state in the con-
versation (Thorisson, 2002). Several teams have investigated how the speaker behavior
triggered listeners’ back-channels (Morency, Kok, & Gratch, 2008; Huang, Morency,
& Gratch, 2011; Gravano & Hirschberg, 2009; Al Moubayed et al., 2009).
Conclusions and Main Challenges
Several questions regarding the dimension and perception of synchrony remain to be

explored. These questions are fundamental to the development of an automatic model
to assess synchrony.
The first issue relates to the nature of synchrony: is synchrony an all-or-none con-
dition (synchronous vs. non-synchronous)? Is synchrony a continuous or a discrete
notion? Or can dyadic interaction approach or move away from synchrony (Harrist &
Waugh, 2002)? Most current sources suggest that synchrony varies over the course of
interaction, being stronger at the beginning and the ending of an exchange (Kendon,
1970) or at moments of particular engagement (Campbell, 2009). Feldman operational-
izes synchrony as the degree to which the partners change their affective behavior in
reference to one another and obtains a number ranging between zero and one (Feldman,
2003). When addressing the matter of movement synchrony and its relation to perceived
entitativity, Lakens observed that objective differences in movement rhythms were lin-
early related to ratings of perceived entitativity (Lakens, 2010). A recent study showed
that the perception of coordination was more unanimous when coordination was very
high or very low. However, judges were not reliable when judging dyads with “medium”
coordination (Delaherche & Chetouani, 2011).
The second issue relates to the multiple scales of interpersonal synchrony. As previ-
ously described, there is evidence of interpersonal synchrony at different levels: behav-
ioral, neural, and physiological. A major challenge is to propose frameworks dealing
with these different levels (Kelso, Dumas, & Tognoli, 2013; Chatel-Goldman et al.,
2013). This will require specific tools and protocols in order to acquire, process and
model various signals. In addition, interpersonal synchrony has been found at different
timescales, ranging from milli-seconds to minutes. Social signal processing approaches
should now deal with multi-scale situations using various sources of information. In
Weisman et al. (2013) it is described as a first approach to analyze the effect of oxytocin
during parent–infant interaction. Understanding these mechanisms will help to propose
objective evaluation of interpersonal synchrony and more generally be of great benefit
for social signal processing in terms of low-resolution brain scanning (Pentland et al.,
2009).
The third issue is related to acquisition and annotation of databases. Indeed, the defi-
nition of coordination is wide and different dimensions of coordination can be analyzed.
Several works have shown that the similarity measures do not always predict the degree
of coordination perceived by annotators. Which begs the question: what is annoted by
the signals received by the annotate when partners are coordinated? These questions
relate to definitions and dimensions of interpersonal synchrony. In response, a collab-
oration with psychologists seems essential. The question of the corpus is also crucial.
As in other related domains, i.e. affective computing, real-life, annotated and publicly
distributed databases, there was a breakthrough that allowed researchers to propose new
relevant models (e.g. continuous models of emotions). Indeed, to define a research pro-
tocol, collect interaction data and annotate them is a long process. In addition, these
baselines would compare the performance of different systems. Until recent contribu-
tion of Sun, Lichtenhauer et al.’s (2011) mimicry database, no publicly available anno-
tated corpus were dedicated to the detection of synchrony. We can hope that this effort
will benefit the field, aiding engineers in their work to develop new algorithms, by skip-
ping the data collection and annotation phases.
The fourth issue is related to machine understanding of interpersonal synchrony. Most
studies investigate interpersonal synchrony through similarity measures (ranging from
correlation to recurrence analysis) in relation to variables such as pathological groups,
success of interaction . . . Very few studies are proposing predictive approaches, such as
evaluation against ground truth by using traditional machine learning metrics (Petridis,
Leveque, & Pantic, 2013; Michelet et al., 2012; Delaherche et al., 2013). Reasons are
multiple and include lack of databases. Definitions may also help to propose relevant
models. For instance, Delaherche et al. (2013), consider imitation as an unsupervised
action recognition problem, where the idea is to detect similar actions independently on
the nature of actions performed by the partner.
The last issue relates to the identification of applications. Automatic characterization
of interpersonal synchrony might be of great interest in psychology. Such methods could
provide automatic and objective tools to study interactive abilities in several psychiatric
conditions, such as depression and autism. Although few studies are currently available
in this specific field, they appear to be promising with studies on, for example, couple
therapy (Lee et al., 2011), success in psychotherapy (Ramseyer & Tschacher, 2011),
and mother–infant interaction (Cohn, 2010). Another great potential lies in the oppor-
tunity to build robots or virtual agents with interactive abilities (Gratch et al., 2007;
Al Moubayed et al., 2009; Prepin & Pelachaud, 2011; Boucenna et al., 2014).
Acknowledgments
This work was supported by the UPMC “Emergence 2009” program, the European
Union Seventh Framework Programme under grant agreement no 288241, and the the
Agence Nationale de la Recherche (SAMENTA program: SYNED-PSY). This work
was performed within the Labex SMART supported by French state funds managed
by the ANR within the Investissements d’Avenir programme under reference ANR-11-
IDEX-0004-02.
References
Al Moubayed, S., Baklouti, M., Chetouani, M., et al. (2009). Generating robot/agent backchannels
during a storytelling experiment Proceedings of IEEE International Conference on Robotics
and Automation (pp. 3749–3754).
Altmann, U. (2011). Studying movement synchrony using time series and regression models. In
I. A. Esposito, R. Hoffmann, S. Hübler, & B. Wrann (Eds), Program and abstract of the COST
2012 Final Conference held in conjunction with the 4th COST 2012 International Training
School on Cognitive Behavioural Systems (p. 23).
Ashenfelter, K. T., Boker, S. M., Waddell, J. R., & Vitanov, N. (2009). Spatiotemporal symmetry
and multifractal structure of head movements during dyadic conversation. Journal of Experi-
mental Psychology: Human Perception and Performance, 35(4), 1072–1091.
Bernieri, F. J., Reznick, J. S., & Rosenthal, R. (1988). Synchrony, pseudosynchrony, and dissyn-
chrony: Measuring the entrainment process in mother–infant interactions. Journal of Personal-
ity and Social Psychology, 54(2), 243–253.
Boucenna, S., Anzalone, S., Tilmont, E., Cohen, D., & Chetouani, M. (2014). Learning of social
signatures through imitation game between a robot and a human partner. IEEE Transactions on
Autonomous Mental Development, 6(3), 213–225.
Campbell, N. (2009). An audio-visual approach to measuring discourse synchrony in multimodal
conversation data. In Interspeech (pp. 2159–2162), September, Brighton, UK.
Cappella, J. N. (1997). Behavioral and judged coordination in adult informal social interactions:
Vocal and kinesic indicators. Journal of Personality and Social Psychology, 72, 119–131.
Cappella, J. N. (2005). Coding mutual adaptation in dyadic nonverbal interaction. In V. Manusov
(Ed.), The Sourcebook of Nonverbal Measures: Going Beyond Words (pp. 383–392). Mahwah,
NJ: Lawrence Erlbaum.
Champagne, F., Diorio, J., Sharma, S., & Meaney, M. J. (2001). Naturally occurring variations
in maternal behavior in the rat are associated with differences in estrogen-inducible cen-
tral oxytocin receptors. Proceedings of the National Academy of Sciences, 98(22), 12736–
12741.
Chartrand, T. L. & Bargh, J. A. (1999). The chameleon effect: The perception-behavior link and
social interaction. Journal of Personality and Social Psychology, 76(6), 893–910.
Chatel-Goldman, J., Schwartz, J.-L., Jutten, C., & Congedo, M. (2013). Non-local mind from the
perspective of social cognition. Frontiers in Human Neuroscience, 7, 107.
Chittaranjan, G., Aran, O., & Gatica-Perez, D. (2011). Inferring truth from multiple annotators
for social interaction analysis. In Neural Information Processing Systems (NIPS) Workshop on
Modeling Human Communication Dynamics (HCD) (p. 4).
Cohn, J. F. (2010). Advances in behavioral science using automated facial image analysis and
synthesis. IEEE Signal Processing Magazine, 27(November), 128–133.
Condon, W. S. & Ogston, W. D. (1967). A segmentation of behavior. Journal of Psychiatric
Research, 5, 221–235.
Delaherche, E., Boucenna, S., Karp, K., et al. (2013). Social coordination assessment: Distin-
guishing between shape and timing. In Multimodal Pattern Recognition of Social Signals in
Human–Computer Interaction (vol. 7742, pp. 9–18). Berlin: Springer.
Delaherche, E. & Chetouani, M. (2010). Multimodal coordination: Exploring relevant features
and measures. In Second International Workshop on Social Signal Processing, ACM Multime-
dia 2010.
Delaherche, E. & Chetouani, M. (2011). Characterization of coordination in an imitation task:
Human evaluation and automatically computable cues. In 13th International Conference on
Multimodal Interaction.
Delaherche, E., Chetouani, M., Mahdhaoui, M., et al. (2012). Interpersonal synchrony: A survey
of evaluation methods across disciplines. IEEE Transactions on Affective Computing, 3(3),
349–365.
Dumas, G., Nadel, J., Soussignan, R., Martinerie, J., & Garnero, L. (2010). Inter-brain synchro-
nization during social interaction. PLoS ONE, 5(8), e12166.
Feldman, R. (2003). Infant–mother and infant–father synchrony: The coregulation of positive
arousal. Infant Mental Health Journal, 24(1), 1–23.
Feldman, R. (2007). Parent–infant synchrony and the construction of shared timing: Physiological
precursors, developmental outcomes, and risk conditions. Journal of Child Psychology and
Psychiatry and Allied Disciplines, 48(3–4), 329–354.
Goldstein, M. H, King, A. P., & West, M. J. (2003). Social interaction shapes babbling: Testing
parallels between birdsong and speech. Proceedings of the National Academy of Sciences of
the United States of America, 100(13), 8030–8035.
Gratch, J., Wang, N., Gerten, J., Fast, E., & Duffy, R. (2007). Creating rapport with virtual
agents. IVA ’07: Proceedings of the 7th International Conference on Intelligent Virtual Agents
(pp. 125–138). Berlin: Springer.
Gravano, A. & Hirschberg, J. (2009). Backchannel-inviting cues in task-oriented dialogue. In
Proceedings of InterSpeech (pp. 1019–1022).
Guedeney, A., Guedeney, N., Tereno, S., et al. (2011). Infant rhythms versus parental time: Pro-
moting parent–infant synchrony. Journal of Physiology-Paris, 105(4–6), 195–200.
Harrist, A. W. & Waugh, R. M. (2002). Dyadic synchrony: Its structure and function in children’s
development. Developmental Review, 22(4), 555–592.
Huang, L., Morency, L.-P., & Gratch, J. (2011). A multimodal end-of-turn prediction model:
Learning from parasocial consensus sampling. In The 10th International Conference on
Autonomous Agents and Multiagent Systems AAMAS ’11 (vol. 3, pp. 1289–1290).
Kelso, J. A. S., Dumas, G., & Tognoli, E. (2013). Outline of a general theory of behavior and
brain coordination. Neural Networks, 37(1), 120–131.
Kipp, M. (2008). Spatiotemporal coding in ANVIL. In Proceedings of the 6th International Con-
ference on Language Resources and Evaluation, LREC, Marrakech.
Kozima, H., Michalowski, M., & Nakagawa, C. (2009). Keepon. International Journal of Social
Robotics, 1, 3–18.
Lakens, D. (2010). Movement synchrony and perceived entitativity. Journal of Experimental
Lee, C., Katsamanis, A., Black, M. P., et al. (2011). An analysis of PCA-based vocal entrain-
ment measures in married couples, affective spoken interactions. In Proceedings of InterSpeech
(pp. 3101–3104).
Mahdhaoui, A. & Chetouani, M. (2011). Understanding parent–infant behaviors using non-
negative matrix factorization. In Proceedings of the Third COST 2102 International Training
School Conference on Toward Autonomous, Adaptive, and Context-Aware Multimodal Inter-
faces: Theoretical and Practical Issues (pp. 436–447). Berlin: Springer.
Messinger, D. M., Ruvolo, P., Ekas, N. V., & Fogel, A. (2010). Applying machine learn-
ing to infant interaction: The development is in the details. Neural Networks, 23(8–9),
1004–1016.
Michalowski, M. P., Simmons, R., & Kozima, H. (2009). Rhythmic attention in child–robot dance
play. In Proceedings of RO-MAN 2009, Toyama, Japan.
Michelet, S., Karp, K., Delaherche, E., Achard, C., & Chetouani, M. (2012). Automatic imitation
assessment in interaction. Human Behavior Understanding (vol. 7559, pp. 161–173). Berlin:
Springer
Morency, L.-P., Kok, I., & Gratch, J. (2008). Predicting listener backchannels: A probabilistic
multimodal approach. In Proceedings of the 8th International Conference on Intelligent Virtual
Agents IVA ‘08 (pp. 176–190). Berlin: Springer.
Nadel, J., Carchon, I., Kervella, C., Marcelli, D., & Roserbat-Plantey, D. (1999). Expectancies for
social contingency in 2-month-olds. Developmental Science, 2(2), 164–173.
Oullier, O., De Guzman, G. C., Jantzen, K. J. S. Kelso, J. A., & Lagarde, J. (2008). Social coordi-
nation dynamics: Measuring human bonding. Social Neuroscience, 3(2), 178–192.
Ozkan, D., Sagae, K., & Morency, L.-P. (2010). Latent mixture of discriminative experts for mul-
timodal prediction modeling. Computational Linguistics, 2, 860–868.
Pentland, A., Lazer, D., Brewer, D., & Heibeck, T. (2009). Using reality mining to improve public
health and medicine. Studies in Health Technology and Informatics, 149, 93–102.
Perry, A., Troje, N. F., & Bentin, S. (2010). Exploring motor system contributions to the per-
ception of social information: Evidence from EEG activity in the mu/alpha frequency range.
Social Neuroscience, 5(3), 272–284.
Petridis, S., Leveque, M., & Pantic, M. (2013). Audiovisual detection of laughter in
human machine interaction. Affective Computing and Intelligent Interaction ACII 2013
(pp. 129–134).
Prepin, K. & Gaussier, P. (2010). How an agent can detect and use synchrony parameter of its own
interaction with a human? In A. Esposito, N. Campbell, C. Vogel, A. Hussain, & A. Nijholt
(Eds), Development of Multimodal Interfaces: Active Listening and Synchrony (pp. 50–65).
Berlin: Springer.
Prepin, K. & Pelachaud, C. (2011). Shared understanding and synchrony emergence: Synchrony
as an indice of the exchange of meaning between dialog partners. In ICAART2011 International
Conference on Agent and Artificial Intelligence (vol. 2, pp. 25–30).
Ramseyer, F. & Tschacher, W. (2006). Synchrony: A core concept for a constructivist approach to
psychotherapy. Constructivism: The Human Sciences, 11, 150–171.
Ramseyer, F. & Tschacher, W. (2010). Nonverbal synchrony or random coincidence? How to
tell the difference. In A. Esposito, N. Campbell, C. Vogel, A. Hussain, & A. Nijholt (Eds),
Development of Multimodal Interfaces: Active Listening and Synchrony (pp. 182–196). Berlin:
Springer.
Ramseyer, F. & Tschacher, W. (2011). Nonverbal synchrony in psychotherapy: Coordinated body
movement reflects relationship quality and outcome. Journal of Consulting and Clinical Psy-
chology, 79(3), 284–295.
Richardson, D. C. & Dale, R. (2005). Looking to understand: The coupling between speakers’ and
listeners’ eye movements and its relationship to discourse comprehension. Cognitive Science,
29(6), 1045–1060.
Richardson, D., Dale, R., & Shockley, K. (2008). Synchrony and Swing in Conversation: Coordi-
nation, Temporal Dynamics, and Communication. Oxford: Oxford University Press.
Richardson, M J., Marsh, K L., Isenhower, R. W., Goodman, J. R. L., & Schmidt, R. C.
(2007). Rocking together: Dynamics of intentional and unintentional interpersonal coordina-
tion. Human Movement Science, 26(6), 867–891.
Saint-Georges, C., Chetouani, M., Cassel, R., et al. (2013). Motherese in interaction: At the cross-
road of emotion and cognition? (A systematic review.) PLoS ONE, 8(10), e78103.
Saint-Georges, C., Mahdhaoui, A., Chetouani, M., et al. (2011). Do parents recognize autistic
deviant behavior long before diagnosis? Taking into account interaction using computational
methods. PLoS ONE, 6(7), e22393.
Shockley, K., Santana, M.-V., & Fowler, C. A. (2003). Mutual interpersonal postural constraints
are involved in cooperative conversation. Journal of Experimental Psychology: Human Percep-
tion and Performance, 29(2), 326–332.
Sun, X., Lichtenhauer, J., Valstar, M., Nijholt, A., & Pantic, M. (2011). A multimodal database
for mimicry analysis. In J. Luo (Ed.) Affective Computing and Intelligent Interaction (pp. 367–
376). Berlin: Springer.
Sun, X., Truong, K., Nijholt, A., & Pantic, M. (2011). Automatic visual mimicry expression anal-
ysis in interpersonal interaction. In Proceedings of IEEE International Conference on Com-
puter Vision and Pattern Recognition (CVPR-W’11), Workshop on CVPR for Human Behaviour
Analysis (pp. 40–46).
Thórisson, K. R. (2002). Natural turn-taking needs no manual: Computational theory and model,
from perception to action. In B. Granström, D. House, & I. Karlsson (Eds), Multimodality in
Language and Speech Systems (pp. 173–207). Dordrecht, Netherlands: Kluwer Academic.
Varni, G., Volpe, G., & Camurri, A. (2010). A system for real-time multi-modal analysis of non-
verbal affective social interaction in user-centric media. IEEE Transactions on Multimedia,
12(6), 576–590.
Viaux-Savelon, S., Dommergues, M., Rosenblum, O., et al. (2012). Prenatal ultrasound screening:
False positive soft markers may alter maternal representations and mother–infant interaction.
PLoS ONE, 7(1), e30935.
Ward, N. G., Fuentes, O., & Vega, A. (2010). Dialog prediction for a general model of turn-taking.
In Proceedings of InterSpeech (pp. 2662–2665).
Weisman, O., Delaherche, E., Rondeau, M., et al. (2013). Oxytocin shapes parental motion during
father–infant interaction. Biology Letters, 9(6).
Weisman, O., Zagoory-Sharon, O., & Feldman, R. (2012). Oxytocin administration to parent
enhances infant physiological and behavioral readiness for social engagement. Biological Psy-
chiatry, 72(12), 982–989.
Wiltermuth, S. S. & Heath, C. (2009). Synchrony and cooperation. Psychological Science, 20(1),
1–5.
16 Automatic Analysis of Social
Emotions
Automatic emotion recognition has widely focused on analysing and inferring the
expressions of six basic emotions – happiness, sadness, fear, anger, surprise, and dis-
gust. Little attention has been paid to social emotions such as kindness, unfriendliness,
jealousy, guilt, arrogance, shame, and understanding the consequent social behaviour.
Social context plays an important factor on labeling and recognizing social emotions,
which are difficult to recognise out of context.
Social emotions are emotions that have a social component such as rage arising from
a perceived offense (Gratch, Mao, & Marsella, 2006), or embarrassment deflecting
undue attention from someone else (Keltner & Buswell, 1997). Such emotions are cru-
cial for what we call social intelligence and they appear to arise from social explanations
involving judgments of causality as well as intention and free will (Shaver, 1985).
To date, most of the automatic affect analysers in the literature have performed
one-sided analysis by looking only at one party irrespective of the other party with
which they interact (Gunes & Schuller, 2013). This assumption is unrealistic for auto-
matic analysis of social emotions due to the inherent social aspect and bias that affect
the expressiveness of the emotions in a social context or group setting. Therefore,
the recent interest in analysing and understanding group expressions (e.g., Dhall &
Goecke, 2012) will potentially contribute to the progress in automatic analysis of social
emotions.
Recent developments in social media and social websites have opened up new
avenues for the employment of user-driven and user-generated emotional and affec-
tive tone such as amused, touched, and empathy in social interactions. Accordingly, a
number of researchers refer to automatic analysis of social emotions as ‘social affective
analysis’ (e.g., social affective text mining) (Bao et al., 2012). Such works have focused
on automatic prediction of social emotions from text content by attempting to establish
a connection between affective terms and social emotions (Bao et al., 2012).
As technology is widely becoming part of our social lives, analysing and understand-
ing human social emotions and making inference about human socio-emotional states
opens up new avenues in the affective computing field with various applications, most
notably for inducing behavioural change, assisting the decision-making process, and
enhancing well-being to enable humans cope with emotionally-charged social situa-
tions (e.g., customized fitness and psychotherapy applications, stress management in
high-stress social settings, tutoring systems, etc.) (Gratch et al., 2006). In this chap-
ter we will provide a brief and non-exhaustive review of social emotion research
focusing on automatic recognition and a summary of representative works intro-

duced in recent years on automatic analysis of social emotions from visual and audio
cues.
Conceptualization and Categorization of Social Emotions
Social emotions are defined as emotions that serve interpersonal or inter-group func-
tions mostly by affecting others’ reactions (Parkinson, Fischer, & Manstead, 2005). The
most common causes of emotions are social events and the fact that emotions are fre-
quently communicated to other people and that social processes generally shape and are
shaped by emotions (Hareli & Parkinson, 2008). Although in this sense all emotions are
somewhat social there also exists researchers who distinguish a specific set of emotions
including shame, embarrassment, jealousy, admiration, and so on, as social emotions
(Hareli & Parkinson, 2008). What distinguishes this subset from the broader category
of emotions? This research question has been posed and simultaneously answered in
Hareli and Parkinson (2008):
shame, embarrassment, and jealousy are social emotions because they necessarily depend on
other people’s thoughts, feelings or actions, as experienced, recalled, anticipated or imagined at
first hand, or instantiated in more generalized consideration of social norms or conventions.
This definition points to the fact that emotions are closely linked with appraising sit-
uations relevant to specific concerns such as goals, projects, or orientations that a per-
son cares about. Some of these concerns are explicitly social because they are directly
associated with the demands of social life (affiliation, social status, etc.). Hence, social
emotions are based on exclusively social concerns and are associated with appraisals
that are social by nature because they evolved to cope with social problems (Barrett &
Campos 1987).
In the relevant psychology literature there are various schemes introduced for cate-
gorizing social emotions. The most prevalent approach is to categorize social emotions
based on their associated appraisals. Social appraisals are grouped into two categories:
i) involvement of the self or the other, and ii) application of social or moral standards
(Hareli & Parkinson, 2008). Self-conscious emotions such as shame, guilt, pride, and
embarrassment arise when individuals become aware that a certain situation creates a
negative effect on their welfare. Moral emotions are defined as emotions that are linked
to the interest or welfare of society and include subcategories such as shame, guilt,
regret, embarrassment, contempt, anger, disgust, gratitude, envy, jealousy, schaden-
freude, admiration, sympathy, and empathy. Using these two social appraisal categories,
there appears to be consensus that admiration, anger (rage), contempt, envy, gratitude,
gloating, and jealousy are associated with social appraisals. Guilt, love, shame, pity
(compassion), and pride are also highly associated with social appraisals. On the other
hand, there is only an intermediate level of agreement for the categories of surprise, hate
(dislike), and sadness (sorrow) to be categorised as social emotions. Finally, there is
fairly general consensus that disappointment, disgust, frustration, happiness (joy), fear,
Automatic Analysis of Social Emotions 215
and hope have low relations to social appraisals (Hareli & Parkinson, 2008). For com-
prehensive studies on the conceptualization and the categorization of social emotions
the reader is referred to Barrett and Campos (1987) and Hareli and Parkinson (2008).
To conclude this section, it is important to note that social emotions have been found
to be determinants of social behaviour. For instance, pity determines giving, shame is
associated with the desire to disappear from others’ view, and love is associated with the
desire to approach the object of love (Hareli & Parkinson, 2008). Similarly, nonsocial
emotions appear to be determinants of nonsocial behaviour, e.g., running away is an
action tendency associated with fear.
Automatic Analysis
To ease the complexity and provide a framework for comparison, the problem of auto-
matic social emotion analysis can be formulated in terms of four categories: 1) analysing
the emotions of an individual in a social context, 2) analysing the emotions of two indi-
viduals involved in a dyadic interaction, 3) analysing the emotions of multiple people
constituting a group, and finally 4) analysing the emotions of co-located groups (i.e.,
multiple groups in the same space). Most of the current literature on automatic analysis
of emotions and/or social emotions currently deals with the first category. In this section
we will shortly present the current state of the art focusing on vision and audio based
analysis.
Vision-based Analysis
Visual Cues
Visual cues from face and body that can potentially be used for analysing social emo-
tions are inherently similar to that of other emotion recognition methodologies. These
cues include facial expressions and facial actions (cf. Figure 16.11 ), bodily expressions,
and gait patterns. It has been widely accepted that facial actions (e.g., pulling eyebrows
up) and facial expressions (e.g., producing a smile) (Pantic & Bartlett, 2007), and to a
much lesser extent bodily postures (e.g., backwards head bend and arms raised forward
and upward) and gestures (e.g., head nod) (Dael, Mortillaro, & Scherer, 2012), form
the widely known and used visual cues for automatic emotion analysis and synthesis.
Detection of emotions from bodily expressions is mainly based on categorical repre-
sentation of emotion (Gunes et al., 2015). The categories happy, sad, and angry appear
to be more distinctive in motion than categories such as pride and disgust. To date,
the bodily cues that have been more extensively considered for emotion recognition
are static postural configurations of head, arms, and legs (Coulson, 2004; Kleinsmith
& Bianchi-Berthouze, 2007), static configurations and temporal segments (Gunes &
Piccardi, 2009), dynamic hand and arm movements (Wallbott, 1998), head movements
1 The authors would like to thank Simon Baron-Cohen and Helen O’Reilly (University of Cambridge, UK)
for the permission to use these images.
(a) (b)
(c) (d)
Figure 16.1 Representative examples of facial expressions of social emotions: (a, b) pride
displayed by two different actors as well as (c) jealousy and (d) shame shown by the same actor.
The images have been selected based on very high agreement during validation of the emotions
in the ASC-Inclusion Still Images Set acquired in the context of playful education of children
with autism spectrum condition (Schuller, Marchi et al., 2013). The authors would like to thank
Simon Baron-Cohen and Helen O’Reilly (University of Cambridge, UK) for the permission to
use these images.
(e.g., position and rotation) (Cohn et al., 2004), and head gestures (e.g., head nods and
shakes) (Cowie et al., 2010; Gunes & Pantic, 2010). Studies have shown that there is
a relationship between the notion of approach/avoidance via the body movements and
emotional experiences (Chen & Bargh, 1999; Förster & Strack, 1996), e.g., as a feed-
back of positively and negatively valenced emotions (Carver, 2003), postural leaning
forward and backwards in response to affective pictures (Hillman, Rosengren, & Smith,
2004), etc. Emotions that are of the similar arousal-valence nature appear to have similar
expression characteristics during their display. For instance, although sadness and shame
are both characterized by slow, low energy movements, shame differs with a ‘stepping
back movement’ that is not present in sadness (Kleinsmith & Bianchi-Berthouze, 2012).
This could be due to the fact that spontaneous nonverbal expressions associated with
shame and pride appear to be innate and cross-culturally recognized, but the shame dis-
play appears to be inhibited in accordance with cultural norms (Tracy & Matsumoto,
2008). Pride is recognized from features such as expanded posture and head tilt back,
other behaviors similar to the inflated display observed in dominant animals defeating
a rival, as well as facial action of pulling the lip corners up (AU 12) and arms extended
out (Tracy & Matsumoto, 2008). Shame is expressed by a simple head tilt downward,
slumped shoulders, and narrowed chest behaviours similar to the ‘cringing’ and low-
ered posture associated with submission in a range of animal species. Head inclination
and face touching have also been found to be indicators of ‘self-conscious’ emotions
of shame and embarrassment (Costa et al., 2001). Gait is also a source of dynamic
information by definition and has been exploited for emotion perception and recogni-
tion (Janssen et al., 2008; Karg, Kühnlenz, & Buss, 2010). How people perceive the
expression of emotional states based on the observation of different styles of locomo-
tion has also been investigated in Inderbitzin, Väljamäe, and Calvo (2011) by generating
animation of a virtual character. Overall, combining information from multiple visual
cues (e.g., face and body) appears to be particularly important when recognizing social
emotions such as embarrassment (Costa et al., 2001). However, how temporal structures
of these expressions are perceived and decoded, and how temporal correlations between
different visual cues are processed and integrated, remain open areas of research.
Feature Extraction and Recognition

There exists an extensive literature for face and body feature extraction, tracking, and
gesture recognition from video sequences. The facial feature extraction techniques, used
for categorical and dimensional affect analysis from the visual modality, can be cate-
gorized under two categories (Pantic & Bartlett, 2007): feature-based approaches and
appearance-based approaches. In the feature-based approach, specific facial features,
such as the pupils and inner/outer corners of the eyes/mouth are detected and tracked,
distances between these are measured or used, and prior knowledge about the facial
anatomy is utilized. In the appearance-based approach, certain regions are treated as a
whole, and motion and change in texture are measured. Hybrid approaches explore the
combination of these two. The existing approaches for hand or body gesture recognition
and analysis of human motion in general can be classified into three major categories:
model-based (i. e., modeling the body parts or recovering three-dimensional configura-
tion of articulated body parts), appearance-based (i. e., based on information such as
colour/gray scale images or body silhouettes and edges), and motion-based (i. e., using
directly the motion information without any structural information about the physical
body). Literature on automatic analysis of social emotions from visual cues is sparse.
Relevant works have mostly focused on processing and analysing the images and videos
of individuals. A representative work is that of Meservy et al. (2005) who focused on
extracting body cues for detecting truthful (innocent) and deceptive (guilty) behaviour
in the context of national security achieving a good recognition accuracy for the two–
class problem (i.e., guilty/innocent). Although the psychology literature reported that
happiness (joy) is not necessarily categorised as a social emotion due to its a low rela-
tion to social appraisals (Hareli & Parkinson, 2008), the most widely analysed emo-
tion in social context has been that of happiness and smiles focusing on distinguishing
between posed and spontaneous (Valstar, Gunes, & Pantic, 2007) and polite and friendly
smiles (Hoque, Morency, & Picard, 2012). What makes a smile a display of politeness,
irony, joy, or greeting has been reported to largely depend on the social context in which
it has been displayed. Social context involves identity of the expresser (i. e., who the
expresser is), location (e.g., whether the expresser is, in the office or on the street), task
(e.g., whether the expresser is working), and the identity of the receiver (Zeng et al.,
2009). Due to the use of contextual information, analysis of smiles can be considered
as a pioneering step toward automatic analysis of social emotions. Other representative
examples of automatic contextual smile analysis research include Hernandez and Hoque
(2011) and Dhall and Goecke (2012). Hernandez and Hoque (2011) describe research
at MIT that used cameras at multiple locations on campus to predict the mood of people
looking into the camera and compute an overall mood map for the campus. The analysis
was based on detecting the smiles of individuals passing by the field of view of the cam-
eras. This work can be seen as a pioneering attempt for extending automatic analysis of
happiness and smiles of individuals to detecting the overall mood of a group of people.
Social context has recently been recognized as an important factor for automatic vision-
based analysis of people, their faces, identities, social relationships etc. (e.g., Gallagher
& Chen, 2009). Accordingly, there is a recent interest in analysing emotions of a group
of people assuming that the expressions displayed by their faces and bodies in images
and videos are not independent of each other. The work of Dhall and Goecke (2012)
extends automatic analysis of happiness and smiles to detecting the happiness expres-
sion of a group of people in a video based on facial expression intensity estimation using
the Gaussian process regression. This is further used for weighted summation of happi-
ness intensity of multiple subjects in a video frame based on social context. The face is
processed using the Constrained Local Model of Saragih and Goecke (2009) by fitting a
parametrised shape model to the location landmark points of the face. These landmark
points are used to crop and align the faces. For computing a descriptor of the face input,
Pyramid of Histogram of Gradients (PHOG) and Local Phase Quantization (LPQ) tech-
niques are utilized. These techniques are currently widely used for automatic emotion
recognition from face and body (e.g., Dhall et al., 2011). The parameter of social con-
text is modelled using the information about where each individual person is located
in a given scene (people standing close to the camera will have relatively larger faces).
This information is used for applying weights to the expression intensities of subjects
based on the size of their face in the image (Dhall & Goecke, 2012). The new trend in
automatic analysis of visual cues is using cameras and sensors based on depth informa-
tion (e.g., Microsoft Kinect) (Han et al., 2013). Such sensors provide quick solutions to
problems pertaining to common vision-based analysis approaches (e.g., segmentation
of the human body, etc.). However, there are range- and calibration-related issues that
need to be solved prior to using them for a wider range of applications (e.g., analysis of
face and facial features).
Voice-based Analysis
The body of literature on recognition of emotion from speech has become rich since first
attempts in this direction emerged fifteen years ago. A recent comprehensive overview
on the state of the art in this field is provided in Schuller et al. (2011). To date, there is
virtually no work in the literature dealing with the recognition of social emotions from
spoken language. One of the few exceptions is found in Marchi et al. (2012). There, the
authors compare recognition rates of nine emotions including ‘proud’ and ‘ashamed’
enacted either by children with autism spectrum condition or a control group. From
the results, it appears that the approaches used for basic emotion recognition can also be
applied to social emotion recognition, reaching similar accuracies. The GEMEP corpus,
recently featured in the Interspeech Computational Paralinguistics Challenge series,
contains social emotions of admiration, pride, shame, and tenderness alongside a dozen
of other emotional states – all enacted by professionals (Schuller, Steidl et al., 2013).
Both the challenge and its participants targeted a recognition of all available emotional
categories, rather than having a specific focus on social emotions. No explicit differ-
entiation in processing was made between the social emotions and the basic emotions.
A huge number of emotion categories was considered in the mindreading database,
which contains affective states such as ‘impressed’ or other classes such as ‘opposed’.
Automatic recognition mostly focuses on a large set of emotions or ‘cover classes’ and
does not focus on social emotions in particular (e.g., Sobol-Shikler, 2007; Pfister, 2009).
Abelin and Allwood (2000) observed acoustic similarities between shyness and fear and
sadness. Yet, all these studies are based on (en-)acted social emotions, which makes the
urgent need for naturalistic and spontaneous data clear. Finally, it seems noteworthy that
in the analysis of written language, social emotions are also of interest (Bao et al., 2012),
and were targeted e.g., by Neviarouskaya, Prendinger, and Ishizuka (2007), where guilt
was recognised among other emotions in text messaging. Below we focus on spoken
language, and in particular on the acoustic manifestations of social emotions.
Vocal Cues
From the above, one can assume that social emotions are also manifested mainly by
prosodic (i. e., tempo or duration, intonation, and intensity), voice quality, and spectral
and cepstral descriptors. As social emotions per se comprise a group of different states
on the arousal, valence, and dominance continuum, their manifestation can be expected
to differ considerably. For example, Pittam and Scherer (1993) report that both shame
and guilt increased in mean and contour of pitch, increased in high-frequency energy,
formant precision, and first formant position, but decreased in second formant posi-
tion and first formant’s bandwidth. As another example, Abelin and Allwood (2000)
report medium duration and low to medium intensity and pitch variation for shyness.
Apart from such acoustic correlates, non-verbal ‘social’ signals may be exploited for the
recognition of social emotions, such as laughter or sigh (Schuller, Steidl et al., 2013).
However, further studies in this direction are needed.
Feature Extraction and Recognition

In the works targeting automatic recognition of social emotions, usually low-level
descriptors (LLDs) are extracted at fixed-length intervals such as every 10 ms with a
window of around 20–30 ms. These are also known as frame-level features. From these,
higher level statistics are derived over a longer segment of interest such as words or
whole phrases. These include mean, standard deviation, extrema, or the more com-
plex ‘functionals’ such as Discrete Cosine Transform coefficients or regression error. A
standardised feature set (over six-thousand features), used in the abovementioned chal-
lenges (Schuller, Steidl et al., 2013), can be extracted using an open-source extractor
(Eyben et al., 2013) from arbitrary data. Besides the question of ‘which are the opti-
mal features’, another research question is the segment length used for unit of analysis.
Not all the work in the more general field of recognition of emotion from speech is
based on ‘supra-segmental’ feature information. Some authors prefer the fixed-length
short frame-level information as input to the machine learning algorithms (Schuller

et al., 2011). The suitability certainly depends also on the type of feature. For exam-
ple, prosodic features usually benefit more from a larger time window than spectral or
cepstral do (Schuller & Rigoll, 2009). Thus, a combination of different units, poten-
tially handled by different classification algorithms, might be ideal. When it comes to
which machine learning algorithm to use, there is hardly any agreement in the field.
Still, it is worth mentioning some of the preferred methods. For functional type fea-
tures, these include support vector machines, neural networks, and different variants of
decision trees, such as Random Forests. For LLD-type classification, hidden markov
models and Gaussian mixture models prevail. Such preferences are certainly influenced
by the available tools and libraries in the ‘more traditional’ field of speaker recognition.
Discussion and Conclusion
In this chapter we provided a brief and non-exhaustive review of social emotion research
and a summary of representative works introduced in recent years on automatic analysis
of social emotions from visual and audio cues. Our review indicates that, despite its
relevance in many application scenarios, social emotions’ automatic recognition has not
received the attention it deserves compared to basic emotions. This is likely to the lack
of availability of social interaction data and, specifically, naturalistic data. To date, the
scarcely available data in this field has been of (en-)acted non-dyadic nature.
A major challenge in capturing and acquiring displays of social emotions, such as
jealousy and guilt, is the privacy concerns and ethical issues inherent in the situational
context and expression of these emotional states. Hence, unblocking the typical bottle-
neck of labelled data is rather urgent for automatic analysis of social emotional states.
Although labeled data is scarce, the automatic analysis methods needed may be readily
borrowed from the well-established field of automatic assessment of basic and dimen-
sional emotions. Needless to say, the main difference in the automatic assessment of
social emotions is the breadth and depth of contextual information that can potentially
be exploited. The typical contextual information of ‘who, what, where, how, and why’
now needs to extend from the individual toward including questions and answers about
the other senders and receivers of emotional information involved in the social interac-
tion and situation. Additionally, social signals and their contextual contribution toward
analysis and understanding of social emotions may need to be explored further using the
available social signal recognizers for attractiveness (e.g., Kalayci, Ekenel, & Gunes,
2014), personality (e.g., Joshi, Gunes, & Goecke, 2014; Celiktutan & Gunes, 2014),
and emotions (e.g., Gunes & Schuller, 2013; Schuller, Steidl et al., 2013; Eyben et al.,
2011) as a stepping stone.
Considering the availability of such automatic analysers, coupled with the everpresent
need for labelled data, semi-automatic annotation obtained by using semi-supervised or
active learning approaches – potentially in combination with crowd-sourcing – may be
a promising avenue to pursue for obtaining sufficient data in a reasonably short time.
With the new and upcoming multidisciplinary projects, such as ‘Being There: Humans
and Robots in Public Spaces’ (Bremner et al., 2013) and ‘Integrated Internet-based
Environment for Social Inclusion of Children with Autism Spectrum Conditions’
(Schuller, Marchi et al., 2013) focusing on social emotion analysis and understanding,
and the authors themselves involved as investigators, it remains yet to be seen to what
extent existing data acquisition protocols can be utilised, how the real-time processing
aspects will differ, and how and to what extent contextual information can be modeled
and used.
Acknowledgments
The work of Hatice Gunes is supported by the EPSRC under its IDEAS Factory Sandpits
call on Digital Personhood (Grant ref: EP/L00416X/1).
References
Abelin, A. & Allwood, J. (2000). Cross linguistic interpretation of emotional prosody. In Proceed-
ings of ISCA Workshop on Speech and Emotion, Belfast, UK.
Bao, S., Xu, S., Zhang, L., et al. (2012). Mining social emotions from affective text. IEEE Trans-
actions on Knowledge and Data Engineering, 24(9), 1658–1670.
Barrett, K. C. & Campos, J. J. (1987). Perspectives on emotional development II: A functional-
ist approach to emotion. In J. D. Osofsky (Ed.), Handbook of Infant Development (2nd edn,
pp. 555–578). New York: Wiley.
Bremner, P., Trigoni, N., Brown, I., et al. (2013). Being there: Humans and robots in public spaces.
In Proceedings of International Conference on Social Robotics, Bristol.
Carver, C. S. (2003). Pleasure as a sign you can attend to something else: Placing positive feelings
within a general model of affect. Cognition and Emotion, 17, 241–261.
Celiktutan, O. & Gunes, H. (2014). Continuous prediction of perceived traits and social dimen-
sions in space and time. In Proceedings of IEEE International Conference on Image Processing
(ICIP), Paris.
Chen, M. & Bargh, J. A. (1999). Consequences of automatic evaluation: Immediate behavioral
predispositions to approach or avoid the stimulus. Personality and Social Psychology Bulletin,
25, 215–224.
Cohn, J. F., Reed, L. I., Moriyama, T., et al. (2004). Multimodal coordination of facial action, head
rotation, and eye motion during spontaneous smiles. In Proceedings of the IEEE International
Conference on Automatic Face and Gesture Recognition (pp. 129–135), Seoul.
Costa, M., Dinsbach, W., Manstead, A. S. R., & Bitti, P. E. R. (2001). Social presence, embarrass-
ment, and nonverbal behavior. Journal of Nonverbal Behavior, 25(4), 225–240.
Coulson, M. (2004). Attributing emotion to static body postures: Recognition accuracy, confu-
sions, and viewpoint dependence. Nonverbal Behavior, 28(2), 117–139.
Cowie, R., Gunes, H., McKeown, G., et al. (2010). The emotional and communicative signifi-
cance of head nods and shakes in a naturalistic database. In Proceedings of LREC International
Workshop on Emotion (pp. 42–46), Valletta Malta.
Dael, N., Mortillaro, M., & Scherer, K. R. (2012). The body action and posture coding system
(BAP): Development and reliability. Journal of Nonverbal Behavior, 36(2), 97–121.
Dhall, A., Asthana, A., Goecke, R., & Gedeon, T. (2011). Emotion recognition using PHOG and
LPQ features. In Proceedings of the Workshop on Facial Expression Recognition and Analysis
Challenge (FERA) at IEEE International Conference on Automatic Face and Gesture Recog-
nition (pp. 878–883), Santa Barbara, CA.
Dhall, A. & Goecke, R. (2012). Group expression intensity estimation in videos via Gaussian pro-
cesses. In Proceedings of International Conference on Pattern Recognition (pp. 3525–3528),
Tsukuba, Japan.
Eyben, F., Weninger, F., Groß, F., & Schuller, B. (2013). Recent developments in openSMILE,
the Munich open-source multimedia feature extractor. In Proceedings of the 21st ACM Inter-
national Conference on Multimedia, MM 2013. Barcelona, Spain.
Eyben, F., Wöllmer, M., Valstar, M., et al. (2011). String-based audiovisual fusion of behavioural
events for the assessment of dimensional affect. Proceedings of the IEEE International Con-
ference on Automatic Face & Gesture Recognition (pp. 322–329), Santa Barbara, CA.
Förster, J. & Strack, F. (1996). Influence of overt head movements on memory for valenced words:
A case of conceptual–motor compatibility. Journal of Personality and Social Psychology, 71,
421–430.
Gallagher, A. & Chen, T. (2009). Understanding images of groups of people. In Proceedings of
IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
(pp. 256–263), Miami.
Gratch, J., Mao, W., & Marsella, S. (2006). Modeling social emotions and social attributions.
In R. Sun (Ed.), Cognitive Modeling and Multi-agent Interactions (pp. 219–251). Cambridge:
Cambridge University Press.
Gunes, H. & Pantic, M. (2010). Dimensional emotion prediction from spontaneous head gestures
for interaction with sensitive artificial listeners. In Proceedings of International Conference on
Intelligent Virtual Agents (pp. 371–377), Philadelphia, PA.
Gunes, H. & Piccardi, M. (2009). Automatic temporal segment detection and affect recognition
from face and body display. IEEE Transactions on Systems, Man, and Cybernetics, Part B,
39(1), 64–84.
Gunes, H. & Schuller, B. (2013). Categorical and dimensional affect analysis in continuous input:
current trends and future directions. Image & Vision Computing, 31(2), 120–136.
Gunes, H., Shan, C., Chen, S., & Tian, Y. (2015) Bodily expression for automatic affect recogni-
tion. In A. Konar & A. Chakraborty (Eds), Emotion Recognition: A Pattern Analysis Approach
(pp. 343–378). Hoboken, NJ: John Wiley & Sons.
Han, J., Shao, L., Xu, D., & Shotton, J. (2013). Enhanced computer vision with Microsoft Kinect
Sensor: A review. IEEE Transactions on Cybernetics, 43, 1318–1334.
Hareli, S. and Parkinson, B. (2008). What is social about social emotions? Journal for the Theory
of Social Behaviour, 38(2), 131–156.
Hernandez, J., & Hoque, E. (2011). MIT Mood Meter. moodmeter.media.mit.edu.
Hillman, C. H., Rosengren, K. S., & Smith, D. P. (2004). Emotion and motivated behavior: pos-
tural adjustments to affective picture viewing. Biological Psychology, 66, 51–62.
Hoque, M., Morency, L.-P., & Picard, R. W. (2012). Are you friendly or just polite? Analysis of
smiles in spontaneous face-to-face interactions. In S. D’Mello, A. Graesser, B. Schuller, & B.
Martin (Eds.), Affective Computing and Intelligent Interaction (vol. 6974, pp. 135–144). New
York: Springer.
Inderbitzin, M., Väljamäe, A., & Calvo, J. M. B. (2011). Expression of emotional states during
locomotion based on canonical parameters. Proceedings of IEEE International Conference on
Automatic Face and Gesture Recognition (pp. 809–814), Santa Barbara, CA.
Janssen, D., Schöllhorn, W. I., Lubienetzki, J., et al. (2008). Recognition of emotions in gait
patterns by means of artificial neural nets. Journal of Nonverbal Behavior, 32, 79–92.
Joshi, J., Gunes, H., & Goecke, R. (2014). Automatic prediction of perceived traits using visual
cues under varied situational context. In Proceedings of 22nd International Conference on
Pattern Recognition (ICPR), Stockholm.
Kalayci, S., Ekenel, H. K., & Gunes, H. (2014). Automatic analysis of facial attractiveness from
video. In Proceedings of IEEE International Conference on Image Processing (ICIP), Paris.
Karg, M., Kühnlenz, K., & Buss, M. (2010). Recognition of Affect Based on Gait Patterns. IEEE
Trans. on Systems, Man and Cybernetics Part B, 40, 1050–1061.
Keltner, D. & Buswell, B. N. (1997). Embarrassment: Its distinct form and appeasement functions.
Psychological Bulletin, 122, 250–270.
Kleinsmith, A. & Bianchi-Berthouze, N. (2007). Recognizing affective dimensions from body
posture. In Proceedings of the International Conference on Affective Computing and Intelligent
Interaction (pp. 48–58), Lisbon.
Kleinsmith, A. & Bianchi-Berthouze, N. (2012). Affective body expression perception and recog-
nition: A survey. IEEE Transactions on Affective Computing, 4(1), 15–33.
Marchi, E., Schuller, B., Batliner, A., et al. (2012). Emotion in the speech of children with autism
spectrum conditions: Prosody and everything else. In Proceedings of the 3rd Workshop on
Child, Computer and Interaction (WOCCI 2012). Portland, OR.
Meservy, T. O., Jensen, M. L., Kruse, J., et al. (2005). Deception detection through automatic,
unobtrusive analysis of nonverbal behavior. IEEE Intelligent Systems, 20(5), 36–43.
Neviarouskaya, A., Prendinger, H., & Ishizuka, M. (2007). Textual affect sensing for sociable and
expressive online communication. In A. Paiva, R. Prada, & R. Picard (Eds), Affective Comput-
ing and Intelligent Interaction (vol. 4738, pp. 220–231). New York: Springer.
Pantic, M. & Bartlett, M. S. (2007). Machine analysis of facial expressions. In K. Delac, & M.
Grgic (Eds), Face Recognition (pp. 377–416). Vienna: I-Tech Education and Publishing.
Parkinson, B., Fischer, A. H., & Manstead, A. S. R. (2005). Emotion in Social Relations: Cultural,
Group, and Interpersonal Processes. New York: Psychology Press.
Pfister, T. (2009). Emotion detection from speech. PhD thesis, Cambridge University.
Pittam, J. & Scherer, K. (1993). Vocal expression and communication of emotion. In M. Lewis &
J. M. Haviland-Jones (Eds) Handbook of Emotions (pp. 185–197). New York: Guilford Press.
Saragih, J. & Goecke, R. (2009). Learning AAM fitting through simulation. Pattern Recognition,
42(November), 2628–2636.
Schuller, B., Batliner, A., Steidl, S., & Seppi, D. (2011). Recognising realistic emotions and affect
in speech: State of the art and lessons learnt from the first challenge. Speech Communication,
Special Issue on Sensing Emotion and Affect – Facing Realism in Speech Processing, 53(9/10),
1062–1087.
Schuller, B., Marchi, E., Baron-Cohen, S., et al. (2013). ASC-inclusion: Interactive emotion
games for social inclusion of children with autism spectrum conditions. In Proceedings of
the 1st International Workshop on Intelligent Digital Games for Empowerment and Inclusion,
Chania, Crete.
Schuller, B. & Rigoll, G. (2009). Recognising interest in conversational speech – comparing bag
of frames and supra-segmental features. In Proceedings of InterSpeech 2009, 10th Annual
Conference of the International Speech Communication Association Pages 1999–2002 of:.
Brighton, UK: ISCA.
Schuller, B., Steidl, S., Batliner, A., et al. (2013). The InterSpeech 2013 computational paralin-
guistics challenge: Social signals, conflict, emotion, autism. In Proceedings InterSpeech 2013,
14th Annual Conference of the International Speech Communication Association (pp. 148–
152). Lyon, France.
Shaver, K. G. (1985). The Attribution of Blame: Causality, Responsibility, and Blameworthiness.
New York: Springer.
Sobol-Shikler, T. (2007). Analysis of affective expression in speech. PhD thesis, Cambridge
University.
Tracy, J. L. & Matsumoto, D. (2008). The spontaneous expression of pride and shame: Evidence
for biologically innate nonverbal displays. Proceedings of the National Academy of Sciences of
the United States of America, 105(33), 11655–11660.
Valstar, M. F., Gunes, H., & Pantic, M. (2007). How to distinguish posed from spontaneous smiles
using geometric features. In Proceedings of the ACM International Conference on Multimodal
Interfaces (pp. 38–45), Nagoya, Japan.
Wallbott, H. G. (1998). Bodily expression of emotion. European Journal of Social Psychology,
28, 879–896.
Zeng, Z., Pantic, M., Roisman, G. I., & Huang, T. S. (2009). A survey of affect recognition
methods: audio, visual, and spontaneous expressions. IEEE Transaction on Pattern Analysis
and Machine Intelligence, 31, 39–58.
17 Social Signal Processing for
Automatic Role Recognition
Introduction
According to the Oxford Dictionary of Sociology, “Role is a key concept in sociological

theory. It highlights the social expectations attached to particular social positions and
analyses the workings of such expectations” (Scott & Marshall, 2005). Furthermore,
“Role theory concerns one of the most important features of social life, characteristic
behaviour patterns or roles” (Biddle, 1986). Besides stating that the notion of role is
crucial in sociological inquiry, the definitions introduce the two main elements of role
theory, namely expectations and characteristic behaviour patterns. In particular, the
definitions suggest that the expectations of others – typically associated to the position
someone holds in a given social context – shape roles in terms of stable and recognizable
behavioural patterns.
Social signal processing (SSP) relies on the similar key idea that social and psy-
chological phenomena leave physical, machine detectable traces in terms of both ver-
bal (e.g., lexical choices) and nonverbal (prosody, postures, facial expressions, etc.)
behavioural cues (Vinciarelli, Pantic, & Bourlard, 2009; Vinciarelli et al., 2012). In
particular, most SSP works aim at automatically inferring phenomena like conflict, per-
sonality, mimicry, effectiveness of delivery, etc. from verbal and nonverbal behaviour.
Hence, given the tight relationship between roles and behavioural patterns, SSP method-
ologies appear to be particularly suitable to map observable behaviour into roles, i.e. to
perform automatic role recognition (ARR). Not surprisingly, ARR was one of the ear-
liest problems addressed in the SSP community and the proposed approaches typically
include three main steps, namely person detection (segmentation of raw data streams
into segments corresponding to a given individual), behavioural cues extraction (detec-
tion and representation of relevant behavioural cues), and role recognition (mapping of
detected cues into roles). Most of the works presented in the literature propose experi-
ments over two main types of data, i.e. meeting recordings and broadcast material. The
probable reason is that these contexts are naturalistic, but sufficiently constrained to
allow effective automatic analysis.
The rest of this chapter is organized as follows: role recognition technology, which
introduces the main technological components of an ARR system; previous work, which
surveys the most important ARR approaches proposed in the literature; open issues,
which outlines the main open issues and challenges of the field; and the last section,
which draws some conclusions.
Figure 17.1 General scheme of a role recognition approach. Data portraying multiparty
interactions is first segemented into intervals displaying only one person (person detection). The
data corresponding to each individual is then used to detect behavioral patterns (behavioral cues
extraction) and these are then mapped into roles (role recognition).
Role Recognition Technology
Figure 17.1 shows the main stages of a generic role recognition approach. After the
acquisition of the data – typically performed with sensors like microphones, cam-
eras, smartphones, or wearable devices – the first problem is to isolate the data seg-
ments corresponding to a given individual, a task often called “person detection”. Such
a step is required in most SSP approaches dealing with multiparty data (Vinciarelli
et al., 2009, 2012), but it is particularly useful in the case of ARR because roles cor-
respond to individual behavioural patterns and, therefore, it is necessary to assign the
right behavioural cues to the right person. Technologies applied at this step include, for
example, speaker diarization (Tranter & Reynolds, 2006), i.e. the segmentation of audio
into single speaker intervals, face detection (Yang, Kriegman, & Ahuja, 2002), i.e. the
localization of faces in images, tracking (Forsyth et al., 2006), i.e. the detection of peo-
ple across consecutive frames of a video, etc. While remaining a challenging research
problem, person detection has been extensively investigated and available methodolo-
gies are often effective and robust to naturalistic settings.
The second step of the process is the detection of behavioural patterns with tech-
nologies like facial expression recognition, prosody, and voice quality analysis, gesture
and posture recognition, etc. (extensive surveys of the approaches proposed for these
tasks are available in other chapters of this book). The latest development shows that
most approaches rely on lexical choices and/or nonverbal behaviour in speech (see the
section on previous work). However, a few works propose the use of fidgeting related
measurements, a proxy for motor activation (Pianesi et al., 2008; Zancanaro, Lepri, &
Pianesi, 2006). The last step of the process is the inference of roles from the nonverbal
behavioural cues detected at the previous stages. In general, cues are represented as vec-
tors of measurements that can be mapped into roles with machine learning and pattern
recognition approaches. Two main techniques have been proposed in the literature for
this task:
r To represent each individual i involved in an interaction with a feature vector yi –

expected to account for the individual’s behaviour – and then to apply classification
techniques that map yi into one of the roles belonging to R = {r1 , . . . , rN }, i.e. the set
of predefined roles relevant to the scenario under exam.
Social Signal Processing for Automatic Role Recognition 227
r To segment the data under analysis into meaningful units involving only one person
(e.g., conversation turns, i.e., time intervals during which only one person talks), to
extract a feature vector xi from each unit and then to map the resulting sequence of
vectors X = (x1 , . . . , xT ) into a sequence of roles R = (r1 , . . . , rT ) using statistical
sequential models.
The first approach can be applied only when a given person plays the same role during
the entire event being analyzed (e.g., the chairman in a meeting or the anchorman in
a television show). The second approach can be applied to cases where a person can
change role as interactions evolve (e.g., a team member that act as a leader in certain
moments and as a follower in others).
Previous Work
The expectations associated to roles are of different types (Biddle, 1986): the norms cor-
respond to explicit prescriptions about the behaviour to display when playing a role (e.g.,
call centre operators are expected to be polite with the customers). The beliefs corre-
spond to subjective choices on how a role should be performed (e.g., teachers believing
that hostility is counterproductive will be more friendly and approachable with their stu-
dents). The preferences correspond to spontaneous choices based on personality traits
or attitudes (e.g., extrovert workers will tend to collaborate more with their colleagues).
Role recognition approaches presented in the literature (Vinciarelli et al., 2009, 2012;
Gatica-Perez, 2009) can be grouped according to the types of roles addressed in the
experiments. The most common cases of roles driven by norms are functions to be
performed in a particular scenario (e.g., the chairman in a meeting). When it comes to
the other types of roles, the approaches often target positions in a given social system
(e.g., the manager in a company).
Recognition of Roles Driven by Norms

Two main techniques have been applied for the recognition of these roles: lexical anal-
ysis (Barzilay et al., 2000; Liu, 2006) and social network analysis (Vinciarelli, 2007;
Weng, Chu, & Wu, 2009). The upper part of Table 17.1 contains basic data and approach
descriptions for each work discussed in this section.
The work by Barzilay et al. (2000) describes the recognition of three roles in news
(anchor, journalist, and guest) with the goal of finding the structure of the data. The fea-
tures used as role evidence are the distribution of terms, the speaking time length, and
the introductions at the beginning of people interventions. Lexical features are selected
using the BoosTexter categorization approach, and the same algorithm is used to recog-
nize the roles. The ratio of an intervention length to the length of the previous interven-
tion is shown to be a good role predictor. A similar task is addressed by Liu (2006),
that proposes two methods for the recognition of three roles (anchor, reporter, and
other): the first is the application of a hidden Markov model (HMM) where the states
Table 17.1 Synopsis of role recognition approaches. The table reports the details of the main role recognition works
presented in the literature. The time is expressed in hours (h) and minutes (m), the expectations in terms of norms (N),
beliefs (B), and preferences (P).
Reference Data Time Exp. Evidence
Barzilay et al. (2000) NIST TREC SDR Corpus 17h N Term distribution, speaking
(35 recordings, 3 roles) time
Liu (2006) TDT4 Mandarin broadcast 170h N Distribution of bigrams and
news (336 shows, 3 trigrams
roles)
Vinciarelli (2007) Radio news bulletins 25h N Turn organization, social
(96 recordings, 6 roles) networks (centrality, nodes
degree, etc.)
Salamin, Favre, and Radio news (96 recordings, 90h NBP Turn organization, social
Vinciarelli (2009) 6 roles), Talk shows networks (centrality, nodes
(27 recordings, 6 roles), degree, etc.)
meetings (138
recordings, 4 roles)
Weng et al. (2009) Movies and TV shows 21h N Co-occurrence of faces, social
(13 recordings, 2 roles) networks
Bigot et al. (2010) EPAC Corpus (Broadcast 100h N Turn organization, prosody
data, 3 roles)
Banerjee and Meetings (2 recordings, 45m BP Turn organization

Rudnicky (2004) 5 roles)
Zancanaro et al. Mission survival corpus 4h 30m BP speaking activity, fidgeting
(2006) (11 recordings, 5 roles)
Pianesi et al. (2008) Mission survival corpus 4h 30m BP speaking activity, fidgeting
(11 recordings, 5 roles)
Dong et al. (2007) Mission survival corpus 4h 30m BP speaking activity, fidgeting
(11 recordings, 5 roles)
Laskowski et al. AMI meeting corpus (138 45h BP speaking activity, talkspurts
(2008) recordings, 4 roles)
Garg et al. (2008) AMI meeting corpus (138 45h BP speaking activity, term
recordings, 4 roles) distribution
correspond to the roles and the observations are the distributions of bigrams and tri-
grams of the words at the beginning and end of each intervention. The second method
uses a maximum entropy classifier taking as input the same features as in the first
method. Contextual information (roles of the people talking before and after an indi-
vidual under exam) is shown to improve the performance.
The work by Vinciarelli (2007) addresses the recognition of six different roles in
broadcast news, i.e. anchorman, second anchorman, guest, headline reader, weather
man, and interview participant. The approach extracts automatically a social network
from the data and then uses it to associate interaction features to each person. Further-
more, it models the intervention length associated to each role with Gaussians. Each
individual is then assigned the role corresponding to the highest a-posteriori probability.
The main limitation of this approach is that the number of individuals interacting must
be high enough (more than 8–10 persons) to build meaningful social networks. Fur-
thermore, the dependence among the roles is not modeled and each person is assigned
the most probable role independently of the role of the others. The approach proposed
by Weng et al. (2009) applies social networks to extract the leading roles (hero, hero-
ine) and their respective communities (hero’s friends and colleagues) from movies. The
approach uses the co-occurrence of individuals in the same scene as a evidence of the
interaction between people and between roles.
Recognition of Roles Driven by Preferences and Beliefs

Basic information about data and approaches used in the works described in this sec-
tion is shown in the lower part of Table 17.1. The work by Zancanaro et al. (2006)
presents an approach for the recognition of task roles (neutral, orienteer, giver, seeker,
and recorder) and socioemotional roles (neutral, gate-keeper, supporter, protagonist,
and attacker) described in Benne and Sheats (1948). The approach uses sliding win-
dows to span the whole length of the recordings and extracts features accounting for
speech and fidgeting activity of the participants, as well as the number of simultaneous
speakers during each window. A support vector machine maps the features into roles.
The work is further extended by using features corresponding to all meeting participants
to predict the role of each individual participant (Pianesi et al., 2008). The performance
improves, but the approach suffers from the curse of dimensionality and overfitting.
These issues are addressed by Dong et al. (2007) with an influence model that reduces
significantly the number of model parameters.
The approach by Banerjee and Rudnicky (2004) focuses on meeting roles (presenter,
discussion participator, information provider, information consumer, and undefined).
The classifier is a decision tree and the features account for the activity in short win-
dows: number of speaker changes, number of meeting participants that have spoken,
number of overlapping speech segments, etc. The works by Laskowski, Ostendorf, and
Schultz (2008) and Garg et al. (2008) use the AMI meeting corpus (McCowan et al.,
2005) and try to recognize different sets of predefined roles. The features extracted by
Laskowski et al. (2008) are low-level speech activity features, namely the probability of
initiating a talk-spurt in silence, the probability of initiating a talk-spurt when someone
else is speaking, and the probability of initiating a talk-spurt when a participant in a
specific other role is speaking. The work by Garg et al. (2008) combines lexical fea-
tures and interaction features to perform the role recognition task. The lexical features
are extracted from the automatic speech transcriptions and mapped into roles using the
BoosTexter text categorization approach (Schapire & Singer, 2000). The interaction fea-
tures are extracted through affiliation networks and mapped into roles using a Bernoulli
distribution (Bishop, 2006).
Open Issues
One of the main limitations in the latest development is that the approaches tend
to be specific of a given setting or scenario. This is due mainly to the difficulty of
identifying roles that can be observed in every possible social exchange and account
for general aspects of interaction. The adoption of the task and socioemotional roles
described by Benne and Sheats (1948) is a potential solution, but so far it has been
applied only to acted meetings (Zancanaro et al., 2006; Pianesi et al., 2008) and there
is no evidence that can work in other settings. Still, it is possible to identify important
application domains where the same set of roles can be used for a wide range of data. For
example, this is the case of broadcast data (news, talk-shows, etc.) that, while being dif-
ferent, tend to follow a limited number of formats and involve a limited set of roles (e.g.,
the anchorman, the guest, etc.). While maybe being of limited value from a sociological
point of view, such roles can be helpful in technological tasks such as indexing of large
archives of broadcast material, browsers for television and radio emission recordings,
role based summarization, etc.
Another major limitation of the state of the art is that the approaches proposed in the
literature deal only with roles that can be defined a priori. In other words, it is neces-
sary to know what the roles to be recognized are in order to develop a role recognition
approach. Given that roles correspond to behavioural patterns, it is probably possible to
overcome such a limitation by applying unsupervised learning techniques (Xu & Wun-
sch, 2005) to features and measurements extracted from interaction recordings. In fact,
this should allow one to identify the patterns and verify whether they can be perceived
as such by observers. On the other hand, such an approach would leave open the prob-
lem of guessing the correct number of roles actually taking place in a given setting and,
furthermore, whether the role set is stable or changes over time.
This chapter focuses on automatic role recognition. However, roles can be addressed
under different perspectives that, to the best of our knowledge, have been explored
only to a limited extent. The first is the possiblity of using roles to segment multi-
party recordings according to meaningful criteria. A few examples show that roles can
help to segment broadcast data into semantically coherent units depending on the roles
played in a given interval of time (Vinciarelli & Favre, 2007; Vinciarelli, Fernandez, &
Favre, 2007). Furthermore, in the case of meetings, it is possible to split a recording
in terms of topics or agenda items depending on the role of people who talk at a cer-
tain moment (Sapru & Bourlard, 2014). The second possibility is to use roles – and the
behavioural patterns they induce – as a priori information to improve the performance
of other tasks such as, e.g., speaker segmentation (Valente, Vijayasenan, & Motlicek,
2011). Last, but not least, roles can help to make sense of behavioural observations like,
e.g., asymmetries between callers and receivers in phone calls (Vinciarelli, Salamin, &
Polychroniou, 2014; Vinciarelli, Chatziioannou, & Esposito, 2015).
Conclusions
Role recognition is one of the first problems that were addressed in social signal
processing (Vinciarelli et al., 2009, 2012) and this chapter has overviewed the main
works proposed in the field. In particular, this chapter has shown that SSP approaches
are particularly suitable to recognize roles because these are, according to the
sociological literature, behavioural patterns that can be recognized as such by people

involved in interactions. Since most works in SSP actually aim at mapping behavioural
patterns into social and psychological phenomena, ARR appears to fit in the scope of
the domain.
Besides introducing the technological elements involved in the ARR problem and the
main works presented in the literature, the chapter has outlined some of the issues and
challenges that still need to be addressed to ensure further progress. The advancement
of methodologies aimed at the detection of behavioural cues can help to model the
patterns associated to roles in increasingly deeper detail. However, the impossibility of
working on roles that are not predefined appears to be the most important limitation
of the current state of the art. In fact, extracting the role set directly from the data at
disposition, possibly via unsupervised approaches, would allow one to build approaches
capable to work on any type of interaction data and not only on scenarios where the role
set is known and available a priori.
The state-of-the-art limitations, outlined in the section discussing open issues, implic-
itly define an agenda for future work, but this should take into account potential appli-
cations as well. Given that roles are an integral part of any social interaction, ARR
technologies can enhance any application revolving around human–human and human–
machine interactions: embodied agents (e.g., artificial agents and social robots) can
generate artificial cues that account for the most appropriate role in a given situation,
interfaces can change configuration according to the role they automatically assign to
their users, learning analytics systems can monitor the roles students play in collective
learning processes, etc. Last, but not least, ARR technologies can contribute to ana-
lyze better the content of common multimedia data, such as movies, shows, news, etc.
Finally, while improving ARR technologies can help to enhance the applications men-
tioned above (and the many others that can benefit from role recognition), the use of
ARR in real-world problems can result into better understanding of roles from both a
sociological and technological points of view.
Acknowledgment
The author was supported by the European Commission via the Social Signal Process-
ing Network (GA 231287).
References
Banerjee, S. & Rudnicky, A. I. (2004). Using simple speech based features to detect the state of a
meeting and the roles of the meeting participants. In Proceedings of International Conference
on Spoken Language Processing (pp. 221–231).
Barzilay, R., Collins, M., Hirschberg, J., & Whittaker, S. (2000). The rules behind the roles:
Identifying speaker roles in radio broadcasts. In Proceedings of the 17th National Conference
on Artificial Intelligence (pp. 679–684).
Benne, K. D. & Sheats, P. (1948). Functional roles of group members. Journal of Social Issues,
3(2), 41–49.
Biddle, B. J. (1986). Recent developments in role theory. Annual Review of Sociology, 12,
67–92.
Bigot, B., Ferrané, I., Pinquier, J., & André-Obrecht, R. (2010). Speaker role recognition to help
spontaneous conversational speech detection. In Proceedings of International Workshop on
Searching Spontaneous Conversational Speech (pp. 5–10).
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. New York: Springer.
Dong, W., Lepri, B., Cappelletti, A., et al. (2007 (November). Using the influence model to rec-
ognize functional roles in meetings. In Proceedings of the 9th International Conference on
Multimodal Interfaces (pp. 271–278).
Forsyth, D. A., Arikan, O., Ikemoto, L., O’Brien, J., & Ramanan, D. (2006). Computational studies
of human motion part 1: Tracking and motion synthesis. Foundations and Trends in Computer
Graphics and Vision, 1(2), 77–254.
Garg, N., Favre, S., Salamin, H., Hakkani-Tür, D., & Vinciarelli, A. (2008). Role recognition for
meeting participants: An approach based on lexical information and social network analysis.
In Proceedings of the ACM International Conference on Multimedia (pp. 693–696).
Gatica-Perez, D. (2009). Automatic nonverbal analysis of social interaction in small groups: A
review. Image and Vision Computing, 27(12), 1775–1787.
Laskowski, K., Ostendorf, M., & Schultz, T. (2008). Modeling vocal interaction for text-
independent participant characterization in multi-party conversation. In Proceedings of the 9th
ISCA/ACL SIGdial Workshop on Discourse and Dialogue (pp. 148–155), June.
Liu, Yang. (2006). Initial study on automatic identification of speaker role in broadcast news
speech. In Proceedings of the Human Language Technology Conference of the NAACL, Com-
panion Volume: Short Papers (pp. 81–84), June.
McCowan, I., Carletta, J., Kraaij, W., et al. (2005). The AMI meeting corpus. In Proceedings of
the 5th International Conference on Methods and Techniques in Behavioral Research (pp. 137–
140), Wageningen, Netherlands.
Pianesi, F, Zancanaro, M., Lepri, B., & Cappelletti, A. (2008). A multimodal annotated cor-
pus of consensus decision making meetings. Language Resources and Evaluation, 41(3–4),
409–429.
Salamin, H., Favre, S., & Vinciarelli, A. (2009). Automatic role recognition in multiparty record-
ings: Using social affiliation networks for feature extraction. IEEE Transactions on Multimedia,
11(7), 1373–1380.
Sapru, A. & Bourlard, H. (2014). Detecting speaker roles and topic changes in multiparty conver-
sations using latent topic models. In Proceedings of InterSpeech (pp. 2882–2886).
Schapire, R. E. & Singer, Y. (2000). BoosTexter: A boosting-based system for text categorization.
Machine Learning, 39(2/3), 135.
Scott, J. & Marshall, G. (Eds) (2005). Dictionary of Sociology. Oxford: Oxford University Press.
Tranter, S. E. & Reynolds, D. A. (2006). An overview of automatic speaker diarization systems.
IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1557–1565.
Valente, F., Vijayasenan, D., & Motlicek, P. (2011). Speaker diarization of meetings based on
speaker role n-gram models. In Proceedings of the IEEE International Conference on Acous-
tics, Speech and Signal Processing (pp. 4416–4419), Prague.
Vinciarelli, A. (2007). Speakers role recognition in multiparty audio recordings using social net-
work analysis and duration distribution modeling. IEEE Transactions on Multimedia, 9(6),
1215–1226.
Vinciarelli, A., Chatziioannou, P., & Esposito, A. (2015). When the words are not everything: The
use of laughter, fillers, back-channel, silence and overlapping speech in phone calls. Frontiers
in ICT, 2.
Vinciarelli, A. & Favre, S. (2007). Broadcast news story segmentation using social network anal-
ysis and hidden Markov models. In Proceedings of the ACM International Conference on Mul-
timedia (pp. 261–264).
Vinciarelli, A., Fernandez, F., & Favre, S. (2007). Semantic segmentation of radio programs using
social network analysis and duration distribution modeling. In Proceedings of the IEEE Inter-
national Conference on Multimedia and Expo (pp. 779–782).
Vinciarelli, A., Pantic, M., Heylen, D., et al. (2012). Bridging the gap between social animal
and unsocial machine: A survey of social signal processing. IEEE Transactions on Affective
Computing, 3(1), 69–87.
Vinciarelli, A., Salamin, H., & Polychroniou, A. (2014). Negotiating over mobile phones: Calling
or being called can make the difference. Cognitive Computation, 6(4), 677–688.
Weng, C. Y., Chu, W. T., & Wu, J. L. (2009). RoleNet: Movie analysis from the perspective of
social networks. IEEE Transactions on Multimedia, 11(2), 256–271.
Xu, R. & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on Neural
Networks, 16(3), 645–678.
Yang, M. H., Kriegman, D., & Ahuja, N. (2002). Detecting faces in images: A survey. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 24(1), 34–58.
Zancanaro, M., Lepri, B., & Pianesi, F. (2006). Automatic detection of group functional roles in
face to face interactions. In Proceedings of International Conference on Multimodal Interfaces
(pp. 47–54).
18 Machine Learning Methods for Social
Signal Processing
Ognjen Rudovic, Mihalis A. Nicolaou, and Vladimir Pavlovic
Introduction
In this chapter we focus on systematization, analysis, and discussion of recent trends in

machine learning methods for Social signal processing (SSP) (Pentland, 2007). Because
social signaling is often of central importance to subconscious decision making that
affects everyday tasks (e.g., decisions about risks and rewards, resource utilization, or
interpersonal relationships), the need for automated understanding of social signals by
computers is a task of paramount importance. Machine learning has played a promi-
nent role in the advancement of SSP over the past decade. This is, in part, due to the
exponential increase of data availability that served as a catalyst for the adoption of a
new data-driven direction in affective computing. With the difficulty of exact modeling
of latent and complex physical processes that underpin social signals, the data has long
emerged as the means to circumvent or supplement expert- or physics-based models,
such as the deformable musculoskeletal models of the human body, face, or hands and
its movement, neuro-dynamical models of cognitive perception, or the models of the
human vocal production. This trend parallels the role and success of machine learn-
ing in related areas, such as computer vision (c.f., Poppe, 2010; Wright et al., 2010;
Grauman & Leibe, 2011) or audio, speech and language processing (c.f., Deng & Li,
2013), that serve as the core tools for analytic SSP tasks. Rather than emphasize the
exhaustive coverage of the many approaches to data-driven SSP, which can be found
in excellent surveys (Vinciarelli, Pantic, & Bourlard, 2009; Vinciarelli et al., 2012), we
seek to present the methods in the context of current modeling challenges. In particular,
we identify and discuss two major modeling directions:
r Simultaneous modeling of social signals and context, and
r Modeling of annotators and the data annotation process.
Context plays a crucial role in understanding the human behavioral signals that can oth-
erwise be easily misinterpreted. For instance, a smile can be a display of politeness,
contentedness, joy, irony, empathy, or a greeting, depending on the context. Yet, most
SSP methods to date focus on the simpler problem of detecting a smile as a prototypical
and self-contained signal. To identify the smile as a social signal one must simultane-
ously know the location of where the subject is (outside, at a reception, etc.), what his
or her current task is, when the signal was displayed (timing), and who the expresser is
(expresser’s identity, age, and expressiveness). Vinciarelli et al. (2009) identify this as
Machine Learning Methods for Social Signal Processing 235
the W4 quadruplet (where, what, when, who) but quickly point out that comprehensive
human behavior understanding requires the W5+ sextuplet (where, what, when, who,
why, how), where the why and how factors identify both the stimulus that caused the
social signal (e.g., funny video) as well as how the information is passed on (e.g, by
means of facial expression intensity). However, most current SSP methods, including
the data-driven ones, are not able to provide a satisfactory answer to W4, let alone W5+.
Simultaneously answering the W5+ is a key challenge of data-driven SSP.
Another key factor in machine learning-based SSP is the curse of annotations. Unlike
in many traditional machine learning settings, social signals are frequently marked
by multiple annotators, be those experts or novices, with an unknown ground truth.
Because of the often subjective interpretation of social signals, annotations reflect both
the annotators’ bias and the potential temporal lag in marking the time-course of the
signal. Hence, modeling of the annotators themselves and deriving the gold standard in
addition to modeling the expresser and its signal, is another crucial factor for full and
robust automated social signal understanding. We therefore analyze recent approaches
to the annotation modeling process in this context.
The two modeling challenges are universal across different signal modalities (e.g.,
visual or auditory). In the rest of this chapter we focus on one signal domain, that of
facial signals, that most ubiquitously illustrates the new data-driven modeling direc-
tions. Specifically, we consider the problems of facial expression measurements and
describe the state of the art in machine learning methods as they relate to modeling of
the signal and context and the annotators/annotations.
Facial Expression Analysis
There are two main streams in the current research on automatic analysis of facial
expressions. The first considers holistic facial expressions, such as facial expressions
of six basic emotions (fear, sadness, happiness, anger, disgust, surprise), proposed by
Ekman, Friesen, and Hager (2002), and facial expressions of pain. The second consid-
ers local facial expressions, described with a set of facial muscle actions named action
units (AUs), as defined in the facial action coding system (FACS) (Ekman et al., 2002).
In what follows, we review the existing machine learning approaches for automated
classification, temporal segmentation, and intensity estimation of facial expressions and
relate these approaches to the W5+ context design.
Classification of Facial Expressions

Different methods have been proposed for classification of facial expressions from
image sequences. Depending on how these methods perform classification of facial
expressions they can be divided into frame-based and sequence-based methods.
The frame-based methods for classification of facial expressions of six basic emo-
tion categories (Ekman et al., 2002) typically employ static classifiers such as
rule-based classifiers (Pantic & Rothkrantz, 2004; Black & Yacoob, 1997), neural
networks (NN) (Padgett & Cottrell, 1996; Tian, 2004), support vector machine (SVM)
(Bartlett et al., 2005; Shan, Gong, & McOwan, 2009), and Bayesian networks (BN)
(Cohen et al., 2003). SVMs and its probabilistic counterpart, relevance vector machine
(RVM), have been used for classification of facial expressions of pain (Lucey et al.,
2011; Gholami, Haddad, & Tannenbaum, 2009). For instance, Lucey et al. (2011)
addressed the problem of pain detection by applying SVMs either directly to the image
features or by applying a two-step approach, where AUs were first detected using SVMs,
the outputs of which were then fused using the logistic regression model. Similarly, for
the static classification of AUs, where the goal is to assign to each AU a binary label
indicating the presence of an AU, the classifiers based on NN (Bazzo & Lamar, 2004;
Fasel & Luettin, 2000), Ensemble Learning techniques, such as AdaBoost (Yang, Liu, &
Metaxas, 2009a) and GentleBoost (Hamm et al., 2011)), and SVM (Chew et al., 2012;
Bartlett et al., 2006; Kapoor, Qi, & Picard, 2003), are commonly employed. These static
approaches are deemed context-insensitive as they focus on answering only one context
question, i.e., how. Recently, Chu, De la Torre, and Cohn (2013) proposed a transduc-
tive learning method, named selective transfer machine (STM), where a SVM classifier
for AU detection is personalized by attenuating person-specific biases, thus, simultane-
ously answering the context questions who and how. This is accomplished by learning
the classifier and re-weighing the training samples that are most relevant to the test
subject during inference.
The common weakness of the frame-based classification methods is that they ignore
dynamics of target facial expressions or AUs. Although some of the frame-based meth-
ods use the features extracted from several frames in order to encode dynamics of facial
expressions, models for dynamic classification provide a more principled way of doing
so. With a few exceptions, most of the dynamic approaches to classification of facial
expressions are based on the variants of dynamic Bayesian networks (DBN) (e.g., Hid-
den Markov Models (HMM) and Conditional Random Fields (CRF)). For example,
Otsuka and Ohya (1997) and Shang and Chan (2009) trained independent HMMs for
each emotion category and then performed emotion categorization by comparing the
likelihoods of the HMMs. In Otsuka and Ohya (1997), the input features are based on
velocity vectors computed using the optical flow algorithm, while the observation prob-
ability, corresponding to the hidden states in the HMMs, is modeled using mixtures
of Gaussians in order to account better for variation in facial expressions of different
subjects. Likewise, Shang and Chan (2009) used geometric features (i.e. locations of
facial points) and a nonparametric estimate of the observation probability in the HMM
model. While these methods perform the expression classification of the pre-segmented
image sequences, corresponding to the target emotion category, Cohen et al. (2003) pre-
sented a two-level HMM classifier that performs expression classification by segment-
ing sequences of arbitrary length into the segments, corresponding to different emo-
tion categories. This is accomplished by learning first the expression-specific HMMs,
and then the transitions between the expression categories using another HMM, taking
as an input the predictions of the expression-specific HMMs. Simultaneous classifica-
tion of different AUs using HMMs was addressed in Khademi et al. (2010) using a
Hybrid HMM-ANN model. In this model, the temporal development of each AU is first
modeled using AU-specific HMMs. Subsequently, the outputs of different HMMs are
combined in the ANN to account for the AU dependencies.
Discriminative models based on CRFs have also been proposed (Der Maaten & Hen-
driks, 2012; Jain, Hu, & Aggarwal, 2011; Chang, Liu, & Lai, 2009). In Der Maaten
and Hendriks (2012), the authors trained one linear-chain CRF per AU. The model’s
states are binary variables indicating the AU activations. Jain et al. (2011) proposed
a generalization of the linear-chain CRF model, a hidden conditional random field
(HCRF) (Wang et al., 2006), where an additional layer of hidden variables is used to
model temporal dynamics of facial expressions. The training of the model was per-
formed using image sequences, but classification of the expressions was done by select-
ing the most likely class (i.e. emotion category) at each time instance. The authors
showed that: (i) having the additional layer of hidden variables results in the model
being more discriminative than the standard linear-chain CRF, and (ii) that modeling
of the temporal unfolding of the facial shapes is more important for discrimination
between different facial expressions than their spatial variation (based on comparisons
with SVMs). Another modification of HCRF, named partially-observed HCRF, was pro-
posed in Chang et al. (2009). In this method, the appearance features based on the Gabor
wavelets were extracted from image sequences and linked to the facial expressions of
the target emotion category via hidden variables in the model. The hidden variables rep-
resent subsets of AU combinations, encoded using the binary information about the AU
activations in each image frame. In this way, classification of the emotion categories
(sequence-based), and the AU combinations (frame-based), was accomplished simul-
taneously. This method outperformed the standard HCRF, which does not use a prior
information about the AU combinations. Temporal consistency of AUs was also mod-
eled in Simon et al. (2010) using the structured-output SVM framework for detecting
the starting and ending frames of each AU.
More complex graph structures within the DBN framework have been proposed in
Zhang and Ji (2005) and Tong, Liao, and Ji (2007) for dynamic classification of facial
expressions. In Zhang and Ji (2005), the DBN was constructed from interconnected
time slices of static Bayesian networks, where each static network was used to link the
geometric features (i.e. locations of characteristic facial points) to the target emotion
categories via a set of related AUs. Specifically, the relationships between the neigh-
boring time slices in the DBN were modeled using the first-order HMMs. Tong et al.
(2007) modeled relationships between different AUs using another variant of a DBN.
In this model, the AdaBoost classifiers were first used for independent classification of
AUs to select the AU-specific features. These features were then passed as inputs to the
DBN, used to model temporal unfolding of the AUs as well as their co-occurrences.
Finally, some authors attempted modeling of the facial expression dynamics on the
expression-specific manifold (Hu et al., 2004; Shan, Gong, & McOwan, 2006; Lee &
Elgammal, 2005). For instance, Hu et al. (2004) used a low dimensional Isomap embed-
ding to build a manifold of shape variation across different subjects, and then used the I-
condensation algorithm to simultaneously track and recognize target emotion categories
within a common probabilistic framework. Shan et al. (2006) used a Bayesian temporal
model (with Markov property) for the expression classification on the manifold derived
using a supervised version of the locality preserving projections (LPP) method (He &
Niyogi, 2004). As with the models mentioned above, these models account for the con-
text questions how, and implicitly for the context question when, due to their modeling
of the temporal dynamics. Static modeling using the expression manifold can also be
attained using multi-linear decomposable generative models, as done in Lee and Elgam-
mal (2005). The authors used these models to separate the subject identity from the
facial expressions on a manifold, followed by the expression classification. In contrast
to the dynamic manifold-based models mentioned above, this approach accounts only
for the context question how. While it has potential for accounting for the context ques-
tion who, as well as the other context questions due to its decomposable nature, this has
not been explored so far.
Temporal Segmentation of Facial Expressions

Most of the works on facial expression analysis from image sequences implicitly answer
the context question when as they focus only on classification of target expressions
and/or AUs. For instance, in the HMM-based models for facial expression classifica-
tion (Shang & Chan, 2009; Cohen et al., 2003), the number of hidden states is set
so that they correspond to the temporal segments (neutral/onset/apex/offset) of facial
expressions. They do not, however, explicitly encode these dynamics (i.e. they do not
perform classification of the temporal segments). Yet, both the configuration, in terms
of AUs constituting the observed expressions, and their dynamics, in terms of timing
and duration of the temporal segments of facial expressions, are important for catego-
rization of, for example, complex psychological states, such as various types of pain and
mood (Pantic & Bartlett, 2007). They also represent a critical factor in interpretation of
social behaviors, such as social inhibition, embarrassment, amusement, and shame, and
are a key parameter in differentiation between posed and spontaneous facial displays
(Ekman et al., 2002).
The class of models that performs segmentation of the expression sequences into dif-
ferent temporal segments tries to answer the context questions how (e.g. the information
is passed on by the apex of a facial expression of emotion or AU) and when (i.e. when
did it occur in the expression sequence), thus accounting explicitly for this context ques-
tion. For instance, in Pantic and Patras (2005, 2006), a static rule-based classifier and the
geometric features (i.e. facial points) were used to encode temporal segments of AUs
in near-frontal and profile view faces, respectively. The works in Koelstra, Pantic, and
Patras (2010) and Valstar and Pantic (2012) proposed modifications of standard HMMs
to encode temporal evolution of the AU segments. Specifically, Koelstra et al. (2010)
proposed a combination of discriminative, frame-based GentleBoost ensemble learners
and HMMs for classification and temporal segmentation of AUs. Similarly, Valstar and
Pantic (2012) combined SVMs and HMMs in a hybrid SVM-HMM model based on the
geometric features for the same task. Classification and temporal segmentation of the
emotion categories was also attempted in Gunes and Piccardi (2009) using HMMs and
SVMs.
A variant of the linear-chain CRF, named the conditional ordinal random field
(CORF), was proposed in Kim and Pavlovic (2010) for temporal segmentation of the
emotion categories. In this model, the node features of the linear-chain CRF model
are set using the modeling strategy of the standard ordinal regression models, e.g.
(Chu & Ghahramani, 2005) in order to enforce the ordering of the temporal segments
(neutral < onset < apex). The authors emphasize the importance of modeling the ordi-
nal constraints as well as the temporal constraints imposed by a transition model defined
on the segments. On the target task, the proposed CORF model outperforms the static
classifiers for nominal data, such as SVMs, and ordinal data, such as support vector ordi-
nal regression (SVOR) (Chu & Keerthi, 2005), as well as traditional dynamic models
for nominal data, such as HMMs and CRFs. An extension of this model was proposed in
Rudovic, Pavlovic, and Pantic (2012b), where the authors combined different emotion-
specific CORF models in the HCRF framework. In contrast to the CORF model, this
model performs simultaneous classification and temporal segmentation of the emotion
categories. More recently, Rudovic, Pavlovic, and Pantic (2012a) introduced a kernel
extension of the CORF model and applied it to the AU temporal segmentation. Com-
pared to the nominal temporal models such as hybrid SVM-HMM (Valstar & Pantic,
2012) and the linear CORF/CRF models, this model showed improved performance in
the target task on most of the AUs tested, which is mainly attributed to its nonlinear
feature functions.
Intensity Estimation of Facial Expressions

Facial expression dynamics can also be described in terms of their intensity. Explicit
analysis of the expression intensity is important for accurate interpretation of facial
expressions, and is also essential for distinguishing between spontaneous and posed
facial expressions (Pantic & Bartlett, 2007). For example, a full-blown smile and a
smirk, both coded as AU12 but with different intensities, have very different mean-
ings (e.g., enjoyment vs sarcasm). However, discerning different intensities of facial
expressions is a far more challenging task than the expression classification. This
is mainly because the facial muscle contractions are combined with the individ-
ual’s physical characteristics, producing changes in appearance that can vary signif-
icantly between subjects (Ekman et al., 2002). As a consequence, the methods that
work for intense expressions may generalize poorly to subtle expressions with low
intensity.
While FACS (Ekman et al., 2002) provides a 5-point ordinal scale for coding the
intensity AUs, there is no established standard for how to code the intensity of holis-
tic facial expressions (e.g., those of the six basic emotions). Primarily for this rea-
son and the observation in Hess, Blairy, and Kleck (1997), that the expression decod-
ing accuracy and the perceived intensity of the underlying affective state vary linearly
with the physical intensity of a facial display, the existing works on intensity estima-
tion of facial expressions of the basic emotions resort to an unsupervised approach
to modeling of the expression intensity (e.g., Amin et al., 2005; Shan, 2007; Kimura
& Yachida, 1997; Lee & Xu, 2003; Yang, Liu, & Metaxas, 2009b). The main idea in
these works is that the variation in facial images due to the facial expressions can be
represented on a manifold, where the image sequences are embedded as continuous
curves. The distances from the origin of the manifold (corresponding to the embed-
ding of the neutral faces) are then related to the intensity of the facial expressions. For
instance, Amin et al. (2005) used an unsupervised Fuzzy-K-Means algorithm to per-
form clustering of the Gabor wavelet features, extracted from expressive images, in a
2D eigenspace defined by the pairs of the features’ principal components chosen so that
the centroids of the clusters lie on a straight line. The cluster memberships are then
mapped to three levels of intensity of a facial expression (e.g. less happy, moderately
happy, and very happy). Similarly, Shan (2007) first applied a supervised LPP technique
(Shan, Gong, & McOwan, 2005) to learn a manifold of six basic expression categories.
Subsequently, Fuzzy K-Means was used to cluster the embeddings of each expression
category into three fuzzy clusters corresponding to a low, moderate, and high intensity
of target expressions. Kimura and Yachida (1997) used a potential net model to extract
the motion-flow-based features from images of facial expressions, which were used to
estimate a 2D eigenspace of the expression intensity. Lee and Xu (2003) and Yang et al.
(2009b) also performed the intensity estimation on a manifold of facial expressions.
Specifically, Lee and Xu (2003) used isometric feature mapping (Isomap) to learn a 1D
expression-specific manifold, and the distances on the manifold were then mapped into
the expression intensity. The mapping of the input features to the expression intensity
of three emotion categories (happiness, anger, and sadness) was then modeled using
either cascade NNs or support vector regression (SVR). On the other hand, Yang et al.
(2009b) treated the intensity estimation as a ranking problem. The authors proposed the
RankBoost algorithm for learning the expression-category-specific ranking functions
that assign different scores to each image frame, assumed to correspond to the expres-
sion intensity. These scores are based on the pair-wise comparisons of the changes in
the Haar-like features, extracted over time from facial images. The main criticism of
these works is that the expression intensity is obtained as a byproduct of the learning
method (and the features) used, which makes the comparison of the different methods
difficult.
Recent release of the pain intensity coded data (Lucey et al., 2011) has motivated
research into automated estimation of the pain intensity levels (Hammal & Cohn, 2012;
Kaltwang, Rudovic, & Pantic, 2012; Rudovic, Pavlovic, & Pantic, 2013a). For example,
Hammal and Cohn (2012) performed estimation of 4 pain intensity levels, with the lev-
els greater than 3 on the 16-level scale being grouped together. The authors applied log-
normal filters to the normalized facial appearance to extract the image features, which
were then used to train binary SVM classifiers for each pain intensity level, on a frame-
by-frame basis. Instead of quantizing the intensity levels for the classification, Kaltwang
et al. (2012) treated the pain intensity estimation as a regression problem. To this end,
the authors proposed a feature-fusion approach based on the relevance vector regres-
sion (RVR) model. While these works focus on static modeling of the pain intensity,
Rudovic et al. (2013a) proposed the heteroscedastic CORF model for dynamic intensity
estimation of six intensity levels of pain. In this CRF-like model, the authors model the
temporal unfolding of the pain intensity levels in an image sequence, where the ordering
of the image frames with different intensity levels is enforced. The heteroscedastic vari-
ance in the model also allows it to adapt more easily to different subjects.
AU intensity estimation is a relatively recent problem within the field, and only
a few works have addressed it so far. Based on the modeling approach, these can
be divided into static methods (Mahoor et al., 2009; Mavadati et al., 2013; Savrana,
Sankur, & Bilge, 2012, Kaltwang et al., 2012; Jeni et al., 2013) and dynamic methods
(Rudovic, Pavlovic, & Pantic, 2013b). The static methods can further be divided into
classification-based (e.g., Mahoor et al., 2009; Mavadati et al., 2013) and regression-
based methods (e.g, Savrana et al., 2012; Kaltwang et al., 2012; Jeni et al., 2013).
The static classification-based methods (Mahoor et al., 2009; Mavadati et al., 2013)
perform multiclass classification of the intensity of AUs using the SVM classifier.
For example, Mahoor et al. (2009) performed the intensity estimation of AU6 (cheek
raiser) and AU12 (lip corner puller) from facial images of infants. The input fea-
tures were obtained by concatenation of the geometric and appearance features. Due
to the excessive number of the features, the spectral regression (SR) (Cai, He, & Han,
2007) was applied to select the most relevant features for the intensity estimation of
each AU. The intensity classification was performed using AU-specific SVMs. On
the other hand, the static regression-based methods model the intensity of AUs on a
continuous scale, using either logistic regression (Savran et al., 2012), RVM regres-
sion (Kaltwang et al., 2012), or support vector regression (SVR) (Jeni et al., 2013).
For instance, Savrana et al. (2012) used logistic regression for AU intensity estima-
tion, where the input features were selected by applying an AdaBoost-based method
to the Gabor wavelet magnitudes of 2D luminance and 3D geometry extracted from
the target images. Kaltwang et al. (2012) used the RVM model for intensity estima-
tion of 11 AUs using image features such as local binary patterns (LBPs), discrete
cosine transform (DCT), and the geometric features (i.e. facial points) as well as
their fusion. Jeni et al. (2013) proposed a sparse representation of the facial appear-
ance obtained by applying non-negative matrix factorization (NMF) filters to gray-
scale image patches extracted around facial points from the AU-coded facial images,
thus answering the context question who indirectly, in addition to the context question
how, which is also answered in the other models mentioned above. The image patches
were then processed by applying the personal mean texture normalization, and used as
input to the SVR model for the intensity estimation. SVMs were also used to analyze
the AU intensities in Bartlett et al. (2006), Reilly, Ghent, and McDonald (2006), and
Delannoy and McDonald (2008), however, these works did not report any quantitative
results.
So far, all the methods for intensity estimation of AUs, except that in Jeni et al.
(2013), account only for the context question how. Recently, Rudovic et al. (2013b)
proposed the context-sensitive conditional ordinal random filed (cs-CORF) model for
dynamic estimation of intensity of AUs, and facial expressions of pain. This model is
a generalization of the CORF models (Kim & Pavlovic, 2010; Rudovic et al., 2012b)
proposed for expression classification and temporal segmentation. The cs-CORF pro-
vides means of accounting for all six context questions from the W5+ context model.
In Rudovic et al. (2013b), the authors demonstrate the influence of context on intensity
When? ...
s o2 yi 1 yi 2 yi Ti
s u2(xiu ) ...
zi 1 s r2(xir1) zi 2 s r2(xir2) ... zi Ti s r2(xirTi )
xiu ...
W E)
(C
ho
C
?
xir1 xir2 ... xirTi
How?
(FCE)
Figure 18.1 The cs-CORF model (Rudovic et al., 2013b) simultaneously accounts for the context
questions who, how and when. x are the feature measurements, and the latent variable z is
non-linearly related to the ordinal labels y via the ordinal probit function, used to define the node
features in the cs-CORF model. For more details, see (Rudovic et al., 2013b).
estimation of facial expressions by modeling the context questions who (the observed
person), how (the AU intensity-related changes in facial expressions), and when (the
timing of the AU intensities). The context questions who and how are modeled by
means of the newly introduced context and context-free covariate effects, while the
context question when is modeled in terms of temporal correlation between the ordi-
nal outputs, i.e., the AU intensity levels. To deal with skewed distributions of the AU
intensity levels, the model parameters are adapted using a weighted softmax-margin
learning approach. All these effects are summarized in the graphical representation
of the cs-CORF model shown in Figure 18.1. In their experiments on spontaneously
displayed facial expressions, the authors show that modeling of the ordinal relation-
ships between the intensity levels and their temporal unfolding improves the estimation
compared to that attained by static classification/regression models as well as the tradi-
tional nominal models for sequence classification (i.e. CRFs). More importantly, they
show that the modeling of the context question who improves significantly the abil-
ity of the model to discriminate between the expression intensity levels of different
subjects.
Annotations in Social Signal Processing
The urgency for obtaining meaningful annotations is crucial for any field which inter-
sects with machine learning. Usually, the labeling task is performed manually, involving
the cost of manual labour where a set of experts or simple annotators is employed.
This cost has increased heavily during the past years, since the vast explosion of
information in the so-called Big Data era led to the gathering of massive amounts of
data to be annotated.
As a descriptive example one can simply juxtapose Paul Ekman’s seminal work on
the six universal emotions (Pictures of Facial Affect) (Ekman, Friesen, & Press, 1975) to
one of the modern databases on affect, the SEMAINE database (McKeown et al., 2012).
Ekman’s work contained 110 black and white images, while approximately 2 seconds
from one of the 959 sessions in SEMAINE contain approximately 100 color frames,
accompanied with audio. It is clear that the task of annotating hours of audiovisual data
is much more demanding than merely annotating 100 images.
The exponential increase of data availability functioned as a catalyst for the adoption
of a new direction in social signal processing (SSP). Since a large amount of audiovisual
material was now available, instead of assigning one class label to a set of predefined
episodes, researchers started to adopt continuous annotations in terms of the temporal
dimension, i.e. instead of labeling a set of frames as “happy”, we now can have one
label per frame. Furthermore, if the label is a real number indicating the “magnitude”
of happiness, the labels are continuous in both space and time. Most related research is
based on the seminal work of Russel (Posner, Russell, & Peterson, 2005), where affect
is described via a set of latent dimensions that capture the emotional state of the subject
beyond the basic, discrete classes of emotion introduced by Ekman (anger, disgust, fear,
happiness, sadness, and surprise). The most commonly used dimensions are valence,
indicating the emotional state as positive or negative, and arousal, indicating the emo-
tion intensity, while continuous annotations have been employed for other social signals
such as pain and conflict. The shift from discrete classes of emotion to continuous anno-
tations is part of an ongoing change in the field of affective computing and SSP, where
the locus of attention was changing to more real-world problems outside heavily con-
trolled laboratory conditions, focusing on spontaneous emotion expressions instead of
posed. By adopting a dimensional description of emotions, we are now able to repre-
sent emotional states that are commonly found in everyday life, e.g., being bored or
interested (Gunes, Piccardi, & Pantic, 2008).
Challenges
The challenges arising from the recent focus of SSP on spontaneous, naturalistic data,
along with the adoption of continuous annotations and the exponential increase in data
to be annotated are many. The first issue inherent to annotation tasks related to SSP
is label subjectivity. When measuring quantities such as subject interest or emotion
dimensions such as valence, it is natural for some ambiguity to arise, especially when
utilising spontaneous data in naturalistic, interactive scenarios (as in most state-of-the-
art databases such as SEMAINE). While this issue manifests regardless of the label
type, be it continuous, discrete, or ordinal, the most tricky scenario is when dealing
with continuous in space annotations. This is mostly due to the fact that instead of
1 Spike noise
Valence
Lags
0.5
−0.5
Bias
−1
0 500 1,000 1,500
Frames
Figure 18.2 Example valence annotations from multiple annotators.
predefined classes (e.g., happy, neutral, sad), the annotation is in terms of the magnitude
of e.g., happiness, leading to essentially infinite (up to machine/input device accuracy)
classes. Essentially, this is a trade-off situation, since capturing a larger spectrum of
expressions leads to increased label ambiguity.
As mentioned, many modern databases such as SEMAINE1 adopt continuous anno-
tations in time. This entails that the annotation task is performed on-line, i.e. while
each annotator is watching/listening to the audio/visual data, he or she is also moving
the input device, usually a mouse (Cowie et al., 2000) or a joystick, according to his
or her understanding of the emotional state of the subject. A prominent implication of
the latter is that each annotator will demonstrate a time-varying, person-specific lag.
Although one can claim that, due to the efficacy of the human brain, the realisation of
the emotional state of the subject can be near-instant, the lag can be due to the time it
takes for the annotator to actually perform the annotation (e.g., move the mouse), or can
even depend on the input device itself or on how alert the annotator is at the time (e.g.,
the annotator can become tired and less responsive when annotating large amounts of
data). Furthermore, the annotator is called to make an on-the-spot decision regarding the
annotation, i.e. the annotation is no longer per frame/per image, making the processes
more prone to errors.
In an effort to minimize person-specific bias, databases such as SEMAINE are anno-
tated by multiple expert psychologists who were trained in annotating such behaviour.
Still, as one can easily verify by examining the provided annotations (Figure 18.2), the
subjectivity bias, annotator lag, and other issues are still prominent. Other issues, which
we do not comment on extensively here, can arise from weaknesses of physical input
device that affect the accuracy of the annotation (e.g., moving the mouse can be highly
inaccurate and can cause the appearance of spikes and other artifacts in the annotation).
Some of the issues mentioned in this section are illustrated in Figure 18.2.
1 Besides SEMAINE, other examples of databases which incorporate continuous annotations include the
Belfast Naturalistic Database, the Sensitive Artificial Listener (Douglas-Cowie et al., 2003; Cowie,
Douglas-Cowie, & Cox, 2005), and the CreativeIT database (Metallinou et al., 2010).
The Sub-optimality of Majority Voting and Averaging

Due to the challenges discussed in the previous section, it is clear that the task of
obtaining a “gold standard” (i.e. the “true” annotation, given a set of possibly noisy
annotations), is a quite tedious task, and researchers in the field have not been agnostic
regarding this in previous work (Metallinou et al., 2011; Nicolaou, Pavlovic, & Pantic,
2012). In the majority of past research related to SSP though, the average annotation
is usually used as an estimation of the underlying true annotation, either in the form
of a weighted average by e.g., the correlations of each annotator to the rest (Nicolaou,
Gunes, & Pantic, 2011) or a simple, unweighted average (Wöllmer et al., 2008).
Majority voting (for discrete labels) or averaging (for continuous in space annota-
tions) makes a set of explicit assumptions, namely that all annotators are equally good,
and that the majority of the annotators will identify the correct label eliminating any
ambiguity/subjectivity. Nevertheless, in most in real-world problems these assumptions
typically do not hold. So far in our discussion we have assumed that all annotators are
considered experts2 , a common case for labels related to SSP. In many cases though,
annotators can be inexperienced, naive, or even uninterested in the annotation task. This
phenomenon has been amplified by the recent trend of crowdsourcing annotations (via
services such as Mechanical Turk), which allows gathering labels from large groups of
people, who usually have no formal training in the task at hand, shifting the annotation
processes from a small group of experts to a massive but weak annotator scale. In gen-
eral, besides experts, we can consider that annotators can be assigned to classes such as
naive which commonly make mistakes, adversarial or malicious annotators, that pro-
vide erroneous annotations on purpose, or spammers that do not even pay attention at
the sequence they are annotating. It should be clear that if e.g., the majority of anno-
tators are adversarial then majority voting will always obtain the wrong label. This is
also the case if the majority of annotators are naive, and on a difficult/subjective data
all make the same mistake. This phenomenon led to particular interest manifesting in
modeling annotator performance, c.f. (Dai, Mausam, & Weld, 2010; Dai, Mausam, &
Weld, 2011; Raykar et al., 2010; Yan et al., 2012).
It is important to note that the case of fusing continuous in time annotations comes
with particular difficulties, since as discussed in the previous section, there is increased
ambiguity and, most importantly, an annotator-specific lag, which in turn leads to the
misalignment of samples, as can be seen in Figure 18.2. By simply averaging, we are
essentially integrating these temporal discrepancies into the estimated ground truth, pos-
sibly giving rise to both phase and magnitude errors (e.g., false peaks). The idea of
shifting the annotations in time in order to attain maximal agreement has been touched
upon in Nicolaou, Gunes, and Pantic (2010) and Mariooryad and Busso (2013). Nev-
ertheless, these works refer to a constant time-shift, which assumes that the annotator
lag is constant. This does not appear to be the case, as the annotator lag depends on
time-varying conditions (see previous section). The work of Nicolaou et al. (2012) is
2 But not infallible when it comes to a subjective, online annotation process (see the section on Challenges).
the first approach in the field which formally introduces a time alignment component
into the ground truth estimation in order to tackle this issue. We will discuss the work
of Nicolaou et al. (2012) along with other works on fusing multiple annotations in what
follows.
Beyond Majority Voting and Averaging: Fusing Multiple Annotations

As mentioned in the previous section, the sub-optimality of majority voting given the
challenges mentioned led to much interest in designing models to better fuse labels.
In Raykar et al. (2009), an attempt is made to model the performance of annotators
who assign a possibly noisy label. The latent “true” (binary) annotation is not known
and should be discovered in the estimation process. By assuming independence of all
annotators and, furthermore, assuming that annotator performance does not intrinsically
depend on the annotated sample, each annotator can be characterised by his/her sensi-
tivity and specificity. In this naive Bayes scenario, the annotator scores are essentially
used as weights for a weighted majority rule, where, if all annotators have the same
annotator characteristics, it collapses to the majority rule3 . Note that the more general
approach of Raykar et al. (2009) indicates that, in the presence of data that is being
labeled, neither simple nor weighted majority voting is optimal. In fact majority voting
can be seen only as a first guess aimed at assigning an uncertain consensus “true” label,
which is then further refined using an iterative EM (expectation maximization) process,
where both the “true” label and the annotator performance are recursively estimated.
Spatiotemporal Fusion of Continuous Annotations

In general, canonical correlation analysis (CCA) is a fitting paradigm for fusing annota-
tions. CCA can find maximally correlating projections for the set of variables involved
and, in a way, this can translate to the goal of fusing multiple annotations: find max-
imally correlating projections for the fused annotations in order to minimise subject-
dependent bias. CCA has been extended to a probabilistic formulation in Bach and
Jordan (2005), while Klami and Kaski (2008)4 have extended probabilistic CCA
(PCCA) to a private-shared space model. In effect, by applying the model of Klami and
Kaski (2008) on a set of signals, we obtain an estimation of the common characteristics
of the signal (projected onto a maximally correlated space), while also isolating unin-
teresting factors which are signal-specific. Practically, this model is computationally
efficient as it can lead to a closed-form SVD-based solution for a simple Gaussian noise
model. Nevertheless, in order to apply this model on annotations, it is highly desirable
that (i) the model takes dynamics into account, since temporally continuous annotations
are rich in dynamics, and (ii) somehow alleviate temporal discrepancies, which appear
due to e.g., annotator-specific lags. These extensions are proposed and implemented in
3 Detailed analysis of majority voting, including its weighted version, can be found in Lam and Suen (1997)
and Ruta and Gabrys (2005).
4 This formulation is closely related to Tucker (1958), while the model of Raykar et al. (2010) for fusing
continuous annotations can be considered a special case from Bach and Jordan (2005).
Original data
Warping 0.5
process i
Δi
0
z1 −0.5
Warped shared
latent ζ i ,1 −1
space 0 500 1,000 1,500
...
Aligned E[Z|Xi]
.05
z2
ζ i ,T
Observation 0
sequence xi x i ,1
...
...
... −.05
0 1,000 2,000 3,000 4,000 5,000 6,000
Individual x i ,T .05
factors zi ,1
E[Z|X1:N]
of xi zT
... σ i2 0
zi ,T
Shared
−.05
Annotation i, i = 1, …, N latent
0 1,000 2,000 3,000 4,000 5,000 6,000
space Z
(a) (b)
Figure 18.3 (a) Graphical model of (Nicolaou et al., 2012). The shared space Z generates all
annotations Xi , while also modelling the individual factors Zi , specific only to annotation i.
The time-warping process i temporally aligns the shared space given each annotation in time.
(b) Applying the model of Nicolaou et al. (2012) on a set of annotations. From top to bottom:
original annotations, aligned shared space, derived annotation.
Nicolaou et al. (2012), where Markovian dependencies are imposed on both the shared
and private latent spaces, while annotations are temporally aligned in order to alleviate
for lags by introducing a time-warping process based on dynamic time warping (DTW)
on the sampled shared space of each annotation. Thus, the model is able to isolate unin-
teresting parts of the annotation (which are defined, in this context, as factors specific to
an annotation and not shared) and learn a latent representation of the common, underly-
ing signal which should express the “true annotation,” ideally being free of all nuisances,
such as annotator bias and spike noise. The graphical model of Nicolaou et al. (2012)
is illustrated in Figure 18.3, along with an example application. We note that both the
model of Nicolaou et al. (2012) and Raykar et al. (2010) are able to incorporate data
points (to which the annotations correspond) in the learning process. Furthermore, the
application of CCA-related models to handle discrete/categorical annotations is still an
open issue. This would require using similar methodologies such as De Leeuw (2006)
and Niitsuma and Okada (2005), the CCA model described in Hamid et al. (2011), or
by modifying the generative model used in Klami and Kaski (2008) and Nicolaou et al.
(2012).
Future Directions
In this chapter we identified two key challenges in data-driven SSP, the joint signal-
context and the annotation-annotator modeling. While modeling of the signal context
and W5+ is crucial, few approaches to date have focused on this task and none have
solved it in a satisfactory manner. The key difficulty is the lack of models for W5+ and
the corresponding learning algorithms that are robust and scalable enough to produce
models that generalize from posed or even real-world training datasets to arbitrary real-
world, spontaneous query instances. Models that explicitly encode W5+ factors, such
as the cs-CORF (Rudovic et al., 2013b) have the potential to generalize beyond training
sets, but face difficulty in estimation. Related approaches based on tensor/multilinear
decomposition (Lu, Plataniotis, & Venetsanopoulos, 2011) provide one avenue but face
similar algorithmic and modeling (in particular, out-of-sample prediction) challenges.
One practical direction to address the generalization problem has been to use the so-
called domain-adaptation or transfer learning techniques (Pan & Yang, 2010). These
methods work well on simpler models but may face difficulty on full-blown W5+. How
to effectively integrate multifactor W5+ modeling, temporal information, and general-
ization ability remains a significant challenge.
Another related difficulty is the lack of sufficiently comprehensive spontaneous affect
labeled datasets that can be used to estimate W5+ models. Databases such as MAH-
NOB (http://mahnob-db.eu) or SEMAINE are initial efforts in this direction. Never-
theless, providing comprehensive labeled data is challenging. Most current SSP mod-
els take into account neither the stimulus itself (a part of W5+) nor the annotators,
including the errors and bias they may be imposing in the annotation process. We have
described some initial approaches in the SSP domain that attempt to model the annota-
tion process, annotator performance, bias, and temporal lag. However, many challenges
continue to exist, including how to couple the predictive model estimation with the
annotator modeling, how to track changes in annotator performance over time, how to
select new or avoid underperforming experts, etc. Some of these and related problems
are already being addressed in the domain of crowdsourcing (Quinn & Bederson, 2011)
and data-driven SSP can leverage those efforts. Related efforts have ensued in the con-
text of multi-label learning (Tsoumakas, Katakis, & Vlahavas, 2010), that focuses on
learning a model that partitions the set of labels into relevant and irrelevant with respect
to a query instance or orders the class labels according to their relevance to a query.
Multi-label learning approaches have not yet been directly applied to problems in SSP,
although they carry great potential.
References
Amin, M. A., Afzulpurkar, N. V., Dailey, M. N., Esichaikul, V. & Batanov, D. N. (2005). Fuzzy-C-
Mean determines the principle component pairs to estimate the degree of emotion from facial
expressions. In 2nd International Conference on Fuzzy Systems and Knowledge Discovery
(pp. 484–493), Changsa, China.
Bach, F. R. & Jordan, M. I. (2005). A probabilistic interpretation of canonical correlation analysis
Technical Report 688, Department of Statistics, University of California.
Bartlett, M., Littlewort, G., Frank, M., et al. (2005). Recognizing facial expression: Machine
learning and application to spontaneous behavior. In Proceedings of IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (pp. 568–573), San Diego, CA.
Bartlett, M., Littlewort, G., Frank, M., et al. (2006). Fully automatic facial action recognition in
spontaneous behavior. In Proceedings of IEEE International Conference on Automatic Face
and Gesture Recognition (pp. 223–230), Southampton, UK.
Bazzo, J. & Lamar, M. (2004). Recognizing facial actions using Gabor wavelets with neutral face
average difference. In Proceedings of IEEE International Conference on Automatic Face and
Gesture Recognition (pp. 505–510), Seoul.
Black, M. J. & Yacoob, Y. (1997). Recognizing facial expressions in image sequences using
local parameterized models of image motion. International Journal of Computer Vision, 25,
23–48.
Cai, D., He, X., & Han, J. (2007). Spectral regression for efficient regularized subspace learning.
In Proceedings of IEEE International Conference on Computer Vision (pp. 1–8), Brazil.
Chang, K.-Y., Liu, T.-L. & Lai, S.-H. (2009). Learning partially observed hidden conditional
random fields for facial expression recognition. In Proceedings of IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, pp. 533–540, Miami, FL.
Chew, S., Lucey, P., Lucey, S., et al. (2012). In the pursuit of effective affective computing: The
relationship between features and registration. IEEE Transactions on Systems, Man, and Cyber-
netics, Part B: Cybernetics, 42(4), 1006–1016.
Chu, W. & Ghahramani, Z. (2005). Gaussian processes for ordinal regression. Journal of Machine
Learning Research, 6, 1019–1041.
Chu, W. & Keerthi, S. S. (2005). New approaches to support vector ordinal regression. In Pro-
ceedings of the 22nd International Conference on Machine Learning (pp. 145–152), Bonn,
Germany.
Chu, W.-S., De la Torre, F., & Cohn, J. (2013). Selective transfer machine for personalized facial
action unit detection. In Proceedings of IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (pp. 3515–3522), Portland, OR.
Cohen, I., Sebe, N., Chen, L., Garg, A., & Huang, T. S. (2003). Facial expression recognition from
video sequences: Temporal and static modelling. Computer Vision and Image Understanding,
92(1–2), 160–187.
Cowie, R., Douglas-Cowie, E., & Cox, C. (2005). Beyond emotion archetypes: Databases for
emotion modelling using neural networks, Neural networks, 18(4), 371–388.
Cowie, R., Douglas-Cowie, E., Savvidou, S., et al. (2000). “FEELTRACE”: An instrument for
recording perceived emotion in real time. In Proceedings of the ISCA Workshop on Speech and
Emotion (pp. 19–24), Belfast.
Dai, P., Mausam, & Weld, D. S. (2010). Decision-theoretic control of crowd-sourced workflows.
In Proceedings of the 24th National Conference on Artificial Intelligence (pp. 1168–1174),
Atlanta, GA.
Dai, P., Mausam, & Weld, D. S. (2011). Artificial intelligence for artificial artificial intelligence.
In Proceedings of 25th AAAI Conference on Artificial Intelligence (1153–1159), San Francisco.
Delannoy, J. & McDonald, J. (2008). Automatic estimation of the dynamics of facial expression
using a three-level model of intensity. In Proceedings of IEEE International Conference on
Automatic Face and Gesture Recognition (pp. 1–6), Amsterdam.
De Leeuw, J. (2006). Principal component analysis of binary data by iterated singular value
decomposition. Computational Statistics and Data Analysis, 50(1), 21–39.
Deng, L. & Li, X. (2013). Machine learning paradigms for speech recognition: An overview.
Der Maaten, L. V. & Hendriks, E. (2012). Action unit classification using active appearance mod-
els and conditional random fields. Cognitive Processing, 13(2), 507–518.
Douglas-Cowie, E., Campbell, N., Cowie, R., & Roach, P. (2003). Emotional speech: Towards a
new generation of databases. Speech Communication 40(1), 33–60.
Ekman, P., Friesen, W., & Hager, J. (2002). Facial Action Coding System (FACS): Manual. Salt
Lake City, UT: A Human Face.
Ekman, P., Friesen, W. V., & Press, C. P. (1975). Pictures of Facial Affect. Palo Alto, CA: Con-
sulting Psychologists Press.
Fasel, B. & Luettin, J. (2000). Recognition of asymmetric facial action unit activities and intensi-
ties. In Proceedings of 15th International Conference on Pattern Recognition (pp. 110–1103),
Barcelona, Spain.
Gholami, B., Haddad, W. M., & Tannenbaum, A. R. (2009). Agitation and pain assessment
using digital imaging. In Proceedings of International Conference of the IEEE Engineering
in Medicine and Biology Society (pp. 2176–2179), Minneapolis, MN.
Grauman, K. & Leibe, B. (2011). Visual object recognition. Synthesis Lectures on Artificial Intel-
ligence and Machine Learning, 5(2), 1–181.
Gunes, H. & Piccardi, M. (2009). Automatic temporal segment detection and affect recognition
from face and body display, IEEE Transactions on Systems, Man, and Cybernetics, 39(1), 64–
84.
Gunes, H., Piccardi, M., & Pantic, M. (2008). From the lab to the real world: Affect recogni-
tion using multiple cues and modalities. In J. Or (Ed.), Affective Computing [e-book]. www
.intechopen.com/books/affective_computing.
Hamid, J., Meaney, C., Crowcroft, N., et al. (2011). Potential risk factors associated with human
encephalitis: Application of canonical correlation analysis. BMC Medical Research Methodol-
ogy, 11(1), 1–10.
Hamm, J., Kohler, C. G., Gur, R. C., & Verma, R. (2011). Automated facial action coding system
for dynamic analysis of facial expressions in neuropsychiatric disorders. Journal of Neuro-
science Methods, 200(2), 237–256.
Hammal, Z. & Cohn, J. F. (2012). Automatic detection of pain intensity. In Proceedings of the
14th ACM International Conference on Multimodal Interaction (pp. 47–52), Santa Monica,
CA.
He, X. & Niyogi, P. (2004). Locality preserving projections. In Proceedings of Neural Information
Processing Systems (vol. 16) Vancouver, Canada.
Hess, U., Blairy, S., & Kleck, R. (1997). The intensity of emotional facial expressions and decod-
ing accuracy. Journal of Nonverbal Behavior, 21(4), 241–257.
Hu, C., Chang, Y., Feris, R., & Turk, M. (2004). Manifold based analysis of facial expression.
In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshop
(p. 81).
Jain, S., Hu, C., & Aggarwal, J. (2011). Facial expression recognition with temporal modeling
of shapes. In IEEE International Conference on Computer Vision Workshops (pp. 1642–1649),
Barcelona, Spain.
Jeni, L. A., Girard, J. M., Cohn, J. F., & Torre, F. D. L. (2013). Continuous AU intensity estimation
using localized, sparse facial feature space. IEEE International Conference on Automatic Face
and Gesture Recognition (pp. 1–7).
Kaltwang, S., Rudovic, O., & Pantic, M. (2012). Continuous pain intensity estimation from facial
expressions. Lecture Notes in Computer Science ISVC, 7432, 368–377.
Kapoor, A., Qi, Y. A., & Picard, R. W. (2003). Fully automatic upper facial action recognition. In
Proceedings of IEEE International Workshop on Analysis and Modeling of Faces and Gestures
(pp. 195–202).
Khademi, M., Manzuri-Shalmani, M. T., Kiapour, M. H., & Kiaei, A. A. (2010). Recognizing
combinations of facial action units with different intensity using a mixture of hidden Markov
models and neural network. In Proceedings of the 9th International Conference on Multiple
Classifier Systems (pp. 304–313)
Kim, M. & Pavlovic, V. (2010). Structured output ordinal regression for dynamic facial emo-
tion intensity prediction. In Proceedings of 11th European Conference on Computer Vision
(pp. 649–662), Heraklion, Crete.
Kimura, S. & Yachida, M. (1997). Facial expression recognition and its degree estimation. In Pro-
ceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(pp. 295–300), Puerto Rico.
Klami, A. & Kaski, S. (2008). Probabilistic approach to detecting dependencies between data
sets. Neurocomputing, 72(1), 39–46.
Koelstra, S., Pantic, M., & Patras, I. (2010). A dynamic texture based approach to recognition of
facial actions and their temporal models. IEEE Transactions on Pattern Analysis And Machine
Intelligence, 32, 1940–1954.
Lam, L. & Suen, S. (1997). Application of majority voting to pattern recognition: An analysis of
its behavior and performance. IEEE Transactions on Systems, Man, and Cybernetics, Part A:
Systems and Humans, 27(5), 553–568.
Lee, C. S. & Elgammal, A. (2005). Facial expression analysis using nonlinear decomposable
generative models. In Proceedings of IEEE International Workshops on Analysis and Modeling
of Faces and Gestures (pp. 17–31).
Lee, K. K. & Xu, Y. (2003). Real-time estimation of facial expression intensity. In Proceedings
of IEEE International Conference on Robotics and Automation (pp. 2567–2572), Taipei.
Lu, H., Plataniotis, K. N., & Venetsanopoulos, A. N. (2011). A survey of multilinear subspace
learning for tensor data Pattern Recognition, 44(7), 1540–1551.
Lucey, P., Cohn, J., Prkachin, K., Solomon, P., & Matthews, I. (2011). Painful data: The UNBC-
McMaster shoulder pain expression archive database. In Proceedings of IEEE International
Conference on Automatic Face and Gesture Recognition (pp. 57–64), Santa Barbara, CA.
Mahoor, M., Cadavid, S., Messinger, D., & Cohn, J. (2009). A framework for automated mea-
surement of the intensity of non-posed facial action units. In Proceedings of IEEE Computer
Society Conference on Computer Vision and Pattern Recognition Workshop (pp. 74–8), Miami,
FL.
Mariooryad, S. & Busso, C. (2013). Analysis and compensation of the reaction lag of evaluators
in continuous emotional annotations. In Proceedings of Humaine Association Conference on
Affective Computing and Intelligent Interaction (pp. 97–108), Switzerland.
Mavadati, S., Mahoor, M., Bartlett, K., Trinh, P., & Cohn, J. (2013). DISFA: A spontaneous facial
action intensity database. IEEE Transactions on Affective Computing, 4(2), 151–160.
McKeown, G., Valstar, M., Cowie, R., Pantic, M., & Schroder, M. (2012). The SEMAINE
database: Annotated multimodal records of emotionally colored conversations between a per-
son and a limited agent. IEEE Transactions on Affective Computing, 3(1), 5–17.
Metallinou, A., Katsamanis, A., Wang, Y., & Narayanan, S. (2011). Tracking changes in contin-
uous emotion states using body language and prosodic cues. In Proceedings of IEEE Interna-
tional Conference Acoustics, Speech and Signal Processing (pp. 2288–2291), Prague.
Metallinou, A., Lee, C.-C., Busso, C., Carnicke, S., & Narayanan, S. (2010). The USC CreativeIT
database: A multimodal database of theatrical improvisation. In Proceedings of the Multimodal
Corpora Workshop: Advances in Capturing, Coding and Analyzing Multimodality (pp. 64–68),
Malta.
Nicolaou, M. A., Gunes, H., & Pantic, M. (2010). Automatic segmentation of spontaneous data
using dimensional labels from multiple coders. In Proceedings of LREC International Work-
shop on Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality,
Valletta, Malta.
Nicolaou, M. A., Gunes, H., & Pantic, M. (2011). Continuous prediction of spontaneous affect
from multiple cues and modalities in valence-arousal space. IEEE Transactions on Affective
Computing, 2(2), 92–105.
Nicolaou, M. A., Pavlovic, V., & Pantic, M. (2012). Dynamic probabilistic CCA for analysis
of affective behaviour. In Proceedings of the 12th European Conference on Computer Vision
(pp. 98–111), Florence, Italy.
Niitsuma, H. & Okada, T. (2005). Covariance and PCA for categorical variables. In T. Ho, D.
Cheung, & Liu, H. (Eds), Advances in Knowledge Discovery and Data Mining (pp. 523–528).
Berlin: Springer.
Otsuka, T. & Ohya, J. (1997). Recognizing multiple persons’ facial expressions using HMM based
on automatic extraction of significant frames from image sequences. In Proceedings of Inter-
national Conference on Image Processing (pp. 546–549), Santa Barbara, CA.
Padgett, C. & Cottrell, G. W. (1996). Representing face images for emotion classification. In
Proceedings 10th Annual Conference on Neural Information Processing Systems (pp. 894–
900), Denver, CO.
Pan, S. J. & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge
and Data Engineering, 22(10), 1345–1359.
Pantic, M. & Bartlett, M. (2007). Machine analysis of facial expressions. In K. Delac & M. Grgic
(Eds), Face Recognition [e-book]. http://www.intechopen.com/books/face_recognition.
Pantic, M. & Patras, I. (2005). Detecting facial actions and their temporal segments in nearly
frontal-view face image sequences. In Proceedings of IEEE International Conference on Sys-
tems, Man and Cybernetics (pp. 3358–3363), Waikoloa, HI.
Man, and Cybernetics, Part B, 36(2), 433–449.
Pantic, M. & Rothkrantz, L. J. (2004). Facial action recognition for facial expression analysis
from static face images. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 34(3),
1449–1461.
Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Com-
puting, 28(6), 976–990.
Posner, J., Russell, J. A., & Peterson, B. S. (2005). The circumplex model of affect: An integrative
approach to affective neuroscience, cognitive development, and psychopathology. Development
and Psychopathology, 17(3), 715–734.
Quinn, A. J. & Bederson, B. B. (2011). Human computation: a survey and taxonomy of a growing
field. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems,
ACM Request Permissions (pp. 1403–1412), Vancouver.
Raykar, V. C., Yu, S., Zhao, L. H., et al. (2009). Supervised learning from multiple experts: Whom
to trust when everyone lies a bit. In Proceedings of the 26th Annual International Conference
on Machine Learning (pp. 889–896), Montreal.
Raykar, V. C., Yu, S., Zhao, L. H., et al. (2010). Learning from crowds. Journal of Machine
Reilly, J., Ghent, J., & McDonald, J. (2006). Investigating the dynamics of facial expression.
Lecture Notes in Computer Science, 4292, 334–343.
Rudovic, O., Pavlovic, V., & Pantic, M. (2012a). Kernel conditional ordinal random fields for
temporal segmentation of facial action units. Proceedings of 12th European Conference on
Computer Vision (pp. 260–269), Florence, Italy.
Rudovic, O., Pavlovic, V., & Pantic, M. (2012b). Multi-output Laplacian dynamic ordinal regres-
sion for facial expression recognition and intensity estimation. Proceedings of IEEE Confer-
ence on Computer Vision and Pattern Recognition (pp. 2634–2641), Providence, RI.
Rudovic, O., Pavlovic, V., & Pantic, M. (2013a). Automatic pain intensity estimation with het-
eroscedastic conditional ordinal random fields. In Proceedings of 9th International Symposium
on Advances in Visual Computing (pp. 234–243), Rethymnon, Crete.
Rudovic, O., Pavlovic, V., & Pantic, M. (2013b). Context-sensitive conditional ordinal random
fields for facial action intensity estimation. In Proceedings of IEEE International Conference
on Computer Vision Workshops (pp. 492–499), Sydney.
Ruta, D. & Gabrys, B. (2005). Classifier selection for majority voting. Information Fusion, 6(1),
63–81.
Savrana, A., Sankur, B., & Bilge, M. (2012). Regression-based intensity estimation of facial
action units, Image and Vision Computing, 30(10), 774–784.
Shan, C. (2007). Inferring facial and body language. PhD thesis, University of London.
Shan, C., Gong, S., & McOwan, P. W. (2005). Appearance manifold of facial expression, Lecture
Notes in Computer Science, 3766, 221–230.
Shan, C., Gong, S., & McOwan, P. W. (2006). Dynamic facial expression recognition using a
Bayesian temporal manifold model. In Proceedings of the British Machine Vision Conference
(pp. 297–306), Edinburgh.
Shan, C., Gong, S., & McOwan, P. W. (2009). Facial expression recognition based on local binary
patterns: A comprehensive study. Image and Vision Computing, 27(6), 803–816.
Shang, L. & Chan, K.-P. (2009). Nonparametric discriminant HMM and application to facial
expression recognition. In Proceedings of IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, pp. 2090–2096.
Simon, T., Nguyen, M. H., De la Torre, F., & Cohn, J. F. (2010). Action unit detection with
segment-based SVMs. In Proceedings of IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (pp. 2737–2744), San Francisco.
Tian, Y.-L. (2004). Evaluation of face resolution for expression analysis. In Proceedings of
IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops,
Washington, DC.
Tong, Y., Liao, W., & Ji, Q. (2007). Facial action unit recognition by exploiting their dynamic
and semantic relationships, IEEE Transactions on Pattern Analysis and Machine Intelligence,
29(10), 1683–1699.
Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010). Mining multi-label data. In O. Maimon &
L. Rokach (Eds), Data Mining and Knowledge Discovery Handbook (pp. 667–685). Boston:
Springer.
Tucker, L. R. (1958). An inter-battery method of factor analysis. Psychometrika 23(2), 111–
136.
Valstar, M. F. & Pantic, M. (2012). Fully automatic recognition of the temporal phases of facial
actions. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 42, 28–43.
Computing, 3(1), 69–87.
Wang, S., Quattoni, A., Morency, L.-P., Demirdjian, D., & Darrell, T. (2006). Hidden conditional
random fields for gesture recognition In Proceedings of IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (pp. 1097–1104), New York.
Wöllmer, M., Eyben, F., Reiter, S., et al. (2008). Abandoning emotion classes – towards con-
tinuous emotion recognition with modelling of long-range dependencies. In Proceedings of
InterSpeech (pp. 597–600), Brisbane, Australia.
Wright, J., Ma, Y., Mairal, J., et al. (2010). Sparse representation for computer vision and pattern
recognition. Proceedings of the IEEE, 98(6), 1031–1044.
Yan, Y., Rosales, R., Fung, G., & Dy, J. (2012). Modeling multiple annotator expertise in the
semi-supervised learning scenario. In Proceedings of the 26th Conference on Uncertainty in
Artificial Intelligence, Catalina Island, CA
Yang, P., Liu, Q., & Metaxas, D. N. (2009a). Boosting encoded dynamic features for facial expres-
sion recognition Pattern Recognition Letters, 2, 132–139.
Yang, P., Liu, Q., & Metaxas, D. N. (2009b). Rankboost with L1 regularization for facial expres-
sion recognition and intensity estimation. In Proceedings of IEEE International Conference on
Computer Vision (pp. 1018–1025), Kyoto, Japan.
Zhang, Y. & Ji, Q. (2005). Active and dynamic information fusion for facial expression under-
standing from image sequences. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 27(5), 699–714.
Part III
Machine Synthesis of Social
Signals
19 Speech Synthesis: State of the Art
and Challenges for the Future
Kallirroi Georgila
Introduction
Speech synthesis (or alternatively text-to-speech synthesis) means automatically con-

verting natural language text into speech. Speech synthesis has many potential applica-
tions. For example, it can be used as an aid to people with disabilities (see Challenges
for the Future), for generating the output of spoken dialogue systems (Lemon et al.,
2006; Georgila et al., 2010), for speech-to-speech translation (Schultz et al., 2006), for
computer games, etc.
Current state-of-the-art speech synthesizers can simulate neutral read aloud speech
(i.e., speech that sounds like reading from some text) quite well, both in terms of nat-
uralness and intelligibility (Karaiskos et al., 2008). However, today, many commer-
cial applications that require speech output still rely on prerecorded system prompts
rather than use synthetic speech. The reason is that, despite much progress in speech
synthesis over the last twenty years, current state-of-the-art synthetic voices still lack
the expressiveness of human voices. On the other hand, using prerecorded speech has
several drawbacks. It is a very expensive process that often has to start from scratch
for each new application. Moreover, if an application needs to be enhanced with new
prompts, it is quite likely that the person (usually an actor) that recorded the ini-
tial prompts will not be available. Furthermore, human recordings cannot be used for
content generation on the fly, i.e., all the utterances that will be used in an appli-
cation need to be predetermined and recorded in advance. Predetermining all utter-
ances to be recorded is not always possible. For example, the number of names in the
database of an automatic directory assistance service can be huge. Not to mention the
fact that most databases are continuously being updated. In such cases, speech out-
put is generated by using a mixture of prerecorded speech (for prompts) and synthetic
speech (for names) (Georgila et al., 2003). The results of such a mixture can be quite
awkward.
The discussion above shows that there is great motivation for further advances in the
field of speech synthesis. Below we provide an overview of the current state of the art
in speech synthesis, and present challenges for future work.
258 Machine Synthesis of Social Signals
How Does Speech Synthesis Work?
A speech synthesizer consists of two main components: a front end or pre-processing

module, which analyzes the input text and transforms it into a linguistic specification,
and a speech generation module, which converts this linguistic specification into a
speech waveform. The first module is language-specific, whereas the second module
can be largely language-independent (except for the language data used for training in
data-driven speech synthesis methods).
Pre-processing Module
To understand how the pre-processing module works, consider the example sentence
“the aunt decided to present the present”. In order to synthesize this sentence properly,
the pre-processing module first has to perform part-of-speech tagging (Toutanova et al.,
2003) and word sense disambiguation (Navigli, 2009). Thus it will determine that the
first instance of “present” is a verb and means “give” or “introduce”, whereas the second
instance of “present” is a noun and means “gift”. This distinction is very important
because the words “present” as verb and “present” as noun are pronounced differently.
The next step is to perform parsing (Socher et al., 2013), i.e., convert this sentence
into a syntactic tree, which will provide information about the structure of the sentence,
required for predicting prosody from text. Prosody is the part of human communication
that expresses the speaker’s emotion, makes certain words more prominent while de-
emphasizing others, determines the position of phrase-breaks and pauses, and controls
the rhythm, intonation, and pitch of the utterance (Taylor, 2009). As a result of parsing,
certain words are grouped together to form phrases, e.g., the article “the” and the noun
“present” form the noun phrase “the present”, the verb “present” and the noun phrase
“the present” form the verb phrase “present the present”, and so forth. Pauses are more
likely to occur between word groupings than within a word grouping. Thus it is less
likely that there will be a pause between the words “the” and “aunt”. Also, some words
are more prominent or stressed than others, e.g., the two instances of the word “present”
are more stressed than the word “to”.
The output of the pre-processing stage should contain all the factors that may affect
how the utterance will be realized in the speech generation stage. Each word in the
text to be synthesized is transformed into a series of phonemes based on a phonetic
dictionary or the output of a grapheme-to-phoneme converter (Taylor, 2009), e.g., the
word “decided” is transformed into the sequence of phonemes “d ih0 s ay1 d ih0 d”.
But for generating natural speech the linguistic specification needs to be much more
complex than that. For this reason, for each speech segment, e.g., phoneme, we need to
keep track of many context factors such as preceding and following phonemes, position
of segment in syllable, stress of the current syllable as well as of the preceding and
following syllables, length of utterance in syllables or words or phrases, etc. (King,
2010). Some of these factors are quite localized, others take into account larger context
Speech Synthesis: State of the Art and Challenges for the Future 259
dependencies and span several segments (suprasegmental). The bottom line is that the
output of the pre-processing stage, i.e., the linguistic specification of the input text, is
a sequence of phonemes augmented with contextual information. For a full discussion
on prosody prediction from text, and generally how the pre-processing stage works, see
Taylor (2009).
Speech Generation Module

Over the years many methods have been proposed for converting linguistic specifica-
tions into speech waveforms. King (2010) classifies such methods into two categories:
exemplar-based and model-based speech synthesis. Exemplar-based approaches are
data-driven methods that memorize the data. During training they store labelled seg-
ments of speech and at runtime they retrieve the appropriate sequence of such segments.
Typical examples of exemplar-based speech synthesis are concatenative speech synthe-
sis, diphone-based speech synthesis, and unit selection speech synthesis (see section on
Unit Selection Speech Synthesis).
Model-based approaches use data in order to learn their properties or may not require
using any data at all. An example of a model-based approach is articulatory speech
synthesis, which aims to model the natural speech production process. More specifi-
cally, articulatory synthesis models the vocal tract and changes in its shape due to the
movement of the tongue, jaw, and lips in order to simulate the air flow through the
vocal tract. Articulatory speech synthesis has been around for a long time in the form
of talking machines (von Kempelen, 1791). More recent articulatory synthesis tech-
niques take advantage of progress in X-ray photography, magnetic resonance imaging
(MRI), or electromagnetic articulography (EMA) (Narayanan, Alwan, & Haker, 1997;
Taylor, 2009). For example, the Haskins configurable articulatory synthesizer (CASY)
(Iskarous et al., 2003) represents speech organs (articulators) using simple geometric
parameters such as angles, lines, and circles. These parameters are adjusted based on
MRI data.
Despite much progress due to technological advances in digital imaging methods,
developing an accurate model of the natural speech production process is very difficult,
and the best articulatory speech synthesizers generate speech of poor quality compared
to unit selection or HMM-based speech synthesis. Articulatory speech synthesis is also
related to audio-visual synthesis (or talking-head synthesis), which aims to create a
complete visual animated model of the head while talking (Taylor, 2009).
Another example of a model-based approach is formant synthesis, also called syn-
thesis by rule (Klatt, 1980). In this approach, the sound is generated from two sources
(periodic for voiced and noise-like for obstruent sounds) (Taylor, 2009). This source sig-
nal passes through a model that simulates the vocal tract as a series of digital resonators.
The resulting speech does not sound natural but it is intelligible.
Most formant synthesis systems rely on rules but it is also possible to use data in
order to adjust some of the model parameters. Actually the most recent trend in speech
synthesis is to use model-based approaches where the parameters of the model, such
as spectrum, fundamental frequency (F0), and phoneme duration, are learned from data
(statistical parametric speech synthesis). The most popular example of statistical para-
metric speech synthesis is hidden Markov model (HMM)-based speech synthesis (see
section on HMM-based speech synthesis).
Evaluation of Synthetic Speech

The current practice in speech synthesis evaluation is to synthesize utterances using a
particular voice and then ask humans to rate these utterances in terms of a few aspects,
usually naturalness and intelligibility. An example question used for this kind of eval-
uation is: “Does this utterance sound natural?” (1 = very unnatural, 2 = somewhat
unnatural, 3 = neither natural nor unnatural, 4 = somewhat natural, 5 = very natural)
(Georgila et al., 2012). A large number of listeners are asked to respond to questions
similar to the one above for several utterances synthesized with a voice, and the average
of these ratings is the so-called mean opinion score (MOS) for that voice (Karaiskos
et al., 2008). To determine the intelligibility of a voice, humans are asked to transcribe
usually semantically unpredictable utterances. These transcriptions are compared with
the correct transcriptions of the utterances (gold-standard) and the average word error
rate (WER) is calculated. The lower the WER the higher the intelligibility of the voice.
Georgila et al. (2012) performed a systematic evaluation of human and synthetic
voices with regard to naturalness, conversational aspect (whether an utterance sounds
more like in an everyday conversation as opposed to someone reading from a script), and
likability (whether one would like to have a conversation with a speaker that sounds like
a particular voice). They also varied the type (in-domain vs. out-of-domain utterances),
length, and content of utterances, and took into account the age and native language of
raters as well as their familiarity with speech synthesis.
State of the Art in Speech Synthesis
Below we focus on unit selection and HMM-based synthesis because they are con-
sidered as the current state of the art in speech synthesis. Currently unit selection is
the dominant approach for commercial speech synthesis, whereas the focus of most
researchers in the field is on HMM-based synthesis.
Unit Selection Speech Synthesis

Concatenative speech synthesis means concatenating or gluing together segments of
recorded speech. The larger the segments the higher the quality of the resulting output,
but also the larger the number of possible combinations of segments that need to be
recorded for full phonetic coverage. This issue makes concatenative speech synthesis
of large segments impractical and not cost-effective. In diphone-based speech synthesis
all the diphones of a language are recorded under highly controlled recording studio
conditions forming a database of speech. Note that diphones are sound segments from
the middle of one phone to the middle of the next phone (although in practice there can
be deviations from this definition). At runtime a synthetic utterance is created by putting
together the best sequence of diphones retrieved from the database. The problem with
this approach is that the same diphone may sound differently depending on the context,
which is not taken into account. Thus diphone-based speech synthesis does not produce
realistic speech.
Unit selection speech synthesis is an extension of diphone-based synthesis and is cur-
rently considered as the state of the art for commercial speech synthesis. The difference
between diphone-based and unit selection speech synthesis is that the latter uses many
recordings of the same subword unit in different contexts. Note that a unit is not neces-
sarily a diphone, other subword segments may be used (Kishore & Black, 2003). ATR
v-TALK was the first speech synthesis system to be based on unit selection (Sagisaka
et al., 1992). Then CHATR generalized unit selection to multiple languages and also
provided an automatic training method (Hunt & Black, 1996).
In unit selection each target utterance (utterance to be synthesized) is a sequence of
target units (e.g., diphones augmented with context) determined in the pre-processing
stage. Target cost Ct (ui , ti ) is an estimate of the difference between a database unit
ui and the target unit ti , while concatenation cost, or joining cost Cc (ui−1 , ui ) is an
estimate of how good the joining between consecutive units ui−1 and ui is. The goal is
to select the best sequence of database units to minimize the total cost (sum of target
and concatenation cost). There is also unit selection based on clustering of units of the
same phoneme class using a decision tree (Black & Taylor, 1997).
Unit selection simulates neutral read aloud speech quite well, but it is very sensitive
to the size and quality of the database of speech units. Designing the recording script
(Kominek & Black, 2004) is a very important part of the process of building a unit
selection synthetic voice. In the past, different greedy algorithms for selecting the opti-
mal script from a large text corpus were proposed (Bozkurt, Ozturk, & Dutoit, 2003).
Because in unit selection the number of possible contexts is huge, there will always be
cases where the required unit is not present in the database, and there will always be bad
joining of units. As discussed below in Challenges for the Future, this issue becomes
more prominent in conversational and emotional speech synthesis where the number of
possible contexts grows even larger. To deal with this problem researchers came up with
the idea of limited-domain speech synthesis (Black & Lenzo, 2000). In this approach,
a synthetic voice is trained using material from the domain where it will be deployed,
which means that only diphones, context, etc. that are relevant to this domain are con-
sidered. Limited-domain speech synthesis achieves high quality within the domain it is
trained on, but in a different domain it performs worse than standard general purpose
(not limited-domain) speech synthesis (Georgila et al., 2012).
Typically at least 4–5 hours of clean speech is required for building a high cover-
age unit selection voice for synthesizing neutral read aloud speech. This is to ensure
that all units are recorded clearly and with a neutral tone in every possible context,
which is a painstaking procedure. The larger the database of a unit selection voice
is the better the synthetic voice sounds, which means that good unit selection voices
have large footprints (about 170 MB of storage space is required for a standard neutral-
style voice). This could potentially cause problems for mobile platform applications
that require many voices, or in cases where many applications need to be installed on
the same mobile device (where memory can be limited). Another problem with unit
selection voices is that they are usually built using professional actor recordings over
the course of days or weeks, which could lead to inconsistencies in the resulting voices
(Stylianou, 1999). This is because human voices can be influenced by various factors,
such as whether the person is tired, has a cold, has been talking for a long time, etc. Note
that the problem of inconsistencies also applies to human recordings. Furthermore, unit
selection synthesis is not robust to noise and is highly dependent on the quality of the
recording conditions (Zen, Tokuda, & Black, 2009).
HMM-based Speech Synthesis

Hidden Markov models (HMMs) are statistical time series models that have been widely
used in speech recognition (Young et al., 2009). They consist of a weighted finite-state
network of states and transitions. Each state can generate observations based on a prob-
ability distribution, usually a mixture of multivariate Gaussians. The use of HMMs for
speech synthesis has been inspired by the success of HMMs for speech recognition.
HMM-based speech synthesis involves two major tasks. The first task is to find the
best parameters to represent the speech signal. Because these parameters are used to
reconstruct the speech waveform, they are called vocoder (short for voice encoder)
parameters, and should be chosen carefully so that the resulting speech sounds natu-
ral. The vocoder parameters are the observations of the HMM model. Speech recogni-
tion systems usually use mel-frequency cepstral coefficients (MFCCs) with energy, and
their delta and delta-delta coefficients (dynamic coefficients) for modeling their rate of
change (Young et al., 2009). MFCCs have been very successful in speech recognition
because of their good phone discrimination properties. However, the purpose of HMM-
based speech synthesis is to generate high quality speech, therefore more parameters
are required, namely, for representing the spectral envelope, the fundamental frequency
(F0), and aperiodic (noise-like) components.
The second task is selecting the right modeling unit. Typically, for speech recog-
nition, triphones are used, i.e., single phonemes in the context of the preceding and
the following phoneme. As discussed in the section on the pre-processing module, for
speech synthesis, in addition to phonetic context, we need to take into account other
types of context. For example, the HTS system for English (Zen, Nose et al., 2007)
uses the current phoneme, the preceding and following two phonemes, syllable struc-
ture, word structure, utterance structure, stress, intonation, etc. From now on we will
call these units full-context HMMs. These full-context HMMs are trained on labelled
data. Note also that unlike HMMs used for speech recognition, where duration is mod-
elled as state self-transitions, in HMM-based synthesis we have an explicit duration
model (Yoshimura et al., 1998). For modeling state durations we may have Gaussian or
Gamma distributions (Zen et al., 2009).
During synthesis the text to be synthesized is transformed into a sequence of full-

context labels. Then each full-context label is replaced by the corresponding full-context
HMM. Thus now we have a sequence of full-context HMMs, and because each HMM
has a number of states, we end up with a long sequence of states. The next step is
to generate the most likely sequence of observations from the sequence of models.
Using the maximum likelihood criterion, each state will generate its mean observa-
tion. Thus the result will be the sequence of means of the Gaussians in the visited
states. Obviously with this approach we will end up having “jumps” at state bound-
aries, which is far from naturally sounding speech. The solution to this problem was
provided by Tokuda et al. (2000). The main idea is to put constraints on what observa-
tions can be generated, using dynamic features (delta and delta-delta coefficients). The
final step is to convert these observations (vocoder parameters) into a speech waveform
using excitation generation and a speech synthesis filter, e.g., mel-log spectrum approx-
imation (MLSA) filter (Imai, Sumita, & Furuichi, 1983; Zen et al., 2009). The most
popular vocoder for HMM-based synthesis is STRAIGHT (speech transformation and
representation using adaptive interpolation of weighted spectrum) (Kawahara, Masuda-
Katsuse, & de Cheveigné, 1999). Figure 19.1 shows a schematic comparison of unit
selection and HMM-based synthesis at runtime.
Note that the idea of using dynamic features to constrain the observations that can
be generated led to the development of trajectory HMMs (Zen, Tokuda, & Kitamura,
2007; Zhang & Renals, 2008). Trajectory HMMs are designed to alleviate a major weak-
ness of HMMs, the fact that the output probability of an observation depends only on
the current state (state conditional independence assumption) (Zen, Tokuda et al.,
2007).
Similarly to unit selection, it is very likely that the units required for modeling the
input utterance have not been seen in the training data. This is because the context that
we use is quite complex and it is not just localized but also spans the full utterance.
Therefore it is highly unlikely that our training data will be so large that all possible
units and their context will be fully covered. To deal with such data sparseness, general-
ization techniques are employed, such as sharing models across contexts, for example,
commonalities between states can be exploited by “tying” similar states (similarly to
speech recognition). Decision trees can be used for clustering similar models together
(Young et al., 2009).
As explained above, with HMM-based synthesis there is no need to pre-record every
possible speech unit in every possible context, which in turn means that building a
HMM-based voice requires much less data and thus can be developed at a much lower
cost than a unit selection voice. Furthermore, storing the parameters of a HMM model
(from which speech is reconstructed) requires much less space than unit selection syn-
thesis (typically less than 2 MB of storage space), which makes HMM-based synthesis
ideal for mobile devices.
Another advantage of HMM-based speech synthesis is that we can easily change
voice characteristics and speaking style just by modifying its model parameters (Zen
et al., 2009). This is because it is easier to manipulate a statistical model to gen-
erate a different speaking style than perform signal processing on recorded subword
Pre-processing Input text

stage
Text analysis
HMM-based
Full-context labels synthesis
(target units in the case
Unit selection of unit selection) Database of HMMs
synthesis
HMM2
Database of units HMM1
u2 HMM3
u1 u4
u3
Target cost
... ...
Concatenation cost
... ti–1 ti ti+1 ... Parameter generation
Excitation Synthesis
Speech waveform generation filter
Speech waveform
Figure 19.1 A schematic comparison of unit selection and HMM-based speech synthesis at
runtime.
speech units (Barra-Chicote et al., 2010). With HMM-based speech synthesis it is also
possible to generate new voices (that do not sound like any specific speaker) by mixing
existing voices, thus there is potential for an infinite number of voices. This is done by
interpolating the parameters of the HMM models (Yoshimura et al., 1997). Also, so far
HMM-based speech synthesis has been used for building voices in more than forty-five
different languages and dialects (Zen et al., 2009).
HMM-based speech synthesis shares some of the benefits of HMM-based speech
recognition. HMMs can be adapted to a particular speaker to improve speech recog-
nition performance. Recently the same idea was used for speech synthesis, which is
called speaker-adaptive HMM-based speech synthesis, as opposed to standard HMM-
based speech synthesis (also called speaker-dependent HMM-based speech synthesis)
(Yamagishi et al., 2009, 2010). The process is as follows. First an average voice is built
using speech from multiple speakers. This average voice will serve as the starting point
for any new voice to be developed, i.e., every time one needs to make a new voice that
sounds like a particular target speaker, they need to adapt the parameters of the average
voice to capture the voice characteristics of this target speaker, using small amounts of
speech from the target speaker (speaker-adaptive speech synthesis). So once we have
built an average voice, using this technique, we can build new speaker-specific voices
with very small amounts of data.
In an experiment presented in (Yamagishi et al., 2009), it was found that in terms
of both naturalness and similarity to the target speaker’s real voice, a HMM voice
built using speaker-adaptive synthesis and one hour of speech performed better than
a state-of-the-art unit selection voice built using eight hours of speech, and similarly
to a HMM voice based on speaker-adaptive synthesis also built using eight hours of
speech. What is also interesting is that, in terms of naturalness, a HMM voice built using
speaker-adaptive HMM synthesis and only six minutes of speech performed better than
a unit selection voice built using one hour of speech. Speaker-adaptive HMM synthe-
sis performed consistently better than speaker-dependent HMM synthesis for the same
amount of training data (6 minutes and 1 hour) in terms of naturalness and similarity
to the target speaker’s real voice. But their performance was similar when eight hours
of training data were used. The performance of unit selection synthesis deteriorated
for out-of-domain sentences, whereas HMM-based synthesis was more robust in such
cases.
Of course a very carefully built high coverage unit selection voice most likely will
sound better than a HMM-based voice. But as we see, HMM voices have many advan-
tages and that is why they are currently considered as the most promising technique
for speech synthesis. Also, hybrids of HMM-based synthesis and unit selection synthe-
sis have been proposed, and shown to generate highly natural sounding speech when
clean speech data of the target speaker are available (Ling & Wang, 2006). As discussed
above, HMM-based synthesis uses a vocoder to generate a waveform from the speech
parameters (observations of the HMM model). Instead, in the hybrid HMM and unit
selection method, the output of the HMM model is used to predict the target units or
calculate costs. The units can be frame-sized, HMM-state sized, phone-sized, diphone-
sized, or nonuniform-sized (Zen et al., 2009). Finally, HMM-based synthesis has also
been combined with articulatory synthesis. Ling et al. (2008) proposed a model where
articulatory features were integrated into a HMM-based synthesis system. The method
was successful in changing the overall character of the synthetic speech and the quality
of specific phones.
Challenges for the Future
As discussed above, both unit selection and HMM-based synthesis have reached high
performance levels for neutral read aloud speech. For many applications such a neu-
tral style is sufficient. However, in some applications, such as spoken dialogue systems
(Lemon et al., 2006; Georgila et al., 2010), it is important that the system sounds as if
it is engaged in the conversation and thus we need to build synthetic voices for con-
versational speech (conversational speech synthesis). There are also dialogue system
applications where it is of great importance that the system is able to show empathy
by expressing appropriate emotions depending on the user’s input and the dialogue
context (DeVault et al., 2014), which means that there is a great need for emotional
speech synthesis.
As discussed in a previous section, with HMM-based speech synthesis we can easily
change voice characteristics and speaking style just by modifying its model parame-
ters. However, there are some constraints, for example, to build a child’s voice we need
recorded speech from a child (Ling et al., 2008). In an ideal world, we should be able to
build a child’s voice from adult speech data or vice versa based on phonetic rules about
the differences in the speech of a child and an adult. However, in practice this is very
difficult and constitutes a future challenge for the field of speech synthesis. Ling et al.
(2008) propose articulatory HMM-based synthesis as a solution to this problem. Their
rationale is that because articulatory features have physiological meanings, they can
explain speech characteristics in a simpler way, and can also be modified more easily
based on phonetic rules and linguistic knowledge.
Voice cloning means building a copy of a person’s voice. Voice cloning can be used
for entertaining, educational, and medical purposes. With regard to entertaining or edu-
cational applications, imagine in 100 years from now being able to engage in a dialogue
with people of our time; a similar idea is presented in (Artstein et al., 2014). In terms of
medical applications, a famous case of voice cloning is the development by CereProc
Ltd. (www.cereproc.com) of a synthetic version of the voice of film critic Roger Ebert,
who had lost the ability to speak due to health problems. In particular, with regard to
voice cloning as an aid to people who have lost their voice, there are many open research
questions. What if the amount or quality of the recordings of the person who needs a
synthetic voice is not adequate? What if there are only recordings of that person after the
impairment occurred? There should be a way to reconstruct the “healthy” voice from
“unhealthy” recordings. These are all major research problems for future work.
Another challenge for the future is sharing resources so that it is easier to perform
comparisons of different speech synthesis methods. For example, we need more corpora
designed for building synthetic voices, such as the CMU ARCTIC speech databases
(Kominek & Black, 2004). We also need to organize more challenges where differ-
ent speech synthesis systems could compete under the same conditions (same data for
training and same evaluation criteria). The Blizzard Challenge (Black & Tokuda, 2005),
which started in 2005 and takes place annually, is a great step toward this direction.
Below we focus on conversational and emotional speech synthesis, two major chal-
lenges in the field of speech synthesis. These two areas may overlap, i.e., an emo-
tional voice can have conversational characteristics and vice versa. We provide a brief
overview of what has been achieved so far. Nevertheless, despite much progress in
recent years, both issues are far from being solved.
Conversational Speech Synthesis

Spontaneous conversational speech exhibits characteristics that are hard to model in
speech synthesis, e.g., pronunciation variation (Werner & Hoffmann, 2007), speech dis-
fluencies (repetitions, repairs, hesitations) (Georgila, 2009), paralinguistics (laughter,
breathing) (Campbell, 2006), etc. Although there has been some work on generating
synthetic speech with conversational characteristics, the amount of this work is limited
compared to the vast amount of research on synthesis of neutral speech.
Much work has focused on filled pauses (e.g., “uh”, “um”), in particular, on predicting
where to insert filled pauses in an utterance so that it sounds natural, and how to syn-
thesize such filled pauses. Adell, Bonafonte, and Escudero (2006) developed a prosodic
model of the hesitation before a filled pause. Data sparsity problems were overcome by
always synthesizing filled pauses surrounded by silent pauses.
Andersson et al. (2010) were able to synthesize filled pauses even when they were
not surrounded by silent pauses. They also synthesized lexical fillers (e.g., “you know”,
“well”, “so”). This was accomplished by training a unit selection synthesizer on a com-
bination of neutral read aloud and spontaneous speech from the same speaker. The
result was synthetic speech with a more conversational character, compared to synthetic
speech trained only on neutral read aloud data. In terms of naturalness, both synthetic
voices performed similarly. Then Andersson, Yamagishi, and Clark (2012) applied the
same approach (combining neutral and conversational speech training data) to HMM-
based synthesis. Sundaram and Narayanan (2002) also used spontaneous speech data
for synthesizing spontaneous monologues. But unlike Andersson et al. (2010) who used
general purpose unit selection, they employed limited-domain unit selection.
Campbell (2007) recorded a corpus of spontaneous speech from an adult female
speaker in her everyday life during 5 years. This corpus was used to build a concatena-
tive synthetic voice, but concatenation was only allowed at phrase boundaries. Werner
and Hoffmann (2007) modelled sequences of pronunciation variants in order to gener-
ate a more conversational style of speech. Finally, Székely et al. (2012) used audio book
data to build voices, which were then tested on conversational phrases, i.e., sentences
that commonly occur in conversations. They did not model disfluencies or pronunciation
variation.
Emotional Speech Synthesis

Both unit selection and HMM-based speech synthesis techniques have been used for
synthesizing emotions (Barra-Chicote et al., 2010). As discussed in a previous section,
unit selection is very sensitive to the size and quality of the database of speech units.
For emotional speech, in particular, this means that for each subword unit the number
of possible contexts can grow extremely large. Thus in addition to the units required
for neutral speech, we need to record units for a large variety of emotions. This in turn
means that using unit selection for emotional synthesis can be an extremely expensive
process (Barra-Chicote et al., 2010).
A solution to this problem is to use emotional speech corpora, extract rules for mod-
ifying the target F0 and duration and intonation contours, and incorporate these rules
into unit selection (Pitrelli et al., 2006). One problem with this approach is that it does
not always allow for synthesizing emotions for arbitrary speakers. Another issue is that
if some required units are not part of the speech database, signal processing manipula-
tion will be required, which usually negatively affects the quality of the resulting speech
(Barra-Chicote et al., 2010).
On the other hand, the problem with HMM-based synthesis is that the resulting
speech does not have the variability and richness in spectra and prosodic patterns that we
see in natural speech. However, the main advantage of HMM-based speech synthesis is
that we can easily change voice characteristics and speaking style just by modifying its
model parameters (Zen et al., 2009). For example, Qin et al. (2006) trained an average
emotion model on a multi-emotion speech database and then adapted this model to a
target emotion not included in the training data. The idea is similar to speaker-adaptive
speech synthesis presented in the section on HMM-based speech synthesis.
Barra-Chicote et al. (2010) performed a direct comparison of unit selection and
HMM-based synthesis for emotional speech. Their findings were that both methods
had similar performance in terms of the quality of the generated emotional speech. Unit
selection produced emotional speech of higher strength, whereas HMM-based synthesis
allowed for adjusting the strength of emotions. Furthermore, unit selection had issues
with prosodic modeling whereas HMM-based synthesis could benefit from improve-
ments to spectral modeling.
Most research on emotional speech synthesis is based on speech corpora that contain
acted emotions, i.e., actors are asked to simulate emotions such as happiness, sadness,
anger, etc. However, such simulated emotions differ significantly from emotions that we
experience in the real world (Douglas-Cowie, Cowie, & Schroder, 2000). Due to ethical
and privacy concerns, a major challenge in emotional speech synthesis is acquiring
speech that exhibits real emotions.
Conclusion
We presented an overview of speech synthesis research and briefly described how a

speech synthesizer works. We placed particular emphasis on unit selection and HMM-
based synthesis, the two most popular current state-of-the-art speech synthesis methods.
Finally, we presented a number of challenges for the future of speech synthesis, focusing
on conversational and emotional speech synthesis.
Acknowledgment
Research for this chapter was supported by the US Army. Any opinion, content, or infor-
mation presented does not necessarily reflect the position of the United States Govern-
ment, and so official endorsement should be inferred.
References
Adell, J., Bonafonte, A., & Escudero, D. (2006). Disfluent speech analysis and synthesis: A pre-
liminary approach. In Proceedings of the International Conference on Speech Prosody.
Andersson, S., Georgila, K., Traum, D., Aylett, M., & Clark, R. A. J. (2010). Prediction and
realisation of conversational characteristics by utilising spontaneous speech for unit selection.
In Proceedings of the International Conference on Speech Prosody.
Andersson, S., Yamagishi, J., & Clark, R. A. J. (2012). Synthesis and evaluation of conversational
characteristics in HMM-based speech synthesis. Speech Communication, 54(2), 175–188.
Artstein, R., Traum, D., Alexander, O., et al. (2014). Time-offset interaction with a holocaust sur-
vivor. In Proceedings of the International Conference on Intelligent User Interfaces (pp. 163–
168).
Barra-Chicote, R., Yamagishi, J., King, S., Montero, J. M., & Macias-Guarasa, J. (2010). Anal-
ysis of statistical parametric and unit selection speech synthesis systems applied to emotional
speech. Speech Communication, 52(5), 394–404.
Black, A. W. & Lenzo, K. A. (2000). Limited domain synthesis. In Proceedings of the Interna-
tional Conference on Spoken Language Processing (vol. 2, pp. 411–414).
Black, A. W. & Taylor, P. (1997). Automatically clustering similar units for unit selection in
speech synthesis. In Proceedings of the European Conference on Speech Communication and
Technology (pp. 601–604).
Black, A. W. & Tokuda, K. (2005). The Blizzard challenge – 2005: Evaluating corpus-based
speech synthesis on common datasets. In Proceedings of the European Conference on Speech
Communication and Technology (pp. 77–80).
Bozkurt, B., Ozturk, O., & Dutoit, T. (2003). Text design for TTS speech corpus building using
a modified greedy selection. In Proceedings of the European Conference on Speech Communi-
cation and Technology (pp. 277–280).
Campbell, N. (2006). Conversational speech synthesis and the need for some laughter. IEEE
Transactions on Audio, Speech, and Language Processing, 14(4), 1171–1178.
Campbell, N. (2007). Towards conversational speech synthesis: Lessons learned from the expres-
sive speech processing project. In Proceedings of the ISCA Workshop on Speech Synthesis
(pp. 22–27).
DeVault, D., Artstein, R., Benn, G., et al. (2014). SimSensei kiosk: A virtual human interviewer
for healthcare decision support. In Proceedings of the International Conference on Autonomous
Agents and Multiagent Systems (pp. 1061–1068).
Douglas-Cowie, E., Cowie, R., & Schröder, M. (2000). A new emotion database: Considerations,
sources and scope. In Proceedings of the ISCA Tutorial and Research Workshop (ITRW) on
Speech and Emotion (pp. 39–44).
Georgila, K. (2009). Using integer linear programming for detecting speech disfluencies. In Pro-
ceedings of the Human Language Technology Conference of the North American Chapter of
the Association for Computational Linguistics (HLT-NAACL). Companion volume: short papers
(pp. 109–112).
Georgila, K., Black, A. W., Sagae, K., & Traum, D. (2012). Practical evaluation of human and
synthesized speech for virtual human dialogue systems. In Proceedings of the International
Conference on Language Resources and Evaluation (pp. 3519–3526).
Georgila, K., Sgarbas, K., Tsopanoglou, A., Fakotakis, N., & Kokkinakis, G. (2003). A speech-
based human–computer interaction system for automating directory assistance services. Inter-
national Journal of Speech Technology (special issue on Speech and Human Computer Inter-
action), 6(2), 145–159.
Georgila, K., Wolters, M., Moore, J. D., & Logie, R. H. (2010). The MATCH corpus: A corpus of
older and younger users’ interactions with spoken dialogue systems. Language Resources and
Evaluation, 44(3), 221–261.
Hunt, A. J. & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using
a large speech database. In Proceedings of the IEEE International Conference on Acoustics,
Speech, and Signal Processing (pp. 373–376).
Imai, S., Sumita, K., & Furuichi, C. (1983). Mel log spectrum approximation (MLSA) filter for
speech synthesis. Electronics and Communications in Japan, 66(2), 10–18.
Iskarous, K., Goldstein, L. M., Whalen, D. H., Tiede, M. K., & Rubin, P. E. (2003). CASY: The
Haskins configurable articulatory synthesizer. In Proceedings of the International Congress of
Phonetic Sciences (pp. 185–188).
Karaiskos, V., King, S., Clark, R. A. J., & Mayo, C. (2008). The Blizzard challenge 2008. In
Proceedings of the Blizzard Challenge Workshop.
Kawahara, H., Masuda-Katsuse, I., & de Cheveigné, A. (1999). Restructuring speech representa-
tions using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based
F0 extraction: Possible role of a repetitive structure in sounds. Speech Communication, 27(3–
4), 187–207.
King, S. (2010). A tutorial on HMM speech synthesis. In Sadhana – Academy Proceedings in
Engineering Sciences, Indian Institute of Sciences.
Kishore, S. P. & Black, A. W. (2003). Unit size in unit selection speech synthesis. In Proceedings
of the European Conference on Speech Communication and Technology (pp. 1317–1320).
Klatt, D. H. (1980). Software for a cascade/parallel formant synthesizer. Journal of the Acoustical
Society of America, 67(3), 971–995.
Kominek, J. & Black, A. W. (2004). The CMU ARCTIC speech databases. In Proceedings of the
ISCA Workshop on Speech Synthesis (pp. 223–224).
Lemon, O., Georgila, K., Henderson, J., & Stuttle, M. (2006). An ISU dialogue system exhibiting
reinforcement learning of dialogue policies: Generic slot-filling in the TALK in-car system. In
Proceedings of the Conference of the European Chapter of the Association for Computational
Linguistics (EACL) – Demonstrations (pp. 119–122).
Ling, Z.-H., Richmond, K., Yamagishi, J., & Wang, R.-H. (2008). Articulatory control of HMM-
based parametric speech synthesis driven by phonetic knowledge. In Proceedings of the Annual
Conference of the International Speech Communication Association (pp. 573–576).
Ling, Z.-H. & Wang, R.-H. (2006). HMM-based unit-selection using frame sized speech seg-
ments. In Proceedings of the International Conference on Spoken Language Processing
(pp. 2034–2037).
Narayanan, S., Alwan, A., & Haker, K. (1997). Toward articulatory-acoustic models for liquid
approximants based on MRI and EPG data: Part I, The laterals. Journal of the Acoustical
Society of America, 101(2), 1064–1077.
Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys, 41(2), art.
10.
Pitrelli, J. F., Bakis, R., Eide, E. M., et al. (2006). The IBM expressive text-to-speech synthesis
system for American English. IEEE Transactions on Audio, Speech, and Language Processing,
14(4), 1099–1108.
Qin, L., Ling, Z.-H., Wu, Y.-J., Zhang, B.-F., & Wang, R.-H. (2006). HMM-based emotional
speech synthesis using average emotion model. Lecture Notes in Computer Science, 4274, 233–
240.
Sagisaka, Y., Kaiki, N., Iwahashi, N., & Mimura, K. (1992). ATR v-TALK speech synthesis sys-
tem. In Proceedings of the International Conference on Spoken Language Processing (pp. 483–
486).
Schultz, T., Black, A. W., Vogel, S., & Woszczyna, M. (2006). Flexible speech translation systems.
Socher, R., Bauer, J., Manning, C. D., & Ng, A. Y. (2013). Parsing with compositional vector
grammars. In Proceedings of the Annual Meeting of the Association for Computational Lin-
guistics (pp. 455–465).
Stylianou, Y. (1999). Assessment and correction of voice quality variabilities in large speech
databases for concatenative speech synthesis. In Proceedings of the IEEE International Con-
ference on Acoustics, Speech, and Signal Processing (pp. 377–380).
Sundaram, S. & Narayanan, S. (2002). Spoken language synthesis: Experiments in synthesis of
spontaneous monologues. In Proceedings of the IEEE Speech Synthesis Workshop (pp. 203–
206).
Székely, É., Cabral, J. P., Abou-Zleikha, M., Cahill, P., & Carson-Berndsen, J. (2012). Eval-
uating expressive speech synthesis from audiobooks in conversational phrases. In Proceed-
ings of the International Conference on Language Resources and Evaluation (pp. 3335–
3339).
Taylor, P. (2009). Text-to-speech Synthesis. New York: Cambridge University Press.
Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (2000). Speech parameter
generation algorithms for HMM-based speech synthesis. In Proceedings of the IEEE Interna-
tional Conference on Acoustics, Speech, and Signal Processing (pp. 1315–1318).
Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tag-
ging with a cyclic dependency network. In Proceedings of the Human Language Technology
Conference of the North American Chapter of the Association for Computational Linguistics
(HLT-NAACL) (pp. 173–180).
Von Kempelen, W. (1791). Mechanismus der menschlichen Sprache nebst Beschreibung einer
sprechenden Maschine. Vienna: J. V. Degen.
Werner, S., & Hoffmann, R. (2007). Spontaneous speech synthesis by pronunciation variant selec-
tion: A comparison to natural speech. In Proceedings of the Annual Conference of the Interna-
tional Speech Communication Association (pp. 1781–1784).
Yamagishi, J., Nose, T., Zen, H., et al. (2009). Robust speaker-adaptive HMM-based text-to-
speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 17(6),
1208–1230.
Yamagishi, J., Usabaev, B., King, S., et al. (2010). Thousands of voices for HMM-based speech
synthesis-analysis and application of TTS systems built on various ASR corpora. IEEE Trans-
actions on Audio, Speech, and Language Processing, 18(5), 984–1004.
Yoshimura, T., Masuko, T., Tokuda, K., Kobayashi, T., & Kitamura, T. (1997). Speaker interpola-
tion in HMM-based speech synthesis system. In Proceedings of the European Conference on
Speech Communication and Technology (pp. 2523–2526).
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1998). Duration modeling
for HMM-based speech synthesis. In Proceedings of the International Conference on Spoken
Language Processing (pp. 29–32).
Young, S., Evermann, G., Gales, M., et al. (2009). The HTK Book (for HTK version 3.4). Cam-
bridge: Cambridge University Press.
Zen, H., Nose, T., Yamagishi, J., et al. (2007). The HMM-based speech synthesis system (HTS)
version 2.0. In Proceedings of the ISCA Workshop on Speech Synthesis (pp. 294–299).
Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Com-
munication, 51(11), 1039–1064.
Zen, H., Tokuda, K., & Kitamura, T. (2007). Reformulating the HMM as a trajectory model by
imposing explicit relationships between static and dynamic feature vector sequences. Speech
Communication, 21(1), 153–173.
Zhang, L., & Renals, S. (2008). Acoustic-articulatory modeling with the trajectory HMM. IEEE
Signal Processing Letters, 15, 245–248.
20 Body Movements Generation for
Virtual Characters and Social Robots
Aryel Beck, Zerrin Yumak, and Nadia Magnenat-Thalmann
Introduction
It has long been accepted in traditional animation that a character’s expressions must
be captured throughout the whole body as well as the face (Thomas & Johnston, 1995).
Existing artificial agents express themselves using facial expressions, vocal intonation,
body movements, and postures. Body language has been a focus of interest in research
on embodied agents (virtual humans and social robots). It can be separated into four
different areas that should be considered when animating virtual characters as well as
social robots. (1) Postures: postures are specific positions that the body takes during
a timeframe.Postures are an important modality during social interaction and play an
important role as they can signal liking and affiliation (Lakin et al., 2003). Moreover,
it has been established that postures are an effective medium to express emotion for
humans (De Silva & Bianchi-Berthouze, 2004). Thus, virtual humans and social robots
should be endowed with the capability to display adequate body postures. (2) Move-
ment or gestures: throughout most of our daily interactions, gestures are used along
with speech for effective communication (Cassell, 2000). For a review of the types of
gestures that occur during interactions the reader can refer to Cassell (2000). Move-
ments are also important for expressing emotions. Indeed, it has been shown that many
emotions are differentiated by characteristic body movements and that these are effec-
tive clues for judging the emotional state of other people in the absence of facial and
vocal clues (Atkinson et al., 2004). Body movements include the movements themselves
as well as the manner in which they are performed, i.e. speed of movements, dynamics,
and curvature – something captured by the traditional animation principles (Thomas &
Johnston, 1995; Beck, 2012). Moreover, it should be noted that body movements occur
in interaction with other elements, such as speech, facial expressions, gaze, all of which
needs to be synchronised. (3) Proxemics: it is the distance between individuals during
a social interaction. It is also indicative of emotional state. For example, angry people
have a tendency to reduce the distance during social interaction, although this reduc-
tion would also be evident between intimate people. Proxemics is required to complete
the realistic behaviour (Walters et al., 2009). (4) Gaze: the way we look at each other
during an interaction is an important modality. It helps us manage speaking turns. It
can also express attention (or lack of it) and is therefore a very active topic of research
for embodied agents. The behaviours generated while interacting with humans should
be believable, in other words, it should provide an illusion of life. It should also be
responsive to what interactants are doing and to what is happening in the environment
(Thiebaux et al., 2008). It should also be meaningful and interpretable, in other words, it
should reflect the inner state of the artificial agent (Thiebaux et al., 2008). In order to be
successful, postures, gestures, gaze, and proxemics should be considered for animated
characters and social robots. This chapter describes the research conducted in order to
endow artificial agents with the capability to display body language as well as the issues
for synchronizing it with other expressive modalities.
Generating Body Postures
Postures are specific positions that the body takes during a time frame. Strictly speak-
ing, humans do not remain static, however, the overall appearance and position of our
bodies constitute an important part of nonverbal behaviour. While interacting, humans
tend to unconsciously mimic each other’s postures. Moreover, it has been shown that
mimicking body postures and movements facilitate the smoothness of interactions and
increases liking between interaction partners. Postural mimicry can also be applied to
human–virtual human interaction and is promising for improving human–agent interac-
tions (Sun & Nijholt, 2011). Body postures are also used by humans in order to express
power (Huang et al., 2010). Interestingly, whether an artificial agent can use this to
express power has not been investigated. Indeed, most of the work on posture genera-
tion has focused on emotional expressions.
Postures and Emotion

Following the seminal work by Wallbott (1998), a body of research endeavours to define
distinctive features of postures that correspond to certain emotions (Coulson, 2004). An
important source of information regarding the expression of emotion through static pos-
tures comes from automatic recognition of emotion. For instance, existing studies in this
field show that collar joint angle and shoulder joint angle are elements that can be used
to automatically recognise emotions (Kleinsmith, Bianchi-Berthouze, & Steed, 2011;
De Silva & Bianchi-Berthouze, 2004). Moreover, Kleinsmith, De Silva, and Bianchi-
Berthouze (2006) investigated cross-cultural recognition of four emotions (anger, fear,
happiness, sadness) through interpretations of body postures. They built a set using
actors to perform emotional postures and showed that it was possible for participants
to correctly identify the different emotions. Specific features of body posture have been
isolated, in particular collar and shoulder joint angles, which have been found to be
expressive for adults (Kleinsmith et al., 2011; Beck et al., 2012) as well as for chil-
dren (Beck et al., 2011; Beck, Cañamero et al., 2013). Roether et al. (2009) investigated
the portrayal of emotions through gait. They found that head inclination as well as the
amplitude of the elbow joint angles is particularly salient to the expression of fear and
anger. Thus, a robot displaying emotions has to take up postures appropriate to the emo-
tion. Previous results show that this is an effective medium to convey emotions as it was
found that people correctly identify emotions from postures displayed by a humanoid
Body Movements Generation for Virtual Characters and Social Robots 275
robot (Beck et al., 2012, Beck, Cañamero et al., 2013). Moreover, work on emotional
behaviour generation has shown that by blending key poses, it is possible to generate a
continuous space of emotional expressions (Beck et al., 2010). Postures have also been
shown to affect the interpretation of facial expressions when displayed concurrently
(Clavel et al., 2009).
Idle Movements Added to Body Postures

People do not remain static during face-to-face interaction and this is why idle move-
ments play an important part in creating believable characters. For instance, Egges,
Molet, and Magnenat-Thalmann (2004) developed a virtual character with personalized
idle movements and emotion-enabled communication skills with body using a data-
driven method. Procedural methods to generate idle movements are widely used. Indeed,
one of the most established methods to generate idle movements and behaviour is Perlin
noise (Perlin, 2002). In animation, Perlin noise, a coherent noise that is highly control-
lable, is a well-known tool used to procedurally generate movements and increase the
lifelikeness of animations. It can not only be used to modify movement but also to cre-
ate different types of nonrepetitive and “idle” behaviours. In robotics, Perlin noise and
similar methods have been used, added to joint angles, to increase the lifelikeness of
robot movements and to generate idle behaviours (Snibbe, Scheeff, & Rahardja, 1999;
Ishiguro, 2005). Idle movements contribute to create an illusion of life as well as to con-
vey emotions. Beck, Hiolle, and Cañamero (2013) investigated the use of Perlin noise
on a Nao robot for the generation of idle behaviours. It was found that velocity, jerki-
ness, and amplitude of the generated movements can significantly affect the perceived
emotional pose.
Generating Movements
Definition and categories of body movements vary greatly. For instance, Knapp (1972)
proposes five types of body movements: emblems which have specific meanings, illus-
trators which emphasize speech, affect displays that express emotional states, regula-
tors which influence turn-taking, and adaptors which convey implicit information. Con-
sequently, research in movement generation has also considered these categories. For
instance, the MIT Leonardo robot displays emblematic gestures (Breazeal et al., 2004).
Data-driven methods use motion capture data to produce realistic results. They can
produce on-the-fly facial expressions and body gestures. On the other hand, Cao et al.
(2005) used a method based on the well-known computer animation technique “motion
graphs”. Machine learning techniques have also been used for learning expressive head
movement from recorded data (Busso et al., 2007). For example, BEAT (behaviour
expression animation toolkit) (Cassell, Vilhjálmsson, & Bickmore, 2001) was capable
of taking text input and converting it to communicative gestures based on the linguistic
and contextual information in the text. Manual methods use markup languages, such as
behaviour markup language (BML). In contrast, movement generation for social robots
has been less explored (Salem et al., 2012). However, existing methods from the vir-
tual human research have been adapted for social robots (see Salem et al., 2012, for an
example).
There are three main motion generation approaches: manually creating motion,
motion capture, and online motion generation (also called motion planning). For man-
ual creation, professional animators set each joint values for each time step (called
key frames), the intermediate points can be generated through interpolation methods
(Pierris & Lagoudakis, 2009). This approach usually produces the best results (although
not the most realistic ones). However, it is time-consuming and, more importantly, it is
not adaptive to new situations (i.e. the agent is limited to the set of movements that
were previously created). Motion capture usually produces the most realistic results as
it records human’s movements and map these data to a humanoid robot or a virtual
human. As with manually creating motion, this method is difficult to adapt to new sit-
uations. A combination of these methods is often applied to take advantage of the pros
of both methods (Egges et al., 2004). Moreover, the kinematic structure of the artificial
character might be different from the human structure captured, so the movements need
to be adapted to the specifics of the virtual human and the robot. For instance, Koene-
mann and Bennewitz (2012) used an Xsens MVN system to capture human motion and
realize a stable complex pose for the Nao robot. Besides emotions and facial expres-
sions, if a character is gazing or shaking hands with people, these movements cannot
be achieved by pre-defined or recorded animations. In contrast with these two methods,
online motion generation is very adaptive. Online movement generation methods rely
on kinematics and/or dynamics equations to solve a geometric task. For instance, Nunez
et al. (2012) proposed an analytic solution for a humanoid robot with three degrees of
freedom (DOF) on the arm and six DOF on the legs. Typically, the challenge is how to
best use these methods to generate believable movements and behaviours for an artificial
companion to interact in a believable way.
Movement and Emotions

Research in psychology have shown that emotions affect the way movements are exe-
cuted. For instance, Coombes, Cauraugh, and Janelle (2006) have shown that exposure
to unpleasant stimuli magnifies the force production of a sustained voluntary movement.
Moreover, the quality of movements seems to be specific to emotion (Wallbott, 1998;
Laban & Ullmann, 1971). Movements are effective clues for judging the emotional state
of other people in conjunction or in the absence of facial and vocal clues (Atkinson et al.,
2004; Beck, 2012). Body movements include the motion as well as the manner in which
it is performed. The quality of movements has been successfully used for virtual agents,
such as Greta, to express emotions (Hartmann et al., 2005). Greta uses a set of five
attributes to describe expressivity: overall activation (amount of activity, e.g. static vs
animated), spatial extent (amplitude of movements, e.g. contracted vs expanded), flu-
idity (smoothness and continuity of movements), repetition (rhythmic repetition of the
same movement), and power (dynamics property of the movements e.g. weak vs strong).
In this system, these parameters act as filters on the character animation affecting the
strength, fluidity, and tempo of the movements. Roether and colleagues (2009) sys-
tematically investigated features of gait performed in different emotional states. Their
findings highlight the importance of amplitude and speed of movements. These param-
eters were also successfully reused to modulate gait in order to make them expressive
(Roether et al., 2009). Changing the dynamics of movements to express emotions has
also been used in robotics. For instance, Barakova and colleagues have used Laban’s
movement theory to model a small set of emotions using an E-puck robot (Barakova &
Tourens, 2010). They found reliable recognition of most of the behaviours. However, it
is still necessary to build a library of expressive gestures that will be modified by this
set of parameters.
Generating Gaze
Gaze is an essential element of human–human interaction. As such, it is a very active

topic of research in psychology, human–computer interaction, as well as social robotics.
In human–human interaction, gaze has a wide range of functions including signaling
speaker turns, addressee, changes in thematic etc. For gaze movement generation, data-
driven methods give promising results. For gaze generation, these models rely on obser-
vations from human–human interaction and aim at defining statistical models that will
generate similar behaviour as the one observed. For instance, Mutlu et al. (2012) con-
ducted a series of studies looking at how a speaker gazes while speaking to an audi-
ence of two persons. They then use their observation to generate movements for a
Robovie robot. Although data-driven methods produce convincing results, one of the
difficulties is to generalize these approaches to different situations. Indeed, typically
these approaches are “shallow” (Dautenhahn, 2013) models and focus on presentation.
They do not explain the function of the gaze, just the output. Nevertheless, one of the
major challenges in research focusing on an embodied agent is to sustain long-term
interaction. It seems difficult for current methods to do so as they typically rely on data
captured in the same session. Similarly, using these methods, it is quite difficult to grasp
and model the differences due to relationship. Should an agent acting as a coach gaze
in a similar way as a receptionist while interacting?
Gaze and Emotion

Gaze is also related to the agent’s emotional state. Gaze direction systematically influ-
ences the perceived emotion disposition conveyed by neutral faces (Adams & Kleck,
2005). Direct gaze leads to more anger and joy dispositional attributions, whereas
averted gaze leads to more fear and sad dispositional attributions. Also gaze can increase
the emotional expressivity of non-neutral faces (Adams & Kleck, 2005). This is con-
sistent with Dovidio’s findings which noted that holding a dominant position affect the
length and directness of gaze (Dovidio & Ellyson, 1985). Moreover, recent research
suggests that affective states high in motivational intensity broaden the scope of atten-
tion while affective states low in motivational intensity narrow it (Harmon-Jones, Gable,
& Price, 2011). This is consistent with Fredrickson’s theory of positive emotion which
predict that positive affect broadens and negative affect narrows the scope of atten-
tion (Fredrickson, 2004). In summary, for gaze generation, both the way the gaze is
performed as well as what to look at are affected by the internal state. Research on emo-
tional gaze generation has mostly focused on expressive gaze. For instance, Cig et al.
(2010) proposed a model in which gaze movements can affect perceived dominance and
arousal. This model considers saccade duration and interval, velocity, and gaze aversion
and how it affects these two dimensions. However, an important aspect for research is
how emotions affect the decision to look at a specific point.
Proxemics
Extensive research has been conducted in social sciences looking at how we manage
interpersonal space during interactions (Torta et al., 2011). In social sciences, proxemics
is usually considered along five zones: intimate zone (0 to 0.15 m), close intimate zone
(0.15 to 0.45 m), personal zone (0.45 m to 1.2 m), social zone (1.2 to 3.6 m), and public
zone (more than 3.6 m). Depending on our relationships, the context of the interaction,
and cultures, we dynamically adjust our distances during social interactions. Moreover,
in virtual reality, it has been shown that users respect these distances while interacting
with virtual humans (Bailenson et al., 2003). Proxemics is also especially relevant for
social robots that evolve in the real worlds (Beck, Hiolle et al., 2013). Models to gen-
erate behaviour based on these zones have been proposed. For instance, Walters et al.
(2009) proposed a framework for a robot to decide at what distance it should place itself.
The framework considers the robot tasks, appearance, and user preferences. Torta et al.
(2011) proposed to integrate a model of proxemics into a robot’s behaviour-based navi-
gation architecture. They proposed to use Bayesian filtering to dynamically infer target
location respecting personal space.
Proxemics and Emotion

Proxemics is also indicative of emotional state. For example, an angry person has the
tendency to reduce the distance during social interaction, although this reduction would
also be evident between intimate people. Proxemics cannot therefore be considered as
an emotional expression in itself but is required to complete a representation of realistic
emotional behaviour.
Synchronization of Body Animations in Real Time
The synchronization among various body expressions, gestures, facial expressions,

gaze, and head movements is an extremely challenging task. The SAIBA (situation,
agent, intention, behaviour, animation) framework is an attempt to establish a unified
framework for multimodal output generation (Kopp et al., 2006). It consists of three
stages: planning of a communicative intent, planning of a multimodal realization of this

intent, and realization of the planned behaviours.
Planning of a Communicative Intent

The planning of communicative intents is related to the state of the ongoing interac-
tion. This can include sociocultural and situational context, history of communication
between the interactants, history of the ongoing dialogue, intention, personality and
emotions and so on (Krenn & Sieber, 2008). The modeling of communicative intents
are based on computational models of social behaviour that are driven by psychological
and social studies. Typically, existing work focuses on modeling a single aspect of social
behaviour, such as turn-taking, attention behaviour models, or emotions. However, in
reality these aspects are all highly related. Therefore, one of the main challenges is to
combine these in an holistic model. Functional markup language (FML) is an attempt by
the community to standardize this process and provide an inventory of high-level repre-
sentations that can be shared among different components. In FML, the basic building
blocks of a communicative event are the communication partners (name, gender, per-
sonality etc.) and communication acts (turn taking, verbal and nonverbal expressions
related to the communicative goal etc) (Krenn & Sieber, 2008). Other definitions have
also been suggested, for instance, Heylen et al. (2008) proposed to use person char-
acteristics (identifier, name, gender, type [human/agent], appearance, voice), commu-
nicative actions (turn-taking, grounding, speech act), content (what is being communi-
cated and emphasized), mental state (felt and expressed emotions, cognitive processes),
and social-relational goals (relationship to the communicative partner). In addition,
Bickmore (2008) introduced contextual tags including information exchange, social
(social chat, small talk), empathy (comforting interactions), and encourage (coaching,
motivating).
Planning of Multimodal Realisation of This Intent

There are two main approaches for modeling nonverbal behaviours, literature-based
and machine learning (Lee & Marsella, 2012). Literature-based approaches are based
on findings from psychology. These are typically obtained through manual analysis of
human behaviour. The disadvantage of such methods is that the existing research can-
not yet explain the full complexity of the mapping between behaviours and commu-
nicative functions. Nonverbal behaviours are concurrently affected by several factors,
such as emotion, personality, gender, social context etc. Research on these topics is still
in progress. On the other hand machine learning approaches automatize this process
and find regularities and dependencies between factors using statistics and learn from a
larger amount of data to cover various cases. However, obtaining good annotated data
is problematic. Moreover, typically these data apply to the specific conditions in which
they were collected but do not necessarily generalize well. The most emblematic work
using a literature-based approach is probably the behaviour expression animation toolkit
(BEAT). BEAT allows animators to input the text that will be spoken by an animated
character. The linguistic and contextual features of the text are then analyzed and a
rule-based model is then used to generate appropriate behaviours. Clauses are divided
into two parts called theme (the part of the clause that creates a coherent link with a
preceding clause) and rheme (the part that contributes some new information). Other
language tags are based on whether words are new, contrasting, or whether they are
objects or actions. For example, if a rheme contains a new node, the system generates
a beat gesture that coincide with the object phrase. Another work in the same direction
is the nonverbal behaviour generator (NVBG) (Lee & Marsella, 2006). In addition to
the literature, they analysed video recordings of people performing gestures. They used
labels, such as affirmation, negation, intensification, contrast, obligation, and assump-
tion, to tag the parts of speech and mapped them to behaviours together with some
priority rules. NVBG takes as input a functional language markup language (FML) and
produces in turn behaviour markup language (BML).
BML is an XML-based language to coordinate speech, gesture, gaze, and body move-
ments. Every behaviour is divided into six animation phases bounded with seven syn-
chronization points: start, ready, stroke-start, stroke, stroke-end, relax, and end. Syn-
chrony is achieved by assigning the sync-point of one behaviour to the sync-point of
another. The behaviour planner that produces the BML gets also information back from
the behaviour realizers about the success and failure of the behaviour requests. One of
the open challenges of using BML is the maintenance of behaviour (Vilhjálmsson et al.,
2007). For example, a character gazing at a certain target at the stroke point of a speci-
fied gesture is defined in BML. However, what will happen once the gaze is performed
is not clear. BML has been used in various embodied conversational agent projects as
well as in various behaviour planners, behaviour realizers, repositories, and tools. For a
survey of these, the reader can refer to Vilhjálmsson et al. (2007).
In contrast, Lee and Marsella (2012) uses machine learning. They generate speaker
head nods using linguistic and affective features. They used the AMI Meeting corpus
and manually annotated the dynamics of the nods (small, medium, big) and eyebrow
movements (inner brow raise, outer brow raise, brow lowerer). They also processed
the text input to obtain some features, such as syntactic features, dialogue acts, paralin-
guistic features, and semantic categories. Syntactic features include part-of-speech tags,
phrase boundaries, and key lexical entities, which are the words that are known to have
strong correlations with head nods, such as “yes” for affirmation and “very” for intensi-
fication. Dialogue acts are the communicative functions of each utterance as described
above. Paralinguistic features are, for example, gaps (between speaking turns), disflu-
encies (discontinuity in the middle of an utterance), and vocal sounds (laughing, throat
noises). These three construct the basic feature set. They also define semantic categories
for each word such as psychological constructs (e.g. affect, cognition, biological pro-
cesses), personal concern categories (e.g. work, home, leisure), paralinguistic dimen-
sions (e.g. assents, fillers), and punctuation categories (periods, commas), and define
this as the extended feature set which is used to study the impact of word semantics.
In a previous study, Lee and Marsella (2010) compared machine learning-based, rule-
based, and human-generated head nods applied on virtual characters and found out that
the machine learning approach were perceived to be more natural in terms of nod tim-
ing. The machine learning approach also outperformed the human-generated head nods
which were directly generated based on real human performance. This result indicates
that machine learning approach produced a better representation of the head nods as an
average model based on multiple people’s data and can be perceived as more natural in
comparison to an individual who might not be a very good representative of the data.
Kipp et al. (2007) developed a gesture generation system that takes text as input and
produces full-body animations based on a style of a particular performer. They define a
gesture unit (g-unit), which is a sequence of contiguous gestures where the hands return
to a rest pose at the end of the last gesture. At the low level, movements consist of g-
phases (preparation, stroke, hold etc), in the middle level g-phases form g-phrases, and
g-phrases are grouped into g-units. G-units with only one gestures is defined as sin-
gletons. Kipp et al. (2007) proposed that more expressive people have longer g-units.
The system uses videos of a person whose gesture style is to be animated. In the offline
phase, the videos are manually annotated based on p-phases, g-phrases, and g-units
in three different tracks. The input text is also tagged with semantic tags and condi-
tional probabilities are computed based on the links between the meaning of gesture
and semantic tags, and also taking into account the gesture sequences which are most
likely to follow each other. Based on this, gesture profiles are created for each specific
user. In the online mode, the new given text is annotated manually with semantic tags
and gestures are generated using the selected gesture profile. The selected gestures are
merged into g-units to generated the final animation sequences which result in a ges-
ture script to be played by the animation controller. The gesture script contains detailed
information about the start time, shape, g-phase durations, and the form of the gesture.
Further technical details of the system can be found in Neff et al. (2008).
Recently, Huang and Mutlu (2014) used a dynamic Bayesian network (DBN) for
coordinating multimodal behaviours (speech, gaze, gestures) for a humanoid robot. The
video data was based on a narrative speech and was annotated with four typical ges-
ture types (deictics, iconics, metaphorics, and beats). Additionally, four clusters of gaze
targets were annotated (reference, recipient, narrator’s own gesture, and other places).
The speech is coded with lexical affiliates for all four type of gestures and a DBN is
constructed based on the relationships between speech, gaze, and gesture, including a
latent cognitive state variable. Their results show that the learning-based model is bet-
ter in comparison to “no behaviour” and “random behaviour” conditions. The results
are comparable on most scales with a “heuristically-generated” condition. However,
learning-based approach has the additional advantage of the reduction of designer’s
effort since the model allows for automatic generation of gestures.
Realization of Planned Behaviours

Animations using the same joint can be triggered at the same time. Animations that
would simultaneously move the same part of the body raise two problems: how
to handle the synchronization of animations and how to blend these animations.
Kallmann and Marsella (2005) developed a real-time motion control system for
autonomous virtual humans. They use a hierarchy of individual controllers where the
leaf nodes are the actual motion controllers and the upper nodes are for blending and
interpolation. Each controller can override, modify, or ignore other controllers based on
the hierarchy. Integration of individual controllers is a challenging task if they affect the
same joints of the body. For example, combining walking animation with lip-sync might
not be an issue as they are not related, but combining manipulation animation with gaze
might create some problems as they are considered as distinct problem areas in anima-
tion research (Shapiro, 2011). Thus synchronization of animations in real time is still an
open research area as existing game engines cannot handle complex character anima-
tion, although they provide solutions for other real-time simulation problems, such as
lighting, mesh rendering, and particle effects (Shapiro, 2011). Some of the controllers
used in (Shapiro, 2011) are world offset (to define global position and orientation), idle
motion, locomotion, reaching, grabbing, gaze (looking with the eyes, head, shoulders,
and waist), breathing, eye saccades, blinking, head movements, gestures, facial expres-
sions, and other controllers, such as blushing and tears. Each of these can override the
effects of the other when necessary.
Throughout this chapter, the main areas of research for nonverbal behaviours gener-
ation are highlighted. The field has made significant progress since the apparition of
the first virtual humans that were able to display nonverbal body language in the late
1980s (Magnenat-Thalmann & Thalmann, 2005). Nevertheless, there are still a number
of research avenues that need to be addressed, the first being adaptivity. The nonverbal
behaviour we display while interacting is highly volatile. Indeed, the way we interact
depends on the topic of the conversation, the surrounding context, the person with whom
we interact etc. We also vary our nonverbal behaviours while interacting with the same
person in different contexts. For example, multiparty interaction is an active research
topic (Yumak et al., 2014). State of the art virtual humans and social robots are not yet
able to display this kind of flexibility. Moreover, these behaviours are not fixed over time
and evolve along with our relationships. Schulman and Bickmore (2012) showed that
changes in behaviour occur over long-term interaction. Indeed, understanding and mod-
eling these changes are major challenges toward sustaining long-term interaction. These
research topics are being pushed forward (Belpaeme et al., 2012). These challenges
need to be addressed not at the movement behaviour generation level alone. Most of the
fields involved in autonomous agents research are making significant progress in this
direction. These challenges will be best addressed with “deep” approaches (Cañamero,
2008) and only through an understanding of how these processes evolve in human–
human interactions.
References
Adams, R. & Kleck, R. (2005). Effects of direct and averted gaze on the perception of facially
communicated emotion. Emotion, 5, 3–11.
Atkinson, A. P., Dittrich, W. H., Gemmell, A. J., & Young, A. W. (2004). Emotion perception
from dynamic and static body expressions in point-light and full-light displays. Perception,
33(6), 717–746.
Bailenson, J. N., Blascovich, J., Beall, A. C., & Loomis, J. M. (2003). Interpersonal dis-
tance in immersive virtual environments. Personality and Social Psychology Bulletin, 29(7),
819–833.
Barakova, E. L. & Tourens, T. (2010). Expressing and interpreting emotional movements in social
games with robots. Personal and Ubiquitous Computing, 14, 457–467.
Beck, A. (2012). Perception of emotional body language displayed by animated characters. PhD
dissertation, University of Portsmouth.
Beck, A., Cañamero, L., Damiano, L., et al. (2011). Children interpretation of emotional body
language displayed by a robot. In Proceedings of International Conference on Social Robotics
(pp. 62–70), Amsterdam.
Beck, A., Cañamero, L., Hiolle, A., et al. (2013). Interpretation of emotional body language
displayed by a humanoid robot: A case study with children. International Journal of Social
Robotics, 5(3), 325–334.
Beck, A., Hiolle, A., & Cañamero, L. (2013). Using Perlin noise to generate emotional expres-
sions in a robot. In Proceedings of Annual Meeting of the Cognitive Science Society (pp. 1845–
1850).
Beck, A., Hiolle, A., Mazel, A., & Cañamero, L. (2010). Interpretation of emotional body lan-
guage displayed by robots. In Proceedings of the 3rd International Workshop on Affective Inter-
action in Natural Environments (pp. 37–42).
Beck, A., Stevens, B., Bard, K., & Cañamero, L. (2012). Emotional body language displayed by
artificial agents. Transactions on Interactive Intelligent Systems, 2(1), 2–1.
Belpaeme, T., Baxter, P., Read, R. et al. (2012). Multimodal child-robot interaction: Building
social bonds. Journal of Human–Robot Interaction, 1(2), 33–53.
Bickmore, T. (2008). Framing and interpersonal stance in relational agents. In Autonomous Agents
and Multi-Agent Systems. Workshop on Why Conversational Agents Do What They Do: Func-
tional Representations for Generating Conversational Agent Behavior, Estoril, Portugal.
Breazeal, C., Brooks, A., Gray, J., et al. (2004). Tutelage and collaboration for humanoid robots.
International Journal of Humanoid Robotics, 1(2), 315–348.
Busso, C., Deng, Z., Grimm, M., Neumann, U., & Narayanan, S. (2007). Spoken and multimodal
dialog systems and applications – rigid head motion in expressive speech animation: Analysis
and synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 15(3), 1075.
Cañamero, L. (2008). Animating affective robots for social interaction, in L. Cañamero & R.
Aylett (Eds), Animating Expressive Characters for Social Interaction (pp. 103–121). Amster-
dam: John Benjamins.
Cao, Y., Tien, W. C., Faloutsos, P., & Pighin, F. (2005). Expressive speech-driven facial animation.
ACM Transactions on Graphics, 24(4), 1283–1302.
Cassell, J. (2000). Nudge nudge wink wink: Elements of face-to-face conversation for embodied
conversational agents. In J. Cassell, J. Sullivan, S. Prevost, & E. Churchill (Eds), Embodied
Conversational Agents (pp. 1–27). Cambridge, MA: MIT Press.
Cassell, J., Vilhjálmsson, H., & Bickmore, T. (2001). BEAT. In Proceedings of the 28th Annual
Conference on Computer Graphics and Interactive Techniques, Los Angeles.
Cig, C., Kasap, Z., Egges, A., & Magnenat-Thalmann, N. (2010). Realistic emotional gaze and
head behavior generation based on arousal and dominance factors. In R. Boulic, Y. Chrysan-
thou, & T. Komura (Eds), Motion in Games (vol. 6459, pp. 278–289). Berlin: Springer.
Clavel, C., Plessier, J., Martin, J.-C., Ach, L., & Morel, B. (2009). Combining facial and postu-
ral expressions of emotions in a virtual character. In Z. Ruttkay, M. Kipp, A. Nijholt, & H.
Vilhjálmsson (Eds), Intelligent Virtual Agents (vol. 5773, pp. 287–300). Berlin: Springer.
Coombes, S. A., Cauraugh, J. H., & Janelle, C. M. (2006). Emotion and movement: Activation
of defensive circuitry alters the magnitude of a sustained muscle contraction. Neuroscience
Letters, 396(3), 192–196.
sions, and viewpoint dependence. Journal of Nonverbal Behavior, 28, 117–139.
Dautenhahn, K. (2013). Human–Robot Interaction. In M. Soegaard & R. F. Dam (Eds), The Ency-
clopedia of Human–Computer Interaction (2nd edn). Aarhus, Denmark: The Interaction Design
Foundation.
De Silva, P. R. & Bianchi-Berthouze, N. (2004). Modeling human affective postures: An infor-
mation theoretic characterization of posture features. Computer Animation and Virtual Worlds,
15(3–4), 269–276.
Dovidio, J. & Ellyson, S. (1985). Pattern of visual dominance behavior in humans. In S. Ellyson
& J. Dovidio (Eds), Power, Dominance, and Nonverbal Behavior (pp. 129–149). New York:
Springer.
Egges, A., Molet, T., & Magnenat-Thalmann, N. (2004). Personalised real-time idle motion syn-
thesis. In Proceedings of 12th Pacific Conference on Computer Graphics and Applications
(pp. 121–130).
Fredrickson, B. (2004). The broaden-and-build theory of positive emotions. Philosophical Trans-
actions: Biological Sciences, 359, 1367–1377.
Harmon-Jones, E., Gable, P., & Price, T. (2011). Toward an understanding of the influence of
affective states on attentional tuning: Comment on Friedman and Förster (2010). Psychology
Bulletin, 137, 508–512.
Hartmann, B., Mancini, M., Buisine, S., & Pelachaud, C. (2005). Design and evaluation of expres-
sive gesture synthesis for embodied conversational agents. In Proceedings of the 4th Interna-
tional Joint Conference on Autonomous Agents and Multiagent Systems (pp. 1095–1096), New
York.
Heylen, D., Kopp, S., Marsella, S., Pelachaud, C., & Vilhjálmsson, H. (2008). The next step
towards a function markup language. In H. Prendinger, J. Lester, & M. Ishizuka (Eds), Intelli-
gent Virtual Agents (vol. 5208, pp. 270–280). Berlin: Springer.
Huang, C.-M. & Mutlu, B. (2014). Learning-based modeling of multimodal behaviors for human-
like robots. In Proceedings of the 2014 ACM/IEEE International Conference on Human–Robot
Interaction (pp. 57–64), New York.
Huang, L., Galinsky, A. D., Gruenfeld, D. H., & Guillory, L. E. (2010). Powerful postures ver-
sus powerful roles which is the proximate correlate of thought and behavior? Psychological
Science, 22(1), 95–102.
Ishiguro, H. (2005). Android science: Toward a new cross-disciplinary framework. In Proceedings
of the 27th Annual Conference of the Cognitive Science Society: Toward Social Mechanisms of
Android Science (A CogSci 2005 Workshop) (pp. 1–6).
Kallmann, M. & Marsella, S. (2005). Hierarchical motion controllers for real-time autonomous
virtual humans. Lecture Notes in Computer Science, 3661, 253–265.
Kipp, M., Neff, M., Kipp, K., & Albrecht, I. (2007). Towards natural gesture synthesis: Evaluating
gesture units in a data-driven approach to gesture synthesis. Lecture Notes in Computer Science,
4722, 15–28.
Kleinsmith, A., Bianchi-Berthouze, N., & Steed, A. (2011). Automatic recognition of non-acted
affective postures. IEEE Transactions on Systems, Man, and Cybernetics Part B, 41(4), 1027–
1038.
Kleinsmith, A., De Silva, P. R., & Bianchi-Berthouze, N. (2006). Cross-cultural differences in
recognizing affect from body posture. Interacting with Computers, 18(6), 1371–1389.
Knapp, M. (1972). Nonverbal Communication in Human Interaction. New York: Holt, Reinhart
and Winston.
Koenemann, J. & Bennewitz, M. (2012). Whole-body imitation of human motions with a Nao
humanoid. In Proceedings of the 7th Annual ACM/IEEE International Conference on Human–
Robot Interaction (pp. 425–426), New York.
Kopp, S., Krenn, B., Marsella, S., et al. (2006). Towards a common framework for multimodal
generation: The behavior markup language. In Proceedings of the 6th International Conference
on Intelligent Virtual Agents (pp. 205–217).
Krenn, B. & Sieber, G. (2008). Functional markup for behavior planning: Theory and practice.
In Proceedings of the AAMAS 2008 Workshop: Functional Markup Language. Why Conversa-
tional Agents Do What They Do.
Laban, R. & Ullmann, L. (1971). The Mastery of Movement. Boston: Plays.
Lakin, J., Jefferis, V., Cheng, C., & Chartrand, T. (2003). The chameleon effect as social glue:
Evidence for the evolutionary significance of nonconscious mimicry. Journal of Nonverbal
Behavior, 27(3), 145–162.
Lee, J. & Marsella, S. (2006). Nonverbal behavior generator for embodied conversational agents.
Lee, J. & Marsella, S. (2010). Predicting speaker head nods and the effects of affective informa-
tion. IEEE Transactions on Multimedia, 12(6), 552–562.
Lee, J. & Marsella, S. (2012). Modeling speaker behavior: A comparison of two approaches.
Magnenat-Thalmann, N. & Thalmann, D. (2005). Handbook of Virtual Humans. Hoboken, NJ:
John Wiley & Sons.
Mutlu, B., Kanda, T., Forlizzi, J., Hodgins, J., & Ishiguro, H. (2012). Conversational gaze mech-
anisms for humanlike robots. Transactions on Interactive Intelligent Systems, 1(2), art. 12.
Neff, M., Kipp, M., Albrecht, I., & Seidel, H.-P. (2008). Gesture modeling and animation based
on a probabilistic re-creation of speaker style, ACM Transactions on Graphics, 27(1), art. 5.
Nunez, J., Briseno, A., Rodriguez, D., Ibarra, J., & Rodriguez, V. (2012). Explicit analytic solution
for inverse kinematics of bioloid humanoid robot. In Brazilian Robotics Symposium and Latin
American Robotics Symposium (pp. 33–38).
Perlin, K. (2002). Improving noise. ACM Transactions on Graphics, 21(3), 681–682.
Pierris, G. & Lagoudakis, M. (2009). An interactive tool for designing complex robot motion
patterns. In Proceedings of IEEE International Conference on Robotics and Automation
(pp. 4013–4018).
Roether, C. L., Omlor, L., Christensen, A., & Giese, M. A. (2009). Critical features for the per-
ception of emotion from gait. Journal of Vision, 9(6), 15.
Salem, M., Kopp, S., Wachsmuth, I., Rohlfing, K., & Joublin, F. (2012). Generation and evaluation
of communicative robot gesture. International Journal of Social Robotics, 4(2), 201–217.
Schulman, D. & Bickmore, T. (2012). Changes in verbal and nonverbal conversational behavior
in long-term interaction. In Proceedings of the 14th ACM International Conference on Multi-
modal Interaction (pp. 11–18).
Shapiro, A. (2011). Building a character animation system. Lecture Notes in Computer Science,
7060, 98–109.
Snibbe, S., Scheeff, M., & Rahardja, K. (1999). A layered architecture for lifelike robotic motion.
In Proceedings of the 9th International Conference on Advanced Robotics, October.
Sun, X. & Nijholt, A. (2011). Multimodal embodied mimicry in interaction. Lecture Notes in
Computer Science, 6800, 147–153.
Thiebaux, M., Marsella, S., Marshall, A. N., & Kallmann, M. (2008). Smartbody: Behavior real-
ization for embodied conversational agents. In Proceedings of the 7th International Joint Con-
ference on Autonomous Agents and Multiagent Systems (pp. 151–158).
Thomas, F. & Johnston, O. (1995). Disney Animation: The Illusion of Life. New York: Abbeville
Press.
Torta, E., Cuijpers, R., Juola, J., & Van der Pol, D. (2011). Design of robust robotic proxemic
behaviour. Lecture Notes in Computer Science, 7072, 21–30.
Vilhjálmsson, H., Cantelmo, N., Cassell, J., et al. (2007). The behavior markup language: Recent
developments and challenges. In Proceedings of the 7th International Conference on Intelligent
Virtual Agents (pp. 99–111).
Wallbott, H. (1998). Bodily expression of emotion. European Journal of Social Psychology, 28(6),
879–896.
Walters, M. L., Dautenhahn, K., Te Boekhorst, R., et al. (2009). An empirical framework for
human–robot proxemics. In Proceedings of New Frontiers in Human–Robot Interaction: Sym-
posium at the AISB09 Convention (pp. 144–149).
Yumak, Z., Ren, J., Magnenat-Thalmann, N., & Yuan, J. (2014). Modelling multi-party interac-
tions among virtual characters, robots and humans. Presence: Teleoperators and Virtual Envi-
ronments, 23(2), 172–190.
21 Approach and Dominance as Social
Signals for Affective Interfaces
Marc Cavazza
Introduction
Recent years have seen a growing interest in the development of affective interfaces.
At the heart of these systems is an ability to capture social signals and analyse them
in a way that meets the requirements and characteristics of the application. There has
been a concerted effort in devising a principled approach which could benefit from
the interdisciplinary nature of the affective computing endeavour. One common strat-
egy has been to seek conceptual models of emotion that could mediate between the
social signals to be captured and the knowledge content of the application. Early sys-
tems have largely relied on the intrinsic, natural properties of emotional communication
with, for instance, an emphasis on facial expressions both on the output and the input
side, together with some theoretical foundation for the acceptance of computers as inter-
action partners (Reeves & Nass, 1996). Without downplaying the importance of these
systems in the development of the field or the practical interest of the applications they
entailed (Prendinger & Ishizuka, 2005), it soon appeared that not all interactions could
be modelled after interhuman communication, in particular when considering interac-
tion with more complex applications. This complexity can be described at two principal
levels: the interaction with complex data, knowledge, or task elements and the nature
of the emotions themselves (and their departing from universal or primitive emotions to
reflect more sophisticated ones). On the other hand, part of the problem rests with var-
ious simplifications that were necessary to get early prototypes off the ground. A good
introduction to the full complexity of an affective response can be found in Sander and
Scherer (2014), in particular its description of the various levels and components to be
considered. These have not always been transposed into affective computing systems;
however, as is often the case, the original description frameworks may not be computa-
tional enough to support direct and complete transposition.
Dimensional models of emotions have been enthusiastically adopted in affective
interfaces for their flexibility in terms of representation as well as the possibility of
mapping input modalities onto their dimensions to provide a consistent representation of
input. In this chapter, we advocate for two lesser used affective dimensions, Dominance
and Approach, which have been identified in the literature, yet given comparatively less
attention than the traditional dimensions of Valence and Arousal. Our main standpoint
will be their relevance to affective interfaces and their ability to capture important social
signals. The first of these dimensions is Dominance, as used in the pleasure arousal
dominance (PAD) model of Mehrabian (1996). The second dimension discussed in this
chapter is known as Approach/Withdrawal (Davidson, 1992, 2003)1 . It is of particular
interest in relation to its electrophysiological correlates, in particular EEG frontal alpha
asymmetry, and the development of affective Brain-Computer Interfaces (BCI). We will
argue that each of these dimensions has a specific interest for social signal processing in
affective interfaces, in particular for interactive systems with a significant media content
or an underpinning real-world task that extends beyond ‘simple’ communication.
From a terminological perspective, it is worth noting that these dimensions are
often named through a pair of antonyms: Approach/Withdrawal, and Dominance/
Submissiveness. This contrasts with the straightforward naming of Arousal and Valence
whose evolution tended to be described qualitatively (high/low or positive/negative). In
this chapter we shall use only the main polarity to designate the dimension, and capi-
talise them, to avoid ambiguity.
From Arousal Valence (AV) to PAD Models: Introducing Dominance
Dominance was introduced by Mehrabian (e.g., 1996) as a third dimension added to

Arousal and Valence in his pleasure arousal dominance (PAD2 ) framework. This was
the consequence of interpreting factor analytic results from emotion questionnaire data,
which suggested that a third factor was required to adequately characterise an affective
state (Russell & Mehrabian, 1977). The need of additional dimensions to differentiate
emotional states that may share Valence and Arousal parameters is now well accepted,
and Demaree et al. (2005) give examples of how Dominance can differentiate between
similar emotions both positively and negatively valenced.
Mehrabian’s original definition of Dominance was in relation to the control that the
subject would feel over her immediate environment. More specifically, Dominance has
been defined as “feelings of control and influence over everyday situations, events, and
relationships versus feelings of being controlled and influenced by circumstances and
others” (Mehrabian, 1996).
Since we are interested in affective dimensions in conjunction with social signals,
one terminological difficulty arises from the original naming of the dimension itself
which made reference to the antonym pair Dominance/Submissiveness, and would sug-
gest a relation of social dominance, rather than of control over the environment. In this
chapter, we would only consider Dominance as a sense of control over the environment
and suggest that it is primarily mediated by embodiment and relationship to physical
space, hence, its relevance to the study and implementation of interactive systems. This
is why Dominance is most often associated with body gestures and attitudes. However,
it does not seem possible to completely disentangle Dominance from its social accep-
tation, especially when the latter is mediated by space appropriation. Less orthodox
suggestions have been made according to which Dominance could be interpreted as an
1 Avoidance is sometimes used as an equivalent of withdrawal.

2 Where pleasure is equivalent to valence provided the axis contains a negative section.
Approach and Dominance as Social Signals for Affective Interfaces 289
element of control over other affective dimensions (Liu & Sourina, 2012), but this view
is not shared by other authors. Our interpretation of Dominance will follow Mehrabian’s
original definition of control over the environment. This is consistent with the develop-
ment of the PAD model for aesthetic judgments and reacting to physical artifacts rather
than interaction in a social context, even though uses of PAD have been reported for
social agents (Gebhard, 2005), gaming (Reuderink et al., 2013), and aesthetic comput-
ing (Gilroy, Cavazza, & Vervondel, 2011).
A notable advocacy of Dominance in affective computing was produced by Broekens
(2012). While following a strict definition of Dominance in the PAD model as the ability
to exert control over the environment, he extended the definition of environment to
accommodate social environment, hence other subjects. The list of examples he gives
is, however, slightly biased toward social aspects, somehow departing from the type of
control that relates to the predictability of the environment, or a sense of physical agency
A very interesting comment that Broekens makes, in the context of this chapter, is to
draw an interesting, albeit surprising, parallel between Dominance and Approach. This
may have been prompted by the use of Dominance to differentiate between anger and
fear in social contexts, which is not unlike the valence controversy for Approach when
associated with anger (see below). We will discuss this parallel further in this chapter.
Finally, the correlation studies performed by Broekens (2012) also confirm Mehra-
bian’s (1996) statement that the three PAD dimensions are not fully orthogonal. This
means that using the PAD representation as some kind of orthonormal basis for affec-
tive signals vectors should be considered nothing more than an approximation.
Several authors have incorporated Dominance into affective interfaces of various
sorts, most often via implementations and mappings onto the PAD model. However,
rather than reviewing previous use of the PAD model at large, it seems more relevant to
consider recent work that has specifically considered Dominance as an input dimension,
in particular through its physical, even embodied, realisation.
Kleinsmith and Bianchi-Berthouze (2007) have presented results for the automatic
recognition of four affective dimensions from body posture, among which Dominance3 ,
for which they achieved an error rate of only 10%. In a recent review Kleinsmith and
Bianchi-Berthouze (2013), they propose a more extensive presentation of related work
and recent results from them, which confirm the promising nature of their early findings
and the spatial and physical anchoring of Dominance.
Gaffary et al. (2013) have conducted an innovative and interesting study into the
ability of haptics to physically convey various types of emotions, using a definition of
Dominance as the ability to control a situation. They use haptics to differentiate between
pairs of emotions in the PAD model (Russell & Mehrabian, 1977); as three of these pairs
are differentiated by their Dominance component their haptics findings could potentially
inform the detection of Dominance.
The original definition of Dominance as control over the subject’s environment has
implications for affective interfaces but also interactive systems at large. For instance,
one dimension emerging from factor analysis of determinants of Presence (Witmer &
3 Labelled as ‘potency’ with an explicit reference to it being equivalent to dominance.

Singer, 1998) has also been labelled as Control to reflect the importance for the feel-
ing of Presence of a sense of agency. This has inspired some of Cavazza, Lugrin, and
Buehner (2007) previous work in suggesting that control factors could be related to
basic psychological constructs such as causal perception.
Because of the definition they have adopted for Dominance, they have not attempted
to solve the boundary issue between Dominance over an environment and Dominance
within a social group (Gatica-Perez, 2009). Research by Glowinski et al. (2010) has
shown some potential to bridge this gap by considering a characterisation of complex
gestural patterns (playing a string instrument) through entropy measures. The ambigu-
ity between the social aspect of dominance and the sense of control needs to be further
addressed. It finds its roots in the original definition insofar as it leaves open the pos-
sibility that other subjects constitute the environment over which control is defined.
Proxemics tells us that social dominance may also be mediated by occupation of space,
so that the spatial realisation of Dominance may also link back to social relations. Obvi-
ously, the use of the Dominance/Submissiveness terminology also carries its share of
social implications. Despite the above, it could be suggested that Dominance should be
considered independently of social implications – an epistemological perspective would
be that Pleasure and Arousal are abstract dimensions that emancipate themselves from
the direct communicational role attributed to universal emotions.
Case Study I: PAD-based Affective Fusion

Gilroy et al., 2011, have explored the use of the PAD model as part of research in affec-
tive interfaces to interactive art installations. Their original intuition was based on the
use of PAD models in product design, which suggested that its three-dimensional space
could accommodate complex affective states reflecting the sort of aesthetic responses
expected when considering artworks. One additional incentive was that whenever inter-
action would include physical modalities such as body attitudes or gestures, Dominance
would provide an additional, hopefully more specific dimension onto which social sig-
nals could be mapped. The system is a multimodal installation accepting input from sev-
eral modalities, some typically affective in nature and others, like average body motion
or number of faces detected, specifically mapped onto an affective dimension (Gilroy,
Cavazza, & Benayoun, 2009). The interactive installation itself is a mixed-reality art-
work known as the “emotional tree” (eTree) (Gilroy et al., 2011) in which a virtual tree
would react to the perceived global emotional response of the audience through various
patterns of growth (the tree appearance being generated in real-time using an L-system).
The originality of our approach was that the same PAD space would serve as both a rep-
resentation driving system behaviour as well as a multimodal fusion model. Further,
this fusion mechanism can combine output from several users (we used pairs of users to
facilitate spontaneous speech).
In practice each input signal is defined as a mapping from the modality to relevant
PAD dimensions Each modality4 may map onto several dimensions and in return each
4 There has been much discussion on the exact definition of “modality” in terms of combination of signal
and channel.
dimension receives contributions from multiple modalities. Arousal receives contribu-

tions from the greatest number of different modalities (emotional speech recognition,
face detection, keyword spotting, body motion). On the other hand, Dominance is only
supported by keyword spotting (using affective keywords mappings from Russell &
Mehrabian, 1977) and interest, combining body motion and physical interaction events.
Not surprisingly, the latter embodied component is the largest contribution to Domi-
nance, accounting for 75% of its value.
Overall, each input modality is thus associated with a 4-dimensional vector of zero
origin rotating through time in the PAD space. Its magnitude is determined by a nor-
malised scaling of index vectors on each PAD dimension, associated to a decay coeffi-
cient characteristic of each modality (Gilroy, Cavazza, Niiranen et al., 2009). As a con-
sequence, the fusion process itself takes advantage of the vector representation in the
PAD space and is based on the linear combination of individual modalities’ vectors. The
affective response is thus represented by one single resulting vector of variable magni-
tude rotating through time in the PAD space. This 4-D motion generates a 3D surface in
PAD space which provides an original visualisation of user experience. Gilroy, Cavazza,
and Benayoun, 2009 have also shown the potential correlations between the distribution
of this surface in PAD space and the concept of flow (Csikszentmihalyi, 1991). In this
context, the specific contribution of Dominance in terms of social signal input analysis,
is to capture affective information primarily from spatial modalities, which reflect user
involvement and interest. In terms of output and representation, the third dimension
brought by Dominance enhances the visual analytics of user experience.
One limitation surfaced when trying to use physiological signals as ground truth to
evaluate the performance of the PAD-based multimodal fusion system. Although strong
correlation has been reported between GSR and Arousal (Andreassi, 2006) and between
facial EMG and Valence/Pleasure (Lang et al., 1993), no equivalent correlation could
be found between Dominance and a physiological signal (peripheral or even central,
despite anecdotal reports of the use of EEG). This led Gilroy, Cavazza, and Benayoun
(2009) to evaluate the fusion system with only two dimensions, meaning that despite
+Aro
–Pls –Dom
+Dom +Pls
–Aro
Figure 21.1 Affective trajectory in PAD space: the multimodal fusion vector’s trajectories
provides a representation of user experience with the interactive installation (Gilroy et al., 2011).
clear positive results the evaluation cannot be extrapolated as a complete validation,

notwithstanding the recurring issue of non-orthogonality of the PAD vector basis (see
Figure 21.1).
Approach/Withdrawal as an Affective Dimension
According to a layered model of emotional responses (Sander & Scherer, 2014) dispo-
sition toward action is one of the five components of the emotional response. One way
of presenting this disposition is through the notion of Approach–Withdrawal5 which
has been extensively studied by Davidson (1992, 2003). Davidson has further argued
that Approach and Withdrawal are components of different emotions (Davidson et al.,
1990), conferring them a status of higher-level dimension, able, for instance, to differ-
entiate between emotional categories showing similar valence.
Davidson (2003) described Approach from a phylogenetic perspective as a natural
reaction to appetitive goals, through the self-contained definition of the Approach-
Avoidance pair. However, when considering that Approach could not be strictly equated
to the pursuit of positive stimuli Davidson et al. (1990) illustrated Approach by the ten-
dency of infants to reach out to their mothers (Fox & Davidson, 1988). Several authors
have provided more explicit, and generally consistent definitions of Approach as the
impetus to move toward a stimulus as well as the action itself (Berkman & Lieber-
man, 2010); or as the motivation to pursue desired goals and rewards (Sutton & David-
son, 1997). The default assumption has also been to consider Avoidance as the mirror
image of Approach, moving away from the stimulus instead of toward it. Harmon-Jones,
Gable, and Peterson (2010) have however reported differences in the neural correlates of
Approach and Withdrawal despite being both associated with frontal asymmetry (with
Approach better correlated with left-frontal activity than Withdrawal with frontal asym-
metry). Other work has challenged this joint definition of Approach and Withdrawal,
suggesting that Approach was not necessarily the opposite of Avoidance (Amodio et al.,
2008). For this reason in the remainder of this chapter we shall use primarily the term
Approach in isolation, unless quoting a researcher adopting a different convention.
Davidson’s original phylogenetic perspective has been discussed by several authors.
Huys et al. (2011) studied Pavlovian responses involving approach and withdrawal but
found these to depend critically on the intrinsic valence of behaviours as well as differ-
ing between approach and withdrawal. Demaree et al. (2005) draw similarities between
humans and animals in their Approach/Withdrawal behaviour, considering that the main
difference to be the ability to approach or avoid more abstract situations, such as a
social context. Berkman and Lieberman (2010) suggested that, although in animals the
Approach tendency and Valence were always aligned, the situation was far more com-
plex for humans who were able to deal with a contradiction between goal pursuit and
valence in the determination of Approach or Withdrawal (some of their examples of the
5 Other authors use avoidance in lieu of withdrawal. We will use both in the text depending on the authors
we are citing.
dissociation between Approach and the appetitive nature of the goal involved smoking
cessation or healthy eating). Gable and Harmon-Jones (2008) when studying individual
differences in frontal asymmetry responses used appetitive, rather than simply positive
stimuli (these appetitive stimuli being actually food-related).
Davidson (2003) remarked how short-term rewards inducing Approach could ham-
per the pursuit of a long-term goal. This suggests that phylogenetic considerations
have primarily an epistemological interest but should not lead us to underestimate the
actual complexity and level of integration of emotional regulatory mechanisms. Another
important implication of phylogenetic consideration leads us to discuss the relationship
between Approach and Empathy, which has received sustained interest in recent years
(Tullett, Harmon-Jones, & Inzlicht, 2012; Light et al., 2009).
Approach as a Precursor of Empathy
Several authors have related Approach to empathy: such a connection is remarkable, not
least because it maps a generic dimension to a very specific, albeit complex, affective
disposition of a social nature. Decety and Moriguchi (2007) attribute a phylogenetic
connection between approach and empathy, the former constituting a phylogenetic pre-
cursor of the latter. This is not unlike primitive forms of emotional contagion (Gutsell
& Inzlicht, 2010; De Waal, 2008).
Tullett et al. (2012) have suggested that Withdrawal rather than Approach could corre-
late with empathy. In their model, empathy is described from the perspective of negative
emotional contagion, even explicitly considering sadness as a potential mediator. This
even led them to consider empathy as a potentially unpleasant emotional state for those
who experience it, going against the traditionally positive adjectives associated to the
lay conceptions of empathy. However, this can be explained by the multiform nature
of empathy discussed in particular by Light et al. (2009), who have identified three
main sorts of empathy from the perspective for their relationship to Approach. The def-
inition of one of these concerning empathy [DEF] is consistent with a correlation to
Withdrawal.
Approach and Valence

The Approach dimension has initially been strongly associated to Valence from the
phylogenetic perspective, discussed above, as natural situations in which Approach
behaviour is triggered to often feature positively valenced stimuli. This has been sum-
marised through the “approach the good, avoid the bad” maxim (Briesemeister et al.,
2013) and has been the object of intense discussion based on a number of contrasting
findings. One of the most significant contributions to this debate has been the finding by
Harmon-Jones (2004) that Approach could also be associated to a negatively valenced
emotion, as in the specific case of anger. His argument is that anger projects a negative
emotion onto a target, and this projection matches the concept of Approach. Davidson’s
(2004) response has been to consider this finding as consistent with some of his previ-
ous work and that, from a trait perspective, it was not automatically a negative trait to
be prone to anger in certain situations if it could facilitate the rapid removal of obstacles
that are thwarting goals.
Other researchers have remarked that the case of Anger was somehow unique and
that the symmetrical situation, where Avoidance would be associated to a positively
valenced stimuli, could not be reported (Berkman & Lieberman, 2010). Further, the
same authors have emphasised that not all dissociations were similar: they have distin-
guished Approach associated to Anger from Approach of an unpleasant stimulus dic-
tated by goal pursuit, the latter taking part in a cycle of self-regulation. Their imaging
study has also established that prefrontal asymmetry (considered a marker of Approach)
is associated with action motivation rather than stimulus Valence.
Approach and Dominance

The characterisation of Approach faces similar issues as Dominance in terms of its rela-
tionship to its target, being it a stimulus, affordance, or another subject. For instance,
the case for a social component can be discussed both for Approach (not least through
its connection to empathy) and for Dominance (on whether the environment over which
control is felt can be extended to be the subject’s social environment). Another common-
ality, this type operational, is that both Approach and Dominance has been reported to
disambiguate between emotional categories sharing similar valence and arousal values.
Some authors have drawn parallels between Approach and Dominance, even suggest-
ing a close connection between them (Demaree et al., 2005). Because this position is
not universally supported, we will discuss it in the conclusion.
EEG Frontal Asymmetry as a Social Signal

Prefrontal EEG alpha asymmetry has been shown by Davidson (1992, 1998) to be
a reliable marker of Approach. Asymmetry scores comparing activation of left and
right hemispheres have been defined to capture real-time signals. Typical scores com-
pare alpha power from channels F3 and F4: for instance, the A2 score is defined by
(F4 – F3)/(F4 + F3). Under specific experimental conditions, it can constitute a candi-
date measure of Approach as a social signal.
From a psychophysiological perspective, EEG frontal asymmetry has the property
of a stable trait, which characterises emotional tendencies, from response to emotional
stimuli to susceptibility to depression and mood disorders. In particular, individuals with
greater left anterior activity show a higher tendency to engage in Approach behaviour.
However, EEG frontal asymmetry can also behave as a state variable and change in
real-time as part of an Approach or Withdrawal response.
Coan and Allen (2004) have provided a useful classification of EEG frontal asym-
metry experiments based on their reliance on their trait property (which more often
evaluates emotional control) or indeed state-related changes in asymmetry as a func-
tion of state changes in emotion. The most relevant types of experiments for affective
interfaces are those that explore frontal EEG activation asymmetry as a state measure
(unless the system developed is specifically concerned with psychological aspects, such
as emotional regulation or the treatment of mood disorders). Coan and Allen (2004)
insist on the distinction between activity (tonic recording of cortical processed via EEG)
and activation (change in EEG activity following an emotional stimulus).
While Davidson (2004) had originally identified as a limitation the fact that EEG
scalp signals reflected mostly the activity of the dorsolateral sector of the PFC, brain
imaging studies by Berkman and Lieberman (2010) have since confirmed that asym-
metry effects associated with approach-avoidance occurred primarily in the dorsolat-
eral PFC, a finding that would comfort the use of EEG to measure frontal asymmetry,
notwithstanding the negative contribution of motion artifacts.
Another important property of the EEG frontal asymmetry signal is that it is acces-
sible to neurofeedback Previous work in frontal asymmetry neurofeedback has been
developed for the treatment of depression and mood disorders (Rosenfeld et al., 1995;
Baehr, Rosenfeld, & Baehr, 2001) although Allen, Harmon-Jones, and Cavender (2001)
have also successfully used neurofeedback to modify emotional response in a nonclini-
cal context. This amenability to neurofeedback is especially important from the perspec-
tive of affective interfaces due to the complexity of interpreting spontaneous variations
of EEG frontal asymmetry (see also Elgendi et al., 2014). For instance, Davidson et al.
(1990) have questioned the value of spontaneous variations of EEG frontal asymmetry
and Coan and Allen (2003) have highlighted the intertwining of individual trait differ-
ences and occasion-specific fluctuations in the variance of asymmetry scores. However,
Zotev et al. (2014) have reported successful neurofeedback despite a contribution of
neuronal signals of only 30% to the average β-band EEG power in channels F3 and
F4.6 Our own implementation of the concept to develop an affective interface has made
use of a neurofeedback paradigm (see below).
Other Uses of EEG Frontal Asymmetry (Outside Measurements of Approach)

EEG frontal asymmetry has been used in various contexts and, notwithstanding the
above controversy on valence independence, virtually all major affective dimensions
have seen an attempt to correlate them with EEG frontal asymmetry in at least one study.
This situation of multiple but partial correlation with several affective dimensions is not
unlike what has been observed for some physiological signals (for instance, correlation
of heart rate both with Valence and Arousal) and should be considered with caution
without drawing definite conclusions from isolated studies. There is no single explana-
tion to these findings and there can be a complex interplay between partial correlations
between dimensions, experimental conditions, and in the case of Valence, the above
discussion on its partial overlap with Approach. For instance, Wehbe et al. (2013) have
used EEG frontal alpha asymmetry (citing Coan & Allen, 2004) in a computer gam-
ing context, however targeting Arousal rather than Approach. The same authors also
applied this measurement to explore the user experience of 3D displays (Wehbe et al.,
6 Explained in part by the experiment using a combined fMRI-EEG paradigm.

2014), again with a focus on Arousal. Finally, Valenzi et al. (2014) have reported the
use frontal asymmetry for emotion recognition with an apparent emphasis on Valence
(despite making occasional reference to the Approach/Withdrawal hypothesis).
Case Study II: a Neurofeedback Implementation of Approach
Virtual agents have always played an important role in the history of affective interfaces:
as applications became more complex, their modes of interactions progressively moved
away from expressive communication mediated by facial and non-verbal behaviour (see
also Figure 21.2). Another emerging context was the incorporation of virtual agents
into interactive narratives in which the emotional phenomena were an order of mag-
nitude more complex moving away from the perceived emotions of virtual characters
to incorporate sophisticated user experience corresponding to filmic emotions (Tan,
1995). In this context we were in search of high-level affective interfaces through which
the user could directly express her disposition toward virtual characters This followed
various experiments with physiological sensing, which served to influence narrative ten-
sion rather than disposition toward feature characters. Cavazza and colleagues (Cavazza,
Charles et al., 2014; Cavazza, Aranyi et al., 2014a; Gilroy et al., 2013)7 have explored
the use of an approach-related measure (frontal alpha asymmetry) to support interaction
between the user and a character in a virtual narrative. Cavazza et al.’s initial hypothesis
was to unify Tan’s filmic theory of emotion (1995), which posits that empathy with fea-
ture characters is a key determinant of user response with a social signal compatible with
empathy and supporting the implementation of a brain-computer interface (BCI). They
thus decided to explore Approach and alpha frontal asymmetry as a BCI signal. Because
of the individual variations in the FAA baseline and the previous observation about the
lack of significance of spontaneous variations of FAA Cavazza et al. used a neurofeed-
back paradigm in which the user would control FAA through an empathic cognitive
strategy. More specifically, at key stages of the narrative in which the feature charac-
ter (a female doctor from a medical drama) faced dire situations the NF option will be
triggered offering the user an opportunity to influence the narrative if s/he could men-
tally support the character. Cavazza et al. took great care to keep instructions generic,
in particular avoiding any explicit mention of empathy, so as not to influence the users’
cognitive strategies. The subjects were essentially told to mentally support the charac-
ter by expressing positive thoughts, an instruction both compatible with Approach and
(positive) Valence but that should limit in principle the occurrence of concern empathy
which would not be detected by the BCI implementation.
Several successive implementations of this concept have been reported in Gilroy
et al. (2013), Cavazza, Charles et al. (2014), and Cavazza, Aranyi et al. (2014a), with
newer versions improving feedback mechanisms and calibration methods, but always
7 This research was carried out in collaboration with the Functional Brain Center of Tel Aviv: it is only
summarised here for its use of an Approach-related measure to interact with virtual characters. A detailed
description and a complete list of participants can be found in the references listed.
a Calibraon x12 Rest Training Experiment seng
1.0
F4
F3
0.8
Color
saturaon
120s 15s 30s
0.6 mapping
b x1 In-story NF 0.4
1
0.8
0.6
0.2 0.4
0.2
0
1
10
19
28
37
46
55
64
73
82
91
100
109
–0.2
–0.4
0.0 –0.6
EEG data ﬁltering
30s –0.8
–1 MA2(t)
Figure 21.2 The frontal alpha asymmetry neurofeedback installation (Cavazza, Aranyi et al., 2014a).
maintaining the same strategy. The most recent implementations showed 73% of users
being able to successfully influence the story, despite the lack of significant training tra-
ditionally required by neurofeedback paradigms8 . Although empathy can provide a uni-
fied framework between filmic affective theory and user response, it is still preliminary
to characterise this BCI as an empathic one. On the other hand, it certainly suggests that
Approach could be a promising dimension to be explored further in affective interfaces.
The only remaining limitation is that based on a subject’s debriefing of their cognitive
strategies, it is not always possible to dissociate Approach from (positive) Valence9 .
On average, 50% of successful subjects reported a clear empathic strategy, where the
agent is the target of intentional thoughts, and another 50% express cognitive strategies
primarily characterised by the positive nature of thought contents. In the latter sample,
Approach and Valence could not be formally distinguished.
Conclusions
In their review of the neural basis of emotions, Lindquist et al. (2012) have suggested
that mid-level affective categories would facilitate the unification of various perspec-
tives and a better integration of theory and practice, which could be especially relevant
for affective computing systems. Approach could be a candidate for such mid-level cat-
egories: although Davidson has repeatedly advised caution with the overinterpretation
of the neural substrate of Approach, he has also indicated this problem as a worthy
research direction.
We have discussed two affective dimensions, Approach and Dominance, which,
despite having attracted growing interest in recent years in various disciplines, have
not yet realised their potential in the context of affective computing. While both have
been presented as complementing traditional dimensions of Arousal and Valence, it can
be noted that Approach and Dominance can be granted their autonomy in supporting the
description of specific phenomena and being associated with specific measurements.
Throughout this discussion, it has appeared that several authors have established par-
allels between Approach and Dominance, sometimes going as far as to suggest a close
proximity of these dimensions. In the absence of conclusive evidence, I have not dedi-
cated a section to this hypothesis as I have decided not to embrace it. I will still briefly
review some of their arguments here, within the framework that I have outlined through-
out the chapter, trying to avoid any bias.
Demaree et al. (2005) have explicitly suggested, probably encouraged by the anecdo-
tal finding that clinical depression may be accompanied by a decrease in Dominance,
that the approach-withdrawal theory could fit properties of Dominance and that, further,
right anterior regions could mediate “submissive feelings” rather than Withdrawal. They
thus propose Dominance as a measure of the Approach-Withdrawal emotion intensity.
8 Our subjects (Cavazza, Aranyi et al., 2014b) went through a single 10-minute training session with the NF
system, as compared to multiple sessions totaling several hours of training in the FAA NF literature.
9 We had no reports of subjects getting angry at the character when trying to modify her situation.
In their review of action tendencies, Lowe and Ziemke (2011) have suggested that
Dominance (in the PAD model) could be considered a measure of “orientation”, which
they equate to Approach-Withdrawal.
Whether one is willing to bring Approach and Dominance closer or not, their poten-
tial use in interactive systems is more complementary than it seems. I have subscribed
to the view that Dominance is strongly related to spatial and physical interaction to the
point that it may map onto concrete, physical aspects such as: causal effects, appropri-
ation of space, haptics . . . Harmon-Jones and Peterson (2009) have shown that body
position could influence left-frontal activation to anger-inducing insults (in practice
reducing Approach). They interpreted this finding through what they call an embodi-
ment hypothesis (Harmon-Jones et al., 2010) that “lying on one’s back is antithetical
to approach-oriented behaviour, particularly aggression”. This of course would suggest
more than a social signal, a non-aggression feedback signal to the subject itself.
Another intriguing commonality between Approach and Dominance resides in their
potential to capture affective (perhaps even aesthetic) judgments in particular related to
product design. This has been reported, for Approach, by Briesemeister et al. (2013),
while it was embedded at an early stage into the PAD model (Mehrabian & Russell,
1974). However, as phenomena to be analysed grow more complex, the significance of
these shared properties becomes more difficult to establish in the absence of carefully
designed experiments or meta-analyses. Additional research into the application of each
dimension to affective interfaces may generate sufficient data to shed new light on the
actual relationships, if any, between Approach and Dominance.
Acknowledgments
Work on the use of alpha asymmetry for affective interfaces such as the one described
in Gilroy et al. (2013) and Cavazza, Aranyi et al. (2014a) has been undertaken in col-
laboration with the Functional Brain Research Center of the Tel Aviv Sourasky Medical
Center (Prof. Talma Hendler). I am indebted to her and her team for having introduced
me to the use of alpha asymmetry as a measure of approach/avoidance: however, any
remaining inaccuracies or misconceptions in the present article are the author’s sole
responsibility. Part of this work (dominance, PAD-based fusion) has been originally
funded by the European Commission through the FP6 CALLAS Project [IST-034800].
The PAD-fusion model has been developed by Steve Gilroy.
References
Allen, J. J., Harmon-Jones, E., & Cavender, J. H. (2001). Manipulation of frontal EEG asymmetry
through biofeedback alters self-reported emotional responses and facial EMG. Psychophysiol-
ogy, 38(04), 685–693.
Amodio, D. M., Master, S. L., Yee, C. M., & Taylor, S. E. (2008). Neurocognitive components
of the behavioral inhibition and activation systems: Implications for theories of self-regulation.
Psychophysiology, 45(1), 11–19.
Andreassi, J. (2006). Psychophysiology: Human Behavior and Physiological Response. Hoboken,

NJ: Routledge.
Baehr, E., Rosenfeld, J. P., & Baehr, R. (2001). Clinical use of an alpha asymmetry neurofeedback
protocol in the treatment of mood disorders: Follow-up study one to five years post therapy.
Journal of Neurotherapy, 4(4), 11–18.
Berkman, E. T. & Lieberman, M. D. (2010). Approaching the bad and avoiding the good: Lateral
prefrontal cortical asymmetry distinguishes between action and valence. Journal of Cognitive
Neuroscience, 22(9), 1970–1979.
Briesemeister, B. B., Tamm, S., Heine, A., & Jacobs, A. M. (2013). Approach the good, with-
draw from the bad – a review on frontal alpha asymmetry measures in applied psychological
research. Psychology, 4(3), 261–267
Broekens, J. (2012). In defense of dominance: PAD usage in computational representations of
affect. International Journal of Synthetic Emotions, 3(1), 33–42.
Cavazza, M., Aranyi, G., Charles, F., et al. (2014a). Towards empathic neurofeedback for interac-
tive storytelling. Open Access Series in Informatics, 41, 42–60.
Cavazza, M., Aranyi, G., Charles, F., et al. (2014b). Frontal alpha asymmetry neurofeedback for
brain–computer interfaces. In Proceedings of the 6th International Brain–Computer Interface
Conference, December. Graz, Austria.
Cavazza, M., Charles, F., Aranyi, G., et al. (2014). Towards emotional regulation through neuro-
feedback. In Proceedings of the 5th Augmented Human International Conference (p. 42).
Cavazza, M., Lugrin, J. L., & Buehner, M. (2007). Causal perception in virtual reality and its
implications for presence factors. Presence: Teleoperators and Virtual Environments, 16(6),
623–642.
Coan, J. A. & Allen, J. J. (2003). The state and trait nature of frontal EEG asymmetry in emotion.
In K. Hugdahl & R. J. Davidson (Eds), The Asymmetrical Brain (pp. 565–615), Cambridge,
MA: MIT Press.
Coan, J. A. & Allen, J. J. (2004). Frontal EEG asymmetry as a moderator and mediator of emotion.
Biological psychology, 67(1), 7–50.
Csikszentmihalyi, M. (1991). Flow: The Psychology of Optimal Experience (vol. 41). New York:
Harper Perennial.
Davidson, R. J. (1992). Anterior cerebral asymmetry and the nature of emotion. Brain and Cog-
nition, 20(1), 125–151.
Davidson, R. J. (1998). Anterior electrophysiological asymmetries, emotion, and depression: Con-
ceptual and methodological conundrums. Psychophysiology, 35(5), 607–614.
Davidson, R. J. (2003). Darwin and the neural bases of emotion and affective style. Annals of the
New York Academy of Sciences, 1000(1), 316–336.
Davidson, R. J. (2004). What does the prefrontal cortex “do” in affect: Perspectives on frontal
EEG asymmetry research. Biological Psychology, 67(1), 219–234.
Davidson, R. J., Ekman, P., Saron, C. D., Senulis, J. A., & Friesen, W. V. (1990). Approach-
withdrawal and cerebral asymmetry: Emotional expression and brain physiology: I. Journal of
Personality and Social Psychology, 58(2), 330.
Decety, J. & Moriguchi, Y. (2007). The empathic brain and its dysfunction in psychiatric pop-
ulations: Implications for intervention across different clinical conditions. BioPsychoSocial
Medicine, 1(1), 22.
Demaree, H. A., Everhart, D. E., Youngstrom, E. A., & Harrison, D. W. (2005). Brain later-
alization of emotional processing: Historical roots and a future incorporating “dominance.”
Behavioral and Cognitive Neuroscience Reviews, 4(1), 3–20.
De Waal, F. B. (2008). Putting the altruism back into altruism: The evolution of empathy. Annual
Review of Psychology, 59, 279–300.
Elgendi, M., Dauwels, J., Rebsamen, B., et al. (2014). From auditory and visual to immersive
neurofeedback: Application to diagnosis of Alzheimer’s disease. In Z. Yang (Ed.), Neural Com-
putation, Neural Devices, and Neural Prosthesis (pp. 63–97). New York: Springer
Fox, N. A. & Davidson, R. J. (1988). Patterns of brain electrical activity during facial signs of
emotion in 10-month-old infants. Developmental Psychology, 24(2), 230–236.
Gable, P. & Harmon-Jones, E. (2008). Relative left frontal activation to appetitive stimuli: Con-
sidering the role of individual differences. Psychophysiology, 45(2), 275–278.
Gaffary, Y., Eyharabide, V., Martin, J.-C., & Ammi, M. (2013). Clustering approach to char-
acterize haptic expressions of emotions. ACM Transactions on Applied Perception, 10(4),
1–20.
review. Image and Vision Computing, 27(12), 1775–1787.
Gebhard, P. (2005). ALMA: a layered model of affect. In Proceedings of The Fourth International
Joint Conference on Autonomous Agents and Multiagent Systems (pp. 29–36).
Gilroy, S. W., Cavazza, M., & Benayoun, M. (2009). Using affective trajectories to describe states
of flow in interactive art. In Proceedings of the International Conference on Advances in Com-
puter Entertainment Technology (pp. 165–172).
Gilroy, S. W., Cavazza, M., Niiranen, M., et al. (2009). PAD-based multimodal affective fusion.
In 3rd International Conference on Affective Computing and Intelligent Interaction and Work-
shops (pp. 1–8).
Gilroy, S. W., Cavazza, M. O., & Vervondel, V. (2011). Evaluating multimodal affective fusion
using physiological signals. In Proceedings of the 16th International Conference on Intelligent
User Interfaces (pp. 53–62).
Gilroy, S. W., Porteous, J., Charles, F., et al. (2013). A brain-computer interface to a plan-based
narrative. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intel-
ligence (pp. 1997–2005).
Glowinski, D., Coletta, P., Volpe, G., et al. (2010). Multi-scale entropy analysis of dominance
in social creative activities. In Proceedings of the International Conference on Multimedia
(pp. 1035–1038).
Gutsell, J. N. & Inzlicht, M. (2010). Empathy constrained: Prejudice predicts reduced mental sim-
ulation of actions during observation of outgroups. Journal of Experimental Social Psychology,
46(5), 841–845.
Harmon-Jones, E. (2004). Contributions from research on anger and cognitive dissonance to
understanding the motivational functions of asymmetrical frontal brain activity. Biological Psy-
chology, 67(1), 51–76.
Harmon-Jones, E., Gable, P. A., & Peterson, C. K. (2010). The role of asymmetric frontal cortical
activity in emotion-related phenomena: A review and update. Biological Psychology, 84(3),
451–462.
Harmon-Jones, E. & Peterson, C. K. (2009). Supine body position reduces neural response to
anger evocation. Psychological Science, 20(10), 1209–1210.
Huys, Q. J., Cools, R., Gölzer, M., et al. (2011). Disentangling the roles of approach, activation
and valence in instrumental and Pavlovian responding. PLoS Computational Biology, 7(4),
e1002028.
Kleinsmith, A. & Bianchi-Berthouze, N. (2007). Recognizing affective dimensions from body
posture. Lecture Notes in Computer Science, 4738, 48–58.
Kleinsmith, A. & Bianchi-Berthouze, N. (2013). Affective body expression perception and recog-
nition: A survey. IEEE Transactions on Affective Computing, 4(1), 15–33.
Lang, P. J., Greenwald, M. K., Bradley, M. M., & Hamm, A. O. (1993). Looking at pictures:
Affective, facial, visceral, and behavioral reactions. Psychophysiology, 30(3), 261–273.
Light, S. N., Coan, J. A., Zahn-Waxler, C., et al. (2009). Empathy is associated with dynamic
change in prefrontal brain electrical activity during positive emotion in children. Child Devel-
opment, 80(4), 1210–1231.
Lindquist, K. A., Wager, T. D., Kober, H., Bliss-Moreau, E., & Barrett, L. F. (2012). The
brain basis of emotion: A meta-analytic review. Behavioral and Brain Sciences, 35(3),
121–143.
Liu, Y. & Sourina, O. (2012). EEG-based dominance level recognition for emotion-enabled inter-
action. In Proceedings of IEEE International Conference on Multimedia and Expo (pp. 1039–
1044).
Lowe, R. & Ziemke, T. (2011). The feeling of action tendencies: On the emotional regulation of
goal-directed behavior. Frontiers in Psychology, 2.
Mehrabian, A. (1996). Pleasure-arousal-dominance: A general framework for describing and
measuring individual differences in temperament. Current Psychology, 14, 261–292.
Mehrabian, A. & Russell, J. A. (1974). An Approach to Environmental Psychology. Cambridge,
MA: MIT Press.
Prendinger, H. & Ishizuka, M. (2005). The empathic companion: A character-based interface that
addresses users’ affective states. Applied Artificial Intelligence, 19(3–4), 267–285.
Reeves, B. & Nass, C. (1996). The Media Equation: How People Treat Computers, Television,
and New Media Like Real People and Places. New York: Cambridge University Press.
Reuderink, B., Mühl, C., & Poel, M. (2013). Valence, arousal and dominance in the EEG during
game play. International Journal of Autonomous and Adaptive Communications Systems, 6(1),
45–62.
Rosenfeld, J. P., Cha, G., Blair, T., & Gotlib, I. H. (1995). Operant (biofeedback) control of left-
right frontal alpha power differences: Potential neurotherapy for affective disorders. Biofeed-
back and Self-Regulation, 20(3), 241–258.
Russell, J. A. & Mehrabian, A. (1977). Evidence for a three-factor theory of emotions. Journal of
Research in Personality, 11(3), 273–294.
Sander, D. & Scherer, K. R. (2014). Traité de psychologie des émotions. Paris: Dunod.
Sutton, S. K. & Davidson, R. J. (1997). Prefrontal brain asymmetry: A biological substrate of the
behavioral approach and inhibition systems. Psychological Science, 8(3), 204–210.
Tan, E. S. H. (1995). Film-induced affect as a witness emotion. Poetics, 23(1), 7–32.
Tullett, A. M., Harmon-Jones, E., & Inzlicht, M. (2012). Right frontal cortical asymmetry pre-
dicts empathic reactions: Support for a link between withdrawal motivation and empathy. Psy-
chophysiology, 49(8), 1145–1153.
Valenzi, S., Islam, T., Jurica, P., & Cichocki, A. (2014). Individual Classification of Emotions
Using EEG. Journal of Biomedical Science and Engineering, 2014.
Wehbe, R. R., Kappen, D. L., Rojas, D., et al. (2013). EEG-based assessment of video and in-
game learning. In Proceedings of CHI’13 Extended Abstracts on Human Factors in Computing
Systems (pp. 667–672).
Wehbe, R. R., Zerebecki, C., Khattak, S., Hogue, A., & Nacke, L. E. (2014). User Research
for 3D Display Settings with EEG Frontal Alpha Asymmetry. In CHI Play 2014 Games User
Workshop, Toronto.
Witmer, B. G. & Singer, M. J. (1998). Measuring presence in virtual environments: A presence

questionnaire. Presence: Teleoperators and Virtual Environments, 7(3), 225–240.
Zotev, V., Phillips, R., Yuan, H., Misaki, M., & Bodurka, J. (2014). Self-regulation of human
brain activity using simultaneous real-time fMRI and EEG neurofeedback. NeuroImage, 85,
985–995.
22 Virtual Reality and Prosocial Behavior
Ketaki Shriram, Soon Youn Oh, and Jeremy Bailenson
Introduction
People have long been intrigued by the notion of a virtual space that could offer an
escape from reality to new sensory experiences. As early as 1965, Sutherland envi-
sioned that the ‘ultimate display’ would enable users to actively interact with the virtual
space as if it were real, giving them “a chance to gain familiarity with concepts not
realizable in the physical world” (Sutherland, 1965, p. 506). William Gibson appears to
have shared this vision when coining the term ‘cyberspace’ in his 1984 novel Neuro-
mancer, defining it as “a consensual hallucination experienced daily by billions of legit-
imate operators, in every nation, by children being taught mathematical concepts · · · ”
(p. 51). While the image may have seemed farfetched at the time, the mounting popu-
larity of home video game consoles, massively multiplayer online role playing games
(MMORPGs), and massive open online courses (MOOCs) all demonstrate that virtual
reality (VR) is an increasingly integral component of our everyday lives.
Despite some romanticized versions of VR, much of the previous literature focused
on its dangers. Early studies voiced concerns about how individuals would no longer
be able to receive true emotional and social support online (e.g., Kraut et al., 1998)
and more recent research focused on Internet addiction (e.g., Lam et al., 2009) as well
as the antisocial effects of playing violent games (e.g., Bartholow, Bushman, & Sestir,
2006). Overall, the results suggest that spending extensive amounts of time in VR can
result in apathetic or even violent attitudes and behavior toward others. Indeed, early
conversations between Jaron Lanier, one of the pioneers of the technology, and William
Gibson, who consulted Lanier while writing his manifesto, focused on this tension.
Lanier envisioned prosocial uses for the technology he championed, but Gibson felt
compelled to write about the less wholesome applications, saying, “Jaron, I tried. But
it’s coming out dark” (Lanier, 2013, p. 329).
In terms of academic research, there is a group of scholars who focus on a more posi-
tive outlook; the unique affordances of virtual environments actually promote prosocial
behavior when leveraged. Recent developments show that even brief virtual interven-
tions can increase environmental awareness, reduce racial bias, and enhance general
altruistic behavior. These interventions have been found to be especially powerful when
the user feels fully immersed in the virtual world. With these findings in mind, the
present chapter will outline the characteristics and strengths of virtual environments.
It will then describe the malleable nature of prosocial behavior and explain how the
Virtual Reality and Prosocial Behavior 305
technological affordances of virtual environments can be properly employed to encour-

age prosocial behaviors.
Using Virtual Environments for Social Science Research
Virtual experiences provide users with the chance to experience realistic environments
without the need to be physically present and can be applied to a multitude of domains
such as education, business, and entertainment. Because immersive virtual environ-
ments (IVEs) – virtual environments that perceptually surround the user (Lanier, 2001;
Loomis, Blascovich, & Beall, 1999) – provide the opportunity to create realistic situ-
ations in a controlled manner, they can be employed as a useful methodology to study
human psychology and behavior. IVEs allow researchers to address three main issues in
social science research: (1) the experimental control-mundane realism tradeoff, which
addresses the challenge of conducting a controlled experiment that presents realistic
situations to participants, (2) difficulties in replicating studies, and (3) the use of non-
representative samples (Blascovich et al., 2002). By using IVEs, participants are able
to directly experience realistic and multisensory scenarios in a controlled environment,
which enhances the validity and the reliability of the study. For example, users can be
immersed in a virtual forest, where the movement of animals and other environmen-
tal minutiae can be controlled. With IVEs, researchers are presented with a novel per-
spective and capacity to study social interactions, which allow new insight into human
behavior. Further, because technological advances allow for changes in the nature of
interactions in VR (i.e., transformed social interaction; Bailenson et al., 2004), IVEs
can be used to explore research questions that cannot be pursued under the constraints
of physical reality.
What is Transformed Social Interaction?
People appear to have an inherent desire to alter their appearances. Individuals consult
self-help books for exercise regimens, go on diets, and in the extreme case, undergo
plastic surgery for self-enhancement. However, the extent to which one can change is
highly limited in the physical world. Relatively extreme changes, such as plastic surgery,
are not only dangerous, but also difficult to reverse. VR offers us a venue to go beyond
these physical limitations. Within VR, people are able to bend the law of physics and
change their appearance, abilities, and surroundings.
In contrast to early assertions that computer-mediated communication (CMC) was
inherently inferior to face-to-face communication, Walther (1996) argued that CMC
could be even more intimate and salient than face-to-face communication (hyperper-
sonal model of communication). Perhaps because the model was conceptualized when
text-based CMC was the norm, Walther focused on the user’s increased ability to moni-
tor his or her verbal messages to support this argument. However, by taking technolog-
ical advances into account, studies show that the hyperpersonal model can be extended
to nonverbal elements (Bailenson et al., 2005). The technology of immersive virtual

environments allows “changing the nature of social interaction” (Bailenson et al., 2004,
p. 429) through the enhancement or degradation of interpersonal communication. Users
can decouple their actual behavior and form from their virtual representations, result-
ing in behavioral and social implications for both the virtual and physical world. This
research paradigm is known as transformed social interaction (TSI). Bailenson and
his colleagues (2004) outline three main dimensions of TSI: (1) self-representation,
(2) sensory capabilities, and (3) situational context. More specifically, by using com-
puter algorithms, it is possible to (1) transform the visual representation of the interac-
tants by changing their appearance or nonverbal behavior (e.g, increasing mutual eye
gaze with the interactant), (2) heighten sensory capabilities by displaying signals that
are available in face-to-face (FTF) contexts (e.g., displaying the names of one’s inter-
actants in virtual bubbles above their heads) or reduce sensory capabilities by hiding
signals that are available in FTF contexts (e.g., choosing not to render distracting hand
motions), and (3) alter the temporal or spatial context by adjusting the speed of the
interaction or the proximity/position of the components of the virtual environment. The
remainder of this section will examine these three categories in further detail.
Altering Avatar Appearance and Behavior

Studies have documented that digitally altering others’ nonverbal behavior, such as eye
gaze and mimicry, has a significant influence on one’s perception of and attitude toward
them (e.g., Bailenson & Yee, 2005; Garau et al., 2003). For example, Bailenson and
Yee (2005) found that people liked a computerized agent better when it mimicked their
behavior compared to when it mimicked the prior participant, even though the partici-
pants were fully aware that the agent was not controlled by a real person.
In addition to transforming nonverbal behavior, new techniques also allow users to
alter their digital self-representations, which can subsequently change the manner in
which they perceive themselves (Yee & Bailenson, 2007). Dubbed the ‘Proteus effect,’
this well-researched phenomenon shows that “an individual’s behavior conforms to their
digital self-representation independent of how others perceive them” (Yee & Bailen-
son, 2007, p. 271). Specifically, Yee and Bailenson (2007) found that participants who
embodied attractive avatars exhibited more confidence and openness within social inter-
actions compared to those who embodied unattractive avatars, despite the fact that the
avatars were randomly assigned. Similarly, participants who embodied short avatars
were less confident and more likely to behave submissively in a negotiation for money
compared to their tall-avatar counterparts. The perceptual effects of embodying vari-
ous virtual representations can last well after the immersive experience. When adults
embodied a child in virtual reality, they were more likely to identify themselves with
child-like characteristics afterwards. In comparison, those who embodied an adult body
scaled to the size of a child still identified with adult traits (Banakou, Groten, & Slater,
2013).
The possibility of identifying an alien body as one’s own, also known as ‘body
transfer,’ has been documented outside of VR using the ‘rubber hand’ illusion, where
participants feel gentle strokes on their own hand when a detached rubber hand is being
synchronously touched (Botvinick & Cohen, 1998). More germane to the current con-
text, Slater et al. (2010) found that male participants identified with a female virtual
body when they embodied the avatar of a young girl, which led them to react physiolog-
ically to perceived threats toward the girl in the virtual environment (Slater et al., 2010).
Specifically, male participants who had been given a first (versus third) person perspec-
tive of the female avatar exhibited greater heart rate deceleration when they later saw
the same avatar being slapped by another female avatar, indicating identification with
the avatar, in spite of physical dissimilarities (Slater et al., 2010).
Transforming Sensory Capabilities

The critically acclaimed television series Lie to Me features a leading psychologist
(inspired by Paul Ekman) who interprets fleeting micro-expressions and body lan-
guage to detect whether someone is telling the truth or lying. The notion of accurately
monitoring others’ nonverbal behavior to ‘read’ their thoughts is extremely attractive
when considering the fact that a substantial amount of communication consists of non-
verbal, rather than verbal communication (Birdwhistell, 1970; Mehrabian, 1981). In
reality, however, it is very difficult to track others’ nonverbal cues, especially in real
time. Human senses typically are not capable of detecting and consciously processing
millisecond-by-millisecond changes. It is also challenging for people to monitor their
own nonverbal behavior, as most of it takes place subconsciously.
The technology of IVEs offers a novel method to process nonverbal cues by enhanc-
ing our sensory capabilities. The computer can track the information of interest and
display the data to the user. For example, an instructor can monitor the eye gaze of
his or her students to properly gauge their involvement in the lesson. Interactants can
place their basic information (e.g., name, employer, hometown, etc.) in floating bubbles
above their heads to reduce uncertainty and facilitate the communication (Bailenson
et al., 2004). It is also possible to grant communicators the sensory capability to hear
each other’s heartbeats (Janssen et al., 2010) or see a summary of each other’s facial
expressions or arousal levels during a conversation.
Modifying the Contextual Situation

Imagine a world that is no longer governed by spatial and temporal rules. You do not
have to compete for a front row seat. You can decide which direction your interactants
should face. You can replay a conversation to make sure that you understood all of the
details correctly and ‘forward’ the part where your friend suddenly decided to talk about
her pet lizard for 20 minutes. The final category of TSI, transformation of the contextual
situation, pertains to the capacity of VR to make such a world possible.
By using computer algorithms, interactants can easily bend the spatial and temporal
rules to make the virtual environment match their needs and preferences. For example,
every student can have a front row seat in a virtual classroom, an attractive option that
is not viable in physical classrooms (Bailenson et al., 2008). Recent studies suggest
that you can even modify the general ambience of the same environment, which can
subsequently influence your mood. Riva and his colleagues (2007) entertained this pos-
sibility by creating ‘anxious’ and ‘relaxing’ versions of the same virtual park through
manipulations of its auditory (e.g., sound, music) and visual (e.g., shadows, texture)
features.
The diverse range of work on TSI demonstrates that the technology of virtual reality
can be employed to modify social interactions in significant ways, producing attitudinal
and behavioral changes that extend to the physical world (Ahn, Le, & Bailenson, 2013).
This powerful capacity that enables alterations of self-representation, sensory capabili-
ties, and situational context provides the foundation to use virtual experiences as a tool
to both promote and accurately assess prosocial intentions and behavior.
The Plasticity of Prosocial Behavior
While it is common to explain prosocial behavior in terms of individual predispositions

such as trait empathy or altruistic personality, prosocial behavior can be encouraged by
seemingly simple interventions (e.g., Galinsky & Ku, 2004; Weng et al., 2013). Writing
a short essay about the typical day of an older man from his perspective significantly
reduced prejudice toward the elderly (Galinsky & Ku, 2004) and receiving audio-based
compassion training for 30 minutes per day significantly increased altruistic tendencies
after only two weeks (Weng et al., 2013).
This suggests that prosocial behavior is highly malleable. Studies show that egocen-
tric motives, priming, and mood valence all influence an individual’s willingness to
engage in altruistic actions.
Using Egocentrism for Prosocial Behavior

Individuals have the intrinsic motivation to protect their sense of ‘self.’ While some
may believe that individuals must abandon their selfish motives in order to engage in
prosocial behavior, studies show precisely the opposite; properly leveraging egocentric
motives can actually promote prosocial actions. For instance, members of the dominant
group (Caucasians) were more likely to support affirmative action when they were told
it would not pose a threat to their ingroup; in contrast, the likeliness of supporting affir-
mative action was not influenced by how it was perceived to benefit minorities (Lowery
et al., 2006). Similarly, studies argue that the perceived overlap between the self and the
other and the relevance of the issue to the self are egocentric factors that can be used to
encourage prosocial behavior.
Some scholars argue that perspective taking promotes prosocial behavior through
self-referential neural processing, which enables people to empathize with a foreign sit-
uation. That is, perspective taking may lead to the “blurring of the distinction between
self and other” (Ames et al., 2008, p. 643), which triggers empathic concerns. Per-
spective taking, therefore, involves an egocentric process; people show more positive
behavior and attitudes toward the target because of an increase in perceived self–other
overlap (Galinsky & Moskowitz, 2000). The benefits of perspective taking typically
extend beyond a specific individual to members of his or her group. For example, Galin-
sky and Moskowitz (2000) found that taking the perspective of a stereotyped group
member (experiments 1 and 2) was effective in reducing stereotype expression and
stereotype accessibility of the elderly and African Americans.
The degree of psychological attachment to a given issue or person is another ego-
centric element that affects prosocial behavior; the individual’s desire to promote self-
relevant issues drives altruistic behavior. O’Reilly and Chatman (1986) argue that three
main factors predict the psychological attachment of an individual to a given situation:
compliance, identification of relevance, and internalization (the degree of involvement
of the individual). Identifying topic resonance and internalizing involvement are pos-
itively correlated with participants’ prosocial action on an issue. The more attached
people are to an issue, the more likely they are to act on it (O’Reilly & Chatman, 1986).
Priming Good, Priming Bad

Priming has also been fruitful in encouraging prosocial behavior. In their 2005 study,
Nelson and Norton found that participants were more likely to engage in helping behav-
ior when they were primed with a superhero (i.e., they gave a description of the traits
of Superman) compared to those who were not primed at all (i.e., they gave a descrip-
tion of a dorm room). More important, those who were primed with a superhero were
more likely to engage in actual volunteering behavior up to three months after the initial
prime. Similarly, Greitemeyer (2009: experiment 1) found that participants were more
likely to access prosocial thoughts after listening to songs with prosocial lyrics (e.g.,
“Love Generation” by Bob Sinclair) compared to songs with neutral lyrics (e.g., “Rock
This Party” by Bob Sinclair).
In contrast, priming individuals with violent content can discourage prosocial behav-
ior. Previous work found that children exposed to aggressive programs over time showed
less obedience to rules (Friedrich & Stein, 1973). Anderson and Dill (2000) argue that
exposure to violent content increases the accessibility of aggressive thoughts, leading
to antisocial behavior. In their meta-analytic review, Anderson and Bushman (2001)
conclude that experimental and nonexperimental studies support the argument that vio-
lent video games lead to increased aggression in children and young adults. Consid-
ering the pervasiveness of violence in everyday media, these results have implications
for content regulation, especially for children and teenagers still undergoing cognitive
development.
Mood: Feel Good–Do Good

People are more likely to help others when they are in a good mood. Research con-
sistently shows that positive moods lead to increased altruistic behavior through var-
ious mechanisms, including positive social outlook and desire for mood maintenance
(Carlson, Charlin, & Miller, 1988). Similarly, Whitaker and Bushman (2012) found that
playing relaxing video games led to a more positive mood, compared to playing a neutral
or violent game, which subsequently led to more helpful behavior.
Physical Limitations in Previous Work

While previous research has been successful in demonstrating the malleability of proso-
cial behavior, there are some limitations to conducting prosocial research in the physical
world. Most notably, it is extremely challenging to attain both experimental control and
everyday realism (Blascovich et al., 2002). Studies on bystanders’ responses to violent
situations demonstrate this tension; experimental methods require “abstractions from
the complexities of real life” (Rovira et al., 2009, p. 2), while field studies inevitably
include multiple confounds (Rovira et al., 2009), reducing their power to provide con-
crete results. In addition, the ease with which individuals construct vivid mental simu-
lations varies greatly, which poses difficulties for perspective taking and other vignette-
based studies. Because virtual reality can overcome the constraints of face-to-face
communication and allow TSI in a controlled yet realistic manner, these issues can be
at least partially addressed by utilizing virtual environments.
Virtual Reality and Prosocial Behavior
Many aspects of virtual environments render them ideal for conducting studies on proso-
cial behavior. Immersive worlds provide a mix of realism and control, and also offer
new methods by which to study nonverbal behavior (e.g., measuring eye gaze, interper-
sonal distance, etc.). By unobtrusively observing participants’ interactions with virtual
humans, research can determine factors that increase compassion and empathy toward
others (Rovira et al., 2009). Further, the measurement of physiological responses and
subtle nonverbal responses can supplement traditional self-report measures, providing
valuable insight into prosocial behavior. Gillath et al. (2008) found that individuals’
dispositional levels of compassion predicted their proxemic behavior (e.g., movement
paths, head orientation, interpersonal distance) when they were exposed to a visually
impaired man asking for help, demonstrating that virtual environments can be used to
measure prosocial responses in an unobtrusive manner. The plausibility of virtual envi-
ronments may also elicit more realistic responses from participants than an overly arti-
ficial experiment in the physical world (Rovira et al., 2009). Previous studies indicate
that virtual experiences can produce measurable positive behavior.
For example, when placed in an immersive virtual environment, participants were
more likely to help the victim of violence when he looked toward the participant for
help, although this was only the case for ingroup victims (Slater et al., 2013). Similarly,
Navarrete et al. (2012) explored a virtual representation of the “trolley problem” – a
decision that involved saving one life at the cost of others. When faced with this decision
in virtual reality, participants experienced high levels of arousal, allowing researchers
to investigate the link between moral judgment and prosocial action. These discoveries
would have been difficult without the realism and detailed tracking measures available
in virtual environments.
Prosocial Health and Medical Applications

As virtual experiences yield perceptual changes through alterations in self-
representation and their surrounding environments, they can be used to positively influ-
ence health-related behaviors and medical conditions. For example, participants who
viewed self-resembling avatars losing weight based on their level of exercise in vir-
tual environments were more likely to display healthy behaviors than those who viewed
avatars that did not resemble the participant (Fox & Bailenson, 2009).
There are many potential applications of VR as a tool for positive change in the
medical field; the therapeutic possibilities of treating patients with eating disorders in
virtual environments to promote positive body image (Perpiñá et al., 1999) in addition
to the potential of virtual reality to treat post-traumatic stress disorder using exposure
therapy (Rizzo et al., 2009), to improve the driving performance of military personnel
recovering from traumatic brain injury (Cox et al., 2010), and to alleviate the pain of
adolescent burn patients (Hoffman et al., 2000) have been explored.
Virtual interventions are also one of the few effective treatments for youth with high-
functioning autism. Jarrold and his colleagues (2013) demonstrated that virtual envi-
ronments could be used for a more nuanced understanding of children with higher
functioning autism spectrum disorder and could thus inform efforts to design proper
interventions. In one such attempt, participants were placed in several social situations
in virtual reality, using an avatar customized to look like themselves. Over a five-week
period, participants who received this treatment showed improvement in social function
and cognition (Kandalaft et al., 2013). In another study, researchers presented autistic
adolescents with a graphical display that quantified their emotion levels (‘emotion bub-
bles’) during a conversation with their peers. The study found that this enhanced sensory
capacity helped the participants understand and adjust their facial expressions (Madsen
et al., 2008). These results present a promising future for the use of IVEs in medical
treatment and therapy across a variety of conditions.
Prosocial Attitude and Behavior Change

Virtual experiences can also impact attitudes, generating prosocial behavior. Positive
effects have been found in reducing prejudice and increasing general altruistic behavior.
For example, Ahn and her colleagues (2013) examined the effect of embodied experi-
ences on helping behavior by randomly assigning participants to either a colorblind or
normal condition in an IVE. In the normal condition, participants were asked to imagine
they had red-green colorblindness. After the study, participants were offered the chance
to assist colorblind people. Those who had embodied the colorblind condition were
more likely to volunteer to help than those who had imagined being colorblind, under-
scoring how virtual embodiment can be a more effective method than mental simulation
for perspective taking.
However, the implications of embodying the avatar of an outgroup member are not
always so clear-cut. In their study, Groom, Bailenson, and Nass (2009) had participants
either embody or imagine themselves as a black or white model in a virtual environ-
ment. Instead of showing a reduction in prejudice, those who embodied black avatars
displayed stronger implicit racial bias in the physical world, suggesting increased stereo-
type activation. This difference was not found for the mental simulation condition.
In contrast to Groom and her colleagues (2009), a more recent study found that partic-
ipants who embodied dark skinned avatars exhibited decreased implicit racial bias com-
pared to those who embodied light or purple-skinned avatars (Peck et al., 2013). Even
when the implicit association test (IAT) was given to participants three days later, those
who had embodied dark skinned avatars still showed significantly less racial bias (Peck
et al., 2013). These disparate results suggest that the effects of embodiment on preju-
dice are sensitive to certain boundary conditions, which can lead to different results.
Peck and colleagues used a more advanced system to track and render avatar move-
ments; this increased immersion could explain why empathy trumped priming in their
study.
In addition to attitudinal changes, virtual experiences also lead to behavioral changes
in the physical world. For example, participants who were granted the “superpower” of
flight in virtual reality were more likely to display altruistic behavior in the physical
world (helping the researcher pick up a spilled cup of pens) than those who rode as a
passenger in a virtual helicopter (Rosenberg, Baughman, & Bailenson, 2013).
Prosocial Environmental Behavior

Immersion in virtual environments impacts and promotes prosocial environmental
behavior. In one series of studies, participants were asked to either cut down a vir-
tual tree (IVE condition) or imagine a tree being cut down (mental simulation condi-
tion). While participants in both conditions showed an increase in pro-environmental
self-efficacy (i.e., the belief that their actions could improve the environment), the par-
ticipants in the embodiment condition were more likely to engage in pro-environmental
behavior in physical reality (paper conservation) than those in the mental simulation
condition, suggesting that embodied experiences are crucial to behavior change (Ahn,
Bailenson, & Park, 2013).
Similarly, a virtual simulation of flooding evoked greater awareness and knowl-
edge of coping strategies for natural disasters than traditional images of flooded areas
(Zaalberg & Midden, 2010). This presents the potential application of IVE for future
disaster preparedness. With more knowledge and awareness of what the event may feel
like, people may respond to victims more quickly, enabling more efficient disaster man-
agement. Presence, or the degree to which users actually feel they are in the environ-
ment, is an important consideration in response; the vividness and intensity of the vir-
tual experience are both factors in promoting attitude change (Meijnders, Midden, &
McCalley, 2006).
Virtual nature can also induce anxiety or relaxation among users (Riva et al., 2007;
Valtchanov, Barton, & Ellard, 2010). The degree of immersion can impact these effects:
participants who saw a restorative environment (nature scenes) on a high immersive

screen were more likely to show stress-reduction than those who saw the environment
on a low immersive screen. Immersion was manipulated by the size of the screen (De
Kort et al., 2006). These results indicate the potential of virtual environments to be used
as a tool for social action in the environmental sphere. If higher levels of emotion can be
induced through high immersion, virtual environments that vividly depict the potential
devastating outcomes of global warming may produce attitude and behavior change.
Conclusion
Previous literature has explored the potential of using virtual experiences to promote
prosocial behavior and attitude changes. Promising results have been found for preju-
dice reduction, general altruistic behavior, positive health behaviors and medical treat-
ment, and knowledge and preparation for natural disasters. Future research should
explore different forms of embodiment; for example, animal embodiment may produce
significant effects on prosocial behavior. Experiment length is another consideration,
as greater time spent immersed may result in more pronounced effects. Social bias
presents another potential area for study in virtual spaces; embodying the ill or physi-
cally impaired may alter attitudes and behaviors to such groups in the physical world.
Further research should also consider how TSI can be used to leverage psychological
factors that affect prosocial motivations, such as egocentrism, priming, and mood. From
our review, we conclude that virtual spaces provide new ways to change attitudes and
promote prosocial behavior, and that more work is required to determine the extent of
these transformative effects.
References
Ahn, S. J., Bailenson, J., & Park, D. (2013). Felling a tree to save paper: Short- and long-term
effects of immersive virtual environments on environmental self-efficacy, attitude, norm, and
behavior. Paper presented at the 63rd Annual International Communication Association Con-
ference, June 17–21, London.
Ahn, S. J., Le, A. M. T., & Bailenson, J. N. (2013). The effect of embodied experiences on self–
other merging, attitude, and helping behavior. Media Psychology, 16(1), 7–38.
Ames, D. L., Jenkins, A. C., Banaji, M. R., & Mitchell, J. P. (2008). Taking another person’s per-
spective increases self-referential neural processing. Psychological Science, 19(7), 642–644.
Anderson, C. A. & Bushman, B. J. (2001). Effects of violent video games on aggressive behavior,
aggressive cognition, aggressive affect, physiological arousal, and prosocial behavior: A meta-
analytic review of the scientific literature. Psychological Science, 12(5), 353–359.
Anderson, C. A. & Dill, K. E. (2000). Video games and aggressive thoughts, feelings, and behav-
ior in the laboratory and in life. Journal of Personality and Social Psychology, 78(4), 772–790.
interaction: Decoupling representation from behavior and form in collaborative virtual envi-
ronments. PRESENCE: Teleoperators and Virtual Environments, 13(4), 428–441.
interaction, augmented gaze, and social influence in immersive virtual environments. Human
Communication Research, 31(4), 511–537.
Bailenson, J. N. & Yee, N. (2005). Digital chameleons automatic assimilation of nonverbal ges-
tures in immersive virtual environments. Psychological Science, 16(10), 814–819.
Bailenson, J. N., Yee, N., Blascovich, J., et al. (2008). The use of immersive virtual reality in the
learning sciences: Digital transformations of teachers, students, and social context. Journal of
the Learning Sciences, 17(1), 102–141.
Banakou, D., Groten, R., & Slater, M. (2013). Illusory ownership of a virtual child body causes
overestimation of object sizes and implicit attitude changes. Proceedings of the National
Academy of Sciences, 110(31), 12846–12851.
Bartholow, B. D., Bushman, B. J., & Sestir, M. A. (2006). Chronic violent video game exposure
and desensitization to violence: Behavioral and event-related brain potential data. Journal of
Experimental Social Psychology, 42(4), 532–539.
Birdwhistell, R. L. (1970). Kinesics and Context: Essays On Body Motion Communication.
Philadelphia: University of Pennsylvania Press.
Blascovich, J., Loomis, J., Beall, A. C., et al. (2002). Immersive virtual environment technology
as a methodological tool for social psychology. Psychological Inquiry, 13(2), 103–124.
Botvinick, M. & Cohen, J. (1998). Rubber hands “feel” touch that eyes see. Nature, 391(6669),
756.
Carlson, M., Charlin, V., & Miller, N. (1988). Positive mood and helping behavior: A test of six
hypotheses. Journal of Personality and Social Psychology, 55(2), 211–229.
Cox, D. J., Davis, M., Singh, H., et al. (2010). Driving rehabilitation for military personnel recov-
ering from traumatic brain injury using virtual reality driving simulation: A feasibility study.
Military Medicine, 175(6), 411–416.
De Kort, Y. A. W., Meijnders, A. L., Sponselee, A. A. G., & IJsselsteijn, W. A. (2006). What’s
wrong with virtual trees? Restoring from stress in a mediated environment. Journal of Envi-
ronmental Psychology, 26(4), 309–320.
Fox, J. & Bailenson, J. N. (2009). Virtual self-modeling: The effects of vicarious reinforcement
and identification on exercise behaviors. Media Psychology, 12(1), 1–25.
Friedrich, L. K. & Stein, A. H. (1973). Aggressive and prosocial television programs and the nat-
ural behavior of pre-school children. Monographs of the Society for Research in Child Devel-
opment, 38(4), 1–64.
Galinsky, A. D. & Ku, G. (2004). The effects of perspective-taking on prejudice: The
moderating role of self-evaluation. Personality and Social Psychology Bulletin, 30(5),
594–604.
Galinsky, A. D. & Moskowitz, G. B. (2000). Perspective-taking: Decreasing stereotype expres-
sion, stereotype accessibility, and in-group favoritism. Journal of Personality and Social Psy-
chology, 78(4), 708–724.
Garau, M., Slater, M., Vinayagamoorthy, V., et al. (2003). The impact of avatar realism and eye
gaze control on perceived quality of communication in a shared immersive virtual environment.
In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 529–
536). New York: ACM Press.
Gibson, W. (2000). Neuromancer. New York: Penguin Books.
Gillath, O., McCall, C., Shaver, P. R., & Blascovich, J. (2008). What can virtual reality teach us
about prosocial tendencies in real and virtual environments? Media Psychology, 11(2), 259–
282.
Greitemeyer, T. (2009). Effects of songs with prosocial lyrics on prosocial behavior: Further evi-
dence and a mediating mechanism. Personality and Social Psychology Bulletin, 35(11), 1500–
1511.
Groom, V., Bailenson, J. N., & Nass, C. (2009). The influence of racial embodiment on racial bias
in immersive virtual environments. Social Influence, 4(3), 231–248.
Hoffman, H. G., Doctor, J. N., Patterson, D. R., Carrougher, G. J., & Furness III, T. A. (2000).
Virtual reality as an adjunctive pain control during burn wound care in adolescent patients.
Pain, 85(1), 305–309.
Janssen, J. H., Bailenson, J. N., IJsselsteijn, W. A., & Westerink, J. H. (2010). Intimate heart-
beats: Opportunities for affective communication technology. IEEE Transactions on Affective
Computing, 1(2), 72–80.
Jarrold, W., Mundy, P., Gwaltney, M., et al. (2013). Social attention in a virtual public speaking
task in higher functioning children with autism. Autism Research, 6(5), 393–410.
Kandalaft, M., Didehbana, N., Krawczyk, D., Allen, T., & Chapman, S. (2013). Virtual reality
social cognition training for young adults with high-functioning autism. Journal of Autism and
Developmental Disorders, 43(1), 34–44.
Kraut, R., Patterson, M., Lundmark, V., et al. (1998). Internet paradox: A social technology
that reduces social involvement and psychological well-being? American Psychologist, 53(9),
1017–1031.
Lam, L. T., Peng, Z. W., Mai, J. C., & Jing, J. (2009). Factors associated with Internet addiction
among adolescents. Cyberpsychology & Behavior, 12(5), 551–555.
Lanier, J. (2001). Virtually there. Scientific American, 284(4), 66–75.
Lanier, J. (2013). Who Owns the Future? New York: Simon & Schuster.
Loomis, J. M., Blascovich, J. J., & Beall, A. C. (1999). Immersive virtual environment technology
as a basic research tool in psychology. Behavior Research Methods, Instruments, & Computers,
31(4), 557–564.
Lowery, B. S., Unzueta, M. M., Knowles, E. D., & Goff, P. A. (2006). Concern for the in-group
and opposition to affirmative action. Journal of Personality and Social Psychology, 90(6), 961–
974.
Madsen, M., el Kaliouby, R., Goodwin, M., & Picard, R. W. (2008). Technology for just-in-time
in-situ learning of facial affect for persons diagnosed with an autism spectrum disorder. In
Proceedings of the 10th ACM Conference on Computers and Accessibility (pp. 19–26), Halifax,
Canada: ACM Press.
Mehrabian, A. (1981). Silent Messages: Implicit Communication of Emotions and Attitudes
(2nd edn). Belmont, CA: Wadsworth.
Meijnders, A. L., Midden, C. J. H., & McCalley, L. T. (2006). The persuasive power of mediated
risk experiences. In W. IJsselsteijn, Y. de Kort, C. Midden, B. Eggen, & E. van den Hoven
(Eds), Proceedings of Persuasive Technology: First International Conference on Persuasive
Technology for Human Well-Being (vol. 3962, pp. 50–54). Berlin: Springer.
Navarrete, C. D., McDonald, M. M., Mott, M. L., & Asher, B. (2012). Virtual morality: Emo-
tion and action in a simulated three-dimensional “trolley problem.” Emotion, 12(2), 364–
370.
Nelson, L. D. & Norton, M. I. (2005). From student to superhero: Situational primes shape future
helping. Journal of Experimental Social Psychology, 41(4), 423–430.
O’Reilly, C. A. & Chatman, J. (1986). Organizational commitment and psychological attachment:
The effects of compliance, identification, and internalization on prosocial behavior. Journal of
Applied Psychology, 71(3), 492.
Peck, T. C., Seinfeld, S., Aglioti, S. M., & Slater, M. (2013). Putting yourself in the skin of a black
avatar reduces implicit racial bias. Consciousness and Cognition, 22(3), 779–787.
Perpiñá, C., Botella, C., Baños, R., et al. (1999). Body image and virtual reality in eating dis-
orders: Is exposure to virtual reality more effective than the classical body image treatment?
Cyberpsychology & Behavior, 2(2), 149–155.
Riva, G., Mantovani, F., Capideville, C. S., et al. (2007). Affective interactions using virtual real-
ity: The link between presence and emotions. Cyberpsychology & Behavior, 10(1), 45–56.
Rizzo, A., Reger, G., Gahm, G., Difede, J., & Rothbaum, B. O. (2009). Virtual reality expo-
sure therapy for combat-related PTSD. In P. Shiromani, T. Keane, & J. LeDoux (Eds), Post-
Traumatic Stress Disorder: Basic Science and Clinical Practice (pp. 375–399). New York:
Humana Press.
Rosenberg, R. S., Baughman, S. L., & Bailenson, J. N. (2013). Virtual superheroes: Using super-
powers in virtual reality to encourage prosocial behavior. PLoS One, 8(1). http://dx.plos.org/10
.1371/journal.pone.0055003.g004.
Rovira, A., Swapp, D., Spanlang, B., & Slater, M. (2009). The use of virtual reality in the study
of people’s responses to violent incidents. Frontiers in Behavioral Neuroscience, 3(59). http://
www.frontiersin.org/Behavioral_Neuroscience/10.3389/neuro.08.059.2009/full.
Slater, M., Rovira, A., Southern, R., et al. (2013). Bystander responses to a violent incident
in an immersive virtual environment. PloS One, 8(1). http://dx.plos.org/10.1371/journal.pone
.0052766.g004.
Slater, M., Spanlang, B., Sanchez-Vives, M. V., & Blanke, O. (2010). First person experience
of body transfer in virtual reality. PLoS One, 5(5). http://dx.plos.org/10.1371/journal.pone
.0010564.
Sutherland, I. E. (1965). The ultimate display. Proceedings of International Federation for Infor-
mation Processing Congress, 2, 506–508.
Valtchanov, D., Barton, K. R., & Ellard, C. (2010). Restorative effects of virtual nature settings.
Cyberpsychology, Behavior, and Social Networking, 13(5), 503–512.
Walther, J. B. (1996). Computer-mediated communication impersonal, interpersonal, and hyper-
personal interaction. Communication Research, 23(1), 3–43.
Weng, H. Y., Fox, A. S., Shackman, A. J., et al. (2013). Compassion training alters altruism and
neural responses to suffering. Psychological Science, 24(7), 1171–1180.
Whitaker, J. L. & Bushman, B. J. (2012). “Remain calm. Be kind.” Effects of relaxing video
games on aggressive and prosocial behavior. Social Psychological and Personality Science,
3(1), 88–92.
Yee, N. & Bailenson, J. N. (2007). The Proteus effect: The effect of transformed self-
representation on behavior. Human Communication Research, 33(3), 271–290.
Zaalberg, R. & Midden, C. (2010). Enhancing human responses to climate change risks through
simulated flooding experiences. In T. Ploug, P. Hasle, & H. Oinas-Kukkonen (Eds), Persuasive
Technology (pp. 205–210). Berlin: Springer.
23 Social Signal Processing in
Social Robotics
Maha Salem and Kerstin Dautenhahn
Introduction
In recent years, the roles of robots have become increasingly social, leading to a shift
from machines that are designed for traditional human–robot interaction (HRI), such as
industrial robots, to machines intended for social HRI. As a result, the wide range of
robotics applications today includes service and household robots, museum and recep-
tion attendants, toys and entertainment devices, educational robots, route guides, and
robots for elderly assistance, therapy, and rehabilitation. In light of this transformation
of application domain, many researchers have investigated improved designs and capa-
bilities for robots to engage in meaningful social interactions with humans (Breazeal,
2003).
The term social robots was defined by Fong, Nourbakhsh, and Dautenhahn (2003) to
describe “embodied agents that are part of a heterogeneous group: a society of robots
or humans. They are able to recognize each other and engage in social interactions,
they possess histories (perceive and interpret the world in terms of their own experi-
ence), and they explicitly communicate with and learn from each other” (p. 144). Other
terms that have been used widely are “socially interactive robots” (Fong et al., 2003)
with an emphasis on peer-to-peer multimodal interaction and communication between
robots and people, and “sociable robots” (Breazeal, 2002) that pro-actively engage with
people based on models of social cognition. A discussion of the different concepts of
social robots can be found in Dautenhahn (2007). Note that all the above definitions
consider social robots in the context of interactions with humans; this is in contrast
to approaches on collective and swarm robotics (Kube, 1993; Bonabeau, Dorigo, &
Theraulaz, 1999; Kernbach, 2013) which emphasise interactions among large groups of
(typically) identical robots that strongly rely on communication mediated by the envi-
ronment and afforded by the physical embodiment of the robots.
Together with the attempt to name and define this new category of robots, a whole
new research area – social robotics – has since emerged. Social robotics research is
dedicated to designing, developing, and evaluating robots that can engage in social envi-
ronments in a way that is appealing and familiar to human interaction partners (Salem
et al., 2013). However, interaction is often difficult as inexperienced users struggle to
understand the robot’s internal states, intentions, actions, and expectations. To facilitate
successful interaction, social robots should therefore provide communicative function-
ality that is intuitive and, to some extent, natural to humans. The appropriate level of
such communicative functionality strongly depends on the physical appearance of the

robot and attributions of ability thus made to it (Goetz, Kiesler, & Powers, 2003). These
initial attributions, in turn, influence human users’ expectations and social responses
with regard to the robot.
Various design approaches can be chosen depending on the social context and
intended application of the robot. Fong et al. (2003) define four broad categories of
social robots based on their appearance and level of embodiment: anthropomorphic,
zoomorphic, caricatured, and functionally designed robots. While the last three design
categories aim to establish a human–creature relationship which does not evoke as high
an expectation on the human’s side, anthropomorphic design, in contrast, is targeted at
supporting intuitive, humanlike interaction (Breazeal, 2002; Duffy, 2003). This chapter
will focus primarily on those social robots that are being used in the field of human–
robot interaction (Goodrich & Schultz, 2007; Dautenhahn, 2007). Equipping a robot
with humanlike body features such as a head, two arms and two legs, can be used as
a means to elicit the broad spectrum of responses that humans usually direct toward
one another. This phenomenon is referred to as anthropomorphism (Epley, Waytz, &
Cacioppo, 2007), i.e. the attribution of human qualities to non-living objects, and it
is increased when social signals and behaviours are generated as well as correctly
interpreted by the robot during interaction (Duffy, 2003). Besides the robot’s physi-
cal appearance, such social cues comprise vocal behaviours, facial expression and gaze,
body posture and gesture, and spatial distances between interaction partners, typically
referred to as proxemics. These aspects and their roles in social human–robot interaction
are discussed in more detail in the following sections.
Social Signal Processing in Human-Robot Interaction
Social robots that are intended to engage in natural and intuitive HRI or to fulfill social
roles require certain interactive skills and social behaviours. For example, a therapeutic
robot should be able to sense the user’s mental state in order to adjust its motivational
strategy to the given situation: if the patient is frustrated or unmotivated, the robot may
have to show empathic behaviours and try to motivate the person in different ways than
when dealing with a patient who is generally in good spirits.
Therefore, interactive robot companions need to be capable of perceiving, process-
ing, and interpreting the user’s social signals in order to subsequently generate adequate
responses. These responses, in turn, must be easily and correctly interpretable by the
human to successfully close the interaction loop. For this purpose, the robot needs to
be equipped with a conceptual model of human social behaviours based on which it
can both perceive and generate appropriate social signals. Besides processing human
behaviours that are perceived during interaction, the robot ideally exploits past interac-
tion experiences to update and expand constantly its initial conceptual model based on
learning in real-world scenarios.
Similarly, on the human’s side, repeated interaction with the robot affects the men-
tal model the user may initially have about the robot (Lee et al., 2005), ideally
Social Signal Processing in Social Robotics 319
Conceptual model
n Ge
eptio ner
Perc atio
n
Social
Processing signals Processing
and learning and and learning
behaviours
Gen
erat
ion ept ion
Perc
Co-learning
Figure 23.1 Model of social signal processing in human–robot interaction.
resulting in enhanced communication as the human learns to perceive, process, and

interpret the robot’s social behaviours more successfully over time. Such experience-
based co-learning and mutual adaptation between the two interaction partners is a cru-
cial aspect not only of human–robot interaction, but also of human–human interaction
(Kolb, 1984). Figure 23.1 provides a schematic model of social signal processing in
HRI. The different types of social signals that play a role in this processing loop are
described in the following, while highlighting state-of-the-art main trends currently in
social robotics.
While some aspects of social signal processing discussed in this chapter are rele-
vant also for human interaction with computer interfaces and virtual embodied conver-
sational agents (ECAs) with social intelligence, it is important to note the additional
challenges that arise in potentially much richer and more complex dynamic HRI envi-
ronments: due to the robot’s embodiment and flexible mobility in the human interaction
space, technology dedicated to the perception and detection of human social signals
(e.g. the robot’s camera or microphone) has to handle frequently changing environmen-
tal settings such as varying lighting conditions, altered camera angles or background
noise.
Vocal Behaviours
Nonverbal vocal cues include all aspects of speech beyond the words themselves, e.g.
voice quality, silent pauses and fillers, vocalizations, and turn-taking patterns (Vincia-
relli, Pantic, & Bourlard, 2008). Despite the general efforts in the affective computing
domain to develop systems that can detect and analyse such vocal behaviours, social
robotics applications are currently very limited in their ability to detect and process
nonverbal vocal cues. At this stage, HRI research is still tackling the problem of reliably
detecting, processing, and interpreting the mere occurrence and the verbal content of
speech (e.g. Lee et al., 2012), as robots often operate in noisy environments in which
natural spoken interaction using the robot’s integrated microphones, rather than speech
processing devices attached to the user, is rarely possible.
For example, when facing a group of people, a service robot employed in a multiuser
setting like an airport or museum will have difficulty identifying the current interaction
partner and the respective spoken input directed toward the robot. This issue is amplified
in cases of speech overlap, e.g. when multiple individuals address the robot simultane-
ously or alternate between talking to each other and to the robot. In conclusion, as long
as natural language processing at the semantic level still poses a major challenge in HRI,
advances in the detection and processing of nonverbal vocal behaviours and cues, e.g.
with robots being able to recognise sarcasm or frustration in human speech, are likely
to be still a long time coming.
In contrast, the generation of such behavioural cues is advancing more quickly
and recent empirical work has demonstrated their effectiveness in HRI: for example,
Chidambaram, Chiang, and Mutlu (2012) showed that humans comply with a robot’s
suggestions significantly more when it uses persuasive nonverbal vocal cues than when
it does not use such cues. Eyssel et al. (2012) investigated effects of vocal cues reflect-
ing the robot’s gender (male vs female voice) and voice type (humanlike vs robot-like
voice) to evaluate the impact of these vocal features on HRI acceptance; the results
suggest that a robot’s vocal cues significantly influence human judgment regarding the
robot.
Therefore, such findings and their implications should be taken into consideration
when designing social robots for specific application areas, as small variations in the
robot’s nonverbal vocal cues may significantly alter its perception and acceptance.
Gaze and Facial Behaviours

The human face provides a rich channel for nonverbal communication such as eye gaze,
head gestures and facial expressions, which plays important roles in regulating social
interaction, e.g. by nodding or gazing, and when displaying emotions. Such facial com-
municative signals (FCS) provide very valuable social cues which can be used by a robot
to infer the user’s mental or affective state (Castellano et al., 2010). Therefore, one major
objective of automatic facial signal processing is to enhance HRI by enabling the robot
to react appropriately to the human user’s nonverbal feedback which, in turn, is believed
to enhance the quality of interaction as perceived by the human (Lang et al., 2012).
The endeavour to endow social robots with so-called affect sensitivity (Castellano
et al., 2010) is not a straightforward one. Most affect recognition systems focus mainly
on the detection and recognition of basic emotions such as joy, sadness, anger, fear,
surprise and disgust (Zeng et al., 2009). However, in social HRI it can be useful for
the robot to also recognise more complex or subtle states such as boredom, irritation
or interest, in order to adjust its interaction behaviour accordingly. In recent years,
some attempts to recognise non-basic affective expressions have been presented in the
literature.
For example, El Kaliouby and Robinson (2005) proposed a computational real-time
model that detects complex mental states such as agreeing and disagreeing, unsure and
interested, thinking and concentrating based on head movement and facial expressions
in video data. Yeasin, Bullot, and Sharma (2006) presented an approach to recognise
six basic facial expressions which are then used to compute the user’s interest level.
In an attempt to analyse the temporal dynamics and correlations of affective expres-
sions, Pantic and Patras (2006) developed a method that handles a wide range of human
facial behaviour by recognising the relevant facial muscle actions based on so-called
action units (AUs). In contrast to other systems for facial muscle action detection, their
approach does not require frontal-view face images and handles temporal dynamics in
extended face image sequences.
Although these and several other approaches of automatic facial expression recog-
nition have been presented in the computer vision literature, there have been only few
attempts to integrate such systems on robotic platforms in order to allow for recognition
of human emotion and subsequent robot feedback in real time. Representing one of the
few approaches in HRI, Castellano et al. (2013) successfully use a contextually rich mul-
timodal video corpus containing affective expressions of children playing chess with the
iCat, an expressive robot head, to train a context-sensitive affect recognition system for
a robotic game companion. However, like most existing solutions in HRI and human–
computer interaction, this solution relies on a static interaction setup in which the human
sits directly opposite the robot or computer interface, thus providing an optimal video
perspective for the system. Therefore, in dynamic HRI environments in which both
the robot and human may be moving around freely, such approaches would perform
poorly.
Despite the plenitude of studies in the field of affect recognition for social robots,
performance results and their underlying methods cannot be easily compared as exper-
imental conditions and databases and corpora used to train the systems typically differ
(Castellano et al., 2010). Thus, common guidelines are required for the design of affect
sensitive frameworks that can be used in real-world scenarios. In addition, the percep-
tion of spontaneous and more subtle affective cues that differ from basic emotions,
the analysis of multiple modalities of expression, as well as the personalisation and
adaptation over time to changes of the human’s attitude toward the robot remain major
challenges.
As with vocal behaviours, the generation of appropriate nonverbal gaze and facial
behaviours for social robots has been more widely addressed by the HRI community
than their recognition, e.g. by equipping robots with the ability to display emotions
based on facial expressions. Examples include the MIT robots, Kismet and Leonardo
(Thomaz, Berlin, & Breazeal, 2005), the iCub robot developed at IIT (Metta et al.,
2008), ATR’s RoboVie platform (Kanda et al., 2002), and Bielefeld University’s anthro-
pomorphic head, Flobi (Lütkebohle et al., 2010). Mutlu et al. (2012) modelled con-
versational gaze cues for robots based on key conversational gaze mechanisms used
by humans to manage conversational roles and turn-taking effectively. Experimental
evaluation subsequently showed that these social signals effectively help robots signal
different participant roles in conversations by managing speaking turns and that they
further shape how interlocutors perceive the robot and the interaction.
Body Posture and Gesture

Fong et al. (2003) identify the use of gestures as one crucial aspect when design-
ing robots that are intended to engage in meaningful social interactions with humans.
Gestures, as a dynamic interpolation of body postures, convey conceptual information
which distinguishes them from other – arbitrary or functional – motor movements per-
formed by the robot. Given the design of humanoid robots in particular, they are typ-
ically expected to exhibit humanlike communicative behaviors, using their bodies for
nonverbal expression just as humans do. Especially in cases where the robot’s design
allows for only limited or no facial expression at all (e.g. Honda’s ASIMO robot; Honda
Motor Co. Ltd, 2000), the use of body gesture offers a viable alternative to compensate
for the robot’s lack of nonverbal expressive capabilities.
Representing an integral component of human communicative behavior (McNeill,
1992), speech-accompanying hand and arm gestures are ideal candidates for extend-
ing the communicative expressiveness of social robots. Not only are gestures frequently
used by human speakers to express emotional states and illustrate what they express
in speech (Cassell, McNeill, & McCullough, 1998), more crucially, they help to con-
vey information that speech alone sometimes cannot provide, as in referential, spatial,
or iconic information (Hostetter, 2011). At the same time, human listeners have been
shown to be attentive to information conveyed via such nonverbal behaviours (Goldin-
Meadow, 1999). Therefore, it appears reasonable to equip robots that are intended to
engage in natural and comprehensible HRI with the ability both to recognise and gener-
ate gestural behaviours.
With the advent of the low-cost depth sensor of the Microsoft Kinect and its built-in
skeleton tracker, HRI research focusing on new methods and algorithms for the recog-
nition and continuous tracking of human posture and gesture has been rapidly advanc-
ing (see Suarez & Murphy, 2012, for a review). Since behavioural studies have shown
that emotions and other affective dimensions can be communicated by means of human
body postures (Coulson, 2004), gestures (Pollick et al., 2001), and movements (Crane &
Gross, 2007), these nonverbal behaviours represent valuable communication channels
for social signal processing. However, in the field of HRI, work on human gesture or
posture recognition has been mostly dedicated to less subtle aspects of communication,
for example, to processing spatial information such as pointing gestures in direction-
giving scenarios (e.g. Droeschel, Stuckler, & Behnke, 2011).
In contrast, much recent work has addressed the generation of communicative robot
gestures and the evaluation of the social cues conveyed by these nonverbal behaviours
in HRI. For example, Koay et al. (2013) deployed the Sunflower robot (see Figure 23.2),
which is specifically designed to communicate solely based on nonverbal cues, in a
study to evaluate the effect of bodily communication signals inspired by how hearing
dogs interact with their owners and communicate intent. Their results suggest that even
Figure 23.2 Sunflower Robot. Image used with the permission of the University of Hertfordshire,
Adaptive Systems Research Group.
untrained humans can correctly interpret the robot’s intentions based on such nonverbal
behaviours.
Kim, Kwak, and Kim (2008) controlled the size, speed and frequency of a robot’s
gestures to express different types of personalities while measuring human perception of
the robot in an experimental study. They found that personality can indeed be expressed
by means of gestural behaviour cues and that the control of such gesture design factors
can actually affect and shape the impression humans get of the robot.
Salem et al. (2013) showed that incongruent gestural behaviours, i.e. those that do
not semantically match accompanying speech, performed by a humanoid robot affect
human perception of the robot’s likability and perceived anthropomorphism. The find-
ings of their experiment suggest that the social signals conveyed by the robot’s incongru-
ent behaviours increase humans’ attribution of intentionality to the robot and therefore
make it appear even more humanlike to them than when it only displays congruent co-
verbal gestures.
Proxemics
A crucial aspect for the design of robots that are to interact with humans socially is prox-
emics, i.e. the dynamic process of interpersonal physical and psychological distancing
in social encounters (Hall, 1995). Humans use mostly subtle proxemic cues that follow
specific societal and cultural norms, such as physical distance, stance, gaze, or body
orientation, to communicate implicit messages, e.g. about the individual’s availability

for or interest in social interaction with another person (Deutsch, 1977). Depending on
a number of factors such as interpersonal liking, gender, age and ethnic group (Bax-
ter, 1970), people may choose a mutual distance within one of four broadly categorised
zones: intimate, casual-personal, socio-consultive, and public zone (Hall, 1995).
Robots that do not exhibit appropriate distancing behaviours may be perceived as
threatening or as less acceptable by their human users and social environments. There-
fore, a substantial body of HRI research is dedicated to establishing which ‘distance
zone’ a robot belongs to – and which factors (e.g. robot size or appearance) influence
this categorisation.
For example, Walters et al. (2009) proposed an empirical framework which shows
how the measurement and control of interpersonal distances between a human and
a robot can be employed by the robot to interpret, predict, and manipulate proxemic
behaviours for HRI. Their human–robot proxemic framework allows for the incorpora-
tion of interfactor effects can be extended based on new empirical results.
In another technical approach, Mead, Atrash, and Matari (2013) present a system
that builds on metrics used in the social sciences to automate the analysis of human
proxemics behaviour in HRI. Specifically, they extract features based on individual,
physical, and psychophysical factors to recognise spatiotemporal behaviours that signal
the initiation and termination of a social interaction.
Mumm and Mutlu (2011) conducted a study in which they manipulated a robot’s
likeability and gaze behavior (mutual vs averted gaze), showing that human participants
who disliked the robot compensated for an increase in the robot’s gaze by increasing
their physical distance from the robot; in contrast, participants who liked the robot did
not differ in their distancing from the robot across different gaze conditions. Their study
results on psychological distancing further suggest that, when asked to disclose personal
information to the robot, individuals who disliked the robot were less willing to share
information with the robot than those who liked it.
These and other empirical findings regarding the issue of human–robot proxemics
suggest that appropriate proxemic behaviours for robots, and the social signals con-
veyed in the process, may have a facilitating effect on human–robot interaction. As this
specific subdomain in HRI research is still young, however, more empirical studies need
to substantiate these observations in the future.
Despite the advances made in social signal processing in the field of social robotics
in recent years, many challenges remain to be tackled in the future. Importantly, the
social cues and behaviours described in the previous section should not be viewed and
addressed in isolation as they typically occur in combination with each other in natu-
ral human communication. Most research in HRI currently focuses on the detection or
generation of a single modality only, with very few approaches (e.g. Lee et al., 2012)
presently trying to fuse more than one communication channel – but certainly not all
relevant modalities. However, providing and taking into consideration multiple modal-
ities can help to dissolve ambiguity that is typical of unimodal communication and
thereby increases robustness of communication. Therefore, future work in the field of
HRI will need to address the challenges of sensor fusion more extensively.
Another challenge is precision, for example, in a robot-assisted therapy context, the
robot, if used autonomously, needs to perceive and judge the patient’s behaviour as
reliably as an experienced human therapist could. Moreover, the robot needs to be
able to adapt to changes in the social and non-social environment (François, Dauten-
hahn, & Polani, 2009). Changes in the social environment include changes in people’s
behaviours, preferences, lifestyles, or changes in how they behave due to aging, illness
etc. Detecting these changes accurately poses a major challenge and recent research has
been focusing on activity recognition within such contexts (e.g. Duque et al., 2013).
One of the drawbacks of social signal processing research in HRI is posed by the
problem of generalisation of results: since the design space and thus the plenitude of
appearances, technical abilities, and behaviours of social robots is so vast, it is very
difficult to transfer solutions from one technical system to another, from one application
domain to another (e.g. medical vs entertainment), or to compare findings from different
studies. For example, different robot embodiments elicit different user expectations and,
as a result, social cues generated by one robot may be unsuitable for another. In fact,
although robot appearance plays an important role with regard to social acceptance,
researchers (e.g. Mori, 1970; Walters et al., 2008) have argued that it is more important
that the robot’s appearance must be consistent with its behaviour. Future work should
therefore aim to establish common guidelines for the design of social signal processing
frameworks that can be used in a variety of real-world HRI scenarios.
Finally, while it may be important and useful for a robot to measure and keep track of
the user’s engagement level during interaction, e.g. by means of analysing social cues
such as human emotions and adapting its own behaviour accordingly, it is advisable
not to lose sight of the question of how much ‘sociality’ a robot should be equipped
with. For example, one may question whether there really is a need for service or assis-
tive robots to detect and comment on their users’ emotional states, e.g. by saying “you
look sad today”. Ultimately, such philosophical and ethical questions will challenge
social roboticists when deciding what is necessary and what is sufficient for the appli-
cation area at hand, and to what extent even social robots should be intended as tools as
opposed to social companions.
References
Baxter, J. (1970). Interpersonal spacing in natural settings. Sociometry, 33(4), 444–456.

Bonabeau, E., Dorigo, M., & Theraulaz, G. (1999). Swarm Intelligence: From Natural to Artificial
Systems. New York: Oxford University Press.
Breazeal, C. (2002). Designing Sociable Robots. Cambridge, MA: MIT Press.
Breazeal, C. (2003). Toward sociable robots. Robotics and Autonomous Systems, 42(3–4), 167–
175.
Cassell, J., McNeill, D., & McCullough, K.-E. (1998). Speech-gesture mismatches: Evidence
for one underlying representation of linguistic and nonlinguistic information. Pragmatics &
Cognition, 6(2), 1–34.
Castellano, G., Leite, I., Pereira, A., et al. (2010). Affect recognition for interactive companions:
Challenges and design in real world scenarios. Journal on Multimodal User Interfaces, 3(1–2),
89–98.
Castellano, G., Leite, I., Pereira, A., et al. (2013). Multimodal affect modeling and recognition for
empathic robot companions. International Journal of Humanoid Robotics, 10(1).
Chidambaram, V., Chiang, Y.-H., & Mutlu, B. (2012). Designing persuasive robots: How robots
might persuade people using vocal and nonverbal cues. In Proceedings of 7th ACM/IEEE Inter-
national Conference on Human–Robot Interaction (HRI) (pp. 293–300), Boston, MA.
sions, and viewpoint dependence. Journal of Nonverbal Behavior, 28(2), 117–139.
Crane, E. & Gross, M. (2007). Motion capture and emotion: Affect detection in whole body
movement. In A. Paiva, R. Prada, & R. W. Picard (Eds), Affective Computing and Intelligent
Interaction (pp. 95–101). Berlin: Springer.
Dautenhahn, K. (2007). Socially intelligent robots: Dimensions of human–robot interaction.
Philosophical Transactions of the Royal Society B: Biological Sciences, 362(1480), 679–704.
Deutsch, R. D. (1977). Spatial Structurings in Everyday Face-to-face Behavior. Orangeburg, NY:
Association for the Study of Man–Environment Relations.
Droeschel, D., Stuckler, J., & Behnke, S. (2011). Learning to interpret pointing gestures with
a time-of-flight camera. In Proceedings of the 6th ACM/IEEE International Conference on
Human–Robot Interaction (HRI) (pp. 481–488), Lausanne, Switzerland.
Duffy, B. R. (2003). Anthropomorphism and the social robot. Robotics and Autonomous Systems,
42(3–4), 177–190.
Duque, I., Dautenhahn, K., Koay, K. L., Willcock, L., & Christianson, B. (2013). A different
approach of using personas in human–robot interaction: Integrating personas as computational
models to modify robot companions’ behaviour. In Proceedings of IEEE International Sympo-
sium on Robot and Human Interactive Communication (RO-MAN) (pp. 424–429), Gyeongju,
South Korea.
El Kaliouby, R. & Robinson, P. (2005). Generalization of a vision-based computational model
of mind-reading. In J. Tao, T. Tan, & R. Picard (Eds), Affective Computing and Intelligent
Interaction (vol. 3784, pp. 582–589). Berlin: Springer.
Epley, N., Waytz, A., & Cacioppo, J. (2007). On seeing human: A three-factor theory of anthro-
pomorphism. Psychological Review, 114(4), 864–886.
Eyssel, F., Kuchenbrandt, D., Hegel, F., & De Ruiter, L. (2012). Activating elicited agent knowl-
edge: How robot and user features shape the perception of social robots. In Proceedings of
IEEE International Symposium on Robot and Human Interactive Communication (pp. 851–
857), Paris.
Fong, T., Nourbakhsh, I. R., & Dautenhahn, K. (2003). A survey of socially interactive robots.
Robotics and Autonomous Systems, 42(3–4), 143–166.
François, D., Dautenhahn, K., & Polani, D. (2009). Using real-time recognition of human–robot
interaction styles for creating adaptive robot behaviour in robot-assisted play. In Proceedings
of 2nd IEEE Symposium on Artificial Life (pp. 45–52), Nashville, TN.
Goetz, J., Kiesler, S., & Powers, A. (2003). Matching robot appearance and behavior to tasks to
improve human–robot cooperation. In Proceedings of the 12th IEEE International Symposium
on Robot and Human Interactive Communication (pp. 55–60), Millbrae, CA.
Goldin-Meadow, S. (1999). The role of gesture in communication and thinking. Trends in Cogni-
tive Science, 3, 419–429.
Goodrich, M. A. & Schultz, A. C. (2007). Human–robot interaction: A survey. Foundation and
Trends in Human–Computer Interaction, 1(3), 203–275.
Hall, E. (1995). Handbook for proxemic research. Anthropology News, 36(2), 40.
Honda Motor Co. Ltd (2000). The Honda Humanoid Robot Asimo, year 2000 model. http://world
.honda.com/ASIMO/technology/2000/.
Hostetter, A. B. (2011). When do gestures communicate? A meta-analysis. Psychological Bul-
letin, 137(2), 297–315.
Kanda, T., Ishiguro, H., Ono, T., Imai, M., & Nakatsu, R. (2002). Development and evaluation of
an interactive humanoid robot “Robovie.” In Proceedings IEEE International Conference on
Robotics and Automation (pp. 1848–1855), Washington, DC.
Kernbach, S. (2013). Handbook of Collective Robotics – Fundamentals and Challenges. Boca
Raton, FL: Pan Stanford.
Kim, H., Kwak, S., & Kim, M. (2008). Personality design of sociable robots by control of gesture
design factors. In Proceedings of the 17th IEEE International Symposium on Robot and Human
Interactive Communication (pp. 494–499), Munich.
Koay, K. L., Lakatos, G., Syrdal, D. S., et al. (2013). Hey! There is someone at your door. A
hearing robot using visual communication signals of hearing dogs to communicate intent. In
Proceeding of the 2013 IEEE Symposium on Artificial Life (pp. 90–97).
Kolb, D. (1984). Experiential Learning: Experience as the Source of Learning and Development.
Englewood Cliffs, NJ: Prentice Hall.
Kube, C. R. (1993). Collective robotics: From social insects to robots. Adaptive Behavior, 2(2),
189–218.
Lang, C., Wachsmuth, S., Hanheide, M., & Wersing, H. (2012). Facial communicative signals.
International Journal of Social Robotics, 4(3), 249–262.
Lee, J., Chao, C., Bobick, A., & Thomaz, A. (2012). Multi-cue contingency detection. Interna-
tional Journal of Social Robotics, 4(2), 147–161.
Lee, S.-I., Kiesler, S., Lau, Y.-m., & Chiu, C.-Y. (2005). Human mental models of humanoid
robots. In Proceedings of 2005 IEEE International Conference on Robotics and Automation
(pp. 2767–2772).
Lütkebohle, I., Hegel, F., Schulz, S., et al. (2010). The Bielefeld anthropomorphic robot head
“Flobi.” In Proceedings of the IEEE International Conference on Robotics and Automation
(pp. 3384–3391), Anchorage, AK.
McNeill, D. (1992). Hand and Mind: What Gestures Reveal about Thought. Chicago: University
of Chicago Press.
Mead, R., Atrash, A., & Matari, M. (2013). Automated proxemic feature extraction and behavior
recognition: Applications in human–robot interaction. International Journal of Social Robotics,
5(3), 367–378.
Metta, G., Sandini, G., Vernon, D., Natale, L., & Nori, F. (2008). The icub humanoid robot: An
open platform for research in embodied cognition. In Proceedings of the 8th workshop on
Performance Metrics for Intelligent Systems (pp. 50–56).
Mori, M. (1970). The uncanny valley (trans., K. F. MacDorman & T. Minato). Energy, 7(4), 33–
35.
Mumm, J. & Mutlu, B. (2011). Human–robot proxemics: Physical and psychological distancing in
human–robot interaction. In Proceedings of the 6th International Conference on Human–Robot
Interaction (pp. 331–338), Lausanne, Switzerland.
Mutlu, B., Kanda, T., Forlizzi, J., Hodgins, J., & Ishiguro, H. (2012). Conversational gaze mecha-
nisms for humanlike robots. ACM Transactions on Interactive Intelligent Systems (TiiS), 1(2).
Man, and Cybernetics, Part B: Cybernetics, 36(2), 433–449.
Pollick, F., Paterson, H., Bruderlin, A., & Sanford, A. (2001). Perceiving affect from arm move-
ment. Cognition, 82(2), 51–61.
Salem, M., Eyssel, F., Rohlfing, K., Kopp, S., & Joublin, F. (2013). To err is human(-like): Effects
of robot gesture on perceived anthropomorphism and likability. International Journal of Social
Robotics, 5(3), 313–323.
Suarez, J. & Murphy, R. R. (2012). Hand gesture recognition with depth images: A review. In
Proccedings of IEEE International Workshop on Robot and Human Interactive Communication
(pp. 411–417), Paris.
Thomaz, A. L., Berlin, M., & Breazeal, C. (2005). An embodied computational model of social
referencing. In Proceedings of IEEE International Workshop on Robot and Human Interactive
Communication (pp. 591–598).
ing domain. Image and Vision Computing, 27, 1743–1759.
Walters, M. L., Dautenhahn, K., Te Boekhorst, R., et al. (2009). An empirical framework for
human–robot proxemics. Proceedings of New Frontiers in Human–Robot Interaction (pp. 144–
149).
Walters, M. L., Syrdal, D. S., Dautenhahn, K., Te Boekhorst, R., & Koay, K. L. (2008). Avoiding
the uncanny valley: Robot appearance, personality and consistency of behavior in an attention-
seeking home scenario for a robot companion. Autonomous Robots, 24(2), 159–178.
Yeasin, M., Bullot, B., & Sharma, R. (2006). Recognition of facial expressions and measurement
of levels of interest from video. IEEE Transactions on Multimedia, 8(3), 500–508.
Zeng, Z., Pantic, M., Roisman, G., & Huang, T. (2009). A survey of affect recognition meth-
ods: Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and
Part IV
Applications of Social Signal
Processing
24 Social Signal Processing for
Surveillance
Dong Seon Cheng and Marco Cristani
Automated surveillance of human activities has traditionally been a computer vision

field interested in the recognition of motion patterns and in the production of high-level
descriptions for actions and interactions among entities of interest (Cedras & Shah,
1995; Aggarwal & Cai, 1999; Gavrila, 1999; Moeslund, Hilton, & Krüger, 2006; Bux-
ton, 2003; Hu et al., 2004; Turaga et al., 2008; Dee & Velastin, 2008; Aggarwal & Ryoo,
2011; Borges, Conci, & Cavallaro, 2013). The study on human activities has been revi-
talized in the last five years by addressing the so-called social signals (Pentland, 2007).
In fact, these nonverbal cues inspired by the social, affective, and psychological litera-
ture (Vinciarelli, Pantic, & Bourlard, 2009) have allowed a more principled understand-
ing of how humans act and react to other people and to their environment.
Social Signal Processing (SSP) is the scientific field making a systematic, algorith-
mic and computational analysis of social signals, drawing significant concepts from
anthropology and social psychology (Vinciarelli et al., 2009). In particular, SSP does
not stop at just modeling human activities, but aims at coding and decoding human
behavior. In other words, it focuses on unveiling the underlying hidden states that drive
one to act in a distinct way with particular actions. This challenge is supported by
decades of investigation in human sciences (psychology, anthropology, sociology, etc.)
that showed how humans use nonverbal behavioral cues, like facial expressions, vocal-
izations (laughter, fillers, back-channel, etc.), gestures, or postures to convey, often out-
side conscious awareness, their attitude toward other people and social environments,
as well as emotions (Richmond & McCroskey, 1995). The understanding of these cues
is thus paramount in order to understand the social meaning of human activities.
The formal marriage of automated video surveillance with Social Signal Pro-
cessing had its programmatic start during SISM 2010 (the International Workshop
on Socially Intelligent Surveillance and Monitoring; http://profs.sci.univr.it/~cristanm/
SISM2010/), associated with the IEEE Computer Vision and Pattern Recognition con-
ference. At that venue, the discussion was focused on what kind of social signals can
be captured in a generic surveillance scenario, detailing the specific scenarios where the
modeling of social aspects could be the most beneficial.
After 2010, SSP hybridizations with surveillance applications have grown rapidly in
number and systematic essays about the topic started to compare in the computer vision
literature (Cristani et al., 2013).
In this chapter, after giving a short overview of those surveillance approaches which
adopt SSP methodologies, we examine a recent application where the connection
332 Applications of Social Signal Processing
between the two worlds promises to give intriguing results, namely the modeling of
interactions via instant messaging platforms. Here, the environment to be monitored is
not “real” anymore: instead, we move into another realm, that of the social web. On
instant messaging platforms, one of the most important challenges is the identification
of people involved in conversations. It has become important in the wake of social media
penetration into everyday life, together with the possibility of interacting with persons
hiding their identity behind nicknames or potentially fake profiles.
Under scenarios like these, classification approaches (from the classical surveillance
literature) can be improved with social signals by importing behavioral cues from con-
versation analysis. In practice, sets of features are designed to encode effectively how
a person converses: since chats are crossbreeds of written text and face-to-face verbal
communication, the features inherit equally from textual authorship attribution and con-
versational analysis of speech. Importantly, the cues completely ignore the semantics of
the chat, relying solely on nonverbal aspects typical of SSP, taking care of possible pri-
vacy and ethical issues. Identity safekeeping can be managed with this modeling.
Finally, in conclusion, some considerations summarize what has been achieved so far
in the surveillance under a social and psychological perspective. Future perspectives are
then given, identifying how and where social signals and methods of surveillance could
create an effective mixture.
State of the Art
At the heart of social signal processing is the study of social signals, viewable as tempo-
ral co-occurences of social or behavioral cues (Ambady & Rosenthal, 1992), i.e., sets of
temporally sequenced changes in neuromuscular, neurocognitive, and neurophysiologi-
cal activity. Vinciarelli et al. (2009) have organized behavioral cues into five categories
that are heterogeneous, multimodal aspects of a social interplay: 1) physical appear-
ance, 2) gesture and posture, 3) face and gaze behavior, 4) vocal behavior and 5) space
and environment. In practice, the first category has little relevance in surveillance while
the other four are consistently utilized in surveillance approaches.
Gesture and Posture

Monitoring gestures in a social signaling-driven surveillance sense is hard: the goal is
not only capturing intentional gestures chosen voluntarily to communicate something,
but it is also capturing unintentional movements, such as subtle and/or rapid oscilla-
tions of the limbs, shoulder shrugging, casual touching of the nose/ear, hair twisting,
and self protection gestures like closing the arms. This is a worthwhile effort as the
essential nature of gestures in interactive scenarios is confirmed by the extreme rarity
of “gestural errors,” i.e., gestures portray with high fidelity the speaker’s communica-
tive intention (Cassell, 1998). Analyzing fine gesturing activity for the extraction of
social signals seems to be a growing trend as witnessed by a recent workshop associ-
ated with the International Conference of Computer Vision (ICCV), that is, the IEEE
Social Signal Processing for Surveillance 333
Workshop on Decoding Subtle Cues from Social Interactions (http://www.cbi.gatech

.edu/CBSICCV2013/).
Oikonomopoulos, Patras, & Pantic (2011) introduce a framework that represents the
“shape” of an activity through the localization of ensembles of spatiotemporal features.
Originally tested on standard action benchmarks (KTH, Hollywood human actions), it
was later employed to detect hand-raising in image sequences of political debates.
An orthogonal perspective extracts gestures through structured models, such as pic-
torial structures (Andriluka, Roth, & Schiele, 2009) or flexible mixtures of parts (Yang
& Ramanan, 2011). However, these models do not appear capable of capturing fine sig-
nals like the ones described above and their usage is limited in surveillance and SSP.
Nonetheless, this trend could drastically change by considering additional modalities
other than the visual information: for example, depth and exploiting RGBD sensors
like the Kinect promises to be very effective in capturing the human shape (Popa et al.,
2012).
In Cristani, Pesarin et al. (2011), gesturing is used to infer who is talking when in a
surveillance scenario, performing through statistical analysis a simple form of diariza-
tion (detection of who speaks when, see Hung et al., 2008).
The posture is an aspect of the human behavior which is unconsciously regulated
and thus can be considered as the most reliable nonverbal social cue. In general, pos-
ture conveys social signals in three different ways (Vinciarelli et al., 2009): inclusive
versus noninclusive, face-to-face versus parallel body orientation, and congruent versus
noncongruent. These cues may help to distinguish extrovert and introvert individuals,
suggesting a way to detect threatening behaviors. Only few and very recent surveillance
approaches deal with posture information (Robertson & Reid, 2011; Cristani, Bazzani
et al., 2011; Hung & Kröse, 2011) and they will be described in more detail below, since
they exploit cues mostly coming from the other behavioral categories.
An application of surveillance is the monitorating of abnormal behaviors in patients
within domestic scenarios. Rajagopalan, Dhall, and Goecke (2013) released a new pub-
lic dataset of videos with children exhibiting self-stimulatory behaviours1 , commonly
used for autism diagnosis. In the dataset, three kinds of “in the wild” behaviors are ana-
lyzed (that is, no laboratory environments are taken into account): arm flapping, head
banging, and spinning. Classification tests over these three classes have been performed
using standard video descriptors (e.g., spatiotemporal interest points in Laptev, 2005).
This highlights another lack of the social signal-based surveillance, that is, few video
datasets are currently available, where social signals are accurately labelled.
Face and Gaze Behavior

In surveillance, capturing fine visual cues from faces is quite challenging because of
two factors: most of the scenarios are non-collaborative in nature (people do not inten-
tionally look toward the sensors) and the faces are usually captured in low resolution.
1 Self-stimulatory behaviours refer to stereotyped, repetitive movements of body parts or objects.

As for gaze orientation, since objects are foveated for visual acuity, gaze direction
generally provides precise information regarding the spatial localization of one’s atten-
tional focus (Ba & Odobez, 2006), also called visual focus of attention (VFOA). How-
ever, given that the reasons above make measuring the VFOA by using eye gaze often
difficult or impossible in standard surveillance scenarios, the viewing direction can be
reasonably approximated by detecting the head pose (Stiefelhagen et al., 1999; Stiefel-
hagen, Yang, & Waibel, 2002).
In such a scenario, there are two kinds of approaches: the ones that exploit temporal
information (tracking approaches) and those that rely on a single frame for the cues
extraction (classification approaches). In the former, it is customary to exploit the nat-
ural temporal smoothness (subsequent frames cannot portray head poses dramatically
different) and the influence of the human motion on the head pose (while running for-
ward we are not looking backward), usually arising from structural constraints between
the body pose and the head pose. All these elements are elegantly joined in a single
filtering framework in Chen and Odobez (2012). In classification approaches, there are
many works describing features to use. One promising approach considers covariances
of features matrices as features to perform head and body orientation classification and
head direction regression (Tosato et al., 2013).
Regarding the use of these cues in surveillance, Smith et al. (2008) estimates pan and
tilt parameters of the head and, subsequently, represents the VFOA as a vector normal
to the person’s face with the goal of understanding whether a person is looking at an
advertisement located on a vertical glass. A similar analysis was performed in Liu et al.
(2007) where an active appearance model models the face and pose of a person in order
to discover which portion of a mall shelf is observed.
According to biological evidence (Panero & Zelnik, 1979), the VFOA can be
described as a 3D polyhedron delimiting the portion of the scene at which a subject
is looking. This is very informative in a general, unrestricted scenario where people can
enter, leave, and move freely. The idea of moving from objective (surveillance cameras)
toward subjective individual points of view offers a radical change of perspective for
behavior analysis. Detecting where a person is directing his gaze allows us to build a
set of high-level inferences, and this is the subject of studies in the field of egocentric
vision (Li, Fathi, & Rehg, 2013) where people wear ad hoc sensors for recording daily
activities. Unfortunately, this is not our case, as we are in a non-collaborative scenario.
Related to video surveillance, Benfold and Reid (2009) infer what part of the scene
is seen more frequently by people, thus creating some sort of interest maps, which they
use to identify individuals that are focused on particular portions of their environment
for a long time: a threatening behavior can be inferred if the observed target is critical
(e.g., an ATM).
Moreover, the “subjective” perspective has been proposed in Bazzani et al. (2011),
where group interactions are discovered by estimating the VFOA using a head orienta-
tion detector and by employing proxemic cues under the assumption that nearby people
whose VFOA is intersecting are also interacting.
Similarly, in Robertson and Reid (2011), a set of two- and one-person activities are
formed by sequences of actions and then modeled by HMMs whose parameters are
manually set.
Vocal Behavior
The vocal behavior class comprehends all the spoken cues that define the verbal message
and influence its actual meaning. This class includes five major components (Vinciarelli
et al., 2009): prosody can communicate competence; linguistic vocalization can com-
municate hesitation; non linguistic vocalization can provide strong emotional states or
tight social bonds; silence can express hesitation; and turn-taking patterns, which are
the most investigated in this category since they appear the most reliable for recognizing
people’s personalities (Pianesi et al., 2008), predict the outcome of negotiations (Curhan
& Pentland, 2007), recognize the roles interaction participants play (Salamin, Favre, &
Vinciarelli, 2009), or modeling the type of interactions (e.g., conflict).
In surveillance, monitoring of vocal behavior cues is absent because it is difficult to
capture audio in large areas and, most importantly, because it is usually forbidden for
privacy issues. Another issue is the fact that audio processing is usually associated with
speech recognition while in SSP the content of a conversation is ignored.
An interesting topic for surveillance is that of the modeling of conflicts as they may
degenerate in threatening events. Conflicts have been studied extensively in a wide spec-
trum of disciplines, including Sociology (see Oberschall, 1978 for social conflicts) and
Social Psychology (see Tajfel, 1982 for intergroup conflicts).
A viable approach in a surveillance context is that of Pesarin et al. (2012), which pro-
poses a semi-automatic generative model for the detection of conflicts in conversations.
Their approach is based on the fact that, during conflictual conversations, overlapping
speech becomes both longer and more frequent (Schegloff, 2000), the consequence of a
competition for holding the floor and preventing others from speaking.
In summary, vocal behavior appears to be a very expressive category of social cues
that should be exploited in the surveillance realm since it can be handled in a manner
respectful of privacy.
Space and Environment

The study of the space and environment cues is tightly connected with the concept of
proxemics, which can be defined as the “the study of man’s transactions as he per-
ceives and uses intimate, personal, social and public space in various settings,” quoting
Hall (1966), the anthropologist who first introduced this term in 1966. In other words,
proxemics investigates how people use and organize the space they share with others
to communicate. This typically happens outside conscious awareness; socially relevant
information such as personality traits (e.g., dominant people tend to use more space
than others in shared environments (Lott & Sommer, 1967), attitudes (e.g., people in
discussion tend to sit in front of one another, whereas people that collaborate tend to sit
side by side; Russo, 1967), etc. From a social point of view, two aspects of proxemic
behavior appear to be particularly important, namely interpersonal distances and spatial
arrangement of interactants.
Interpersonal distances have been the subject of the earliest investigations on prox-
emics, and one of the main and seminal findings is that people tend to organize
the space around them in terms of four concentric zones with decreasing degrees of
r-space
o-space
p-space
a) b) e)
f) g) h)
c) d)
Figure 24.1 F-formations: (a) in orange, graphical depiction of the most important part of an
F-formation – the o-space; (b) a poster session in a conference, where different group formations
are visible; (c) circular F-formation; (d) a typical surveillance setting, where the camera is
located at 2–2.5 meters on the floor (detecting groups here is challenging); (e) components of an
F-formation: o-space, p-space, r-space – in this case, a face-to-face F-formation is sketched;
(f) L-shape F-formation; (g) side-by-side F-formation; and (h) circular F-formation.
intimacy: Intimate Zone, Casual-Personal Zone, Socio-Consultive Zone, and public

zone. The more intimate a relationship, the less space there is among interactants.
One of the first attempts to model proxemics in a potential monitoring scenario has
been presented in Groh et al. (2010), where nine subjects were left free to move in a
3m × 3m area for 30 minutes. The subjects had to speak to each other about specific
themes and an analysis of mutual distances in terms of the above zones allowed to
discriminate between people who did interact and people who did not.
Also crucial is the spatial arrangement during social interactions. It addresses two
main issues: the first is to give all people involved the possibility of interacting, the
second is to separate the group of interactants from other individuals (if any). One
approach is the so-called F-formations, which are stable patterns that people tend to
form during social interactions (including in particular standing conversations): “an F-
formation arises whenever two or more people sustain a spatial and orientational rela-
tionship in which the space between them is one to which they have equal, direct,
and exclusive access” (Kendon, 1990). See Fig. 24.1(a)–(d) for some examples of
F-formations.
The most important part of an F-formation is the o-space (see Fig. 24.1), a convex
empty space surrounded by the people involved in a social interaction in which every
participant looks inward and no external people are allowed. The p-space is a narrow
stripe that surrounds the o-space, and that contains the bodies of the talking people,
while the r-space is the area beyond the p-space.
The use of the space appears to be the behavioral cue most suited in the surveil-
lance field: people detection and tracking are applications which provide information
about the layout of the people in the space, that is, how they use the space. Therefore,
post-processing this information with social models which exploit proxemics appears
to be very convenient. The recent literature confirms this claim: many surveillance
approaches presented in top tier computer vision conferences try to include the social
facet in their workflow. In particular, two applications have emerged in recent years:
the tracking of moving people or groups and the detection of standing conversational
groups.
In tracking, the keystone methodology for “socially” modeling moving people is the
social force model (SFM) of Helbing and Molnár (1995), which applies a gas-kinetic
analogy to the dynamics of pedestrians. It is a physical model for simulating interactions
while pedestrians are moving, assuming them as reacting to energy potentials caused by
other pedestrians and static obstacles. This happens through a repulsive or an attractive
force while trying to keep a desired speed and motion direction. This model can be
thought of as explaining group formations and obstacle avoidance strategies, i.e., basic
and generic forms of human interaction. Pellegrini et al. (2009) and Scovanner and
Tappen (2009) have modified the SFM by embedding it within a tracking framework,
substituting the actual position of the pedestrian of the SFM with a prediction of the
location made by a constant velocity model, which is then revised considering repulsive
effects due to pedestrians or static obstacles. No mention about attractive factors are
cited in the papers.
Park and Trivedi (2007) present a versatile synergistic framework for the analysis
of multi-person interactions and activities in heterogeneous situations. They design an
adaptive context switching mechanism to mediate between two stages, one where the
body of an individual can be segmented into parts and the other where persons are
assumed as simple points. The concept of spatiotemporal personal space is also intro-
duced to explain the grouping behavior of people. They extend the notion of personal
space to that of spatiotemporal personal space: the former is the region surrounding
each person that is considered personal domain or territory, while the latter takes into
account the motion of each person, modifying the geometry of the personal space into
a sort of cone. Such a cone is narrowed down proportionally with the motion of the
subject so the faster the subject, the narrower the area. An interaction is then defined as
caused by intersections of such volumes.
Zen et al. (2010) use mutual distances to infer personality traits of people left free to
move in a room. The results show that it is possible to predict extraversion and neuroti-
cism ratings based on velocity and number of intimate/personal/social contacts (in the
sense of Hall) between pairs of individuals looking at one another.
Concerning the tracking of groups, the recent literature can be partitioned in three
categories: 1) the class of group-based techniques, where groups are treated as atomic
entities without the support of individual tracks statistics (Lin & Liu, 2007); 2) the
class of individual-based methods, where group descriptions are built by associating
individuals’ tracklets that have been calculated beforehand, typically, with a time lag of a
few seconds (Pellegrini, Ess, & Van Gool, 2010; Yamaguchi et al., 2011; Qin & Shelton,
2012); and 3) the class of joint individual-group approaches, where group tracking and
individual tracking are performed simultaneously (Bazzani, Cristani, & Murino, 2012;
Pang, Li, & Godsill, 2007; Mauthner, Donoser, & Bischof, 2008).
Essentially, the first class does not include social theories in the modeling, while the
last two classes state that people close enough and proceeding in the same direction
represent groups with high probability. However, this assumption is crude and fails in
many situations, for example, in the case of crowded situations.
The second application regards the standing conversational groups, which are groups
of people who spontaneously decide to be in each other’s immediate presence to con-
verse with each and every member of that group (e.g., at a party, during a coffee break
at the office, or a picnic). These events have to be situated (Goffman, 1966), i.e., occur-
ring within fixed physical boundaries: this means that people should be located and not
wandering. In this scenario, we look for focused interactions (Goffman, 1966), which
occur when persons gather close together and openly cooperate to sustain a single focus
of social attention; this is the precise case where F-formation can be employed.
Cristani, Bazzani et al. (2011) find F-formations by exploiting a Hough voting strat-
egy. The main characteristics are that people have to be reasonably close to each other,
have to be oriented toward the o-space, and that the o-space must be empty to allow the
individuals to look at each other.
Another approach for the F-formation is that of Hung and Kröse (2011), where they
proposed to consider an F-formation as a maximal clique in an edge-weighted graph
where each node in the graph was a person and the edges between them measures the
affinity between pairs. Such maximal cliques were defined by Pavan and Pelillo (2007)
to be dominant sets for which a game theoretic approach was designed to solve the
clustering problem under these constraints.
Given an F-formation, other aspects of the interaction can be considered: for exam-
ple, Cristani, Paggetti et al. (2011) analyze the kind of social relationships between
people in a F-formation under a computational perspective. In their approach, they cal-
culate pair-wise distances between people lying in the p-space and perform a clustering
over the obtained distances, obtaining different classes. The number of classes is chosen
automatically by the algorithm following an information theory principle. Their main
finding is that each of the classes actually represent well-defined social bonds. In addi-
tion, the approach adapts to different environmental conditions, namely, the size of the
space where people can move.
A New Challenge for Surveillance: From the Real World to the Social Web
So far, social signal processing and surveillance cooperated for the analysis of data
coming from observations of the real world. However, surveillance has started to be
active on the virtual dimension, that is, that of the social web (Fuchs, 2012). Many
observers claim that the Internet has been transformed in the past years from a system
that is primarily oriented to information provision into a system that is more oriented
to communication and community building. In this scenario, criminality phenomena
that affect the “ordinary” community carry over into the social web: see, for example,
bullying and cyber-bullying (Livingstone & Brake, 2010), stalking and cyber-stalking
(Ellison, Steinfield, & Lampe, 2007). These aspects are similarly dangerous and
damaging both in the real world and in the social web. Other crimes are peculiar to
the social web, the most prominent being identity violation – identity violation occurs
when somebody enters the social web with the identity of someone else. Essentially,
there are three ways that an identity can be violated: by identity theft (Newman, 2006),
where an impostor becomes able to access the personal account, mostly due to Trojan
horse keystroke logging programs (such as Dorkbots; Deng et al., 2012); or by social
engineering (i.e., tricking individuals into disclosing login details or changing user pass-
words) Anderson, 2001). The third way consists in creating a fake identity, that is, an
identity which describes an invented person or emulates another person (Harman et al.,
2005).
With crime typologies inherited from the real world, and with crimes exclusive of
the social web sphere, a crucial aspect must be noted: in Internet, traces and evidences
are more abundant than in the real world. Stalking in the real world may happen when
a person shows continuously in your proximity. Detecting this with video cameras is
difficult and cumbersome. On the Internet, stalking is manifested with email and mes-
sages, which can be kept and analyzed. The same happens with cyber-bullying, where
intimidations and threats are tangible.
Therefore, our claim is that when surveillance is performed in the social web,
approaching it with the methods of social signal processing may be highly promis-
ing. The study we are going to show here, published in Cristani et al. (2012), is only
a statement of this claim.
Conversationally-inspired Stylometric Features for Authorship Attribution in

Instant Messaging
Authorship attribution (AA) is the research domain aimed at automatically recogniz-
ing the author of a given text sample, based on the analysis of stylometric cues that
can be split into five major groups: lexical, syntactic, structural, content-specific and
idiosyncratic (Abbasi & Chen, 2008).
Nowadays, one of the most important AA challenges is the identification of peo-
ple involved in chat (or chat-like) conversations. The task has become important after
social media have penetrated the everyday life of many people and offered the possibil-
ity of interacting with persons that hide their identity behind nicknames or potentially
fake profiles. So far, standard stylometric features have been employed to categorize
the content of a chat (Orebaugh & Allnutt, 2009) or the behavior of the participants
(Zhou & Zhang, 2004), but attempts of identifying chat participants are still few and
early. Furthermore, the similarity between spoken conversations and chat interactions
has been totally neglected while probably being a key difference between chat data and
any other type of written information.
Hence, we investigated possible technologies aimed at revealing the identity of a per-
son involved in instant messaging activities. In practice, we simply require that the
user under analysis (from now on, the probe individual) engages a conversation for a
limited number of turns, with whatever interlocutor: after that, novel hybrid cues can
be extracted, providing statistical measures which can be matched with a gallery of
signatures, looking for possible correspondences. Subsequently, the matches can be

employed for performing user recognition.
In our work, we propose cues that take into account the conversational nature of
chat interactions. Some of them fit in the taxonomy quoted above, but others require
to define a new group of conversational features. The reason is that they are based on
turn-taking, probably the most salient aspect of spoken conversations that applies to
chat interactions as well. In conversations, turns are intervals of time during which only
one person talks. In chat interactions, a turn is a block of text written by one participant
during an interval of time in which none of the other participants writes anything. Like
in the case of automatic analysis of spoken conversations, the AA features are extracted
from individual turns and not from the entire conversation.
Feature Extraction
In our study, we focused on a data set of N = 77 subjects, each involved in a dyadic chat
conversation with an interlocutor. The conversations can be modeled as sequences of
turns, where “turn” means a stream of symbols and words (possibly including “return”
characters) typed consecutively by one subject without being interrupted by the inter-
locutor. The feature extraction process was applied to T consecutive turns that a subject
produces during the conversation.
Privacy and ethical issues limit the use of the features, relying only on those ones that
do not involve the content of the conversation, namely number of words, characters,
punctuation marks, and emoticons.
In standard AA approaches, these features are counted over entire conversations
obtaining a single quantity. In our case, we considered the turn as a basic analysis
unit, so we extracted such features for each turn, obtaining T numbers. After that, we
calculated statistical descriptors on them, which can be the mean values or the his-
tograms; in this last case, since the turns are usually short, we obtained histograms that
collapse toward small numeric values. Modeling them as uniformly binned histograms
over the whole range of the assumed values will produce ineffective quantizations, so we
opted for exponential histograms, where small-sized bin ranges are located toward zero,
increasing their sizes while going to higher numbers. This intuition has been validated
experimentally, as discussed in the following.
The introduction of turns as a basic analysis unit allows one to introduce features
that explicitly take into account the conversational nature of the data and mirror behav-
ioral measurements typically applied in automatic understanding of social interactions
(see Vinciarelli, Pantic, & Bourlard, 2009 for an extensive survey):
r Turn duration: the time spent to complete a turn (in hundredth of seconds); this
feature accounts for the rhythm of the conversation with faster exchanges typically
corresponding to higher engagement.
r Writing speed (two features): number of typed characters – or words – per second
(typing rate); these two features indicate whether the duration of a turn is simply due
to the amount of information typed (higher typing rates) or to cognitive load (low
typing rate), i.e., to the need of thinking about what to write.
Table 24.1 Stylometric features used in the experiments.

The symbol “#" stands for “number of.” In bold, the
conversational features.
No. Feature Range
1 # words [0–260]
2 # emoticons [0–40]
3 # emoticons per word [0–1]
4 # emoticons per characters [0–0.5]
5 # exclamation marks [0–12]
6 # question marks [0–406]
7 # characters [0–1318]
8 average word length [0–20]
9 # three points [0–34]
10 # uppercase letters [0–94]
11 # uppercase letters/#words [0–290]
12 turn duration [0–1800 (sec.)]
13 # return chars [1–20]
14 # chars per second [0–20 (ch./sec.)]
15 # words per second [0–260]
16 mimicry degree [0–1115]
r Number of “return” characters: since the first two tend to provide interlocutors with
an opportunity to start a new turn, high values of this feature are likely to measure the
tendency to hold the floor and prevent others from “speaking” (an indirect measure of
dominance).
r Mimicry: ratio between number of words in current turn and number of words in
previous turn; this feature models the tendency of a subject to follow the conversa-
tion style of the interlocutor (at least for what concerns the length of the turns). The
mimicry accounts for the social attitude of the subjects.
We call these features conversational features.

Table 24.1 provides basic facts about the features used in our approach. The features
1–13 and 16 are the exponential histograms (32 bins) collected from the T turns. The
features 14 and 15 are the averages estimated over the T turns. This architectural choice
maximizes the AA accuracy.
Experiments
The experiments have been performed over a corpus of dyadic chat conversations col-
lected with Skype (in Italian language). The conversations are spontaneous, i.e., they
have been held by the subjects in their real life and not for the purpose of data collec-
tion. This ensures that the behavior of the subjects is natural and no attempt has been
made to modify the style in any sense. The number of turns per subject ranges between
60 and 100. Hence, the experiments were performed over sixty turns of each person.
In this way, any bias due to differences in the amount of available material should
be avoided. When possible, we picked different turns selections (maintaining their
Figure 24.2 CMCs of the proposed features. The numbers on the right indicate the nAUC.
Conversational features are in bold (best viewed in colors).
chronological order) in order to generate different AA trials. The average number of

words per subject is 615. The sixty turns of each subject are split into probe and gallery
set, each including thirty samples.
The first part of the experiments aimed at assessing each feature independently, as a
simple ID signature. Later on, we will see how to create more informative ID signatures.
A particular feature of a single subject was selected from the probe set, and matched
against the corresponding gallery features of all subjects, employing a given metrics
(Bhattacharya distance for the histograms (Duda, Hart, & Stork, 2001), Euclidean dis-
tance for the mean values). This happened for all the probe subjects, resulting in a
N × N distance matrix. Ranking in ascending order the N distances for each probe ele-
ment allows one to compute the Cumulative Match Characteristic (CMC) curve, i.e.,
the expectation of finding the correct match in the top n positions of the ranking. The
CMC is an effective performance measure for AA approaches (Bolle et al., 2003). In
particular, the value of the CMC curve at position one is the probability that the probe ID
signature of a subject is closer to the gallery ID signature of the same subject than to any
other gallery ID signature; the value of the CMC curve at position n is the probability
of finding the correct match in the first n ranked positions.
Given the CMC curve for each feature (obtained by averaging on all the available
trials), the normalized Area Under Curve (nAUC) is calculated as a measure of accuracy.
Figure 24.2 shows that the individual performance of each feature is low (less than 10%
at rank 1 of the CMC curve). In addition, the first dynamic feature has the seventh higher
nAUC, while the other ones are in position 10, 14, 15, and 16, respectively.
The experiments above served as basis to apply the forward feature selection (FFS)
strategy (Liu & Motoda, 2008) to select the best pool of features that can compose an
Figure 24.3 Comparison among different pools of features.
ID signature. At the first iteration the FFS retains the feature with the highest nAUC, at
the second one it selects the feature that, in combination with the previous one, gives
the highest nAUC, and so on until all features have been processed. Combining features
means to average their related distance matrices, forming a composite one. The pool of
selected features is the one which gives the highest nAUC.
Since FFS is a greedy strategy, different runs (50) of the feature selection are used,
selecting a partially different pool of thirty turns each time for building the probe set.
In this way, fifty different ranked subsets of features are obtained. For distilling a single
subset, the Kuncheva stability index (Kuncheva, 2007) is adopted, which essentially
keeps the most informative features (with high ranking in the FFS) that occurred most
times.
The FFS process results in twelve features, ranked according to their contribution
to the overall CMC curve. The set includes features 5, 2, 9, 10, 12 (turn duration),
13 (# “return” characters), 8, 14 (characters per second), 6, 7, 16 (mimicry degree), 15
(words per second). We reported the conversational features that appear to rank higher
than when used individually in bold. This suggests that, even if their individual nAUC
was relatively low, they encode information complementary with respect to the tradi-
tional AA features.
The final CMC curve, obtained using the pool of selected features, is reported in
Figure 24.3, curve (a). In this case, the rank 1 accuracy is 29.2%. As comparison, other
CMC curves are reported, considering (b) the whole pool of features (without feature
selection); (c) the same as (b), but adopting linear histograms instead of exponential
ones; (d) the selected features with exponential histograms, without the conversational
ones; (e) the conversational features alone; and (f) the selected features, calculating the
mean statistics over the whole thirty turns, as done usually in the literature with the
stylometric features.
Table 24.2 Relationship between performance and number of

turns used to extract the ID signatures.
# Turns 5 10 15 20 25 30
nAUC 68.6 76.6 80.6 85.0 88.4 89.5

rank1 acc. 7.1 14.0 15.1 21.9 30.6 29.2
Several facts can be inferred: our approach has the highest nAUC; feature selection
improves the performance; exponential histograms work better than linear ones; conver-
sational features increase the matching probability of around 10% in the first ten ranks;
and conversational features alone give higher performance of standard stylometric fea-
tures, calculated over the whole set of turns, and not over each one of them.
The last experiment shows how the AA system behaves while diminishing the number
of turns employed for creating the probe and gallery signatures. The results (mediated
over 50 runs) are shown in Table 24.2. Increasing the number of turns increases the
nAUC score, even if the increase appears to be smaller around thirty turns.
Conclusions
The technical quality of the classical modules that compose a surveillance system nowa-
days allows to face very complex scenarios. The goal of this review is to support the
argument that a social perspective is fundamental to deal with the highest level mod-
ule, i.e., the analysis of human activities, in a principled and fruitful way. We discussed
how the use of social signals may be valuable toward a robust encoding of social events
that otherwise cannot be captured. In addition, we report a study where social signal
processing is applied to a recent kind of surveillance, that is, the surveillance of the
social web. We claim that this declination of surveillance may arrive to get data not
available in the real world, for example, conversations; as a consequence, instruments
of social signal processing concerning the conversational analysis, scarcely employed
in surveillance, may be applied in this context. In this chapter we show how it is pos-
sible to recognize the identity of a person by examining the way she chats. Future
perspective in this novel direction may be employed to recognize conflicts during the
chat, or in general categorize the type of conversation which is occurring, in a real-time
fashion.
References
Abbasi, A. & Chen, H. (2008). Writeprints: A stylometric approach to identity-level identification

and similarity detection in cyberspace. ACM Transactions on Information Systems, 26(2), 1–29.
Aggarwal, J. K. & Cai, Q. (1999). Human motion analysis: A review. Computer Vision and Image
understanding, 73(3), 428–440.
Aggarwal, J. K. & Ryoo, M. S. (2011). Human activity analysis: A review. ACM Computing
Surveys, 43, 1–43.
Ambady, N. & Rosenthal, R. (1992). Thin slices of expressive behavior as predictors of interper-
sonal consequences: A meta-analysis. Psychological Bulletin, 111(2), 256–274.
Anderson, R. J. (2001). Security Engineering: A Guide to Building Dependable Distributed Sys-
tems. New York: John Wiley & Sons.
Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and
articulated pose estimation. In Proceedings of IEEE Computer Society Conference on Com-
puter Vision and Pattern Recognition Workshops (pp. 1014–1021).
Ba, S. O. & Odobez, J. M. (2006). A study on visual focus of attention recognition from head
pose in a meeting room. Lecture Notes in Computer Science, 4299, 75–87.
Bazzani, L., Cristani, M., & Murino, V. (2012). Decentralized particle filter for joint individual-
group tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recogni-
tion (pp. 1888–1893).
Bazzani, L., Cristani, M., Tosato, D., et al. (2011). Social interactions by visual focus of attention
in a three-dimensional environment. Expert Systems, 30(2), 115–127.
Benfold, B. & Reid, I. (2009). Guiding visual surveillance by tracking human attention. In Pro-
ceedings of the 20th British Machine Vision Conference, September.
Bolle, R., Connell, J., Pankanti, S., Ratha, N., & Senior, A. (2003). Guide to Biometrics. New
York: Springer.
Borges, P. V. K., Conci, N., & Cavallaro, A. (2013). Video-based human behavior understanding:
A survey. IEEE Transactions on Circuits and Systems for Video Technology, 23(11), 1993–
2008.
Buxton, H. (2003). Learning and understanding dynamic scene activity: A review. Image and
Vision Computing, 21(1), 125–136.
Cassell, J. (1998). A framework for gesture generation and interpretation. In R. Cipolla & A.
Pentland (Eds), Computer Vision in Human–Machine Interaction (pp. 191–215). New York:
Cambridge University Press.
Cedras, C. & Shah, M. (1995). Motion-based recognition: A survey. Image and Vision Computing,
13(2), 129–155.
Chen, C. & Odobez, J. (2012). We are not contortionists: Coupled adaptive learning for head
and body orientation estimation in surveillance video. In Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition (pp. 1544–1551).
Cristani, M., Bazzani, L., Paggetti, G., et al. (2011). Social interaction discovery by statistical
analysis of F-formations. In J. Hoey, S. McKenna, & E. Trucco (Eds), Proceedings of British
Machine Vision Conference (pp. 23.1–23.12). Guildford, UK: BMVA Press.
Cristani, M., Paggetti, G., Vinciarelli, A., et al. (2011). Towards computational proxemics: Infer-
ring social relations from interpersonal distances. In Proceedings of Third IEEE International
Conference on Social Computing (pp. 290–297).
Cristani, M., Pesarin, A., Vinciarelli, A., Crocco, M., & Murino, V. (2011). Look at who’s talking:
Voice activity detection by automated gesture analysis. In Proceedings of the Workshop on
Interactive Human Behavior Analysis in Open or Public Spaces (InterHub 2011).
Cristani, M., Raghavendra, R., Del Bue, A., & Murino, V. (2013). Human behavior analysis in
video surveillance: A social signal processing perspective. Neurocomputing, 100(2), 86–97.
Cristani, M., Roffo, G., Segalin, C., et al. (2012). Conversationally inspired stylometric features
for authorship attribution in instant messaging. In Proceedings of the 20th ACM International
Conference on Multimedia (pp. 1121–1124).
Curhan, J. R. & Pentland, A. (2007). Thin slices of negotiation: Predicting outcomes from conver-
sational dynamics within the first five minutes. Journal of Applied Psychology, 92(3), 802–811.
Dee, H. M. & Velastin, S. A. (2008). How close are we to solving the problem of automated visual
surveillance. Machine Vision and Application, 19(2), 329–343.
Deng, Z., Xu, D., Zhang, X., & Jiang, X. (2012). IntroLib: Efficient and transparent library call
introspection for malware forensics. In 12th Annual Digital Forensics Research Conference
(pp. 13–23).
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern Classification. New York: John Wiley &
Sons.
Ellison, N. B, Steinfield, C., & Lampe, C. (2007). The benefits of Facebook “friends”: Social
capital and college students’ use of online social network sites. Journal of Computer-Mediated
Fuchs, C. (2012). Internet and Surveillance: The Challenges of Web 2.0 and Social Media. New
York: Routledge.
Gavrila, D. M. (1999). The visual analysis of human movement: A survey. Computer Vision and
Image Understanding, 73(1), 82–98.
Goffman, E. (1966). Behavior in Public Places: Notes on the Social Organization of Gatherings.
New York: Free Press.
Groh, G., Lehmann, A., Reimers, J., Friess, M. R., & Schwarz, L. (2010). Detecting social situa-
tions from interaction geometry. In Proceedings of the 2010 IEEE Second International Con-
ference on Social Computing (pp. 1–8).
Hall, R. (1966). The Hidden Dimension. Garden City, NY: Doubleday.
Harman, J. P., Hansen, C. E., Cochran, M. E., & Lindsey, C. R. (2005). Liar, liar: Internet fak-
ing but not frequency of use affects social skills, self-esteem, social anxiety, and aggression.
Cyberpsychology & Behavior, 8(1), 1–6.
Helbing, D., & Molnár, P. (1995). Social force model for pedestrian dynamics. Physical Review
E, 51(5), 4282–4287.
Hu, W., Tan, T., Wang, L., & Maybank, S. (2004). A survey on visual surveillance of
object motion and behaviors. IEEE Transactions on Systems, Man and Cybernetics, 34,
334–352.
Hung, H., Huang, Y., Yeo, C., & Gatica-Perez, D. (2008). Associating audio-visual activity cues
in a dominance estimation framework. In Proceedings of IEEE Computer Society Conference
on Computer Vision and Pattern Recognition Workshops, June 23–28, Anchorage, AK.
Hung, H., & Kröse, B. (2011). Detecting F-formations as dominant sets. In Proceedings of the
International Conference on Multimodal Interaction (pp. 231–238).
Kendon, A. (1990). Conducting Interaction: Patterns of Behavior in Focused Encounters. New
York: Cambridge University Press.
Kuncheva, L. I. (2007). A stability index for feature selection. In Proceedings of IASTED Inter-
national Multi-Conference Artificial Intelligence and Applications (pp. 390–395).
Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–
3), 107–123.
Li, Y., Fathi, A., & Rehg, J. M. (2013). Learning to predict gaze in egocentric video. In Proceed-
ings of 14th IEEE International Conference on Computer Vision (pp. 3216–3223).
Lin, W.-C. & Liu, Y. (2007). A lattice-based MRF model for dynamic near-regular texture track-
ing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5), 777–792.
Liu, H. & Motoda, H. (2008). Computational Methods of Feature Selection. Boca Raton, FL:
Chapman & Hall/CRC.
Liu, X., Krahnstoever, N., Yu, T., & Tu, P. (2007). What are customers looking at? In Proceedings
of IEEE Conference on Advanced Video and Signal Based Surveillance (pp. 405–410).
Livingstone, S. & Brake, D. R. (2010). On the rapid rise of social networking sites: New findings
and policy implications. Children & Society, 24(1), 75–83.
Lott, D. F. & Sommer, R. (1967). Seating arrangements and status. Journal of Personality and
Mauthner, T., Donoser, M., & Bischof, H. (2008). Robust tracking of spatial related components.
Proceedings of the International Conference on Pattern Recognition (pp. 1–4).
Moeslund, T. B., Hilton, A., & Krüger, V. (2006). A survey of advances in vision-based human
motion capture and analysis. Computer Vision and Image understanding, 104(2), 90–126.
Newman, R. C. (2006). Cybercrime, identity theft, and fraud: Practicing safe Internet – network
security threats and vulnerabilities. In Proceedings of the 3rd Annual Conference on Informa-
tion Security Curriculum Development (pp. 68–78).
Oberschall, A. (1978). Theories of social conflict. Annual Review of Sociology, 4, 291–315.
Oikonomopoulos, A., Patras, I., & Pantic, M. (2011). Spatiotemporal localization and categoriza-
tion of human actions in unsegmented image sequences. IEEE Transactions on Image Process-
ing, 20(4), 1126–1140.
Orebaugh, A. & Allnutt, J. (2009). Classification of Instant Messaging Communications for
Forensics Analysis. International Journal of Forensic Computer Science, 1, 22–28.
Panero, J. & Zelnik, M. (1979). Human Dimension and Interior Space: A Source Book of Design.
New York: Whitney Library of Design.
Pang, S. K., Li, J., & Godsill, S. (2007). Models and algorithms for detection and tracking of coor-
dinated groups. In Proceedings of International Symposium on Image and Signal Processing
and Analysis (pp. 504–509).
Park, S. & Trivedi, M. M. (2007). Multi-person interaction and activity analysis: A syn-
ergistic track- and body-level analysis framework. Machine Vision and Application, 18,
151–166.
Pavan, M. & Pelillo, M. (2007). Dominant sets and pairwise clustering. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 29(1): 167–172.
Pellegrini, S., Ess, A., Schindler, K., & Van Gool, L. (2009). You’ll never walk alone: Modeling
social behavior for multi-target tracking. In Proceedings of 12th International Conference on
Computer Vision, Kyoto, Japan (pp. 261–268).
Pellegrini, S., Ess, A., & Van Gool, L. (2010). Improving data association by joint modeling of
pedestrian trajectories and groupings. In Proceedings of European Conference on Computer
Vision (pp. 452–465).
Pesarin, A., Cristani, M., Murino, V., & Vinciarelli, A. (2012). Conversation analysis at work:
Detection of conflict in competitive discussions through semi-automatic turn-organization anal-
ysis. Cognitive Processing, 13(2), 533–540.
Pianesi, F., Mana, N., Ceppelletti, A., Lepri, B., & Zancanaro, M. (2008). Multimodal recog-
nition of personality traits in social interactions. Proceedings of International Conference on
Popa, M., Koc, A. K., Rothkrantz, L. J. M., Shan, C., & Wiggers, P. (2012). Kinect sensing of
shopping related actions. In R. Wichert, K. van Laerhoven, & J. Gelissen (Eds), Constructing
Ambient Intelligence (vol. 277, pp. 91–100). Berlin: Springer.
Qin, Z. & Shelton, C. R. (2012). Improving multi-target tracking via social grouping. In Proceed-
ings of IEEE Conference on Computer Vision and Pattern Recognition (pp. 1972–1978).
Rajagopalan, S. S., Dhall, A., & Goecke, R. (2013). Self-stimulatory behaviours in the wild for
autism diagnosis. In Proceedings of IEEE Workshop on Decoding Subtle Cues from Social
Interactions (associated with ICCV 2013) (pp. 755–761).
Richmond, V. & McCroskey, J. (1995). Nonverbal Behaviors in Interpersonal Relations. Boston:
Allyn and Bacon.
Robertson, N. M., & Reid, I. D. (2011). Automatic reasoning about causal events in surveillance
video. EURASIP Journal on Image and Video Processing, 1, 1–19.
Russo, N. (1967). Connotation of seating arrangements. The Cornell Journal of Social Relations,
2(1), 37–44.
11(7), 1373–1380.
Schegloff, E. (2000). Overlapping talk and the organisation of turn-taking for conversation. Lan-
guage in Society, 29(1), 1–63.
Scovanner, P. & Tappen, M. F. (2009). Learning pedestrian dynamics from the real world. In
Proceedings International Conference on Computer Vision (pp. 381–388).
Smith, K., Ba, S., Odobez, J., & Gatica-Perez, D. (2008). Tracking the visual focus of attention for
a varying number of wandering people. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 30(7), 1–18.
Stiefelhagen, R., Finke, M., Yang, J., & Waibel, A. (1999). From gaze to focus of attention. Lec-
ture Notes in Computer Science, 1614, 761–768.
Stiefelhagen, R., Yang, J., & Waibel, A. (2002). Modeling focus of attention for meeting indexing
based on multiple cues. IEEE Transactions on Neural Networks, 13, 928–938.
Tajfel, H. (1982). Social psychology of intergroup relations. Annual Review of Psychology, 33,
1–39.
Tosato, D., Spera, M., Cristani, M., & Murino, V. (2013). Characterizing humans on Riemannian
manifolds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 2–15.
Turaga, P., Chellappa, R., Subrahmanian, V. S., & Udrea, O. (2008). Machine recognition of
human activities: A survey. IEEE Transactions on Circuits and Systems for Video Technology,
18(11), 1473–1488.
ing domain. Image and Vision Computing Journal, 27(12), 1743–1759.
Yamaguchi, K., Berg, A. C., Ortiz, L. E., & Berg, T. L. (2011). Who are you with and where are
you going? In Proceedings of IEEE Conference on Computer Vision and Patter Recognition
(pp. 1345–1352).
Yang, Y. & Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. Pro-
ceedings of IEEE Conference on Computer Vision and Pattern Recognition (pp. 1385–1392).
Zen, G., Lepri, B., Ricci, E., & Lanz, O. (2010). Space speaks: Towards socially and personality
aware visual surveillance. Proceedings of the 1st ACM International Workshop on Multimodal
Pervasive Video Analysis (pp. 37–42).
Zhou, L. & Zhang, D. (2004). Can online behavior unveil deceivers? An exploratory investigation
of deception in instant messaging. In Proceedings of the Hawaii International Conference on
System Sciences (no. 37, p. 22).
25 Analysis of Small Groups
Daniel Gatica-Perez, Oya Aran, and Dinesh Jayagopi
Introduction
Teams are key components of organizations and, although complexity and scale are typ-
ical features of large institutions worldwide, much of the work is still implemented by
small groups. The small-group meeting, where people discuss around the table, is perva-
sive and quintessential of collaborative work. For many years now, this setting has been
studied in computing with the goal of developing methods that automatically analyze
the interaction using both the spoken words and the nonverbal channels as information
sources. The current literature offers the possibility of inferring key aspects of the inter-
action, ranging from personal traits to hierarchies and other relational constructs, which
in turn can be used for a number of applications. Overall, this domain is rapidly evolv-
ing and studied in multiple subdisciplines in computing and engineering as well as the
cognitive sciences.
We present a concise review of recent literature on computational analysis of face-to-
face small-group interaction. Our goal is to provide the reader with a quick pointer to
work on analysis of conversational dynamics, verticality in groups, personality of group
members, and characterization of groups as a whole, with a focus on nonverbal behavior
as information source. The value of the nonverbal channel (including voice, face, and
body) to infer high-level information about individuals has been documented at length
in psychology and communication (Knapp & Hall, 2009) and is one of the main themes
of this volume.
In the chapter, we include pointers to 100 publications appearing in a variety of
venues between 2009 and 2013 (discussions about earlier work can be found e.g. in
Gatica-Perez, 2009.) After a description of our Methodology (see section on Method-
ology) and a basic quantitative analysis of this body of literature (see section on the
Analysis of Main Trends), we select a few works, due to the limited space, in each of
the four aforementioned trends to illustrate the kind of research questions, computa-
tional approaches, and current performance available in the literature (see sections on
Conversational Dynamics, Verticality, Personality, and Group Characterization). Taken
together, the existing research on small-group analysis is diverse in terms of goals and
studied scenarios, relies on state-of-the-art techniques for behavioral feature extraction
to characterize group members from audio, visual, and other sensor sources, and is still
largely using standard machine learning techniques as tools for computational inference
of interaction-related variables of interest. In the Conclusions and Outlook section, we
conclude the chapter by providing a few words about what the future can bring in this
domain.
Methodology
For this review, we limited the search for literature on the topic by the conditions listed
below.
1. Publications written in English from 2009 to 2013 (previous surveys cover older
literature; Gatica-Perez, 2009).
2. Papers strictly covering small groups, i.e., involving between three and six conver-
sational partners where all of them are human. This condition therefore excludes
literature using robots and agents interacting with people, and literature involving
only individuals (e.g. lectures or self-presentations), dyads, and large groups.
3. Papers where strictly co-located, face-to-face interactions are studied. This restriction
thus leaves aside literature on computer-mediated communication.
4. Papers where some form of sensor processing is done (e.g. audio, video, or motion).
This conditions thus excludes papers that focus on analysis using only transcribed
speech.
5. Original research work, rather than other review papers or that summarize or revisit
existing work.
With the above restrictions, a wide but non-exhaustive search of the literature (using
a combination of web searches for terms, such as “small group” and “multiparty”
and publication venue-specific searches) was conducted in the summer of 2013 and
resulted in 100 papers, including twenty-five journal papers and seventy-five confer-
ence/workshop papers. We then defined seven classification areas that span most of
the publication venues where work in computational analysis of small groups with the
above restrictions can be found. The areas include audio, speech, and language (ASL,
including venues such as IEEE T-ASLP, ICASSP, InterSpeech), computer vision (CV,
with venues such as IVC, CVIU, CVPR, ICCV), multimodal and multimedia processing
(MM, including venues such as IEEE T-MM, ICMI, MM), human–computer interaction
(HCI, with venues like CHI), pattern recognition and machine learning (PR, including
venues such as IEEE T-PAMI, PR, PRL, ICPR), behavioral, affective, and social (BAS,
with venues such as IEEE T-AC, ACII, SocialCom, SSPW, HBU), and other (catching
publications that could not be clearly associated to any of the previous categories)1 .
Analysis of Main Trends
We analyze the trends based on the 100 technical references on small-group analysis
that we have found based on the methodology described in the Methodology section.
Figure 25.1(b) shows the distribution of the publications over time. The number of
publications on small-group analysis seem to be stable between 2009 and 2012 with
around twenty publications per year. The figures for 2013 is incomplete due to the date
1 For space reasons, we only provide the acronyms for each publication venue, but we anticipate that the
reader will be familiar with most of them.
Analysis of Small Groups 351
(a)
(b)
Figure 25.1 Statistics of the 100 technical references on small group analysis reviewed in this
paper. (a) Distribution of papers over time and (b) Distribution of papers over research field in
journals, conferences and workshops: audio, speech, language (ASL), computer vision (CV),
multimodal and multimedia processing (MM), human computer interaction (HCI), pattern
recognition and machine learning (PR), behavioral, affective, social (BAS), other.
that this review was done. In comparison to the period between 2001 and 2009, reported
in Gatica-Perez (2009), we see that there is an increase in the number of publications,
around ten more publications per year, since 2009.
In Figure 25.1(a), we show the distribution of the papers per research field. Almost
half of the papers appeared in venues related to multimodal and multimedia processing
Table 25.1 List of references for small group analysis in four main categories.
Conversational dynamics Ba and Odobez (2009); Baldwin et al. (2009); Bohus and Horvitz (2009);
Bousmalis et al. (2009); Chen and Harper (2009); Germesin and Wilson
(2009); Ishizuka et al. (2009); De Kok and Heylen (2009); Kumano et al.
(2009); Lepri, Mana, Cappelletti, and Pianesi (2009); Otsuka et al. (2009);
Vinciarelli (2009); Bachour et al. (2010); Gorga and Otsuka (2010);
Subramanian et al. (2010); Sumi et al. (2010); Valente and Vinciarelli
(2010); Voit and Stiefelhagen (2010); Ba and Odobez (2011a, 2011b);
Bohus and Horvitz (2011); Bousmalis et al. (2011); Campbell et al. (2011);
Cristani et al. (2011); Kumano et al. (2011); Wang et al. (2011); Angus
et al. (2012); Bruning et al. (2012); Debras and Cienki (2012); Kim,
Valente, and Vinciarelli (2012); Kim, Filippone et al. (2012); Noulas et al.
(2012); Otsuka and Inoue (2012); Pesarin et al. (2012); Prabhakar and
Rehg (2012); Rehg et al. (2012); Song et al. (2012); Vinyals et al. (2012);
Bousmalis, Mehu, and Pantic (2013); Bousmalis, Zafeiriou et al. (2013)
Verticality and roles Favre et al. (2009); Raducanu and Gatica-Perez (2009); Salamin et al. (2009);
Aran and Gatica-Perez (2010); Aran et al. (2010); Charfuelan et al. (2010);
Escalera et al. (2010); Glowinski et al. (2010); Hung and Chittaranjan
(2010); Poggi and D’Errico (2010); Salamin et al. (2010); Sanchez-Cortes
et al. (2010); Valente and Vinciarelli (2010); Varni et al. (2010);
Charfuelan and Schroder (2011); Hung et al. (2011); Kalimeri et al.
(2011); Raiman et al. (2011); Sanchez-Cortes et al. (2011); Schoenenberg
et al. (2011); Vinciarelli, Salamin et al. (2011); Vinciarelli, Valente et al.
(2011); Wilson and Hofer (2011); Feese et al. (2012); Hadsell et al. (2012);
Kalimeri et al. (2012); Nakano and Fukuhara (2012); Raducanu and
Gatica-Perez (2012); Salamin and Vinciarelli (2012); Sanchez-Cortes,
Aran, Schmid Mast et al. (2012); Sanchez-Cortes, Aran, Jayagopi et al.
(2012); Wöllmer et al. (2012); Wang et al. (2012); Dong et al. (2013);
Ramanathan et al. (2013); Sapru and Bourlard (2013); Suzuki et al. (2013)
Personality Lepri, Mana, Cappelletti, Pianesi, and Zancanaro (2009); Lepri, Subramanian
et al. (2010); Lepri, Kalimeri et al. (2010); Staiano, Lepri, Kalimeri et al.
(2011); Staiano, Lepri, Ramanathan et al. (2011); Lepri et al. (2012); Aran
and Gatica-Perez (2013a, 2013b); Pianesi (2013)
Group level analysis Camurri et al. (2009); Dai et al. (2009); Jayagopi and Gatica-Perez (2009);
Jayagopi, Raducanu, and Gatica-Perez (2009); Kim and Pentland (2009);
Dong and Pentland (2010); Hung and Gatica-Perez (2010); Jayagopi and
Gatica-Perez (2010); Subramanian et al. (2010); Woolley et al. (2010);
Bonin et al. (2012); Dong, Lepri, and Pentland (2012); Dong, Lepri, Kim
et al. (2012); Jayagopi et al. (2012); La Fond et al. (2012)
(column labeled MM in Figure 25.1(a)). This effect might be partly biased by the active
participation of the authors’ institution in these specific communities, but in general it
should be seen as a community effect. Roughly tied in the second place are ASL and
BAS. It is interesting that, while ASL is a classic domain, BAS corresponds to publi-
cation venues that did not exist before 2009. In comparison to the research disciplines
covered for older work (e.g., reviewed in Gatica-Perez, 2006, 2009), we see that more
papers are published in multimodal/multimedia venues, and that new venues emerge in
parallel to the growing interest on the analysis of social behavior in general and of small
groups in particular.
The collected papers investigate the small-group interaction based on audio and/or
visual recordings through both verbal and nonverbal cues, with a majority of them
focusing on nonverbal cues only. As the review did not include venues on national lan-
guage processing (NLP), the analysis of text-based interaction in small groups can be
underrepresented.
We discuss the analysis of small groups in four categories of social constructs,
i.e., conversational dynamics, personality, roles-dominance-leadership, and group level
analysis. The first three categories look at the social constructs of individuals in a small
group setting. In the fourth category, we review papers that focus on the group as a
whole, rather than the individuals in the group. In Table 25.1 we list the technical refer-
ences considered in this paper, grouped in these four categories.
Conversational Dynamics
Conversations in groups involve multiple channels of communication and complex

coordination between the interacting parties. The communication process involves tak-
ing turns, addressing someone, yielding the floor, and gesturing using head and hand
to communicate or acknowledge. Over a decade or so, several works have appeared to
extract these basic conversational signals and analyze them further to study the turn-
taking, gazing, and gesturing behavior in small groups.
Cristani et al. (2011) present a novel way of analyzing turn-taking patterns by using
a GMM model on durations of the steady conversational period (SCP) and then use
the discrete cluster classes as observed states of an influence model (Basu et al., 2001).
These low level features are shown to be indeed useful to capture the conversational
dynamics by using them to improve the state-of-the-art for classifying roles in group
meetings. The authors also argue that SCPs are better than prosodic or phonetic fea-
tures, used in state-of-the-art algorithms for speech analysis. In this paper, results on
role classification on the AMI dataset (Carletta et al., 2005) are shown to improve with
respect to an existing baseline with a final accuracy of 90%, with ninety-eight meetings
to train, twenty to validate, and twenty to test. The generative approach proposed in the
paper has applications in turn-taking decisions for multiparty embodied conversational
agents.
Angus, Smith, Wiles (2012), on the other hand, approach the problem of mod-
eling the coupling in human–human communication by quantifying multiparticipant
recurrence. The words spoken in utterance by a participant are used to estimate the cou-
pling between utterances, both from the same participant as well as other participants.
The work proposes a set of multiparticipant recurrence metrics to quantify topic usage
patterns in human communication data. The technique can be used to monitor the level
of topic consistency between participants; the timing of state changes for the partici-
pants as a result of changes in topic focus; and patterns of topic proposal, reflection,
and repetition. Finally, as an interesting test case, the work analyzes a dataset consisting
of a conversation in a aircraft, involved in an emergency situation. Some of the studied

metrics include short-term and long-term topic introduction, repetition, and consistency.
The participants involved in this dataset included the captain, first officer, jumpseat cap-
tain, ground staff, and others.
Baldwin, Chai, and Kirchhoff (2009) study communicative hand gestures for coref-
erence identification, for example, when someone says “you want this” and gestures at
a certain speaker, to automatically infer the intention of the speaker and to understand
whom ‘you’ refers to in this multiparty context. They approach this problem by first
formulating a binary classification task to determine if a gesture is communicative or
not. Then, every communicative gesture is used to identify if two different linguistic-
referring expressions actually refer to the same person or object. A diverse set of fea-
tures that included text, dialogue, and gesture information were used for this task. For
this study, a total of six meetings from the Augmented Multi-party Interaction (AMI)
data were used with 242 annotated gestures and 1,790 referring expressions. The results
show that the best accuracy to classify if a gesture is communicative or not, is close
to 87% and features, such as the duration of the gesture, are useful. Also, gestures are
shown to improve the performance of co-reference identification.
Taken together, the recent literature on conversational analysis show that this area is
active, and many open issues remain toward a holistic understanding of the conversa-
tional processes. Extracting and analyzing nonverbal and verbal behavior is the basis for
all subsequent inferences about individuals and groups.
Verticality
A second trend in the literature relates to aspects of structure in groups, whether

vertical – which in the social psychology literature (Hall, Coats, & Smith, 2005)
includes aspects, such as dominance, status, and leadership – or not (e.g. structure
defined by specific roles played by the group members). In this section, we discuss a
few representative works focused on the vertical dimension of interaction, more specifi-
cally dominance and leadership (for space reasons, we omit discussions on other aspects
of structure like roles mentioned in Table 25.1). Dominance can be seen as a mani-
fest act to control others, as a personality trait that elicits such behavior, or as a con-
trol behavioral outcome (Dunbar & Burgoon, 2005). Leadership, on the other hand,
includes both emerging phenomena and styles related to managing and directing a team
(Stein, 1975).
Dominance in small groups was originally studied in computing in the mid-2000s in
works, such as Rienks and Heylen (2005) and Hung et al. (2007). In the last five years,
this line of research has been expanded, among others, by Charfuelan, Schroder, and
Steiner (2010). This particular work used the popular Augmented Multi-party Inter-
action (AMI) scenario meeting data. The AMI data corresponds to five-minute slices
of four-person meetings involving people playing a role-based design scenario. A sub-
corpus was originally annotated for perceived dominance rankings (from most to least
dominant) in Jayagopi, Hung et al. (2009). The goal in Charfuelan et al. (2010) was
to investigate whether certain prosody and voice quality signals would be characteris-
tic of most and least dominant individuals. Using a variety of acoustic cues extracted
from close-talk microphones and using principal component analysis, the study found
that most dominant people tend to speak with “louder-than-average voice quality” and,
conversely, least dominant people speak with “softer-than-average voice quality”. It is
important to notice that rather than trying to automatically classify most and least dom-
inant people, this study was interested in identifying acoustic cues useful to synthesize
expressive speech corresponding to such social situations. In a subsequent work, Char-
fuelan and Schroder (2011) applied a similar methodology to two constructs other than
dominance, namely speaker roles and dialogue acts.
A second construct of interest is leadership. We discuss two variations of this theme
found in the recent literature. The first one is emergent leadership, a phenomenon occur-
ring among people who are not acquainted previously to an interaction, in which one
of the interactors emerges from among the others through the interaction itself. One of
the first published works is Sanchez-Cortes, Aran, Schmid Mast et al. (2012), who pro-
posed to identify the emergent leader in a three- to four-person group using a variety
of nonverbal cues, including prosody, speaking activity (extracted from a commercial
microphone array), and visual activity (estimated from webcam video). The setting is
the Winter Survival task – a well-known design in psychology to study small group
phenomena. Furthermore, group members were asked to rate how they perceived them-
selves and the others with respect to leadership and dominance. Using standard machine
learning techniques, this work reported between 72 and 85% of correct identification
of the emergent leader on a corpus of 40 group meetings (148 subjects) for various
modalities and classification techniques. Through the analysis of the questionnaires,
this work also found a correlation between the perception of emergent leadership and
dominance.
The other variation is that of leadership styles, which was studied in Feese et al.
(2012). Specifically, two contrasting styles in terms of how the leader interacts with the
team (authoritarian or considerate) were elicited in a simulated staff selection scenario
involving three-person groups, with one of them being the designated leader. A corpus
of forty-four group discussions was recorded with sensor-shirts, i.e., shirts equipped
with inertial measurement units (IMUs) containing accelerometer, gyroscope, and mag-
netic field sensors. A number of nonverbal body cues were manually annotated and
extracted from the IMU sensors, including some measures of behavioral mimicry. This
work did not attempt to classify leadership styles, but rather to identify nonverbal fea-
tures that were significantly different between the two types of leaders. As main results
it was found that authoritarian leaders tend to move their arms more often than consid-
erate ones, and that considerate leaders imitate posture changes and head nods of team
members more often than authoritarian ones.
The three examples discussed above show the active interest in understanding and
discriminating social constructs related to verticality. It also shows that while current
results are promising, additional work is needed to replicate and validate these findings
in other settings. A variable closely related to social verticality is the personality of team
members. As a closely connected subject, this is discussed in the next section.
Personality
The automatic analysis of personality has been addressed in a number of works in

social computing literature in the last decade. While most works have looked at self-
presentations where the individual is the only interacting person, few works also looked
at predicting personality of individuals when they interact with others in small groups.
The Big-Five model has been the most commonly used model, which factors person-
ality into five different traits (extraversion, agreeableness, conscientiousness, emotional
stability, and openness to experience). Among these traits, extraversion has been the
one relatively easier to predict, especially in conversational settings. Several audiovi-
sual nonverbal cues have been used and shown to be relatively effective in inferring
extraversion. The inference problem can be either formulated as a regression task based
on the personality trait scores or as a classification task by quantizing the scores into
two or more classes. For the ground truth annotation of personality, current works
either use self-reported personality (i.e. the personality of an individual as seen by
self) or externally observed (i.e. how the individual is seen by others, also known as
impressions).
Lepri et al. (2012) investigated the automatic classification of the extraversion trait
based on meeting behavior, such as speaking time and social attention. They used self-
reported personality annotations and investigated the effect of speaking time and social
attention as indicators of extraversion based on a thin-slice analysis. Their approach
achieved a maximum accuracy of 69% using manually extracted features and 61% using
automatically extracted features with a support vector machine classifier. Their results
show that for predicting extraversion, in addition to the target’s behavior, the behavior
of others in the group should be taken into account. The speaking time or the attention
of the target alone did not yield significant accuracies. Besides studying social context
in the form of others’ behavior, the authors also investigated whether the group compo-
sition had any effect on the classification accuracy. They found no significant difference
between group variance and thus concluded that accuracy variability is entirely due to
differences among subjects.
Recently, Aran and Gatica-Perez (2013a) studied the inference of personality in cross-
domain settings. While collecting data that contains natural face-to-face interaction in
everyday life is challenging, social media sites provide a vast amount of human behav-
ioral data. In this study, the authors investigate a cross-domain setting where they used
conversational vlogs downloaded from YouTube as the source domain and video record-
ings of individuals taken from a small group meeting as the target domain, with person-
ality annotations obtained from external observers. Their results show that, for predict-
ing the extraversion trait, a model of body activity cues on conversational vlog data can
be useful in a transfer learning setting with face-to-face interaction in small groups as
the target domain. The method achieves up to 70% of accuracy in a binary extraversion
classification task, by using the source domain data and as few as ten examples from the
target data with visual only cues.
While personality is considered to be a stable characteristic of a person, the behav-
ior of people is variable. Although one approach is to consider this variability as noise,
another approach would be to use this information to better understand the relationship
between personality and behavior. Pianesi (2013) discusses this fact and suggests the
characterization of behavior of people in the form of personality states, representing
each personality dimension as a distribution of these states. On a similar point, recently,
Aran and Gatica-Perez (2013b) also investigated whether thin-slice personality impres-
sions of external observers generalize to the full-meeting behavior of individuals, using
a computational setting to predict trait impressions.
In summary, many recent works on the automatic analysis of personality in small
groups have focused on the inference of personality of an individual interacting in a
group of people, and investigated links between the personality of an individual and the
behavior of the other group members. Another research problem is how the character-
istics of individuals can affect group formation and interaction. In the next section, we
review works that conceptualize groups as units and characterize a group based on the
collective behavior of its members.
Group Characterization
The last thread of work discussed in the chapter is the modeling of collective aspects
of groups. A seminal work in this direction is the work on collective intelligence by
Woolley et al. (2010), which showed how the emergent properties of group intelligence
is quite different from the intelligence of the group members. Collective intelligence is
a factor that explains why some groups that do well on a certain task are good at many
other tasks (similar to the general intelligence factor of individuals.) The authors show
that the collective intelligence of a group is uncorrelated with the average or maximum
intelligence of the group members. On the contrary, it is shown to be correlated with
the communication patterns of the group (particularly egalitarian turn-taking) and the
compostion of the group (specifically group with socially-sensitive individuals/more
females). This study was conducted with 107 groups, involving 699 people. Wear-
able badges were used for sensing on a subset of the full dataset (46 groups), partic-
ularly to compute the speaking turn distribution. The group tasks were selected from
the McGrath task circumplex, which included brainstorming, planning, and reasoning
tasks. This work establishes the role of group communication on group performance.
With a different goal, Jayagopi et al. (2012) explored the relationship between sev-
eral automatically extracted group communication patterns and group constructs such as
group composition, group interpersonal perception, and group performance. The work
proposed a way of characterizing groups by clustering the extracted looking and turn-
taking patterns of a group as a whole. The work defined a bag of nonverbal patterns
(bag-of-NVPs) to discretize the group looking and speaking cues. The clusters learned
using the Latent Dirichlet Allocation (LDA) topic model (Blei, Ng, & Jordan, 2003)
were then interpreted by studying the correlations with the group constructs. Data from
eighteen four-people groups were used in this study (a subset of the Emergent Lead-
ership [ELEA] corpus; Sanchez-Cortes et al., 2010.) The groups were unacquainted
and performed the Winter Survival task. Big-Five personality traits were used to char-
acterize group composition. Group interpersonal perception questionnaires measured
dominance, leadership, liking, and competence. The survival task also generates a mea-
sure of performance for each group. Several insights about groups were found in this
study. The work showed groups with top-two person hierarchy participated less, while
groups without this hierachy participated more. Introverted groups looked at the meet-
ing table more often. Finally, groups which were known to perform better on the task
had a competent person as part of their team, and also had more converging gaze on this
person during their interaction.
La Fond et al. (2012) approached the problem of group characterization by analyzing
who-replies-to-whom patterns, which were manually transcribed. Groups were classi-
fied as hub, outlier, and equal types. Similarly, individuals were assigned hub, spoke,
outlier, and equal roles. Interestingly, those individuals identified as hub were more
assertive, while outliers were not. The groups consisted of three to four individuals solv-
ing logic problems. They participated in two phases. The first phase was a distributed
session (an online chat session) and the second phase was a face-to-face interaction.
In the distributed phase, there were seventy-nine groups of size three and forty-eight
groups of size four, while the face-to-face phase had twenty-seven groups of size three
and thirty-five groups of size four. After the session, the participants evaluated the traits
and performance of each member (including themselves), as well as the performance of
the group as a whole. The participants evaluated the performance of the group, which
included ratings on group cohesion, effectiveness, productivity, trustworthiness, and sat-
isfaction. Models to predict these group evaluation measures using linear regression and
decision trees were learned and tested. The results showed that group effectiveness and
trust could be predicted with above 80% accuracy using a decision tree classifier.
As a final example, Hung and Gatica-Perez (2010) focused on estimating group cohe-
sion using turn-taking and motion analysis. The work defined several turn-taking fea-
tures. Compressed-domain motion activity features, which are computationally lighter
as compared to pixel-domain features were used to define analogous “motion turn-
taking” features. Group cohesion, unlike the work by La Fond et al. (2012), was defined
through external perception or impressions. For the study, 120 two-minute slices of
four-people group interaction of the AMI corpus were used. Three annotators answered
twenty-seven questions about social and task cohesion. After an analysis of inter-
annotator agreement, sixty-one two-minute slices with fifty high-cohesion score and
eleven low-cohesion score with sufficient agreement were used for classification exper-
iments. Accuracies of the order of 90% were achieved on this cohesion classification
task.
Overall, the automatic characterization of groups as units is an area for which we
anticipate more work in the future, as many open issues need to be addressed. One of
them is the need to significantly increase the size of the data sets used for analysis in
order to reach more significant conclusions. A second issue is the need to generalize the
initial results discussed here across conversational contexts or even cultures. Another
direction is in studying nonverbal and verbal behavioral differences between collocated
and distributed groups (as in La Fond et al., 2012), as remote groups interactions have
become commonplace. This direction would obviously have links to the literature in
computed-supported collaborative work (CSCW).
Conclusions and Outlook
In this chapter, we presented a succinct review of the literature on face-to-face small

group interaction analysis. From an initial pool of a hundred papers published in the
2009–2013 period, we selected a number of works that illustrate four of the main
research trends (conversational dynamics, verticality, personality, and group character-
ization). We then briefly discussed the kind of research tasks and approaches that have
been proposed using a few illustrative examples for each trend. The review shows that
the body of research has grown in numbers in comparison to the previous decade, that
it has diversified in terms of goals, and that approaches have gained sophistication in
terms of methods to extract behavioral features. In contrast, recent research has made
relatively less progress with respect to new computational modeling tools for recogni-
tion and discovery tasks: most of the existing work still uses relatively standard machine
learning methodologies for automatic inference.
We have argued elsewhere (see Gatica-Perez, Op den Akken, & Heylen, 2012) that
the future of this area will be shaped by progress along two axes: sensing and model-
ing. Sensing, literally and metaphorically speaking, is in front of our eyes: smartphones,
wearable devices, such as Google Glass, Android Wear, and Samsung Galaxy Gear,
and gaming platforms like Microsoft’s XBox One all give the possibility of sensing
interaction quasi-continuously and with higher degree of accuracy than currently possi-
ble. While the sensing functionalities will continue to advance, a fundamental point for
practical applications is acceptability, both individual and social. There are (and there
should be) ethical and legal bounds to recording interaction data. These limits, however,
are often not consistent across countries or often not respected; the many stories in the
media about privacy intrusion certainly point in the wrong direction. We anticipate pri-
vacy to become a much larger research issue in group interaction analysis in the near
future.
The second axis is modeling. The possibility of recording interaction in real situa-
tions, as enabled by new sensing platforms, will call for methods that integrate both
the temporal dimension and the new data scales that will be generated. Regarding time,
essentially all of the work discussed in this chapter has examined short-lived interac-
tions, although we know that teams in the real world do not work that way. Methods
that are capable of discovering how teams in organizations perform and evolve over
weeks, months, or years are needed and likely to appear in the future (existing exam-
ples include Olguin Olguin et al., 2009; Do & Gatica-Perez, 2011). As a second issue,
data scale should also boost new ways of thinking about small-group research, mov-
ing beyond the small-data-for-small-groups current research trends. It is not hard to
anticipate that a big data version of small-group research will emerge given the combi-
nation of new sensing and modeling methodologies.
Acknowledgments
We thank the support of the Swiss National Science Foundation (SNSF) through the
NCCR IM2, the Sinergia SONVB project, and the Ambizione SOBE (PZ00P2-136811)
project.
References
Angus, D., Smith, A. E., & Wiles, J. (2012). Human communication as coupled time series:
Quantifying multi-participant recurrence. IEEE Transactions on Audio, Speech, and Language
Processing, 20(6), 1795–1807.
Aran, O. & Gatica-Perez, D. (2010). Fusing audio-visual nonverbal cues to detect dominant people
in small group conversations. In Proceedings of 20th International Conference on Pattern
Recognition (pp. 3687–3690).
Aran, O. & Gatica-Perez, D. (2013a). Cross-domain personality prediction: From video blogs to
small group meetings. In Proceedings of the 15th ACM International Conference on Multi-
modal Interaction (pp. 127–130).
Aran, O. & Gatica-Perez, D. (2013b). One of a kind: Inferring personality impressions in meet-
ings. In Proceedings of the 15th ACM International Conference on Multimodal Interaction
(pp. 11–18).
Aran, O., Hung, H., & Gatica-Perez, D. (2010). A multimodal corpus for studying dominance in
small group conversations. In Proceedings of LREC workshop on Multimodal Corpora Malta.
Ba, S. O. & Odobez, J. M. (2009). Recognizing visual focus of attention from head pose in natural
meetings. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 39(1),
16–33.
Ba, S. O. & Odobez, J.-M. (2011a). Multiperson visual focus of attention from head pose and
meeting contextual cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33,
101–116.
Ba, S. O. & Odobez, J. M. (2011b). Multi-person visual focus of attention from head pose and
meeting contextual cues. IEEE Transactions on Pattern Analysis and Machine Intelligence,
33(1), 101–116.
Bachour, K., Kaplan, F., & Dillenbourg, P. (2010). An interactive table for supporting participation
balance in face-to-face collaborative learning. IEEE Transactions on Learning Technologies,
3(3), 203–213.
Baldwin, T., Chai, J. Y., & Kirchhoff, K. (2009). Communicative gestures in coreference identifi-
cation in multiparty meetings. In Proceedings of the 2009 International Conference on Multi-
modal Interfaces (pp. 211–218).
Basu, S., Choudhury, T., Clarkson, B., & Pentland, A. (2001). Learning human interactions with
the influence model. MIT Media Lab Vision and Modeling, Technical Report 539, June.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine
Bohus, D. & Horvitz, E. (2009). Dialog in the open world: Platform and applications. In Proceed-
ings of the 2009 International Conference on Multimodal Interfaces (pp. 31–38).
Bohus, D. & Horvitz, E. (2011). Decisions about turns in multiparty conversation: From percep-
tion to action. In Proceedings of the 13th International Conference on Multimodal Interfaces
(pp. 153–160).
Bonin, F., Bock, R., & Campbell, N. (2012). How do we react to context? Annotation of individual
and group engagement in a video corpus. In Proceedings of Privacy, Security, Risk and Trust
(PASSAT) and International Conference on Social Computing (pp. 899–903).
Bousmalis, K., Mehu, M., & Pantic, M. (2009). Spotting agreement and disagreement: A survey
of nonverbal audiovisual cues and tools. In Proceedings of 3rd International Conference on
Affective Computing and Intelligent Interaction and Workshops (pp. 1–9).
Bousmalis, K., Mehu, M., & Pantic, M. (2013). Towards the automatic detection of sponta-
neous agreement and disagreement based on nonverbal behaviour: A survey of related cues,
databases, and tools. Image and Vision Computing, 31(2), 203–221.
Bousmalis, K., Morency, L., & Pantic, M. (2011). Modeling hidden dynamics of multimodal cues
for spontaneous agreement and disagreement recognition. In Proceedings of IEEE Interna-
tional Conference on Automatic Face Gesture Recognition and Workshops (pp. 746–752).
Bousmalis, K., Zafeiriou, S., Morency, L.-P., & Pantic, M. (2013). Infinite hidden conditional
random fields for human behavior analysis. IEEE Transactions on Neural Networks Learning
Systems, 24(1), 170–177.
Bruning, B., Schnier, C., Pitsch, K., & Wachsmuth, S. (2012). Integrating PAMOCAT in the
research cycle: Linking motion capturing and conversation analysis. In Proceedings of the 14th
ACM International Conference on Multimodal Interaction (pp. 201–208).
Campbell, N., Kane, J., & Moniz, H. (2011). Processing YUP! and other short utterances in
interactive speech. In Proceedings of IEEE International Conference on Acoustics, Speech and
Signal Processing (pp. 5832–5835).
Camurri, A., Varni, G., & Volpe, G. (2009). Measuring entrainment in small groups of musi-
cians. In Proceedings of 3rd International Conference on Affective Computing and Intelligent
Interaction and Workshops (pp. 1–4).
Carletta, J., Ashby, S., Bourban, S., et al. (2005). The AMI meeting corpus: A pre-announcement.
In Proceedings of the Second International Conference on Machine Learning for Multimodal
Interaction (pp. 28–39).
Charfuelan, M. & Schroder, M. (2011). Investigating the prosody and voice quality of social
signals in scenario meetings. In S. Da Mello, A. Graesser, B. Schuller, & J.-C. Martin (Eds),
Affective Computing and Intelligent Interaction (vol. 6974, pp. 46–56). Berlin: Springer.
Charfuelan, M., Schroder, M., & Steiner, I. (2010). Prosody and voice quality of vocal social
signals: The case of dominance in scenario meetings. In Proceedings of Interspeech 2010,
September, Makuhari, Japan.
Chen, L. & Harper, M. P. (2009). Multimodal floor control shift detection. In Proceedings of the
2009 International Conference on Multimodal Interfaces (pp. 15–22).
Cristani, M., Pesarin, A., Drioli, C., et al. (2011). Generative modeling and classification of
dialogs by a low-level turn-taking feature. Pattern Recognition, 44(8), 1785–1800.
Dai, P., Di, H., Dong, L., Tao, L., & Xu, G. (2009). Group interaction analysis in dynamic context.
IEEE Transactions on Systems, Man, and Cybernetics Part B: Cybernetics, 39(1), 34–42.
Debras, C. & Cienki, A. (2012). Some uses of head tilts and shoulder shrugs during human inter-
action, and their relation to stancetaking. In Proceedings of Privacy, Security, Risk and Trust
(PASSAT), International Conference on Social Computing (pp. 932–937).
De Kok, I. & Heylen, D. (2009). Multimodal end-of-turn prediction in multi-party meetings. In

Proceedings of the 2009 International Conference on Multimodal Interfaces (pp. 91–98).
Do, T. & Gatica-Perez, D. (2011). GroupUs: Smartphone proximity data and human interac-
tion type mining. In Proceedings of IEEE International Symposium on Wearable Computers
(pp. 21–28).
Dong, W., Lepri, B., Kim, T., Pianesi, F., & Pentland, A. S. (2012). Modeling conversational
dynamics and performance in a social dilemma task. In Proceedings of the 5th International
Symposium on Communications Control and Signal Processing (pp. 1–4).
Dong, W., Lepri, B., & Pentland, A. (2012). Automatic prediction of small group performance
in information sharing tasks. In Proceedings of Collective Intelligence Conference (CoRR
abs/1204.3698).
Dong, W., Lepri, B., Pianesi, F., & Pentland, A. (2013). Modeling functional roles dynamics in
small group interactions. IEEE Transactions on Multimedia, 15(1), 83–95.
Dong, W. & Pentland, A. (2010). Quantifying group problem solving with stochastic analysis.
In Proceedings of International Conference on Multimodal Interfaces and the Workshop on
Machine Learning for Multimodal Interaction (pp. 40:1–40:4).
Dunbar, N. E. & Burgoon, J. K. (2005). Perceptions of power and interactional dominance in
interpersonal relationships. Journal of Social and Personal Relationships, 22(2), 207–233.
Escalera, S., Pujol, O., Radeva, P., Vitrià, J., & Anguera, M. T. (2010). Automatic detection of
dominance and expected interest. EURASIP Journal on Advances in Signal Processing, 1.
Favre, S., Dielmann, A., & Vinciarelli, A. (2009). Automatic role recognition in multiparty
recordings using social networks and probabilistic sequential models. In Proceedings of the
17th ACM International Conference on Multimedia (pp. 585–588).
Feese, S., Arnrich, B., Troster, G., Meyer, B., & Jonas, K. (2012). Quantifying behavioral
mimicry by automatic detection of nonverbal cues from body motionc. In Proceedings of
Privacy, Security, Risk and Trust (PASSAT), International Conference on Social Computing
(pp. 520–525).
Gatica-Perez, D. (2006). Analyzing group interactions in conversations: A review. In Proceed-
ings of IEEE International Conference on Multisensor Fusion and Integration for Intelligent
Systems (pp. 41–46).
review. Image and Vision Computing (special issue on Human Behavior), 27(12), 1775–1787.
Gatica-Perez, D., Op den Akken, R., & Heylen, D. (2012). Multimodal analysis of small-group
conversational dynamics. In S. Renals, H. Bourlard, J. Carletta, & A. Popescu-Belis (Eds), Mul-
timodal Signal Processing: Human Interactions in Meetings. New York: Cambridge University
Press.
Germesin, S. & Wilson, T. (2009). Agreement detection in multiparty conversation. In Proceed-
ings of the 2009 International Conference on Multimodal Interfaces (pp. 7–14).
Glowinski, D., Coletta, P., Volpe, G., et al. (2010). Multi-scale entropy analysis of dominance
in social creative activities. In Proceedings of the International Conference on Multimedia
(pp. 1035–1038).
Gorga, S. & Otsuka, K. (2010). Conversation scene analysis based on dynamic Bayesian network
and image-based gaze detection. In Proceedings of International Conference on Multimodal
Interfaces and the Workshop on Machine Learning for Multimodal Interaction (art. 54).
Hadsell, R., Kira, Z., Wang, W., & Precoda, K. (2012). Unsupervised topic modeling for leader
detection in spoken discourse. In IEEE International Conference on Acoustics, Speech and
Signal Processing (pp. 5113–5116).
Hall, J. A., Coats, E. J., & Smith, L. (2005). Nonverbal behavior and the vertical dimension of
social relations: A meta-analysis. Psychological Bulletin, 131(6), 898–924.
Hung, H. & Chittaranjan, G. (2010). The IDIAP wolf corpus: Exploring group behaviour in a
competitive role-playing game. In Proceedings of the International Conference on Multimedia
(pp. 879–882).
Hung, H. & Gatica-Perez, D. (2010). Estimating cohesion in small groups using audio-visual
nonverbal behavior. IEEE Transactions on Multimedia, 12(6), 563–575.
Hung, H., Huang, Y., Friedland, G., & Gatica-Perez, D. (2011). Estimating dominance in multi-
party meetings using speaker diarization. IEEE Transactions on Audio, Speech & Language
Processing, 19(4), 847–860.
Hung, H., Jayagopi, D., Yeo, C., et al. (2007). Using audio and video features to classify the most
dominant person in a group meeting. In Proceedings of the 15th ACM International Conference
on Multimedia (pp. 835–838).
Ishizuka, K., Araki, S., Otsuka, K., Nakatani, T., & Fujimoto, M. (2009). A speaker diarization
method based on the probabilistic fusion of audio-visual location information. In Proceedings
of the 2009 International Conference on Multimodal Interfaces (pp. 55–62).
Jayagopi, D. B. & Gatica-Perez, D. (2009). Discovering group nonverbal conversational patterns
with topics. In Proceedings of the International Conference on Multimodal Interfaces (pp. 3–
6).
Jayagopi, D. B. & Gatica-Perez, D. (2010). Mining group nonverbal conversational patterns using
probabilistic topic models. IEEE Transactions on Multimedia, 12(8), 790–802.
Jayagopi, D. B., Hung, H., Yeo, C., & Gatica-Perez, D. (2009). Modeling dominance in group
conversations from nonverbal activity cues. IEEE Transactions on Audio, Speech, and Lan-
guage Processing (special issue on Multimodal Processing for Speech-based Interactions),
17(3), 501–513.
Jayagopi, D., Raducanu, B., & Gatica-Perez, D. (2009). Characterizing conversational group
dynamics using nonverbal behavior. In Proceedings of the International Conference on Multi-
media (pp. 370–373).
Jayagopi, D., Sanchez-Cortes, D., Otsuka, K., Yamato, J., & Gatica-Perez, D. (2012). Linking
speaking and looking behavior patterns with group composition, perception, and performance.
In Proceedings of the 14th ACM International Conference on Multimodal Interaction (pp. 433–
440).
Kalimeri, K., Lepri, B., Aran, O., et al. (2012). Modeling dominance effects on nonverbal behav-
iors using granger causality. In Proceedings of the 14th ACM International Conference on
Multimodal Interaction (pp. 23–26).
Kalimeri, K., Lepri, B., Kim, T., Pianesi, F., & Pentland, A. (2011). Automatic modeling of dom-
inance effects using granger causality. In A. A. Salah, & B. Lepri (Eds), Human Behavior
Understanding (vol. 7065, pp. 124–133). Berlin: Springer.
Kim, S., Filippone, M., Valente, F., & Vinciarelli, A. (2012). Predicting the conflict level in tele-
vision political debates: An approach based on crowdsourcing, nonverbal communication and
Gaussian processes. In Proceedings of the 20th ACM International Conference on Multimedia
(pp. 793–796).
Kim, S., Valente, F., & Vinciarelli, A. (2012). Automatic detection of conflicts in spoken conversa-
tions: Ratings and analysis of broadcast political debates. In Proceedings of IEEE International
Conference on Acoustics, Speech and Signal Processing (pp. 5089–5092).
Kim, T. & Pentland, A. (2009). Understanding effects of feedback on group collaboration. Asso-
ciation for the Advancement of Artificial Intelligence, Spring Symposium (pp. 25–30).
Knapp, M. L. & Hall, J. A. (2009). Nonverbal Communication in Human Interaction (7 edn).

Boston: Wadsworth Publishing.
Kumano, S., Otsuka, K., Mikami, D., & Yamato, J. (2009). Recognizing communicative facial
expressions for discovering interpersonal emotions in group meetings. In Proceedings of the
2009 International Conference on Multimodal Interfaces (pp. 99–106).
Kumano, S., Otsuka, K., Mikami, D., & Yamato, J. (2011). Analysing empathetic interactions
based on the probabilistic modeling of the co-occurrence patterns of facial expressions in group
meetings. In Proceedings of IEEE International Conference on Automatic Face Gesture Recog-
nition and Workshops (pp. 43–50).
La Fond, T., Roberts, D., Neville, J., Tyler, J., & Connaughton, S. (2012). The impact of com-
munication structure and interpersonal dependencies on distributed teams. In Proceedings of
Privacy, Security, Risk and Trust (PASSAT), International Conference on Social Computing
(pp. 558–565).
Lepri, B., Kalimeri, K., & Pianesi, F. (2010). Honest signals and their contribution to the auto-
matic analysis of personality traits – a comparative study. In A. A. Salah, T. Gevers, N. Sebe,
& A. Vinciarelli, (Eds), Human Behavior Understanding (vol. 6219, pp. 140–150. Berlin:
Springer.
Lepri, B., Mana, N., Cappelletti, A., & Pianesi, F. (2009). Automatic prediction of individual per-
formance from “thin slices” of social behavior. In Proceedings of the 17th ACM International
Lepri, B., Mana, N., Cappelletti, A., Pianesi, F., & Zancanaro, M. (2009). Modeling the personal-
ity of participants during group interactions. In Proceedings of Adaptation and Personalization
UMAP 2009, 17th International Conference on User Modeling (pp. 114–125).
Lepri, B., Ramanathan, S., Kalimeri, K., et al. (2012). Connecting meeting behavior with extraver-
sion – a systematic study. IEEE Transactions on Affective Computing, 3(4), 443–455.
Lepri, B., Subramanian, R., Kalimeri, K., et al. (2010). Employing social gaze and speaking activ-
ity for automatic determination of the extraversion trait. In Proceedings of the International
Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal
Interaction (pp. 7:1–7:8).
Nakano, Y. & Fukuhara, Y. (2012). Estimating conversational dominance in multiparty inter-
action. In Proceedings of the 14th ACM International Conference on Multimodal Interaction
(pp. 77–84).
Noulas, A., Englebienne, G., & Krose, B. J. A. (2012). Multimodal speaker diarization. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 34(1), 79–93.
Olguin Olguin, D., Waber, B. N., Kim, T., et al. (2009). Sensible organizations: Technology and
methodology for automatically measuring organizational behavior. IEEE Transactions on Sys-
tems, Man, and Cybernetics Part B: Cybernetics, 39(1), 43–55.
Otsuka, K., Araki, S., Mikami, D., et al. (2009). Realtime meeting analysis and 3D meeting viewer
based on omnidirectional multimodal sensors. In Proceedings of the 2009 International Con-
ference on Multimodal Interfaces (pp. 219–220).
Otsuka, Y. & Inoue, T. (2012). Designing a conversation support system in dining together based
on the investigation of actual party. In Proceedings of IEEE International Conference on Sys-
tems, Man, and Cybernetics (pp. 1467–1472).
Detection of conflict in competitive discussions through automatic turn-organization analysis.
Pianesi, F. (2013). Searching for personality. IEEE Signal Processing Magazine, 30(1), 146–158.
Poggi, I. & D’Errico, F. (2010). Dominance signals in debates. In A. A. Salah, T. Gevers, N. Sebe,
& A. Vinciarelli (Eds), Human Behavior Understanding (vol. 6219, pp. 163–174). Berlin:
Springer.
Prabhakar, K. & Rehg, J. M. (2012). Categorizing turn-taking interactions. In A. Fitzgibbon, S.
Lazebnik, P. Perona, Y. Sato, & C. Schmid (Eds), European Conference on Computer Vision
(vol. 7576, pp. 383–396). Berlin: Springer.
Raducanu, B. & Gatica-Perez, D. (2009). You are fired! nonverbal role analysis in competitive
meetings. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal
Processing (pp. 1949–1952).
Raducanu, B. & Gatica-Perez, D. (2012). Inferring competitive role patterns in reality TV show
through nonverbal analysis. Multimedia Tools and Applications, 56(1), 207–226.
Raiman, N., Hung, H., & Englebienne, G. (2011). Move, and I will tell you who you are: Detecting
deceptive roles in low-quality data. In Proceedings of the 13th International Conference on
Ramanathan, V., Yao, B., & Fei-Fei, L. (2013). Social role discovery in human events. In Proceed-
ings of IEEE Conference on Computer Vision and Pattern Recognition (pp. 2475–2482).
Rehg, J. M., Fathi, A., & Hodgins, J. K. (2012). Social interactions: A first-person perspective.
In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (pp. 1226–
1233).
Rienks, R. J. & Heylen, D. (2005). Automatic dominance detection in meetings using easily
detectable features. In Proceedings of Workshop Machine Learning for Multimodal Interac-
tion, Edinburgh
11(7), 1373–1380.
Salamin, H. & Vinciarelli, A. (2012). Automatic role recognition in multi-party conversations:
An approach based on turn organization, prosody and conditional random fields. IEEE Trans-
actions on Multimedia, 13(2), 338–345.
Salamin, H., Vinciarelli, A., Truong, K., & Mohammadi, G. (2010). Automatic role recognition
based on conversational and prosodic behaviour. In Proceedings of the International Confer-
ence on Multimedia (pp. 847–850).
Sanchez-Cortes, D., Aran, O., & Gatica-Perez, D. (2011). An audio visual corpus for emergent
leader analysis. In Proceedings of Workshop on Multimodal Corpora for Machine Learning:
Taking Stock and Road Mapping the Future, November.
Sanchez-Cortes, D., Aran, O., Jayagopi, D. B., Schmid Mast, M., & Gatica-Perez, D. (2012).
Emergent leaders through looking and speaking: From audio-visual data to multimodal recog-
nition. Journal on Multimodal User Interfaces, 7(1–2), 39–53.
Sanchez-Cortes, D., Aran, O., Schmid Mast, M., & Gatica-Perez, D. (2010). Identifying emergent
leadership in small groups using nonverbal communicative cues. In Proceedings of the 12th
International Conference on Multimodal Interfaces and 7th Workshop on Machine Learning
for Multimodal Interaction (art. 39).
Sanchez-Cortes, D., Aran, O., Schmid Mast, M., & Gatica-Perez, D. (2012). A nonverbal behav-
ior approach to identify emergent leaders in small groups. IEEE Transactions on Multimedia,
14(3), 816–832.
Sapru, A. & Bourlard, H. (2013). Automatic social role recognition in professional meetings using
conditional random fields. In: Proceedings of 14th Annual Conference of the International
Speech Communication Association (pp. 1530–1534).
Schoenenberg, K., Raake, A., & Skowronek, J. (2011). A conversation analytic approach to the
prediction of leadership in two- to six-party audio conferences. In Proceedings of Third Inter-
national Workshop on Quality of Multimedia Experience (pp. 119–124).
Song, Y., Morency, L.-P., & Davis, R. (2012). Multimodal human behavior analysis: Learning
correlation and interaction across modalities. In Proceedings of the 14th ACM International
Conference on Multimodal Interaction (pp. 27–30).
Staiano, J., Lepri, B., Kalimeri, K., Sebe, N., & Pianesi, F. (2011). Contextual modeling of person-
ality states’ dynamics in face-to-face interactions. In Proceedings of Security Risk And Trust
(PASSAT), IEEE Third International Conference on Social Computing Privacy (pp. 896–899).
Staiano, J., Lepri, B., Ramanathan, S., Sebe, N., & Pianesi, F. (2011). Automatic modeling of
personality states in small group interactions. In Proceedings of the 19th ACM International
Stein, R. T. (1975). Identifying emergent leaders from verbal and nonverbal communications.
Personality and Social Psychology, 32(1), 125–135.
Subramanian, R., Staiano, J., Kalimeri, K., Sebe, N., & Pianesi, F. (2010). Putting the pieces
together: Multimodal analysis of social attention in meetings. In Proceedings of the Interna-
tional Conference on Multimedia (pp. 659–662).
Sumi, Y., Yano, M., & Nishida, T. (2010). Analysis environment of conversational struc-
ture with nonverbal multimodal data. In Proceedings of the International Conference on
Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction
(pp. 44:1–44:4).
Suzuki, N., Kamiya, T., Umata, I., et al. (2013). Detection of division of labor in multiparty
collaboration. In Proceedings of the 15th International Conference on Human Interface and the
Management of Information: Information and Interaction for Learning, Culture, Collaboration
and Business (pp. 362–371).
Valente, F. & Vinciarelli, A. (2010). Improving Speech Processing through social signals: Auto-
matic speaker segmentation of political debates using role based turn-taking patterns. In Pro-
ceedings of the International Workshop on Social Signal Processing (pp. 29–34).
Varni, G., Volpe, G., & Camurri, A. (2010). A system for real-time multi-modal analysis of non-
verbal affective social interaction in user-centric media. IEEE Transactions on Multimedia,
12(6), 576–590.
Vinciarelli, A. (2009). Capturing order in social interactions. IEEE Signal Processing Magazine,
26, 133–152.
Vinciarelli, A., Salamin, H., Mohammadi, G., & Truong, K. (2011). More than words: Inference
of socially relevant information from nonverbal vocal cues in speech. Lecture Notes in Com-
puter Science, 6456, 24–33.
Vinciarelli, A., Valente, F., Yella, S. H., & Sapru, A. (2011). Understanding social signals in
multi-party conversations: Automatic recognition of socio-emotional roles in the AMI meeting
corpus. In Proceedings of IEEE International Conference on Systems, Man, and Cybernetics
(pp. 374–379).
Vinyals, O., Bohus, D., & Caruana, R. (2012). Learning speaker, addressee and overlap detection
models from multimodal streams. In Proceedings of the 14th ACM International Conference
on Multimodal Interaction (pp. 417–424).
Voit, M. & Stiefelhagen, R. (2010). 3D user-perspective, voxel-based estimation of visual focus of
attention in dynamic meeting scenarios. In Proceedings of International Conference on Multi-
modal Interfaces and the Workshop on Machine Learning for Multimodal Interaction (pp. 51:1–
51:8).
Wang, W., Precoda, K., Hadsell, R., et al. (2012). Detecting leadership and cohesion in spoken
interactions. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal
Processing (pp. 5105–5108).
Wang, W., Precoda, K., Richey, C., & Raymond, G. (2011). Identifying agreement/disagreement
in conversational speech: A cross-lingual study. In Proceedings of the Annual Conference of
the International Speech Communication Association (pp. 3093–3096).
Wilson, T. & Hofer, G. (2011). Using linguistic and vocal expressiveness in social role recogni-
tion. In Proceedings of the International Conference on Intelligent User Interfaces (pp. 419–
422).
Wöllmer, M., Eyben, F., Schuller, B. & Rigoll, G. (2012). Temporal and situational context mod-
eling for improved dominance recognition in meetings. In Proceedings of 13th Annual Confer-
ence of the International Speech Communication Association (pp. 350–353).
Woolley, A. W., Chabris, C. F., Pentland, A., Hashmi, N., & Malone, T. W. (2010). Evidence
for a collective intelligence factor in the performance of human groups. Science, 330(6004),
686–688.
26 Multimedia Implicit Tagging
Mohammad Soleymani and Maja Pantic
Introduction
Social and behavioral signals carry invaluable information regarding how audiences
perceive the multimedia content. Assessing the responses from the audience, we can
generate tags, summaries, and other forms of metadata for multimedia representation
and indexing. Tags are a form of metadata which enables a retrieval system to find
and re-find the content of interest (Larson et al., 2011). Unlike classic tagging schemes
where users’ direct input is needed, implicit human-centered tagging (IHCT) was pro-
posed (Pantic & Vinciarelli, 2009) to generate tags without any specific input or effort
from users. Translating the behavioral responses into tags results in “implicit” tags since
there is no need for users’ direct input as reactions to multimedia are displayed sponta-
neously (Soleymani & Pantic, 2012).
User generated explicit tags are not always assigned with the intention of describing
the content and might be given to promote the users themselves (Pantic & Vinciarelli,
2009). Implicit tags have the advantage of being detected for a certain goal relevant to a
given application. For example, an online radio interested in the mood of its songs can
assess listeners’ emotions; a marketing company is interested in assessing the success
of its video advertisements.
It is also worth mentioning that implicit tags can be a complementary source of
information in addition to the existing explicit tags. They can also be used to filter
out the tags which are not relevant to the content (Soleymani & Pantic, 2013; Soley-
mani, Kaltwang, & Pantic, 2013). A scheme of implicit tagging versus explicit tag-
ging is shown in Figure 26.1. Recently, we have been witnessing a growing inter-
est from industry on this topic (Klinghult, 2012; McDuff, El Kaliouby, & Picard,
2012; Fleureau, Guillotel, & Orlac, 2013; Silveira et al., 2013) which is a sign of its
significance.
Analyzing spontaneous reactions to multimedia content can assist multimedia index-
ing with the following scenarios: (i) direct translation to tags – users’ spontaneous reac-
tions will be translated into emotions or preference, e.g., interesting, funny, disgusting,
scary (Kierkels, Soleymani, & Pun, 2009; Soleymani, Pantic, & Pun, 2012; Petridis &
Pantic, 2009; Koelstra et al., 2010; Silveira et al., 2013; Kurdyukova, Hammer, & Andr,
2012); (ii) assessing the correctness of explicit tags or topic relevance, e.g., agreement
or disagreement over a displayed tag or the relevance of the retrieved result (Koelstra,
Multimedia Implicit Tagging 369
Figure 26.1 Implicit tagging vs. explicit tagging scenarios. The analysis of the bodily reactions to
multimedia content replaces the direct input of the tag by users. Thus, users do not have to put
any effort into tagging the content.
Muhl, & Patras, 2009; Soleymani, Lichtenauer et al., 2012; Arapakis, Konstas, & Jose,
2009; Jiao & Pantic, 2010; Moshfeghi & Jose, 2013); (iii) user profiling – a user’s per-
sonal preferences can be detected based on her reactions to retrieved data and be used
for re-ranking the results; (iv) content summarization – highlight detection is also pos-
sible using implicit feedbacks from the users (Fleureau et al., 2013; Joho et al., 2010;
Chênes et al., 2012).
Classic multimedia indexing relies on concepts that characterize its content in terms
of events, objects, locations, etc. The indexing that only relies on the concepts depicted
in the content is called cognitive indexing. An alternative parallel to this approach to
indexing has emerged that take affective aspects into account. Affect, in this context,
refers to the intensity and type of emotion that is evoked in the consumer of multimedia
content (Hanjalic & Xu, 2005; Soleymani et al., 2014). Multimedia affective content
can be presented by relevant emotional tags. Being directly related to the users’ reac-
tions, implicit tagging directly translates users’ emotions into affective representation of
multimedia. Affective tags are shown to help recommendation and retrieval systems to
improve their performance (Shan et al., 2009; Tkalčič, Burnik, & Košir, 2010; Kierkels
et al., 2009).
Other feedbacks from users, including clickthrough rate and dwell time, have been
used extensively for information retrieval and topic relevance applications (Shen, Tan,
& Zhai, 2005; Joachims et al., 2005). In this chapter, we only cover the implicit feed-
back which is measurable by sensors and cameras from bodily responses. The remain-
der of this chapter is organized as follows. The next section provides a background
on the recent developments in this field. Available public databases are introduced in
the Databases section. Current challenges and perspectives are discussed in the last
section.
Background
Implicit tagging have been applied to different problems from emotional tagging and
preference detection to topic relevance assessment (Soleymani & Pantic, 2012). Cur-
rently, there are three main research trends taking advantage of implicit tagging tech-
niques. The first one deals with using emotional reactions to detect users’ emotions
and content’s mood using the expressed emotion, e.g., laughter detection for hilarity
(Petridis & Pantic, 2009); the second group of research is focused detecting interest of
the viewers and video highlights; the third group of studies are using the spontaneous
reactions for information retrieval or search results re-ranking, e.g., eye gaze for rel-
evance feedback (Hardoon & Pasupa, 2010). In the following we review the existing
work categorized by their applications.
Emotional Tagging
Emotional tags can be used for indexing the content with their affect as well as improv-
ing content recommendation (Shan et al., 2009; Kierkels et al., 2009). Affective infor-
mation has been shown to improve image and music recommendation (Tkalčič, Burnik
et al., 2010; Shan et al., 2009). Tkalčič et al. used affect detected from facial expression
in response to images from an image recommender. Their experimental results showed
that the affective implicit tags could improve the explicit tagging as a complementary
source of information (Tkalčič et al., 2013).
Physiological signals have been also used to detect emotions with the goal of implicit
emotional tagging. Soleymani et al. (2009) proposed an affective characterization for
movie scenes using peripheral physiological signals. Eight participants watched sixty-
four movie scenes and self-reported their emotions. A linear regression trained by rel-
evance vector machines (RVM) was utilized to estimate each clip’s affect from physi-
ological features. Kierkels et al. (2009) extended these results and analyzed the effec-
tiveness of tags detected by physiological signals for personalized affective tagging of
videos. Quantized arousal and valence levels for a clip were then mapped to emotion
labels. This mapping enabled the retrieval of video clips based on keyword queries. A
similar approach was taken using a linear ridge regression for emotional characteriza-
tion of music videos. Arousal, valence, dominance, and like/dislike rating was detected
from the physiological signals and video content (Soleymani et al., 2011). Koelstra et al.
(2012) used electroencephalogram (EEG) and peripheral physiological signals for emo-
tional tagging of music videos. In a similar study (Soleymani, Pantic et al., 2012), a
multimodal emotional tagging was conducted using EEG signals and pupillary reflex.
Abadi, Kia et al. (2013) recorded and analyzed magneto encephalogram (MEG) sig-
nals as an alternative to the EEG signals with the ability to monitor brain activities.
Although they could obtain comparable results to the results obtained by EEG, the price
and apparatus complexity of MEG machines do not make it an apparent candidate for
such applications.
In an approach taken for emotional tagging, emotional events, defined as arousing
events, were first detected in movies from peripheral physiological responses and then
their valence was detected using Gaussian processes classifiers (Fleureau, Guillotel,
& Huynh-Thu, 2012). Such a strategy can also be justified based on the heart-shaped
distribution of emotions on an arousal and valence plane (Dietz & Lang, 1999) in which
emotions with higher arousal have more extreme pleasantness or unpleasantness.
Engagement of viewers with movie scenes was assessed by physiological signals and
facial expressions (Abadi, Staiano et al., 2013). To measure engagement, a system will
be able to steer the story in a hyper-narrative movie where different outcomes are pos-
sible based on the users’ reactions. Spontaneous audio responses can also be used for
implicit emotional tagging. Petridis and Pantic proposed a method for tagging videos
for the level of hilarity by analyzing users’ laughter (Petridis & Pantic, 2009). Differ-
ent types of laughter can be an indicator of the level of hilarity of multimedia con-
tent. Using audiovisual modalities, they could recognize speech, unvoiced laughter, and
voiced laughter.
Highlight and Interest Detection

Users’ interest in content can help recommender systems, content producers, and adver-
tisers to focus their efforts better toward higher user satisfaction. Kurdyukova et al.
(2012) set up a display that can detect the interest of the passers-by by detecting their
faces, facial expressions, and head pose. In addition, the social context, groups, conver-
sations, and gender were recognized, which can be used for profiling purposes for adver-
tisements. In a study on estimating movie ratings (Silveira et al., 2013), galvanic skin
response (GSR) was recorded and analyzed from a movie audience. Ratings were allo-
cated according to a five-point scale with low rating (1–3) and high rating (4–5) classes.
Their method could achieve better results if they had incorporated GSR responses along
with demography information for two out of three studied movies.
Interest in the advertisements was shown to be detectable by analyzing the facial
expressions of viewers on the web. McDuff et al. (2012, 2013) measured the level of
smile from a video advertisement audience to assess their interest in the content. They
collected a large number of samples using crowdsourcing by recording the responses on
users’ webcams. They were able to detect fairly accurately if the viewers liked a video
and whether they had a desire to watch a video again.
Video highlight detection and summarization is an important application for index-
ing and visualization purposes. Joho et al. (2009, 2010) developed a video summariza-
tion tool using facial expressions. A probabilistic emotion recognition based on facial
expressions was employed to detect emotions of ten participants watching eight video
clips. The expression change rate between different emotional expressions and the pro-
nounce level of expressed emotions were used as features to detect personal highlights
in the videos. The pronounce levels they used were ranging from highly expressive
emotions, surprise, and happiness, to no expression or neutral. Chênes et al (2012)
used physiological linkage between different viewers to detect video highlights. Skin
temperature and Galvanic Skin Response (GSR) were found to be informative in detect-
ing video highlights via physiological linkage. They achieved 78.2% of accuracy in
detecting highlight by their proposed method. In a more recent study, Fleureau et al.
(2013) simultaneously used GSR responses from an audience to create an emotional
profile of movies. The profiles generated from the physiological responses were shown
to match the user-reported highlights.
Relevance Assessment
Users’ responses also carry pertinent information regarding the relevance of retrieved
content to a query. Relevance of content to the user generated tags or tags detected by
content-based indexing systems can also be assessed by users’ responses (Soleymani
et al., 2013). Arapakis Moshfeghi et al. (2009) introduced a method to assess the top-
ical relevance of videos in accordance to a given query using facial expressions show-
ing users’ satisfaction or dissatisfaction. Based on facial expressions recognition tech-
niques, basic emotions were detected and compared with the ground truth. They were
able to predict with 89% accuracy whether a video was indeed relevant to the query.
The same authors later studied the feasibility of using affective responses derived from
both facial expressions and physiological signals as implicit indicators of topical rel-
evance. Although the results were above random level and support the feasibility of
the approach, there is still room for improvement from the best obtained classifica-
tion accuracy, 66%, on relevant versus non-relevant classification (Arapakis, Konstas
et al., 2009). In the same line Arapakis, Athanasakos, and Jose (2010) compared the
performance of personal versus general affect recognition approaches for topical rel-
evance assessment and found that accounting for personal differences in their emo-
tion recognition method improved their performance. In a more recent study, Mosh-
feghi and Jose (2013) showed that physiological responses and facial expressions can
be used as complementary sources of information in addition to dwell time for rele-
vance assessment. Their study was evaluated with an experiment on a video retrieval
platform.
In another information retrieval application, Kelly and Jones (2010) used physiologi-
cal responses to re-rank the content collected via a lifelogging application. The lifelog-
ging application collects picture, text messages, GSR, skin temperature, and the energy
that the body of a user consumed using an accelerometer. Using the skin temperature
they could improve the mean average precision (MAP) of the baseline retrieval system
by 36%.
Koelstra et al. (2009) investigated the use of electroencephalogram (EEG) signals for
implicit tagging of images and videos. They showed short video excerpts and images
first without tags and then with tags. They found significant differences in EEG signals
(N400 evoked potential) in responses to relevant and irrelevant tags. These differences
were nevertheless not always present; thus precluding classification. Facial expression
and eye gaze were used to detect users’ agreement or disagreement with the displayed
tags on twenty-eight images (Jiao & Pantic, 2010; Soleymani, Lichtenauer et al., 2012).
The results showed that not all the participants in the experiment were expressing their
agreement or disagreement on their faces and that their eye gaze were more informa-
tive for agreement assessment. Soleymani and Pantic (2013) showed that EEG signals
and N400, while aggregated from multiple participants, can reach a high accuracy for
detecting the nonrelevant content. Soleymani et al. (2013) further studied the effective-
ness of different modalities for relevance assessment on the same dataset. They showed
that in a user independent approach eye gaze performs much better than EEG signals
and facial expressions to detect tag relevance. Eye gaze responses have also been used to
detect interest for image annotation (Haji Mirza, Proulx, & Izquierdo, 2012), relevance
judgment (Salojärvi, Puolamäki, & Kaski, 2005), interactive video search (Vrochidis,
Patras, & Kompatsiaris, 2011), and search personalization (Buscher, Van Elst, & Den-
gel, 2009).
Databases
In this section, we introduce the publicly available databases which are developed for
the sole purpose of implicit human-centered tagging studies.
The MAHNOB HCI database (Soleymani, Lichtenauer et al., 2012) was developed
for experimenting with implicit tagging approaches for two different scenarios, namely,
emotional tagging and tag relevance assessment. This database consists of two experi-
ments. The responses, including EEG, physiological signals, eye gaze, audio, and facial
expressions, of thirty people were recorded. The first experiment was to watch twenty
emotional video excerpts from movies and online repositories. The second experiment
was a tag agreement experiment in which images and short videos with human actions
were shown to the participants first without a tag and then with a displayed tag. The tags
were either correct or incorrect and participants’ agreement with the displayed tag was
assessed. An example of an eye gaze pattern and fixation points on an image with an
overlaid label is shown in Figure 26.2. This database is publicly available on the Internet
(http://mahnob-db.eu/hct-tagging/).
The Database for Emotion Analysis using Physiological Signals (DEAP) (Koelstra
et al., 2012) is a database developed for emotional tagging of music videos. It includes
peripheral and central nervous system physiological signals in addition to face videos
from thirty-two participants. The face videos were recorded from only twenty-two par-
ticipants. EEG signals were recorded from thirty-two active electrodes. Peripheral ner-
vous system physiological signals were EMG, electro-oculogram (EOG), blood volume
pulse (BVP) using plethysmograph, skin temperature, and GSR. The spontaneous reac-
tions of participants were recorded in response to music video clips. This database is
publicly available on the Internet (www.eecs.qmul.ac.uk/mmv/datasets/deap/).
The Pinview database comprises eye gaze and interaction data collected in an image
retrieval scenario (Auer et al., 2010). The Pinview database includes explicit rele-
vance feedback interaction from the user, such as pointer clicks, and implicit relevance
Figure 26.2 An example of displayed images is shown with eye gaze fixation and scan path
overlaid. The size of the circles represents the time spent staring at each fixation point.
feedback signals, such as eye movements and pointer traces. This database is available
online (www.pinview.eu/databases/).
Tkalčič et al. collected the LDOS-PerAff-1 corpus of face video clips in addition to
the participants’ personality (Tkalčič, Tasič, & Košir, 2010). Participant personalities
were assessed by the international personality item pool (IPIP) questionnaire (Goldberg
et al., 2006). Participants watched a subset of images extracted from the international
affective picture system (IAPS) (Lang, Bradley, & Cuthbert, 2005) and, on a five-point
Likert scale, rated their preference for choosing the picture for their desktop wallpa-
per. The LDOS-PerAff-1 database is available online (http://slavnik.fe.uni-lj.si/markot/
Main/LDOS-PerAff-1).
Challenges and Perspectives
Reading users’ minds and generating the ground truth for emotion and interest detection
is one of the main challenges of implicit tagging studies. It is often easier for the users
to compare or rank the content based on their emotion rather than assigning an exact
label or absolute ratings (Yannakakis & Hallam, 2011). Although comparing pairs or a
group of content to each other require a larger number of trials and longer experiments
if it is to be taken into account in future studies.
The other challenge is to have nonintrusive, easy to use, and cheap sensors that can be
commercially produced. Owing to the growing interest from the industry, portable and
wearable sensors and camera are becoming cheaper and more accessible, e.g., Microsoft
Kinect and Google Glass. In addition to the sensor-based methods, there is also a
trend in detecting physiological signals and facial expressions through users’ webcams
(McDuff et al., 2013). Due to the availability of webcams on almost all the devices,
there is a huge potential for its use.
Emotional expressions in natural settings are mostly subtle and person dependent
which make them hard to detect. Therefore, large databases and specific machine learn-
ing techniques still have to be developed for bringing implicit tagging ideas into prac-
tice. So far, the emotional models are limited either to the discrete basic emotions or
the dimensional valence-arousal-dominance spaces. Developing new emotional mod-
els and dimensions specific to different applications, such as the one proposed by
Eggink and Bland (2012) and Benini, Canini, and Leonardi (2011), should be also
explored.
There are also contextual factors, such as time, environment, cultural background,
mood, and personality, which are not necessarily easy to assess or consider (Soleymani
et al., 2014). The important contextual factors for each application need to be care-
fully identified and their effect has to be incorporated into the final tagging or retrieval
process.
Some people might also find such systems intrusive; and they have legitimate privacy
concerns. For example, such technologies can be used for surveillance and marketing
purposes without users’ consent. These concerns need to be addressed by researchers in
collaborations with ethics and law experts.
Implicit tagging is showing its potential by attracting interest from the industrial enti-
ties. The proliferation of commercially produced sensors, such as handheld devices
equipped with RGB-D cameras, will help the emergence of the new techniques for
multimedia implicit tagging.
Acknowledgments
Mohammad Soleymani’s work is supported by the European Research Council under the
FP7 Marie Curie Intra-European Fellowship: Emotional continuous tagging using spon-
taneous behavior (EmoTag). Maja Pantic’s work is supported in part by the European
Community’s 7th Framework Programme (FP7/2007–2013) under the grant agreement
no 231287 (SSPNet) and ERC Starting Grant agreement no. ERC-2007-StG-203143
(MAHNOB).
References
Abadi, M. K., Kia, S. M., Subramanian, R., Avesani, P., & Sebe, N. (2013). User-centric affec-
tive video tagging from MEG and peripheral physiological responses. In Proceedings of 3rd
International Conference on Affective Computing and Intelligent Interaction and Workshops
(pp. 582–587).
Abadi, M. K., Staiano, J., Cappelletti, A., Zancanaro, M., & Sebe, N. (2013). Multimodal engage-
ment classification for affective cinema. In Proceedings of 3rd International Conference on
Arapakis, I., Athanasakos, K., & Jose, J. M. (2010). A comparison of general vs personalised
affective models for the prediction of topical relevance. In Proceedings of the 33rd interna-
tional ACM SIGIR conference on Research and development in information retrieval (pp. 371–
378).
Arapakis, I., Konstas, I., & Jose, J. M. (2009). Using facial expressions and peripheral physiolog-
ical signals as implicit indicators of topical relevance. In Proceedings of the Seventeen ACM
International Conference on Multimedia (pp. 461–470).
Arapakis, I., Moshfeghi, Y., Joho, H., et al. (2009). Integrating facial expressions into user pro-
filing for the improvement of a multimodal recommender system. In Proceedings of IEEE
International Conference on Multimedia and Expo (pp. 1440–1443).
Auer, P., Hussain, Z., Kaski, S., et al. (2010). Pinview: Implicit feedback in content-based image
retrieval. In Proceedings of JMLR: Workshop on Applications of Pattern Analysis (pp. 51–57).
Benini, S., Canini, L., & Leonardi, R. (2011). A connotative space for supporting movie affective
recommendation. IEEE Transactions on Multimedia, 13(6), 1356–1370.
Buscher, G., Van Elst, L., & Dengel, A. (2009). Segment-level display time as implicit feedback:
A comparison to eye tracking. In Proceedings of the 32nd International ACM SIGIR Confer-
ence on Research and Development in Information Retrieval (pp. 67–74).
Chênes, C., Chanel, G., Soleymani, M., & Pun, T. (2012). Highlight detection in movie scenes
through inter-users, physiological linkage. In N. Ramzan, R. van Zwol, J.-S. Lee, K. Clüver, &
X.-S. Hua (Eds), Social Media Retrieval (pp. 217–238). Berlin: Springer.
Dietz, R. B. & Lang, A. (1999). Æffective agents: Effects of agent affect on arousal, attention, lik-
ing and learning. In Proceedings of the Third International Cognitive Technology Conference,
San Francisco.
Eggink, J. & Bland, D. (2012). A large scale experiment for mood-based classification of TV
programmes. In Proceedings of IEEE International Conference on Multimedia and Expo
(pp. 140–145).
Fleureau, J., Guillotel, P., & Huynh-Thu, Q. (2012). Physiological-based affect event detector for
entertainment video applications. IEEE Transactions on Affective Computing, 3(3), 379–385.
Fleureau, J., Guillotel, P., & Orlac, I. (2013). Affective benchmarking of movies based on the
physiological responses of a real audience. In Proceedings of 3rd International Conference on
Goldberg, L. R., Johnson, J. A., Eber, H. W., et al. (2006) The international personality item
pool and the future of public-domain personality measures. Journal of Research in Personality,
40(1), 84–96.
Haji Mirza, S., Proulx, M., & Izquierdo, E. (2012). Reading users’ minds from their eyes: A
method for implicit image annotation. IEEE Transactions on Multimedia, 14(3), 805–815.
Hanjalic, A & Xu, L.-Q. (2005). Affective video content representation and modeling. IEEE
Transactions on Multimedia, 7(1), 143–154.
Hardoon, D. R. & Pasupa, K. (2010). Image ranking with implicit feedback from eye movements.
In Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications (pp. 291–
298).
Jiao, J. & Pantic, M. (2010). Implicit image tagging via facial information. In Proceedings of the
2nd International Workshop on Social Signal Processing (pp. 59–64).
Joachims, T., Granka, L., Pan, B., Hembrooke, H., & Gay, G. (2005). Accurately interpreting
clickthrough data as implicit feedback. In Proceedings of the 28th Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval (pp. 154–161).
Joho, H., Jose, J. M., Valenti, R., & Sebe, N. (2009). Exploiting facial expressions for affective
video summarisation. In Proceeding of the ACM International Conference on Image and Video
Retrieval, New York.
Joho, H., Staiano, J., Sebe, N., & Jose, J. (2010). Looking at the viewer: Analysing facial activity
to detect personal highlights of multimedia contents. Multimedia Tools and Applications, 51(2),
505–523.
Kelly, L. & Jones, G. (2010). Biometric response as a source of query independent scoring in
lifelog retrieval. In C. Gurrin, Y. He, G. Kazai, et al. (Eds), Advances in Information Retrieval
(vol. 5993, pp. 520–531). Berlin: Springer.
Kierkels, J. J. M., Soleymani, M., & Pun, T. (2009). Queries and tags in affect-based multimedia
retrieval. In Proceedings of the 2009 IEEE International Conference on Multimedia and Expo
(pp. 1436–1439).
Klinghult, G. (2012). Camera Button with Integrated Sensors. US Patent App. 13/677,517.
Koelstra, S., Muhl, C., & Patras, I. (2009). EEG analysis for implicit tagging of video data. In
Proceedings of 3rd International Conference on Affective Computing and Intelligent Interac-
tion and Workshops (pp. 1–6).
Koelstra, S., Mühl, C., Soleymani, M., et al. (2012). DEAP: A database for emotion analysis using
physiological signals. IEEE Transactions on Affective Computing, 3, 18–31.
Koelstra, S., Yazdani, A., Soleymani, M., et al. (2010). Single trial classification of EEG and
peripheral physiological signals for recognition of emotions induced by music videos. In
Y. Yao (Ed.), Brain Informatics (vol. 6334, pp. 89–100). Berlin: Springer.
Kurdyukova, E., Hammer, S., & Andr, E. (2012). Personalization of content on public displays
driven by the recognition of group context. In F. Patern, B. Ruyter, P. Markopoulos, et al. (Eds),
Ambient Intelligence (vol. 7683, pp. 272–287). Berlin: Springer.
Lang, P., Bradley, M., & Cuthbert, B. (2005). international affective picture system (iaps): affec-
tive ratings of pictures and instruction manual. Technical report A-8. University of Florida,
Gainesville, FL.
Larson, M., Soleymani, M., Serdyukov, P., et al. (2011). Automatic tagging and geotagging in
video collections and communities. In Proceedings of the 1st ACM International Conference
on Multimedia Retrieval (pp. 51:1–51:8).
McDuff, D., El Kaliouby, R., Demirdjian, D., & Picard, R. (2013) Predicting online media effec-
tiveness based on smile responses gathered over the Internet. In Proceedings of 10th IEEE
International Conference and Workshops on Automatic Face and Gesture Recognition (pp. 1–
7).
McDuff, D., El Kaliouby, R., & Picard, R. W. (2012). Crowdsourcing Facial Responses to Online
Videos. IEEE Transactions on Affective Computing, 3(4), 456–468.
Moshfeghi, Y. & Jose, J. M. (2013). An effective implicit relevance feedback technique using
affective, physiological and behavioural features. In Proceedings of the 36th International ACM
SIGIR Conference on Research and Development in Information Retrieval (pp. 133–142).
Pantic, M. & Vinciarelli, A. (2009). Implicit human-centered tagging. IEEE Signal Processing
Magazine, 26(6), 173–180.
Petridis, S. & Pantic, M. (2009). Is this joke really funny? Judging the mirth by audiovisual laugh-
ter analysis. In IEEE International Conference on Multimedia and Expo (pp. 1444–1447).
Salojärvi, J., Puolamäki, K., & Kaski, S. (2005). Implicit relevance feedback from eye move-
ments. In W. Duch, J. Kacprzyk, E. Oja, & S. Zadrozny (Eds), Artificial Neural Networks:
Biological Inspirations ICANN 2005 (vol. 3696, pp. 513–518). Berlin: Springer.
Shan, M. K., Kuo, F. F., Chiang, M. F., & Lee, S. Y. (2009). Emotion-based music recommendation
by affinity discovery from film music. Expert Systems with Applications, 36(4), 7666–7674.
Shen, X., Tan, B., & Zhai, C. (2005). Context-sensitive information retrieval using implicit feed-
back. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval (pp. 43–50).
Silveira, F., Eriksson, B., Sheth, A., & Sheppard, A. (2013). Predicting audience responses to
movie content from electro-dermal activity signals. In Proceedings of the 2013 ACM Confer-
ence on Ubiquitous Computing.
Soleymani, M., Chanel, G., Kierkels, J. J. M., & Pun, T. (2009). Affective characterization of
movie scenes based on content analysis and physiological changes. International Journal of
Semantic Computing, 3(2), 235–254.
Soleymani, M., Kaltwang, S., & Pantic, M. (2013). Human behavior sensing for tag relevance
assessment. In Proceedings of the 21st ACM International Conference on Multimedia.
Soleymani, M., Koelstra, S., Patras, I., & Pun, T. (2011). Continuous emotion detection in
response to music videos. In Proceedings of IEEE International Conference on Automatic
Face Gesture Recognition and Workshops (pp. 803–808).
Soleymani, M., Larson, M., Pun, T., & Hanjalic, A. (2014). Corpus development for affective
video indexing. IEEE Transactions on Multimedia, 16(4), 1075–1089.
Soleymani, M., Lichtenauer, J., Pun, T., & Pantic, M. (2012). A multimodal database for affect
recognition and implicit tagging. IEEE Transactions on Affective Computing, 3, 42–55.
Soleymani, M. & Pantic, M. (2012). Human-centered implicit tagging: Overview and perspec-
tives. In Proceedings of IEEE International Conference on Systems, Man and Cybernetics
(pp. 3304–3309).
Soleymani, M. & Pantic, M. (2013). Multimedia implicit tagging using EEG signals. In Proceed-
ings of IEEE International Conference on Multimedia and Expo.
Soleymani, M., Pantic, M., & Pun, T. (2012). Multimodal emotion recognition in response to
videos. IEEE Transactions on Affective Computing, 3(2), 211–223.
Tkalčič, M., Burnik, U., & Košir, A. (2010). Using affective parameters in a content-based rec-
ommender system for images. User Modeling and User-Adapted Interaction, 20(4), 279–311.
Tkalčič, M., Odic, A., Košir, A., & Tasic, J. (2013). Affective labeling in a content-based recom-
mender system for images. IEEE Transactions on Multimedia, 15(2), 391–400.
Tkalčič, M., Tasič, J., & Košir, A. (2010). The LDOS-PerAff-1 corpus of face video clips with
affective and personality metadata. In Proceedings of Multimodal Corpora Advances in Cap-
turing Coding and Analysing Multimodality (pp. 111–115).
Vrochidis, S., Patras, I., & Kompatsiaris, I. (2011). An eye-tracking-based approach to facilitate
interactive video search. In Proceedings of the 1st ACM International Conference on Multime-
dia Retrieval (pp. 43:1–43:8)
Yannakakis, G. N., & Hallam, J. (2011). Ranking vs. preference: A comparative study of self-
reporting. In S. D’Mello, A. Graesser, B. Schuller, & J.-C. Martin (Eds), Affective Computing
and Intelligent Interaction (vol. 6974, pp. 437–446). Berlin: Springer.
27 Social Signal Processing for Conflict
Analysis and Measurement
Introduction
The literature proposes several definitions of conflict: “a process in which one party per-
ceives that its interests are being opposed or negatively affected by another party” (Wall
& Roberts Callister, 1995); “[conflict takes place] to the extent that the attainment
of the goal by one party precludes its attainment by the other” (Judd, 1978); “[…]
the perceived incompatibilities by parties of the views, wishes, and desires that each
holds” (Bell & Song, 2005); and so on. While apparently different, all definitions share
a common point, that is, the incompatibility between goals and targets pursued by dif-
ferent individuals involved in the same interaction.
Following the definitions above, conflict is a phenomenon that cannot be observed
directly (goals and targets are not accessible to our senses), but only inferred from
observable behavioural cues. Therefore, the phenomenon appears to be a suitable sub-
ject for a domain like social signal processing that includes detection and interpretation
of observable social signals among its research focuses (Vinciarelli et al., 2008; Vin-
ciarelli, Pantic, & Bourlard, 2009; Vinciarelli, Pantic et al., 2012). Furthermore, the
literature shows that emotions are ambiguous conflict markers – people tend to display
both positive and negative emotions with widely different levels of arousal (Arsenio
& Killen, 1996) – while social signals are more reliable markers of conflict (Gottman,
Markman, & Notarius, 1977; Sillars et al., 1982; Cooper, 1986; Smith-Lovin & Brody,
1989; Schegloff, 2000).
One of the main challenges toward the development of automatic conflict analysis
approaches is the collection of ecologically valid data (Vinciarelli, Kim et al., 2012). The
main probable reason is that there is no conflict in absence of real goals and motivations,
but these are difficult to produce in laboratory experiments. To the best of our knowl-
edge, the few corpora where the subjects are moved by real motivations and, hence,
actually experience conflict are collections of political debates (Vinciarelli, Kim et al.,
2012) and recordings of counseling sessions for couples in distress (Black et al., 2013).
However, while the former can be distributed publicly and have even been used in inter-
national benchmarking campaigns (Schuller et al., 2013), the latter are protected for
privacy reasons.
The difficulties above explain why many approaches address disagreement, a phe-
nomenon that is easier to observe and elicit in the laboratory and often precedes
or accompanies conflict. Agreement and disagreement are defined as relations of
Behavioral Conflict
Person
cues detection
detection
detection
Input _
x y
data Feature extraction Regression/classification
Figure 27.1 General scheme of a conflict detection and analysis approach. Data portraying
multiparty interactions is first segemented into intervals displaying only one person (person
detection). The data corresponding to each individual is then used to detect behavioral patterns
(behavioral cues extraction) and these are then mapped into conflict and its measure.
congruence or opposition, respectively, between opinions expressed by multiple parties

involved in the same interaction (Poggi, D’Errico, & Vincze, 2011). Due to the close
relationship with conflict, this chapter surveys approaches for disagreement detection
as well.
The rest of this chapter is organized as follows: the next section proposes a survey of
previous work in the literature, which is followed by a section on describes open issues
and challenges and the final section draws some conclusions.
Previous Work
Conflict and disagreement have been the subject of extensive efforts in computing
research. While being different, the two phenomena often co-occur and, in particular,
disagreement is often a precursor of conflict. For this reason, this section proposes a sur-
vey of previous work aimed at the detection of both conflict and disagreement. Overall,
the approaches follow the scheme depicted in Figure 27.1. The two main technological
components are the extraction of features – typically designed to capture verbal and/or
nonverbal behavioural cues expected to mark the presence of conflict or disagreement –
and the actual detection. The latter can be designed as a classification, meaning that the
approach simply tries to detect whether conflict is present or absent, or as a regression,
meaning that the approach tries not only to predict whether there is conflict, but also
to measure its intensity. The rest of this section shows in more detail how each of the
approaches presented in the literature deals with the two technological issues above.
Disagreement Detection
Agreement can be defined as “a relation of identity, similarity or congruence between
the opinions of two or more persons” (Poggi et al., 2011). Hence, disagreement cor-
responds to a condition where people express opinions that are different and not con-
gruent, independently of any goal they are pursuing. In principle, disagreement can be
detected by analysing the content of what people say, the verbal component of the inter-
action. However, automatic transcription of data is still a challenging task in the case of
SSP for Conflict Analysis and Measurement 381
spontaneous conversations (Pieraccini, 2012) and, furthermore, even in the case of per-
fect transcriptions it can be difficult to spot a difference in opinions. For this reason, the
literature proposes different approaches to detect disagreement through both the words
that people utter and the social signals that people display when they defend opinions
different from those of their interlocutors (Bousmalis et al., 2011).
Several works focus on meetings that, while often being acted, give people the oppor-
tunity to display disagreement even when the scenario is cooperative (Germesin & Wil-
son, 2009; Galley et al., 2004; Hillard, Ostendorf, & Shriberg, 2003; Wrede & Shriberg
2003a, 2003b). In general, the approaches include three main stages, namely the seg-
mentation into short intervals according to a predefined criterion, the extraction of lex-
ical and nonverbal features from these intervals, and the application of classification
approaches to detect the possible presence of disagreement (see Figure 27.1).
The approach proposed by Germesin and Wilson (2009) splits meeting conversa-
tions into short segments and it extracts several features from each of them, including
dialogue acts, lexical choices (part of speech tags and key words selected via an effec-
tiveness ratio), and prosody (pitch and speech rate). The feature vectors extracted from
consecutive segments can be concatenated to take into account possible context effects.
The segmentation of the conversations into agreement and disagreement intervals is per-
formed using decision trees and conditional random fields. The performance, assessed
in terms of F 1 measure (see Table 27.1), is close to 45%. Galley et al. (2004) propose
to segment conversations in spurts, that is, periods“of speech by one speaker that [have]
no pauses of length greater than one half second” (Hillard et al., 2003). The spurts can
then be represented in terms of speaker adjacency statistics (e.g., how many spurts are
observed between two speakers on average), duration modeling (e.g., distribution of
speaking time across speakers), and lexical measurements (e.g., distribution of number
of words over the spurts). A maximum entropy classifier is then proposed to class spurts
into four possible categories (including disagreement) with an accuracy of 84%. A sim-
ilar approach is proposed by Hillard et al. (2003). In this case, the spurts are represented
with number of words and their type (“positive” and “negative”) as well as with the
perplexity of language models trained over samples where disagreement is present or
absent. The feature mentioned so far are based on the verbal content, but the approach
includes a nonverbal component as well, that is, spurts duration and statistics of the
fundamental frequency.
The main assumption behind the experiments proposed by Wrede and Shriberg
(2003a, 2003b) is that disagreement is a moment of higher engagement for the meeting
participants. Therefore, the approach proposed in these works is to detect the hot spots
(segments of high engagement) and then class them into different categories including
disagreement. The features used to represent a hot spot include the perplexity of lan-
guage models, dialogue acts, fundamental frequency, and energy of speech. The classi-
fication, performed with decision trees, achieves an accuracy of up to 40%. Most recent
work (Bousmalis, Morency, & Pantic, 2011) focuses on political debates, expected to
be more ecologically valid (Vinciarelli, Kim et al., 2012), and tries to go beyond clas-
sification to reconstruct the very temporal evolution of disagreement. The approach is
multimodal – it relies on speech cues (energy and pitch) and gesture detection – and
it adopts hidden state conditional random fields for an accuracy close to 65%. To the
best of our knowledge, the last work is the only one that does not take into account
verbal aspects. A synopsis of all approaches presented in this section is available in
Table 27.1.
Conflict Detection
Conflict is a phenomenon that has been addressed only recently in the social signal pro-
cessing literature (Vinciarelli, Pantic et al., 2009, 2012). This is not surprising because
most of the progress in SSP has been achieved using data based on cooperative sce-
narios like, for example, the AMI Meeting Corpus, and it is unlikely to observe con-
flict episodes (Germesin & Wilson, 2009). The situation has changed only recently
Bousmalis, Mehu, & Pantic, 2013; Vinciarelli, Dielmann et al., 2009), when corpora
of political debates (Kim, Valente, & Vinciarelli, 2012; Kim, Filippone et al., 2012;
Pesarin et al., 2012; Grezes Richards, & Rosenberg, 2013; Räsänen & Pohjalainen,
2013) and couple therapy sessions have become available (Georgiou et al., 2011; Black
et al., 2013). In these settings, people actually pursue incompatible goals and, hence,
conflict takes place frequently.
In most cases, the goal of the approaches is simply to detect whether conflict is present
or absent (Pesarin et al., 2012; Kim, Valente et al., 2012; Grezes et al., 2013; Räsänen &
Pohjalainen, 2013; Georgiou et al., 2011) – this applies to the international benchmark-
ing described by Schuller et al. (2013) as well – but some approaches try to measure
the intensity of the phenomenon in continuous terms (Kim, Filippone et al., 2012; Kim
et al., 2014). The approach proposed by Pesarin et al. (2012) is based on “steady con-
versational periods” (Cristani et al., 2011), that is, statistical representations of stable
conversational configurations (e.g., everybody talks, one person talks and the others lis-
ten, etc.). The authors adopt a hidden Markov model to represent sequences of steady
conversational periods and then use the parameters of the hidden Markov model as a
feature vector for a discriminative approach. By using such a methodology – originally
proposed by Perina et al. (2009) – the recordings used for the experiments can be seg-
mented into “conflict/nonconflict intervals” with accuracy up to 80%.
Several works propose experiments on the SSPNet Conflict Corpus (Vinciarelli, Kim
et al., 2012), which, to the best of our knowledge, is the only publicly available cor-
pus focusing on conflict (see also Table 27.2). The approach proposed by Kim, Valente
et al. (2012) extracts prosodic and turn-taking features from the audio component of
the data and then adopts support vector machines to map the data into three conflict
classes, namely absent-to-low, middle, and high. The performance claimed in the work
is an F -measure of 76.1%. Other works, presented in the framework of the “Interspeech
2013 Computational Paralinguistics Challenge” (Schuller et al., 2013), propose similar
experiments over the same data, namely the classification of 30 seconds long segments
into high or low conflict classes. Grezes et al. (2013) achieve an unweighted average
recall (UAR) higher than 80% by simply using the ratio of overlapping to nonoverlap-
ping speech. In the case of (Räsänen & Pohjalainen, 2013), the core of the approach is
a feature selection method capable to filter a set of 6,373 acoustic features provided by
Table 27.1 The table shows the most important works dedicated to disagreement.
Reference Subjects Behavioral cues Phenomenon Task Data Performance
Hillard et al. (2003) 40–50 Prosody (dis)agreement C 9854 spurts 61% accuracy
Lexical ICSI meetings
Wrede and Shriberg (2003a) 20–30 Prosody hot spots C 13 ICSI meetings significant correlation
Wrede and Shriberg (2003b) 53 Dialogue acts hot spots C 32 ICSI meetings 0.4 chance
Lexical normalized accuracy
Galley et al. (2004) 40–50 Duration, lexical (dis)agreement C 9854 spurts 84% accuracy
Speaker adjacency ICSI meetings
Germesin and Wilson (2009) 16 Prosody, lexical (dis)agreement C 20 AMI meetings F 1 ∼ 45%
Dialogue acts
Bousmalis et al. (2011) 44 Prosody (dis)agreement C 147 Debate clips 64.2% accuracy
Gestures from Canal9
Table 27.2 The table shows the most important works dedicated to conflict.
Reference Subjects Behavioral cues Phenomenon Annotation Data Performance
Kim, Valente et al. (2012) 138 Turn organization conflict categorical SSPNet Conflict F 1 = 76.1% clip
Prosody Corpus accuracy (3 classes)
Speaker adjacency stats.
Kim, Filippone et al. (2012) 138 Turn organization conflict dimensional SSPNet Conflict correlation 0.75
Prosody Corpus predicted/real
Speaker adjacency stats conflict level
Pesarin et al. (2012) 26 Turn organization conflict categorical 13 Debates 80.0% turn from
Steady conversational Canal9 classification accuracy
periods
Grezes et al. (2013) 138 Overlapping speech conflict categorical SSPNet Conflict UAR = 83.1% clip
to nonoverlapping Corpus accuracy (2 classes)
speech ratio
Räsänen and Pohjalainen (2013) (1) 138 Feature selection conflict categorical SSPNet Conflict UAR = 83.9% clip
Over OpenSmile Corpus accuracy (2 classes)
Acoustic features
Räsänen and Pohjalainen (2013) (2) 138 Feature selection conflict dimensional SSPNet Conflict correlation 0.82
Over OpenSmile Corpus predicted/real
Acoustic features conflict level
Georgiou et al. (2011) 26 Lexical blaming categorical 130 couple >70.0%
acceptance therapy sessions classification accuracy
the challenge organizers (Schuller et al., 2013). The resulting UAR is 83.9%. The last
classification approach (Georgiou et al., 2011) works on a large corpus of couple ther-
apy sessions. It adopts lexical features (frequency of appearance of words used by each
subject) to identify, among others, blaming or acceptance attitudes, possibly accounting
for the presence or absence of conflict, respectively. Accuracies higher than 70% are
achieved for both absence and presence of conflict.
The SSPNet Conflict Corpus has been adopted in Kim, Filippone et al., (2012) and
Kim et al. (2014). In both cases, the experiments aimed not just at predicting whether
conflict is present or absent, but at predicting the actual conflict level available in the
data. Both approaches rely on audio features such as statistics of pitch and energy, statis-
tics of turn lengths, frequency of overlapping speech events, and so on. The prediction
of the conflict level is performed with Gaussian processes and the correlation between
actual and predicted conflict level goes up to 0.8 in both cases.
Open Issues and Challenges
So far, the approaches for analysis and detection of conflict have followed the general
indications of domains like social signal processing or computational paralinguistics,
that is, they detect behavioural cues in data showing interactions between people and
then apply machine learning approaches to infer the presence or absence of conflict
(see Figure 27.1). In a few cases, the approaches try to measure the intensity of conflict
as well, but not every available corpus allows one to perform such a task. However,
no attempts have been done so far, at least to the best of our knowledge, to develop
approaches that take into account specific aspects of conflict. In particular, no attempt
has been done to model and analyse conflict as it unfolds over time. This is a major issue
not only for technology – the application of statistical sequential models to behavioural
data, possibly multimodal, is still a challenge – but also for human sciences. In fact,
knowledge about how conflict starts and develops in time is still limited in social psy-
chology as well.
In most of the works presented in the previous section, the data is manually seg-
mented into samples labeled according to their “conflict content” (conflict absent or
present and, sometimes, conflict intensity). However, conflict in real interactions begins,
evolves, and ends in the middle of longer-term social exchanges that do not necessarily
involve conflict. In this respect, it is necessary to develop approaches capable to anal-
yse the stream of data captured during human–human interactions and to automatically
segment it. This is not a simple task because all approaches developed so far need suffi-
cient evidence to distinguish between different levels of conflict. Thus, it is unclear how
effective technologies can be at spotting the start of conflict and how much time would
be needed to do it.
Finally, no attempt has been made so far to take into account cultural differences,
whether these correspond to different nationalities and ethnic origins or to different
environments (e.g., job, family, etc.). From a technological point of view, culture can be
considered as a latent variable that conditions the display of behavioural cues. From a
psychological point of view, the study of cultural effects requires extensive analysis of
conflict in multiple contexts and environments. Similarly, it is important to consider the

effect of any other socially relevant variable, including status, hierarchical relationships,
personality, and so on. In this case as well, the various phenomena can probably be
included in computational models under the form of latent variables.
Conclusions
This chapter has shown how the social signal processing community has been dealing
with the problem of conflict detection and analysis in the last years. While having been
proposed only recently, the topic has attracted significant attention and several initiatives
have consolidated the latest developments on the subject, including the organisation of
an international benchmarking campaign (Schuller et al., 2013) and the publication of a
volume exploring conflict in all its aspects (D’Errico et al., 2015).
The chapter has focused in particular on automatic detection and analysis of conflict
because this is the only task that has been addressed in the literature. However, more
problems revolving around conflict can be interesting for social signal processing. In
recent years, research in human sciences has shown that conflict is not always a negative
aspect of human–human interactions (Joni & Beyer, 2009). If properly managed, con-
flict can help people to mobilise cognitive, affective, and psychological resources that
remain unused in most contexts. This can allow a group to perform better in achieving
a task or to reach a social configuration better than the one observed before conflict.
However, this requires a better understanding of conflict dynamics and, in particular, it
requires one to understand where is the limit that separates the conflict between ideas,
typically fertile in terms of new insights and exchange of information, and conflict
between persons, typically dangerous for the stability of a group and always at risk of
leaving permanent negative effects. Better technologies for the understanding of conflict
can play a major role in making conflict a resource rather than a problem.
Acknowledgment
Supported by the European Commission via the Social Signal Processing Network (GA
231287).
References
Arsenio, W. F. & Killen, M. (1996). Conflict-related emotions during peer disputes. Early Educa-
tion and Development, 7(1), 43–57.
Bell, C. & Song, F. (2005). Emotions in the conflict process: An application of the cognitive
appraisal model of emotions to conflict management. International Journal of Conflict Man-
agement, 16(1), 30–54.
Black, M. P., Katsamanis, A., Baucom, B. R., et al. (2013). Toward automating a human behav-
ioral coding system for married couples’ interactions using speech acoustic features. Speech
Bousmalis, K., Mehu, M., & Pantic, M. (2013). Towards the automatic detection of spontaneous
agreement and disagreement based on nonverbal behaviour: A survey of related cues, databases
and tools. Image and Vision Computing, 31(2), 203–221.
Bousmalis, K., Morency, L. P., & Pantic, M. (2011). Modeling Hidden Dynamics of Multimodal
Cues for Spontaneous Agreement and Disagreement Recognition. In Proceedings of IEEE
International Conference on Automatic Face and Gesture Recogition (pp. 746–752).
Cooper, V. W. (1986). Participant and observer attribution of affect in interpersonal conflict: An
examination of noncontent verbal behavior. Journal of Nonverbal Behavior, 10(2), 134–144.
Cristani, M., Pesarin, A., Drioli, C., et al. (2011). Generative modeling and classification of
dialogs by a low-level turn-taking feature. Pattern Recognition, 44(8), 1785–1800.
D’Errico, F., Poggi, I., Vinciarelli, A., & Vincze, L. (Eds). (2015). Conflict and Multimodal Com-
munication. Berlin: Springer.
Galley, M., McKeown, K., Hirschberg, J., & Shriberg, E. (2004). Identifying agreement and dis-
agreement in conversational speech: Use of Bayesian networks to model pragmatic dependen-
cies. In Proceedings of the Annual Meeting of the Association for Computational Linguistics
(pp. 669–676).
Georgiou, P. G., Black, M. P., Lammert, A. C., Baucom, B. R., & Narayanan, S. S. (2011). “That’s
Aggravating, Very Aggravating”: Is It Possible to Classify Behaviors in Couple Interactions
Using Automatically Derived Lexical Features? In Proceedings of International Conference on
Affective Computing and Intelligent Interaction (pp. 87–96).
Germesin, S. & Wilson, T. (2009). Agreement detection in multiparty conversation. In Proceed-
ings of ACM International Conference on Multimodal Interfaces (pp. 7–14).
Gottman, J., Markman, H., & Notarius, C. (1977). The topography of marital conflict: A sequen-
tial analysis of verbal and nonverbal behavior. Journal of Marriage and the Family, 39(3),
461–477.
Grezes, F., Richards, J., & Rosenberg, A. (2013). Let me finish: Automatic conflict detection
using speaker overlap. In Proceedings of 14th Annual Conference of the International Speech
Communication Association (pp. 200–204).
Hillard, D., Ostendorf, M., & Shriberg, E. (2003). Detection of agreement vs. disagreement in
meetings: Training with unlabeled data. In Proceedings of the Conference of the North Ameri-
can Chapter of the Association for Computational Linguistics: Human Language Technologies
(pp. 34–36).
Joni, S. N. & Beyer, D. (2009). How to pick a good fight. Harvard Business Review, 87(12),
48–57.
Judd, C. M. (1978). Cognitive effects of attitude conflict resolution. Journal of Conflict Resolu-
tion, 22(3), 483–498.
Kim, S., Filippone, M., Valente, F., & Vinciarelli, A. (2012). Predicting the conflict level in
television political debates: An approach based on crowdsourcing, nonverbal communication
and Gaussian processes. In Proceedings of the ACM International Conference on Multimedia
(pp. 793–796).
Kim, S., Valente, F., Filippone, M., & Vinciarelli, A. (2014). Predicting continuous conflict per-
ception with Bayesian Gaussian processes. IEEE Transactions on Affective Computing, 5(2),
187–200.
Kim, S., Valente, F., & Vinciarelli, A. (2012). Automatic detection of conflicts in spoken conversa-
tions: Ratings and analysis of broadcast political debates. In Proceedings of IEEE International
Conference on Acoustics, Speech and Signal Processing (pp. 5089–5092).
Perina, A., Cristani, M., Castellani, U., Murino, V., & Jojic, N. (2009). Free energy score space.
In Advances in Neural Information Processing Systems (pp. 1428–1436).
Detection of conflict in competitive discussions through automatic turn-organization analysis.
Pieraccini, R. (2012). The Voice in the Machine: Building Computers that Understand Speech.
Cambridge, MA: MIT Press.
Poggi, I., D’Errico, F., & Vincze, L. (2011). Agreement and its multimodal communication in
debates: A qualitative analysis. Cognitive Computation, 3(3), 466–479.
Räsänen, O. & Pohjalainen, J. (2013). Random subset feature selection in automatic recognition
of developmental disorders, affective states, and level of conflict from speech. In Proceedings
of 14th Annual Conference of the International Speech Communication Association (pp. 210–
214).
Schegloff, E. (2000). Overlapping Talk and the Organization of Turn-taking for Conversation.
Language in Society, 29(1), 1–63.
Schuller, B., Steidl, S., Batliner, A., et al. (2013). The InterSpeech 2013 computational paralin-
guistics challenge: Social signals, conflict, emotion, autism. In Proceedings of 14th Annual
Conference of the International Speech Communication Association (pp. 148–152).
Sillars, A. L., Coletti, S. F., Parry, D., & Rogers, M. A. (1982). Coding verbal conflict tactics: Non-
verbal and perceptual correlates of the “avoidance-distributive-integrative” distinction. Human
Communication Research, 9(1), 83–95.
Smith-Lovin, L. & Brody, C. (1989). Interruptions in group discussions: The effects of gender
and group composition. American Sociological Review, 54(3), 424–435.
Vinciarelli, A, Dielmann, A, Favre, S, & Salamin, H. (2009). Canal9: A database of political
debates for analysis of social interactions. In Proceedings of the International Conference on
Affective Computing and Intelligent Interaction (vol. 2, pp. 96–99).
Vinciarelli, A., Kim, S., Valente, F., & Salamin, H. (2012). Collecting data for socially intelligent
surveillance and monitoring approaches: The case of conflict in competitive conversations. In
Proceedings of International Symposium on Communications, Control and Signal Processing
(pp. 1–4).
Vinciarelli, A., Pantic, M., Bourlard, H., & Pentland, A. (2008). Social signal processing: State of
the art and future perspectives of an emerging domain. In Proceedings of the ACM International
Computing, 3(1), 69–87.
Wall, J. A. & Roberts Callister, R. (1995). Conflict and its management. Journal of Management,
21(3), 515–558.
Wrede, B. & Shriberg, E. (2003a). Spotting “hotspots” in meetings: Human judgments and
prosodic cues. In Proceedings of Eurospeech (pp. 2805–2808).
Wrede, B. & Shriberg, E. (2003b). The relationship between dialogue acts and hot spots in meet-
ings. In Proceedings of the IEEE Speech Recognition and Understanding Workshop (pp. 180–
185).
28 Social Signal Processing and Socially
Assistive Robotics in Developmental
Disorders
Mohamed Chetouani, Sofiane Boucenna, Laurence Chaby,
Monique Plaza, and David Cohen
Introduction
Multimodal social-emotional interactions play a critical role in child development and

this role is emphasized in autism spectrum disorders (ASD). In typically develop-
ing children, the ability to correctly identify, interpret, and produce social behaviors
(Figure 28.1) is a key aspect for communication and is the basis of social cogni-
tion (Carpendale & Lewis, 2004). This process helps children to understand that other
people have intentions, thoughts, and emotions and act as a trigger of empathy (Decety
& Jackson, 2004; Narzisi et al., 2012). Social cognition includes the child’s ability to
spontaneously and correctly interpret verbal and nonverbal social and emotional cues
(e.g., speech, facial and vocal expressions, posture and body movements, etc.); the abil-
ity to produce social and emotional information (e.g., initiating social contact or conver-
sation); the ability to continuously adjust and synchronize behavior to others (i.e., par-
ent, caregivers, peers); and the ability to make an adequate attribution about another’s
mental state (i.e., “theory of mind”).
Definitions and Treatments

ASDs are a group of behaviorally defined disorders with abnormalities or impaired
development in two areas: (1) persistent deficits in social communication and social
interaction and (2) restricted, repetitive patterns of behavior, interests, or activities. An
individual with ASD has difficulty interacting with other people due to an inability to
understand social cues as well as others’ behaviors and feelings. For example, children
with ASD often have difficulty with cooperative play with other peers; they prefer to
continue with their own repetitive activities (Baron-Cohen & Wheelwright, 1999). Per-
sons with ASD evaluate both world and human behavior uniquely because they react
in an abnormal way to input stimuli while there is problematic human engagement
and inability to generalize the environment (Rajendran & Mitchell, 2000). Although
ASD remains a devastating disorder with a poor outcome in adult life, there have been
important improvements in treating ASD with the development of various therapeutic
approaches (Cohen, 2012).
Successful autism “treatments” using educational interventions have been reported
as recently as a decade ago (Murray, 1997). Since then, the literature devoted to the
Social signal
Imitation
Facial expressions
Synchrony
Mutual gaze
Posture
Body Prosody
movements Speech
Nonverbal
cues Verbal cues
Figure 28.1 Reception and production of social signals. Multimodal verbal (speech and prosody)
and nonverbal cues (facial expressions, vocal expressions, mutual gaze, posture, imitation,
synchrony, etc.) merge to produce social signals (Chaby et al., 2012).
description and evaluation of interventions in ASD has become substantial over the last
few years. From this literature, a number of conclusions can be drawn. First, there is
increasing convergence between behavioral and developmental methods (Ospina et al.,
2008). For both types of treatment, the focus of early intervention is directed toward
the development of skills that are considered “pivotal,” such as joint attention and imi-
tation, as well as communication, symbolic play, cognitive abilities, attention, sharing
emotion, and regulation. Second, the literature contains a number of guidelines for treat-
ments, such as 1) starting as early as possible, 2) minimizing the gap between diagnosis
and treatment, 3) providing no shorter than three/four hours of treatment each day, 4)
involving the family, 5) providing six-monthly development evaluations and updating
the goals of treatment, 6) choosing among behavioral/developmental treatment depend-
ing on the child’s response, 7) encouraging spontaneous communication, 8) promoting
the skills through play with peers, 9) gearing toward the acquisition of new skills and
to their generalization and maintenance in natural contexts, and 10) supporting positive
behaviors rather than tackling challenging behaviors.
Information Communication Technology and ASD

Computational models able to automatically analyze behaviors by making use of infor-
mation communication technology (ICT) may be beneficial in ASD therapy. Over the
last few years, there have been considerable advances in the research on innovative
ICT for the education of people with special needs, such as patients suffering from
ASD (Konstantinidis et al., 2009). Education is considered to be the most effective ther-
apeutic strategy (Mitchell, Parsons, & Leonard, 2006). More specifically, early stage
education has proven helpful in coping with difficulties in understanding the mental
states of other people (Howlin, Baron-Cohen, & Hadwin, 1999). In recent years, there
have been new developments in ICT-based approaches and methods for therapy and the
education of children with ASD. Individuals with autism have recently been included
as a main focus in the areas of social signal processing (SSP is the ICT domain that
SSP and Socially Assistive Robotics in Developmental Disorders 391
aims at providing computers with the ability to sense and understand human social sig-
nals and communication) (Chaby et al., 2012) and Affective Computing (AC is the ICT
domain that aims at modeling, recognizing, processing, and simulating human affects,
or that relates to, arises from, or deliberately influences emotions) (Kaliouby, Picard, &
Barron-Cohen, 2006).
In this chapter, we review two important domains, namely social signal processing
(SSP) and socially assistive robotics (SAR) for investigations and treatments in the area
of developmental disorders. The chapter begins with a description of computational
methods for measuring and analyzing the behavior of autistic children with a special
focus on social interactions (section Computational Methods for Measuring and Ana-
lyzing the Behavior of Autistic Children During Social Interactions). The idea is not to
investigate autism only by looking at children but also at social environment (parent,
therapist, etc.). In section Robotics and ASD, we review robotics contributions applied
to autism and we show that different points of view are followed by the research com-
munity. Finally, the chapter discusses a number of challenges that need to be addressed
(section Conclusions and Main Challenges).
Computational Methods for Measuring and Analyzing the Behavior of

Autistic Children during Social Interactions
In this section, we focus more specifically on three domains of impairments: i) language,

ii) emotion, and iii) interpersonal synchrony in social interactions.
Language Impairment
Language impairment is a common feature in autism spectrum disorders that is char-
acterized by a core pragmatic disorder, abnormal prosody, and impairments regarding
semantics skills (Kjelgaard & Tager-Flusberg, 2001; Tager-Flusberg, 1981). However,
language functioning in ASD is variable. On one hand, there are children with ASD
whose vocabulary, grammatical knowledge, pragmatics, and prosody skills are within
the normal range of functioning (e.g. Asperger syndrome), while at the other hand a
significant proportion of the population remains essentially non-verbal (e.g. autistic dis-
order with intellectual disability).
In a recent clinical work, Demouy et al. (2011) tried to find differential language
markers of pathology in autistic disorder without intellectual disability (AD), perva-
sive developmental disorder not otherwise specified (PDD-NOS) compared to specific
language impairment (SLI), and to typically developing children (TD). Our findings
suggest that expressive syntax, pragmatic skills, and some intonation features could be
considered as language differential markers of pathology. The AD group is the most
deficient, presenting difficulties at the lexical, syntactic, pragmatic, and prosodic levels;
the PDD-NOS group performed better than AD in pragmatic and prosodic skills but
was still impaired in lexical and syntactic skills.
Ringeval et al. (2011) designed a system that automatically assesses a child’s gram-
matical prosodic skills through an intonation contours imitation task. The key idea of the
system is to propose computational modeling of prosody by employing static (k-nn) and

dynamic (HMM) classifiers. The intonation recognition scores of typically developing
(TD) children and language-impaired children (LIC) are compared. The results showed
that all LIC have difficulties in reproducing intonation contours because they achieved
significantly lower recognition scores than TD children on almost all studied intona-
tions (p < 0.05). The automatic approach used in this study to assess LIC’s prosodic
skills confirms the clinical descriptions of the subjects’ communication impairments
(Demouy et al., 2011). Combined with traditional clinical evaluations, the results also
suggest that expressive syntax, pragmatic skills, and some intonation features could be
considered as language differential markers of pathology (e.g. LIC vs. ASD), but also
within LIC (e.g. AD vs. PDD-NOS vs. SLI).
Emotion
Interpersonal communication involves the processing of multimodal emotional cues,
which could be perceived and expressed through visual, auditory, and bodily modali-
ties. Autism spectrum disorder is characterized by problems in recognizing emotions
that affect day-to-day life (Chamak et al., 2008). Research into emotion recognition
abilities in ASD has been limited by over-focus to the visual modality, specifically the
recognition of facial expressions. In addition, emotion production remains a neglected
area. However, understanding emotional states in real life involves identifying, interpret-
ing, and producing a variety of cues that include nonverbal vocalizations (e.g. laughter,
crying), speech prosody, body movements, and posture. In a preliminary work, Vannet-
zel et al. (2011) have recently studied neutral and emotional (facial, vocal) processing
in children with PDD-NOS that represents around two-thirds of autism spectrum dis-
orders. Their results suggest that children with PDD-NOS present global emotional
human stimuli processing difficulties (in both facial and vocal conditions), which dra-
matically contrast with their ability to process neutral human stimuli. However, impair-
ments in emotional processing are partially compensated using multimodal processing.
Nevertheless, it is still not clear how children with ASD perceive and produce multi-
modal emotion in particular in ASD subtypes (i.e., autism, PDD-NOS, high-functioning
autism, etc.) and stimulus domains (e.g. visual, auditory, etc.).
Emotions play an important role on infants’ development. Specifically motherese
(Saint-Georges et al., 2013; Mahdhaoui et al., 2011), also known as infant-directed
speech (IDS), is a typical social emotion produced by the mother toward the infant.
Saint-Georges et al. (2013) recently reviewed the role of motherese in interaction within
various dimensions, such as language acquisition and infants’ attention and learn-
ing. Two observations were notable: (1) IDS prosody reflects emotional charges and
meets infants’ preferences and (2) mother-infant contingency and synchrony are cru-
cial for IDS production and prolongation. Thus, IDS is part of an interactive loop that
may play an important role in infants’ cognitive and social development. Cohen et al.
(2013) investigated this interactive loop for the development of both typical and autistic
infants. They found that parentese was significantly associated with infant responses
to parental vocalizations involving orientation towards other people and with infant
receptive behaviours, that parents of infants developing autism displayed more intense
solicitations that were rich in parentese, that fathers of infants developing autism spoke
to their infants more than fathers of TD infants, and that fathers’ vocalizations were
significantly associated with intersubjective responses and active behaviours in infants
who subsequently developed autism.
Interpersonal Synchrony
Synchrony in social interaction is a complex phenomenon that requires the perception
and production of social and communicative signals (speech, linguistic cues, prosody,
emotion, gesture, etc.) and also a continuous adaptation to other. In adulthood, inter-
actional synchrony has been shown to act as a facilitator to high quality interpersonal
relationships and smooth social interactions (Kendon, 1970). The role of synchrony
during child development is not well known, but seems to provide the children a secure
base from which they can explore their environment, regulate their affective states, and
develop language and cognitive skills (Delaherche et al., 2012). In addition, synchrony
appears to be a key metric in human communication dynamics and interaction (Vincia-
relli, Pantic, & Bourlard, 2009) that can be employed to assess children (Delaherche
et al., 2013; Segalin et al., 2013) or detect early signs of disorders (Saint-Georges et al.,
2011).
Currently, few models have been proposed to capture mimicry in dyadic interac-
tions. Mimicry is usually considered within the larger framework of assessing inter-
actional synchrony, which is the coordination of movement between individuals, with
respect to both the timing and form during interpersonal communication (Bernieri,
Reznick, & Rosenthal, 1988). The first step in computing synchrony is to extract the
relevant features of the dyad’s motion. Some studies (Campbell, 2008; Ashenfelter
et al., 2009; Varni, Volpe, & Camurri, 2010; Weisman et al., 2013) have focused on
head motion, which can convey emotion, acknowledgment, or active participation in an
interaction. Other studies have captured the global movements of the participants with
motion energy imaging (Altmann, 2011; Ramseyer & Tschacher, 2011) or derivatives
(Delaherche & Chetouani, 2010; Sun et al., 2011). Then, a measure of similarity is
applied between the two time series. Several studies have also used a peak-picking
algorithm to estimate the time lag between partners (Ashenfelter et al., 2009; Boker
et al., 2002; Altmann, 2011). Michelet et al. (2012) recently proposed an unsupervised
approach to measuring immediate synchronous and asynchronous imitations between
two partners. The proposed model is based on the following two steps: detection of
interest points in images and evaluation of the similarity between actions. The cur-
rent challenges to mimicry involve the characterization of both temporal coordina-
tion (synchrony) and content coordination (behavior matching) in a dyadic interaction
(Delaherche et al., 2012).
Robotics and ASD
In this section, we explore the contribution of robotics to children with ASD. The
use of robots in special education is an idea that has been studied for a number of
decades (Papert, 1980). We will specifically focus on robotics and children with ASD
according to what is expected from the robotic systems in the context of the specific
experiment described. However, it is important to keep in mind that socially assistive
robotics have at least three discrete but connected phases, which are physical robot
design, human robot interaction design, and evaluations of robots in therapy-like set-
tings (Scassellati, Admoni, & Matarić, 2012). Moreover, we focus on two abilities, imi-
tation and joint attention, because they are important during the development of the
child (Jones, 2009, 2007; Carpenter et al., 1998; Tomasello & Farrar, 1986) and core
deficit in ASD (Dawson et al., 2009). To address these abilities from the point of view
of both developmental psychology and social signal processing, we review the avail-
able literature on robotics and ASD, differentiating between different lines of research,
including: (1) exploring the response of children with ASD to robotics platforms; (2)
settings where a robot was used to elicit behaviors, (3) modeling or teaching a skill, and
(4) providing feedback to children with ASD.
Robotics and Children with Autism

There have been an increasing number of clinical studies since 2000 that have used
robots to treat individuals with ASD. The robot can have two roles in the interven-
tion, namely practice and reinforcement (Duquette, Michaud, & Mercier, 2008). At
least two reviews of the literature have been conducted recently (Scassellati et al., 2012;
Diehl et al., 2012). Here, we choose to follow the plan proposed by Diehl and col-
leagues because it fits the main focus of our study regarding imitation and joint atten-
tion. Diehl et al. (2012) distinguished 4 different categories of studies. The first com-
pares the responses of individuals with ASD to humans, robots, or robot-like behav-
ior. The second assesses the use of robots to elicit behaviors that should be promoted
with regard to ASD impairments. The third uses robotics systems or robots to model,
teach and practice a skill with the aim of enhancing this skill in the child. The last uses
robots to provide feedback on performance during therapeutic sessions or in natural
environments.
Response to Robots or Robot-like Characteristics

Although most of the research in this field has been based on short series or case reports,
the authors have insisted on the appealing effects of using robots to treat individuals with
ASD. If we assume that individuals with ASD prefer robots or robot-like characteris-
tics to human characteristics or non-robotic objects, we may wonder why individuals
with ASD prefer robots and what, in particular, is appealing about these characteristics.
Pioggia et al. (2005) compared a child with ASD to a typically developing control child
for his/her behavioral and physiological responses to a robotic face. The child with ASD
did not have an increase in heart rate in response to the robotic face, which implies that
the robotic face did not alarm the child. In contrast, the control child spontaneously
observed the robot with attention and expressed positive reactions to it; however, when
the robot’s facial movements increased, the typical child became uncomfortable and
exhibited an increased heart rate. In a case series, the same authors (Pioggia et al., 2008)
compared the responses of ASD children to the robotic face versus human interaction;
most individuals with ASD showed an increase in social communication, some showed
no change, and one showed a decrease when he interacted with the robotic face.
Feil-Seifer and Mataric (2011) showed, in a group of eight children with ASD, that
there was tremendous variability in the valence of an effective response toward a mobile
robot, depending on whether the robot’s behavior was contingent on the participant or
random. In this study, the robot automatically distinguished between positive and neg-
ative reactions of children with ASD. Individual affective responses to the robots were
indeed highly variable. Some studies (Dautenhahn & Werry, 2004; Robins, Dautenhahn,
& Dubowski, 2006) have shown that, for some children with ASD, there is a preference
for interacting with robots compared to non-robotic toys or human partners. However,
Dautenhahn and Werry (2004) found individual differences in whether children with
ASD preferred robots to non-robotic toys. Two of the four participants exhibited more
eye gazes toward the robot and more physical contact with the robot than with a toy.
Other studies have investigated motion. Bird et al. (2007) found a speed advantage
in adults with ASD when imitating robotic hand movements compared to human hand
movements. In the same vein, Pierno et al. (2008) reported that children with ASD made
significantly faster movements to grasp a ball when they observed a robotic arm perform
the movement compared to a human arm. In contrast, typically developing children
showed the opposite effect. Therefore, these two studies suggest increased imitation
speed with robot models compared to human models (Bird et al., 2007; Pierno et al.,
2008).
Additionally, some studies have investigated the responses of children with ASD
when exposed to emotional stimuli. Nadel et al. (2006) and Simon et al. (2007) explored
the responses of 3- and 5-year-old children to emotional expressions produced by a robot
or a human actor. Two types of responses were considered, which were: automatic facial
movements produced by the children facing the emotional expressions (emotional reso-
nance) and verbal naming of the emotions expressed (emotion recognition). Both stud-
ies concluded that, after robot exposition, an overall increase in performance occurred
with age, as well as easier recognition of human expressions (Nadel et al., 2006; Simon
et al., 2007). This result is encouraging from a remediation perspective in which an
expressive robot could help children with autism express their emotions without human
face-to-face interaction. Finally, Chaminade et al. (2012) investigated the neural bases
of social interactions with a human or with a humanoid robot using fMRI and com-
pared male controls (N = 18, mean age = 21.5 years) to patients with high-functioning
autism (N = 12, mean age = 21 years). The results showed that in terms of activation,
interacting with a human was more engaging than interacting with an artificial agent.
Additionally, areas involved in social interactions in the posterior temporal sulcus were
activated when controls, but not subjects with high-functioning autism, interacted with
a human fellow.
Robots can be Used to Elicit Behavior

Some theoretical works have highlighted several potential uses of a robot for diagnos-
tic purposes (Scassellati, 2007; Tapus, Matarić, & Scassellati, 2007). For example, a
robot could provide a set of social cues designed to elicit social responses for which the
presence, absence, or quality of response is helpful during diagnostic assessment.
In Feil-Seifer and Matarić (2009) the robot could be programmed to take the role of
a bubble gun. The robot produces bubbles to elicit an interaction between the child and
the examiner. Additionally, the robot can act as a sensor and provide measurements
of targeted behaviors (Scassellati, 2007; Tapus et al., 2007). These measurements may
be used to diagnose the disorder and to quote its severity on one or several dimen-
sions. The robots could record behaviors and traduce social behaviors into quantitative
measurements. Additionally, interaction between a robot and a child has been used to
elicit and analyze perseverative speech in one individual with high-functioning ASD
(Stribling, Rae, & Dickerson, 2009). Interaction samples were collected from previous
studies in which the child interacted with a robot that imitated the child’s behavior.
Here, the robot–child interaction is used to collect samples of perseverative speech to
conduct Conversational Analysis on the interchanges. This study suggested that robot–
child interactions might be useful to elicit characteristic behaviors such as perseverative
speech.
Finally, the robot can be used to elicit prosocial behaviors. Robots can provide inter-
esting visual displays or respond to a child’s behavior in the context of a therapeutic
interaction. Consequently, the robot could encourage a desirable or prosocial behav-
ior (Dautenhahn, 2003; Feil-Seifer and Matarić, 2009). For example, the robot’s behav-
ior could be used to elicit joint attention, such as the robot could be the object of shared
attention (Dautenhahn, 2003) or the robot could provoke joint attention by looking else-
where at an object in the same visual scene and “asking” the child with ASD to follow
its gaze or head direction. In another study, Ravindra et al. (2009) showed that individ-
uals with ASD are able to follow social referencing behaviors performed by a robot.
This study shows that social referencing is possible, but the results are not quantitative.
Other studies (Robins et al., 2005; François, Powell, & Dautenhahn, 2009) have tried
to elicit prosocial behavior, such as joint attention and imitation. However, the results
were not robust because of the small sample size of children with ASD in these studies.
Finally, several studies aimed to assess whether interaction between a child with ASD
and a robot with a third interlocutor can elicit prosocial behaviors (Costa et al., 2010;
Kozima, Nakagawa, & Yasuda, 2007; Wainer et al., 2010). Unfortunately, no conclusion
could be drawn due to their small sample sizes and the significant individual variation
in the response to the robot.
Robots can be Used to Model, Teach, or Practice a Skill

Here, the theoretical point of view is to create an environment in which a robot can
model specific behaviors for a child (Dautenhahn, 2003) or the child can practice spe-
cific skills with the robot (Scassellati, 2007). The aim is to teach a skill that the child
can imitate or learn and eventually transfer to interactions with humans. In this case,
the robot is used to simplify and facilitate social interaction. The objective of Duquette
et al. (2008) was to explore whether a mobile robot toy could facilitate reciprocal social
interaction in cases where the robot was more predictable, attractive, and simple. The
exploratory experimental set-up presented two pairs of children with autism, a pair inter-
acting with the robot and another pair interacting with the experimenter. The results
showed that imitations of body movements and actions were more numerous in children
interacting with humans compared to children interacting with the robot. In contrast,
the two children interacting with the robot had better shared attention (eye contact and
physical proximity) and were better able to mimic facial expressions than the children
interacting with a human partner. Fujimoto et al. (2011) used techniques for mimicking
and evaluating human motions in real time using a therapeutic humanoid robot. Practi-
cal experiments have been performed to test the interaction of ASD children with robots
and to evaluate the improvement of children’s imitation skills.
Robots can be Used to Provide Feedback and Encouragement

Robots can also be used to provide feedback and encouragement during a skill learning
intervention because individuals with ASD might prefer the use of a robot rather than a
human as a teacher for skills. Robots can have human-like characteristics. For example,
they can mimic human sounds or more complex behaviors. The social capabilities of
robots could improve the behavior of individuals with ASD vis-à-vis the social world.
The robot could also take on the role of a social mediator in social exchanges between
children with ASD and partners because robots can provide feedback and encourage-
ment (Dautenhahn, 2003). In this approach, the robot would encourage a child with
ASD to interact with an interlocutor. The robot would provide instruction for the child
to interact with a human therapist and encourage the child to proceed with the interac-
tion. However, this approach is only theoretical as no studies have yet been conducted.
However, some attempts at using robots for rewarding behaviors have been
made. Duquette et al. (2008) used a reward in response to a robot behavior. For example,
if a child was successful in imitating a behavior, the robot provided positive reinforce-
ment by raising its arms and saying, ‘Happy’. Additionally, the robot could respond to
internal stimuli from the child; for example, the stimuli generally used in biofeedback
(e.g., pulse and respiratory frequency) could be used as indicators of the affective state
or arousal level of the child to increase the individualized nature of the treatment (Picard,
2010). This capability could be useful to provide children with feedback about their own
emotional states or to trigger an automatic redirection response when a child becomes
disinterested (Liu et al., 2008).
Conclusions and Main Challenges
In this chapter, we reported works on social signal processing and socially assistive
robotics in developmental disorders. Through this lecture, we identify several issues
that should be addressed by researchers in these research domains.
The first issue, and surely the most important for the general public and fami-
lies, relates to the treatments of pathologies. Recent years have witnessed ICT-based
approaches and methods for the therapy and education of children with ASD. Individu-
als with autism have lately been included as the main focus in the area of affective com-
puting (Kaliouby et al., 2006). Technologies, algorithms, interfaces and sensors that can
sense emotions or express them and thereby influence the users’ behavior (here indi-
viduals with ASD) have been continuously developed. Working closely with persons
with ASD has led to the development of various significant methods, applications, and
technologies for emotion recognition and expression. However, many improvements are
needed to attain significant success in treating individuals with autism, which depends
on practical and clinical aspects. From the practical perspective, many of the existing
technologies have limited capabilities in their performance and thus limit the success in
the therapeutic approach of children with ASD. This is especially significant for wear-
able hardware sensors that can provide feedback from the individuals with ASD during
the therapeutic session. More studies must be performed to generate a reliable emo-
tional, attentional, behavioral, or other type of feedback that is essential to tailoring the
special education methods to suit people with autism better. Clinically, most of the ICT
proposals have not been validated outside the context of proof of concept studies. More
studies should be performed to assess whether ICT architectures, devices, or robots are
clinically relevant over long periods of time.
The second issue is related to machine understanding of typical and autistic behav-
iors. Indeed, being able to provide insights on underlying mechanisms of social situ-
ations will be of great benefit for various domains, including psychology and social
science. In Segalin et al. (2013) an interesting feature selection framework is employed
to identify features relevant for the characterization of children pragmatics skills. This
framework not only allows to propose automatic assessment but also makes it possible
to identify micro-behaviors difficult to perceive by psychologists. In addition, compu-
tational models can explicitly take into account interaction during processing and mod-
eling as in Delaherche et al. (2013) for coordination assessment. In this particular case,
a paradigm shift effect has been found: it was possible to predict the diagnostic and
developmental age of children given only the behaviors of therapists. Social signal pro-
cessing is a promising tool for the study of communication and interaction in children
with ASD if it will propose models that can be interpreted and shared with nonexperts of
the field (Weisman et al., 2013; Pantic et al., 2006). Boucenna et al. (2014) have shown
that socially aware robotics combined with machine learning techniques could provide
useful insights on how children with ASD perform motor imitation. Metrics provided
by these computational approaches are of great help in clinical investigations.
The third issue is related to databases since very few databases are publicly available
for research because of obvious ethical reasons. The USC CARE Corpus was recently
proposed to study children with autism in spontaneous and standardized interactions
and develop analytical tools to enhance the manual rating tools of psychologists (Black
et al., 2011). In Rehg et al. (2013) a corpus of children interacting with parent and
therapist is introduced. The focus of this work is to promote behavior imaging, which
can be easily related to SSP (Pentland et al., 2009). The research community should also
promote challenges dedicated to impaired situations (Schuller et al., 2013).
Acknowledgments
This work was supported by the UPMC “Emergence 2009” program, the European
Union Seventh Framework Programme under grant agreement n288241, the Agence
Nationale de la Recherche (SAMENTA program: SYNED-PSY). This work was per-

formed within the Labex SMART supported by French state funds managed by the ANR
within the Investissements d’Avenir program under reference ANR-11-IDEX-0004-02.
References
Altmann, U. (2011). Studying movement synchrony using time series and regression models. In I.
A. Esposito, R. Hoffmann, S. Hübler, & B. Wrann (Eds), Program and Abstracts of the COST
2012 Final Conference Held in Conjunction with the 4th COST 2012 International Training
School on Cognitive Behavioural Systems (p. 23).
Ashenfelter, K. T., Boker, S. M., Waddell, J. R., & Vitanov, N. (2009). Spatiotemporal symmetry
and multifractal structure of head movements during dyadic conversation. Journal of Experi-
mental Psychology: Human Perception and Performance, 35(4), 1072–1091.
Baron-Cohen, S. & Wheelwright, S. (1999). “Obsessions” in children with autism or Asperger
syndrome: Content analysis in terms of core domains of cognition. The British Journal of
Psychiatry, 175(5), 484–490.
Bernieri, F. J., Reznick, J. S., & Rosenthal, R. (1988). Synchrony, pseudo synchrony, and dissyn-
chrony: Measuring the entrainment process in mother–infant interactions. Journal of Personal-
ity and Social Psychology, 54(2), 243–253.
Bird, G., Leighton, J., Press, C., & Heyes, C. (2007). Intact automatic imitation of human and
robot actions in autism spectrum disorders. Proceedings of the Royal Society B: Biological
Sciences, 274(1628), 3027–3031.
Black, M. P., Bone, D., Williams, M. E., et al. (2011). The USC CARE Corpus: Child–
psychologist interactions of children with autism spectrum disorders. In: Proceedings of Inter-
Speech (pp. 1497–1500).
Boker, S. M., Xu, M., Rotondo, J. L., & King, K. (2002). Windowed cross-correlation and peak
picking for the analysis of variability in the association between behavioral time series. Psy-
chological Methods, 7(3), 338–355.
Boucenna, S., Anzalone, S., Tilmont, E., Cohen, D., & Chetouani, M. (2014). Learning of social
signatures through imitation game between a robot and a human partner. IEEE Transactions on
Autonomous Mental Development, 6(3), 213–225.
Campbell, N. (2008). Multimodal processing of discourse information: The effect of syn-
chrony. In Proceedings of 2008 Second International Symposium on Universal Communication
(pp. 12–15).
Carpendale, J. I. M. & Lewis, C. (2004). Constructing an understanding of the mind: The devel-
opment of children’s social understanding within social interaction. Behavioral and Brain Sci-
ences, 27, 79–151.
Carpenter, M., Nagell, K., Tomasello, M., Butterworth, G., & Moore, C. (1998). Social cognition,
joint attention, and communicative competence from 9 to 15 months of age. Monographs of the
Society for Research in Child Development, 63(4), 1–143.
Chaby, L., Chetouani, M., Plaza, M., & Cohen, D. (2012). Exploring multimodal social-
emotional behaviors in autism spectrum disorders. In Workshop on Wide Spectrum Social
Signal Processing, 2012 ASE/IEEE International Conference on Social Computing (pp. 950–
954).
Chamak, B., Bonniau, B., Jaunay, E., & Cohen, D. (2008). What can we learn about autism from
autistic persons? Psychotherapy and Psychosomatics, 77, 271–279.
Chaminade, T., Da Fonseca, D., Rosset, D., et al. (2012). FMRI study of young adults with autism
interacting with a humanoid robot. In Proceedings of the 21st IEEE International Symposium
on Robot and Human Interactive Communication (pp. 380–385).
Cohen, D. (2012). Traumatismes et traces: donnés expérimentales. Neuropsychiatrie de l’Enfance
et de l’Adolescence, 60, 315–323.
Cohen, D., Cassel, R. S., Saint-Georges, C., et al. (2013). Do parentese prosody and fathers’
involvement in interacting facilitate social interaction in infants who later develop autism?
PLoS ONE, 8(5), e61402.
Costa, S., Santos, C., Soares, F., Ferreira, M., & Moreira, F. (2010). Promoting interaction
amongst autistic adolescents using robots. In Proceedings of 2010 Annual International Con-
ference of the IEEE Engineering in Medicine and Biology (pp. 3856–3859).
Dautenhahn, K. (2003). Roles and functions of robots in human society: Implications from
research in autism therapy. Robotica, 21(4), 443–452.
Dautenhahn, K. & Werry, I. (2004). Towards interactive robots in autism therapy: Background,
motivation and challenges. Pragmatics & Cognition, 12(1), 1–35.
Dawson, G., Rogers, S., Munson, J., et al. (2009). Randomized, controlled trial of an intervention
for toddlers with autism: The Early Start Denver model. Pediatrics, 125(1), 17–23.
Decety, J. & Jackson, P. (2004). The functional architecture of human empathy. Behavioral and
Cognitive Neuroscience Reviews, 3(2), 71–100.
Delaherche, E. & Chetouani, M. (2010). Multimodal coordination: Exploring relevant features
and measures. In Proceedings of the 2nd International Workshop on Social Signal Processing
(pp. 47–52).
Delaherche, E., Chetouani, M., Bigouret, F., et al. (2013). Assessment of the communicative
and coordination skills of children with autism spectrum disorders and typically developing
children using social signal processing. Research in Autism Spectrum Disorders, 7(6), 741–
756.
Delaherche, E., Chetouani, M., Mahdhaoui, M., et al. (2012). Interpersonal synchrony: A survey
of evaluation methods across disciplines. IEEE Transactions on Affective Computing, 3(3),
349–365.
Demouy, J., Plaza, M., Xavier, J., et al. (2011). Differential language markers of pathology in
autism, pervasive developmental disorder not otherwise specified and specific language impair-
ment. Research in Autism Spectrum Disorders, 5(4), 1402–1412.
Diehl, J., Schmitt, L. M., Villano, M., & Crowell, C. R. (2012). The clinical use of robots for
individuals with autism spectrum disorders: A critical review. Research in Autism Spectrum
Disorders, 6(1), 249–262.
Duquette, A., Michaud, F., & Mercier, H. (2008). Exploring the use of a mobile robot as an
imitation agent with children with low-functioning autism. Autonomous Robots, 24(2), 147–
157.
Feil-Seifer, D. & Matarić, M. J. (2009). Toward socially assistive robotics for augmenting inter-
ventions for children with autism spectrum disorders. In O. Khatib, V. Kumar, & G. Pappas
(Eds), Experimental Robotics (vol. 54, pp. 201–210). Berlin: Springer.
Feil-Seifer, D. & Matarić, M. J. (2011). Automated detection and classification of positive vs.
negative robot interactions with children with autism using distance-based features. In 6th
ACM/IEEE International Conference on Human–Robot Interaction (pp. 323–330).
François, D., Powell, S., & Dautenhahn, K. (2009). A long-term study of children with autism
playing with a robotic pet: Taking inspirations from non-directive play therapy to encourage
children’s proactivity and initiative-taking. Interaction Studies, 10(3), 324–373.
Fujimoto, I., Matsumoto, T., De Silva, P. R. S., Kobayashi, M., & Higashi, M. (2011). Mimicking
and evaluating human motion to improve the imitation skill of children with autism through a
robot. International Journal of Social Robotics, 3(4), 349–357.
Howlin, P., Baron-Cohen, S., & Hadwin, J. (1999). Teaching Children with Autism to Mind-Read:
A Practical Guide for Teachers and Parents. New York: John Wiley & Sons.
Jones, S. (2007). Imitation in infancy the development of mimicry. Psychological Science, 18(7),
593–599.
Jones, S. (2009). The development of imitation in infancy. Philosophical Transactions of the
Royal Society B: Biological Sciences, 364(1528), 2325.
Kaliouby, R., Picard, R., & Barron-Cohen, S. (2006). Affective computing and autism. Annals of
the New York Academy of Sciences, 1093, 228–248.
Kjelgaard, M. & Tager-Flusberg, H. (2001). An investigation of language impairment in autism:
Implications for genetic subgroups. Language and Cognitive Processes, 16(2–3), 287–308.
Konstantinidis, E. I., Luneski, A., Frantzidis, C. A., Pappas, C., & Bamidis, P. D. (2009). A pro-
posed framework of an interactive semi-virtual environment for enhanced education of children
with autism spectrum disorders. In Proceedings of the 22nd IEEE International Symposium on
Computer-Based Medical Systems (pp. 1–6).
Kozima, H., Nakagawa, C., & Yasuda, Y. (2007). Children-robot interaction: A pilot study in
autism therapy. Progress in Brain Research, 164, 385–400.
Liu, C., Conn, K., Sarkar, N., & Stone, W. (2008). Physiology-based affect recognition for
computer-assisted intervention of children with autism spectrum disorder. International Jour-
nal of Human-Computer Studies, 66(9), 662–677.
Mahdhaoui, A., Chetouani, M., Cassel, R. S., et al. (2011). Computerized home video detection
for motherese may help to study impaired interaction between infants who become autistic and
their parents. International Journal of Methods in Psychiatric Research, 20(1), e6–e18.
Michelet, S., Karp, K., Delaherche, E., Achard, C., & Chetouani, M. (2012). Automatic imitation
assessment in interaction. Lecture Notes in Computer Science, 7559, 161–173.
Mitchell, P., Parsons, S., & Leonard, A. (2006). Using virtual environments for teaching social
understanding to 6 adolescents with autistic spectrum disorders. Journal of Autism and Devel-
opmental Disorders, 3(37), 589–600.
Murray, D. (1997). Autism and information technology: Therapy with computers. In S. Powell
& R. Jordan (Eds), Autism and Learning: A Guide to Good Practice (pp. 100–117). London:
David Fulton.
Nadel, J., Simon, M., Canet, P., et al. (2006). Human responses to an expressive robot. In Pro-
ceedings of the Sixth International Workshop on Epigenetic Robotics (pp. 79–86).
Narzisi, A., Muratori, F., Calderoni, S., Fabbro, F., & Urgesi, C. (2012). Neuropsychological Pro-
file in High Functioning Autism Spectrum Disorders. Journal of Autism and Developmental
Disorders, 43(8), 1895–1909.
Ospina, M. B., Seida, J. K., Clark, B., et al. (2008). Behavioural and developmental interventions
for autism spectrum disorder: a clinical systematic review. PLoS ONE, 3(11): e3755.
Pantic, M., Pentland, A., Nijholt, A., & Huang, T. (2006). Human computing and machine under-
standing of human behavior: A survey. In Proceedings of the 8th International Conference on
Papert, S. (1980). Mindstorms: Children, Computers, and Powerful Ideas. New York: Basic
Books.
Pentland, A., Lazer, D., Brewer, D., & Heibeck, T. (2009). Using reality mining to improve public
health and medicine. Studies in Health Technology and Informatics, 149, 93–102.
Picard, R. (2010). Emotion research by the people, for the people. Emotion Review, 2(3), 250–254.
Pierno, A., Mari, M., Lusher, D., & Castiello, U. (2008). Robotic movement elicits visuomotor
priming in children with autism. Neuropsychologia, 46(2), 448–454.
Pioggia, G., Igliozzi, R., Ferro, M., et al. (2005). An android for enhancing social skills and
emotion recognition in people with autism. IEEE Transactions on Neural Systems and Reha-
bilitation Engineering, 13(4), 507–515.
Pioggia, G., Igliozzi, R., Sica, M. L., et al. (2008). Exploring emotional and imitational android-
based interactions in autistic spectrum disorders. Journal of CyberTherapy & Rehabilitation,
1(1), 49–61.
Rajendran, G. & Mitchell, P. (2000). Computer mediated interaction in Asperger’s syndrome: The
Bubble Dialogue program. Computers and Education, 35, 187–207.
Ramseyer, F. & Tschacher, W. (2011). Nonverbal synchrony in psychotherapy: Coordinated body
movement reflects relationship quality and Outcome. Journal of Consulting and Clinical Psy-
chology, 79(3), 284–295.
Ravindra, P., De Silva, S., Tadano, K., et al. (2009). Therapeutic-assisted robot for children with
autism. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and
Systems (pp. 3561–3567).
Rehg, J. M., Abowd, G. D., Rozga, A., et al. (2013). Decoding children’s social behavior. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3414–
3421).
Ringeval, F., Demouy, J., Szaszák, G., et al. (2011). Automatic intonation recognition for the
prosodic assessment of language impaired children. IEEE Transactions on Audio, Speech and
Language Processing, 19(5), 1328–1342.
Robins, B., Dautenhahn, K., & Dubowski, J. (2006). Does appearance matter in the interaction of
children with autism with a humanoid robot? Interaction Studies, 7(3), 509–542.
Robins, B., Dautenhahn, K., Te Boekhorst, R., & Billard, A. (2005). Robotic assistants in therapy
and education of children with autism: Can a small humanoid robot help encourage social
interaction skills? Universal Access in the Information Society, 4(2), 105–120.
Saint-Georges, C., Chetouani, M., Cassel, R., et al. (2013). Motherese in interaction: At the cross-
road of emotion and cognition? (A systematic review). PLoS ONE, 8(10), e78103.
Saint-Georges, C., Mahdhaoui, A., Chetouani, M., et al. (2011). Do parents recognize autistic
deviant behavior long before diagnosis? Taking into account interaction using computational
methods. PLoS ONE, 6(7), e22393.
Scassellati, B. (2007). How social robots will help us to diagnose, treat, and understand autism. In
S. Thrun, R. Brooks, & H. Durrant-Whyte (Eds), Robotics Research (pp. 552–563). London:
Springer.
Scassellati, B., Admoni, H., & Matarić, M. (2012). Robots for use in autism research. Annual
Review of Biomedical Engineering, 14, 275–294.
Schuller, B., Steidl, E., Batliner, A., et al. (2013). The InterSpeech 2013 computational paralin-
guistics challenge: Social signals, conflict, Emotion, autism. In Proceedings of the Annual Con-
ference of the International Speech Communication Association (pp. 148–152).
Segalin, C., Pesarin, A., Vinciarelli, A., Tait, M., & Cristani, M. (2013). The expressivity of turn-
taking: Understanding children pragmatics by hybrid classifiers. In Proceedings of the 14th
International Workshop on Image Analysis for Multimedia Interactive Services (pp. 1–4).
Simon, M., Canet, P., Soussignan, R., Gaussier, P., & Nadel, J. (2007). L’enfant face à des expres-
sions robotiques et humaines. Enfance, 59(1), 59–70.
Stribling, P., Rae, J., & Dickerson, P. (2009). Using conversation analysis to explore the recur-
rence of a topic in the talk of a boy with an autism spectrum disorder. Clinical Linguistics &
Phonetics, 23(8), 555–582.
Sun, X., Truong, K., Nijholt, A., & Pantic, M. (2011). Automatic visual mimicry expression anal-
ysis in interpersonal interaction. In Proceedings of IEEE International Conference on Com-
puter Vision and Pattern Recognition: Workshop on CVPR for Human Behaviour Analysis
(pp. 40–46).
Tager-Flusberg, H. (1981). On the nature of linguistic functioning in early infantile autism. Jour-
nal of Autism and Developmental Disorders, 11, 45–56.
Tapus, A., Matarić, M., & Scassellati, B. (2007). Socially assistive robotics. IEEE Robotics and
Automation Magazine, 14(1), 35–42.
Tomasello, M. & Farrar, M. (1986). Joint attention and early language. Child Development, 57(6),
1454–1463.
Vannetzel, L., Chaby, L., Cautru, F., Cohen, D., & Plaza, M. (2011). Neutral versus emotional
human stimuli processing in children with pervasive developmental disorders not otherwise
specified. Research in Autism Spectrum Disorders, 5(2), 775–783.
Varni, G., Volpe, G., & Camurri, A. (2010). A system for real-time multimodal analysis of nonver-
bal affective social interaction in user-centric media. IEEE Transactions on Multimedia, 12(6),
576–590.
Wainer, J., Ferrari, E., Dautenhahn, K., & Robins, B. (2010). The effectiveness of using a robotics
class to foster collaboration among groups of children with autism in an exploratory study.
Personal and Ubiquitous Computing, 14(5), 445–455.
Weisman, O., Delaherche, E., Rondeau, M., et al. (2013). Oxytocin shapes parental motion during
father–infant interaction. Biology Letters, 9(6).
29 Social Signals of Deception and
Dishonesty
Judee K. Burgoon, Dimitris Metaxas, Thirimachos Bourlai, and Aaron Elkins
Social life is constituted of interactions with others – others whom we must rapidly clas-
sify as friend or foe, as trustworthy or not. Gauging another’s trustworthiness relies on
successfully reading nonverbal signals – signals that have been selected through human
evolution to serve precisely such a communicative function. These deeply ingrained
signals – some part of our phylogenetic heritage and some part of our socially con-
structed communication system – form the unwritten “order” for cooperative encoun-
ters, enabling both individuals and societies to survive and thrive. Yet paradoxically,
the same course of evolution has also remunerated, with greater prospects for survival,
those who manipulate and falsely manufacture such signals; in short, those who cheat,
dissemble and deceive. Put differently, the course of human development has produced
a system of presumably reliable signals of veracity, authenticity, trust and trustworthi-
ness, while simultaneously conferring advantages on sham portrayals of those same
signals. The use of dishonest signals is not confined to humans; natural selection has
also rewarded sophisticated forms of cheating among all manner of living organisms
(Greenfield, 2006). Consequently, these nonverbal signals, many of which are universal
and have similarities among other species, are among the most important for humans to
produce and read and the most useful for computational methods to detect and track.
In what follows, we foreground those aspects of social signaling related to veracity
that have widespread use and recognition. These are the kinds of signals that Burgoon
and Newton (1991) identified as corresponding to a social meaning model in that they
are recurrent expressions that have consensually recognized meanings within a given
community. In this chapter, we first provide the reader background on the nature of
the aforementioned signals. Next, we discuss automated methods for human nonverbal
communication computing, i.e., methods we used for identifying and tracking such sig-
nals. Then, we discuss computer vision technologies using sensors operating in different
wavelengths of the infrared band and not only the visible band, which is convention-
ally used. We conclude with recommendations for promising future research directions
where the latest technologies can be applied to this elemental aspect of social signaling.
Deception Defined
Most scholars converge on a definition of deception as a deliberate act, whether suc-

cessful or not, by which perpetrators knowingly transmit messages (verbal or nonverbal)
Social Signals of Deception and Dishonesty 405
that mislead others by fostering impressions, beliefs, or understandings that the sender
believes to be false (Buller & Burgoon, 1994; Ekman, 1992; Knapp & Comadena, 1979;
Masip, Garrido, & Herrero, 2004). Deception may occur through acts of commission or
omission. It encompasses not just outright fabrications but also a variety of forms such
as evasions, exaggerations, equivocations, concealments, white lies, and the like. Thus
deception is a far broader phenomenon than just lies. Although the term “deception”
may conjure up words and language, it is not confined to the verbal side of communica-
tion; it can be accomplished through nonverbal signals alone or through nonverbal sig-
nals that accompany and transform verbal messages into duplicity. In the animal king-
dom, deceit may be accomplished through such diverse means as camouflage, mimesis,
lures, feints, bluffs, deimatic displays and distractions; humans may similarly use dis-
guises, mimicry, decoys, legerdemain, diversions and misdirection, among other means
of duplicity. The primacy that these signals commands in intra-species encounters war-
rants a delineation of their forms and functions and the methods by which they can be
captured computationally.
Nonverbal Codes Involved in Deceptive and Dishonest Signaling
Beginning with the invention of the polygraph in 1921, practitioners and researchers
have pursued a variety of tools and methods to detect deception. Yet over a half century
of scientific research dedicated to investigating possible signals of deception has led
to the overwhelming conclusion that there is no single surefire measure of deceit and
that many of the possible signs are too weak, unreliable, or context-dependent to serve as
valid indicators (see, e,g., DePaulo et al., 2003; Hartwig & Bond, 2011, 2014). That said,
numerous indicators associated with deception continue to be investigated and, with
the advent of more sophisticated technologies and methods, are producing meaningful
results.
In this chapter we focus exclusively on those associated with visual signals that can
be detected and tracked with computer image and video analyses methods. These indi-
cators include those falling under the nonverbal codes of kinesics, oculometrics, prox-
emics, and physical appearance as well as physiologically based reactions that have
outward manifestations and are sometimes grouped with nonverbal behaviors. Kinesics
refers to body movements and includes head, face, trunk and limb movements such as
head nods, tilts and shakes; facial expressions, facial emotions and hand to face or body
touches; sitting and standing postures; gestures; and gait. Oculometrics is frequently
subsumed under kinesics and includes gaze patterns, gaze fixations, blinking and pupil
dilation. Proxemics refers to spatial and distancing patterns such as sitting and stand-
ing proximity, lean and body orientation. Physical appearance includes all aspects of
natural body features (e.g., facial structure, hair, body type, height, weight, skin color),
grooming and adornments (e.g., clothing, cosmetics, tattoos, hair ornaments, jewelry).
All of these nonverbal codes can be measured through computer imaging techniques.
Other nonverbal codes that are outside the scope of this chapter but can also be enlisted
to deceive others include vocalics (voice features), haptics (forms of touch), and use of
personal artifacts. The interested reader is directed to Elkins et al. (2015), Rockwell,
Buller, and Burgoon (1997), and Schuller (2013) for extensive research on the voice
of deceit. The use of disguises and other forms of artifacts that can facilitate imposter-
ship or hiding one’s true identity, although often considered forms of nonverbal com-
munication, have had far less systematic research dedicated to their detection; discus-
sions of these can be found in many basic textbooks on nonverbal communication (e.g.,
Burgoon, Guerrero, & Floyd, 2010).
Classes of Nonverbal Deception Indicators
Biological versus Social Signals

Buck (1988) and Buck and VanLear (2002) has differentiated nonverbal behaviors
according to whether they are biologically derived or socially derived signals. Biolog-
ical signals are ones that are naturally occurring, nonpropositional, spontaneous and
automatic. They may include emotional expressions, reflexes, involuntary reactions and
other physiologically based displays such as perspiration or pupil constriction/dilation.
Social signals are ones that are symbolic (artificially created for communicative pur-
poses), learned, intentional and socially shared. Pseudo-symbolic signals are ones that
are biological in origin but intentionally manipulated for strategic purposes. These
distinctions are useful in many respects when trying to infer intent and to determine
whether apparently spontaneous expressions are high-fidelity signs of internal states or
have been manipulated to feign alternative states. That is, they are relevant to the inter-
pretation of signals and inferring the motivations behind them. They also hold relevance
for distinguishing among similar visual cues that reflect felt versus faked states, such
as true smiles and false smiles. However, for our purposes we will include all under the
umbrella of social signals inasmuch as they are used among conspecifics to gauge one
another’s states and intentions.
Burgoon (2005) grouped nonverbal signals of deceit into five categories. We begin
with those same groupings here, which correspond to more traditional views of decep-
tion signals. We then introduce an alternative approach based on a communication per-
spective. Deception indicators and patterns are best understood as working in compos-
ites of uses and as probabilistic, with alternative predictions depending on context.
Arousal Signals
Deception has routinely been associated with physiological arousal, and the polygraph is
predicated on the assumption that lying gives rise to several forms of measurable cardio,
respiratory, and electrodermal responses such as increased heart rate, faster breathing
and perspiration associated with anxiety or stress. Ekman and Friesen (1969) called
these changes “leakage” because they are unintended, uncontrolled, or uncontrollable
physiological reactions that “leak” out of the body as clues to deception or as telltale
signs of the true state of affairs (i.e., anxiety). Zuckerman, DePaulo, and Rosenthal
(1981) regarded arousal as one of four etiologies of deception displays. Buller and
Burgoon (1994) considered arousal one aspect of nonstrategic (unintended) displays
associated with decrements in communicative performance during deceit.
Although arousal can be measured with contact instruments, such as the polygraph
or electroencephalograph, it can also be detected through noncontact observation by
humans or computer imaging. Thermal imaging can detect changes in blood perfu-
sion in the periorbital and perinasal regions when people lie (Dcosta et al., 2015).
Other observable behavioral changes associated with deception include small hand fid-
gets; increased rigidity and frozen head and posture; impassivity of the face; reduced
illustrative hand gestures but more self-touch gestures including hands touching the
face and head (known as adaptor behaviors); and lip presses (Burgoon, Schuetzler,
& Wilson, 2014; Hartwig & Bond, 2014; Mullin et al., 2014; Pentland, Burgoon, &
Twyman, 2015; Twyman et al., 2014; Vrij, 2008).
There are problems associated with viewing arousal behaviors as the most valid and
reliable indicators of deception or dishonesty:
First, most of the research has focused specifically on lying, not other forms of deceit such as
equivocation, evasion, exaggeration, or concealment. Thus, some forms of deceit, such as omit-
ting truthful details or being ambiguous, may not be (as) physiologically arousing and result in the
same nonverbal displays. Second, it is unclear whether lying and other forms of deceit are in fact
highly arousing. Certainly the kinds of mundane white lies, polite evasions, well-selected omis-
sions and other low-stakes lies that populate daily discourse and may roll off the tongue without
hesitation are unlikely to generate a lot of physiological changes. (Burgoon, 2005, p. 239)
Additional problems are that much of the research intended to validate deception
cues has not been conducted under high-stakes circumstances, and much has collected
or utilized very brief samples of behavior that are less than 30 seconds in length, so that
it is unknown whether arousal dissipates over time or becomes even more elevated and
alters behavioral profiles. For example, in our own research (some unpublished) we have
seen that whereas deceivers committing a mock theft may exhibit more random trunk
movements and adaptor gestures, those in real-world high stakes circumstances, such as
being questioned about serious crimes, may instead exhibit the aforementioned freeze
pattern (Burgoon & Floyd, 2000; Mullin et al., 2014). That is, when stakes are low to
moderate, adaptor gestures may be more frequent among deceivers than truth tellers,
but when stakes are high, the reverse is true: Truth tellers may exhibit more movement
overall, including postural shifts, fidgeting and other adaptor gestures, than deceivers.
Negative-based Affect Signals

The second category of indicators includes not only discrete negative emotions (e.g.,
anger, contempt, fear) but also other more diffuse mood and feeling states (e.g., guilt,
shame, discomfort, uncertainty). Affective states are closely related to arousal inasmuch
as emotions entail some level of arousal or activation, and arousal has a valence asso-
ciated with it (Burgoon et al., 1989). Emotional states are typically measured by use of
the facial affect coding system (FACS) (Ekman, Friesen, & Hagar, 2002), whereas other
more diffuse mood states may be measured by observation of the face or posture as well
as the voice; Ekman et al., 1991).
Although deception is usually associated with negative affect states such as fear of
detection, Ekman and Friesen (1969) also suggested that some deceivers experience
duping delight – glee at conning someone. The empirical research on affective indicators
associated with deception is more mixed and inconsistent. Although many researchers
have hypothesized which emotions should be associated with actual deceit, few have
actually investigated or reported confirmatory findings. Frank and Ekman (1997, 2004)
measured emotional expressions with FACS in two scenarios (crime, opinion) in which
participants lied or told the truth. They found that fear and disgust accurately discrim-
inated truth tellers from liars with 80% accuracy, whereas contempt correlated with
perceived truthfulness. Smiling was also a discriminator. However, meta-analyses of
cues of deception did not find any specific emotions to predict veracity, only that less
facial pleasantness was associated with perceived or actual deception, and smiling was
not a significant discriminator (DePaulo et al., 2003; Hartwig & Bond, 2011). A more
recent study (Pentland, Twyman, & Burgoon, 2014) applied an advanced computer-
based tool for detecting and tracking facial expressions, computer expression recog-
nition toolbox, to video recorded interview responses of participants in a mock crime
experiment about which guilty participants lied and innocent participants told the truth.
On two of three concealed information test questions, truth tellers showed more con-
tempt than deceivers, guilty participants (deceivers) averaged more neutral expressions
than innocent ones (truth tellers), and, except for smiling, all of the other action units
(AUs) and emotions failed to distinguish liars from truth tellers on target responses. As
for smiling, truth tellers consistently smiled more than deceivers. Despite these failures,
the affect coding was 94% accurate in distinguishing truth from deception, but largely
because of the impassivity of deceivers and the smiling of truth tellers, not because other
emotions and AUs were good predictors.
One exception to the lack of reported investigations of emotional expressions in
deception displays is micro-momentary facial expressions, fleeting expressions of emo-
tion, usually 1/25–1/5th sec, that are usually not noticed with the naked eye but can
be discerned with training. Ekman and colleagues (Ekman, 2009; Ekman & Friesen,
1969) regard micro-expressions as deliberate but suppressed expressions of emotional
states that “leak out” of the body involuntarily, whereas Haggard and Isaacs (1966),
who first discovered them, saw them as unintentional expressions of repressed emo-
tions, a distinction that adumbrates one of the problems associated with treating
micro-expressions as indicators of deception – one cannot infer definitively that they
signal deception. Notwithstanding, Ekman and colleagues, Matsumoto, Givens, and
Frank have trained thousands of law enforcement and intelligence agencies on using
micro-expressions to detect deceit, and such expressions are a staple of the Screen-
ing Passengers by Observation Techniques (SPOT) program (Maccario, 2013) program
(HSNW, 2010).
The other facial expression that is associated with affect is the smile. As already
noted, there have been mixed findings on whether facial pleasantness and smiling are
associated with deception. Ekman and Friesen (1982; see also Ekman, Davidson, &
Friesen, 1990) introduced an important distinction, based on the work of Duchenne
(1862/1990), between felt and feigned smiles. The Duchenne smile is a genuine smile
and differs in configuration, duration and smoothness from faked smiles. Early on it
was thought that truth tellers would show the genuine smile (also called an enjoyment
smile; Frank, Ekman, & Friesen, 1993) and deceivers would show the fake smile.
But other research (e.g., Gunnery, Hall, & Ruben, 2013) has shown that a deliberate
non-Duchenne smile is commonplace in social interaction and should not be read as
deceptive.
The problems of using signs of emotion as valid and reliable indicators of deception
include the same ones associated with arousal. There is no one-to-one correspondence
between the observable indicator and the experienced psychological state (Fernandez-
Dols et al., 1997). For example, smiles, which are often thought to signify happiness,
may be exhibited by individuals feeling fear (known as an appeasement smile). In
the process of suppressing other body movements, deceivers may also suppress facial
expression and thus exhibit less affect than truth tellers. As well, truthful individuals
may display negative affect, and deceivers may display other positive emotional states.
Even if outward manifestations could be trusted as high-fidelity signs of felt emo-
tions, there is no standard set of emotions associated with deceit. Moreover, emo-
tional expressions can be masked, minimized, exaggerated, neutralized, or replaced
with other expressions (Ekman, 1985). Porter, Ten Brinke, and Wallace (2012) found
that although high-intensity emotions often leak out, such expressions could be inhib-
ited when the stakes for deceivers were low. In a separate investigation, they found
that individuals high in the psychopathic trait of interpersonal manipulation showed
shorter durations of unintended emotional “leakage” during deceptive expressions
(Porter et al., 2011).
Additionally, such expressions are rare both in terms of the percentage of people who
display them and the percentage of facial expressions that include them. Ekman (2009)
reported that only about half the individuals they studied showed micro-expressions,
and Porter and Ten Brinke (2013) reported in a study of genuine and falsified emotions
that micro-expressions were exhibited by only 22% of participants. These expressions
appeared in only 2% of all expressions considered in the study, and largely in just the
upper part of the face. (It should be noted that longer emotional expressions – long
enough to be detected by trained observers – were much more frequent in the Porter and
ten Brinke study; see also Warren, Schertler, and Bull (2009), who found that training
with the Subtle Expression Training Tool (SETT) improved detection of emotional lies.
However, training on the Micro Expression Training Tool (METT) did not improve the
detection of emotional lies.)
Cognitive Effort Signals

Research in several domains has established a number of nonverbal signals indicative
of a communicator experiencing greater mental taxation. In the context of deception,
telling a lie is thought to be more difficult than telling the truth. Thus, fabricating a story
should be more demanding than simply omitting details or exaggerating an otherwise
truthful account. Signals related to “thinking hard”, especially in the context of decep-
tion, include longer delays (response latencies) when answering questions or starting
a speaking turn, within-turn hesitations, slower speaking tempo, other speech dysflu-
encies, gaze aversion, temporary cessation of gesturing, suppressed blinking, changes
in facial expressions, vague and repeated details, structured message productions, less
contextual embedding, fewer reproduced conversations, and tense changes (Goldman-
Eisler, 1968; Porter & Ten Brinke, 2013; Sporer & Schwandt, 2006). Only some of these
are visual signals or measurable through imaging methods.
Memory Retrieval Signals

Closely aligned with cognitive effort are signs that a communicator is retrieving actual
memories, comparing them to invented versions of events, reconciling discrepancies,
making decisions about which version to tell, and simultaneously regulating self-
presentation. These are among the processes involved in engaging the central executive
in working memory (Baddeley, 1986, 2000a, 2000b). Longer response latencies, more
hesitations, slower speaking tempo, and temporary cessation of gesturing and gaze aver-
sion are among the signals of memory retrieval.
Strategic Behavior Patterns

Not all nonverbal signals of deceit are involuntary, reactive or uncontrollable. Commu-
nicators are by nature goal-driven and engage in a number of intentional activities to
evade detection, put forward a credible image, and persuade the target to accept the
veracity of their messages. These are strategic behaviors (Buller & Burgoon, 1994) as
distinct from the aforementioned reflexive and uncontrolled signals that are nonstrate-
gic activity. Burgoon (2005) has identified a number of strategies that are comprised of
constellations of behaviors, three of which we mention here.
Involvement as a strategy consists of multiple, multimodal behaviors that convey
one’s mental, emotional and behavioral engagement in the interaction taking place.
Deceivers, in an attempt to appear normal, may show moderately high to high degrees of
engagement through greater nonverbal immediacy (close proximity, touch, mutual eye
gaze, direct body orientation, forward lean), composure (still posture, lack of adaptor
gestures), expressiveness (more illustrator gestures, facial expressivity, vocal variety),
smooth conversation management (closed turn exchanges, no overlapped speech, inter-
actional synchrony), and moderate relaxation (lack of facial or bodily tension, moder-
ately erect posture) (Burgoon et al., 1996; Coker & Burgoon, 1987).
Increased pleasantness is another strategy to cover up negative affective states and
to promote a favorable image. It is highly correlated with, and encompasses many of
the same indicators as, involvement to which are added smiling, other positive facial
expressions, affirming head nods (backchannels), postural mirroring, vocal variety, and
resonance (Floyd & Burgoon, 1999; Frank, Ekman, & Friesen, 1993).
Dominance displays can take various forms during deceptive encounters and there-
fore pose special challenges for interpretation. Under some circumstances, deceivers
may adopt an assertive, dominant stance while attempting to persuade another (see
Dunbar et al., 2014). Dominance displays include proxemic, kinesic, and vocalic
patterns of elevation, size, power, centrality, and precedence; under more aggressive
circumstances, they may also include threat gestures (Burgoon & Dunbar, 2006). Under
other circumstances, deceivers may opt for a more submissive, unassuming stance that
deflects perceptions of culpability and allows them to “fly under the radar” (Burgoon,
2005). Examples include submissive and retracted postures, symmetrical and formal
postures, downcast eyes, and subdued, higher-pitched voices.
Functions of Nonverbal Deception Indicators
Functions of communication refer to the purposes of communication – to the goals

communicators must juggle during interaction. They incorporate the social and interac-
tive aspects underpinning many of the observed behavioral patterns. Below we briefly
discuss four functions, namely (i) emotional regulation, (ii) relational communication,
(iii) impression management, and (iv) interaction management.
Emotional Regulation
Whereas emotions are often thought as expressive rather than communicative behavior,
and can be displayed while alone, emotional regulation is more likely to occur in social
settings, to manage and override spontaneous displays. As already noted, the physio-
logical and subjective experiences of emotions do not have a deterministic influence on
observable behavior. This is partly due to humans’ ability to mask, attenuate, exagger-
ate or simulate emotional states to meet situational demands and their own goals. For
example, grieving in some cultures entails very public displays of weeping and wail-
ing whereas in other cultures it takes the form of stoic reserve. Interpreting observable
behavior must therefore factor in the cultural, social and personal context. Smiling at a
funeral may reflect just such intentional reserve rather than being a sign of happiness.
Relational Communication
Nonverbal signals are one of the main ways that humans define their interpersonal
relationships. They express to other co-present individuals how they feel about the
other (e.g., liked or disliked, trusted or distrusted), the state of the relationship (e.g.,
deep or superficial, intimate or distant, warm or cold, equal or unequal) and them-
selves in the context of the relationship (e.g., together or not, exhilarated or calm) (see
Burgoon & Hale, 1984). Many nonverbal signals do relational “double duty,” send-
ing relational messages while simultaneously serving other functions. For example, eye
gaze may signal attentiveness but also convey attraction and liking. Interpretation prob-
lems arise when the communicator is sending one message (e.g., attentive listening)
and the receiver is “reading” another message (e.g., flirtatious attraction). Relational
messages are among the major causes of communication misunderstandings and mis-
readings because the signals themselves are polysemous in meaning.
Impression Management
Although nonverbal signals can serve very important biometric functions, they do not
always signify a person’s actual identity but may instead convey who they want others
to think they are. This need not go as far as presenting false identities, but humans are
constantly in the process of managing their presentation of “self ” to others in ways that
represent not only a “real” identity (e.g., male or female) but also a preferred identity
(e.g., intelligent, witty). Dynamic behavioral patterns may therefore reflect a blend of
multiple identities. A woman’s gait, for example, may partially serve as a unique iden-
tifier but may also include elements of self-confidence and fearlessness when walking
alone at night or seductiveness when greeting a loved other after a long absence. Here,
as with relational messages, the signals will not have a deterministic meaning; they will
be fraught with “noise” and high variance.
Interaction Management
A final function of observable behavior is to manage the interaction between two or more
people, to signal greetings and departures, to regulate conversational turns at talk, and to
mark topic changes. These interactive behaviors form a coordinated meshing of behav-
iors as individuals adapt to one another. These dynamic behaviors are reflective of the
influences of an interlocutor as well as the speaker and can create a behavioral “conta-
gion” (Burgoon, Stern, & Dillman, 1995; Hess & Bourgeois, 2010; Hatfield, Cacioppo,
& Rapson, 1994). Turn-requesting, turn-yielding, turn-denying, and, back-channeling
behaviors utilize many of the same behaviors listed under other functions and carry with
them the same problems of disambiguation noted before. A head nod of attentive listen-
ing may be misread as assent; a turn-yielding gaze toward an interlocutor may be mis-
read as dominance or ingratiation. The total accumulation of behaviors displayed con-
comitantly or serially provides essential context for making sense of any given signal.
The aforementioned causes of nonverbal signals during interactions, both truthful and
deceptive, are contingent on the environment and moderated by attributes of individuals
in the interactions. The implications for engaging in this type of research are that a
computational model for detecting deceptive social signals will likely not generalize
across settings and cultures. The culture of the speakers and the cultural norms implied
by the setting becomes particularly germane when attempting to infer emotion or affect
from nonverbal behavior. While outside the scope of this chapter, interested readers
should see Adams and Markus (2004) as a starting point for conceptualizing culture in
psychological research.
One technical approach to overcoming the challenge of cross-cultural or individ-
ual behavioral differences is to develop interaction-specific behavior models. In this
type of model, the interaction serves as its own baseline. Of interest are deviations and
behavioral anomalies that occur concomitantly with interaction events (e.g., greetings,
questions, responses, deception issue mentioned). This model normalizes individuals’
unique behavior dynamics according to their own baseline and threshold for that unique
interaction.
Figure 29.1 An example of the automated analysis for human nonverbal communication
computing. Sample snapshots from tracked facial data showing an interviewee (left) and an
interviewer (right). Red dots represent tracked facial landmarks (eyes, eyebrows, etc.), while the
ellipse in the top left corner depicts the estimated 3-D head pose of the subject; the top right
corners show the detected expressions and head gestures for subject and interviewer.
Human Nonverbal Communication Computing – A Review of Motion

Analysis Methods
Nonverbal communication research offers high-level principles that might explain how
people organize, display, adapt and understand such behaviors for communicative pur-
poses and social goals. However, the specifics are generally not fully understood, nor is
the way to translate these principles into algorithms and computer-aided communica-
tion technologies such as intelligent agents. To model such complex dynamic processes
effectively, novel computer vision and learning algorithms are needed that take into
account both the heterogeneity and the dynamicity intrinsic to behavior data. As one of
the most active research areas in computer vision, human motion analysis has become
a widely-used tool in this area. It uses image sequences to detect and track people, and
also to interpret human activities. Emerging automated methods for analyzing motion
(Wang, Hu, & Tan, 2003; Metaxas & Zhang, 2013) have been studied and developed to
enable tracking diverse human movements precisely and robustly as well as correlating
multiple people’s movements in interaction. Some of the applications of using motion
analysis methods for nonverbal communication computing include deception detection,
expression recognition, sign language recognition, behavior analysis, and group activity
recognition. In the following we illustrate two examples of nonverbal communication
computing.
Figure 29.1 shows an example of deception detection during interactions using an
automated motion analysis system (Yu et al., 2015). This work investigates how the
degree of the interactional synchrony can signal whether an interactant is truthful or
deceptive. The automated, data-driven and unobtrusive framework consists of several
motion analysis methods such as face tracking, gesture detection, facial expression
recognition and interactional synchrony estimation. It is able to automatically track
gestures and analyze expressions of both the target interviewee and the interviewer,
Scoring from two minute sad clip

Sadness
Emotion
Surprise
Anger
Neutral
0:02:20
0:02:24
0:02:28
0:03:02
0:03:06
0:03:10
0:03:14
0:03:18
0:03:22
0:03:26
0:04:00
0:04:04
0:04:08
0:04:12
0:04:16
0:04:20
Timestamp of video
A = Angry
1 Probability of Expressions
F = Fear
.5 D = Disgust
H = Happy
0 S = Sad
A F D H S P N P = Surprise
N = Neutral
Sad
Figure 29.2 A system for recognizing a specific facial expression of emotion. The system scored
the videos clip for a 2 minute period. The graph (lower right) shows the probabilities (on the
Y axis) for each of seven emotional expressions (X axis) for this specific video frame. The
upward arrow in the upper right graph indicates the time at which the frame occurred that was
scored by the system (lower right) as well as all results over the 2 min clip.
extract normalized meaningful synchrony features, and learn classification models for
deception detection. The analysis results show that these features reliably capture simul-
taneous synchrony. The relationship between synchrony and deception is shown to be
correlated and complex.
The second example is to use an automated motion analysis system to recognize
facial expressions of emotions and fatigue from sleep loss in space flight (Michael et al.,
2011). Specifically, this system developed a non-obtrusive objective means of detecting
and mitigating cognitive performance deficits, stress, fatigue, anxiety and depression
for the operational setting of spaceflight. To do so, a computational 3-D model-based
tracker and an emotion recognizer of the human face was developed to reliably identify
when astronauts are displaying various negative emotional expressions and ocular signs
of fatigue from sleep loss during space flight. Figure 29.2 shows an illustration of using
this system to recognize the facial expression of emotion. This subject had an emotion
of sadness induced by guided recollection of negative memories. The system scored
the video clip for a 2 min period. “Sad” was the predominant selection for the frames
in the clip. This agreed with the human ratings of sadness as the dominant emotional
expression during this period as well as with the emotion induced.
The above examples demonstrate that motion analysis methods such as face tracking
are critical to nonverbal communication computing. We next discuss some technolo-
gies we have developed during the past 20 years related to nonverbal communication
networking. Research in the area of human nonverbal communication computing can
be categorized in two main categories: a) highly structured such as American Sign
Language (ASL) and, b) less structured, which includes application domains such as
detection of deception, emotional expressions, stress and impairments with respect to
cognitive and social skills. Both of them rely on robust motion analysis methods such
as tracking, reconstruction and recognition. In the following, we present a set of motion
analysis methods needed for this line of work and several examples to demonstrate the
complexity of the problems.
Face Tracking
One of the most important cues for nonverbal communication comes from facial
motions. Thus accurately tracking head movements and facial actions is very impor-
tant and has attracted much attention in the computer vision and graphics communities.
Early work typically focused on recognizing expressions of a roughly stationary head
(Terzopoulos & Waters, 1993). In contrast, contemporary face tracking systems need
to track facial features (e.g., eye corners, nosetip, etc.) under both head motion and
varying expressions. The face models and tracking algorithms we have developed in
recent years are based on parametric models and statistical models (Active Shape Mod-
els, Constrained Local Models, and Active Appearance Models) as well as face tracking
from range data.
3-D Deformable Model-based Methods

2-D parametric face models were first explored to track facial features for recovering
and recognizing non-rigid and articulated motion of human faces (Black & Yacoob,
1997). DeCarlo and Metaxas (1996, 2000) introduced a 3-D facial mesh, and applied
optical flow as a non-holonomic constraint solved using a deformable model-based
approach to estimate the 3-D head movements and the facial expressions. Based on
several more evolutions of their methodology, they have developed a state-of-the-art,
real-time facial tracking system (Yu et al., 2013). An alternative approach is to learn
3-D morphable models from a group of face shapes and textures (Blanz & Vetter, 1999)
which are usually acquired by high accuracy 3-D scans. These 3-D face models can rep-
resent a wide variety of faces and rigid facial motions. On the other hand, such methods
are only as good as the models they have learned and do not generalize well with facial
expressions.
0.4 Cluster 1
0.3 Cluster 5
Cluster 6
0.2 Cluster 7
0.1
0
–0.1
–0.2
–0.2 –0.1 0 0.1 0.2 0.3 0.4
Figure 29.3 Top: The face shape manifold is approximated by piecewise linear sub-regions.
Bottom: This method searches across multiple clusters to find the best local linear model.
Active Shape Model-based Methods

The Active Shape Models (ASM) (Cootes et al., 1995) learn statistical distributions of 2-
D feature points, which allow shapes to vary only in ways seen in a training set. Kanaujia
and Metaxas (2006) built a real-time face tracking system based on ASM. They trained
a mixture of ASMs for pre-aligned faces of different clusters, each corresponding to
a different pose, as shown in Figure 29.3. The target shape is fitted by first searching
the local features along the normal direction, followed by constraining the global shape
using the most probable cluster.
2-D ASM-based methods have also been combined with 3-D face models for
improved accuracy. A framework was developed to integrate both 2-D ASM and 3-
D deformable models (Vogler et al., 2007), which allows robust tracking of faces and
estimation of both rigid and nonrigid motions. Later, a face tracker was built that com-
bined statistical models of both 2-D and 3-D faces (Yang, Wang et al., 2011; Yang et al.,
2012). Shape fitting was performed by minimizing both feature displacement errors
and subspace energy terms with temporal smoothness constraints. Given limited num-
ber of training samples, traditional statistical shape models may overfit and generalize
poorly for new samples. Instead of building models on the entire face, Huang, Liu, and
Metaxas (2007) built separate ASM models for face components to preserve local shape
deformations. They applied a Markov network to provide global geometry constraints.
Some recent research enhanced the ASM fitting by using sparse displacement errors
Figure 29.4 Sample processed frames showing tracked landmarks, estimated head pose (top left
corner) and predicted facial expression scores.
(Yang, Huang, & Metaxas, 2011; Zhang et al., 2011). These models are more robust to
outliers and partial occlusions.
Facial Expression Recognition

Based on the tracked face region, we can estimate head movements and facial expres-
sions. Facial expression recognition has attracted much attention as early as the 1970s,
and it has still been widely investigated in the past decade (Zeng et al., 2009; Metaxas
& Zhang, 2013), for there remain many open issues due to the complexity and vari-
ety of facial expressions and their appearance. Our previous work introduced several
3-D methods for facial expression analysis (DeCarlo & Metaxas, 1996) and synthe-
sis (Wang et al., 2004) based on the use of deformable models and learning methods.
Figure 29.4 shows an example of estimating facial expressions. The facial motion is
estimated by automatically tracking the landmarks on the faces, and the shape informa-
tion is also integrated into the expression analysis. In order to further analyze the facial
expressions in the video, the encoded dynamic features, which contain both spatial and
temporal information, were developed. Boosting and ranking methods were then used
in the learning phase to estimate the expression intensity for the first time, with state-
of-the-art performance (Yang, Liu, & Metaxas, 2007, 2009).
Moving Towards the Infrared Band for Face-based Motion Analysis
When working with face images captured in the visible range of the electromagnetic
spectrum, i.e., 380–750 nm, several challenges have to be mitigated. For example,
there are situations when face-based human motion analysis needs to deal with harsh
environmental conditions characterized by unfavorable lighting and pronounced shad-

ows. Such an example is low-light environments (Bourlai et al., 2011), where motion
analysis based solely on visible spectral image sequences may not be feasible (Selinger
& Socolinsky, 2004; Ao et al., 2009).
In order to deal with such difficult scenarios, multi-spectral camera sensors can be
used. They have already become very useful in face identification applications (focus
of this section) because they can be used day and night (Bourlai et al., 2012; Bourlai,
2015; Narang & Bourlai, 2015a, 2015b). Thus, face-based human motion analysis can
be moved to the infrared spectrum. The infrared (IR) spectrum is divided into different
spectral bands. The boundaries between these bands can vary depending on the scien-
tific field involved (e.g., optical radiation, astrophysics, or sensor technology; Miller,
1994). The IR bands, discussed in this work, are based on the response of various detec-
tors. Specifically, the IR spectrum is comprised of the active IR band and the thermal
(passive) IR band. The active band (0.7–2.5 µm) is divided into the NIR (near infrared)
and the SWIR (shortwave infrared) bands. Differences in appearance between images
sensed in the visible and the active IR band are due to the properties of the object being
imaged. The passive IR band is further divided into the Mid-Wave (MWIR) and the
Long-Wave InfraRed (LWIR) band. MWIR ranges from 3–5 µm, while LWIR ranges
from 7–14 µm. Both MWIR and LWIR cameras can sense temperature variations across
the face at a distance and produce thermograms in the form of 2-D images. However,
while both pertain to the thermal spectrum, they reveal different image characteristics
of the facial skin. The difference between MWIR and LWIR is that MWIR has both
reflective and emissive properties, whereas LWIR consists primarily of emitted radia-
tion. The importance of MWIR FR has been recently discussed in Abaza et al. (2014)
and some example scenarios will be briefly discussed here. What follows is a descrip-
tion of a set of tasks used for face-based identification, which can also be extended to
face-based human motion analysis. The tasks we will discuss include data collection,
face localization, eye and pupil detection and face normalization, all applied in various
bands, such as the visible, NIR and SWIR.
Face Datasets
For face identification studies we have used the UHDB11 (visible spectrum; uncon-
strained data), the WVU (Near-IR and SWIR spectra; constrained data), and the FRGC2
(visible domain; constrained data) datasets. The WVU and FRCG were used to gen-
erate the pose-variable WVU and FRGC datasets that are composed of random pose
variations of the original images (i.e., each image from the original databases was ran-
domly rotated around the z-axis to different angles ranging from −45 to +45 degrees in
5-degree increments. The UHDB11 database was used unaltered (realistic scenario).
1. UHDB11: This database consists of 1,602 face images that were acquired from 23
subjects under variable pose and illumination conditions. For each illumination con-
dition, the subjects faced four different points inside the room (their face was rotated
on the Y-axis that is the vertical axis through the subject’s head). For each Y-axis
rotation, three images were also acquired with rotations on the Z-axis (that extends
from the back of the head to the nose). Thus, the face images of the database were
acquired under six illumination conditions, with four Y and three Z rotations.
2. WVU: WVU database consists of images that were acquired using a DuncanTech
MS3100 multi-spectral and a XenICs camera. The MS3100 was used to create the
multispectral dataset of the database. The camera consists of three charge couple
devices (CCDs) and three band-pass prisms behind the lens in order to simultane-
ously capture four different wavelength bands. The IR and red (R) sensors of the
multi-spectral camera have spectral response ranges from 400 nm to 1000 nm. The
green (G) channel has a response from 400 nm to 650 nm, and the blue (B) channel
from 400 nm to 550 nm. The XenICs camera was used for the acquisition of SWIR
face images. The camera has an Indium Gallium Arsenide (InGaAs) 320 × 256 focal
plane array (FPA) with 30 µm pixel pitch, 98% pixel operability and three-stage ther-
moelectric cooling. It has a relatively uniform spectral response from 950–1700 nm
wavelength (lower SWIR band) across which the InGaAs FPA has a largely uniform
quantum efficiency. The spectral response of the camera falls rapidly at wavelengths
lower than 950 nm and near 1700 nm.
Face Localization in Various Bands

Determining which image regions contain a face via a detection or tracking component
is very important. This step is called face localization, where faces are located (their
positions in images or video frames are not known prior to analyzing the data) by dis-
tinguishing facial features from those of the background. Face detection algorithms treat
each image as an independent observation and an algorithm searches for features in the
image that indicate the presence of a face. For the purpose of this work, the main face
detection algorithms that will be discussed are the Viola and Jones and WVU template-
based matching face detection algorithms. The benefits of the WVU algorithm are that
it is (i) scenario-adoptable (can work on face images captured at different bands and
noise level conditions), (ii) it is fast, and (iii) no training or re-training is required when
the gallery is updated by new face images coming in from different sources (camera
sensors).
Traditional Approach
The Viola and Jones combines a small set of features from a large set to detect faces in
images. In the training stage, a weighted ensemble of weak classifiers is trained to distin-
guish faces from other objects (each weak classifier operates on a specific feature). By
utilizing a variant of the AdaBoost learning algorithm, a weighted combination of weak
classifiers is chosen. Therefore, the combination of features that offers the best classi-
fication performance on the training set is chosen as well. Haar-like wavelets (features)
are computed with a small number of operations. Finally, the resulting detector operates
on overlapping windows within input images, determining the approximate locations of
faces.
Figure 29.5 Overview of the proposed fully automated (pre-processing) face detection
methodology; a) query image; b) photometric normalization; c) sample average face templates
(utilized after empirically generated); d) face detection.
WVU Approach
The WVU template-based matching face detection algorithm (overview presented in
Figure 29.5) is used when we are dealing with more challenging conditions, i.e.,
variations in illumination, poses, faces captured by different sensors and at different
band (visible or infrared). In order to compensate for these problems, adoptable pre-
processing steps are used. Since different techniques bring out unique features that are
beneficial for face detection in images captured under different scenarios, specific pre-
processing steps are employed. The salient stages of the proposed method that can be
applied on different operational scenarios are the following:
1. Photometric normalization (PN). PN is applied to all multi-scenario images. As con-

ventional techniques (e.g., histogram-equalization and homomorphic filtering) do
not always perform well, we follow the approach proposed by Tan and Triggs (2010)
that incorporates a series of algorithmic steps. The steps are chosen in a certain order
to eliminate the effects of illumination variations, local shadowing and highlights,
while still preserving the essential elements of visual appearance for more efficient
face recognition. The approach is based on the following steps: gamma correction,
difference of Gaussian (DoG) filtering, masking, and contrast equalization.
2. Generation of multipose face templates. Different subjects are randomly selected
from each scenario-specific dataset. Then, for each subject, face registration is per-
formed, i.e., a face image is loaded, manually mark the coordinates of the eye centers,
geometrically normalize the image (using rotation and scaling of the positions onto
two fixed points), and crop face templates at a fixed resolution. Finally, a database-
specific average face template is generated.
3. Detection of face regions. A template convolution on the images is applied by first,
centering each of the generated face templates on the top left corner of each input
image, and then, computing the Pearson product moment correlation (PPMC) coef-
ficient. After rotating the original image to various angles the procedure is repeated
for the entire image. The position where the generated face template best matches
(i.e., highest correlation coefficient in the image domain) the input image is the
estimated position of the template within the image. To validate the performance
of our face detection system we use a relative error measure based on the Euclid-
ian distances between the expected (true coordinates of a set of facial landmarks
acquired by manual annotation), and actual landmark positions, determined after
face detection.
Eye and Pupil Detection and Face Normalization

In a typical FR system (that can be extended to face motion analysis on different bands),
one of the main challenges that must be overcome in order to achieve high FR identifica-
tion rates is to determine successfully the eye locations of all face images in a database
that are used for matching. The reason is that eye detection is the fundamental step in
the majority of FR algorithms. However, available eye detection approaches (both com-
mercial and academic such as the Viola & Jones algorithm) can perform poorly on face
images captured under variable and unconstrained conditions. More importantly, when
the images are geometrically normalized based on the found eye locations, the resultant
face image is not ready for use in face recognition algorithms due to the rotations and
scale of the faces. Thus, we are using our efficient eye localization algorithm on both
the enrolled and test datasets. Due to its accuracy is variable conditions, it is expected to
have minimal effect on face recognition results when compared to the manually anno-
tated eye centers (ground truth).
The algorithm is designed to work very well with rolled face images (see Figure 29.6)
and under variable illumination conditions. The proposed method is efficient for the
detection of human pupils using face images acquired under controlled and difficult
(large pose and illumination changes) conditions in variable spectra (i.e. visible, multi-
spectral). The methodology is based on template matching, and is composed of an offline
and an online mode.
During the offline mode, scenario-dependent eye templates are generated for each eye
from the face images of a pre-selected number of subjects. Using the eye templates
that are generated in the offline mode, the online pupil detection mode determines the
locations of the human eyes and the pupils. A combination of texture- and template-
based matching algorithms is used for this purpose. Our method is designed to work
well with challenging data and achieved a significantly high detection rate. In particular,
it yielded an average of 96.38% detection accuracy across different datasets (visible and
IR), which is a 49.2% increase in average detection performance when compared to the
method proposed by Whitelam and Bourlai (2014) (designed to work well only with
frontal face images collected under constrained conditions).
The commercial software used (COTS) performed well on data acquired under con-
trolled conditions. However, our method performed consistently better than the COTS
across all datasets, achieving a 14.4% increase in average pupil detection performance.
Another important achievement of our work was its efficiency when using the original
face images of the UHDB11 dataset, where none of the face images were synthetically
altered with pose and illumination variations. This was the most challenging scenario
to test our method, and we obtained the highest increase in pupil detection accuracy
over both the benchmark and G8 algorithms, i.e., the pupil detection accuracy was (on
(a) (b)
Figure 29.6 Illustration of the effect of not correct eye detection to geometric normalization of
two face image samples after face detection. (Top) Images after face detection. (Bottom) Images
after face/eye detection and geometric normalization. See that both academic (green and blue)
and commercial (red) found eyes are inaccurate and produce normalized face images.
average) above 92%, and outperformed G8 by over 20%. Please note that all aforemen-
tioned computer vision approaches can be extended to video sequences and blended
with other approaches, such as ASM’s so that they can efficiently work in various IR
bands.
Conclusions
In this chapter we discussed that the understanding of how people exploit nonverbal
aspects of their communication to coordinate their activities and social relationships is
a fundamental scientific challenge where significant progress has been made already.
We also discussed that, in general, nonverbal communication research offers high-level
principles that might explain how people organize, display, adapt and understand such
behaviors for communicative purposes and social goals. The main challenges are to
identify those principles and translate them into algorithms and computer-aided com-
munication technologies, such as intelligent agents. These include large-scale data col-
lection and analysis in multiple bands, automated large scale data feature extraction,
sophisticated facial and body modeling, and robust scalable learning and visualization
methods. For example, most datasets currently used are not large-scale due to diffi-
culties inherent in collecting, annotating, and analyzing large quantities of video data.
Therefore, new protocols should be developed for collection, analysis, storage, and
dissemination of high-quality corpora larger in scale and more diverse in content than
those currently available.
References
Abaza, A., Harrison, M. A., Bourlai, T., & Ross, A. (2014). Design and evaluation of photometric
image quality measures for effective face recognition. IET Biometrics, 3(4), 314–324.
Adams, G. & Markus, H. R. (2004). Toward a conception of culture suitable for a social psy-
chology of culture. In M. Schaller & C. S. Crandall (Eds), The Psychological Foundations of
Culture, 335–360. New York: Springer.
Ao, M., Yi, D. Lei, Z., & Li, S. Z. (2009). Handbook of Remote Biometrics for Surveillance and
Security. London: Springer.
Baddeley, A. (1986). Working Memory. Oxford: Clarendon Press.
Baddeley, A. (2000a). Short-term and working memory. In E. Tulving & F. I. M. Craik (Eds), The
Oxford Handbook of Memory (pp. 77–92). Oxford: Oxford University Press.
Baddeley, A. (2000b). The episodic buffer: A new component of working memory? Trends in
Cognitive Sciences, 4(11), 417–423.
Black, M. J. & Yacoob, Y. (1997). Recognizing facial expressions in image sequences using local
parameterized models of image motion. International Journal of Computer Vision, 25(1), 23–
48.
Blanz, V. & Vetter, T. (1999). A morphable model for the synthesis of 3-D faces. In Pro-
ceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques
(pp. 187–194).
Bourlai, T. (2015). Face recognition by using multispectral imagery. In InfraMation, Nashville,
Tennessee, May 12–14 (invited paper).
Bourlai, T., Kalka, N., Cao, D., et al. (2011). Ascertaining human identity in night environments.
In B. Bhanu, C. Ravishankar, A. Roy-Chowdhury, H. Aghajan, & D. Terzopoulos (Eds), Dis-
tributed Video Sensor Networks (pp. 451–468). London: Springer.
Bourlai, T., Narang, N., Cukic, B., & Hornak, L. (2012). SWIR multi-wavelength acquisition
system for simultaneous capture of face images. In Proceedings of SPIE Infrared Technology
and Applications XXXVIII (vol. 8353).
Buck, R. (1988). Nonverbal communication: Spontaneous and symbolic aspects. American
Behavioral Scientist, 31, 341–354.
Buck, R. & VanLear, C. A. (2002). Verbal and nonverbal communication: Distinguishing sponta-
neous, symbolic and pseudo-spontaneous nonverbal behavior. Journal of Communication, 52,
522–541.
Buller, D. B. & Burgoon, J. K. (1994). Deception: Strategic and nonstrategic communication. In
J. A. Daly & J. M. Wiemann (Eds.), Strategic Interpersonal Communication (pp. 191–223).
Hillsdale, NJ: Erlbaum.
Burgoon, J. K. (2005). Nonverbal measurement of deceit. In V. Manusov (Ed.), The Sourcebook
of Nonverbal Measures: Going Beyond Words (pp. 237–250). Hillsdale, NJ: Erlbaum.
Burgoon, J. K., Buller, D. B., Floyd, K., & Grandpre, J. (1996). Deceptive realities: Sender,
receiver, and observer perspectives in deceptive conversations. Communication Research, 23,
724–748.
Burgoon, J. K. & Dunbar, N. E. (2006). Dominance, power and influence. In V. Manusov & M.
Patterson (Eds.), The SAGE Handbook of Nonverbal Communication (pp. 279–298). Thousand
Oaks, CA: SAGE.
Burgoon, J. K. & Floyd, K. (2000). Testing for the motivation impairment effect during deceptive
and truthful interaction. Western Journal of Communication, 64, 243–267.
Burgoon, J. K., Guerrero, L., & Floyd, K. (2010). Nonverbal Communication. Boston: Allyn &
Bacon.
Burgoon, J. K. & Hale, J. L. (1984). The fundamental topoi of relational communication. Com-
munication Monographs, 51, 193–214.
Burgoon, J. K., Kelley, D. L., Newton, D. A., & Keeley-Dyreson, M. P. (1989). The nature of
arousal and nonverbal indices. Human Communication Research, 16, 217–255.
Burgoon, J. K. & Newton, D. A. (1991). Applying a social meaning model to relational mes-
sages of conversational involvement: Comparing participant and observer perspectives. South-
ern Communication Journal, 56, 96–113.
Burgoon, J. K., Schuetzler, R., & Wilson, D. (2014). Kinesic patterns in deceptive and truthful
interactions. Journal of Nonverbal Behavior, 39, 1–24.
Burgoon, J. K., Stern, L. A., & Dillman, L. (1995). Interpersonal Adaptation: Dyadic Interaction
Patterns. New York: Cambridge University Press.
Coker, D. A. & Burgoon, J. K. (1987). The nature of conversational involvement and nonverbal
encoding patterns. Human Communication Research, 13, 463–494.
Cootes, T., Taylor, C., Cooper, D., & Graham, J. (1995). Active shape models: Their training and
application. Computer Vision and Image Understanding, 61(1) 38–59.
Dcosta, M., Shastri, D., Vilalta, R., Pavilidis, I., & Burgoon, J. K. (2015). Perinasal indicators of
deceptive behavior.Paper presented to IEEE International Conference on Automatic Face and
Gesture Recognition, Slovenia.
DeCarlo, D. & Metaxas, D. (1996). The integration of optical flow and deformable models with
applications to human face shape and motion estimation. In Proceedings of IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (pp. 231–238).
DeCarlo, D. & Metaxas, D. (2000). Optical flow constraints on deformable models with applica-
tions to face tracking. International Journal of Computer Vision, 38(2), 99–127.
DePaulo, B. M., Lindsay, J. J., Malone, B. E., et al. (2003). Cues to deception. Psychological
Bulletin, 129, 74–118.
Duchenne, G. B. (1990). The Mechanism Of Human Facial Expression. New York: Cambridge
University Press. (Original work published 1862)
Dunbar, N. E., Jensen, M. L., Bessabarova, E., et al. (2014). Empowered by persuasive decep-
tion: The effects of power and deception on interactional dominance, credibility, and decision-
making. Communication Research, 41, 852–876.
Ekman, P. (1985). Telling Lies: Clues to Deceit in the Marketplace, Marriage, and Politics. New
York: Norton.
Ekman, P. (1992). Telling Lies: Clues to Deceit in the Marketplace, Politics, and Marriage. New
York: Norton.
Ekman, P. (2009). Lie catching and micro-expressions. In C. Martin (Ed.), The Philosophy of
Deception (pp. 118–133). New York: Oxford University Press.
Ekman, P., Davidson, R. J., & Friesen, W. V. (1990). The Duchenne smile: Emotional
expression and brain physiology: II. Journal of Personality and Social Psychology, 58,
342–353.
Ekman, P. & Friesen, W. V. (1969). Nonverbal leakage and clues to deception. Psychiatry, 32,
88–106.
Ekman, P. & Friesen, W. V. (1982). Felt, false, and miserable smiles. Journal of Nonverbal Behav-
ior, 6, 238–252.
Ekman, P., Friesen, W. V., & Hagar, J. C. (2002). Facial Action Coding System. Salt Lake City,
UT: Network Information Research. (Original work published 1976).
Ekman, P., O’Sullivan, M., Friesen, W. V., & Scherer, K. R. (1991). Invited article: Face, voice,
and body in detecting deceit. Journal of Nonverbal Behavior, 15, 125–135.
Elkins, A., Zafeiriou, S., Pantic, M., & Burgoon, J. K. (2015). Unobtrusive deception detection.
In R. Calvo, S. K. D’Mello, J. Gratch, & A. Kappas (Eds.), The Oxford Handbook of Affective
Computing. Oxford: Oxford University Press.
Fernandez-Dols, J. M., Sanchez, F., Carrera, P., & Ruiz-Belda, M. A. (1997). Are spontaneous
expressions and emotions linked? An experimental test of coherence. Journal of Nonverbal
Behavior, 21, 163–177.
Floyd, K. & Burgoon, J. K. (1999). Reacting to nonverbal expressions of liking: A test of interac-
tion adaptation theory. Communication Monographs, 66, 219–239.
Frank, M. G. & Ekman, P. (1997). The ability to detect deceit generalizes across differ-
ent types of high-stake lies. Journal of Personality and Social Psychology, 72, 1429–
1439.
Frank, M. G. & Ekman, P. (2004). Appearing truthful generalizes across different deception situ-
ations. Journal of Personality and Social Psychology, 86, 486–495.
Frank, M. G., Ekman, P., & Friesen, W. V. (1993). Behavioral markers and recognizability of the
smile of enjoyment. Journal of Personality and Social Psychology, 64, 83–93.
Goldman-Eisler, F. (1968). Psycholinguistics: Experiments in Spontaneous Speech. New York:
Academic Press.
Greenfield, H. D. (2006). Honesty and deception in animal signals. In J. R. Lucas and L. W.
Simmons (Eds), Essays in Animal Behaviour: Celebrating 50 Years of Animal Behaviour
(1st edn, pp. 278–298). New York: Academic Press.
Gunnery, S. D., Hall, J. A., & Ruben, M. A. (2013). The deliberate Duchenne smile: Individual
differences in expressive control. Journal of Nonverbal Behavior, 37, 29–41.
Haggard, E. A. & Isaacs, K. S. (1966). Micromomentary facial expressions as indicators of ego
mechanisms in psychotherapy. In Methods of Research in Psychotherapy (pp. 154–165). New
York: Springer.
Hartwig, M. & Bond, C. F. (2011). Why do lie-catchers fail? A lens model meta-analysis of human
lie judgments. Psychological Bulletin, 137, 643–659.
Hartwig, M. & Bond, C. F. (2014). Lie detection from multiple cues: A meta-analysis. Applied
Cognitive Psychology, 28, 661–676.
Hatfield, E., Cacioppo, J. T., & Rapson, R. L. (1994). Emotional Contagion. New York: Cam-
bridge University Press.
Hess, U. & Bourgeois, P. (2010). You smile – I smile: Emotion expression in social interaction.
Biological Psychology, 84, 514–520.
HSNW (2010). Efficacy of TSA’s behavioral threat detection program questioned. Homeland
Security News Wire, May 25.
Huang, Y., Liu, Q., & Metaxas, D. (2007). A component based deformable model for generalized
face alignment In Proceedings of IEEE 11th International Conference on Computer Vision
(pp. 1–8).
Kanaujia, A. & Metaxas, D. (2006). Recognizing facial expressions by tracking feature shapes.
In Proceedings of IEEE 18th International Conference on Pattern Recognition (vol. 2,
pp. 33–38).
Knapp, M. L. & Comadena, M. E. (1979). Telling it like it isn’t: A review of theory and research
on deceptive communications. Human Communication Research, 5, 270–285.
Maccario, C. (2013). Screening of passengers by observation techniques (SPOT). Transportation

Security Administration, Department of Homeland Security, May.
Masip, J., Garrido, E., & Herrero, C. (2004). Defining deception. Anales de Psicología, 20, 147–
171.
Metaxas, D. & Zhang, S. (2013). A review of motion analysis methods for human nonverbal
communication computing. Image and Vision Computing (special issue on Machine learning
in motion analysis: New advances), 31(6–7), 421–433.
Michael, N., Yang, F., Metaxas, D., & Dinges, D. (2011). Development of optical computer recog-
nition (OCR) for monitoring stress and emotions in space. In 18th IAA Humans in Space
Symposium.
Miller, J. L. (1994). Principles of Infrared Technology: A Practical Guide to the State of the Art.
Boston: Springer.
Mullin, D. S., King, G. W., Saripalle, S. K., et al. (2014). Deception effects on standing center of
pressure. Human Movement Science, 38, 106–115.
Narang, N. & Bourlai, T. (2015a). Can we match ultraviolet face images against their visible
counterparts? In Proceedings of SPIE, Algorithms and Technologies for Multispectral, Hyper-
spectral, and Ultra-spectral Imagery XXI, Baltimore, MD.
Narang, N. & Bourlai, T. (2015b). Face recognition in the SWIR band when using single sensor
multi-wavelength imaging systems. Journal of Image and Vision Computing, 33, 26–43.
Pentland, S., Burgoon, J. K., & Twyman, N. (2015). Face and head movement analysis using auto-
mated feature extraction software. Proceedings of the 48th Hawaii International Conference
on System Sciences Credibility Assessment Symposium.
Pentland, S., Twyman, N., & Burgoon, J. K. (2014). Automated analysis of guilt and deception
from facial affect in a concealed information test.Presented to the Society for Personality and
Social Psychology, Austin.
Porter, S. & Ten Brinke, L. (2013). Reading between the lies: Identifying concealed and falsified
emotions in universal facial expressions. Psychological Science, 19, 508–514.
Porter, S., Ten Brinke, L., Baker, A., & Wallace, B. (2011). Would I lie to you? “Leakage” in
deceptive facial expressions relates to psychopathy and emotional intelligence. Personality and
Individual Differences, 51(2), 133–137.
Porter, S., Ten Brinke, L., & Wallace, B. (2012). Secrets and lies: Involuntary leakage in deceptive
facial expressions as a function of emotional intensity. Journal of Nonverbal Behavior, 36, 23–
37.
Rockwell, P., Buller, D. B., & Burgoon, J. K. (1997). The voice of deceit: Refining and expanding
vocal cues to deception. Communication Research Reports, 14(4), 451–459.
Schuller, B. (2013). Applications in intelligent speech analysis. In Intelligent audio analysis
(pp. 169–223). Berlin: Springer.
Selinger, A. & Socolinsky, D. A. (2004). Face recognition in the dark. In Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition Workshop (pp. 129–134).
Sporer, S. L. & Schwandt, B. (2006). Paraverbal indicators of deception: A meta-analytic synthe-
sis. Applied Cognitive Psychology, 20(4), 421–446.
Tan, X. & Triggs, B. (2010). Enhanced local texture feature sets for face recogni-
tion under difficult lighting conditions. IEEE Transactions on Image Processing, 19(6),
1635–1650.
Terzopoulos, D. & Waters, K. (1993). Analysis and synthesis of facial image sequences using
physical and anatomical models. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 15(6), pp. 569–579.
Twyman, N. W., Elkins, A., Burgoon, J. K. & Nunamaker, J. F. (2014). A rigidity detection system
for automated credibility assessment. Journal of Management Information Systems, 31, 173–
201.
Vogler, C., Li, Z., Kanaujia, A., Goldenstein, S., & Metaxas, D. (2007). The best of both worlds:
Combining 3-D deformable models with active shape models. In Proceedings of IEEE 11th
International Conference on Computer Vision (pp. 1–7).
Vrij, A. (2008). Detecting Lies and Deceit: Pitfalls and Opportunities (2nd edn). Chichester, UK:
John Wiley & Sons.
Wang, L., Hu, W., & Tan, T. (2003). Recent developments in human motion analysis. Pattern
Recognition, 36(3), 585–601.
Wang, Y., Huang, X., Lee, C., et al. (2004). High resolution acquisition, learning and transfer of
dynamic 3-D facial expressions. Computer Graphics Forum, 23, 677–686.
Warren, G., Schertler, E., & Bull, P. (2009). Detecting deception from emotional and unemotional
cues. Journal of Nonverbal Behavior, 33, 59–69.
Whitelam, C. & Bourlai, T. (2014). On designing SWIR to visible face matching algorithms.
Intel® Technology Journal (special issue on Biometrics and Authentication), 18(4), 98–118.
Yang, F., Bourdev, L., Shechtman, E., Wang, J., & Metaxas, D. (2012). Facial expression edit-
ing in video using a temporally smooth factorization. In Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition (pp. 861–868).
Yang, F., Huang, J., & Metaxas, D. (2011). Sparse shape registration for occluded facial feature
localization. In Proceedings of IEEE International Conference on Automatic Face & Gesture
Recognition and Workshops (pp. 272–277).
Yang, F., Wang, J., Shechtman, E., Bourdev, L., & Metaxas, D. (2011). Expression flow for 3-D-
aware face component transfer. ACM Transactions on Graphics, 30(4), art. 60.
Yang, P., Liu, Q., & Metaxas, D. (2007). Boosting coded dynamic features for facial action units
and facial expression recognition. In Proceedings of IEEE Conference on Computer Vision
and Pattern Recognition (pp. 1–6).
Yang, P., Liu, Q., & Metaxas, D. (2009). Rankboost with L1 regularization for facial expression
recognition and intensity estimation, In Proceedings of IEEE 12th International Conference on
Computer Vision (pp. 1018–1025).
Yang, P., Liu, Q., & Metaxas, D. (2011). Dynamic soft encoded patterns for facial event analysis.
Computer Vision and Image Understanding, 115(3), 456–465.
Yu, X., Huang, J., Zhang, S., Yan, W., & Metaxas, D. (2013). Pose-free facial landmark fitting
via optimized part mixtures and cascaded deformable shape model. In Proceedings of IEEE
International Conference on Computer Vision (pp. 1944–1951).
Yu, X., Zhang, S., Yan, Z., et al. (2015). Is interactional dissynchrony a clue to deception? Insights
from automated analysis of nonverbal visual cues. IEEE Transactions on Cybernetics, 45(3),
506–520.
Zeng, Z., Pantic, M., Roisman, G., & Huang, T. (2009). A survey of affect recognition meth-
ods: Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and
Zhang, S., Zhan, Y., Dewan, M., et al. (2011). Sparse shape composition: A new framework for
shape prior modeling. In Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition (pp. 1025–1032).
Zuckerman, M., DePaulo, B. M., & Rosenthal, R. (1981). Verbal and nonverbal communication
of deception. In L. Berkowitz (Ed.), Advances in Experimental Social Psychology (pp. 1–59).
New York: Academic Press.
Further Reading
Bourlai, T. (2016). Face Recognition across the Imaging Spectrum. Cham, Switzerland: Springer.
Bourlai, T., Whitelam, I., & Kakadiaris, I. (2011). Pupil detection under lighting and pose vari-
ations in the visible and active infrared bands. IEEE International Workshop on Information
Forensics and Security, Iguacu Falls, Brazil.
Buller, D. B. & Burgoon, J. K. (1996). Interpersonal deception theory. Communication Theory, 6,
203–242.
Osia, N. & Bourlai, T. (2014). A spectral independent approach for physiological and geometric
based face recognition in the visible, middle-wave and long-wave infrared bands. Journal of
Image and Vision Computing, 32, 847–859.
Pfister, T., Li, X., Zhao, G. & Pietikainen, M. (2011). Recognising spontaneous facial micro-
expressions. In Proceedings of IEEE International Conference on Computer Vision (pp. 1449–
1456).
Whitelam, C. & Bourlai, T. (2015, July). Accurate eye localization in the short waved infrared
spectrum through summation range filters. Elsevier Computer Vision and Image Understand-
ing, 139(C), 59–72.

Social Signal Processing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Social Signal Processing

Uploaded by

Copyright:

Available Formats

Social Signal Processing

Judee K. Burgoon is Professor of Communication, Family Studies and Human Develop-

Alessandro Vinciarelli is Senior Lecturer (Associate Professor) of the School of Computing

Cambridge University Press is part of the University of Cambridge.

1 Introduction: Social Signal Processing 1

Part I Conceptual Models of Social Signals

2 Biological and Social Signaling Systems 11

3 Universal Dimensions of Social Signals: Warmth and Competence 23

4 The Vertical Dimension of Social Signaling 34

5 Measuring Responses to Nonverbal Social Signals: Research on Affect

6 Computational Analysis of Vocal Expression of Affect: Trends and Challenges 56

7 Self-presentation: Signaling Personal and Social Characteristics 69

8 Interaction Coordination and Adaptation 78

9 Social Signals and Persuasion 97

10 Social Presence in CMC and VR 110

Part II Machine Analysis of Social Signals

11 Facial Actions as Social Signals 123

12 Automatic Analysis of Bodily Social Signals 155

13 Computational Approaches for Personality Prediction 168

14 Automatic Analysis of Aesthetics: Human Beauty, Attractiveness, and Likability 183

15 Interpersonal Synchrony: From Social Perception to Social Interaction 202

16 Automatic Analysis of Social Emotions 213

17 Social Signal Processing for Automatic Role Recognition 225

18 Machine Learning Methods for Social Signal Processing 234

Part III Machine Synthesis of Social Signals

21 Approach and Dominance as Social Signals for Affective Interfaces 287

22 Virtual Reality and Prosocial Behavior 304

23 Social Signal Processing in Social Robotics 317

Part IV Applications of Social Signal Processing

24 Social Signal Processing for Surveillance 331

25 Analysis of Small Groups 349

26 Multimedia Implicit Tagging 368

27 Social Signal Processing for Conflict Analysis and Measurement 379

28 Social Signal Processing and Socially Assistive Robotics in Developmental

29 Social Signals of Deception and Dishonesty 404

Dong Seon Cheng

Stacie Renfro Powers

Marianne Schmid Mast

Part I Conceptual Models of Social Signals

“Computational Analysis of Vocal Expression of Affect Trends & Challenges,” which

Part II Machine Analysis of Social Signals

Part III Machine Synthesis of Social Signals

Part IV Applications of Social Signal Processing

As complex beings, humans communicate in complex ways, relying on a range of fac-

The Nature of Social Signals

Biological Processes Underlying Social Signals

Secondary Sexual Characteristics

Signals of Attraction and Sexual Receptivity

Sociocultural Processing of Social Signals

Interactions between Biological and Sociocultural Processes

Although we have discussed them independently, biological and sociocultural pro-

approach-avoidance tendencies are primarily based on appraisals of morality and com-

From Interpersonal to Intergroup Perception

Stereotype Content Model

The stereotype content model’s (SCM) warmth-competence framework allows for

In contrast, the other unambivalent space in the SCM’s framework is occupied by

Leary, T. (1957). Interpersonal Diagnosis of Personality: A Functional Theory and Methodology

Interpersonal interactions and relationships can be described as unfolding along two

Accurate Perception of Verticality

Accurate perception of another person’s standing in terms of verticality is an important

more of a gestalt-like impression formation process. For example, a nonverbal behavior

Verticality and Accurate Social Perception

Another question in the realm of interpersonal accuracy (defined as correctly assessing

Riggio, R. E. (2001). Interpersonal sensitivity research and organizational psychology: Theoret-