You are on page 1of 5

Sound Emblems for Affective Multimodal Output of a

Robotic Tutor: A Perception Study

Helen Hastie Pasquale Dente, Dennis Küster, Arvid


School of Mathematical and Computer Sciences Kappas
Heriot-Watt University Depart. of Psychology and Methods,
Edinburgh, UK Jacobs University, Bremen, Germany
h.hastie@hw.ac.uk p.dente, d.kuester,
a.kappas@jacobs-university.de

ABSTRACT addition, non-verbal cues can make the learning experience


Human and robot tutors alike have to give careful consider- more novel and enjoyable for the students and potentially
ation as to how feedback is delivered to students to provide help compensate for the uncanny valley effect.
a motivating yet clear learning context. Here, we performed There is much discussion in the literature on the peda-
a perception study to investigate attitudes towards negative gogical effectiveness of both negative and positive feedback
and positive robot feedback in terms of perceived emotional [32]. There is, however, a consensus that whatever feedback
valence on the dimensions of ‘Pleasantness’, ‘Politeness’ and is given it needs to be carefully phrased [21], and needs to
‘Naturalness’. We find that, indeed, negative feedback is be communicated in a clear and unambiguous fashion [13].
perceived as significantly less polite and pleasant. Unlike The study described here has three stages. The first stage
humans who have the capacity to leverage various paralin- is to investigate whether negative feedback is perceived to
guistic cues to convey subtle variations of meaning and emo- have less emotional valence in terms of ‘Pleasantness’, ‘Po-
tional climate, presently robots are much less expressive. liteness’ and ‘Naturalness’. The second stage is to investi-
However, they have one advantage that they can combine gate, if this is the case, could the addition of emotive sound
synthetic robotic sound emblems with verbal feedback. We emblems be used to subtly adjust this valence of the feed-
investigate whether these sound emblems, and their posi- back. Thirdly, we investigate the effect of positioning the
tion in the utterance, can be used to modify the perceived sound emblems. Sound emblems may be better placed at
emotional valence of the robot feedback. We discuss this the end of an utterance, thus emulating punctuation or they
in the context of an adaptive robotic tutor interacting with may be better placed at the beginning of an utterance as the
students in a multimodal learning environment. emblem might be perceived as a spontaneous affect burst,
followed by a verbal elaboration [31].
This work is in the context of an affective robotic or vir-
CCS Concepts tual tutor. It has been shown that understanding affect in
•Human-centered computing → Natural language in- a learning environment is key to aiding and assisting stu-
terfaces; Auditory feedback; Empirical studies in inter- dent progression [22]. While it is understood that too much
action design; emotion is bad for rational thinking, recent findings suggest
that so too is too little emotion as intelligent functioning
Keywords is hindered when basic emotions are missing in the brain
[22]. Whilst progress is being made on robots that can ma-
Human-robot interaction, multimodal output, speech syn- nipulate facial expressions, gestures and language to portray
thesis, synthesized sounds emotion, some of the more affordable robotic platforms such
as the NAO robot, do have limited facial expressions. In ad-
1. INTRODUCTION dition, endowing virtual agents with natural facial emotions
In the classroom, teachers use multiple ways and modali- can be computationally expensive and time consuming to
ties to embed knowledge and provide feedback to students. programme. The work discussed here is important to the
Similarly, robotic tutors should use a range of modalities field as it gives insights into the use of sound emblems, a
including language, gestures and non-verbal cues. These language- and platform-independent mechanism for express-
non-verbal modalities can be used to emphasize, extend or ing affect in a multimodal system.
even replace the language output produced by a robot. In
2. EMPATHIC ROBOT TUTOR
Permission to make digital or hard copies of all or part of this work for personal or The robotic tutor was designed as part of the FP7 Emote
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation project1 to be an empathic tutor who aids students aged
on the first page. Copyrights for components of this work owned by others than ACM between 11-13 years to exercise their map reading skills as
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a part of the geography school curriculum. The set-up is mul-
fee. Request permissions from Permissions@acm.org. timodal in that a map-related task is presented to them on
ICMI’16, November 12–16, 2016, Tokyo, Japan a large interactive touch-table or tablet [2]. The students
c 2016 ACM. 978-1-4503-4556-9/16/11...$15.00
1
http://dx.doi.org/10.1145/2993148.2993169 http://www.emote-project.eu

256
are presented with a series of tasks, which can be solved by have less semantic content than spoken language. Questions,
using their skills in reading directions, measuring distances therefore, arise as to whether verbal utterances with simi-
and identifying symbols on a map. For instance, a sam- lar intended valence provide the necessary context to align
ple task would be “Go North West for 550 meters and find with the intended valence of the sound emblem; and ulti-
a museum”. Figure 1 gives the architecture whereby the mately whether sound emblems, when combined with verbal
robot uses sensors to determine the user’s affect and learner utterances, can be used to shape the emotional climate of
model, and adapts its interaction accordingly through the interaction between robots and humans.
Interaction Manager [8]. The system generates personalised
pedagogical feedback that is aligned with the user’s affect, 4. METHOD
for example, the robot gives hints and encouraging remarks
120 participants (28 female, MAge = 31.02, SDAge = 8.03)
if the user is getting frustrated. Here, we discuss a percep-
from 35 countries were recruited on Crowdflower’s online
tion study that has informed the generation of this affective
platform2 . Each participant listened to and rated a set of 5
feedback as part of the Multimodal Output Manager mod-
utterances randomly picked from the set of 30 stimuli and
ule. All modules and other related resources are available
were paid approximately US$1.50 on completion of each set.
from the Emote project website.
A total of approximately 5,500 ratings were collected.

4.1 The Stimuli


The stimuli were composed of either a text-to-speech (TTS)
generated utterance paired with a sound emblem, or a TTS
utterance presented in isolation. Half of the utterances were
negative tutor feedback utterances and half positive feed-
back. Both sets of wording were chosen by learning and
teaching experts and provide varying syllable length and
varying levels of valence, e.g. from high positive valence
“Way to go!” to a more sedate “You got it right”. The TTS
used was the Jack voice, a Received Pronunciation (RP)
Figure 1: Architecture of the Emote system
male voice from Cereproc3 . The exact wording is given be-
whereby the Multimodal Output Manager is in-
low:
formed by this study
• Positive: “Amazing Job”, “Excellent!”, “Way to go!”,
“Wow!”,“You got it right”
3. PREVIOUS WORK
The ultimate goal of the design of socially intelligent robots • Negative: “No that’s 15 metres”, “That’s incorrect”,
is the development of systems capable of communicating and “OK, so that’s not quite right”, “No that’s a telephone”,
interacting with humans at a human-like level of proficiency. “That solution is wrong unfortunately”
This includes the use of naturally sounding human voices
Each of the above-listed 10 utterances were matched up
or other types of sounds to portray emotional feelings [9].
with a sound emblem with similar intended valence. There
In previous studies, a positive effect has been observed in
were 3 versions of each of these utterance: one with the em-
robot-child interaction when the NAO robot expresses emo-
blem before the utterance, one with the emblem after and
tion through voice, posture, whole body poses, eye colour,
one with speech only utterance. The sound emblems are dis-
and gestures [33]. Other child-robot interaction studies have
cussed in detail in the following section and were not evalu-
indicated that empathy facilitates interaction [2], and that
ated in isolation in this study as they have been previously
human-based expressions can be successfully implemented
validated for valence as reported in [11]. Each stimulus was
by robots [18]. Finally, automatic feedback has been closely
evaluated on three (7-point) Likert scales, in order to mea-
examined for tutoring systems [6]. However, little work has
sure perceived ‘Pleasantness’, ‘Naturalness’, and ‘Politeness’
been done on non-human sounds and the effects they have
(Dependent Variables).
on interaction, learning, and how the robot is perceived.
‘Pleasantness’ was designed to be used as an estimate of
Some of the research in this field has relied on sounds
emotional valence to test if negative robot feedback is per-
that are rich in music-like qualities, e.g. [9] who produced
ceived as more unpleasant than positive feedback. We de-
a small set of five sounds to represent speech acts. Other
cided to investigate ‘Politeness’ as politeness strategies are
researchers have focused on simpler sounds, such as linear
commonly employed in one-to-one tutoring interactions and
modifications of pitch or duration of monophonic sounds [14,
have been shown to both enhance and inhibit learning [21].
16, 15]. Conceptually, much of the research in this field has
Both ‘Naturalness’ as well as ‘Politeness’ scales have been
been inspired by the sound design of science fiction movies.
used in a number of perceptual tests evaluating the output
Movie portrayals of robots such as R2-D2 and WALL-E are
of multimodal output modules and natural language gener-
reflected in recent approaches with regards the creation of
ation systems, in particular to inform surface realisation for
non-verbal sounds by several authors as a means to replace
different user personalities [19] and preferred styles [3].
or augment synthetic speech lacking paralinguistic cues [16,
23, 24]. The perception of abstract sounds, like other per-
ceptual processes in psychology, can be expected to be de-
pendent on further contextual information. As argued by
2
[23], simple non-linguistic utterances can be used to shift the http://www.crowdflower.com
3
burden of interpretation to the human because such sounds http://www.cereproc.com

257
Figure 2: Digital version of the Affect Grid adapted Figure 3: Sound selection with BEST-IDs of positive
from [27] as a single-item measure of valence and and negative stimuli from the BEST dataset as a
arousal. In this example, a judgment (indicated by function of their valence and arousal ratings. Error
the cross) has been made on the 9x9 grid indicating bars represent standard error of the mean (SEM)
a response of ‘7’ for valence, and ‘8’ for arousal

validated and found to be valuable in situations where more


time-consuming measures would not be feasible [12].
4.2 The Emblems Five negative sound emblems, and five positive ones were
Sound emblems are short nonverbal utterances that are selected from the BEST database (see Figure 3) and assigned
conceptually separate from vocal “affect bursts” in that they randomly to the respective negative and positive TTS utter-
are on the “pull” end of Scherer’s [29, 30] push-pull dimen- ances. The selection followed two criteria: the duration was
sion. Scherer derived this conceptual distinction on the basis shorter than 2s, and the sound scored high vs. low on va-
of prior work on the acoustic origin of facial expressions and lence. After the selection, the sound files (TTS + acoustic
nonverbal behavior [20, 10, 4], wherein the facial muscula- emblems) were equalized and mixed with SoX.
ture “pushes” the upper vocal tract to produce “raw affect
bursts”, e.g. laughter [31]. At the other end of this contin-
uum are pull effects that require the shaping of an expression 5. RESULTS
to conform to socio-cultural models [30]. Vocal emblems, We tested three hypotheses for each of the three rating
e.g. “Yuck!” are, therefore, conceptually much closer to ver- scales (dimensions): I. (‘Valence’) whether there is indeed
bal interjections than raw affect bursts [31]. a significant difference between the rating of what we de-
In the present research, we studied nonverbal sound em- fined ‘positive’ and ‘negative’ stimuli in terms of our three
blems rather than verbal interjections or affect bursts be- dimensions of ‘Pleasantness’, ‘Politeness’ and ‘Naturalness’;
cause we aimed to interlink these emblems with verbal ut- II (‘Emblem Addition’) whether adding an emblem reduces
terances of a robotic tutor that can be seen as standing in a or increases the values of the three dimensions; and III. (‘Po-
cultural tradition of robots such as R2-D2 and others that sition Factor’) whether the position of the sound emblem,
have made prominent appearances in movies. These types of before or after the utterance, produces a significant differ-
representations have shaped collective cultural expectations ence on the judges’ attribution. In order to test the above
about the sounds to be made by small humanoid-looking mentioned hypotheses, we carried out a Repeated-Measures
robots. We thus utilize the term sound “emblem” to empha- ANOVA for each of the rating scales (Dependent Variables),
size the conceptual distance from raw affect bursts, as well as separately. Furthermore, we conducted a series of post-hoc
to denote the explicit and culturally shaped signaling aims contrasts. For this purpose, all contrasts were computed
for which they were originally created. using the Dunnett-Tukey-Kramer (DTK) pairwise multiple
The sound emblems were selected from BEST, a publicly comparison test adjusted for unequal variances and unequal
available sound database [11]. The BEST sounds were gen- sample sizes [17].
erated by 9 expert performers on an iPad, via the Bebot app Figure 4 shows the rating results for the three dimen-
[1], a touch sound synthesizer. The database has been vali- sions. Firstly with regards Hypothesis I, the results con-
dated by 194 unique judges, from 41 different countries. This firm that for the ‘Valence’, there is a significant difference
evaluation comprised judgments for discrete emotions using (p < 0.05) between the ratings of negative feedback and
the Differential Emotions Scale [7], as well as judgments for positive feedback across two of the three dimensions: ‘Pleas-
valence and arousal obtained on the basis of dimensional antness’ F (1, 119) = 4.92 and ‘Politeness’ F (1, 119) = 4.46.
valence-arousal circumplex models [25, 26, 34]. To avoid In other words, positive feedback is considered significantly
subject fatigue during these validation studies, sounds were more polite and pleasant than negative feedback, regardless
randomly distributed across 4 subsets, and each individual of whether it contains a sound emblem or not. There were
sound was evaluated by at least 40 subjects using a digital no significant effects for ‘Naturalness’, that is to say neg-
version of the 9x9 Affect Grid (see Figure 2) adapted from ative and positive stimuli are considered to have the same
[27] to provide a fast and user-friendly single-item measure ‘Naturalness’, which is to be expected given that the TTS
for valence and arousal. The Affect Grid is a frequently engine and selection of sound emblems are the same across
used single-item measure [28] that has been independently stimuli.

258
Figure 4: Mean ratings from a 7 point Likert Scale for ‘Pleasantness’, ‘Politeness’ and ‘Naturalness’. Mean
and standard deviations are given

With regards Hypotheses II and III, for all three dimen- standard deviations given in Figure 4, which are greater for
sions, the ANOVA showed a main effect (p < 0.05) for the those utterances containing a sound emblem. Furthermore,
‘Position Factor’ (3 levels: ‘sound before’, ‘sound after’, ‘ut- it should be noted that, while participants in the present
terance only’)4 . The DKT post-hoc comparison helps us to study were informed that they would be listening to robot
understand that, as it happens, for all three dimensions, the speech, they were given no concrete visual representation of
significant contrasts (p < 0.05) are between ‘utterance only’ the robot. Therefore, it is possible that basic differences in
vs. ‘sound after’, as well as, between ‘utterance only’ vs. perceived ‘Naturalness’ between ‘with sound emblem’ and
‘sound before’, but not between ‘sound after’ vs. ‘sound be- ‘without sound emblem’ stimuli may have been partially re-
fore’. Therefore, for our participants, the stimuli presented sponsible for the overall reductions in ‘Pleasantness’ for the
in the ‘utterance only’ condition are perceived as more pleas- composite stimuli. In an embodied presentation situation,
ant, natural and polite than those stimuli paired with an the sounds might thus have been perceived as more congru-
utterance with a sound emblem in either position. In other ent with the TTS than it was the case in the present online
words, adding an emblem to an utterance (be it positive study.
or negative) significantly reduces the ‘Pleasantness’, ‘Natu- One advantage of sound emblems is that they are lan-
ralness’ and ‘Politeness’. For positive feedback, adding an guage independent. However, further studies would need to
emblem, in either position, does not increase the valence investigate cultural differences along the lines of the studies
(i.e. make it more positive), in fact, surprisingly, it has the reported in [5]. In addition, future work involves repeating
opposite effect and makes it perceived as significantly more the study for the target age group of children and with the
negative on the three dimensions (p < 0.05). physical embodiment of the robot.
In the Emote system for a robotic tutor, the multimodal
output is currently categorised as a binary negative or pos-
6. DISCUSSION AND FUTURE WORK itive and the choice of this feedback is triggered by the
Our study suggests that, even if every attempt is made to inferred affect of the user in terms of four broad affective
make negative feedback sound pleasant and polite, that it states: frustrated, excited, bored and neutral. Future work
will still be deemed as less so than positive feedback. This would involve making the system adaptive at a more fine-
work can, therefore, be used to inform the design of robotic grained level both in terms of the affect it infers from the
multimodal output, allowing the system designer to be aware user and the multimodal output it generates. The study de-
of the emotional valence induced by the various types of scribed here moves the field closer to these highly sensitive
feedback so as to define strategies to mitigate any negative affective systems.
consequences. Whilst the results concerning the use of sound
emblems are somewhat unexpected, it may be that the bi-
nary categorisation of negative/positive may be too coarse
7. ACKNOWLEDGMENTS
and perhaps a more fine-grained and sophisticated pairing This work was supported by the European Commission
between the utterances is required in terms of utterance con- (EC) and was funded by the EU FP7 ICT-317923 project
tent and length along with the intended valence of the sound Emote. The authors are solely responsible for the content of
emblems. this publication. It does not represent the opinion of the EC,
In addition, it is possible that users are so unused to hear- and the EC is not responsible for any use that might be made
ing sound emblems of this type in everyday conversation that of data appearing therein. We would like to acknowledge the
they were not sure how to judge them. This lack of general other Emote partners for their collaboration in developing
consensus on how to interpret the emblems is reflected by the the system described here.
4
F (2, 238) = 28.88 for ‘Pleasantness’; F (2, 238) = 22.17 for
‘Politeness’; and F (2, 238) = 32.24 for ‘Naturalness’

259
8. REFERENCES [18] I. Leite, G. Castellano, A. Pereira, C. Martinho, and
[1] R. Black. Bebot - Robot Synth. Normalware, 2014. A. Paiva. Modelling empathic behaviour in a robotic
[2] G. Castellano, A. Paiva, A. Kappas, R. Aylett, game companion for children: an ethnographic study
H. Hastie, W. Barendregt, F. Nabais, and S. Bull. in real-world settings. In Proc. of HRI, 2012.
Towards empathic virtual and robotic tutors. In Proc. [19] F. Mairesse and M. Walker. Personality generation for
of AIED. 2013. dialogue. In 45th Annual Meeting of the Association
[3] N. Dethlefs, H. Cuayahuitl, H. Hastie, V. Rieser, and for Computational Linguistics (ACL), 2007.
O. Lemon. Cluster-based prediction of user ratings for [20] J. J. Ohala. The acoustic origin of the smile. The
stylistic surface realisation. In Proc. of EACL, 2014. Journal of the Acoustical Society of America,
[4] P. Ekman and W. V. Friesen. The repertoire of 68(S1):S33–S33, 1980.
nonverbal behavior: Categories, origins, usage, and [21] N. K. Person, R. J. Kreuz, R. A. Zwaan, and A. C.
coding. Semiotica, 1(1):49–98, 1969. Graesser. Pragmatics and pedagogy: Conversational
[5] H. Elfenbein and N. Ambady. On the universality and rules and politeness strategies may inhibit effective
cultural specificity of emotion recognition: a tutoring. Journal of Cognition and Instruction,
meta-analysis. 128(2):203–35, 2002. 13(2):161-188, 1995.
[6] A. C. Graesser, K. Wiemer-Hastings, [22] R. W. Picard, S. Papert, W. Bender, B. Blumberg,
P. Wiemer-Hastings, and R. Kreuz. Autotutor: A C. Breazeal, D. Cavallo, T. Machover, M. Resnick,
simulation of a human tutor. Cognitive Systems D. Roy, and C. Strohecker. Affective learning — a
Research, 1(1):35 – 51, 1999. manifesto. BT Technology Journal, 22(4):253–269,
[7] C. E. Izard. The psychology of emotions. Springer 2002.
Science & Business Media, 1991. [23] R. Read and T. Belpaeme. How to use non-linguistic
[8] S. Janarthanam, H. Hastie, A. Deshmukh, R. Aylett, utterances to convey emotion in child-robot
and M. E. Foster. A reusable interaction management interaction. In Proc. of the seventh annual
module: Use case for empathic robotic tutoring. In ACM/IEEE international conference on Human-Robot
Proc. of SemDial, 2015. Interaction, pages 219–220. ACM, 2012.
[9] E.-S. Jee, Y.-J. Jeong, C. H. Kim, and H. Kobayashi. [24] R. Read and T. Belpaeme. Situational context directs
Sound design for emotion and intention expression of how people affectively interpret robotic non-linguistic
socially interactive robots. Intelligent Service Robotics, utterances. In Proc. of the 2014 ACM/IEEE
3(3):199–206, 2010. international conference on Human-robot interaction,
[10] H. G. Johnson, P. Ekman, and W. V. Friesen. pages 41–48. ACM, 2014.
Communicative body movements: American emblems. [25] J. A. Russell. A circumplex model of affect. J. Pers.
Semiotica, 15(4):335–354, 1975. Soc. Psychol., 39(6):1161–1178, 1980.
[11] A. Kappas, D. Kuester, P. Dente, and C. Basedow. [26] J. A. Russell. Emotion, core affect, and psychological
Simply the B.E.S.T.! creation and validation of the construction. Cognition and Emotion,
bremen emotional sounds toolkit. In Poster presented 23(7):1259–1283, 2009.
at the 1st International Convention of Psychological [27] J. A. Russell, A. Weiss, and G. A. Mendelsohn. Affect
Science, Amsterdam, the Netherlands, 2015. grid: a single-item scale of pleasure and arousal.
[12] W. Killgore. The affect grid: A moderately valid, Journal of personality and social psychology,
nonspecific measure of pleasure and arousal. 57(3):493, 1989.
Psychological reports, 83(2):639–642, 1998. [28] Y. I. Russell and F. Gobet. Sinuosity and the affect
[13] A. N. Kluger and A. DeNisi. The effects of feedback grid: A method for adjusting repeated mood scores 1.
interventions on performance: A historical review, a Perceptual and motor skills, 114(1):125–136, 2012.
meta-analysis, and a preliminary feedback intervention [29] K. R. Scherer. On the symbolic functions of vocal
theory. 119:254–284, 1996. affect expression. Journal of Language and Social
[14] T. Komatsu. Can we assign attitudes to a computer Psychology, 7(2):79–100, 1988.
based on its beep sounds? In Proc. of the Affective [30] K. R. Scherer. Affect bursts. Emotions: Essays on
Interactions: The computer in the affective loop emotion theory, 161:196, 1994.
Workshop at Intelligent User Interface 2005, 2005. [31] M. Schröder. Experimental study of affect bursts.
[15] T. Komatsu, K. Kobayashi, S. Yamada, K. Funakoshi, Speech communication, 40(1):99–116, 2003.
and M. Nakano. Augmenting expressivity of artificial [32] P. Sharp. Behavior modification in the secondary
subtle expressions (ASEs): Preliminary design school: A survey of students’ attitudes to rewards and
guideline for ases. In Proc. of the 5th Augmented praise. 9:109–112, 1985.
Human International Conference, 2014. [33] M. Tielman, M. Neerincx, J.-J. Meyer, and R. Looije.
[16] T. Komatsu and S. Yamada. How does the agents’ Adaptive emotional expression in robot-child
appearance affect users’ interpretation of the agents’ interaction. In Proc. of the 2014 ACM/IEEE
attitudes: Experimental investigation on expressing international conference on Human-robot interaction,
the same artificial sounds from agents with different pages 407–414. ACM, 2014.
appearances. Intl. Journal of Human–Computer [34] M. Yik, J. A. Russell, and J. H. Steiger. A 12-point
Interaction, 27(3):260–279, 2011. circumplex structure of core affect. Emotion,
[17] M. K. Lau. DTK: Dunnett-Tukey-Kramer Pairwise 11(4):705–731, 2011.
Multiple Comparison Test Adjusted for Unequal
Variances and Unequal Sample Sizes.

260

You might also like