You are on page 1of 12

Robot Duck Debugging: Can Attentive Listening

Improve Problem Solving?


Maria Teresa Parreira Sarah Gillet Iolanda Leite
KTH Royal Institute of Technology KTH Royal Institute of Technology KTH Royal Institute of Technology
Stockholm, Sweden Stockholm, Sweden Stockholm, Sweden

Abstract
While thinking aloud has been reported to positively
arXiv:2301.06511v2 [cs.RO] 18 Jan 2023

affect problem-solving, the effects of the presence of an


embodied entity (e.g., a social robot) to whom words
can be directed remain mostly unexplored. In this work,
we investigated the role of a robot in a ”rubber duck
debugging” setting, by analyzing how a robot’s listening
behaviors could support a thinking-aloud problem-solving
session. Participants completed two different tasks while
speaking their thoughts aloud to either a robot or an
inanimate object (a giant rubber duck). We implemented Fig. 1: Overview of the experimental set-up. Participants were
and tested two types of listener behavior in the robot: a
rule-based heuristic and a deep-learning-based model. In
invited to think aloud while completing tasks on the tablet.
a between-subject user study with 101 participants, we They directed their words to A) a Furhat robot which dis-
evaluated how the presence of a robot affected users’ plays either a heuristic (Naı̈veL) or learned (DataL) listening
engagement in thinking aloud, behavior during the task, behavior or B) an inanimate object (giant rubber duck).
and self-reported user experience. In addition, we explored
the impact of the two robot listening behaviors on those
measures. In contrast to prior work, our results indicate In this work, we explored a ”rubber duck problem-solving
that neither the rule-based heuristic nor the deep learning session” and the impact of replacing the inanimate object with
robot conditions improved performance or perception of
the task, compared to an inanimate object. We discuss
an active listener robot. Prior work has established different
potential explanations and shed light on the feasibility of methods to generate listener behaviors, formulating rule-based
designing social robots as assistive tools in thinking-aloud heuristics extracted from human conversation data [54, 40] or
problem-solving tasks. using deep learning (DL) models [19, 58]. We aim to examine
if listener behaviors generated through either a heuristic or a
I. I NTRODUCTION DL model lead to different outcomes in a TAPS session. This
The concept of ”Rubber Duck Debugging” originated from allows us to explore the sensitivity of the user to robot behavior
a story in the book The Pragmatic Programmer [20]. It when the focus of the session is on solving the task rather than
encompasses the idea that a rubber duck can assist the process the social interaction with the robot.
of ”debugging” a program by serving as an embodied listener In a between-subject study with three conditions, we eval-
when a person explains the code aloud step-by-step. This uated how the two different robot listening behaviors and an
concept is related to the technique of Think-Aloud Problem inanimate object - a rubber duck - affect the outcome and
Solving (TAPS) [30], which consists of the verbalization of perception of two distinct think-aloud problem-solving tasks,
one’s thoughts while solving tasks. Prior literature found that a deductive logic quiz and an open-ended question. To the best
verbalization is an effective strategy to improve learning and of our knowledge, this is the first study characterizing how
task performance [53, 8], but this activity takes place as an the presence of an active listener robot affects the perception
individual process. Rubber duck debugging reframes the act of a TAPS session. Considering the reported positive effects
of problem-solving as a thinking-aloud interaction, where the of verbalization during problem-solving and the presence
second interactant is an unresponsive listener. of robots in learning scenarios, we explore the potential
Social robots can assist people in therapy for rehabilitation of robots as ”social assistants” in everyday problem-solving
[5, 7] and mental health [44, 25], or support children with tasks. In sum, our main contributions are: (1) investigating the
special needs [50, 11, 32]. More specifically, the presence of impact of an active listener robot in a thinking-aloud task;
a robot in a TAPS session was seen to improve learning gains (2) developing and evaluating a rule-based listening behavior
in young students [45]. Moreover, work by Gratch et al. [15] and a deep-learning based listening behavior system from
showed that a responsive listener agent, exhibiting non-verbal publicly available human-human dyadic conversational data;
behavior, leads subjects to spend more time talking and use (3) exploring how methods and datasets used for developing
more words, when compared to an unresponsive listener agent. listener behaviors in one social setting - a conversation -
perform in the context of attentive listening to assist a problem- ”memory” of previous inputs). Ruede et al. [48] used Long-
solving session. Short Term Memory (LSTM) layers to predict BCs, with
acoustic features and word history as input features. Huang
II. R ELATED W ORK et al. [19] also tested LSTMs and Gated Recurrent Unit (GRU)
layers for the development of idle behavior in virtual agents,
This research is based on previous work in the areas with multimodal input features. Similarly, Murray et al. [36]
of social robots in think-aloud tasks and generating active made use of LSTMs and multimodal input, and suggested
listening behavior. a method for data augmentation that positively impacts BC
Social Robots in Thinking-aloud tasks: Prior literature has prediction. While the use of RNNs has been widely adopted
explored how a robot can play an auxiliary role in problem- in sequential decision-making tasks, such as the generation of
solving. Wijnen et al. [58] studied how the presence of a listener behavior, we highlight other interesting approaches,
social robot can affect explanatory behavior in children, which such as semi-supervised learning [22], or reinforcement learn-
is associated with deeper learning. The authors concluded ing [21, 47].
that the presence of a robot to which children could direct In the present work, we developed two different robot listener
their words was effective in triggering explanatory behavior behavior generation policies - one rule-based, one DL-based
(longer duration of utterances, more relevant content). Closer - and evaluated how these impacted various aspects of think-
to our work, Ramachandran et al. [45] explored how the aloud problem-solving tasks: engagement with thinking-aloud,
presence of a social robot while engaging in a TAPS task task performance, and user experience.
could improve learning gains in young students, versus a no-
robot condition. The robot would intervene with tutoring notes III. I MPLEMENTATION OF L ISTENER B EHAVIOR
and reminders to keep thinking aloud. They concluded that the
robot platform could support the thinking-aloud process and We set out to study how different listening behaviors can
enhance immediate and persistent learning gains. The authors support a ”rubber duck” problem-solving session. The listen-
suggest that the robot represented an embodied social entity ing behavior consisted of the robot’s gaze and backchannel-
towards which students could direct their words, which is in ing (vocal or non-vocal). For vocal utterances, the specific
line with the findings from Wijnen et al. [58]. To the best of sound was selected randomly from a set of pre-defined utter-
our knowledge, there is no similar work on adult subjects. ances (e.g., ’hmm’, ’ahh’). The non-vocal backchannel was
realized through a head nod.
We expand on prior work by comparing the effects of the For developing backchanneling behavior models, we in-
presence of an active listener robot to those of an inanimate cluded two different approaches, as we intended to shed light
object in a ”rubber duck debugging”/TAPS session with adults. the sensitivity of the user to listening behaviors when the
In order to assess if different robot listening behaviors lead interaction with the robot is only auxiliary (not collaborative)
to different outcomes, we implemented two types of active to task completion. We tested each of the two models as a
listener behaviors, both based on previous work. different robot listening condition in our user study.
Generating Listening Behavior: Listening behavior is char- As suggested by prior work [22], the decision-making
acterized by backchannels (BCs) [16], which signal attentive- process for generating a backchannel was divided into two
ness or emotion to a speaker. These include body language, stages: first, the timing of the backchannel was determined;
such as head nods or smiles, and short vocalizations (’uh-huh’, in a second step, the type of backchannel was selected.
’hmm’). Backchannel placement is an important component This separation allows for handling the natural imbalance in
of conversations, as it can skew the perception of the interac- conversational data (few examples of each backchannel type
tants [43]. and more datapoints for ”no backchannel”).
Significant amount of work has studied the generation of Backchannel timing and type were either heuristically de-
listening behavior in robots. One approach is to develop rule- termined (naı̈ve listener model, NaiveL) or learned through
based methods that are informed by prosodic features, such the application of a deep learning model (data-driven model,
as voice pitch and pauses [54, 55], or by behavior observed DataL). We detail the implementation of these behaviors
in human-human conversational data [40]. As an alternative below.
to designing hand-crafted heuristics, the use of automated
methods has been widely adopted[10, 26]. Okato et al. [41] A. Naı̈ve Listener
designed a Hidden Markov Model (HMM) based on prosodic To develop the robot’s behavior for the naı̈ve listener
patterns to detect when to emit a BC. Morency et al. [33] also condition, we decided to use a frequently deployed heuristic,
used an HMM with multimodal (audiovisual) input features to originally developed by Ward and Tsukahara [55]. This work
predict BC timing. used a corpus of human conversational data in English to
More recently, deep learning techniques have been deployed implement a heuristic that finds the most appropriate timing for
to inform the design of listening behaviors. Examples include a backchannel in a dyadic interaction. The heuristic is based on
Recurrent Neural Network (RNN) models, which capture prosodic cues, i.e. pitch. The authors define a set of conditions
the temporal dependencies of continuous signals (i.e., retain for producing a backchannel, which we reproduce verbatim:
(P1) a region of pitch less than the 26th-percentile pitch level moments where vocal backchannels or head motion (nodding)
and were present.
(P2) continuing for at least 110 milliseconds, State space: As input features for our model, we ex-
(P3) coming after at least 700 milliseconds of speech, tracted speech features from the speaker, including 13-
(P4) providing you have not output backchannel feedback dimensional mel-frequency cepstrum coefficients (MFCC) and
within the preceding 800 milliseconds, 4-dimensional prosody features as used in prior works [48, 36,
(P5) after 700 milliseconds wait. 22]. The MFCC features are computed every 30 ms with a
To run this model in our user study, we computed the pitch sliding hamming window of 400 ms. Prosody features include
distribution of the user’s voice over the last 50 s of audio and pitch (fundamental frequency) and yin-energy, as well as the
calculated the pitch region percentile over the last 10 ms. The first derivative of these variables. The final 34-item feature
backchannel type was randomly selected to be either vocal or vector is composed of the mean and standard deviation of
non-vocal when the five conditions shown above were met. each of these features. The dataset frequency is 2 Hz (one
sample every 0.5 s) and was normalized before training.
B. Data-Driven Listener
Action Space: The robot could perform the same actions as
We explored the use of machine learning to determine those performed by the Naı̈veL robot (Sec. III-A) - vocal utter-
backchannel timing and type. While we maintain the input ance, nod of varying amplitude, or a simultaneous combination
modality in the form of audio features, as in the naı̈ve of both.
condition, we wanted to explore if the learned behavior, by
having higher potential to adapt to different speakers, could 2) Training policies for listening behaviorTo train suffi-
lead to more positively evaluated interactions [33]. ciently good behavior policies, we explored different model
architectures and techniques such as data augmentation[36].
1) Problem FormulationWe formulated the problem of gen-
Below, we explain the models tested and the selection process,
erating backchanneling in a TAPS session as a sequential
as well as model performance.
decision-making problem. At any time-step t, the environment
(user’s voice features) is captured as state variable st ∈ S. The Data Augmentation: Murray et al. [36] suggest a method
robot takes actions at ∈ A. It first chooses between a binary to improve the robustness of robot listening behaviors by
output: performing BC or doing nothing. If the robot chooses using audio data augmentation. The authors show that the
performing BC, st is again used to inform which type of BC model trained on this data outperforms a rule-based and a
to perform. The potential actions at that follow are vocal BC random model as reported by users. We emulated the proposed
(vocal utterance), non-vocal BC (nodding), or both. method by making use of masking techniques in the time
Dataset: The collection of conversational data is frequently and frequency domains. In our work, training instances (audio
the first step to developing listening behaviors. In this work, features) were partially masked in one or both domains, chosen
we wanted to test if we could instead use data collected at random. Both the original and the deformed instances were
in a different setting from that in which we deployed our used for training.
model. Some interesting corpora of dyadic interactions in Models: We explored the use of deep learning techniques to
English include CCDb [3], IEMOCAP [4] and D64 [39], develop listening behavior. We separated our model into two
among others [51]. We were looking for unscripted, non- components according to the two-staged decision-making pro-
topic-bounded interactions that were publicly available. As cess, the Timing and the Type modules. The Timing module
per these criteria, the dataset used for training was Cardiff’s learns the appropriate timing to perform backchanneling, and
Conversation Database (CCDb)[3]. the Type module decides which type of feedback - vocal,
The CCDb consists of 30 dyadic conversations, each lasting non-vocal (nodding) or both - the robot should emit. The first
around five minutes. The 30 conversations are between 16 module is a binary classification task (do BC or do nothing).
different speakers (12 M, 4 F), with ages ranging from 25- In the case of a positive (’do BC’) output, the second module
56 years old. We used the subset of conversations that were is activated (a multiclass classification model).
annotated for facial expressions and utterances to train the lis- The use of neural networks has proven to be a versatile tool
tening behavior policy. We extracted data from the perspective to learn relevant features automatically [35]. Multiple authors
of each participant, totaling 80 minutes of conversational data. have approached this problem by making use of models
We split the individual feature streams into moments where the that consider previous internal states. LSTMs are a common
participant was the speaker or the listener. As we are focusing approach [48, 36], although these architectures are prone to
on listener behavior, we extracted features only in the listener overfitting. Alternatively, Gated Recurrent Units (GRUs) can
portions of the conversation. As training data, we used the be used [19]. These are less complex and usually preferred
audio features from the speaker (other participant), while ex- over LSTMs for small training datasets.
tracting the annotated backchannels performed by the listener Model Hyperparameter Tuning: In order to determine the
participant as ground truth for our model. The conversations best model, we tested different combinations of models and
from the CCDb dataset include extensive annotations. We parameters. For both the Timing and Type modules, we trained
filtered the annotations to provide positive training instances in single-layer LSTM and GRU models, which were followed by
TABLE I: Performance metrics on model used (averaged over all validation folds). Both models used augmented data for
training. Comparison with performance of the rule-based model (Naı̈veL condition) on the same dataset.
Module Model Hyperparameters Performance
Lookback: 5,activation: sigmoid, Macro Accuracy: 0.95, Precision: 0.52, Recall: 0.51, F1:0.50
Timing GRU batch size: 16, dropout: 0.0, Margin Accuracy: 0.95, Precision: 0.59, Recall: 0.76, F1:0.65
loss function: focal,optimizer: Adam BC Prediction Deviation: 0.83
Lookback: 10, activation: sigmoid, Macro Accuracy: 0.64, Precision: 0.37,
Type GRU batch size: 32, dropout: 0.2, Recall: 0.39, F1:0.35
loss function: MSE, optimizer: SGD
Macro Accuracy: 0.95, Precision: 0.48,
Timing Naı̈veL Ward and Tsukahara [55]
Recall: 0.50, F1:0.49

a dropout layer and a final dense layer. Each model was either model input considering a higher number of past internal
trained on the original or the augmented dataset, considering states.
varying previous timesteps, i.e. lookback sizes (5, 10 and 15,
respectively 2.5, 5 or 7.5 seconds of data), activation functions C. Gaze Behavior
(sigmoid, ReLU and softmax), batch sizes (8, 16, 32), dropout Gaze behavior was kept constant across the two listening
rates (0, 0.2 and 0.4), L2 regularization (0.1, 0.01, 0.001, conditions, as direct eye contact was predicted to occur infre-
0.0001), loss functions (focal loss, MSE, binary cross-entropy quently during the problem-solving sessions. We implemented
or hinge loss) and optimizers (SGD and Adam). the robot’s gaze behavior based on the work by Garau et al.
Evaluation Metrics: Performance on both modules was eval- [13]. They found that adapting an avatar’s gaze and head pose
uated based on macro accuracy (number of correct predictions to the turn-taking pattern of a conversation outperformed a
over all predictions), precision (probability that emitted BCs random gaze agent. The agent alternates between two actions,
match the ground-truth data), recall (probability that a ground- looking at the user or away from them (gaze aversion). The
truth BC is predicted by the model) and F1-score (harmonic duration of each action is informed by previous work on
mean of precision and recall). dyadic interactions [2, 27] and varies depending on whether
the agent is listening or speaking. For this work, we only used
Additionally, for the Timing module and in line with similar the listening timing indications.
work [9, 48], we calculated the above metrics considering a
tolerance margin of [−500, 500] ms; finally, we controlled for IV. S TUDY D ESIGN
robot overreaction with a Backchannel frequency deviation We designed a between-subject study with three conditions
metric. For each participant i, we considered the relative in order to evaluate the effects of different previously proposed
difference between the predicted number of backchannels and robot listener behaviors in the outcome of a TAPS activity.
its original value: Two conditions concerned robot listener behaviors: the first,
a rule-based heuristic where non-verbal utterances and head
∆BC idev = |Ytrue
i i
− Ypred i
|/Ytrue (1)
nods are crafted according to previous literature (as described
Models’ Selection and Performance: For training, data was in Section III-A); the second, based on deep learning models
split into 8 folds, with non-overlapping participant data in a which learn listener behaviors through data input from human-
6:1:1 train-validation-test split. Each model was trained for a human dyadic interactions (as described in Section III-B).
maximum of 100 epochs, but early stopping strategies were The two listening behaviors were chosen to gauge if different
deployed for both loss and validation loss, in order to prevent complexity in the model used to generate the listener behavior
overfitting. The use of a dropout layer and L2 regularization affects the outcomes of TAPS activities. We compare these ef-
entailed the same purpose. fects with the baseline condition, the presence of an inanimate
object - a rubber duck - to whom words are directed. The three
Each module (Timing and Type) was trained with an analogous conditions are detailed in Section IV-B.
methodology. We fine-tuned hyperparameters based on macro
accuracy and F1-score. The final candidates for each model A. Hypotheses
type (LSTM or GRU, augmented and not augmented training We were interested in evaluating how a robot can affect
data) were then trained using k-fold cross-validation. the think-aloud process in a ”rubber duck debugging” session.
For each module, we deployed the model that was most Thus, we defined our outcomes based on how the user inter-
frequently in the top three best performing models for each acted with the task. We measure user engagement in thinking-
of the metrics mentioned above. Both models are single-layer aloud and user perception of the problem-solving session (or
GRUs trained on the augmented dataset (see performance on user experience). Engagement with the think-aloud process
Table I, along with performance for the Naı̈veL heuristic on the measures how the user talks when thinking aloud, e.g. number
same dataset). The Timing model outperforms other similar and duration of verbal utterances. User experience addresses
model architectures for BC prediction [48, 36]. Interestingly, how the participant perceives solving the task, including self-
the modules have different lookback periods, with the Type assessed cognitive load (or ”mental effort”) and enjoyment
(or ”user experience index”). These measures are detailed in
Section IV-D.
Wijnen et al. [58] found that the presence of a robot
leads to better explaining behavior while learning in children,
and Ramachandran et al. [45] found that children engage
more in thinking aloud utterances when a robot is present.
Consequently, we hypothesized:
• H1a: Engagement in thinking aloud is higher when
thinking aloud to an actively listening robot than to an
inanimate object.
Moreover, Morency et al. [33] report that a multimodal prob-
abilistic model is better at predicting backchannels than hand- Fig. 2: Layout example for one of the questions from the
crafted rules. As such, we expected the data-driven condition deductive logic quiz, from IQ Test Labs.
to generate listener behavior that leads to a more organic
interaction:
• H1b: Listener behaviors learned from data lead to higher By including two different robot listening behaviors, we ex-
degrees of engagement in thinking-aloud, when compared plored the transferability of two implementation methods of
to rule-based heuristics. different complexity into a new social context, as well as the
Following H1a, H1b, and because the verbalization of one’s sensitivity of the user to differences between these behaviors.
thoughts during problem-solving has been reported to have
affect how users solve a task [53, 8], we also hypothesize C. Problem-solving Tasks
about user behavior during the task: We designed two different tasks in order to evaluate distinct
• H2a: User behavior during task completion will be contexts for thinking-aloud problem-solving and how the pres-
different in the conditions when a robot is present. Users ence of a robot can impact the perception of those tasks and
will take longer to solve tasks, but also perform better. the outcomes.
• H2b: Following H1b, users with the DL-based listener
1) Logic Deduction Quiz (LogicQ)Tasks using deductive
robot condition will take a longer time completing the
logic are common when developing thinking-aloud protocols
tasks than users in the rule-based listener robot condition.
[31]. The questionnaire used here was adapted from IQ Test
Finally, we wanted to evaluate the subjective experience Labs1 . For each question, 5 possible answers were shown as
of the user during the problem-solving sessions. Literature is illustrated in Figure 2. Participants were able to skip questions
sparse regarding if thinking aloud affects how users perceive but could not see if their answers were correct. There was no
the task, and if different listening behaviors skew that percep- time limit per question.
tion. As such, we took an exploratory approach to answer the
following research questions: 2) NASA Moon Survival Task (NASA)In the NASA task [17],
the participant is placed in an hypothetical life-threatening
• RQ1a: Is user experience affected by an active listener
scenario (spaceship crashed on the moon) and asked to rank
robot when compared to an inanimate object?
a list of 15 available items according to their contribution
• RQ1b: Is user experience affected by the type of listening
towards survival.
behavior the robot exhibits?
The LogicQ task allowed for an objective evaluation of per-
B. Conditions formance by counting correct answers to the quiz, whereas the
We sought to evaluate how the presence and behavior of NASA task was framed as an open-ended question involving
a social robot impacted the outcome of a ”rubber duck” strategic thinking. We analyzed the effects of each task sep-
problem-solving session. We designed a between-subject study arately, as they elicit different cognitive processes, and tasks
with three conditions: similar to these give rise to different protocols in thinking-
Rubber Duck (RDuck): To reproduce a ”rubber duck de- aloud studies [52].
bugging” scenario, participants in the RDuck condition were
instructed to think aloud directing their words to a rubber duck. D. Measures
Naı̈ve Listener Robot (NaiveL): The social robot displayed We established a set of metrics that explore the outcomes of
listening behavior generated by hand-crafted heuristics. In- the ”rubber duck” problem-solving session. Engagement in
formation on the detailed implementation is given in Section thinking-aloud and Task-related metrics quantify the objective
III-A. effects of the different conditions on how the user acted,
Data-driven Listener Robot (DataL): The social robot dis- whereas User experience attempts to quantify the subjective
played listening behavior that was learned through machine experience of the user while solving the tasks.
learning methods using a human-human conversational corpus.
1 https://www.intelligencetest.com/questions/logic/1.html
The implementation details are provided in Section III-B.
1) Engagement metricsTo assess H1, we measured the user’s fore moving to the experiment room, participants were given
Number and Duration of verbal utterances, as well as 2 minutes to respond to questions from the LogicQ task, in
Speech-to-silence ratio (time spent talking divided by time order to get familiar with the types of questions presented in
spent silent, during a task) while solving each task. This the quiz.
provided an objective measure of the user’s engagement with Participants then moved to the experimental room. They sat
the think-aloud process. This data was collected from Voice at the desk, facing a tablet and ”Alex”. ”Alex” was either
Activation Detection (VAD) through lapel microphones. the rubber duck or a Furhat robot, depending on the assigned
2) Task-related metricsFor evaluation of H2, we defined met- condition. Figure 1 illustrates the set-up. Audio features and
rics that relate to how the user behaved while solving the task. voice activity were collected through a lapel microphone.
We measured performance in the LogicQ task by monitoring Before starting the first task, participants were introduced to
the number of Questions answered (QAns), and number of ”Alex” and were asked to introduce themselves to it. This
Correct answers. For the NASA task, we considered time was prompted by the robot (or by the researcher in the
spent before submitting the final answer. RDuck condition) and aimed to a) eliminate the ”cold start
3) User experience metricsFinally, to investigate RQ1, we effect’ of talking aloud [14] and, for the Naı̈veL and DataL
measured self-reported Cognitive load and User experience conditions, b) eliminate the novelty effect [28] when first
index (UEI) of the problem solving session. This was assessed speaking to the robot. Participants were given 7 minutes to
in a questionnaire taken after each task. Cognitive load was complete each task (for the NASA task, they were free to
measured using the NASA-TLX scale [18]. The perception of submit their answer before that time). Between the two tasks,
the problem-solving session was evaluated with three dimen- they were asked to fill out a questionnaire about the task
sions from the short version of the User Engagement Scale they had just solved (collection of user experience measures).
(focused attention, perceived usability and reward) [38]. Task order was counterbalanced between participants, within
each condition. After the second task, participants filled out
4) Other measuresWe also collected data on participants’ the second task-based questionnaire and a final questionnaire
demographics (age, gender identity, country of origin), as (collection of perception of the robot and impact of thinking
well as the personality trait of Extraversion from the Big aloud measures). After a final debriefing participants received
Five Inventory[23, 46] in the pre-experiment questionnaire. In a voucher valued at around 10 USD.
the post-experiment questionnaire, we assessed the Impact of
thinking aloud, which asked users to rate how they thought a) System implementationAs our social robot, we used a
thinking aloud helped them solve the tasks. This measure was Furhat robot2 with the William voice from CereProc3 , in-
obtained through five questions with a 5-point Likert scale. cluding its set of backchannels. For the nod movement, the
A Cronbach’s Alpha of α = 0.84 indicated good reliability. ”amplitude” - range of the up and down movement - was
The exact questions are provided as supplemental material. randomly sampled from a uniform distribution. During the
We further measured Extraversion and Impact of thinking nod, the robot paused at the lowest point for 0.5 s. Furhat
aloud since personality and natural inclination for speaking was controlled through a computer using a NVIDIA GeForce
aloud can impact user behavior during the tasks. Finally, RTX 2080 SUPER and an Intel® Core™ i9.
we considered the Order of task presentation, since thinking b) ParticipantsParticipants were recruited through posters,
aloud can suffer from a ”cold start effect” [14] (as described flyers, social media platforms and word of mouth. A total of
in Section IV-E). 101 participants were recruited with ages ranging from 14-76
Finally, to explore if differences in user behavior could be years (M = 26.4, SD = 7.6). 53 participants identified as male
due to the perception of the robot listening behaviors, the and 48 as female, with a total of 29 different nationalities.
participants with conditions where the robot was present were
asked about the robot’s Listening behavior and Closeness V. R ESULTS
[36], Social attributes (from RoSAs [6]), as well as previous
experience with robots. Data from eleven participants had to be excluded due to
substantial software or hardware problems. The remaining
E. Experimental procedure 90 participants were distributed across conditions with the
following demographics: RDuck −N = 30 (14F, 16M ), ages
Participants were randomly assigned to one of the three con- 27.8 ± 10.2; N aiveL − N = 29 (9F, 20M ), ages 26.3 ± 7.4;
ditions. After giving informed consent, participants read the DataL − N = 31 (18F, 13M ), ages 25.3 ± 5.3. Out of the 60
experiment instructions. In all conditions, they were instructed participants which interacted with the robot, 20 indicated they
to speak aloud to ”Alex”, their ”rubber duck”, in a study had interacted with a robot before. In this section, we analyze
that aims to understand cognitive processes behind problem- the listening behaviors developed and report our findings for
solving tasks. The instructions also clarified that Alex did not each measure presented in Section IV.
have any insight into the solutions of any of the two tasks.
2 https://furhatrobotics.com/
Afterwards, participants were asked to fill the pre-experiment
3 https://cereproc.com/
questionnaire (demographic and personality information). Be-
TABLE II: User Engagement metrics, for both tasks (M ± TABLE III: Task Performance metrics, for both tasks (M ±
SD). SD).
Task Condition Number Duration Speech-to-silence Task Condition QAns Correct Time Taken (s)
RDuck 0.17 ± 0.04 3.50 ± 1.34 1.73 ± 1.12 RDuck - - 360 ± 75
NASA NaiveL 0.16 ± 0.04 3.70 ± 1.46 1.73 ± 1.19 NASA NaiveL - - 349 ± 80
DataL 0.16 ± 0.04 4.37 ± 2.66 2.55 ± 2.44 DataL - - 320 ± 99
RDuck 0.16 ± 0.05 2.90 ± 0.87 1.21 ± 0.78 RDuck 8.41 ± 2.93 0.011 ± 0.005 -
LogicQ NaiveL 0.16 ± 0.04 3.28 ± 1.32 1.27 ± 0.85 LogicQ NaiveL 8.93 ± 3.50 0.010 ± 0.36 -
DataL 0.17 ± 0.03 3.31 ± 1.10 1.52 ± 1.11 DataL 8.58 ± 3.50 0.009 ± 0.005 -

TABLE IV: User Experience metrics, for both tasks (M ±SD).


A. User Engagement in Thinking Aloud Task Condition Cog. Load UEI
We evaluated how much participants engaged in thinking RDuck 3.47 ± 0.87 3.42 ± 0.31
NASA NaiveL 3.53 ± 0.73 3.39 ± 0.31
aloud utterances in two separate tasks with an ANCOVA DataL 3.11 ± 0.67 3.40 ± 0.33
to examine the effects of condition on the Number and RDuck 4.18 ± 0.88 3.40 ± 0.29
Duration of utterances, as well as Speech-to-silence ratio, LogicQ NaiveL 4.08 ± 0.81 3.26 ± 0.30
DataL 4.11 ± 0.84 3.27 ± 0.34
after controlling for Extraversion and Impact of Thinking
Aloud. We also control for the Order of presentation of tasks,
due to the ”cold start effect” mentioned in Section IV [14]. on these metrics with an ANCOVA, controlling for the Order
Table II summarizes our findings. of tasks and Impact of Thinking Aloud. See Table IV for
1) NASA taskThe effect of the condition on the Number more detail.
of utterances was not significant (F (2, 83) = 0.33, p = 1) NASA taskThe effect of the condition on self-reported
0.71). The covariate Order was a significant predictor for Cognitive load is not significant, F (2, 84) = 2.43, p = 0.09.
the Duration of utterances and Speech-to-silence ratio ( The covariate Order of task presentation significantly effected
F (1, 83) = 4.16, p = 0.04 and F (1, 83) = 5.84, p = 0.02, (decreased) UEI, F (1, 84) = 8.11, p = 0.005 (NASA first,
resp.). Doing the NASA task first decreased mean duration UEI = 3.32 ± 0.31; LogicQ first, UEI = 3.39 ± 0.32).
of utterances (3.30 ± 1.31s vs. 3.72 ± 1.85s if doing the 2) LogicQ taskA one-way ANCOVA showed no significant
LogicQ task first) and speech-to-silence ratio (1.46 ± 1.01 effects of the condition on User Experience metrics (Cognitive
vs. 1.88 ± 1.74). Load, F (2, 86) = 0.05, p = 0.95, UEI, F (2, 86) = 2.05, p =
2) LogicQ taskThe ANCOVA test showed no significant 0.14).
effects of the condition on the engagement of the user with
the task (Number of utterances, F (2, 83) = 0.21, p = 0.81, D. Robot Behavior
Duration of utterances, F (2, 83) = 1.15, p = 0.32, Speech- In addition to the main measures, we also looked into how
to-silence ratio, F (2, 83) = 1.18, p = 0.31). the robot acted and how it was perceived in the Naı̈veL and
DataL conditions. This helps to illustrate the two listener
B. Task-related metrics behavior models implemented, but can also correlate with user
Table III shows task-related measures across participants behavior during the TAPS sessions under different conditions.
within each condition. For the NASA task, we consider Time Backchannel frequency per minute (see Fig. 3) is given by
taken until submission of final object ranking (in seconds, the count of backchannels executed by the robot every 60
with a maximum of 420s). For the LogicQ task, we looked at seconds. It is calculated as a sliding window with hop length
the Number of Questions Answered (QAns) and Number of 15s. The Naı̈veL robot displayed similar values for vocal
of Correct Answers as a way to measure how the different and non-vocal backchanneling, but the frequency decreased
conditions affected how the user completes the LogicQ task. as task time increases. The DataL robot, on the other hand,
We controlled for the Impact of Thinking Aloud in both favors non-vocal backchannels (nodding), which is displayed
tasks. in much higher frequency, as well as a combination of both
1) NASA taskAn ANCOVA revealed no significant effect nodding and vocal utterances.
of condition on Time Taken solving the task (F (2, 84) = 1) User Reporting of Robot BehaviorFig. 4 shows partici-
1.67, p = 0.19). pants’ evaluation of the robot’s listening behavior. A one-way
2) LogicQ taskA one-way ANOVA revealed no significant ANOVA showed no significant differences between conditions
effect of condition on Correct answers (F (2, 84) = 0.21, p = for both the Closeness and Listening behavior dimensions
0.81), or on Number of questions answered (F (2, 84) = (F (1, 58) = 2.07, p = 0.16 and F (1, 58) = 0.56, p = 0.46,
0.14, p = 0.86). respectively). The robot’s social attributes [6] were also not
significantly different between the NaiveL and DataL con-
C. User Experience metrics ditions (one-way ANOVA for the Competence dimension,
Participants’ self-reported experience solving the tasks was F (1, 58) = 0.46, p = 0.50, and Wilcoxon rank sum test
evaluated through measures of Cognitive load and User for the Warmth, W = 543.5, p = 0.10 and Discomfort,
Experience Index (U EI). We tested the effect of condition W = 410, p = 0.55, dimensions, after a Shapiro–Wilk test
BCs per minute (NASA) BCs per minute (LogicQ)
presence of the robot, in either condition (Naı̈veL or DataL),
did not appear to affect user engagement in thinking aloud.
4 This result is in contrast to the findings in Ramachandran
et al. [45], where children thinking aloud to a robot spoke
more than those in a no-robot condition. We also reject H2a,b
2 - task performance was not affected, and the time spent to
complete the tasks did not significantly change regardless of
0 the condition. Finally, RQ1 was aimed at the impact of the
100 200 300 400 presence of the robot on subjective user experience. Jung et al.
[24] found that users in human-robot teams with robots that
backchannel show lower cognitive load when solving tasks.
4 However, we found no significant effects of condition on how
users self-reported the sessions. Below, we advance potential
explanations for the findings in this study.
2 The data-driven model appeared to greatly favor non-verbal
feedback, with rare instances of vocal-only backchannels.
However, notably, the users rank both behaviors similarly in
0
the dimensions of Listening behavior, Closeness and social
100 200 300 400 attributes (Competence, Warmth and Discomfort). These
Time (s) findings differ from those reported in Murray et al. [36], which
vocal BC non-vocal BC both
found that a data-driven listener behavior was perceived as a
better listener than a rule-based model. This might indicate
Naı̈veL DataL that the user becomes fully involved with the task, paying little
attention to the robot. This phenomenon has been observed in
Fig. 3: Backchanneling behavior over time from the DL-based
a setting where children playing a game stop listening to the
model (DataL, in red) or heuristic model (Naı̈veL, in blue).
robot’s suggestions after some time [1].
The different types of BCs emmited are discriminated. Data
High user involvement in task completion could also explain
shown is averaged over all participants in that condition. Top
why we do not find significant differences between the RDuck
- LogicQ task; bottom - NASA task.
condition and the two robot conditions. Relevant prior work
on human dishonesty in the presence of a robot [42] found
that there were no differences in how users cheated between
4 being alone or in the presence of an unaware robot. We hy-
User Ratings

pothesize that there was a mismatch between the expectations


3 of the users and robot behavior, leading users to believe that
the robot did not truly understand what they were saying.
Furhat’s anthropomorphic appearance elicits the development
2 of expectations about the system’s performance [37], causing
cognitive anthropomorphization [49]. Previous studies also
1 showed different reactions to robot failure depending on the
Naı̈veL DataL Naı̈veL DataL robot’s appearance [29] and functionality framing [56]. Inad-
Listening Behavior Closeness equate backchannel timing thus may have had a detrimental
effect on the expectations of the user, leading them to believe
Fig. 4: Participants’ rating of the robot’s listening behavior Furhat was unaware of the meaning of their utterances - and
and closeness to the robot (from [36]). diluting the differences between the three conditions tested.
This is consistent with comments left by the users in the post-
experiment questionnaire, e.g., ” (...) the robot is responsive
of normality revealed these variables did not have normal however, can’t replace a human interaction as I didn’t get the
distributions, p < 0.05). emotional feedback.”(P72).
The order of presentation of tasks is a significant predictor
VI. D ISCUSSION of measures such as Speech-to-silence ratio and self-reported
In order to capture the impact of a ”robot duck,” we User Experience. As mentioned in Section IV, an initial
analyzed various aspects of a thinking-aloud problem-solving inhibition is to be expected when thinking aloud [14]. Further,
session. Overall, we cannot say that the presence of the robot some participants reported that the two tasks differed in how
impacted the ”rubber duck” TAPS sessions, nor that the two organic it felt to say their thoughts aloud, with many feeling
listening behavior models caused significant differences in task that it was harder to do it during LogicQ quiz. Deductive logic
performance or user experience. Our results reject H1a,b - the skills also vary among humans, and this may explain why
the Impact of thinking aloud is a predictor of involvement and conversational datasets into new interactive contexts. Our
measures for the LogicQ task and not the NASA task. findings indicate that the presence of a social robot displaying
On a final note, we observe that much work in the devel- elementary listener behavior is not sufficient to elicit substan-
opment of robot listener behavior requires the collection of tial differences in human behavior and perception of a ”rubber
specific conversational data in chosen human contexts that are duck” think-aloud problem-solving session, when compared to
then replicated in human-robot interactions [33, 19, 36]. Given an inanimate object. Additional studies are required to inform
the wide range of social settings and contextual characteristics the design of socially-assistive robots in problem-solving - as
within which HRI can be applied, one might wonder: is it the future brings us closer to social robots, an optimized ”robot
sustainable to continue leveraging social HRI research by duck” could play an important role in the problem-solving
collecting task-specific datasets? In this work, we developed tasks which populate our days.
two fully-autonomous listening behaviors models that were
informed by human-human conversational data collected in a R EFERENCES
contrasting social setting to that in which they were deployed. [1] Safinah Ali, Nisha Elizabeth Devasia, and Cynthia Breazeal. Es-
In spite of this, both behaviors lead to successful interactions cape!bot: Social robots as creative problem-solving partners. In Cre-
of the user with the robot, with no significant detriment in ativity and Cognition, C&C ’22, page 275–283, New York, NY, USA,
2022. Association for Computing Machinery. ISBN 9781450393270.
any of the aspects of the sessions when compared to an doi:10.1145/3527927.3532793. URL https://doi.org/10.1145/3527927.
inactive listener. This opens the door to continue exploring 3532793.
transferability of datasets to new social settings. [2] Michael Argyle and Mark Cook. Gaze and mutual gaze. Cambridge
University Press, 1976. ISBN 0-521-20865-3.
[3] Andrew J. Aubrey, David Marshall, Paul L. Rosin, Jason Vandeventer,
A. Limitations and Future Work Douglas W. Cunningham, and Christian Wallraven. Cardiff conversa-
This study carries some limitations. Functionality misfram- tion database (ccdb): A database of natural dyadic conversations. In
2013 IEEE Conference on Computer Vision and Pattern Recognition
ing, or mismatch of user expectations towards the robot, could Workshops, pages 277–282, 2013. doi:10.1109/CVPRW.2013.48.
be reduced by providing participants with more interaction [4] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh,
time with the robot before task completion, in order to adjust Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and
Shrikanth S. Narayanan. IEMOCAP: interactive emotional dyadic
expectations. Since participants spent most of the time looking motion capture database. Language Resources and Evaluation, 42(4):
at the tablet, some reactions from the robot may have been 335–359, December 2008.
missed and impacted the perception of robot behavior. Further, [5] Joanna Butchart, Reema Harrison, Jan Ritchie, Felip Martı́, Chris
McCarthy, Sarah Knight, and Adam Scheinberg. Child and parent
some participants described the robot as ”intimidating”, which perceptions of acceptability and therapeutic value of a socially assistive
could have impacted their comfort in thinking aloud. robot used during pediatric rehabilitation. Disability and rehabilitation,
Future work may help elucidate some questions raised by 43(2):163–170, 2021.
[6] Colleen M. Carpinella, Alisa B. Wyman, Michael A. Perez, and Steven J.
the results reported here. Age and cultural background may Stroessner. The Robotic Social Attributes Scale (RoSAS). Proceedings
impact user behavior, and these aspects were not taken into of the 2017 ACM/IEEE International Conference on Human-Robot
account. There is also potential to explore if different-looking Interaction - HRI ’17, (October):254–262, 2017. ISSN 21672148. doi:
10.1145/2909824.3020208. URL http://dl.acm.org/citation.cfm?doid=
social robots affect the sessions, or the impact of a human 2909824.3020208.
listener. It would be interesting to further investigate if the [7] Jonathan Casas, Emmanuel Senft, Luisa F Gutierrez, Monica Rincon-
outcomes of the TAPS session change with a different set of Rocancio, Marcela Munera, Tony Belpaeme, and Carlos A Cifuentes.
Social assistive robots: assessing the impact of a training assistant robot
tasks (e.g. programming or debugging tasks). in cardiac rehabilitation. International Journal of Social Robotics, 13
Regarding listener behavior implementation, we did not (6):1189–1203, 2021.
explore multimodal data input (such as facial expressions, gaze [8] Michelene T.H. Chi, Nicholas De Leeuw, Mei-Hung Chiu, and Chris-
tian Lavancher. Eliciting self-explanations improves understand-
tracking or head pose). Future work may explore these aspects, ing. Cognitive Science, 18(3):439–477, 1994. ISSN 0364-0213.
in line with what was developed in [19, 36]. The robot could doi:https://doi.org/10.1016/0364-0213(94)90016-7. URL https://www.
also display more complex listening behavior, such as asking sciencedirect.com/science/article/pii/0364021394900167.
[9] I.A. de Kok and Dirk K.J. Heylen. A survey on evaluation metrics for
general questions to the user (e.g.Weizenbaum [57]’s ELIZA). backchannel prediction models. In Proceedings of the Interdisciplinary
Finally, and while we explored transferability of datasets when Workshop on Feedback Behaviors in Dialog, pages 15–18. University
implementing robot listener behavior, we note that behavior of Texas, September 2012. ISBN not assigned. null ; Conference date:
07-09-2012.
expectations may differ in a human-human versus a human- [10] Iwan de Kok, Dirk Heylen, and Louis-Philippe Morency. Speaker-
robot interaction [34, 12]. This is yet another aspect that calls adaptive multimodal prediction model for listener responses. In
for further exploration. Proceedings of the 15th ACM on International Conference on Mul-
timodal Interaction, ICMI ’13, page 51–58, New York, NY, USA,
2013. Association for Computing Machinery. ISBN 9781450321297.
VII. C ONCLUSION doi:10.1145/2522848.2522866. URL https://doi.org/10.1145/2522848.
This study explored human-robot interaction in a new set- 2522866.
[11] Laurie A Dickstein-Fischer, Darlene E Crone-Todd, Ian M Chapman,
ting and broadened knowledge on the role and usefulness (or Ayesha T Fathima, and Gregory S Fischer. Socially assistive robots:
lack thereof) of social robots as adjacent tools for human task current status and future prospects for autism interventions. Innovation
completion. Further, the development and/or implementation and Entrepreneurship in Health, 5:15–25, 2018.
[12] Andrew Gambino, Jesse Fox, and Rabindra A Ratan. Building a stronger
of two fully-autonomous robot listener behaviors provided casa: Extending the computers are social actors paradigm. Human-
an interesting contribution on the transferability of heuristics Machine Communication, 1:71–85, 2020.
[13] Maia Garau, Mel Slater, Simon Bee, and Martina Angela Sasse. The thinking. Metacognition and Learning, 5:251–267, 05 2014. doi:
impact of eye gaze on communication using humanoid avatars. In 10.1007/s11409-010-9060-6.
Proceedings of the SIGCHI Conference on Human Factors in Com- [31] Jacqueline P. Leighton. Two types of think aloud interviews for educa-
puting Systems, CHI ’01, page 309–316, New York, NY, USA, 2001. tional measurement: Protocol and verbal analysis. National Council on
Association for Computing Machinery. ISBN 1581133278. doi: Measurement in Education, 2009.
10.1145/365024.365121. URL https://doi.org/10.1145/365024.365121. [32] Ester Martinez-Martin, Felix Escalona, and Miguel Cazorla. Socially
[14] Bob Gibson. Taking the test: Using verbal report data in looking at assistive robots for older adults and people with autism: An overview.
the processing of cloze tasks. Edinburgh Working Papers in Applied Electronics, 9(2):367, 2020.
Linguistics, 8:54–62, 1997. [33] Louis-Philippe Morency, Iwan Kok, and Jonathan Gratch. A proba-
[15] Jonathan Gratch, Anna Okhmatovskaia, Francois Lamothe, Stacy bilistic multimodal approach for predicting listener backchannels. Au-
Marsella, Mathieu Morales, R. J. van der Werf, and Louis-Philippe tonomous Agents and Multi-Agent Systems, 20:70–84, 01 2010. doi:
Morency. Virtual Rapport. Springer Berlin Heidelberg, 2006. ISBN 10.1007/s10458-009-9092-y.
978-3-540-37594-4. doi:10.1007/11821830˙2. URL http://link.springer. [34] Lilia Moshkina, Susan Trickett, and J Gregory Trafton. Social engage-
com/10.1007/11821830 2. ment in public places: a tale of one robot. In Proceedings of the 2014
[16] Victor H. Yngve. On getting a word in edgewise. pages 567–577. ACM/IEEE international conference on Human-robot interaction, pages
Chicago Linguistic Society, 1970. 382–389, 2014.
[17] Jay Hall and W. H. Watson. The effects of a normative intervention on [35] Markus Mueller, David Leuschner, Lars Briem, Maria Schmidt, Kevin
group decision-making performance. Human Relations, 23(4):299–317, Kilgour, Sebastian Stueker, and Alex Waibel. Using neural networks
1970. doi:10.1177/001872677002300404. URL https://doi.org/10.1177/ for data-driven backchannel prediction: A survey on input features
001872677002300404. and training techniques. In Masaaki Kurosu, editor, Human-Computer
[18] Sandra G. Hart and Lowell E. Staveland. Development of nasa-tlx (task Interaction: Interaction Technologies, pages 329–340, Cham, 2015.
load index): Results of empirical and theoretical research. In Peter A. Springer International Publishing. ISBN 978-3-319-20916-6.
Hancock and Najmedin Meshkati, editors, Human Mental Workload, [36] Michael Murray, Nick Walker, Amal Nanavati, Patricia Alves-Oliveira,
volume 52 of Advances in Psychology, pages 139–183. North-Holland, Nikita Filippov, Allison Sauppe, Bilge Mutlu, and Maya Cakmak.
1988. doi:https://doi.org/10.1016/S0166-4115(08)62386-9. URL https: Learning backchanneling behaviors for a social robot via data aug-
//www.sciencedirect.com/science/article/pii/S0166411508623869. mentation from human-human conversations. In Aleksandra Faust,
[19] Hung-Hsuan Huang, Masato Fukuda, and Toyoaki Nishida. Toward rnn David Hsu, and Gerhard Neumann, editors, Proceedings of the 5th
based micro non-verbal behavior generation for virtual listener agents. Conference on Robot Learning, volume 164 of Proceedings of Machine
pages 53–63, 07 2019. ISBN 978-3-030-21901-7. doi:10.1007/978-3- Learning Research, pages 513–525. PMLR, 08–11 Nov 2022. URL
030-21902-4˙5. https://proceedings.mlr.press/v164/murray22a.html.
[20] Andrew Hunt and David Thomas. The Pragmatic Programmer: From [37] Clifford Nass and Youngme Moon. Machines and mindlessness:
Journeyman to Master. Addison-Wesley Longman Publishing Co., Inc., Social responses to computers. Journal of Social Issues, 56(1):81–
USA, 2000. ISBN 020161622X. 103, 2000. doi:https://doi.org/10.1111/0022-4537.00153. URL https:
[21] Nusrah Hussain, Engin Erzin, T. Metin Sezgin, and Yucel Yemez. //spssi.onlinelibrary.wiley.com/doi/abs/10.1111/0022-4537.00153.
Batch recurrent q-learning for backchannel generation towards engaging [38] Heather O’Brien, Paul Cairns, and Mark Hall. A practical approach
agents, 2019. URL https://arxiv.org/abs/1908.02037. to measuring user engagement with the refined user engagement scale
[22] Vidit Jain, Maitree Leekha, Rajiv Ratn Shah, and Jainendra Shukla. (ues) and new ues short form. International Journal of Human-Computer
Exploring semi-supervised learning for predicting listener backchannels. Studies, 112, 04 2018. doi:10.1016/j.ijhcs.2018.01.004.
In Proceedings of the 2021 CHI Conference on Human Factors in [39] Catharine Oertel, Fred Cummins, Jens Edlund, Petra Wagner, and Nick
Computing Systems, pages 1–12, 2021. Campbell. D64: A corpus of richly recorded conversational interaction.
[23] Oliver P. John, Eileen M. Donahue, and Robert L. Kentle. The big five Journal on Multimodal User Interfaces, 7(1):19–28, 2013.
inventory—versions 4a and 54, 1991. [40] Catharine Oertel, Patrik Jonell, Dimosthenis Kontogiorgos, Kenneth Fu-
[24] Malte F. Jung, Jin Joo Lee, Nick DePalma, Sigurdur O. Adalgeirs- nes Mora, Jean-Marc Odobez, and Joakim Gustafson. Towards an
son, Pamela J. Hinds, and Cynthia Breazeal. Engaging robots: engagement-aware attentive artificial listener for multi-party interactions.
Easing complex human-robot teamwork using backchanneling. In Frontiers in Robotics and AI, 8:189, 2021. ISSN 2296-9144. doi:
Proceedings of the 2013 Conference on Computer Supported Coop- 10.3389/frobt.2021.555913. URL https://www.frontiersin.org/article/10.
erative Work, CSCW ’13, page 1555–1566, New York, NY, USA, 3389/frobt.2021.555913.
2013. Association for Computing Machinery. ISBN 9781450313315. [41] Y. Okato, K. Kato, M. Kamamoto, and S. Itahashi. Insertion of interjec-
doi:10.1145/2441776.2441954. URL https://doi.org/10.1145/2441776. tory response based on prosodic information. In Proceedings of IVTTA
2441954. ’96. Workshop on Interactive Voice Technology for Telecommunications
[25] Katarzyna Kabacińska, Tony J Prescott, and Julie M Robillard. Socially Applications, pages 85–88, 1996. doi:10.1109/IVTTA.1996.552766.
assistive robots as mental health interventions for children: a scoping [42] Sofia Petisca, Iolanda Leite, Ana Paiva, and Francisco Esteves. Human
review. International Journal of Social Robotics, 13(5):919–935, 2021. dishonesty in the presence of a robot: The effects of situation awareness.
[26] Tatsuya Kawahara, Takashi Yamaguchi, Koji Inoue, Katsuya Takanashi, International Journal of Social Robotics, 14:1–12, 07 2022. doi:
and Nigel Ward. Prediction and generation of backchannel form 10.1007/s12369-022-00864-3.
for attentive listening systems. pages 2890–2894, 09 2016. doi: [43] Ronald Poppe, Khiet P. Truong, Dennis Reidsma, and Dirk Heylen.
10.21437/Interspeech.2016-118. Backchannel strategies for artificial listeners. In Proceedings of the 10th
[27] Adam Kendon. Some functions of gaze-direction in social inter- International Conference on Intelligent Virtual Agents, IVA’10, page
action. Acta Psychologica, 26:22–63, 1967. ISSN 0001-6918. 146–158, Berlin, Heidelberg, 2010. Springer-Verlag. ISBN 3642158919.
doi:https://doi.org/10.1016/0001-6918(67)90005-4. URL https://www. [44] Sarah M Rabbitt, Alan E Kazdin, and Brian Scassellati. Integrating
sciencedirect.com/science/article/pii/0001691867900054. socially assistive robotics into mental healthcare interventions: Appli-
[28] J. Kennedy, Paul Baxter, Emmanuel Senft, Tony Belpaeme, and cations and recommendations for expanded use. Clinical psychology
S. Lemaignan. From characterising three years of hri to methodology review, 35:35–46, 2015.
and reporting recommendations. volume 2016-April, pages 391–398, [45] Aditi Ramachandran, Chien-Ming Huang, Edward Gartland, and Brian
2022. ISBN 978-1-4673-8370-7. doi:10.1109/HRI.2016.7451777. URL Scassellati. Thinking aloud with a tutoring robot to enhance learning.
https://uwe-repository.worktribe.com/output/913325. In Proceedings of the 2018 ACM/IEEE International Conference on
[29] Dimosthenis Kontogiorgos, Andre Pereira, Boran Sahindal, Sanne van Human-Robot Interaction, HRI ’18, page 59–68, New York, NY, USA,
Waveren, and Joakim Gustafson. Behavioural responses to robot conver- 2018. Association for Computing Machinery. ISBN 9781450349536.
sational failures. In Proceedings of the 2020 ACM/IEEE International doi:10.1145/3171221.3171250. URL https://doi.org/10.1145/3171221.
Conference on Human-Robot Interaction, HRI ’20, page 53–62, New 3171250.
York, NY, USA, 2020. Association for Computing Machinery. ISBN [46] Beatrice Rammstedt and Oliver P. John. Measuring personality
9781450367462. doi:10.1145/3319502.3374782. URL https://doi.org/ in one minute or less: A 10-item short version of the
10.1145/3319502.3374782. big five inventory in english and german. Journal of
[30] Kelly Ku and Irene Ho. Metacognitive strategies that enhance critical Research in Personality, 41(1):203–212, 2007. ISSN 0092-
6566. doi:https://doi.org/10.1016/j.jrp.2006.02.001. URL [53] Leif Stinessen. The influence of verbalization on problem-
https://www.sciencedirect.com/science/article/pii/S0092656606000195. solving. Scandinavian Journal of Psychology, 26(1):342–347, 1985.
[47] Ognjen Rudovic, Meiru Zhang, Bjorn Schuller, and Rosalind Pi- doi:https://doi.org/10.1111/j.1467-9450.1985.tb01173.x. URL https://
card. Multi-modal active learning from human data: A deep rein- onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9450.1985.tb01173.x.
forcement learning approach. In 2019 International Conference on [54] Khiet Phuong Truong, Ronald Walter Poppe, and Dirk K.J. Heylen.
Multimodal Interaction, ICMI ’19, page 6–15, New York, NY, USA, A rule-based backchannel prediction model using pitch and pause
2019. Association for Computing Machinery. ISBN 9781450368605. information. In Proceedings of Interspeech 2010, pages 3058–3061.
doi:10.1145/3340555.3353742. URL https://doi.org/10.1145/3340555. International Speech Communication Association (ISCA), September
3353742. 2010. ISBN 1990-9772. URL http://www.interspeech2010.jpn.org/. null
[48] Robin Ruede, Markus Müller, Sebastian Stüker, and Alex Waibel. ; Conference date: 26-09-2010 Through 30-09-2010.
Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor: 8th [55] Nigel Ward and Wataru Tsukahara. Prosodic features which cue back-
International Workshop on Spoken Dialog Systems, pages 247–258. 01 channel responses in english and japanese. Journal of Pragmatics, 32(8):
2019. ISBN 978-3-319-92107-5. doi:10.1007/978-3-319-92108-2˙25. 1177–1207, 2000. ISSN 0378-2166. doi:https://doi.org/10.1016/S0378-
[49] Alessandra Sacino, Francesca Cocchella, Giulia De Vita, Fab- 2166(99)00109-5. URL https://www.sciencedirect.com/science/article/
rizio Bracco, Francesco Rea, Alessandra Sciutti, and Luca An- pii/S0378216699001095.
drighetto. Human- or object-like? cognitive anthropomorphism [56] Auriel Washburn, Akanimoh Adeleye, Thomas An, and Laurel D.
of humanoid robots. PLOS ONE, 17(7):1–19, 07 2022. doi: Riek. Robot errors in proximate hri: How functionality framing affects
10.1371/journal.pone.0270787. URL https://doi.org/10.1371/journal. perceived reliability and trust. J. Hum.-Robot Interact., 9(3), may 2020.
pone.0270787. doi:10.1145/3380783. URL https://doi.org/10.1145/3380783.
[50] Brian Scassellati, Henny Admoni, and Maja Matarić. Robots for use in [57] Joseph Weizenbaum. Eliza—a computer program for the study of natural
autism research. Annual review of biomedical engineering, 14:275–294, language communication between man and machine. Commun. ACM,
2012. 9(1):36–45, jan 1966. ISSN 0001-0782. doi:10.1145/365153.365168.
[51] Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Laurent Charlin, and URL https://doi.org/10.1145/365153.365168.
Joelle Pineau. A survey of available corpora for building data-driven [58] Frances M. Wijnen, Daniel P. Davison, Dennis Reidsma, Jan Van Der
dialogue systems, 2015. URL https://arxiv.org/abs/1512.05742. Meij, Vicky Charisi, and Vanessa Evers. Now we’re talking: Learning
[52] Maarten Someren, Yvonne Barnard, and Jacobijn Sandberg. The Think by explaining your reasoning to a social robot. J. Hum.-Robot Interact.,
Aloud Method - A Practical Guide to Modelling Cognitive Processes. 9(1), dec 2019. doi:10.1145/3345508. URL https://doi.org/10.1145/
01 1994. 3345508.
[Supplementary Material]
Robot Duck Debugging: Can Attentive Listening Improve Problem Solving?
In order to assess how the user perceived the impact of thinking-aloud, we formulated 7 questions with a 5-point Likert
scale (strongly disagree to strongly agree). The questions were asked in the post-experiment questionnaire to participants in
all conditions (RDuck, Naı̈veL, DataL). The questions are as follows:
1) Thinking aloud has helped me complete the tasks.
2) Thinking aloud has not impacted my performance in these tasks.
3) Thinking aloud has improved my performance in these tasks.
4) Thinking aloud has improved my ability to think critically about my decisions during these tasks.
5) Thinking aloud has not influenced my thought process while solving these tasks.
6) Thinking aloud has made me more confident about my decisions during these tasks.
7) Thinking aloud has distracted me from the tasks.
Questions Q2 and Q5 were removed from the final Impact of Thinking Aloud index, as their valence is ambiguous. The
final index is given by (Q1 + Q3 + Q4 + Q6 + [6 − Q7])/5 .
A Cronbach’s Alpha of α = 0.84 indicated good reliability.

You might also like