Professional Documents
Culture Documents
Abstract
While thinking aloud has been reported to positively
arXiv:2301.06511v2 [cs.RO] 18 Jan 2023
a dropout layer and a final dense layer. Each model was either model input considering a higher number of past internal
trained on the original or the augmented dataset, considering states.
varying previous timesteps, i.e. lookback sizes (5, 10 and 15,
respectively 2.5, 5 or 7.5 seconds of data), activation functions C. Gaze Behavior
(sigmoid, ReLU and softmax), batch sizes (8, 16, 32), dropout Gaze behavior was kept constant across the two listening
rates (0, 0.2 and 0.4), L2 regularization (0.1, 0.01, 0.001, conditions, as direct eye contact was predicted to occur infre-
0.0001), loss functions (focal loss, MSE, binary cross-entropy quently during the problem-solving sessions. We implemented
or hinge loss) and optimizers (SGD and Adam). the robot’s gaze behavior based on the work by Garau et al.
Evaluation Metrics: Performance on both modules was eval- [13]. They found that adapting an avatar’s gaze and head pose
uated based on macro accuracy (number of correct predictions to the turn-taking pattern of a conversation outperformed a
over all predictions), precision (probability that emitted BCs random gaze agent. The agent alternates between two actions,
match the ground-truth data), recall (probability that a ground- looking at the user or away from them (gaze aversion). The
truth BC is predicted by the model) and F1-score (harmonic duration of each action is informed by previous work on
mean of precision and recall). dyadic interactions [2, 27] and varies depending on whether
the agent is listening or speaking. For this work, we only used
Additionally, for the Timing module and in line with similar the listening timing indications.
work [9, 48], we calculated the above metrics considering a
tolerance margin of [−500, 500] ms; finally, we controlled for IV. S TUDY D ESIGN
robot overreaction with a Backchannel frequency deviation We designed a between-subject study with three conditions
metric. For each participant i, we considered the relative in order to evaluate the effects of different previously proposed
difference between the predicted number of backchannels and robot listener behaviors in the outcome of a TAPS activity.
its original value: Two conditions concerned robot listener behaviors: the first,
a rule-based heuristic where non-verbal utterances and head
∆BC idev = |Ytrue
i i
− Ypred i
|/Ytrue (1)
nods are crafted according to previous literature (as described
Models’ Selection and Performance: For training, data was in Section III-A); the second, based on deep learning models
split into 8 folds, with non-overlapping participant data in a which learn listener behaviors through data input from human-
6:1:1 train-validation-test split. Each model was trained for a human dyadic interactions (as described in Section III-B).
maximum of 100 epochs, but early stopping strategies were The two listening behaviors were chosen to gauge if different
deployed for both loss and validation loss, in order to prevent complexity in the model used to generate the listener behavior
overfitting. The use of a dropout layer and L2 regularization affects the outcomes of TAPS activities. We compare these ef-
entailed the same purpose. fects with the baseline condition, the presence of an inanimate
object - a rubber duck - to whom words are directed. The three
Each module (Timing and Type) was trained with an analogous conditions are detailed in Section IV-B.
methodology. We fine-tuned hyperparameters based on macro
accuracy and F1-score. The final candidates for each model A. Hypotheses
type (LSTM or GRU, augmented and not augmented training We were interested in evaluating how a robot can affect
data) were then trained using k-fold cross-validation. the think-aloud process in a ”rubber duck debugging” session.
For each module, we deployed the model that was most Thus, we defined our outcomes based on how the user inter-
frequently in the top three best performing models for each acted with the task. We measure user engagement in thinking-
of the metrics mentioned above. Both models are single-layer aloud and user perception of the problem-solving session (or
GRUs trained on the augmented dataset (see performance on user experience). Engagement with the think-aloud process
Table I, along with performance for the Naı̈veL heuristic on the measures how the user talks when thinking aloud, e.g. number
same dataset). The Timing model outperforms other similar and duration of verbal utterances. User experience addresses
model architectures for BC prediction [48, 36]. Interestingly, how the participant perceives solving the task, including self-
the modules have different lookback periods, with the Type assessed cognitive load (or ”mental effort”) and enjoyment
(or ”user experience index”). These measures are detailed in
Section IV-D.
Wijnen et al. [58] found that the presence of a robot
leads to better explaining behavior while learning in children,
and Ramachandran et al. [45] found that children engage
more in thinking aloud utterances when a robot is present.
Consequently, we hypothesized:
• H1a: Engagement in thinking aloud is higher when
thinking aloud to an actively listening robot than to an
inanimate object.
Moreover, Morency et al. [33] report that a multimodal prob-
abilistic model is better at predicting backchannels than hand- Fig. 2: Layout example for one of the questions from the
crafted rules. As such, we expected the data-driven condition deductive logic quiz, from IQ Test Labs.
to generate listener behavior that leads to a more organic
interaction:
• H1b: Listener behaviors learned from data lead to higher By including two different robot listening behaviors, we ex-
degrees of engagement in thinking-aloud, when compared plored the transferability of two implementation methods of
to rule-based heuristics. different complexity into a new social context, as well as the
Following H1a, H1b, and because the verbalization of one’s sensitivity of the user to differences between these behaviors.
thoughts during problem-solving has been reported to have
affect how users solve a task [53, 8], we also hypothesize C. Problem-solving Tasks
about user behavior during the task: We designed two different tasks in order to evaluate distinct
• H2a: User behavior during task completion will be contexts for thinking-aloud problem-solving and how the pres-
different in the conditions when a robot is present. Users ence of a robot can impact the perception of those tasks and
will take longer to solve tasks, but also perform better. the outcomes.
• H2b: Following H1b, users with the DL-based listener
1) Logic Deduction Quiz (LogicQ)Tasks using deductive
robot condition will take a longer time completing the
logic are common when developing thinking-aloud protocols
tasks than users in the rule-based listener robot condition.
[31]. The questionnaire used here was adapted from IQ Test
Finally, we wanted to evaluate the subjective experience Labs1 . For each question, 5 possible answers were shown as
of the user during the problem-solving sessions. Literature is illustrated in Figure 2. Participants were able to skip questions
sparse regarding if thinking aloud affects how users perceive but could not see if their answers were correct. There was no
the task, and if different listening behaviors skew that percep- time limit per question.
tion. As such, we took an exploratory approach to answer the
following research questions: 2) NASA Moon Survival Task (NASA)In the NASA task [17],
the participant is placed in an hypothetical life-threatening
• RQ1a: Is user experience affected by an active listener
scenario (spaceship crashed on the moon) and asked to rank
robot when compared to an inanimate object?
a list of 15 available items according to their contribution
• RQ1b: Is user experience affected by the type of listening
towards survival.
behavior the robot exhibits?
The LogicQ task allowed for an objective evaluation of per-
B. Conditions formance by counting correct answers to the quiz, whereas the
We sought to evaluate how the presence and behavior of NASA task was framed as an open-ended question involving
a social robot impacted the outcome of a ”rubber duck” strategic thinking. We analyzed the effects of each task sep-
problem-solving session. We designed a between-subject study arately, as they elicit different cognitive processes, and tasks
with three conditions: similar to these give rise to different protocols in thinking-
Rubber Duck (RDuck): To reproduce a ”rubber duck de- aloud studies [52].
bugging” scenario, participants in the RDuck condition were
instructed to think aloud directing their words to a rubber duck. D. Measures
Naı̈ve Listener Robot (NaiveL): The social robot displayed We established a set of metrics that explore the outcomes of
listening behavior generated by hand-crafted heuristics. In- the ”rubber duck” problem-solving session. Engagement in
formation on the detailed implementation is given in Section thinking-aloud and Task-related metrics quantify the objective
III-A. effects of the different conditions on how the user acted,
Data-driven Listener Robot (DataL): The social robot dis- whereas User experience attempts to quantify the subjective
played listening behavior that was learned through machine experience of the user while solving the tasks.
learning methods using a human-human conversational corpus.
1 https://www.intelligencetest.com/questions/logic/1.html
The implementation details are provided in Section III-B.
1) Engagement metricsTo assess H1, we measured the user’s fore moving to the experiment room, participants were given
Number and Duration of verbal utterances, as well as 2 minutes to respond to questions from the LogicQ task, in
Speech-to-silence ratio (time spent talking divided by time order to get familiar with the types of questions presented in
spent silent, during a task) while solving each task. This the quiz.
provided an objective measure of the user’s engagement with Participants then moved to the experimental room. They sat
the think-aloud process. This data was collected from Voice at the desk, facing a tablet and ”Alex”. ”Alex” was either
Activation Detection (VAD) through lapel microphones. the rubber duck or a Furhat robot, depending on the assigned
2) Task-related metricsFor evaluation of H2, we defined met- condition. Figure 1 illustrates the set-up. Audio features and
rics that relate to how the user behaved while solving the task. voice activity were collected through a lapel microphone.
We measured performance in the LogicQ task by monitoring Before starting the first task, participants were introduced to
the number of Questions answered (QAns), and number of ”Alex” and were asked to introduce themselves to it. This
Correct answers. For the NASA task, we considered time was prompted by the robot (or by the researcher in the
spent before submitting the final answer. RDuck condition) and aimed to a) eliminate the ”cold start
3) User experience metricsFinally, to investigate RQ1, we effect’ of talking aloud [14] and, for the Naı̈veL and DataL
measured self-reported Cognitive load and User experience conditions, b) eliminate the novelty effect [28] when first
index (UEI) of the problem solving session. This was assessed speaking to the robot. Participants were given 7 minutes to
in a questionnaire taken after each task. Cognitive load was complete each task (for the NASA task, they were free to
measured using the NASA-TLX scale [18]. The perception of submit their answer before that time). Between the two tasks,
the problem-solving session was evaluated with three dimen- they were asked to fill out a questionnaire about the task
sions from the short version of the User Engagement Scale they had just solved (collection of user experience measures).
(focused attention, perceived usability and reward) [38]. Task order was counterbalanced between participants, within
each condition. After the second task, participants filled out
4) Other measuresWe also collected data on participants’ the second task-based questionnaire and a final questionnaire
demographics (age, gender identity, country of origin), as (collection of perception of the robot and impact of thinking
well as the personality trait of Extraversion from the Big aloud measures). After a final debriefing participants received
Five Inventory[23, 46] in the pre-experiment questionnaire. In a voucher valued at around 10 USD.
the post-experiment questionnaire, we assessed the Impact of
thinking aloud, which asked users to rate how they thought a) System implementationAs our social robot, we used a
thinking aloud helped them solve the tasks. This measure was Furhat robot2 with the William voice from CereProc3 , in-
obtained through five questions with a 5-point Likert scale. cluding its set of backchannels. For the nod movement, the
A Cronbach’s Alpha of α = 0.84 indicated good reliability. ”amplitude” - range of the up and down movement - was
The exact questions are provided as supplemental material. randomly sampled from a uniform distribution. During the
We further measured Extraversion and Impact of thinking nod, the robot paused at the lowest point for 0.5 s. Furhat
aloud since personality and natural inclination for speaking was controlled through a computer using a NVIDIA GeForce
aloud can impact user behavior during the tasks. Finally, RTX 2080 SUPER and an Intel® Core™ i9.
we considered the Order of task presentation, since thinking b) ParticipantsParticipants were recruited through posters,
aloud can suffer from a ”cold start effect” [14] (as described flyers, social media platforms and word of mouth. A total of
in Section IV-E). 101 participants were recruited with ages ranging from 14-76
Finally, to explore if differences in user behavior could be years (M = 26.4, SD = 7.6). 53 participants identified as male
due to the perception of the robot listening behaviors, the and 48 as female, with a total of 29 different nationalities.
participants with conditions where the robot was present were
asked about the robot’s Listening behavior and Closeness V. R ESULTS
[36], Social attributes (from RoSAs [6]), as well as previous
experience with robots. Data from eleven participants had to be excluded due to
substantial software or hardware problems. The remaining
E. Experimental procedure 90 participants were distributed across conditions with the
following demographics: RDuck −N = 30 (14F, 16M ), ages
Participants were randomly assigned to one of the three con- 27.8 ± 10.2; N aiveL − N = 29 (9F, 20M ), ages 26.3 ± 7.4;
ditions. After giving informed consent, participants read the DataL − N = 31 (18F, 13M ), ages 25.3 ± 5.3. Out of the 60
experiment instructions. In all conditions, they were instructed participants which interacted with the robot, 20 indicated they
to speak aloud to ”Alex”, their ”rubber duck”, in a study had interacted with a robot before. In this section, we analyze
that aims to understand cognitive processes behind problem- the listening behaviors developed and report our findings for
solving tasks. The instructions also clarified that Alex did not each measure presented in Section IV.
have any insight into the solutions of any of the two tasks.
2 https://furhatrobotics.com/
Afterwards, participants were asked to fill the pre-experiment
3 https://cereproc.com/
questionnaire (demographic and personality information). Be-
TABLE II: User Engagement metrics, for both tasks (M ± TABLE III: Task Performance metrics, for both tasks (M ±
SD). SD).
Task Condition Number Duration Speech-to-silence Task Condition QAns Correct Time Taken (s)
RDuck 0.17 ± 0.04 3.50 ± 1.34 1.73 ± 1.12 RDuck - - 360 ± 75
NASA NaiveL 0.16 ± 0.04 3.70 ± 1.46 1.73 ± 1.19 NASA NaiveL - - 349 ± 80
DataL 0.16 ± 0.04 4.37 ± 2.66 2.55 ± 2.44 DataL - - 320 ± 99
RDuck 0.16 ± 0.05 2.90 ± 0.87 1.21 ± 0.78 RDuck 8.41 ± 2.93 0.011 ± 0.005 -
LogicQ NaiveL 0.16 ± 0.04 3.28 ± 1.32 1.27 ± 0.85 LogicQ NaiveL 8.93 ± 3.50 0.010 ± 0.36 -
DataL 0.17 ± 0.03 3.31 ± 1.10 1.52 ± 1.11 DataL 8.58 ± 3.50 0.009 ± 0.005 -