Computers in Industry

Computers in Industry 107 (2019) 33–49
Contents lists available at ScienceDirect
Computers in Industry
journal homepage: www.elsevier.com/locate/compind
Cognitive interaction with virtual assistants: From philosophical

foundations to illustrative examples in aeronautics
Denys Bernarda,*, Alexandre Arnoldb
a
Architecture & Integration, EIVS, AIRBUS S.A.S, 31700, Blagnac, France
b
Central R&T, AIRBUS S.A.S, 18 rue Marius Tercé, 31300, Toulouse, France
ARTICLEINFO
ABSTRACT
Article history:
Received 8 August 2018 Why do we perceive virtual assistants as something radically new? Our hypothesis is that today
Received in revised form 4 December 2018 virtual assistants are raising an expectation for natural interaction with the human. Human
Accepted 17 January 2019 interaction is by nature cognitive and collaborative. Human sciences help to flesh the ingredients of
Available online xxx such cognitive interaction. Uttering a sentence is at the same time: producing sound; making a well-
formed sentence; giving meaning; enriching a common background of beliefs and intentions; making
Keywords: something together. In this paper, we remind the basics of human cognitive communication as
Cognitive interaction
developed by human sciences, particularly philosophy of mind. We propose a definition of this way
Virtual assistant
of interacting with computer as ‘cognitive interaction’, and we summarize the main characteristics of
Cooperation
Shared intention this interaction mode into a layered model. Finally we develop case studies to illustrate concretely
Natural communication the concepts. We analyze in light of our theoretical model three approaches of conversational
systems in AI, to illustrate the different available options to implement the pragmatic dimension of
cognitive interaction. We analyze first the seminal approach of Grosz and Sidner [20], and then we
describe how the now classical approach of discourse structure developed by Asher and Lascarides
[5] could capture the pragmatic dimension of interaction with an intelligent virtual assistant. Finally,
we wonder whether a state-of-the-art chat bot framework actually implements the needed level of
cognitive interaction. The contribution of this paper is: to remind and summarize essential ideas
from other disciplines which are relevant to understand what should be the interaction with virtual
assistants should be; to explain why the cooperation with virtual assistants is something special; to
delineate the challenges we have to solve if we are to develop truly collaborative virtual assistants.
© 2019 Elsevier B.V. All rights reserved.
1. Introduction
robot interaction as a “research area that seeks to improve
interaction between robots and their users by developing
What makes virtual assistants something special lays in
cognitive models for robots and understanding human mental
human-machine interaction: virtual assistants are intended to
models for robots”. A virtual assistant as we consider it in our
interact with human users in a natural way, which is by essence
aeronautical environment differs from a robot in major
cognitive, linguistic and collaborative. In a nutshell, assisting
aspects: it would not have its own “body”, and would have
collaboratively an operator requires recognizing and representing
limited sensing and action capabilities. Nevertheless, a virtual
some of the operators cognitive attitudes related to the current
assistant for aircraft operations would also benefit from natural
task, such as intentions, assumptions and attention. To
and cognitive interaction with the operator. The handbook of
distinguish this kind of interaction from usual human-computer
Robotics identifies three sub areas to “cognitive interaction”:
interaction patterns where human cognition has no explicit role, it
human models of interaction, robots models of interaction,
is usually named cognitive interaction. Cognitive interaction is
and Interaction models that address the interaction as a joint
widely studied in the area of human robot interaction. In the
activity [35]. The work presented here would clearly lay in
“Handbook of robotics”, Multu, Roy and Sabanovic [35] define
this third category: we try to understand cognitive interaction
“cognitive human
between operators and virtual assistants as a joint activity,
shaped by verbal exchanges. Indeed, solutions for efficient
operator-assistant interaction must be inspired by what has
* Corresponding author. been designed for other applications, such as robotics. But in
E-mail addresses: denys.bernard@airbus.com (D. Bernard),
this paper, we do not intend to give solutions: our goal is to
alexandre.arnold@airbus.com (A. Arnold).
propose tracks for a technology independent specification.
https://doi.org/10.1016/j.compind.2019.01.010
Industry needs such an
0166-3615/© 2019 Elsevier B.V. All rights reserved.
D. Bernard, A. Arnold / Computers in Industry 107 (2019) 33–49
34
independent specification to evaluate, down select, specify, (A12) Let’s disconnect the control unit.
evaluate and validate the diverse frameworks, toolboxes, or (The assistant continues guiding the mechanics on how to disconnect the
methodologies provided by both the software industry and control unit)
research. Hence, we need to take a step back from technical
solutions to elaborate this neutral view. Our approach is mainly
influenced by philosophy of action and philosophy of language.
Example 1: Imaginary assistant helping a mechanics to fix
Then, this paper proposes a description of what has to be done,
a failure
while letting open the technological choices.
This example contains some interaction patterns that today
This paper is an extension of a previous conference
chatbots or Virtual Assistant frameworks would not easily
presentation
implement (except through ad-hoc hard coded behaviors and
[7] which summarized the specificity of cognitive interaction in a
poor capability to generalize beyond a few specific situations).
five layers model. In this paper, we aim at describing the
For example:
theoretical foundations in more detail, and showing diverse
example of how cognitive interaction should be deployed for
- at the beginning of the scenario, the Assistant correctly
human – virtual assistant interaction.
interprets utterance (M1)“We have STABILIZER A2 FAULT” as
Industry has good reasons to pay attention to the cognitive
an order, although it is grammatically an assertion.
dimension of human-assistant interaction: it reduces operators
- Despite utterance (M5)”In 3 h” is not a well formed sentence,
workload, improves human-system collaboration, and finally
the assistant correctly interpret it as referring to the time until
increases the quality of human task with subsequent effects on
the next aircraft flight.
safety and operational efficiency. Moreover, the urgency of
- Even more difficult, (M6) “No sorry . . . at 2PM” intends at
understanding cognitive interaction comes also from the
correcting the previous utterance: its interpretation requires
pervasive choice from providers of commercial virtual
taking into consideration the conversational context to catch the
assistants to privilege natural language as interaction mode.
mechanic’s intention to correct his previous utterance.
In this paper, we will develop further the initial conference
- When interpreting “(M13) Still faulty”, the assistant should
paper in two directions. We first introduce the interaction model
hold that not only the flight control system is still faulty
and its sources. In particular we try to clarify the role of the
(which is the explicit content of the utterance), but also that
common interaction context, in light of ideas from dynamic
the fault is the same than the initial one (otherwise, the
semantics (e.g. the recent paper of Craige Roberts [46]). Second,
human would have mentioned it to be maximally relevant
we substantiate those concepts on concrete examples. First we
[55]).
focus on two classical approaches of discourse structure in AI,
which help to understand what should be the dynamic
Although it seems very simple, our example conversation is
elaboration of the linguistic and cognitive context through the
already beyond what current Virtual Assistant development
interaction: First, the structure of discourse by Grosz and Sidner
frameworks can easily implement (note that the capabilities of
[20], then Segmented Discourse Representation Theory (SDRT)
those frameworks evolve daily, and that statement might need to
of Asher and Lascarides [25]. We apply those approaches of
be revised very soon!). Unfortunately, such context based
dialogue to the same imaginary exchange between a mechanics
interpretation processes are pervasive in operational working
and a virtual assistant, for a trouble shooting task in Aircraft
situations. The appeal to context by communicating partners is
maintenance. Finally, we analyze an existing industrial
both an opportunity – because co-workers obviously share a
framework to develop chat bots, in order to imagine how it could
common context – and an obligation for the efficiency of
be extended to fully cover all the needed layers of interaction.
collaboration. Some of those mechanisms (may be not all) must
As an introductory example, consider the following
be implemented by virtual assistants.
imaginary dialogue between a mechanics (M) and a virtual
This paper groups and summarizes a usable kit of concepts
assistant (A). The task of the mechanics is to make available for
from a diversity of disciplines interested in human verbal
flight an aircraft on which a failure occurred (the example is
communication, ranging from philosophy of language to
inspired from a real failure case, although it is simplified for the
natural language understanding in AI. Those concepts help to
sake of conciseness):
specify the cognitive functions that would make a computer
(M1) We have “STABILIZER A2 FAULT”
(A1) The Actuator 2 of the stabilizer has failed. capable of operating efficiently and fluently with a human
(M2) So what? being in real operations.
(A2) In principle the aircraft can fly with actuator 1 only, but the fault might A cognitive assistant is something special because it collabo-
cause a no-go failure at next power up. rates with human operators, and conversation is a special kind of
(M3) Why?
human collaboration. Conversation is used to perform many
(A3) A short circuit in the control unit of Actuator 2 may produce a brake
effect, preventing the whole stabilizer to move other kinds of collaborative activities, in most the case
(M4) What can we do? conversation itself is not the finality of the interaction [58,61]. In
(A4) When is the next flight? operational situations such as the one depicted in the
(M5) In 3 hours.
introductory example, it might even be difficult to disentangle
(M6) No sorry . . . at 2PM
(A5) Confirm the fault, then disconnect the faulty control unit if needed
communicative actions from other activities of the common
(M7) Let’s confirm the fault task.
(A6) Is the aircraft energized? In our example dialogue, the assistant had to understand
(M8) yes elliptical phrases, to recognize a diversity of speech acts, to use
(A7) Erase the non-volatile memory (NVM) of flight control system
and interpret pronouns or indexical words, to maintain dialog
(M9) How?
(A8) On maintenance terminal, select “system test & scanning” coherence, to track mechanics expectations. Human capability
(M10) Done to act together and conversation have been considered from two
(A9) Select “Flight controls” different streams of thought in philosophy of mind. The nice
(A10) Select “Erase NVM” thing, is that those initially distinct perspectives finally converged
(M11) What’s next?
(A11) Launch the system test
towards a consistent picture where conversation is seen as a
(M12) Wait a minute . . . cooperative activity which helps agents to cooperate for other
(M13) Still faulty. purposes. Starting from the analogies between both perspectives
of action and language, we summarize the cognitive interaction
between human users and virtual assistant in a unique multi
layered description. After the philosophical foundations, the
declared intentions; intentions must persist as long as they are
paper introduces a series of examples from AI, where part of the
not realized, if we have no good reasons to cancel them; the set
ideal of cognitive interaction have been implemented or formally
of intentions one holds at a given time must be consistent.
described in an implementable formalism.
Consequently, an intention is a commitment to act: If I declare
First, we analyze two classical approaches of collaborative
my intention to make a particular action at a given time and if I
dialogue in Artificial Intelligence, and then we focus on a state
finally don't make it, I can be charged of irrationality. Because
of the art framework to develop chat bots. This analysis should
they are also commitments, intentions carry reliable information
draw a picture of where we are today on cognitive interaction
about the future behavior of the agents. It is reasonable to believe
with virtual assistants, and where could be the next research
that intentions of rational agents will be realized.
priorities.
Bratman's planning theory of agency has influenced AI since
the 90’s. It has been the case in particular for the research on BDI
2. Towards a layered model of cognitive interaction
agents, i.e. agents driven by human-like cognitive states, beliefs,
desires, intentions [39]. An approach to implement a general
A technology independent representation of cognitive interac-
system to consistently reason on intentions and beliefs has been
tion can be derived from classical accounts of human
proposed by Shoham in [52]. In the continuity of this approach
collaboration and communication. This section introduces first
[22], has elaborated the capability to handle trees of intentions.
the main cognitive notions – group agency, speech acts, and
Note that handling multilevel intentions is necessary to properly
common ground - before introducing the main model itself.
represent and reason about agent's intentions in collaborative
situations [19,20].
2.1. Joint agency vs. individual agency
In general, one cannot intend somebody else’s action
(excepted in a strong hierarchical framework). Hence, a theory of
What matters for the efficiency and the success of a virtual
collective action must explain how the intentions of a group of
assistant, is that its action plan – including communicative
agents involved in a common action are coordinated. How can
actions
individual minds shape collective actions, for example building a
– adequately meshes with operator’s action plan. Ideally, the
house, having a conversation, designing a complex system? This
operator and the virtual assistant should jointly track the goals of
question is addressed later by Bratman through the concept of
the operator. Philosophy of action provides some clues to
shared intentions [9]. A shared intention is the cognitive
understand the cognitive processes which support joint agency.
configuration of a group of partners who could say: "we are
From a classical perspective in philosophy of action, human
acting together", and act consistently with that statement.
action is explained by individual mental attitudes, which are
Bratman proposed a set of necessary conditions that define
directed towards the real world where the action is to take place
shared intention and enable collaborative behavior in a small
e.g. beliefs, hopes, intentions, desires, attention, and assumptions
group of partners. It is not necessary to expose in detail this
[49]. Such attitudes are manifestations of the general ability of
theory here, nevertheless, it’s worth reminding that a virtual
our minds to relate to the real world, or “intentionality” [48]. The
assistant must contribute to fulfilment of the user's intentions, by
assumption that action is driven by individual intentional
adjusting its plans to what the user intends, is actually doing,
attitudes holds also for group action. Hence, a central problem of
and is about to do.
philosophy of action is to explain how individual intentional
Exceptforconversationalsystemswhere intentionsare conveyed
attitudes may drive collective ones.
verbally, agent's intentions are not always easy to catch: intention
Intentional attitudes are related to particular aspects of the
recognition is an active research field particularly in human robot
real world: they have specific “contents” of different types. A
interaction (e.g. [14,23,57]). Even more, intention is not the only
belief is defined by the believed fact; the content of a desire is
cognitive attitude to be monitored by a virtual assistant. Several
a desired state of affairs, e.g. the possession of a given object;
authors, including in the field of human robot collaboration,
the content of an intention is the targeted state of affairs or an
insist on the importance of monitoring user attention (e.g
action aiming at obtaining a particular goal. Different
[34,64].).
intentional attitudes can be related to the same content: I may
To summarize, the virtual assistant must not only represent
know that the sky is blue, hope that the sky is blue, imagine
the operational situation, it must also represent the user's
that the sky is blue, or perceive that the sky is blue. All those
cognitive perspective about the task. Such requirements have
intentional attitudes have the same content that the sky is blue,
already been endorsed from diverse applications areas, ranging
but differ by the kind of relation they entertain with that fact.
from human robot interaction to conversational systems
Searle [48] distinguishes in particular two major subgroups
[2,14,19,27,57,64]). An example approach to coordinate the
depending on their “direction of fit”: beliefs succeed if their
plans of several agents is suggested in [19], where each agent,
contents match the real world; they have a “mind- to-world”
besides his own backlog of actions, manages a backlog of
direction of fit. Conversely, intentions and desires do not need
common actions that have been agreed with his partners. Most of
to match the actual world, but they succeed if the world finally
the literature agrees with Bratman on the statement that
aligns to their contents: intention and desire are said to have a
collective agency use similar planning “algorithms” than
“world-to-mind” direction of fit. Although they have the same
individual agency, but dedicated to building and maintaining
direction of fit, intentions and desires differ by their
consistent and agreed sub-plans. Less obvious is the assumption
conditions of satisfaction: a successful intention actually
that assistants rationality must be transparent: virtual assistants
causes it contents to happen. A desire is different: I can desire
must answer and act in ways which are understandable by human
that it rains tomorrow, even if I will not cause the rain
users [4], as the result of a rational planning process. The
(conversely, it sounds odd to intend that it rains tomorrow).
assistant behavior needs to be transparent and predictable
Intentions have a central role in practical reasoning, leading to
according to user's norms [61].
the elaboration of plans and the decision to take actions, as
theorized in particular by Bratman in [8] though a planning
2.2. Saying and doing
theory of agency. Making a plan consists in forming a cascade of
intentions: Main goals are progressively cascaded into lower
To coordinate their plans, collaborative agents have to
level intentions, down to instrumental intentions which match the
exchange about their current beliefs and intentions. From this
know-how of the agent. Intentions are submitted to specific
perspective, communication could be considered as an
rationality norms [8], for example, one has to act according to his
independent capability, enabling agents’ collaboration. More
consistently, communication
is itself a collaborative activity, rather than a special capability
wanting to give me your email would be the perlocutionary act.
besides joint agency. J.L. Austin has given in the 1950s a series
The interpretation of a perlocutionary act requires carrying
of conferences that inaugurated a radical change in philosophy of
inferences beyond what is said. The nature of the inference
language. Since this series of conferences published under the
needed to access the speaker perlocutionary intention is debated.
title “how to do things with words” [6], speech is no longer
Even the role of the alleged literal meaning of the utterance in
regarded only as a way of producing descriptions of the reality,
this kind of inference is questioned [42,24]. Multifaceted speech
but also as the performance of “speech acts” which purposefully
acts are pervasive in conversation in operational contexts, where
change the reality. Speech acts may succeed or not depending on
the communication needs to be parsimonious and reliable. Here
how the cognitive state of the hearer is finally affected. The
is an extract of an example conversation between pilot and co-
logical structure of speech acts parallels the structure of
pilot in a cockpit:
intentional attitudes presented above: a speech act decomposes
into content and force. The content of a speech acts represents Captain: "set speed one nine five"
the facts or objects the speech act is about. The force delineates
the purpose of the speaker when making this particular speech First Officer: "One?"
act. The same content can be pushed with different forces, for Captain: "nine five"
example "you are running" is an assertion, "run!" is a
In this exchange, First officer’s “One?” not only informs the
prescription, "are you running?" is a question but they all have
Captain that the order has been misunderstood, but also it
the same content.
requests a correction of a particular part of the order.
Speech acts purposes are cognitive and social: When giving
an Implementing the recognition and classification of multi-purpose
dialog acts is still a hard problem, but it is an active research field
order, one intends to change the intentions of the hearer, or when
(e.g. [10,24,56]). Speech acts differ from concrete action by the
making an assertion one aims primarily at changing the beliefs of
the hearer. Commitments or successful prescriptions create new so-called reflexive nature of communicative intentions [47], i.e.
obligations. A successful assertion changes the common ground, the role of speaker intention in the interpretation “algorithm”: the
i.e. what is openly known to be known among the interlocutors. speaker obtains the intended effect thanks to the recognition of
Because the common ground is the root of agents’ actions and his intention by the hearer. Recognizing speaker’s intentions is
their justification, it plays a central role in all collaboration necessary but not sufficient to ensure the success of the speech
situations, including the interaction between human operators act. In addition, the hearer has the choice to accept or not the
and virtual assistants. speech act and its subsequent cognitive effect.
The classification of possible forces in a global model is a According to this psychological perspective on speech acts
hard task, and testifies the diversity of actions that can be done (speaking is changing purposefully one’s mental state), under-
by words. For example, here are the categories proposed in [40] standing utterances is not a mere decoding process, but rather an
as a consensual high-level taxonomy of speech acts: (Fig. 1). explanation process. Interpreting an utterance is explaining the
Representative speech acts contain facts that are true or behavior of the agent by the assignment of intentions and beliefs
intended to become true. They split into performative and [41]. The central role of user communicative intentions is
assertive speech acts: an assertion says something about the partially acknowledged in today's frameworks to develop virtual
assistants or chatbots: processing a user query starts by
actual world ("it's raining"). Performative acts aim at making true
understanding his "intent". Nevertheless, as far as we know, the
their content. A declarative act makes true a social fact ("I declare
reflexive and multilevel nature of communicative intentions is
the meeting closed"). A Commitment creates obligations for the
not supported in today's implemented frameworks for developing
speaker ("I promise to go to school"). Prescriptive acts make the
chatbots, although research in A.I. has theorized and proposed
addressee intend a particular state of affairs or action ("Please, go
solutions in the past (see for example [2]). That limitation
to school"), provided the speaker has authority upon the hearer.
impairs the fluency of dialogue with virtual assistants, because
Finally, expressive speech acts communicate emotional states or
recognizing intention structures is necessary to understand the
social relations ("Hello","thank you").
exchange and plan the next steps.
The debate on the classification of speech acts is still alive.
But it can be objected the following to such psychological
The main difficulty is that speech acts often belong to several
approaches of conversation: in daily conversations, hearers are
categories. For example, by saying "Hello" I would not only
not permanently calculating the intentions of speakers, and
express my wish to consolidate social relations, but I also
speaker's utterances are not deliberately designed to trigger
perform the act of greeting someone: it is then a performative
precise sequences of inferences. Recent approaches of speech
act. If I say “I don’t know your email” you might understand
acts inspired by dynamic semantics tend to reduce the
that I’m requesting you to give me your address, it thus can be
psychological dimension of interpretation by understanding
interpreted as a prescription. The usual solution to this problem,
interpretation as a way of managing the conversation context
in accordance with Austin’s distinction between locutionary act,
[17,37,45,46]. Because speech acts are transparent, they can in
illocutionary act and perlocutionary act, is to consider that an
principle be understood through their impact on the information
utterance performs speech acts of different kinds at the same
openly shared by the interlocutors. Successful speech acts update
time. In the utterance “I don’t know your email”, the action of
and enrich the common ground. The
making you know that I don’t know your email would be an
illocutionary act, and the action of getting you
Fig. 1. Example taxonomy of speech acts.

next section enters in more detail about the dual importance of Sperber & Wilson‘s Theory of Relevance [55] generalized and
context: on the one hand, utterances can only be understood by simplified Grice's approach of pragmatic inference by introducing
appealing to context. That is particularly obvious in operational
conversations, where partners interact to jointly act in a common
environment. On the other hand, the conversation continuously
affects the common ground.
2.3. Common ground
Collaborating agents need to share a wide variety of facts

relevant to their current interaction. In our aeronautical
applications, common ground might include: the operational
situation (we are repairing suchfailure of suchaircraft whichhas to
be dispatched on such flight from X to Y, etc.”), relevant background
knowledge (procedures, affected systemsl . . . ), including
linguistic knowledge and common sense knowledge; cognitive facts
related to the current interaction (“the mechanic is trying to
determine the operational impact of fault X before deciding to fix
it”), including assumptions, intention, or focus state of the agents.
Understanding a sentence is then an inference process anchored
into the common ground. The role of contextual interpretation (or
pragmatic interpretation) stems from the determination of word
reference, to the full determination of the truth conditions of
utterances, as intended by the speaker.
Determining the reference of indexical words such as "I",
"here", "now" is most often straightforward, because those words
unambig- uously point to unambiguous features of the current
context. But they are particular cases. Most often, agents use
phrases which are easy to understand for other involved human
agents, but which are out of reach for natural language
understanding software. For example:
- Deictic use of pronouns ("he", "we", "there"), sometime

reinforced by pointing gestures, to refer to persons or objects
of the concrete situation.
- Definite descriptions ("the wheel is flat", "Please, close the
door",
“ifthefault continues . . . ”).
- Relative clauses pointing to events observed previously: "This
action recalls all the warnings that the flight crew cleared or
cancelled during the last flight."
- Understand quantifiers in context (e.g. “all” in the example
just below)
Even more difficult for software virtual assistants, are of the

metaphors or metonymies, which are used even in technically
constrained operational situations: "Hello cockpit, ground is
ready". When you say to your smart phone "Message to John:
I'm arriving". The system has to determine in the elliptical first
sentence that the user intends to send an email to a certain person
named "John", whose precise reference depends on the social
context. "I'm arriving" is not to be taken as a raw assertion, but
rather as a quotation: the intended content of the mail.
Such context dependent interpretation processes have been
studied since the foundation of Pragmatics by Austin and his
theory of speech acts. Grice's logic of conversation is an attempt
to give an general account on how one can draw inferences
above the literal meaning ("implicatures"), in a way which is
predictable by the speaker [18]. The interpretation of
"implicatures" in Grice's theory is based on a few basic
principles:
- Conversation is driven by norms, which specify the quality

and quantity of information one can expect to find in a
sentence.
- Open violations of those norms by the speaker indicate that
the addressee has to infer the intended message above the
literal meaning.
the key notion of “relevance”. In a nutshell, interpreting an cascade of inferences. A layered model of speech acts has been
utterance consists in elaborating and refining its meaning up first
to the point where it becomes relevant in the current context.
That presumes that the speaker wants to be relevant and that
he knows how to do it. The speaker must know that the
hearer has in mind the required context elements to interpret
the utterance: here is the role of the common ground. The
common ground is what they both know, in a transparent
way.
Utterance interpretation does not necessarily pass through
some literal meaning which would ground higher level
inferences [42]. If the captain says "birds!" just after take-
off, no literal meaning can be assigned to his utterance, since
it is not even a well-formed sentence. Nonetheless, the co-pilot
will immediately catch that they are about to collide with
birds. That seems to be the most reasonable explanation of
captain’s utterance, given the operational situation and the
presumption that the pilot wants to be relevant in this
context.
Like in the previous example, the explanation appeals to
general abductive reasoning, applied to finding an answer to the
question: How to explain captain’s verbal behavior from his
intentions and beliefs? [41].
Recent approaches in dynamic semantics analyze speech
acts as "moves" in a game whose “scoreboard” is an
information structure centered around the common ground
[37,45,46]. The meaning of an utterance is its “Context Change
Potential”: the content change potential is a function which
maps the initial context to the context which would result from
the successful performance of the speech act in the initial
context. More formally, Roberts [46] defines the
“scoreboard” of the language game as a tuple: <I, G, M, <CG,
QUD>> Where I is the set of players, who pursue determined
and
open goals (G). M is the ordered set of "moves", i.e.
Assertions, questions, suggestions. Moves can be accepted or
not, with different consequences on the common ground (CG).
QUD (accepted questions under discussion) drive the
discussions where agents aim at building a shared view on a
question by splitting it into relevant sub-questions. By
segregating moves from common ground, Roberts’ information
structure supports the representation of failed or hesitating
conversation threads: an utterance which has not been accepted
by the hearer is stored as a move, but its content is not
integrated into the common ground.
Some more complex representations are needed when
analyz- ing conversation failures: it is needed there to represent
the discrepancies among the representation of the context by
each of the interlocutors [44]. Analogously, Lascarides and
Asher propose in [25] a multidimensional extension of their
powerful Segmented Discourse Representation Theory
(SDRT). The evolution of the background of each interlocutor
is represented in parallel streams, hence discrepancies can be
easily represented and logical synchronization processes can be
clarified. An example of the application of SDRT for task
oriented dialogue is developed in section III.B.
2.4. Multi layered cognitive interaction
Based on the notions introduced previously, we can

represent cognitive interaction as a consistent stack of actions –
including different kinds of speech acts – which will impact the
interaction context in specific ways. This representation can be
advantageous- ly used to structure a specification of interaction:
action at each level is driven by specific requirements, and the
success of each level of action depends on specific cognitive
capabilities at both the emitter and the receiver sides.
A single utterance in a conversation turn realizes several
actions at the same time: producing sound; making a sentence;
conveying meaning; producing cognitive effects such as a
introduced by Austin under the distinction among locutionary
acts, To act towards the common goal (cooperation), an agent
illocutionary acts, and perlocutionary acts [6]. might have to act on the cognitive state of his partner (pragmatic
act): e.g. to ask for support, to alert on some new relevant fact, to
elaborate a common plan. The semantic act is the mean to obtain
- The locutionary act is the pure linguistic act itself: designing a
the pragmatic effect: the speaking agent informs his partner
sentence and uttering it through the appropriate words (e.g.
about his intention (e.g. to get help or to draw his attention), then
uttering "I don't know your Email").
the receiving agent uptakes the intention by forming the
- The illocutionary act is the core speech act: when successful, it
corresponding cognitive attitude, typically an intention or a
modifies the cognitive status of the hearer according to the
belief. The semantic act succeeds by the means of lower level
conventions of language (e.g. making the hearer aware that I
acts: the conventional linguistic act, and the sounds – or any
don't know his Email).
other physical signals - which carry the sentence.
- The perlocutionary act encompasses the cognitive side effects of
The semantic act succeeds if the receiver catches that the
the illocutionary act. Typically, Grice's implicatures are out-
emitter intends to cause a certain mental effect, more precisely,
comes of the perlocutionary act (The addressee infers that I
which part of the common ground have to be updated. The
want him to give me his Email).
pragmatic act succeeds if thee receiver actually changes his
cognitive state accordingly, depending on the current operational
The original classification by Austin has been widely debated,
context.
particularly because of the difficulties to draw the frontiers
A cognitive assistant must implement both sides of the
between the different speech act categories (As an example of
interaction: not only it must be able to communicate relevant
the multiple debated questions: can the distinction between locu-
information with appropriate means, but also its interpretation
tionary acts and illocutionary acts be reduced to the distinction
processes must match natural human communication mecha-
between sentence meaning and speaker meaning? [47,40]). It
nisms. It starts with proper recognition of physical signals, either
seems today out of reach to propose an uncontroversial model of
sound or gesture (physical layer). Then, it has to recognizes
the diverse dimensions of speech acts. Despite those theoretical
those signals as symbols of a conventional language (linguistic
difficulties, we think that such a model is useful in practice for
layer), and derive the communication intentions of the agents
engineers working on cognitive assistants. Such a conceptual
(semantic layer). This communicative intention in the current
framework not only helps to understand how special the
operational context will lead the assistant to update his
interaction with a cognitive assistant is, but also it sheds light
knowledge base and his plans. Furthermore, he will infer
on the different features to be implemented. When considering
intended indirect effects, e.g. the implicit request hidden behind
the technologies at hand today, not all perform well on all the
the statement. Whatever the type of cognitive effect intended by
dimensions of cognitive interaction. A layered representation
the speaker (get attention, inform, request . . . ) and under the
helps to characterize the computing functions needed and to
assumption that the speech act is relevant, the receiver will adjust
architecture a solution, like the well-known Open System
her sub-plans accordingly. At the end of each interaction turn,
Interconnection model for network communications helps to
the sub-plans of both agents should mesh.
design network architectures. We propose a simplified model
The layered cognitive interaction as described in this paper is
based on the kind of agent intentions which are fulfilled at each
technologically agnostic. But the implementation of each layer of
level of interaction.
the architecture requires the availability of specific cognitive
The five levels in the representation below (partially inspired
capabilities. Some of those capabilities are already provided by
from the representation of human cooperative communication in
existing tools, some other are still the object of researches. It
[58]), correspond to the different kinds of intentions which drive
would be too long to detail here all our requirements at each
the speaker speech act. At the receiver side, a stack of
level. The following table gives a summary of the requirements
interpretation and inference capabilities need to be activated for
to be fulfilled to implement cognitive interaction (Fig. 3).
the speech act to succeed: (Fig. 2).
This representation helps us to better understand and
Here is a short summary of the requirements to be fulfilled
compare diverse industrial solutions available to develop
at each interaction level in order to obtain a fluent and
virtual assistants,
successful interaction:
Fig. 2. The layered model of cognitive interaction.

Fig. 3. Summary of requirements.
or other systems designed to interact cognitively with human speaking: when two people are carrying heavy furniture or a pilot
operators. Indeed, it is the case of some robots. For example, is executing a procedure and the co-pilot is monitoring the
consider the architecture of human-robot interaction in project execution of the procedure, or just when you are walking in a
JAST, Joint Action Science and Technology [16]. According to crowd. The agents act in a transparent way, they make sure that
the section dedicated to human robot interaction in the Handbook their gesture is understandable by the other agents and that their
of Robotics [35], this architecture is a representative example of intentions are transparent. In such case, interactions must happen
the dialog-based paradigm of human robot interaction, which also at physical, linguistic, semantic, pragmatic and cooperative
under- stands interaction as “joint action”. Starting from the levels, even if the work to be done at linguistic and semantic
bottom, the architecture presented in [16] implements the levels is obviously of different nature than in the case of spoken
physical layer through a variety of modules, specialized for the language. A recent paper [27], describes architectures for multi
different modalities available: speech recognition and synthesis, modal human-robot interaction where the knowledge base is fed
face and gaze tracking, production of facial expressions, hand from diverse directions: on top of a sensorimotor layer, the verbal
tracking. Regarding the linguistic layer, only the linguistic layer communication is managed by a specialized natural language
for speech is made explicit in the architecture (the design choice processing module, while in parallel situation assessment func-
has probably been to not consider facial expressions or hand tions feed the knowledge base with facts about the scene. Such a
movements as conventional signs, to be decoded by a specific system builds the human perspective of the scene by recognizing
module). The architecture has a dedicated semantic interpretation facts such as “the user sees X”, “the user looks at X”, or “the user
module, directly interfaced to a central decision making module, points at X”. Such a system enhance the capability of robots to
which designs outputs and actions to be done according to the act at the pragmatic level by introducing knowledge about the
current plan, the discourse history, and the world model. The shared physical environment and agents gesture into the common
semantic interpretation module includes the pragmatic background of interaction.
interpretation task: it “finds referents for noun phrases, including A major difficulty to implement cognitive interaction and
demonstrative and anaphoric pronouns” and produces particularly the pragmatic layer, is that agents have to build and
“disambiguated propositional Content”. But the update of the maintain a proper representation of the common ground. This
knowledge on operational context is distributed among the representation should include facts about the current situation,
semantic interpretation and the decision making module (indeed, some information about the cognitive attitude of the partner, and
the current updated plan is part of the interaction context). should track the interaction. The two first sections of next
The layered representation of interaction also applies to cases chapter are dedicated to two now classical approaches of
where there is no explicit communication. Think of tasks where discourse representation in AI. They both are candidate solutions
two agents coordinate their movements or thoughts without (or part of solution) to that problem.
3. Managing cognitive context: example solutions from AI
DSPs, like Discourse Segment Purposes). In the example, we
distinguish the primary intentions linked to the practical
3.1. Grosz and Sidner
objectives of the discourse segments DS0, DS1, DS2, DS3, DS4
and lower level communicative intentions contributing to fulfil
In their seminal paper Attention, Intention and the structure
the primary intentions. The intention tree could look like: (Fig.
of discourse [20], Grosz and Sidner describe the cognitive
5).
structure of discourse as composed of three subparts: (1) the
The sequence of communicative intentions corresponds to the
actual sequence of utterances (2) A tree of intentions (3) a focus
sequence of utterances in the example. The branches in the tree
stack. Even if this is not explicit in the initial approach, this
represent "dominates" relations: an intention I1 is said to
structure helps to flesh what happens at pragmatic level in a
dominate intention I2 if I2 is part of the intentions to be fulfilled in
conversation. This triple structure is a formalization of what each
order to fulfill I1.
agent has to know consistently with the other agent. Although
Not all the intentions are explicitly mentioned in the
their initial approach applies as well to monologues or texts than
conversation. For example, the mechanics does not explicitly
to conversation, the cited paper of Grosz and Sidner focuses on
require help: it is understated by his first utterance: We have
task-oriented exchanges.
“STABILIZER A2 FAULT”.
From now on, we try to illustrate this approach with our
The intention tree – as a shared representation of the
introductory example. The intentions considered here are
conversation – is built progressively along the interaction. At
typically intentions for cognitive actions, to change the
each step, each agent determines the next action to be taken and
intentions or knowledge state of the interlocutor. They represent
interprets the speech act of the partner by using the current focus
in a consistent way primary intentions linked to the task (e.g. get
space.
the mechanics to launch an auto-test of the flight control system)
The focus space is a stack of focus states, each focus state
and lower level communicative intentions s.t. getting the
representing the intentions and objects of interest introduced by
mechanics to know the operational impact of the fault.
each step of the conversation. For example, that could be the initial
The structure of the conversation parallels the upper layers of
values of the focus stack at the beginning of the conversation:
the intention tree and the structure of the procedure itself, which
(Fig. 6).
counts three major sub-sections: first to understand the fault and
The dialogue first focuses on the fault symptom, then
its consequences (DS1); then decide what to do (DS2); confirm
introduces the possible cause of the fault symptom, and then its
the fault (DS3) ; fix the fault (DS4) (Fig. 4).
impact. At each turn, the focus stack contains the focused
The structure of intentions is restricted to open intentions, i.e.
elements for each of the upper level intentions of the same
intentions to be recognized by the partners. Single open
branch in the intention tree. The focus stack is enriched with the
intentions are attached to each subpart of the conversation (they
communicative intentions and the new objects of interest
are called
introduced for the current turn.
Fig. 4. Intention related discourse segments.

Fig. 5. Intention tree example.
Fig. 6. Example evolution of the focus stack.

The focus space has two objectives: (1) it helps to determine
Energize the aircraft
the intentions which the current utterance can contribute to (or be
Erase the non-volatile memory (NVM) of the flight control
"dominated" by) (2) it provides the objects of interest which can
system:
be referred to in the current utterance. For example further in the
scenario, when the mechanics says: (M5)” In 3 h.” To interpret
this utterance, the assistant needs to "know" that the exchange is - on maintenance terminal, select “system test & scanning”
currently focused on the next flight, so that “in 3 h” can be - select “flight controls”
interpreted as the remaining time before the next flight. - select “erase NVM” Run the system test
The integration of cognitive state modelling and planning into
conversation systems have been a point of interest for AI until That fragment, could be represented as the following
recently, particularly if the agents are to collaborate with other Segment- ed Discourse Representation Structure (we use the
automata or human beings. In line with Bratman’s publications labels of the utterances to name the representations of their
on the role of planning in individual and collective action [8,9], content: (Fig. 7). The representation of the content of each
implementing collaborative behaviors requires to compute the utterance covers both the order, and the presuppositions of the
shared sub-plans, and how those sub-plans commit the agents to order (for example the order to select the “system/test page” on
act consistently. the maintenance terminals presupposes that the terminal affords
For example the collaborative agents’ model developed by an access to the system/test page). Each free variable is linked to
Ferguson and Allen in [19] explains how cognitive models a bound variable
elaborated in parallel by several agents can synchronize. It in a preceding segment.
describes in detail how collaborative planning capability coupled The fragment is hierarchically segmented into three main
with negotiation capability enable an agent to get support, parts, corresponding to energize the aircraft, erase the non-
through building joint intentions towards the agreed common volatile memory of certain computers, and launch a system test.
goal. In [21], Grosz and Krause elaborate a logical specification It is possible to represent the rhetorical structure in a more
of shared plans and clarify the coordination between individuals compact way by focusing on the relations (Narration and
in executing complex plans, e.g. how two agents elaborate partial Elaboration relations are represented as solid lines and dotted
plans consistently through deliberation and contracting. lines): (Fig. 8).
In [38], the authors present an extension of the SDRT to
3.2. SDRT applied to task-oriented dialogue describe a conversation where one of the interlocutors guides the
other one through an itinerary. They had to introduce variants of
According to structured Discourse Representation Theory the rhetorical relations of the initial SDRT to represent some of
(SDRT) [5], the meaning of a discourse can be logically the cognitive intentions of the interlocutors, in particular that the
represented based on the split of the discourse into its guiding person is giving orders and recommendations to the
constituents, or discourse segments. Hence, that is another guided person, and that the guided person might need to ask
candidate formalization of the common ground in the questions to understand the plan. Relations plan-sequence and
conversation. The representation of the meaning takes two plan-elaborate are sub relations of the general SDRT Narration
forms: first the propositional content of each segment is and Elaboration relations, which carry information on the
represented in predicate logics, with unbound variables that have
intentions of the speaker in the execution of the task: if a1 and a2
to be linked to objects referred to in other segments. Second,
represent two actions, Plan-elaborate(a1, a2) is true if a2
segments have specific relations (“rhetoric” relations) with
contributes to the achievement of a1. Similarly, Plan-
other segments, for example two segments can be part of a same
sequence(a1,a2) is true if a1 must be achieved before a2. To
narration, or one segment providing an explanation to another
represent our introductory dialogue in this way, we use also two
segment.
relations which represent the goals related to speech acts:
More formally, each discourse or constituent of the discourse
question-answer-pair (or QAP), and Acknowledge. The
can be recursively represented as a SDRS, or Segmented
rhetoric relations for the fault confirmation part of the example
Discourse Representation Structure, where:
dialogue can be represented in a tree form: (Fig. 9). Notice that to
build a complete discourse representation, we added a few tacit
- The propositional content of sentences is represented in
nodes, that do not match any explicit utterance in the conversation
predicate logics;
(M7’, M9’). The dotted lines around (A6) (M8) and (M9), (A8),
- The referents used in each segment are made explicit;
(A9), (A10) model the contribution of question-answer pair. (M9’)
- Segments are logically related through different kinds of
is a typical examples of what could be implicit in practical task
rhetorical links, e.g. Elaboration, Narration, Explanation,
oriented conversations at work: no answer is often interpreted as an
Contrast.
implicit acknowledgement. Similarly, utterance (M11) “What’s
next?” implicitly acknowledges (A10) “Select Erase
SDRT has been the framework of a large number of NVM”.
publications regarding discourse interpretation, and is based on Those examples illustrated the diverse type of actions that can
logical foundations which in particular provide clear semantics be performed on conversation state represented as a SDRT at the
to all the constructs used in SDRS. pragmatic level of interaction:
The most used rhetorical relations are Elaboration and
Narration. Elaboration is a subordinating relation: it typically - Update of the conversation background according the previous
links a sentence with others which develop in more detail its utterances, including with implicit information is needed to
content. Narration is a coordinating relation, expressing that the maintain the coherence of the dialogue;
events described by two sentences are in sequence. - Determination of references through several utterances,
As an example of this application of SDRT to carry including when abbreviated or elliptic phrases are used;
directives, let’s consider a fragment of a written procedure - Recognition of rhetorical relations among fragments of the interac-
associated with our introductory conversation example. The tion (elaboration, acknowledgement, question-answers . . . ).
procedure can be described as the narrative of the right way of
solving the problem. It could written like this in a
In principle, SDRT and the Approach of Grosz and Sidner
troubleshooting manual (again, the example is very simplified vs.
introduced in section III.A are taking two different perspectives
what would be a real procedure)
on
Fig. 7. Example Segmented Discourse Representation Structure (SDRS) representing a procedure.
discourse structure. Both approaches succeed in showing the

the driving intentions. Nevertheless, this initial distinction blurs a
coherence of the dialogue through the underlying structure. But, in
little in later evolutions of SDRT. We have seen that when
SDRT, the focus is put on semantics, whereas in “Attention,
formalizing a task oriented dialogue, it has been useful to enrich
Intentions and the structure of discourse”, the focus is put on
the set of the rhetoric relations of the SDRT to represent some
communicative intentions [11]. The goal of SDRT is to represent all
intentional aspects: an order or a recommendation carries the
the semantic content of a discourse in a unified and theoretically
intention that the user actually makes the task, a question
founded representation, which supports the result of all the
represent the intention to get to know some information. Even
interpretation processes, including the pragmatic interpretation
with the “add-ons” developed for task-oriented dialogue, SDRT
processes. In this view, the meaning of texts or utterances is as
hardly represents extra linguistic or cognitive features when they
much as possible made independent from speaker’s intentions.
are not verbally expressed. Nevertheless, SDRT has shown to
That does not prevent the process of elaborating this meaning to
support the resolution of many classical interpretation issues
include assumptions about intentions, but SDRT allows to reduce
such as understanding anaphora and pronouns, determining the
as much as possible the meaning of sentences to truth evaluable
reference of definite descriptions, grasping indirect speech acts.
constructs, making the economy of psychological interpretations.
For the case of a virtual assistant, even more if it has other
At the opposite, the merit of the approach of Grosz & Sidner is to
feedbacks than operator’s speech on what the operator is doing, it
show the dependence of dialogue to the structure of the task and
is useful to maintain a consistent view of the task oriented
Fig. 8. Example tree representation of a SDRS.

Fig. 9. Segments and rhetorical relations of the introductory example (partial).
intentions of the operator, because the virtual assistant is

capture what a user is saying in detail, as well as to restitute any
expected to make suggestions to the operator at any time, without
sound that the agent needs to synthesize or play. Microphone
explicit query.
arrays can even support more advanced features like extracting
voice input from ambient noise or locating/separating voices in a
3.3. Implementation in a state-of-the-art conversation framework
multi-speaker environment. The latter, also called the “Cocktail-
Party Problem” is traditionally solved with techniques like ICA
The use of cognitive assistants for critical applications such as
(Independent Component Analysis) [51] or more recent ones,
aircraft maintenance or operations has still to find a solution for a
e.g. based on deep learning models [12], which sometimes only
potential show stopper, namely how to test cognitive interaction
require a single mic to work.
in a way which would guarantee: (1) that the risks of human
Sound however is not the only medium to be used for
errors linked to the cognitive assistant is acceptable, (2) that the
conversation: in the bigger picture of multi-modal interactions,
user can trust the system, which is not only a question of
other physical interfaces come to mind. The most obvious ones
reliability, but depends also on system transparency [4]. As usual
are graphical user interfaces on any kind of display: a history of
in aeronautics [43], operational validation based on simulated
the conversation can be shown in a chat component;
operational situations is part of the solution but will not be
images/videos can be visualized, etc. Sometimes an (animated)
sufficient. Each layer will need to be tested individually through
avatar of the virtual assistant can help “humanizing” it and foster
tailored test benches, targeting each of the interpretation
a more natural conversation from the user perspective. Multi-
capabilities needed for the assistant (e.g. [33,62]).
touch screens enable intuitive interactions with the graphical
Many frameworks supporting the creation of virtual assistants
content. If the agent is implemented in a robot, its appearance
or chatbots have emerged in the last years. Among the
and movements can also be important aspects for the human-
commercial ones, some are provided by big tech companies such
machine interaction. Indeed, humans are accustomed to using
as Google (Dialogflow, previously Api.ai) or IBM (Watson
gesture, attitude, eye contact among other factors to convey their
Assistant), whereas others are driven by startups like Snips.ai or
message in conversation, on top of using their voice. In this
Recast.ai. To illustrate how the theoretical concepts introduced
area, robotics research still has many challenges to overcome for
so far relate to practical implementations, a few off-the-shelf
achieving a human-like experience and building trust with the
technologies will be considered, including the state-of-the-art
user.
free open source framework proposed by Rasa, a German startup.
As of today, no framework is likely to achieve the ultimate
3.3.2. Linguistic layer
vision described in the first two sections; however, we will see
The goal here is to automatically convert the raw byte streams
that every layer of the 5 cognitive interaction levels is partially
from physical interfaces into an analyzable representation
addressed with current technology stacks. The higher the layer,
(typically string for language) to enable further processing of
the less mature current solutions generally are in addressing it, as
the user’s input, and vice versa to output a message to the user.
one could guess. Limitations and perspectives will be stressed at
For the former task, automated speech recognition (ASR),
every stage.
also known as speech-to-text, has made tremendous progress in
the last decade. ASR systems were traditionally decomposed in
3.3.1. Physical interfaces
several components, often manually designed and independently
This very first layer of cognitive interactions is certainly the
trained on various datasets: an acoustic model (AM), a
only one where technology could already matches the required
pronunciation model (PM) and a language model (LM). The role
specification for the ideal conversation vision. Indeed, the quality
of the AM is to predict from acoustic features (such as Fourier
of microphones and speakers today makes it perfectly possible to
coefficients of the sound) a set of sub-word units, typically
context-dependent or
context-independent phonemes. The PM is a hand-designed
any good restaurant?” would rather match a “search_restaurant”
lexicon that maps a sequence of phonemes to words. Finally,
intent). State-of-the-art solutions are sometimes able to extract
the LM assigns probabilities to word sequences. Over the last
several intents from a single input, as recently demonstrated at
years, the growing popularity and success of deep learning
the Google I/O keynote 2018 for a home assistant (e.g. “Hi, I am
models chal- lenged this kind of ASR architectures and state-
looking for an Italian restaurant” could match both “greet” and
of-the-art performance was achieved with new end-to-end
“sear- ch_restaurant” intents). Named-entity recognition (NER) is
models, which attempt to learn these separate components
a subtask of information extraction that seeks to locate and
jointly as a single system [13,63,3]; As of today, the best
classify named entities in text into pre-defined categories, either
models typically report a word error rate of 5% or below on
names of persons, organizations, locations, etc. or expressions of
usual speech recognition datasets. However, these datasets do
times, quantities, monetary values, percentages, etc. Pre-trained
not necessarily reflect the difficulty of some operational
open source and commercial software exists for both types, such
scenarios, such as communication between pilots and air
as spaCy based on machine learning or Facebook duckling based
traffic controllers in the aeronautics domain. Especially, multi-
on language rules (e.g. extracting {value:6, type:"value","unit":
speaker accented speech in (very) noisy environments with
"mile"} from “I walked six miles”).
domain-specific vocabulary is still hard to tackle with current
Recently, natural language understanding (NLU) solutions
methods. Challenges are on-going to assess the best ones and
packaging intent and entity recognition have emerged in the
foster progress with academics and commercial providers [1].
open source world, such as Rasa NLU and Snips NLU,
While research is continuing, many off-the-shelf solutions are
facilitating the complete integration of the semantic layer. As far
available, as service in the cloud or on-premise, with licenses
as Rasa NLU is concerned, it relies on a customizable pipeline
ranging from commercial to open source.
architecture which is modular and flexible. The backend for
For outputting a vocal message to the user, speech synthesis,
natural language processing (NLP), including transforming
also known as text-to-speech (TTS), is used. Historically,
words into word vectors to support higher-level analysis, is
generating speech with computers was mostly relying on
selectable (e.g. spaCy or MITIE). By design, these tools do not
concatenative and sometimes parametric methods. The former
take any conversational context into account when performing
uses a large database of short speech fragments recorded from a
intent and entity recognition; this notion comes into play in the
single speaker that are recombined to form complete utterances
next stages, as we will see in the sections about pragmatic and
(time-consuming and difficult for altering voice, emphasis and
cooperation layers. However, this can be seen as a limitation,
emotion). The latter only requires model parameters to be tuned
since context happens to be crucial sometimes for a perfect
to generate data, often post-processed by algorithms known as
extraction.
vocoders to generate the audio signal, and speech characteristics
When using intent and entities as high-level representation of
can be controlled via the inputs (no recording session needed but
a natural language sentence, a virtual assistant designer has to
results used to sound less natural than concatenative TTS).
select the granularity of these elements depending on his
However deep learning transformed this domain as well in the
domain. For instance, in the scope of a restaurant reservation
last few years, beginning with the WaveNet model introduced by
bot, one could define the intents “inform_cuisine” and
Google DeepMind in 2016 [59]. For the first time, a deep neural
“inform_date”, but trying to cope with a single “inform” intent is
network directly modelled the raw waveform of the audio signal,
a valid alternative (along with the “cuisine” and “date” entities in
one sample at a time. Since then, improvements have been done
both cases to detect which type of cuisine the user is looking for
in terms of quality & efficiency [60], and new model architectures
and what the date of booking should be). Minimizing the number
have surfaced [50]. The last steps for making synthesized speech
of possible intents is appealing because it decreases complexity
as natural as real one lay in tone, intonation and speech
while increasing generality; besides, it might reduce the
disfluencies (such as “hmm” and “uh”), but these limitations
number of annotated examples needed for training the NLU
start to be addressed seriously, as demonstrated in the brand new
component. Broadly speaking, this means taking a step towards
Google Duplex demo, where humans were fooled by a virtual
a one-to-one mapping between defined intents and so-called
assistant to be talking to real humans over the phone [29]. As for
speech acts, as inspired by James Allen’s theory (e.g. in [2],
ASR, many free or commercial solutions are available off-the-shelf
page 250): assert (asserting a fact), command (giving an order),
today, as cloud services or on-premise.
yes/no-query (asking a closed question), wh-query (asking an
As humans do not only use sound to communicate, one could
open question) . . . While this level of abstraction is attractive
imagine that other data streams might be converted to an
for intents, special care must be taken to avoid pitfalls: it might
analyzable representation to support the agent’s comprehension
work for assert/inform in some domains, but may not be precise
of the conversational context and intent of the user, e.g. a video
enough to properly deal with command and wh-query for
feed of the user’s face expression or motion sensing input
example. Indeed, a “search_restaur- ant” intent is much more
devices like Kinect to capture gesture. Although such
informative to understand the sentences “Find nearby
information is rarely used today in conversational systems, they
restaurants!” and “What are nearby restaurants?” than knowing
could in the future nicely complement ASR and TTS. Even
that the first is a command and the second a wh- question
humans have found workarounds to support their own exchanges
(even if associated with an entity referring to restaurant).
in chats when these streams are lacking, for instance with the use
Because choosing the right level of granularity when defining
of emojis.
intents and entities is hard and time-consuming (even more so in
broad domains), not to mention the tedious task of creating a
3.3.3. Semantic layer
training corpus of annotated sentences, alternative methods could
The semantic layer is where the input message from the
be envisaged to abstract the content of a sentence. One idea
user, as captured by physical interfaces and converted by the
would be to automatically transform the text input into a vector
linguistic layer, is interpreted to extract a meaningful
using a machine learning model trained in an unsupervised way.
representation. For a text message, an intent and entities are
At word level, this idea was first popularized by the Word2Vec
typically extracted during this process. Intent recognition
model [30], which is able to learn meaningful vector
usually consists in finding the most probable user’s intent
representations of words from unannotated text corpora using a
among a set of predefined ones using for instance machine
clever context prediction trick, basically based on the fact that
learning classification or rule-based techniques (“Hi”, “Hello”,
similar words appear in similar contexts. By “meaningful”, we
“Good morning” could all match a “greet” intent, whereas “I
mean that semantically similar words will be transformed into
am looking for an Italian restaurant” or “Do you know
close vectors in the learnt
latent space (typically using cosine similarity as distance). This
Rasa Core is a state-of-the-art open source dialog engine
concept has then been extended at sentence and even document
framework, which can be used to illustrate most concepts
levels (Doc2Vec) [26]. The perspective of using such models to
introduced in this layer and the next one (cooperation layer). It
automatically abstract the text input, in particular the intent
is particularly interesting for its ability to use machine learning
inside it, into a meaningful vector has the great benefit of not
to infer the whole conversation state machine from a limited
requiring manual annotations and re-tuning per domain (unless
number of interaction examples called “stories”, which is a great
very specific vocabulary is used), but it might become trickier to
advantage compared to the old-fashioned way of manually hard-
extract relevant entities and link them to context.
coding conversation trees (error-prone, time-consuming and non-
adaptive). The various options for training the conversation
3.3.4. Pragmatic layer
engine, with or without stories, will be detailed in the next
The user’s input, once interpreted in the semantic layer and
section. When using Rasa Core, a domain file is required to
potentially transformed into an abstracted form (intent/entities,
define the intents, entities and context slots to be used. The type
vector . . . ), often still requires context to disambiguate its true
of every context slot must be specified among default types (text,
meaning, so that the most relevant action(s) can be selected by
bool, categorical, float, list . . . ) or customized ones. Utterance
the agent in return, e.g. utterance action, service call, knowledge
actions, along with optional images or buttons, can also be
graph query . . . or listening action where the bot is just waiting
defined via templates in the domain file. More advanced actions,
for new inputs to proceed. Handling context to this end is the
e.g. requiring a service call and/or specific processing, can be
purpose of the pragmatic layer.
customized using code (Python by default).
Context can be split in two main aspects which can both be
Since context is crucial for the agent’s decision making, special
determining for a smart decision of what to do next. The first one
care is needed when transforming it into an exploitable
is the full or recent history of the conversation (conversational
representation. Typically, when using machine learning models
context): knowing what to do after a “Yes, please” utterance is
such as neural networks (NN), context slots must be transformed
obviously dependent on what was asked/said before. The second
into features (often a vector representation) before being injected
aspect is the state of the “world”, as perceived by the agent,
as input of the model. This process is called featurization and can
which might impact the decision (context slots): for instance,
be customized in frameworks like Rasa Core. By default,
choosing the right answer to the question “Can my aircraft land
Booleans are converted to a 0/1 scalar, floats are passed as-is,
at airport XYZ?” typically depends on information like the
categorical fields are transformed into so-called one-hot
aircraft type/position/ speed/fuel, runwaycontaminationstatus,
encoding vectors (e.g. [1, 0, 0] for category A, [0, 1, 0] for
weather, etc. Asforentities before, identifying relevant context
category B and [0,0,1] for category C), while text and lists are
slots at the right level of granularitycanbeadifficulttask,
just represented by a 0/1 scalar depending on whether the field is
sincetoomanyslotswillslowdown the training phase for decision
empty or not. Extending the default featurization for any context
making, while too few is likely to bias it. It is common but not
slot is made possible but requires additional code. If some
necessary to have a context slot for every entity type (as
context slots are not deemed important for decision making, they
recognized in the semantic layer) in order to store the last-
can be declared as “unfea- turized” to be ignored (they can still
mentioned instance of each, e.g. name of the user or aircraft
be used in custom action code for any kind of processing).
serial number. It is often more efficient to do so in terms of
memory and computation than to rely on the conversational
3.3.5. Cooperation layer
history only, even when this information could be re-inferred
In the scope of conversation, the purpose of the cooperation
from it. Besides, in specific scenarios, some context slots can
layer is to select and plan actions in the general sense (utterance
directly be filled via other channels than the conversation itself,
action, listening action, service call, knowledge graph query . . .
such as a data stream from an aircraft docking station in an
). Sequential decision making, i.e. iteratively selecting the best
aeronautic context.
action given a current state (here characterized by context/intent/
When context slots are meant to be aligned with concepts of
an entities), to optimize a cumulated cost/reward is thoroughly
ontology, e.g. a knowledge graph, mapping from recognized studied in the planning and reinforcement learning (RL)
entities (in the semantic layer) to context slots (in the pragmatic literature. Both aim at producing a (quasi-)optimal policy, which
layer) is not always straightforward, even in an intent/entities is a function that maps to each state the best action to perform (or
recognition setup as opposed to more farfetched representations more generally a probability distribution over available actions).
like vectors. For instance, the entity “engine failure” can be One common hypothesis for it to work is that the domain is
extracted as a problem in the sentence “Show procedure for Markovian,
engine failure” but might correspond to an abnormal procedure i.e. the current state encodes everything needed to compute next
named “ENG 1 (2) FAIL” in the agent’s knowledge graph. Being steps once an action is selected (a frequent trick is to add a recent
able to correctly map entities from text to the graph (from history to the state definition when needed to make a domain
“engine failure” text to “ENG 1 (2) FAIL” graph node in this artificially Markovian, e.g. the last utterances in the
case) is known as entity linking (EL) in the literature. There are conversation). Various techniques exist to produce such a policy:
various ways of dealing with this problem, one of the simplest in the ones using a (white-box) model of the environment, like a
our scope being to manually/ programmatically build lists of Markov decision process, usually belong to planning, whereas
aliases, or “synonyms”, for every graph node label (“engine the ones only interacting with a (black-box) environment are
failure” would be one of “ENG 1 (2) FAIL”). Then, in an entity often referred to RL, with some nuances in-between like model-
recognition post-processing step, fuzzy matching techniques can based RL (rebuild- ing a model of the environment by sampling
be used to find the closest alias and thus the relevant graph node. it and then plan on it). The environment represents everything
It is also possible to learn the synonyms as part of the entity that is external to the agent and its state may change whenever an
recognition itself, e.g. by directly extracting the value “ENG 1 action is performed by the agent: in a robotics setup, the agent is
(2) FAIL” in the NLU training corpus whenever “engine failure” the robot and the environment is the world in which it evolves; in
is encountered, but it is usually much less data- efficient than the a chatbot setup, the agent is the bot and the environment is
post-processing approach mentioned previously. Note that a pre- everything else, including the user above all but potentially also
processing step, such as normalizing text input or expanding objects that can be influenced by the agent’s decisions. Since the
acronyms, might also support entity linking at a later stage since user (at least) would be hard to “model”, the environment can be
it may reduce variability in the expressions and thus reduce the considered black-box,
list of aliases.
so our scope is rather a RL one, assuming the goal is to optimize
to play given a Go board state, was first trained on many games
the conversation as a whole. The recent successes of deep learning
from professional Go players (the equivalent of stories for
dramatically changed the field of RL in the last years by
conversation) in a supervised learning fashion: during the
representing the policy function with a deep NN, giving birth to so-
training phase, the NN learnt to predict what a Go master would
called “deep RL”: Google DeepMind or OpenAI, among other
be likely to play in any possible state. However, playing as good
players, contributed to popularize the concept to a wide audience
as humans may not have been sufficient to beat the world
with achievements like learning to play ATARI games from raw
champions, so reinforcement learning algorithms were then used
pixels [31,32] and defeating Go world champions [53,54].
to further improve the policy network parameters to super-
In the Rasa Core framework, the agent’s policy to select next
human level in self-play, only using one reward signal for its
actions given the context/intent/entities is customizable. The
optimization: 1/0 at the end of each won/lost game. This is
main one provided by default is a NN, but simpler ones are also
incredibly appealing but RL still presents technical difficulties in
available (e.g. a memorization policy which learns “by heart”
a conversational scope, as mentioned by the Rasa team itself,
actions to select from situations seen during training, without
justifying why this approach is not yet proposed out-of-the-box
extrapolation), and they can also be combined. The first way of
[36]: first, RL is data hungry (thousands of conversations would
doing training is based on “stories”, captured in a file: stories are
be needed to learn even simple behaviors with today’s
examples of expected conversations, with user’s turns expressed
algorithms); second, training with real humans in the loop giving
with intent names (e.g. “greet”), possibly conditioned on entity
rewards can be biased since human evaluations are notoriously
values or context slots, and agent’s turns expressed with action
unreliable or inconsistent; alterna- tively, training against a
names referring either to templates in the domain file (for simple
simulated user restricts the use to problems that can be specified
utterance actions) or to the ones defined in code (for more
exactly as a reward function and could lead to unwanted
advanced customized actions). Of course, the goal is to provide
behaviors (e.g. a rude agent that would skip all form of
just as many conversation examples as necessary and let for
politeness to go straight to the result if conciseness is rewarded).
instance the NN policy extrapolate on all other scenarios; in
Even if these obstacles still need to be overcome in RL research,
other words, the hope is that the whole expected conversation
the foundations of the Rasa Core framework are the right ones to
state machine can be correctly inferred by just providing a few
transition to such an approach once mature enough.
execution paths. Since it is hard to know in advance how many
The figure below summarizes the technical solutions men-
examples are needed to achieve the desired robustness, another
tioned in this chapter to illustrate the five proposed layers with
way of training is proposed to replace or complement the one
state-of-the-art tools as of today: (Fig. 10).
based on stories: interactive learning. In this mode, after each
user’s utterance, the agent indicates which intent it recognized
4. Conclusion: how to progress on cognitive interaction with a
and which action it was about to select; here the user can validate
virtual assistant?
these decisions or correct any of them if necessary, until the
agent becomes robust enough. This mode performs online
We have proposed to understand cognitive interaction as a
learning, i.e. the agent improves during the interaction itself, but
stack of simultaneous actions, at physical, linguistic, semantic,
can also dump the whole interaction (with corrected intents &
pragmatic and cooperation layers. This structure helps to specify
actions) to new stories, which may be used offline to quickly
the operator/virtual assistant interaction we expect to have with
retrain an agent from scratch.
Both story-based training and interactive training are super- intelligent virtual assistants. Through a single utterance, one is
vised learning techniques since the user explicitly states, via performing several speech acts at the same time: he is producing
stories or interactions, the expected action from the agent in a sound (physical layer); he is making a sentence (linguistic layer);
specific state (context/intent/entities). Another perspective he is carrying meaning (semantic layer); he is changing the
would be to use reinforcement learning as introduced at the cognitive context (pragmatic layer); he is contributing to a
beginning of this section, to automatically optimize complete common project (cooperation layer).
conversations via rewards, without requiring step-by-step To be successful, those interrelated acts must trigger
guidance from the user. Rewards could be based on the concept specific cognitive processes at hearer side. An intelligent
of relevance maximization, virtual assistant must demonstrate specific action and
i.e. rewarding short conversations that go straight to the expected interpretation capabilities to interact cognitively with a human
result (indicated by a good score from the user or a “like” at the operator. The most salient difficulties do not come from the
end of the conversation for instance). Let us draw a parallel with physical and linguistic layers: they are today understood and
the AlphaGo program, which defeated two Go world champions, implemented in a variety of software solutions. At the upper
to clearly show the difference between supervised learning and part of the model, the cooperation layer requires not only the
reinforcement learning for the policy network. In earlier versions capability of the agents to plan the next step of the interaction,
of AlphaGo, a policy network, whose role is to select the best but also the capability to define this next action to be done in
move order to maximize relevance in the current
Fig. 10. Summary of the architecture of cognitive interaction in a state-of-the-art framework.

context. Capability to maximize relevance requires a dynamic
[10] H. Bunt, DIT++ Taxonomy of Dialogue Acts" from Harry Bunt site, (2008) .
representation of the context of the interaction, including https://dit.uvt.nl/.
cognitive attitudes of the participants: intentions, assumptions, [11] J. Busquets, L. Vieu, N. Asher, La SDRT: Une approche de la cohérence du
expectations. discours dans la tradition de la sémantique dynamique, Verbum XXIII (1) (2001)
73–101.
To make cognitive interaction happen with virtual [12] Zhuo Chen, Yi Luo, Nima Mesgarani, Speaker-independent Speech
assistants, it should be possible to recognize and represent Separation with Deep Attractor Network arXiv preprint
the intentional attitudes of the operator which are directed to arXiv:1707.03634, (2017) .
[13] Chung-Cheng Chiu, et al., State-of-the-Art Speech Recognition With
the current task. Symmetrically, the human needs to
Sequence- to- Sequ en ce Mod els ar Xiv pr epr int arXiv:1712.01769, (2017)
understand the actions of the virtual assistant, which requires .
interpreting the behavior of the assistant as if it were [14] A. Clodic, M. Ransan, R. Alami, V. Montreuil, A management of mutual belief
for human-robot interaction, IEEE Int. Conf. on Systems (2007).
consistently driven by intentions and beliefs: The assistant
[16] M.E. Foster, T. By, M. Rickert, A. Knoll, Human-robot dialogue
must act according to human understandable norms. Note that forjointconstruction tasks, International Conference on Multimodal Interfaces
by “human understandable norms”, we suggest neither that a (ICMI), (2006) .
virtual assistant should perceive emotions, nor display [17] D. Harris, D. Fogal, M. Moss, Speech acts: the contemporary theoretical
landscape, in: Daniel Fogal, Matt Moss, Daniel Harris (Eds.), New Work on
emotions, nor have a personality. We are staying on the Speech Acts, Oxford University Press, Oxford, UK, 2018.
strictly rational ground: the actions and decisions of the virtual [18] H.P. Grice, Logic and conversation, Syntax and Semantic 3: Speech Acts,
assistant must stem transparently from openly shared Academic Press, New York, 1975, pp. 41–58.
[19] G. Ferguson, J. Allen, A cognitive model for collaborative agents, Fall
intentions Symposium on Advances in Cognitive Systems. Proceedings of the AAAI,
and shared facts [61]. (2011) , pp. 112–120.
As shown in the above examples, the representation of the [20] B. Grosz, C. Sidner, Attention, intentions, and the structure of discourse,
Comput. Linguist. 12 (1986).
cognitive context can be implemented is several ways: based on
[21] B. Grosz, S. Kraus, Collaborative plans for complex group action, Artif.
the theory of intentionality, one could assume that the cognitive Intell. 86 (2) (1996) 267–357.
context is a set of shared cognitive attitudes: that is the method [22] A. Herzig, E. Lorini, L. Perrussel, Z. Xiao, BDI "logics for BDI architecture: old
problems, new perspectives, KI - Künstliche Intelligenz, Springer, Special issue
taken by Grosz and Sidner, where individual intentions and
on Challenges for Reasoning under Uncertainty, Inconsistency, Vagueness, and
attention play central roles. Dynamic semantics has tried to Preferences, (2016) , pp. 73–83.
reduce the psychological dimension of the cognitive context by [23] E. Horvitz, J. Breese, D. Heckerman, D. Hovel, K. Rommelse, The Lumière
formalizing it as a kind of scoreboard: i.e. a shared information project: Bayesian user modeling for inferring the goals and needs of software
users, Proceedings of the 14th Conference on Uncertainty in Artificial
structure, representing the set of the assertions openly shared in Intelligence (1998) 256–265 July.
the exchange (background) and the questions under discussions. [24] D. Jurafsky, Pragmatics and computational linguistics, in: R.Horn Laurence,
But this account of conversation from Dynamic semantics didn’t Gregory Ward (Eds.), Handbook of Pragmatics, Blackwell, 2005, pp. 578–
604.
fully get rid of psychology because it had to include the goals of [25] Alex Lascarides, Nicholas Asher, Agreement, disputes and commitments in dialogue,
partners into the context. Among the theoretical approaches J. Semant. 26 (May (2)) (2009) 109–158.
presented above, the most psychologically neutral might be [26] Quoc Le, Tomas Mikolov, Distributed representations of sentences and
documents, International Conference on Machine Learning, (2014) .
Asher and Lascarides’ Segmented Discourse Theory: the context
[27] S. Lemaignan, M. Warnier, E. Akin Sisbot, A.C. lodic, R. Alami, Artifial
of assertions is now represented by rhetorical relations among cognition for social human-robot interaction: an implementation, Artif. Intell.
sentences (“elaborate”, “question-answer”, “justification”, (2017).
etc . . . ). Never- theless, even in this approach, intentions and [29] Matias Leviathan, et al., Google Duplex: An AI System for Accomplishing
Real- World Tasks over the Phone, (2018) .
beliefs of interlocutor are not completely absent: First, https://ai.googleblog.com/2018/05/ duplex-ai-system-for-natural-
communicative intentions appear behind rhetorical relations. conversation.html.
Second, when the SDRT is applied to conversations between [30] Tomas Mikolov, et al., Distributed representations of words and phrases and
their compositionality, Advances in Neural Information Processing Systems,
collaborating agents, the rhetorical relations between discourse (2013) .
segments partly represent the structure of the action to be [31] Volodymyr Mnih, et al., Playing Atari With Deep Reinforcement Learning
performed, which is an indirect way to catch practical intentions. arXiv p r e p r i n t a r X i v : 1 3 1 2 . 5 6 0 2 , (2013) .
[32] Volodymyr Mnih, et al., Human-level control through deep reinforcement
More practically, the survey of the theoretical approaches of
learning, Nature 518.7540 (2015) 529.
conversation in AI, and the analysis of a state of the art [33] N. Mostafazadeh, et al., A corpus and Cloze evaluation for deeper
framework suggest that a significant effort has to be made on understanding of commonsense stories", Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
better understanding, representing, and inferring on the cognitive
Technologies (2016) 839–849.
context of interaction. [34] G. Mehlmann, et al., Modeling gaze mechanisms for grounding in HRI.
ECAI’, Proceedings of the Twenty-First European Conference on Artificial
Intelligence, (2014) , pp. 1069–1070.
References [35] B. Multu, N. Roy, S. Sabanovic, Cognitive human–robot interaction, in: B.
Siciliano, O. Khatib (Eds.), Springer Handbook of Robotics. Springer Handbooks,
[1] Airbus, Air Traffic Control Automatic Speech Recognition Challenge, (2018) Springer, 2016.
. https://aigym.airbus.com/challenges/59bbf004dce682131435fe9c. [36] Alan Nichol, A New Approach to Conversational Software, (2017) . https://
[2] J. Allen, Natural Language Understanding, book, 2nd ed., Benjamin-Cummings medium.com/rasa-blog/a-new-approach-to-conversational-software-
Publishing Co., 1995. 2e64a5d05f2a.
[3] Dario Amodei, et al., Deep Speech 2: End-to-end Speech Recognition in [37] P. Portner, Mood, Oxford University Press, 2018.
English a n d M a n d a r i n C o R R a b s / 1 5 1 2 . 0 2 5 9 5 , (2015) . [38] L. Prévot, P. Muller, P. Denis, L. Vieu, Une approche sémantique et rhétorique
[4] D.J. Atkinson, W.J. Clancey, M.H. Clark, Shared awareness, autonomy and du dialogue. Un cas d’étude : l’explication d’un itinéraire, Traitement
trust in human-robot teamwork, Papers from the AAAI Fall Symposium, Automatique des Langues, Numéro spécial "Dialogue" 43 (2) (2002) 43–70.
(2019) , pp. 36–38. [39] A.S. Rao, M.P. Georgeff, BDI agents: form theory to practice, Proceedings of the
[5] N. Asher, A. Lascarides. Logics of Conversation, Cambridge University Press, First Conference on Multiagent Systems (1995) 312–319.
2003. [40] F. Recanati, Les énoncés performatifs, Les Editions de Minuit, 1981.
[6] J.L. Austin, in: Urmson (Ed.), How to Do Things with Words, The William [41] F. Recanati, Philosophie du langage (et de l’esprit), Folio Essais, 2008.
James Lectures Delivered at Harvard University in 1955, Oxford, 1962 [42] F. Recanati, Truth Conditional Pragmatics, Oxford University Press, 2010.
(French translation: "Quand dire c’est faire", Éditions du Seuil, Paris, 1970, [43] F. Reuzeau, Flight deck design process, Human Factors for Civil Flight Deck
translation by Gilles Lane). Design, Ashgate, 2004, pp. 33–55.
[7] D. Bernard, Cognitive interaction – towards’ cognitivity’ requirements for the [44] C. Roberts, Context in dynamic interpretation, in: L.R. Horn, G. Ward (Eds.),
design of virtual assistant, IEEE International Conference on Systems, Man, The Handbook of Pragmatics, editors, Blackwell, 2004, pp. 197–220.
and Cybernetics (SMC) (2017). [45] C. Roberts, Information structure in discourse: towards an integrated
[8] M. Bratman, Intentions, plans and practical reason, The David Hume Series, formal theory of pragmatics, Semant. Pragmat. 5 (2012) 1–69.
CSLI publications, 1987 reprinted 1999 (originally from Harvard University [46] C. Roberts, Speech acts in discourse context, to appear in D. Fogal, D. Harris, M.
Press). Moss New Work on Speech Acts, Oxford University Press, 2018.
[9] M. Bratman, Shared Agency, A Planning Theory of Acting Together, Oxford [47] J.R. Searle, Speech Acts, An Essay in the Philosophy of Language, Cambridge
University Press, 2014. University Press, 1969.
[48] J.R. Searle, Intentionality: An Essay in the Philosophy of Mind, Cambridge
University Press, 1983.
[49] J.R. Searle, Mind, A Brief Introduction, Oxford University Press, 2004.
[50] Jonathan Shen, et al., Natural TTS Synthesis by Conditioning WaveNet on Mel M. Tomasello, Origins of human communication, Jean Nicod Lectures, MIT press, 2008.
Spectrogram Predictions arXiv preprint arXiv:1712.05884, (2017) . [59] Aaron Van Den Oord, et al., Wavenet: a Generative Model for Raw Audio
[51] J. Shlens, A Tutorial on Independent Component Analysis arXiv preprint arXiv p r e p r i n t a r X i v : 1 6 0 9 . 0 3 4 9 9 , (2016) .
a r X i v : 1 4 0 4 . 2 9 8 6 , (2014) . [60] Aaron Van den Oord, et al., Parallel WaveNet: Fast High-Fidelity Speech
[52] Y. Shoham, Logical theories of Intention and the database perspective, J. Synthesis arXiv preprint arXiv:1711.10433, (2017) .
Philos. Logic 38 (2009) 633–647. [61] D. Vernant, in: B. Miège (Ed.), Communication interpersonnelle &
[53] David Silver, et al., Mastering the game of go with deep neural networks and communication personnes/ systèmes", Communication personnes/ systèmes
tree search, nature 529.7587 (2016) 484–489. informationnels, HermèsSciences, Paris, 2003, pp. 73–92.
[54] David Silver, et al., Mastering Chess and Shogi by Self-Play with a General [62] J. Weston, A. Bordes, S. Chopra, T. Mikolov, A.M. Rush, B. van Merrienboer,
Reinforcement Learning Algorithm arXiv preprint Towards AI - Complete Question Answering: A Set of Prerequisite Toy Tasks,
arXiv:1712.01815, (2017) . Face Book Research, (2015) . http://arxiv.org/abs/1502.05698.
[55] D. Sperber, D. Wilson, Deirdre, Meaning and Relevance, Cambridge University [63] Wu Xiong, et al., The Microsoft 2017 Conversational Speech Recognition
Press, 2012. S y s t e m a r X i v p r e p r i n t a r X i v : 1 7 0 8 . 0 6 0 7 3 v 2 , (2017) .
[56] A. Stolcke, N. Coccaro, R. Bates, P. Taylor, C. Van Ess-Dykema, K. Ries, E. [64] Z. Yu, D. Bohus, E. Horvitz, Incremental coordination: attention-centric
Shriberg, D. Jurafsky, R. Martin, M. Meteer, Dialogue Act modeling for speech production in a physically situated conversational agent, Proceedings
automatic tagging and recognition of conversational speech, J. Comput. of SIGIDIAL Conference, (2015) , pp. 402–406.
Linguist. 26 (September (3)) (2000) 339–3373.
[57] G. Sukhtankar (Ed.), et al., Plan, Activity, and Intention Recognition –
Theory and Practice, Elsevier, 2014.
[58]

Computers in Industry

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computers in Industry

Uploaded by

Copyright:

Available Formats

Computers in Industry 107 (2019) 33–49

Contents lists available at ScienceDirect

Cognitive interaction with virtual assistants: From philosophical

Fig. 1. Example taxonomy of speech acts.

2.3. Common ground

Collaborating agents need to share a wide variety of facts

- Deictic use of pronouns ("he", "we", "there"), sometime

Even more difﬁcult for software virtual assistants, are of the

- Conversation is driven by norms, which specify the quality

2.4. Multi layered cognitive interaction

Based on the notions introduced previously, we can

Fig. 2. The layered model of cognitive interaction.

Fig. 4. Intention related discourse segments.

Fig. 6. Example evolution of the focus stack.

discourse structure. Both approaches succeed in showing the

Fig. 8. Example tree representation of a SDRS.

intentions of the operator, because the virtual assistant is

Fig. 10. Summary of the architecture of cognitive interaction in a state-of-the-art framework.

You might also like