You are on page 1of 11

Deep Reinforcement Learning for Dialogue Generation

Jiwei Li1 , Will Monroe1 , Alan Ritter2 , Michel Galley3 , Jianfeng Gao3 and Dan Jurafsky1
1
Stanford University, Stanford, CA, USA
2
Ohio State University, OH, USA
3
Microsoft Research, Redmond, WA, USA
{jiweil,wmonroe4,jurafsky}@stanford.edu, ritter.1492@osu.edu
{mgalley,jfgao}@microsoft.com

Abstract context when mapping between consecutive dialogue


turns (Sordoni et al., 2015) in a way not possible, for
Recent neural models of dialogue generation example, with MT-based dialogue models (Ritter et
offer great promise for generating responses al., 2011).
for conversational agents, but tend to be short-
sighted, predicting utterances one at a time Despite the success of S EQ 2S EQ models in di-
while ignoring their influence on future out- alogue generation, two problems emerge: First,
comes. Modeling the future direction of a di- S EQ 2S EQ models are trained by predicting the next
alogue is crucial to generating coherent, inter- dialogue turn in a given conversational context using
esting dialogues, a need which led traditional the maximum-likelihood estimation (MLE) objective
NLP models of dialogue to draw on reinforce- function. However, it is not clear how well MLE
ment learning. In this paper, we show how to approximates the real-world goal of chatbot develop-
integrate these goals, applying deep reinforce-
ment learning to model future reward in chat-
ment: teaching a machine to converse with humans,
bot dialogue. The model simulates dialogues while providing interesting, diverse, and informative
between two virtual agents, using policy gradi- feedback that keeps users engaged. One concrete
ent methods to reward sequences that display example is that S EQ 2S EQ models tend to generate
three useful conversational properties: infor- highly generic responses such as “I don’t know” re-
mativity, coherence, and ease of answering (re- gardless of the input (Sordoni et al., 2015; Serban
lated to forward-looking function). We evalu- et al., 2016; Li et al., 2016a). This can be ascribed
ate our model on diversity, length as well as
to the high frequency of generic responses found in
with human judges, showing that the proposed
algorithm generates more interactive responses the training set and their compatibility with a diverse
and manages to foster a more sustained conver- range of conversational contexts. Yet “I don’t know”
sation in dialogue simulation. This work marks is apparently not a good action to take, since it closes
a first step towards learning a neural conversa- the conversation down.
tional model based on the long-term success of Another common problem, illustrated in the two
dialogues. sample conversations on the left of Table 1, is that
the system becomes stuck in an infinite loop of repet-
1 Introduction itive responses. This is due to MLE-based S EQ 2S EQ
models’ inability to account for repetition. In exam-
Neural response generation (Sordoni et al., 2015; ple 2 (bottom left), the dialogue falls into an infinite
Shang et al., 2015; Vinyals and Le, 2015; Li et al., loop after three turns, with both agents generating
2016a; Wen et al., 2015; Yao et al., 2015; Luan dull, generic utterances like i don’t know what you
et al., 2016; Xu et al., 2016; Wen et al., 2016; Li are talking about and you don’t know what you are
et al., 2016b; Su et al., 2016) is of growing inter- saying. Looking at the entire conversation, utterance
est. The LSTM sequence-to-sequence (S EQ 2S EQ) (4) turns out to be a bad action to take because it
model (Sutskever et al., 2014) is one type of neural offers no way of continuing the conversation.1
generation model that maximizes the probability of
generating a response given the previous dialogue 1
A similar rule is often suggested in improvisational comedy:
turn. This approach enables the incorporation of rich https://en.wikipedia.org/wiki/Yes,_and...

1192

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202,
c
Austin, Texas, November 1-5, 2016. 2016 Association for Computational Linguistics
Baseline mutual information model (Li et al. 2015) Proposed reinforcement learning model
A: Where are you going? (1) A: Where are you going? (1)
B: I’m going to the restroom. (2) B: I’m going to the police station. (2)
A: See you later. (3) A: I’ll come with you. (3)
B: See you later. (4) B: No, no, no, no, you’re not going anywhere. (4)
A: See you later. (5) A: Why? (5)
B: See you later. (6) B: I need you to stay here. (6)
... A: I don’t know what you are talking about. (7)
... ...
A: how old are you? (1) A: How old are you? (1)
B: I’m 16. (2) B: I’m 16. Why are you asking? (2)
A: 16? (3) A I thought you were 12. (3)
B: I don’t know what you are talking about. (4) B: What made you think so? (4)
A: You don’t know what you are saying. (5) A: I don’t know what you are talking about. (5)
B: I don’t know what you are talking about . (6) B: You don’t know what you are saying. (6)
A: You don’t know what you are saying. (7) ...
... ...
Table 1: Left Column: Dialogue simulation between two agents using a 4-layer LSTM encoder-decoder
trained on the OpenSubtitles dataset. The first turn (index 1) is input by the authors. Then the two agents
take turns conversing, taking as input the other agent’s prior generated turn. The output is generated using
the mutual information model (Li et al., 2015) in which an N-best list is first obtained using beam search
based on p(t|s) and reranked by linearly combining the backward probability p(s|t), where t and s respectively
denote targets and sources. Right Column: Dialogue simulated using the proposed reinforcement learning
model. The new model has more forward-looking utterances (questions like “Why are you asking?” and
offers like “I’ll come with you”) and lasts longer before it falls into conversational black holes.

These challenges suggest we need a conversa- utterances. The agent learns a policy by optimizing
tion framework that has the ability to (1) integrate the long-term developer-defined reward from ongo-
developer-defined rewards that better mimic the true ing dialogue simulations using policy gradient meth-
goal of chatbot development and (2) model the long- ods (Williams, 1992), rather than the MLE objective
term influence of a generated response in an ongoing defined in standard S EQ 2S EQ models.
dialogue. Our model thus integrates the power of S EQ 2S EQ
systems to learn compositional semantic meanings of
To achieve these goals, we draw on the insights of utterances with the strengths of reinforcement learn-
reinforcement learning, which have been widely ap- ing in optimizing for long-term goals across a conver-
plied in MDP and POMDP dialogue systems (see Re- sation. Experimental results (sampled results at the
lated Work section for details). We introduce a neu- right panel of Table 1) demonstrate that our approach
ral reinforcement learning (RL) generation method, fosters a more sustained dialogue and manages to
which can optimize long-term rewards designed by produce more interactive responses than standard
system developers. Our model uses the encoder- S EQ 2S EQ models trained using the MLE objective.
decoder architecture as its backbone, and simulates
conversation between two virtual agents to explore 2 Related Work
the space of possible actions while learning to maxi-
mize expected reward. We define simple heuristic ap- Efforts to build statistical dialog systems fall into two
proximations to rewards that characterize good con- major categories.
versations: good conversations are forward-looking The first treats dialogue generation as a source-
(Allwood et al., 1992) or interactive (a turn suggests to-target transduction problem and learns mapping
a following turn), informative, and coherent. The pa- rules between input messages and responses from a
rameters of an encoder-decoder RNN define a policy massive amount of training data. Ritter et al. (2011)
over an infinite action space consisting of all possible frames the response generation problem as a statisti-

1193
cal machine translation (SMT) problem. Sordoni et task-oriented dialogue system that links input repre-
al. (2015) improved Ritter et al.’s system by rescor- sentations to slot-value pairs in a database— or Su
ing the outputs of a phrasal SMT-based conversation et al. (2016), who combine reinforcement learning
system with a neural model that incorporates prior with neural generation on tasks with real users, show-
context. Recent progress in S EQ 2S EQ models inspire ing that reinforcement learning improves dialogue
several efforts (Vinyals and Le, 2015) to build end- performance.
to-end conversational systems which first apply an
encoder to map a message to a distributed vector rep- 3 Reinforcement Learning for
resenting its semantics and generate a response from Open-Domain Dialogue
the message vector. Serban et al. (2016) propose
In this section, we describe in detail the components
a hierarchical neural model that captures dependen-
of the proposed RL model.
cies over an extended conversation history. Li et al.
The learning system consists of two agents. We
(2016a) propose mutual information between mes-
use p to denote sentences generated from the first
sage and response as an alternative objective function
agent and q to denote sentences from the second.
in order to reduce the proportion of generic responses
The two agents take turns talking with each other.
produced by S EQ 2S EQ systems.
A dialogue can be represented as an alternating se-
The other line of statistical research focuses on quence of sentences generated by the two agents:
building task-oriented dialogue systems to solve p1 , q1 , p2 , q2 , ..., pi , qi . We view the generated sen-
domain-specific tasks. Efforts include statistical tences as actions that are taken according to a policy
models such as Markov Decision Processes (MDPs) defined by an encoder-decoder recurrent neural net-
(Levin et al., 1997; Levin et al., 2000; Walker et al., work language model.
2003; Pieraccini et al., 2009), POMDP (Young et The parameters of the network are optimized to
al., 2010; Young et al., 2013; Gašic et al., 2013a; maximize the expected future reward using policy
Gašic et al., 2014) models, and models that statisti- search, as described in Section 4.3. Policy gradi-
cally learn generation rules (Oh and Rudnicky, 2000; ent methods are more appropriate for our scenario
Ratnaparkhi, 2002; Banchs and Li, 2012; Nio et al., than Q-learning (Mnih et al., 2013), because we can
2014). This dialogue literature thus widely applies initialize the encoder-decoder RNN using MLE pa-
reinforcement learning (Walker, 2000; Schatzmann rameters that already produce plausible responses,
et al., 2006; Gasic et al., 2013b; Singh et al., 1999; before changing the objective and tuning towards a
Singh et al., 2000; Singh et al., 2002) to train dialogue policy that maximizes long-term reward. Q-learning,
policies. But task-oriented RL dialogue systems of- on the other hand, directly estimates the future ex-
ten rely on carefully limited dialogue parameters, or pected reward of each action, which can differ from
hand-built templates with state, action and reward sig- the MLE objective by orders of magnitude, thus mak-
nals designed by humans for each new domain, mak- ing MLE parameters inappropriate for initialization.
ing the paradigm difficult to extend to open-domain The components (states, actions, reward, etc.) of our
scenarios. sequential decision problem are summarized in the
Also relevant is prior work on reinforcement learn- following sub-sections.
ing for language understanding - including learning
from delayed reward signals by playing text-based 3.1 Action
games (Narasimhan et al., 2015; He et al., 2016), An action a is the dialogue utterance to generate.
executing instructions for Windows help (Branavan The action space is infinite since arbitrary-length se-
et al., 2011), or understanding dialogues that give quences can be generated.
navigation directions (Vogel and Jurafsky, 2010).
Our goal is to integrate the S EQ 2S EQ and rein- 3.2 State
forcement learning paradigms, drawing on the advan- A state is denoted by the previous two dialogue turns
tages of both. We are thus particularly inspired by [pi , qi ]. The dialogue history is further transformed
recent work that attempts to merge these paradigms, to a vector representation by feeding the concatena-
including Wen et al. (2016)— training an end-to-end tion of pi and qi into an LSTM encoder model as

1194
described in Li et al. (2016a). reward in the RL setting. r1 is further scaled by the
length of target S.
3.3 Policy
Information Flow We want each agent to con-
A policy takes the form of an LSTM encoder-decoder
tribute new information at each turn to keep the di-
(i.e., pRL (pi+1 |pi , qi ) ) and is defined by its param-
alogue moving and avoid repetitive sequences. We
eters. Note that we use a stochastic representation
therefore propose penalizing semantic similarity be-
of the policy (a probability distribution over actions
tween consecutive turns from the same agent. Let
given states). A deterministic policy would result in
hpi and hpi+1 denote representations obtained from
a discontinuous objective that is difficult to optimize
the encoder for two consecutive turns pi and pi+1 .
using gradient-based methods.
The reward is given by the negative log of the cosine
3.4 Reward similarity between them:

r denotes the reward obtained for each action. In this hpi · hpi+1
r2 = − log cos(hpi , hpi+1 ) = − log
subsection, we discuss major factors that contribute khpi kkhpi+1 k
to the success of a dialogue and describe how approx- (2)
imations to these factors can be operationalized in
computable reward functions. Semantic Coherence We also need to measure the
adequacy of responses to avoid situations in which
Ease of answering A turn generated by a machine the generated replies are highly rewarded but are un-
should be easy to respond to. This aspect of a turn grammatical or not coherent. We therefore consider
is related to its forward-looking function: the con- the mutual information between the action a and pre-
straints a turn places on the next turn (Schegloff and vious turns in the history to ensure the generated
Sacks, 1973; Allwood et al., 1992). We propose to responses are coherent and appropriate:
measure the ease of answering a generated turn by
1 1
using the negative log likelihood of responding to r3 = log pseq2seq (a|qi , pi )+ log pbackward
seq2seq (qi |a)
that utterance with a dull response. We manually con- Na Nqi
(3)
structed a list of dull responses S consisting 8 turns
pseq2seq (a|pi , qi ) denotes the probability of generat-
such as “I don’t know what you are talking about”,
ing response a given the previous dialogue utterances
“I have no idea”, etc., that we and others have found
[pi , qi ]. pbackward
seq2seq (qi |a) denotes the backward proba-
occur very frequently in S EQ 2S EQ models of con-
bility of generating the previous dialogue utterance
versations. The reward function is given as follows:
qi based on response a. pbackwardseq2seq is trained in a simi-
1 X 1 lar way as standard S EQ 2S EQ models with sources
r1 = − log pseq2seq (s|a) (1) and targets swapped. Again, to control the influ-
NS Ns
s∈S
ence of target length, both log pseq2seq (a|qi , pi ) and
where NS denotes the cardinality of NS and Ns de- log pbackward
seq2seq (qi |a) are scaled by the length of targets.
notes the number of tokens in the dull response s. The final reward for action a is a weighted sum of
Although of course there are more ways to generate the rewards discussed above:
dull responses than the list can cover, many of these
responses are likely to fall into similar regions in the r(a, [pi , qi ]) = λ1 r1 + λ2 r2 + λ3 r3 (4)
vector space computed by the model. A system less
likely to generate utterances in the list is thus also where λ1 + λ2 + λ3 = 1. We set λ1 = 0.25, λ2 =
less likely to generate other dull responses. 0.25 and λ3 = 0.5. A reward is observed after the
pseq2seq represents the likelihood output by agent reaches the end of each sentence.
S EQ 2S EQ models. It is worth noting that pseq2seq
4 Simulation
is different from the stochastic policy function
pRL (pi+1 |pi , qi ), since the former is learned based The central idea behind our approach is to simulate
on the MLE objective of the S EQ 2S EQ model while the process of two virtual agents taking turns talking
the latter is the policy optimized for long-term future with each other, through which we can explore the

1195
state-action space and learn a policy pRL (pi+1 |pi , qi ) date â, we will obtain the mutual information score
that leads to the optimal expected reward. We adopt m(â, [pi , qi ]) from the pre-trained pS EQ 2S EQ (a|pi , qi )
an AlphaGo-style strategy (Silver et al., 2016) by and pbackward
S EQ 2S EQ (qi |a). This mutual information score
initializing the RL system using a general response will be used as a reward and back-propagated to the
generation policy which is learned from a fully su- encoder-decoder model, tailoring it to generate se-
pervised setting. quences with higher rewards. We refer the readers to
Zaremba and Sutskever (2015) and Williams (1992)
4.1 Supervised Learning for details. The expected reward for a sequence is
For the first stage of training, we build on prior work given by:
of predicting a generated target sequence given dia-
logue history using the supervised S EQ 2S EQ model J(θ) = E[m(â, [pi , qi ])] (5)
(Vinyals and Le, 2015). Results from supervised
models will be later used for initialization. The gradient is estimated using the likelihood ratio
We trained a S EQ 2S EQ model with attention (Bah- trick:
danau et al., 2015) on the OpenSubtitles dataset,
which consists of roughly 80 million source-target ∇J(θ) = m(â, [pi , qi ])∇ log pRL (â|[pi , qi ]) (6)
pairs. We treated each turn in the dataset as a target
and the concatenation of two previous sentences as We update the parameters in the encoder-decoder
source inputs. model using stochastic gradient descent. A curricu-
lum learning strategy is adopted (Bengio et al., 2009)
4.2 Mutual Information as in Ranzato et al. (2015) such that, for every se-
Samples from S EQ 2S EQ models are often times dull quence of length T we use the MLE loss for the first
and generic, e.g., “i don’t know” (Li et al., 2016a) L tokens and the reinforcement algorithm for the
We thus do not want to initialize the policy model remaining T − L tokens. We gradually anneal the
using the pre-trained S EQ 2S EQ models because this value of L to zero. A baseline strategy is employed to
will lead to a lack of diversity in the RL models’ ex- decrease the learning variance: an additional neural
periences. Li et al. (2016a) showed that modeling model takes as inputs the generated target and the
mutual information between sources and targets will initial source and outputs a baseline value, similar
significantly decrease the chance of generating dull to the strategy adopted by Zaremba and Sutskever
responses and improve general response quality. We (2015). The final gradient is thus:
now show how we can obtain an encoder-decoder
model which generates maximum mutual informa- ∇J(θ) = ∇ log pRL (â|[pi , qi ])[m(â, [pi , qi ]) − b]
tion responses. (7)
As illustrated in Li et al. (2016a), direct decoding
from Eq 3 is infeasible since the second term requires 4.3 Dialogue Simulation between Two Agents
the target sentence to be completely generated. In- We simulate conversations between the two virtual
spired by recent work on sequence level learning agents and have them take turns talking with each
(Ranzato et al., 2015), we treat the problem of gen- other. The simulation proceeds as follows: at the
erating maximum mutual information response as a initial step, a message from the training set is fed to
reinforcement learning problem in which a reward the first agent. The agent encodes the input message
of mutual information value is observed when the to a vector representation and starts decoding to gen-
model arrives at the end of a sequence. erate a response output. Combining the immediate
Similar to Ranzato et al. (2015), we use policy gra- output from the first agent with the dialogue history,
dient methods (Sutton et al., 1999; Williams, 1992) the second agent updates the state by encoding the
for optimization. We initialize the policy model pRL dialogue history into a representation and uses the
using a pre-trained pS EQ 2S EQ (a|pi , qi ) model. Given decoder RNN to generate responses, which are sub-
an input source [pi , qi ], we generate a candidate list sequently fed back to the first agent, and the process
A = {â|â ∼ pRL }. For each generated candi- is repeated.

1196
Input Message Turn 1 Turn 2 Turn n

encode decode encode decode 1 encode decode


m p1,1 q1,1 p1n,1

..
.
1
q1,2


p1n,2


How old are I’m 16 .. 16? ..
you? . .
p1,2 2
q1,1 p2n,1

..
.
2
q1,2 p2n,2


..
.
.. ..
. .
3
p1,3 q1,1 p3n,1

..
.
3


q1,2 p3n,2


I’m 16, why are I thought you
you asking? .. .. were 12. ..
. . .
Figure 1: Dialogue simulation between the two agents.

Optimization We initialize the policy model pRL generation systems using both human judgments and
with parameters from the mutual information model two automatic metrics: conversation length (number
described in the previous subsection. We then use of turns in the entire session) and diversity.
policy gradient methods to find parameters that lead
to a larger expected reward. The objective to maxi- 5.1 Dataset
mize is the expected future reward: The dialogue simulation requires high-quality initial
i=T
X inputs fed to the agent. For example, an initial input
JRL (θ) = EpRL (a1:T ) [ R(ai , [pi , qi ])] (8) of “why ?” is undesirable since it is unclear how
i=1 the dialogue could proceed. We take a subset of
where R(ai , [pi , qi ]) denotes the reward resulting 10 million messages from the OpenSubtitles dataset
from action ai . We use the likelihood ratio trick and extract 0.8 million sequences with the lowest
(Williams, 1992; Glynn, 1990; Aleksandrov et al., likelihood of generating the response “i don’t know
1968) for gradient updates: what you are taking about” to ensure initial inputs
i=T are easy to respond to.
X X
∇JRL (θ) ≈ ∇ log p(ai |pi , qi ) R(ai , [pi , qi ])
5.2 Automatic Evaluation
i i=1
(9) Evaluating dialogue systems is difficult. Metrics such
We refer readers to Williams (1992) and Glynn as BLEU (Papineni et al., 2002) and perplexity have
(1990) for more details. been widely used for dialogue quality evaluation (Li
4.4 Curriculum Learning et al., 2016a; Vinyals and Le, 2015; Sordoni et al.,
2015), but it is widely debated how well these auto-
A curriculum Learning strategy is again employed matic metrics are correlated with true response qual-
in which we begin by simulating the dialogue for 2 ity (Liu et al., 2016; Galley et al., 2015). Since the
turns, and gradually increase the number of simulated goal of the proposed system is not to predict the
turns. We generate 5 turns at most, as the number highest probability response, but rather the long-term
of candidates to examine grows exponentially in the success of the dialogue, we do not employ BLEU or
size of candidate list. Five candidate responses are perplexity for evaluation2 .
generated at each step of the simulation.
2
We found the RL model performs worse on BLEU score. On
5 Experimental Results a random sample of 2,500 conversational pairs, single reference
BLEU scores for RL models, mutual information models and
In this section, we describe experimental results vanilla S EQ 2S EQ models are respectively 1.28, 1.44 and 1.17.
along with qualitative analysis. We evaluate dialogue BLEU is highly correlated with perplexity in generation tasks.

1197
Model # of simulated turns pared against both the vanilla S EQ 2S EQ model and
S EQ 2S EQ 2.68 the mutual information model.
mutual information 3.40
RL 4.48 Human Evaluation We explore three settings for
human evaluation: the first setting is similar to what
Table 2: The average number of simulated turns
was described in Li et al. (2016a), where we employ
from standard S EQ 2S EQ models, mutual informa-
crowdsourced judges to evaluate a random sample of
tion model and the proposed RL model.
500 items. We present both an input message and the
generated outputs to 3 judges and ask them to decide
Length of the dialogue The first metric we pro- which of the two outputs is better (denoted as single-
pose is the length of the simulated dialogue. We say turn general quality). Ties are permitted. Identical
a dialogue ends when one of the agents starts gener- strings are assigned the same score. We measure
ating dull responses such as “i don’t know” 3 or two the improvement achieved by the RL model over the
consecutive utterances from the same user are highly mutual information model by the mean difference in
overlapping4 . scores between the models.
The test set consists of 1,000 input messages. To For the second setting, judges are again presented
reduce the risk of circular dialogues, we limit the with input messages and system outputs, but are
number of simulated turns to be less than 8. Results asked to decide which of the two outputs is easier to
are shown in Table 2. As can be seen, using mutual respond to (denoted as single-turn ease to answer).
information leads to more sustained conversations Again we evaluate a random sample of 500 items,
between the two agents. The proposed RL model is each being assigned to 3 judges.
first trained based on the mutual information objec- For the third setting, judges are presented with sim-
tive and thus benefits from it in addition to the RL ulated conversations between the two agents (denoted
model. We observe that the RL model with dialogue as multi-turn general quality). Each conversation
simulation achieves the best evaluation score. consists of 5 turns. We evaluate 200 simulated con-
versations, each being assigned to 3 judges, who are
Diversity We report degree of diversity by calculat- asked to decide which of the simulated conversations
ing the number of distinct unigrams and bigrams in is of higher quality.
generated responses. The value is scaled by the total Results for human evaluation are shown in Table 5.
number of generated tokens to avoid favoring long The proposed RL system does not introduce a signifi-
sentences as described in Li et al. (2016a). The re- cant boost in single-turn response quality (winning
sulting metric is thus a type-token ratio for unigrams 40 percent of time and losing 36 percent of time).
and bigrams. This is in line with our expectations, as the RL model
For both the standard S EQ 2S EQ model and the pro- is not optimized to predict the next utterance, but
posed RL model, we use beam search with a beam rather to increase long-term reward. The RL sys-
size 10 to generate a response to a given input mes- tem produces responses that are significantly easier
sage. For the mutual information model, we first to answer than does the mutual information system,
generate n-best lists using pS EQ 2S EQ (t|s) and then as demonstrated by the single-turn ease to answer
linearly re-rank them using pS EQ 2S EQ (s|t). Results setting (winning 52 percent of time and losing 23
are presented in Table 4. We find that the proposed percent of time), and also significantly higher quality
RL model generates more diverse outputs when com- multi-turn dialogues, as demonstrated by the multi-
Since the RL model is trained based on future reward rather than turn general quality setting (winning 72 percent of
MLE, it is not surprising that the RL based models achieve lower time).
BLEU score.
3
We use a simple rule matching method, with a list of 8 Qualitative Analysis and Discussion We show a
phrases that count as dull responses. Although this can lead random sample of generated responses in Table 3 and
to both false-positives and -negatives, it works pretty well in
practice.
simulated conversations in Table 1 at the beginning
4
Two utterances are considered to be repetitive if they share of the paper. From Table 3, we can see that the
more than 80 percent of their words. RL based agent indeed generates more interactive

1198
Input Message Mutual Information Model RL
How old are you? I’m 16. I’m 16, why are you asking?
What’s your full name? I have no idea. What’s yours?
I don’t want to go home tonight. Really? Why?
Do you have any feelings for me? I don’t know what you are talking about. I don’t want to hurt your feelings.
How much time do you have here? Not long enough. Sorry, sir. Ten seconds.
Shall we get started? Of course! Yes. We’ve got a lot of work to do here.
Do you play football? No, i don’t Yes. I love football!
We’d have to talk to him. I mean, he’s a good guy About what ?
How come you never say it? Because I don’t want to hurt you. I don’t think it’s a good idea to say it.
Table 3: Sampled responses generated from the mutual information models and the proposed RL model.

Model Unigram Bigram A: What’s your name ?


S EQ 2S EQ 0.0062 0.015 B: Daniel.
mutual information 0.011 0.031 A: How old are you ?
B. Twelve. What’s your name ?
RL 0.017 0.041
A. Daniel.
Table 4: Diversity scores (type-token ratios) for the B: How old are you ?
standard S EQ 2S EQ model, mutual information model A: Twelve. What’s your name ?
and the proposed RL model. B: Daniel.
A: How old are you ?
B ...
Setting RL-win RL-lose Tie Table 6: An simulated dialogue with a cycle longer
single-turn general quality 0.40 0.36 0.24 than one.
single-turn ease to answer 0.52 0.23 0.25
multi-turn general quality 0.72 0.12 0.16
Table 5: RL gains over the mutual information sys- some aspects of what makes a good conversation,
tem based on pairwise human judgments. ideally the system would instead receive real rewards
from humans. Another problem with the current
model is that we can only afford to explore a very
responses than the other baselines. We also find that small number of candidates and simulated turns since
the RL model has a tendency to end a sentence with the number of cases to consider grow exponentially.
another question and hand the conversation over to
the user. From Table 1, we observe that the RL model 6 Conclusion
manages to produce more interactive and sustained
conversations than the mutual information model. We introduce a reinforcement learning framework for
During error analysis, we found that although we neural response generation by simulating dialogues
penalize repetitive utterances in consecutive turns, between two agents, integrating the strengths of neu-
the dialogue sometimes enters a cycle with length ral S EQ 2S EQ systems and reinforcement learning
greater than one, as shown in Table 6. This can be for dialogue. Like earlier neural S EQ 2S EQ models,
ascribed to the limited amount of conversational his- our framework captures the compositional models
tory we consider. Another issue observed is that the of the meaning of a dialogue turn and generates se-
model sometimes starts a less relevant topic during mantically appropriate responses. Like reinforce-
the conversation. There is a tradeoff between rele- ment learning dialogue systems, our framework is
vance and less repetitiveness, as manifested in the able to generate utterances that optimize future re-
reward function we define in Eq 4. ward, successfully capturing global properties of a
The fundamental problem, of course, is that the good conversation. Despite the fact that our model
manually defined reward function can’t possibly uses very simple, operationable heuristics for captur-
cover the crucial aspects that define an ideal conversa- ing these global properties, the framework generates
tion. While the heuristic rewards that we defined are more diverse, interactive responses that foster a more
amenable to automatic calculation, and do capture sustained conversation.

1199
Acknowledgement dialogue manager adaptation to extended domains. In
Proceedings of SIGDIAL.
We would like to thank Chris Brockett, Bill Dolan Milica Gasic, Catherine Breslin, Mike Henderson,
and other members of the NLP group at Microsoft Re- Dongkyu Kim, Martin Szummer, Blaise Thomson, Pir-
search for insightful comments and suggestions. We ros Tsiakoulis, and Steve Young. 2013b. On-line policy
also want to thank Kelvin Guu, Percy Liang, Chris optimisation of bayesian spoken dialogue systems via
Manning, Sida Wang, Ziang Xie and other members human interaction. In Proceedings of ICASSP 2013,
of the Stanford NLP groups for useful discussions. pages 8367–8371. IEEE.
Jiwei Li is supported by the Facebook Fellowship, to Milica Gašic, Dongho Kim, Pirros Tsiakoulis, Catherine
which we gratefully acknowledge. This work is par- Breslin, Matthew Henderson, Martin Szummer, Blaise
Thomson, and Steve Young. 2014. Incremental on-
tially supported by the NSF via Awards IIS-1514268,
line adaptation of pomdp-based dialogue managers to
IIS-1464128, and by the DARPA Communicating extended domains. In Proceedings on InterSpeech.
with Computers (CwC) program under ARO prime Peter W Glynn. 1990. Likelihood ratio gradient estima-
contract no. W911NF- 15-1-0462. Any opinions, tion for stochastic systems. Communications of the
findings, and conclusions or recommendations ex- ACM, 33(10):75–84.
pressed in this material are those of the authors and Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong
do not necessarily reflect the views of NSF, DARPA, Li, Li Deng, and Mari Ostendorf. 2016. Deep rein-
or Facebook. forcement learning with a natural language action space.
In Proceedings of the 54th Annual Meeting of the Asso-
ciation for Computational Linguistics (Volume 1: Long
References Papers), pages 1621–1630, Berlin, Germany, August.
V. M. Aleksandrov, V. I. Sysoyev, and V. V. Shemeneva. Esther Levin, Roberto Pieraccini, and Wieland Eckert.
1968. Stochastic optimization. Engineering Cybernet- 1997. Learning dialogue strategies within the markov
ics, 5:11–16. decision process framework. In Automatic Speech
Recognition and Understanding, 1997. Proceedings.,
Jens Allwood, Joakim Nivre, and Elisabeth Ahlsén. 1992.
1997 IEEE Workshop on, pages 72–79. IEEE.
On the semantics and pragmatics of linguistic feedback.
Journal of Semantics, 9:1–26. Esther Levin, Roberto Pieraccini, and Wieland Eckert.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2000. A stochastic model of human-machine interac-
2015. Neural machine translation by jointly learning to tion for learning dialog strategies. IEEE Transactions
align and translate. In Proc. of ICLR. on Speech and Audio Processing, 8(1):11–23.
Rafael E Banchs and Haizhou Li. 2012. IRIS: a chat- Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and
oriented dialogue system based on the vector space Bill Dolan. 2016a. A diversity-promoting objective
model. In Proceedings of the ACL 2012 System Demon- function for neural conversation models. In Proc. of
strations, pages 37–42. NAACL-HLT.
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jiwei Li, Michel Galley, Chris Brockett, Georgios Sp-
Jason Weston. 2009. Curriculum learning. In Pro- ithourakis, Jianfeng Gao, and Bill Dolan. 2016b. A
ceedings of the 26th annual international conference persona-based neural conversation model. In Proceed-
on machine learning, pages 41–48. ACM. ings of the 54th Annual Meeting of the Association for
SRK Branavan, David Silver, and Regina Barzilay. 2011. Computational Linguistics (Volume 1: Long Papers),
Learning to win by reading manuals in a monte-carlo pages 994–1003, Berlin, Germany, August.
framework. In Proceedings of the 49th Annual Meeting Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael Nose-
of the Association for Computational Linguistics: Hu- worthy, Laurent Charlin, and Joelle Pineau. 2016. How
man Language Technologies-Volume 1, pages 268–277. not to evaluate your dialogue system: An empirical
Michel Galley, Chris Brockett, Alessandro Sordoni, study of unsupervised evaluation metrics for dialogue
Yangfeng Ji, Michael Auli, Chris Quirk, Margaret response generation. arXiv preprint arXiv:1603.08023.
Mitchell, Jianfeng Gao, and Bill Dolan. 2015. Yi Luan, Yangfeng Ji, and Mari Ostendorf. 2016.
deltaBLEU: A discriminative metric for generation LSTM based conversation models. arXiv preprint
tasks with intrinsically diverse targets. In Proc. of ACL- arXiv:1603.09457.
IJCNLP, pages 445–450, Beijing, China, July. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex
Milica Gašic, Catherine Breslin, Matthew Henderson, Graves, Ioannis Antonoglou, Daan Wierstra, and Mar-
Dongho Kim, Martin Szummer, Blaise Thomson, Pir- tin Riedmiller. 2013. Playing Atari with deep rein-
ros Tsiakoulis, and Steve Young. 2013a. Pomdp-based forcement learning. NIPS Deep Learning Workshop.

1200
Karthik Narasimhan, Tejas Kulkarni, and Regina Barzilay. Satinder P Singh, Michael J Kearns, Diane J Litman, and
2015. Language understanding for text-based games Marilyn A Walker. 1999. Reinforcement learning for
using deep reinforcement learning. arXiv preprint spoken dialogue systems. In Nips, pages 956–962.
arXiv:1506.08941. Satinder Singh, Michael Kearns, Diane J Litman, Mar-
Lasguido Nio, Sakriani Sakti, Graham Neubig, Tomoki ilyn A Walker, et al. 2000. Empirical evaluation of
Toda, Mirna Adriani, and Satoshi Nakamura. 2014. a reinforcement learning spoken dialogue system. In
Developing non-goal dialog system based on examples AAAI/IAAI, pages 645–651.
of drama television. In Natural Interaction with Robots, Satinder Singh, Diane Litman, Michael Kearns, and Mari-
Knowbots and Smartphones, pages 355–361. Springer. lyn Walker. 2002. Optimizing dialogue management
Alice H Oh and Alexander I Rudnicky. 2000. Stochastic with reinforcement learning: Experiments with the nj-
language generation for spoken dialogue systems. In fun system. Journal of Artificial Intelligence Research,
Proceedings of the 2000 ANLP/NAACL Workshop on pages 105–133.
Conversational systems-Volume 3, pages 27–32. Alessandro Sordoni, Michel Galley, Michael Auli, Chris
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Brockett, Yangfeng Ji, Meg Mitchell, Jian-Yun Nie,
Jing Zhu. 2002. BLEU: a method for automatic eval- Jianfeng Gao, and Bill Dolan. 2015. A neural network
uation of machine translation. In Proceedings of the approach to context-sensitive generation of conversa-
40th annual meeting on association for computational tional responses. In Proceedings of NAACL-HLT.
linguistics, pages 311–318. Pei-Hao Su, Milica Gasic, Nikola Mrksic, Lina Rojas-
Roberto Pieraccini, David Suendermann, Krishna Barahona, Stefan Ultes, David Vandyke, Tsung-Hsien
Dayanidhi, and Jackson Liscombe. 2009. Are we there Wen, and Steve Young. 2016. Continuously learning
yet? Research in commercial spoken dialog systems. neural dialogue management. arxiv.
In Text, Speech and Dialogue, pages 3–13. Springer.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,
Sequence to sequence learning with neural networks.
and Wojciech Zaremba. 2015. Sequence level train- In Advances in neural information processing systems,
ing with recurrent neural networks. arXiv preprint pages 3104–3112.
arXiv:1511.06732.
Richard S Sutton, David A McAllester, Satinder P Singh,
Adwait Ratnaparkhi. 2002. Trainable approaches to sur-
Yishay Mansour, et al. 1999. Policy gradient methods
face natural language generation and their application
for reinforcement learning with function approximation.
to conversational dialog systems. Computer Speech &
In NIPS, volume 99, pages 1057–1063.
Language, 16(3):435–455.
Alan Ritter, Colin Cherry, and William B Dolan. 2011. Oriol Vinyals and Quoc Le. 2015. A neural conversa-
Data-driven response generation in social media. In tional model. In Proceedings of ICML Deep Learning
Workshop.
Proceedings of EMNLP 2011, pages 583–593.
Jost Schatzmann, Karl Weilhammer, Matt Stuttle, and Adam Vogel and Dan Jurafsky. 2010. Learning to follow
Steve Young. 2006. A survey of statistical user simula- navigational directions. In Proceedings of ACL 2010,
tion techniques for reinforcement-learning of dialogue pages 806–814.
management strategies. The knowledge engineering Marilyn A Walker, Rashmi Prasad, and Amanda Stent.
review, 21(02):97–126. 2003. A trainable generator for recommendations in
Emanuel A. Schegloff and Harvey Sacks. 1973. Opening multimodal dialog. In Proceeedings of INTERSPEECH
up closings. Semiotica, 8(4):289–327. 2003.
Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Marilyn A. Walker. 2000. An application of reinforce-
Aaron Courville, and Joelle Pineau. 2016. Building ment learning to dialogue strategy selection in a spoken
end-to-end dialogue systems using generative hierar- dialogue system for email. Journal of Artificial Intelli-
chical neural network models. In Proceedings of AAAI, gence Research, pages 387–416.
February. Tsung-Hsien Wen, Milica Gasic, Nikola Mrkšić, Pei-Hao
Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural Su, David Vandyke, and Steve Young. 2015. Semanti-
responding machine for short-text conversation. In cally conditioned LSTM-based natural language gener-
Proceedings of ACL-IJCNLP, pages 1577–1586. ation for spoken dialogue systems. In Proceedings of
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, EMNLP, pages 1711–1721, Lisbon, Portugal.
Laurent Sifre, George Van Den Driessche, Julian Schrit- Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina M
twieser, Ioannis Antonoglou, Veda Panneershelvam, Rojas-Barahona, Pei-Hao Su, Stefan Ultes, David
Marc Lanctot, et al. 2016. Mastering the game of go Vandyke, and Steve Young. 2016. A network-based
with deep neural networks and tree search. Nature, end-to-end trainable task-oriented dialogue system.
529(7587):484–489. arXiv preprint arXiv:1604.04562.

1201
Ronald J Williams. 1992. Simple statistical gradient-
following algorithms for connectionist reinforcement
learning. Machine learning, 8(3-4):229–256.
Zhen Xu, Bingquan Liu, Baoxun Wang, Chengjie Sun, and
Xiaolong Wang. 2016. Incorporating loose-structured
knowledge into LSTM with recall gate for conversation
modeling. arXiv preprint arXiv:1605.05110.
Kaisheng Yao, Geoffrey Zweig, and Baolin Peng. 2015.
Attention with intention for a neural network conversa-
tion model. In NIPS workshop on Machine Learning
for Spoken Language Understanding and Interaction.
Steve Young, Milica Gašić, Simon Keizer, François
Mairesse, Jost Schatzmann, Blaise Thomson, and Kai
Yu. 2010. The hidden information state model: A prac-
tical framework for pomdp-based spoken dialogue man-
agement. Computer Speech & Language, 24(2):150–
174.
Steve Young, Milica Gasic, Blaise Thomson, and Jason D
Williams. 2013. Pomdp-based statistical spoken di-
alog systems: A review. Proceedings of the IEEE,
101(5):1160–1179.
Wojciech Zaremba and Ilya Sutskever. 2015. Reinforce-
ment learning neural Turing machines. arXiv preprint
arXiv:1505.00521.

1202

You might also like