Paraphrase Generation With Deep Reinforcement Learning

Paraphrase Generation with Deep Reinforcement Learning
Zichao Li1 , Xin Jiang1 , Lifeng Shang1 , Hang Li2

1
Noah’s Ark Lab, Huawei Technologies
{li.zichao, jiang.xin, shang.lifeng}@huawei.com
2
Toutiao AI Lab
lihang.lh@bytedance.com
Abstract language, automatically generating accurate and

diverse paraphrases is still very challenging. Tra-
Automatic generation of paraphrases from a ditional symbolic approaches to paraphrase gen-
given sentence is an important yet challeng-
eration include rule-based methods (McKeown,
ing task in natural language processing (NLP).
In this paper, we present a deep reinforce- 1983), thesaurus-based methods (Bolshakov and
ment learning approach to paraphrase gener- Gelbukh, 2004; Kauchak and Barzilay, 2006),
ation. Specifically, we propose a new frame- grammar-based methods (Narayan et al., 2016),
work for the task, which consists of a genera- and statistical machine translation (SMT) based
tor and an evaluator, both of which are learned methods (Quirk et al., 2004; Zhao et al., 2008,
from data. The generator, built as a sequence- 2009).
to-sequence learning model, can produce para-
Recently, neural network based sequence-to-
phrases given a sentence. The evaluator, con-
structed as a deep matching model, can judge sequence (Seq2Seq) learning has made remark-
whether two sentences are paraphrases of each able success in various NLP tasks, including
other. The generator is first trained by deep machine translation, short-text conversation, text
learning and then further fine-tuned by re- summarization, and question answering (e.g., Cho
inforcement learning in which the reward is et al. (2014); Wu et al. (2016); Shang et al. (2015);
given by the evaluator. For the learning of the Vinyals and Le (2015); Rush et al. (2015); Yin
evaluator, we propose two methods based on
et al. (2016)). Paraphrase generation can naturally
supervised learning and inverse reinforcement
learning respectively, depending on the type
be formulated as a Seq2Seq problem (Cao et al.,
of available training data. Experimental re- 2017; Prakash et al., 2016; Gupta et al., 2018; Su
sults on two datasets demonstrate the proposed and Yan, 2017). The main challenge in paraphrase
models (the generators) can produce more ac- generation lies in the definition of the evaluation
curate paraphrases and outperform the state- measure. Ideally the measure is able to calculate
of-the-art methods in paraphrase generation in the semantic similarity between a generated para-
both automatic evaluation and human evalua- phrase and the given sentence. In a straightfor-
tion.
ward application of Seq2Seq to paraphrase gen-
eration one would make use of cross entropy as
1 Introduction
evaluation measure, which can only be a loose ap-
Paraphrases refer to texts that convey the same proximation of semantic similarity. To tackle this
meaning but with different expressions. For ex- problem, Ranzato et al. (2016) propose employing
ample, “how far is Earth from Sun”, “what is the reinforcement learning (RL) to guide the training
distance between Sun and Earth” are paraphrases. of Seq2Seq and using lexical-based measures such
Paraphrase generation refers to a task in which as BLEU (Papineni et al., 2002) and ROUGE (Lin,
given a sentence the system creates paraphrases 2004) as a reward function. However, these lexi-
of it. Paraphrase generation is an important task cal measures may not perfectly represent semantic
in NLP, which can be a key technology in many similarity. It is likely that a correctly generated
applications such as retrieval based question an- sequence gets a low ROUGE score due to lexical
swering, semantic parsing, query reformulation in mismatch. For instance, an input sentence “how
web search, data augmentation for dialogue sys- far is Earth from Sun” can be paraphrased as “what
tem. However, due to the complexity of natural is the distance between Sun and Earth”, but it will
3865
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3865–3878
Brussels, Belgium, October 31 - November 4, 2018. 2018
c Association for Computational Linguistics
get a very low ROUGE score given “how many
miles is it from Earth to Sun” as a reference.
In this work, we propose taking a data-driven
approach to train a model that can conduct evalu-
ation in learning for paraphrasing generation. The
framework contains two modules, a generator (for
paraphrase generation) and an evaluator (for para- Figure 1: Framework of RbM (Reinforced by
phrase evaluation). The generator is a Seq2Seq Matching).
learning model with attention and copy mecha-
nism (Bahdanau et al., 2015; See et al., 2017), techniques for learning of the generator and
which is first trained with cross entropy loss and evaluator.
then fine-tuned by using policy gradient with su- Section 2 defines the models of generator and
pervisions from the evaluator as rewards. The evaluator. In section 3, we formalize the problem
evaluator is a deep matching model, specifically of learning the models of generator and evaluator.
a decomposable attention model (Parikh et al., In section 4, we report our experimental results. In
2016), which can be trained by supervised learn- section 5, we introduce related work.
ing (SL) when both positive and negative exam-
ples are available as training data, or by inverse 2 Models
reinforcement learning (IRL) with outputs from
the generator as supervisions when only positive This section explains our framework for para-
examples are available. In the latter setting, for phrase generation, containing two models, the
the training of evaluator using IRL, we develop a generator and evaluator.
novel algorithm based on max-margin IRL prin- 2.1 Problem and Framework
ciple (Ratliff et al., 2006). Moreover, the gener-
ator can be further trained with non-parallel data, Given an input sequence of words X =
which is particularly effective when the amount of [x1 , . . . , xS ] with length S, we aim to generate
parallel data is small. an output sequence of words Y = [y1 , . . . , yT ]
with length T that has the same meaning as X.
We evaluate the effectiveness of our approach
We denote the pair of sentences in paraphrasing
using two real-world datasets (Quora question
as (X, Y ). We use Y1:t to denote the subsequence
pairs and Twitter URL paraphrase corpus) and we
of Y ranging from 1 to t and use Ŷ to denote the
conduct both automatic and human assessments.
sequence generated by a model.
We find that the evaluator trained by our methods
We propose a framework, which contains a
can provide accurate supervisions to the genera-
generator and an evaluator, called RbM (Rein-
tor, and thus further improve the accuracies of the
forced by Matching). Specifically, for the gener-
generator. The experimental results indicate that
ator we adopt the Seq2Seq architecture with atten-
our models can achieve significantly better per-
tion and copy mechanism (Bahdanau et al., 2015;
formances than the existing neural network based
See et al., 2017), and for the evaluator we adopt
methods.
the decomposable attention-based deep matching
It should be noted that the proposed approach model (Parikh et al., 2016). We denote the gener-
is not limited to paraphrase generation and can ator as Gθ and the evaluator as Mφ , where θ and φ
be readily applied into other sequence-to-sequence represent their parameters respectively. Figure 1
tasks such as machine translation and generation gives an overview of our framework. Basically
based single turn dialogue. Our technical contri- the generator can generate a paraphrase of a given
bution in this work is of three-fold: sentence and the evaluator can judge how seman-
1. We introduce the generator-evaluator frame- tically similar the two sentences are.
work for paraphrase generation, or in general,
sequence-to-sequence learning. 2.2 Generator: Seq2Seq Model
2. We propose two approaches to train the evalu- In this work, paraphrase generation is defined as
ator, i.e., supervised learning and inverse rein- a sequence-to-sequence (Seq2Seq) learning prob-
forcement learning. lem. Given input sentence X, the goal is to
3. In the above framework, we develop several learn a model Gθ that can generate a sentence
3866
Ŷ = Gθ (X) as its paraphrase. We choose the the decomposable-attention model (Parikh et al.,
pointer-generator proposed by See et al. (2017) 2016), as the evaluator. The evaluator can cal-
as the generator. The model is built based on culate the semantic similarity between two sen-
the encoder-decoder framework (Cho et al., 2014; tences:
Sutskever et al., 2014), both of which are imple-
S T
mented as recurrent neural networks (RNN). The X X
Mφ (X, Y ) = H( G([e(xi ), x̄i ]), G([e(yj ), ȳj ])),
encoder RNN transforms the input sequence X i=1 j=1
into a sequence of hidden states H = [h1 , . . . , hS ].
The decoder RNN generates an output sentence Y where e(·) denotes a word embedding, x̄i and ȳj
on the basis of the hidden states. Specifically it denote inter-attended vectors, H and G are feed-
predicts the next word at each position by sam- forward networks. We refer the reader to Parikh
pling from ŷt ∼ p(yt |Y1:t−1 , X) = g(st , ct , yt−1 ), et al. (2016) for details. In addition, we add po-
where st is the decoder state, ct is the context sitional encodings to the word embedding vectors
vector, yt−1 is the previous word, and g is a to incorporate the order information of the words,
feed-forward neural network. Attention mecha- following the idea in Vaswani et al. (2017).
nism (Bahdanau et al., 2015) is introduced to com-
3 Learning
pute the context vector as the weighted sum of en-
coder states: This section explains how to learn the generator
S and evaluator using deep reinforcement learning.
X exp η(st−1 , hi )
ct = αti hi , αti = PS , 3.1 Learning of Generator
i=1 j=1 exp η(st−1 , hj )
Given training data (X, Y ), the generator Gθ is
where αti represents the attention weight and η first trained to maximize the conditional log like-
is the attention function, which is a feed-forward lihood (negative cross entropy):
neural network.
XT
Paraphrasing often needs copying words from LSeq2Seq (θ) = log pθ (yt |Y1:t−1 , X). (1)
t=1
the input sentence, for instance, named entities.
The pointer-generator model allows either gener- When computing the conditional probability of the
ating words from a vocabulary or copying words next word as above, we choose the previous word
from the input sequence. Specifically the proba- yt−1 in the ground-truth rather than the word ŷt−1
bility of generating the next word is given by a generated by the model. This technique is called
mixture model: teacher forcing.
With teacher forcing, the discrepancy between
pθ (yt |Y1:t−1 , X) = q(st , ct , yt−1 )g(st , ct , yt−1 )
X training and prediction (also referred to as expo-
+ (1 − q(st , ct , yt−1 )) αti , sure bias) can quickly accumulate errors along the
i:yt =xi
generated sequence (Bengio et al., 2015; Ranzato
where q(st , ct , yt−1 ) is a binary neural classifier et al., 2016). Therefore, the generator Gθ is next
deciding the probability of switching between the fine-tuned using RL, where the reward is given by
generation mode and the copying mode. the evaluator.
In the RL formulation, generation of the next
2.3 Evaluator: Deep Matching Model word represents an action, the previous words rep-
In this work, paraphrase evaluation (identifica- resent a state, and the probability of generation
tion) is casted as a problem of learning of sen- pθ (yt |Y1:t−1 , X) induces a stochastic policy. Let
tence matching. The goal is to learn a real-valued rt denote the reward at position t. The goal of RL
function Mφ (X, Y ) that can represent the match- is to find a policy (i.e., a generator) that maximizes
ing degree between the two sentences as para- the expected cumulative reward:
phrases of each other. A variety of learning tech-
T
niques have been developed for matching sen- X
LRL (θ) = Epθ (Ŷ |X) rt (X, Ŷ1:t ). (2)
tences, from linear models (e.g., Wu et al. (2013))
t=1
to neural network based models (e.g., Socher et al.
(2011); Hu et al. (2014)). We choose a simple We define a positive reward at the end of se-
yet effective neural network architecture, called quence (rT = R) and a zero reward at the other
3867
positions. The reward R is given by the evalua- Inverse Reinforcement Learning
tor Mφ . In particular, when a pair of input sen- Inverse reinforcement learning (IRL) is a sub-
tence X and generated paraphrase Ŷ = Gθ (X) problem of reinforcement learning (RL), about
is given, the reward is calculated by the evaluator learning of a reward function given expert demon-
R = Mφ (X, Ŷ ). strations, which are sequences of states and ac-
We can then learn the optimal policy by em- tions from an expert (optimal) policy. More
ploying policy gradient. According to the policy specifically, the goal is to find an optimal re-
gradient theorem (Williams, 1992; Sutton et al., ward function R∗ with which the expert policy
2000), the gradient of the expected cumulative re- pθ∗ (Y |X) really becomes optimal among all pos-
ward can be calculated by sible policies, i.e.,
T
X Epθ∗ (Y |X) R∗ (Y ) ≥ Epθ (Ŷ |X) R∗ (Ŷ ), ∀θ.
∇θ LRL (θ) = [∇θ log pθ (ŷt |Ŷ1:t−1 , X)]rt .
t=1 In the current problem setting, the problem
(3)
becomes learning of an optimal reward function
The generator can thus be learned with stochastic
(evaluator) given a number of paraphrase pairs
gradient descent methods such as Adam (Kingma
given by human experts (expert demonstrations).
and Ba, 2015).
To learn an optimal reward (matching) func-
3.2 Learning of Evaluator tion is challenging, because the expert demonstra-
tions might not be optimal and the reward function
The evaluator works as the reward function in RL
might not be rigorously defined. To deal with the
of the generator and thus is essential for the task.
problem, we employ the maximum margin formu-
We propose two methods for learning the evalua-
lation of IRL inspired by Ratliff et al. (2006).
tor in different settings. When there are both pos-
The maximum margin approach ensures the
itive and negative examples of paraphrases, the
learned reward function has the following two
evaluator is trained by supervised learning (SL).
desirable properties in the paraphrase generation
When only positive examples are available (usu-
task: (a) given the same input sentence, a reference
ally the same data as the training data of the gener-
from humans should have a higher reward than the
ator), the evaluator is trained by inverse reinforce-
ones generated by the model; (b) the margins be-
ment learning (IRL).
tween the rewards should become smaller when
Supervised Learning the paraphrases generated by the model get closer
Given a set of positive and negative examples to a reference given by humans. We thus specifi-
(paraphrase pairs), we conduct supervised learn- cally consider the following optimization problem
ing of the evaluator with the pointwise cross en- for learning of the evaluator:
tropy loss:
JIRL (φ) = max(0, 1−ζ+Mφ (X, Ŷ )−Mφ (X, Y )),
JSL (φ) = − log Mφ (X, Y )−log(1−Mφ (X, Y − )), (5)
(4) where ζ is a slack variable to measure the agree-
where Y − represents a sentence that is not a para- ment between Ŷ and Y . In practice we set ζ =
phrase of X. The evaluator Mφ here is defined as a ROUGE-L(Ŷ , Y ). Different from RbM-SL, the
classifier, trained to distinguish negative example evaluator Mφ here is defined as a ranking model
(X, Y − ) from positive example (X, Y ). that assigns higher rewards to more plausible para-
We call this method RbM-SL (Reinforced by phrases.
Matching with Supervised Learning). The evalu- Once the reward function (evaluator) is learned,
ator Mφ trained by supervised learning can make it is then used to improve the policy function (gen-
a judgement on whether two sentences are para- erator) through policy gradient. In fact, the gen-
phrases of each other. With a well-trained evalua- erator Gθ and the evaluator Mφ are trained alter-
tor Mφ , we further train the generator Gθ by rein- natively. We call this method RbM-IRL (Rein-
forcement learning using Mφ as a reward function. forced by Matching with Inverse Reinforcement
Figure 2a shows the learning process of RbM-SL. Learning). Figure 2b shows the learning process
The detailed training procedure is shown in Algo- of RbM-IRL. The detailed training procedure is
rithm 1 in Appendix A. shown in Algorithm 2 in Appendix A.
3868
(a) RbM-SL (b) RbM-IRL
Figure 2: Learning Process of RbM models: (a) RbM-SL, (b) RbM-IRL.
We can formalize the whole learning procedure Reward Rescaling

as the following optimization problem: In practice, RL algorithms often suffer from insta-
bility in training. A common approach to reduce
min max Epθ (Ŷ |X) JIRL (φ). (6)
φ θ the variance is to subtract a baseline reward from
the value function. For instance, a simple base-
RbM-IRL can make effective use of sequences
line can be a moving average of historical rewards.
generated by the generator for training of the eval-
While in RbM-IRL, the evaluator keeps updating
uator. As the generated sentences become closer to
during training. Thus, keeping track of a base-
the ground-truth, the evaluator also becomes more
line reward is unstable and inefficient. Inspired
discriminative in identifying paraphrases.
by Guo et al. (2018), we propose an efficient re-
It should be also noted that for both RbM-SL
ward rescaling method based on ranking. For a
and RbM-IRL, once the evaluator is learned, the
batch of D generated paraphrases {Ŷ d }D d=1 , each
reinforcement learning of the generator only needs
associated with a reward R = Mφ (X , Ŷ d ), we
d d
non-parallel sentences as input. This makes it pos-
rescale the rewards by
sible to further train the generator and enhance the
generalization ability of the generator. rank(d)
R̄d = σ(δ1 · (0.5 − )) − 0.5, (8)
D
3.3 Training Techniques
where σ(·) is the sigmoid function, rank(d) is the
Reward Shaping rank of Rd in {R1 , ..., RD }, and δ1 is a scalar con-
In the original RL of the generator, only a positive trolling the variance of rewards. A similar strat-
reward R is given at the end of sentence. This pro- egy is applied into estimation of in-sequence value
vides sparse supervision signals and can make the function for each word, and the final rescaled
model greatly degenerate. Inspired by the idea of value function is
reward shaping (Ng et al., 1999; Bahdanau et al., rank(t)
2017), we estimate the intermediate cumulative re- Q̄dt = σ(δ2 · (0.5 − )) − 0.5 + R̄d , (9)
T
ward (value function) for each position, that is
where rank(t) is the rank of Qdt in {Qd1 , ..., QdT }.
Qt = Epθ (Yt+1:T |Ŷ1:t ,X) R(X, [Ŷ1:t , Yt+1:T ]), Reward rescaling has two advantages. First, the
mean and variance of Q̄dt are controlled and hence
by Monte-Carlo simulation, in the same way as they make the policy gradient more stable, even
in Yu et al. (2017): with a varying reward function. Second, when the
evaluator Mφ is trained with the ranking loss as in
( P
1 n=N n
Mφ (X, [Ŷ1:t , Ybt+1:T ]), t < T
Qt = N n=1 RbM-IRL, it is better to inform which paraphrase
Mφ (X, Ŷ ), t = T,
is better, rather than to provide a scalar reward in a
(7)
n range. In our experiment, we find that this method
where N is the sample size and Yt+1:T ∼ b
can bring substantial gains for RbM-SL and RbM-
pθ (Yt+1:T |Ŷ1:t , X) denotes simulated sub- IRL, but not for RL with ROUGE as reward.
sequences randomly sampled starting from the
(t + 1)-th word. During training of the generator, Curriculum Learning
the reward rt in policy gradient (3) is replaced by RbM-IRL may not achieve its best performance if
Qt estimated in (7). all of the training instances are included in training
3869
at the beginning. We employ a curriculum learn- reproducibility of our experimental results by oth-
ing strategy (Bengio et al., 2009) for it. During ers. For the manual evaluation, we conduct evalu-
the training of the evaluator Mφ , each example k ation on the generated paraphrases in terms of rel-
is associated with a weight wk , i.e. evance and fluency.
k
JIRL-CL (φ) = wk max(0,1 − ζ k + 4.2 Datasets
k k k k
Mφ (X , Ŷ ) − Mφ (X , Y )) (10) We evaluate our methods with the Quora ques-
tion pair dataset 2 and Twitter URL paraphrasing
In curriculum learning, wk is determined by the corpus (Lan et al., 2017). Both datasets contain
difficulty of the example. At the beginning, the positive and negative examples of paraphrases so
training procedure concentrates on relatively sim- that we can evaluate the RbM-SL and RbM-IRL
ple examples, and gradually puts more weights methods. We randomly split the Quora dataset
on difficult ones. In our case, we use the edit in two different ways obtaining two experimen-
distance E(X, Y ) between X and Y as the mea- tal settings: Quora-I and Quora-II. In Quora-I, we
sure of difficulty for paraphrasing. Specifically, partition the dataset by question pairs, while in
wk is determined by wk ∼ Binomial(pk , 1), and Quora-II, we partition by question ids such that
k ,Y k ))
pk = σ(δ3 · (0.5 − rank(E(X K )), where K de- there is no shared question between the training
notes the batch size for training the evaluator. For and test/validation datasets. In addition, we sam-
δ3 , we start with a relatively high value and grad- ple a smaller training set in Quora-II to make the
ually decrease it. In the end each example will task more challenging. Twitter URL paraphras-
be sampled with a probability around 0.5. In this ing corpus contains two subsets, one is labeled by
manner, the evaluator first learns to identify para- human annotators while the other is labeled auto-
phrases with small modifications on the input sen- matically by algorithm. We sample the test and
tences (e.g. “what ’s” and “what is”). Along with validation set from the labeled subset, while us-
training it gradually learns to handle more compli- ing the remaining pairs as training set. For RbM-
cated paraphrases (e.g. “how can I” and “what is SL, we use the labeled subset to train the evalua-
the best way to”). tor Mφ . Compared to Quora-I, it is more difficult
to achieve a high performance with Quora-II. The
4 Experiment Twitter corpus is even more challenging since the
4.1 Baselines and Evaluation Measures data contains more noise. The basic statistics of
datasets are shown in Table 1.
To compare our methods (RbM-SL and RbM-
IRL) with existing neural network based meth- Table 1: Statistics of datasets.
ods, we choose five baseline models: the at- Generator Evaluator (RbM-SL)
tentive Seq2Seq model (Bahdanau et al., 2015), Dataset #Train #Test #Validation #Positive #Negative
the stacked Residual LSTM networks (Prakash Quora-I 100K 30K 3K 100K 160K
et al., 2016), the variational auto-encoder (VAE- Quora-II 50K 30K 3K 50K 160K
SVG-eq) (Gupta et al., 2018) 1 , the pointer- Twitter 110K 5K 1K 10K 40K
generator (See et al., 2017), and the reinforced

pointer-generator with ROUGE-2 as reward (RL-
ROUGE) (Ranzato et al., 2016). 4.3 Implementation Details
We conduct both automatic and manual eval- Generator We maintain a fixed-size vocabulary of
uation on the models. For the automatic 5K shared by the words in input and output, and
evaluation, we adopt four evaluation measures: truncate all the sentences longer than 20 words.
ROUGE-1, ROUGE-2 (Lin, 2004), BLEU (Pap- The model architecture, word embedding size and
ineni et al., 2002) (up to at most bi-grams) and LSTM cell size are as the same as reported in See
METEOR (Lavie and Agarwal, 2007). As pointed et al. (2017). We use Adadgrad optimizer (Duchi
out, ideally it would be better not to merely use a et al., 2011) in the supervised pre-training and
lexical measure like ROUGE or BLEU for evalu- Adam optimizer in the reinforcement learning,
ation of paraphrasing. We choose to use them for with the batch size of 80. We also fine-tune the
1 2
We directly present the results reported in Gupta et al. https://www.kaggle.com/c/
(2018) on the same dateset and settings. quora-question-pairs
3870
Table 2: Performances on Quora datasets.
Quora-I Quora-II
Models ROUGE-1 ROUGE-2 BLEU METEOR ROUGE-1 ROUGE-2 BLEU METEOR
Seq2Seq 58.77 31.47 36.55 26.28 47.22 20.72 26.06 20.35
Residual LSTM 59.21 32.43 37.38 28.17 48.55 22.48 27.32 22.37
VAE-SVG-eq - - - 25.50 - - - 22.20
Pointer-generator 61.96 36.07 40.55 30.21 51.98 25.16 30.01 24.31
RL-ROUGE 63.35 37.33 41.83 30.96 54.50 27.50 32.54 25.67
RbM-SL (ours) 64.39 38.11 43.54 32.84 57.34 31.09 35.81 28.12
RbM-IRL (ours) 64.02 37.72 43.09 31.97 56.86 29.90 34.79 26.67
Table 3: Performances on Twitter corpus. Table 4: Human evaluation on Quora datasets.

Twitter Quora-I Quora-II
Models ROUGE-1 ROUGE-2 BLEU METEOR Models Relevance Fluency Relevance Fluency
Seq2Seq 30.43 14.61 30.54 12.80 Pointer-generator 3.23 4.55 2.34 2.96
Residual LSTM 32.50 16.86 33.90 13.65 RL-ROUGE 3.56 4.61 2.58 3.14
Pointer-generator 38.31 21.22 40.37 17.62
RbM-SL (ours) 4.08 4.67 3.20 3.48
RL-ROUGE 40.16 22.99 42.73 18.89
RbM-IRL (ours) 4.07 4.69 2.80 3.53
RbM-SL (ours) 41.87 24.23 44.67 19.97
RbM-IRL (ours) 42.15 24.73 45.74 20.18 Reference 4.69 4.95 4.68 4.90
Seq2Seq baseline models with Adam optimizer mean value of reward, and we set λ as 0.1 by grid
for a fair comparison. In supervised pre-training, search.
we set the learning rate as 0.1 and initial accumu-
lator as 0.1. The maximum norm of gradient is set 4.4 Results and Analysis
as 2. During the RL training, the learning rate de- Automatic evaluation Table 2 shows the per-
creases to 1e-5 and the size of Monte-Carlo sam- formances of the models on Quora datasets. In
ple is 4. To make the training more stable, we use both settings, we find that the proposed RbM-
the ground-truth with reward of 0.1. SL and RbM-IRL models outperform the baseline
Evaluator We use the pretrained GoogleNews models in terms of all the evaluation measures.
300-dimension word vectors 3 in Quora dataset Particularly in Quora-II, RbM-SL and RbM-IRL
and 200-dimension GloVe word vectors 4 in Twit- make significant improvements over the baselines,
ter corpus. Other model settings are the same as which demonstrates their higher ability in learn-
in Parikh et al. (2016). For evaluator in RbM- ing for paraphrase generation. On Quora dataset,
SL we set the learning rate as 0.05 and the batch RbM-SL is constantly better than RbM-IRL for
size as 32. For the evaluator of Mφ in RbM-IRL, all the automatic measures, which is reasonable
the learning rate decreases to 1e-2, and we use the because RbM-SL makes use of additional labeled
batch size of 80. data to train the evaluator. Quora datasets contains
We use the technique of reward rescaling as a large number of high-quality non-paraphrases,
mentioned in section 3.3 in training RbM-SL and i.e., they are literally similar but semantically dif-
RbM-IRL. In RbM-SL, we set δ1 as 12 and δ2 as 1. ferent, for instance “are analogue clocks better
In RbM-IRL, we keep δ2 as 1 all the time and de- than digital” and “is analogue better than digi-
crease δ1 from 12 to 3 and δ3 from 15 to 8 during tal”. Trained with the data, the evaluator tends to
curriculum learning. In ROUGE-RL, we take the become more capable in paraphrase identification.
exponential moving average of historical rewards With additional evaluation on Quora data, the eval-
as baseline reward to stabilize the training: uator used in RbM-SL can achieve an accuracy of
87% on identifying positive and negative pairs of
bm = λQm−1 + (1 − λ)bm−1 , b1 = 0
paraphrases.
where bm is the baseline b at iteration m, Q is the Table 3 shows the performances on the Twitter
corpus. Our models again outperform the base-
3
https://code.google.com/archive/p/ lines in terms of all the evaluation measures. Note
word2vec/
4
https://nlp.stanford.edu/projects/ that RbM-IRL performs better than RbM-SL in
glove/ this case. The reason might be that the evaluator
3871
of RbM-SL might not be effectively trained with sentence. Compared to RbM-SL with an error of
the relatively small dataset, while RbM-IRL can repeating the word scripting, RbM-IRL generates
leverage its advantage in learning of the evaluator a more fluent paraphrase. The reason is that the
with less data. evaluator in RbM-IRL is more capable of measur-
In our experiments, we find that the training ing the fluency of a sentence. In the fourth ex-
techniques proposed in section 3.3 are all neces- ample, RL-ROUGE generates a totally non-sense
sary and effective. Reward shaping is by default sentence, and pointer-generator and RbM-IRL just
employed by all the RL based models. Reward cover half of the content of the original sentence,
rescaling works particularly well for the RbM while RbM-SL successfully rephrases and pre-
models, where the reward functions are learned serves all the meaning. All of the models fail
from data. Without reward rescaling, RbM-SL in the last example, because the word ducking
can still outperform the baselines but with smaller is a rare word that never appears in the training
margins. For RbM-IRL, curriculum learning is data. Pointer-generator and RL-ROUGE generate
necessary for its best performance. Without cur- totally irrelevant words such as UNK token or vic-
riculum learning, RbM-IRL only has comparable tory, while RbM-SL and RbM-IRL still generate
performance with ROUGE-RL. topic-relevant words.
Human evaluation We randomly select 300 sen- 5 Related Work
tences from the test data as input and generate
Neural paraphrase generation recently draws at-
paraphrases using different models. The pairs of
tention in different application scenarios. The
paraphrases are then aggregated and partitioned
task is often formalized as a sequence-to-sequence
into seven random buckets for seven human asses-
(Seq2Seq) learning problem. Prakash et al. (2016)
sors to evaluate. The assessors are asked to rate
employ a stacked residual LSTM network in the
each sentence pair according to the following two
Seq2Seq model to enlarge the model capacity.
criteria: relevance (the paraphrase sentence is se-
Cao et al. (2017) utilize an additional vocabu-
mantically close to the original sentence) and flu-
lary to restrict word candidates during generation.
ency (the paraphrase sentence is fluent as a natural
Gupta et al. (2018) use a variational auto-encoder
language sentence, and the grammar is correct).
framework to generate more diverse paraphrases.
Hence each assessor gives two scores to each para-
Ma et al. (2018) utilize an attention layer instead
phrase, both ranging from 1 to 5. To reduce the
of a linear mapping in the decoder to pick up word
evaluation variance, there is a detailed evaluation
candidates. Iyyer et al. (2018) harness syntac-
guideline for the assessors in Appendix B. Each
tic information for controllable paraphrase gen-
paraphrase is rated by two assessors, and then av-
eration. Zhang and Lapata (2017) tackle a simi-
eraged as the final judgement. The agreement be-
lar task of sentence simplification withe Seq2Seq
tween assessors is moderate (kappa=0.44).
model coupled with deep reinforcement learning,
Table 4 demonstrates the average ratings for in which the reward function is manually defined
each model, including the ground-truth references. for the task. Similar to these works, we also pre-
Our models of RbM-SL and RbM-IRL get bet- train the paraphrase generator within the Seq2Seq
ter scores in terms of relevance and fluency than framework. The main difference lies in that we
the baseline models, and their differences are use another trainable neural network, referred to
statistically significant (paired t-test, p-value < as evaluator, to guide the training of the generator
0.01). We note that in human evaluation, RbM-SL through reinforcement learning.
achieves the best relevance score while RbM-IRL There is also work on paraphrasing generation
achieves the best fluency score. in different settings. For example, Mallinson et al.
Case study Figure 3 gives some examples of gen- (2017) leverage bilingual data to produce para-
erated paraphrases by the models on Quora-II for phrases by pivoting over a shared translation in an-
illustration. The first and second examples show other language. Wieting et al. (2017); Wieting and
the superior performances of RbM-SL and RbM- Gimpel (2018) use neural machine translation to
IRL over the other models. In the third exam- generate paraphrases via back-translation of bilin-
ple, both RbM-SL and RbM-IRL capture accu- gual sentence pairs. Buck et al. (2018) and Dong
rate paraphrasing patterns, while the other models et al. (2017) tackle the problem of QA-specific
wrongly segment and copy words from the input paraphrasing with the guidance from an external
3872
Figure 3: Examples of the generated paraphrases by different models on Quora-II.
QA system and an associated evaluation metric. as pointed by Finn et al. (2016a); Ho and Ermon
Inverse reinforcement learning (IRL) aims to (2016). However, there are significant differences
learn a reward function from expert demonstra- between GAN and our RbM-IRL model. GAN
tions. Abbeel and Ng (2004) propose apprentice- employs the discriminator to distinguish gener-
ship learning, which uses a feature based linear ated examples from real examples, while RbM-
reward function and learns to match feature ex- IRL employs the evaluator as a reward function
pectations. Ratliff et al. (2006) cast the problem in RL. The generator in GAN is trained to maxi-
as structured maximum margin prediction. Ziebart mize the loss of the discriminator in an adversarial
et al. (2008) propose max entropy IRL in order to way, while the generator in RbM-IRL is trained
solve the problem of expert suboptimality. Recent to maximize the expected cumulative reward from
work involving deep learning in IRL includes Finn the evaluator.
et al. (2016b) and Ho et al. (2016). There does not 6 Conclusion
seem to be much work on IRL for NLP. In Neu In this paper, we have proposed a novel deep re-
and Szepesvári (2009), parsing is formalized as inforcement learning approach to paraphrase gen-
a feature expectation matching problem. Wang eration, with a new framework consisting of a
et al. (2018) apply adversarial inverse reinforce- generator and an evaluator, modeled as sequence-
ment learning in visual story telling. To the best to-sequence learning model and deep matching
of our knowledge, our work is the first that applies model respectively. The generator, which is
deep IRL into a Seq2Seq task. for paraphrase generation, is first trained via
Generative Adversarial Networks sequence-to-sequence learning. The evaluator,
(GAN) (Goodfellow et al., 2014) is a family which is for paraphrase identification, is then
of unsupervised generative models. GAN con- trained via supervised learning or inverse rein-
tains a generator and a discriminator, respectively forcement learning in different settings. With
for generating examples from random noises a well-trained evaluator, the generator is further
and distinguishing generated examples from real fine-tuned by reinforcement learning to produce
examples, and they are trained in an adversarial more accurate paraphrases. The experiment re-
way. There are applications of GAN on NLP, sults demonstrate that the proposed method can
such as text generation (Yu et al., 2017; Guo significantly improve the quality of paraphrase
et al., 2018) and dialogue generation (Li et al., generation upon the baseline methods. In the fu-
2017). RankGAN (Lin et al., 2017) is the one ture, we plan to apply the framework and training
most similar to RbM-IRL that employs a ranking techniques into other tasks, such as machine trans-
model as the discriminator. However, RankGAN lation and dialogue.
works for text generation rather than sequence-
to-sequence learning, and training of generator Acknowledgments
in RankGAN relies on parallel data while the This work is supported by China National 973
training of RbM-IRL can use non-parallel data. Program 2014CB340301.
There are connections between GAN and IRL
3873
References Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong
Yu, and Jun Wang. 2018. Long text generation
Pieter Abbeel and Andrew Y Ng. 2004. Apprentice- via adversarial training with leaked information. In
ship learning via inverse reinforcement learning. In AAAI.
ICML.
Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Ankush Gupta, Arvind Agarwal, Prawaan Singh, and
Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Piyush Rai. 2018. A deep generative framework for
Courville, and Yoshua Bengio. 2017. An actor-critic paraphrase generation. In AAAI.
algorithm for sequence prediction. In ICLR. Jonathan Ho and Stefano Ermon. 2016. Generative ad-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- versarial imitation learning. In NIPS.
gio. 2015. Neural machine translation by jointly
learning to align and translate. In ICLR. Jonathan Ho, Jayesh Gupta, and Stefano Ermon. 2016.
Model-free imitation learning with policy optimiza-
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and tion. In ICML.
Noam Shazeer. 2015. Scheduled sampling for se-
quence prediction with recurrent neural networks. Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai
In NIPS. Chen. 2014. Convolutional neural network architec-
tures for matching natural language sentences. In
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, NIPS.
and Jason Weston. 2009. Curriculum learning. In
ICML. Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke
Zettlemoyer. 2018. Adversarial example generation
Igor Bolshakov and Alexander Gelbukh. 2004. Syn- with syntactically controlled paraphrase networks.
onymous paraphrasing using wordnet and internet. In NAACL.
Natural Language Processing and Information Sys-
tems, pages 189–200. David Kauchak and Regina Barzilay. 2006. Paraphras-
ing for automatic evaluation. In NAACL.
Christian Buck, Jannis Bulian, Massimiliano Cia-
ramita, Andrea Gesmundo, Neil Houlsby, Wojciech Diederik Kingma and Jimmy Ba. 2015. Adam: A
Gajewski, and Wei Wang. 2018. Ask the right ques- method for stochastic optimization. In ICLR.
tions: Active question reformulation with reinforce-
ment learning. In ICLR. Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017.
A continuously growing dataset of sentential para-
Ziqiang Cao, Chuwei Luo, Wenjie Li, and Sujian Li. phrases. In EMNLP.
2017. Joint copying and restricted generation for
paraphrase. In AAAI. Alon Lavie and Abhaya Agarwal. 2007. Meteor: An
automatic metric for mt evaluation with high levels
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul- of correlation with human judgments. In Proceed-
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger ings of the Second Workshop on Statistical Machine
Schwenk, and Yoshua Bengio. 2014. Learning Translation.
phrase representations using rnn encoder-decoder
for statistical machine translation. Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, and
Li Dong, Jonathan Mallinson, Siva Reddy, and Mirella Dan Jurafsky. 2017. Adversarial learning for neural
Lapata. 2017. Learning to paraphrase for question dialogue generation. In EMNLP.
answering. In EMNLP. Chin-Yew Lin. 2004. Rouge: A package for automatic
John Duchi, Elad Hazan, and Yoram Singer. 2011. evaluation of summaries. In ACL-04 workshop.
Adaptive subgradient methods for online learning
and stochastic optimization. Journal of Machine Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang,
Learning Research, 12(Jul):2121–2159. and Ming-Ting Sun. 2017. Adversarial ranking for
language generation. In NIPS.
Chelsea Finn, Paul Christiano, Pieter Abbeel, and
Sergey Levine. 2016a. A connection between gen- Shuming Ma, Xu Sun, Wei Li, Sujian Li, Wenjie Li,
erative adversarial networks, inverse reinforcement and Xuancheng Ren. 2018. Word embedding at-
learning, and energy-based models. NIPS 2016 tention network: Generating words by querying dis-
Workshop on Adversarial Training. tributed word representations for paraphrase genera-
tion. In NAACL.
Chelsea Finn, Sergey Levine, and Pieter Abbeel.
2016b. Guided cost learning: Deep inverse optimal Jonathan Mallinson, Rico Sennrich, and Mirella Lap-
control via policy optimization. In ICML. ata. 2017. Paraphrasing revisited with neural ma-
chine translation. In EACL.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Kathleen R McKeown. 1983. Paraphrasing questions
Courville, and Yoshua Bengio. 2014. Generative ad- using given and new information. Computational
versarial nets. In NIPS. Linguistics, 9(1):1–10.
3874
Shashi Narayan, Siva Reddy, and Shay B Cohen. 2016. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Paraphrase generation from latent-variable pcfgs for Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
semantic parsing. In INLG. Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. arXiv preprint arXiv:1706.03762.
Gergely Neu and Csaba Szepesvári. 2009. Training
parsers by inverse reinforcement learning. Machine Oriol Vinyals and Quoc Le. 2015. A neural conversa-
learning, 77(2):303–337. tional model.
Andrew Y Ng, Daishi Harada, and Stuart Russell. Xin Wang, Wenhu Chen, Yuan-Fang Wang, and
1999. Policy invariance under reward transforma- William Yang Wang. 2018. No metrics are perfect:
tions: Theory and application to reward shaping. In Adversarial reward learning for visual storytelling.
ICML. In ACL.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- John Wieting and Kevin Gimpel. 2018. Paranmt-50m:
Jing Zhu. 2002. Bleu: a method for automatic eval- Pushing the limits of paraphrastic sentence embed-
uation of machine translation. In ACL. dings with millions of machine translations. In ACL.
Ankur P Parikh, Oscar Täckström, Dipanjan Das, and John Wieting, Jonathan Mallinson, and Kevin Gimpel.
Jakob Uszkoreit. 2016. A decomposable attention 2017. Learning paraphrastic sentence embeddings
model for natural language inference. In EMNLP. from back-translated bitext. In EMNLP.
Aaditya Prakash, Sadid A Hasan, Kathy Lee, Vivek Ronald J Williams. 1992. Simple statistical gradient-
Datla, Ashequl Qadir, Joey Liu, and Oladimeji Farri. following algorithms for connectionist reinforce-
2016. Neural paraphrase generation with stacked ment learning. Machine learning, 8(3-4):229–256.
residual lstm networks. In COLING.
Wei Wu, Zhengdong Lu, and Hang Li. 2013. Learn-
Chris Quirk, Chris Brockett, and William Dolan. ing bilinear model for matching queries and docu-
2004. Monolingual machine translation for para- ments. The Journal of Machine Learning Research,
phrase generation. In EMNLP. 14(1):2519–2548.
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
and Wojciech Zaremba. 2016. Sequence level train- Le, Mohammad Norouzi, Wolfgang Macherey,
ing with recurrent neural networks. In ICLR. Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, et al. 2016. Google’s neural ma-
Nathan D Ratliff, J Andrew Bagnell, and Martin A chine translation system: Bridging the gap between
Zinkevich. 2006. Maximum margin planning. In human and machine translation. arXiv preprint
ICML. arXiv:1609.08144.
Alexander M Rush, Sumit Chopra, and Jason Weston. Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang,
2015. A neural attention model for abstractive sen- Hang Li, and Xiaoming Li. 2016. Neural generative
tence summarization. In EMNLP. question answering. In IJCAI.
Abigail See, Peter J Liu, and Christopher D Manning. Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.
2017. Get to the point: Summarization with pointer- 2017. Seqgan: Sequence generative adversarial nets
generator networks. In ACL. with policy gradient. In AAAI.
Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Xingxing Zhang and Mirella Lapata. 2017. Sentence
Neural responding machine for short-text conversa- simplification with deep reinforcement learning. In
tion. In ACL. EMNLP.
Richard Socher, Eric H Huang, Jeffrey Pennin, Christo- Shiqi Zhao, Xiang Lan, Ting Liu, and Sheng Li. 2009.
pher D Manning, and Andrew Y Ng. 2011. Dy- Application-driven statistical paraphrase generation.
namic pooling and unfolding recursive autoencoders In ACL.
for paraphrase detection. In NIPS.
Shiqi Zhao, Cheng Niu, Ming Zhou, Ting Liu, and
Yu Su and Xifeng Yan. 2017. Cross-domain semantic Sheng Li. 2008. Combining multiple resources to
parsing via paraphrasing. In EMNLP. improve smt-based paraphrasing model. In ACL.
Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. Brian D Ziebart, Andrew L Maas, J Andrew Bagnell,
Sequence to sequence learning with neural net- and Anind K Dey. 2008. Maximum entropy inverse
works. In NIPS, pages 3104–3112. reinforcement learning. In AAAI.
Richard S Sutton, David A McAllester, Satinder P

Singh, and Yishay Mansour. 2000. Policy gradi-
ent methods for reinforcement learning with func-
tion approximation. In NIPS.
3875
A Algorithms of RbM-SL and RbM-IRL Algorithm 2: Training Procedure of RbM-
IRL
Input : A corpus of paraphrase pairs
Algorithm 1: Training Procedure of RbM-
{(X, Y )}, a corpus of
SL
(non-parallel) sentences {X}.
Input : A corpus of paraphrase pairs
Output: Generator Gθ0 , evaluator Mφ0
{(X, Y )}, a corpus of
1 Pre-train the generator Gθ with {(X, Y )};
non-paraphrase pairs {(X, Y − )},
2 Init Gθ0 := Gθ and Mφ0 ;
a corpus of (non-parallel)
3 while not converge do
sentences {X}.
4 while not converge do
Output: Generator Gθ0
5 Sample a sentence
1 Train the evaluator Mφ with {(X, Y )} and
X = [x1 , . . . , xS ] from the
{(X, Y − )};
paraphrase corpus;
2 Pre-train the generator Gθ with {(X, Y )};
6 Generate a sentence
3 Init Gθ0 := Gθ ;
4 while not converge do Ŷ = [ŷ1 , . . . , ŷT ] according to Gθ0
5 Sample a sentence X = [x1 , . . . , xS ] given input X;
from the paraphrase corpus or the 7 Calculate φ0 -gradient:
non-parallel corpus; gφ0 := ∇φ JIRL-CL (φ);
8 Update Mφ0 using the gradient gφ0
6 Generate a sentence Ŷ = [ŷ1 , . . . , ŷT ]
with learning rate γM :
according to Gθ0 given input X;
Mφ0 := Mφ0 − γM gφ0
7 Set the gradient gθ0 = 0;
8 for t = 1 to T do 9 end
9 Run N Monte Carlo simulations: 10 Train Gθ0 with Mφ0 as in Algorithm 1;
1 N 11 end
{Ybt+1:T , ...Ybt+1:T }∼
12 Return Gθ0 , Mφ0
pθ0 (Yt+1:T |Ŷ1:t , X);
10 Compute the value function by
( PN
1 bn
N n=1 Mφ (X, [Ŷ1:t , Yt+1:T ]), t<T
Qt =
Mφ (X, Ŷ ), t = T.
Rescale the reward to Q̄t by (8);

11 Accumulate θ0 -gradient: gθ0 :=
gθ0 + ∇θ log pθ0 (ŷt |Ŷ1:t−1 , X)Q̄t
12 end
13 Update Gθ0 using the gradient gθ0 with
learning rate γG : Gθ0 := Gθ0 + γG gθ0
14 end
15 Return Gθ0
3876
B Human Evaluation Guideline – 3: Cover part of the content of source
sentence and has serious information
Please judge the paraphrases from the following
loss, e.g. what is the best love movie by
two criteria:
wong ka wai → what is the best movie;
(1) Grammar and Fluency: the paraphrase is
acceptable as natural language text, and the – 2: Topic relevant but fail to cover most
grammar is correct; of the content of source sentence, e.g.
what is some tips to learn english →
(2) Coherent and Consistent: please view from when do you start to learn english;
the perspective of the original poster, to what
– 1: Topic irrelevant or even can not un-
extent the answer of paraphrase is helpful
derstand what it means.
for you with respect to the original question.
Specifically, you can consider following as-
There is token [UNK] that stands for unknown
pects:
token in paraphrase. Ones that contains [UNK]
– Relatedness: it should be topically rele- should have both grammar and coherent score
vant to the original question. lower than 5. The grammar score should depend
– Type of question: the type of the original on other tokens in the paraphrase. The specific co-
question remains the same in paraphrase. herent score depends on the impact of [UNK] on
– Informative: no information loss in para- that certain paraphrase. Here are some paraphrase
phrase. examples given original question how can robot
have human intelligence ?:
For each paraphrase, give two separate score rank-
ing from 1 to 5. The meaning of specific score is • paraphrase: how can [UNK] be intelligent ?
as following: coherent score: 1
• Grammar and Fluency This token prevent us from understanding the
question and give proper answer. It causes
– 5: Without any grammatical error; serious information loss here;
– 4: Fluent and has one minor grammati-
cal error that does not affect understand- • paraphrase: how can robot [UNK] intelli-
ing, e.g. what is the best ways to learn gent ?
programming; coherent score: 3
– 3: Basically fluent and has two or more There is information loss, but the unknown
minor grammatical errors or one seri- token does not influence our understanding
ous grammatical error that does not have so much;
strong impact on understanding, e.g.
what some good book for read; • paraphrase: how can robot be intelligent
[UNK] ?
– 2: Can not understand what it means but
coherent score: 4
it is still in the form of human language,
[UNK] basically does not influence under-
e.g. what is the best movie of movie;
standing.
– 1: Non-sense composition of words and
not in the form of human language, e.g. NOTED:
how world war iii world war.
• Please decouple grammar and coherent as
• Coherent and Consistent
possible as you can. For instance, given a
– 5: Accurate paraphrase with exact the sentence is it true that girls like shopping, the
same meaning of the source sentence; paraphrase do girls like go go shopping can
– 4: Basically the same meaning of the get a coherent score of 5 but a grammar score
source sentence but does not cover some of only 3. But for the one you even can not
minor content, e.g. what are some good understand, e.g., how is the go shopping of
places to visit in hong kong during sum- girls, you should give both of low grammar
mer → can you suggest some places to score and low coherent score, even it contains
visit in hong kong; some topic-relevant words.
3877
• Do a Google search when you see any strange
entity name such that you can make more ap-
propriate judgement.
3878

Paraphrase Generation With Deep Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Paraphrase Generation With Deep Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Paraphrase Generation with Deep Reinforcement Learning

Zichao Li1 , Xin Jiang1 , Lifeng Shang1 , Hang Li2

Abstract language, automatically generating accurate and

Figure 2: Learning Process of RbM models: (a) RbM-SL, (b) RbM-IRL.

We can formalize the whole learning procedure Reward Rescaling

generator (See et al., 2017), and the reinforced

Table 3: Performances on Twitter corpus. Table 4: Human evaluation on Quora datasets.

Richard S Sutton, David A McAllester, Satinder P

Rescale the reward to Q̄t by (8);

You might also like