NIPS 2021 Submission Oldee

Semantically Coherent and Diversified Text
Generation using Inverse Reinforcement Learning
Rohit Sharma Avinash Kumar

Mircrosoft R&D Microsoft R&D
sharmarohit@mirosoft.com avinash.kumar@microsoft.com
Abstract
Generating semantically-coherent text from small data on specific topic is a chal-
lenging task in Natural Language Generation (NLG). Models trained via Maximum
Likelihood (MLE) suffers from exposure bias and have a tendency to memorize and
generate implausible samples when trained on small data. Reinforcement Learning
(RL) based Generative Adversarial Network (GAN) overcome these shortcomings
of MLE. Due to sparse rewards, RL-GAN models need large amount of data to
train and generate diverse texts. Inverse Reinforcement Learning (IRL) produces
dense reward signals by inferring the reward function directly from experts demon-
strations. However, it is a challenging task to recover precise reward function
from very low number of expert demonstrations that covers only a fraction of state
space. We use Data Centric approach to generate semantically coherent text from
extremly-low number of training samples using IRL. We leverage the fact that
related texts share a common structure, which can be used to learn a prior, and
infer specific reward functions from limited text samples. Formulating IRL as a
graph problem, we theoretically and Intuitively show that learning a prior helps
in removing ambiguity leading to smoother convergence and rich text generation.
Extensive experiments on BBC News Data show that the requirement for the cu-
rated topic-specific data can be drastically reduced to 20-50 samples still giving a
significant improvement of 8.7% in data quality over the other Generative models.
1 Introduction
Small data is a natural outcome when events are not frequent. Getting large amount of data is difficult
due to time paucity, operational challenges and cost considerations. Generative models, when trained
on small data, memorize the limited number of samples and do not generalize well (Arora et al. [2]).
The convergence of loss during adversarial training is not stable when trained on small data. External
sources of data from similar topics may not be completely coherent with the original topic intent
and may bias the generative model. Problem of exposure bias is significant for Natural Language
Generation with few samples. Methods like Schedule Sampling (SS) (Bengio et al. [3]) do not solve
the this problem fundamentally Huszár [8]. Sequence-GAN (Yu et al. [13]) solves the Exposure bias
problem by modelling the data generator as a stochastic policy in reinforcement learning (RL) and
uses policy update to bypasses the generator differentiation problem. The discriminator module of
the SeqGAN provides the reward signal only after judging the complete sequence, which causes
a problem of reward sparsity. To overcome the problem of reward sparsity in SeqGAN, (Shi et al.
[12]) used Inverse Reinforcement Learning (IRL). IRL infers the reward function from experts
demonstration and the Generator learns an optimal policy to maximize the expected rewards. The
reward function is an instant reward of each step and action thereby providing more dense reward
signals. However, less data leads to inaccurate inference, greater approximation error and reward
ambiguity in IRL (Ng, Russell, et al. [9]). Recovering the exact reward function and generalizing
from limited number of samples, that covers only a fraction of state space, is a difficult task for IRL.
35th Conference on Neural Information Processing Systems (NeurIPS 2021), Sydney, Australia.
We use Data Centric approach on IRL, to learn a prior on larger data through MLE. Learning a prior
helps IRL to leverage the common structure across different texts. IRL encodes these structures to
infer an expressive reward function from very few text samples. We formulate the text generation
from IRL as a graph traversal problem with nodes of the graphs as the vocabulary and use spectral
graph theory (Chung [5] and Applegate and Kannan [1]) to prove the convergence bounds of the
Markov Decision Process (MDP). Extensive experimentation on BBC Sport News Dataset (Greene
and Cunningham [8]) shows a significant increase in the quality and diversity of the generated texts.
We also create a low-latency pipeline to generate synthetic data using pre-trained Language Models.
2 Text generation using Inverse Reinforcement Learning

Following the framework in (Shi et al. [12]) which uses IRL to improve the quality and diversity of
the generated text over SeqGAN (Yu et al. [13]):
Text generation task can be regarded as the generation of the text sequence x1:T = x1 , x2 , · · · , xT
with a parameterized auto-regressive probabilistic model qθ (x), where xt is a word in a given
N
vocabulary V. The generation model qθ (x) is learned from a given dataset D = x(n) n=1 with an
underlying generating distribution pdata . The text sequence x1:T = x1 , x2 , · · · , xT can be formulated
by a trajectory of Markov-Decision Process (MDP) τ = {s1 , a1 , s2 , a2 . . . , sT , aT }. In each time-
step t, the model generates xt according a policy πθ (at | st ), where st is the current state of the
previous prediction x1:t and at is the action to select the next word xt+1 . Using the Maximum
Entropy IRL Framework (Ziebart et al. [14]) in which the trajectories are distributed proportionally to
their exponentiated rewards. Texts in training set are assumed to be sampled from distribution pφ (τ )
Z
1
pφ (τ ) = exp (Rφ (τ )) where Z = exp (Rφ (τ )) dτ (1)
Z τ
The objective of the reward approximator is to learn a reward function Rφ (τ ) that explains expert
behaviour by maximizing the log-likehood of samples in training set,
N N Z
1 X 1 X
Lr (φ) = log pφ (τn ) = Rφ (τn ) − log exp (Rφ (τ )) dτ (2)
N n=1 N n=1 τ
The objective of generator is to maximize the expected reward with entropy regularization H (q(τ ))
Lg (θ) = Eτ ∼qθ (τ ) [Rφ (τ )] + H (qθ (τ )) = Eτ ∼qθ (τ ) [Rφ (τ )] − Eqθ (τ ) [log qθ (τ )] (3)
3 Graphical model for text generation

Consider the transition distribution ps : S × S × A → [0, 1]. For finite length text sequence
x1:T = x1 , x2 , · · · , xT the trajectory τ = {s1 , a1 , s2 , a2 . . . , sT , aT } will also be finite. Consider
a weighted directed Graph G = {V, E} with nodes V as the tokens in vocabulary. A walk starting
at source node v0 over this graph gives a path < v0 , v1 , v2 , · · · , vk > that represents a unique state
sr in IRL. Lets denote Γk as the set of paths starting at the node v0 and ending at the node vk . The
probability P of a path γk is the product of the node transition probability of all the transitions in γk
t=k
Y
P (γk ) = p (vt0 | γt−1 ) (4)
t=1
The weight wvi ,vj of the edge evi ,vj is the Expected node transition probability from vi to vj .
wvi ,vj = Eγi ∼Γi [P (γi ) × p (vj | γi )] (5)
where, the transition probability P over each node vi in the Graph is
X X
Pv i = Pvj × wi,j = Pvj × [Eγi ∼Γi [P (γi ) × p (vj | γi )]] (6)
∀j:vj →vi ∀j:vj →vi
Text Sequence x1:T = x1 , x2 , · · · , xT . in (Yu et al. [13] and Shi et al. [12]) can be generated by a
walk v1:T =< v0 , v1 , · · · , vT > over G starting from Source node v0 and ending at the sink node vT .
2
In IRL The reward R(τ ) of the complete trajectory τn = {s1P , a1 , . . . , sT , aT } is the sum of reward
of each individual state-action pair in this trajectory R(τ ) = t r(st , at ). To calculate the expected
reward R(τ ), we will sum the rewards over all possible actions.
 
" #
X X  
Lr (φ) = E [Rφ (τ )] = E rφ (st , at ) =  E rφ (γi , vj ) × wvi ,vj  (7)
τ ∼G τ ∼G vi ∼V(G)
t ∀vj ∈G
γi ∼G(Γi )
3.1 Mixing of markov chains
Conductance of an undirected and weighted graph G is defined as,

(S)
φG = min (G) . (8)
S⊂G (S)((G) − (S))
Although, computing φG is NP-hard. The second smallest eigenvalue λ̂2 of The normalized Laplacian
matrix L̂ = − D−1/2 A × D−1/2 of G bounds φG by the relation
q
2λ̂2 ≥ φG ≥ λ̂2 /2. (9)
Let σ be the spectral gap σ = 1 − µn−1 , t∗ = 1/σ. The number of steps of a Markov chain required
before it has reached a total variation distance away from the stationary distribution π satisfies:
1 1 ∗
(t∗ − 1) log ≤ mixing time ≤ log t . (10)
2 πmin
Equivalent bounds with Laplacian and Cheeger’s Inequality (under assumptions) exist for directed
Graphs G as well, making Markov chains on directed graphs rapidly mixing (Chung [4]).
3.2 Learning a prior and convergence
We will use the following theorem to show that the Markov chain for IRL in Eq 7 is mixing
Theorem 1 (Applegate and Kannan, 1993) Let F (·) be a positive real valued function defined on
{x ∈ Rn | −d ≤ xi ≤ d} for some positive d, satisfying for all λ ∈ [0, 1] and ∃ α, β ≥ 0
|f (x) − f (y)| ≤ αkx − yk∞
f (λx + (1 − λ)y) ≥ λf (x) + (1 − λ)f (y) − β
where f (x) = log F (x). Then the Markov chain induced by PolicyWalk on F rapidly mixes to within
~ 1
of F in O n2 d2 α2 e2β log steps
N
We have dataset D = τ (n) n=1 with small number of samples on specific topic T and D0 =

0(n) N
τ n=1
on related similar topics T 0 . The Data S = D0 ∪ D is used to learn a prior Gθ via MLE.
This inserts new nodes and edges in G to create an expanded
√ graph G 0 ⊃ G. Prior from S bounds the
|E|
policy to fewer probable actions per state as shown by |V| value in table 2, (random transitions will
√
|E|
have |V| = 1). Considering generator G as policy, bounding the number of probable actions per
state, bounds the reward variance at each state. Taking disjoint set union, Γ = i∈V Γi in equation 7
X
, E rφ (γi , vj ) ≤ |Γ| × E rφ (γi , vj ) ≤ |Γ| × (11)
γi ∼Γi γi ∼Γi
i
Using Cauchy-Schwartz and equation 11 we have,
 
X X
Lr (φ) = E  rφ (γi , vj ) × wvi ,vj  ≤ | E rφ (γi , vj )| × |w| ≤ |Γ| × × w (12)
γi ∼Γ
∀vj ∈G i
Not only Lr (φ) is bounded, Lr (φ) is α-Lipschitz in kLk2 (Therefore in kLkinf ) because of clipped
weights. ( Arjovsky, Chintala, and Bottou [1]). As Lr (φ) is linear in r We can use Theorem-1 to
claim that Markov Chain induced by our PolicyWalk mixes rapidly (Ramachandran and Amir [10]),
as shown in Figure 1. Prior decreases the uncertainty over the state-action space (as observed in table
2). In adversarial training phase, the reward-approximator is trained on original data D. This allows
the generator to learn the common structures across D ∪ D0 and remove the noises in D0 .
3
4 Experiments and Results
We experimented with BBC Sports News dataset (Greene and Cunningham [7]). Dataset has 737
News Headlines over 5 different sports category (athletics, cricket, football, rugby, tennis). Using
cricket news as topic T and 124 news Headlines on cricket as data D. News headlines over T 0 =
(athletics, football, rugby and tennis) is D0 . (BERT (Devlin et al. [5]) can also be used to create D0 ).
Intra-sequence mode-collapse (Fedus, Goodfellow,√ and Dai [6]) will be proportional to number of
|E|
cycles in (G), which can be approximated by D = |V| . Density of the state-action graph can be
approximated
√ by sampling generated text and using Bi-grams as directed weighted edges. Density(G)
|E|
= |V| is shown in table 2. For evaluation purpose, we observe the Quality of most frequent n-grams
(Table 2). We measured the mean predictive outcome on generated data from classifiers trained on D
and D0 . We also measured mean cosine-similarity using Sentence-BERT (Reimers and Gurevych
[11]), BLEUf , BLEUb and BLEUhf (Shi et al. [12]). Results are in Table 1.
−5 Training ith Real Small Data

Table 1: Predective modelling score and Learning the prior
similarity metrics −10
Generator Loss
Model IRL IRL with Prior −15
Random 16.8 16.8 −20

SVM 22.5 24.46
−25
KNN 27.40 34.52
Logistic 33.53 33.68 −30
Cos-similarity 0.781 0.783 0 20 40 60 80 100 120
BLEUf 0.506 0.572 Adversarial Training Iterations
BLEUb 0.496 0.571

BLEUhf 0.5 0.5715 Figure 1: Convergance is faster, with better reward
at convergence, for IRL trained with prior
5 Conclusion
Using Data Centric approach to learn a prior from related datasets, which can be cerated using BERT,
resulted in better generation in terms of quality and diversity. Prior guides the IRL to encode common
structures across texts on similar topics, which improves convergence and generates semantically
coherent text from extremly-small amount of data. Graph G can be use to evaluate metric like
intra-sequence mode-collapse, mixing of markov chains and generation quality. The alternate
√
|E|
Table 2: Top-3 most frequent n-grams. (Random Model has |V| = 1)
√ √ √
|E| |E| |E|
(N)-Gram IRL |V| = 0.757 Real Data |V| = 0.012 IRL with prior |V| = 0.068
to Sri South Africa South Africa

2-Gram Sri Sri Sri Lanka ban by
victory Sri Set to Sri Lanka
victory Sri Sri tour World XI historic series win
3-Gram West to Sri bowl at Wanderer England claim historic
to victory Sri set to join claim historic series
to victory Sri Sri Flintoff fit to bowl England claim historic series
4-Gram given fresh start for set to join Somerset claim historic series win
Tudor given fresh start action given all-clear Pakistan slump in Twenty20 debut
4
References
[1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. “Wasserstein generative adversarial
networks”. In: International conference on machine learning. PMLR. 2017, pp. 214–223.
[2] Sanjeev Arora et al. “Generalization and equilibrium in generative adversarial nets (gans)”. In:
International Conference on Machine Learning. PMLR. 2017, pp. 224–232.
[3] Samy Bengio et al. “Scheduled sampling for sequence prediction with recurrent neural net-
works”. In: arXiv preprint arXiv:1506.03099 (2015).
[4] Fan Chung. “Laplacians and the Cheeger inequality for directed graphs”. In: Annals of Combi-
natorics 9.1 (2005), pp. 1–19.
[5] Jacob Devlin et al. “Bert: Pre-training of deep bidirectional transformers for language under-
standing”. In: arXiv preprint arXiv:1810.04805 (2018).
[6] William Fedus, Ian Goodfellow, and Andrew M Dai. “Maskgan: better text generation via
filling in the_”. In: arXiv preprint arXiv:1801.07736 (2018).
[7] Derek Greene and Pádraig Cunningham. “Practical Solutions to the Problem of Diagonal
Dominance in Kernel Document Clustering”. In: Proc. 23rd International Conference on
Machine learning (ICML’06). ACM Press, 2006, pp. 377–384.
[8] Ferenc Huszár. “How (not) to train your generative model: Scheduled sampling, likelihood,
adversary?” In: arXiv preprint arXiv:1511.05101 (2015).
[9] Andrew Y Ng, Stuart J Russell, et al. “Algorithms for inverse reinforcement learning.” In:
Icml. Vol. 1. 2000, p. 2.
[10] Deepak Ramachandran and Eyal Amir. “Bayesian Inverse Reinforcement Learning.” In: IJCAI.
Vol. 7. 2007, pp. 2586–2591.
[11] Nils Reimers and Iryna Gurevych. “Sentence-bert: Sentence embeddings using siamese bert-
networks”. In: arXiv preprint arXiv:1908.10084 (2019).
[12] Zhan Shi et al. “Toward diverse text generation with inverse reinforcement learning”. In: arXiv
preprint arXiv:1804.11258 (2018).
[13] Lantao Yu et al. “Seqgan: Sequence generative adversarial nets with policy gradient”. In:
Proceedings of the AAAI conference on artificial intelligence. Vol. 31. 1. 2017.
[14] Brian D Ziebart et al. “Maximum entropy inverse reinforcement learning.” In: Aaai. Vol. 8.
Chicago, IL, USA. 2008, pp. 1433–1438.

NIPS 2021 Submission Oldee

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NIPS 2021 Submission Oldee

Uploaded by

Copyright:

Available Formats

Semantically Coherent and Diversified Text

Generation using Inverse Reinforcement Learning

Rohit Sharma Avinash Kumar

2 Text generation using Inverse Reinforcement Learning

3 Graphical model for text generation

3.1 Mixing of markov chains

Conductance of an undirected and weighted graph G is defined as,

3.2 Learning a prior and convergence

−5 Training ith Real Small Data

Random 16.8 16.8 −20

BLEUb 0.496 0.571

to Sri South Africa South Africa

You might also like