Professional Documents
Culture Documents
006 have a tendency to memorize and generate im- Language Generation with few samples. Methods 049
007 plausible samples when trained on small data. like Schedule Sampling (SS) (Bengio et al. [3]) do 050
008 Reinforcement Learning (RL) based Gener- not solve this problem fundamentally (Huszár [8]). 051
009 ative Adversarial Network (GAN) overcome Sequence-GAN (Yu et al. [13]) solves the Expo- 052
010 these shortcomings of MLE. Due to sparse re- sure bias problem by modelling the data genera- 053
011 wards, RL-GAN models need large amount of
tor as a stochastic policy in reinforcement learning 054
012 data to train and generate diverse texts. Inverse
013 Reinforcement Learning (IRL) produces dense
(RL) and uses policy update to bypasses the gen- 055
014 reward signals by inferring the reward function erator differentiation problem. The discriminator 056
015 directly from experts demonstrations. How- module of the SeqGAN provides the reward signal 057
016 ever, it is a challenging task to recover pre- only after judging the complete sequence, which 058
017 cise reward function from very low number of causes a problem of reward sparsity. To overcome 059
018 expert demonstrations that covers only a frac- the problem of reward sparsity in SeqGAN, (Shi 060
019 tion of state space. We generate semantically et al. [12]) used Inverse Reinforcement Learning 061
020 coherent text from extremely-low number of
(IRL). IRL infers the reward function from experts 062
021 training samples using IRL. We leverage the
022 fact that related texts share a common struc- demonstration and the Generator learns an optimal 063
023 ture, which can be used to learn a prior, and in- policy to maximize the expected rewards. The re- 064
024 fer specific reward functions from limited text ward function is an instant reward of each step and 065
025 samples. Formulating IRL as a graph prob- action thereby providing more dense reward sig- 066
026 lem, we theoretically and intuitively show that nals. However, less data leads to inaccurate in- 067
027 learning a prior helps in removing ambiguity ference, greater approximation error and reward 068
028 leading to smoother convergence and rich text
ambiguity in IRL (Ng and Russell [9]). Recov- 069
029 generation. Extensive experiments on BBC
030 News data show that the requirement for the
ering the exact reward function and generalizing 070
031 curated topic-specific data can be drastically from limited number of samples, that covers only 071
032 reduced to 20-50 samples still giving a signifi- a fraction of state space, is a difficult task for IRL. 072
033 cant improvement of 8.7% in data quality over We use Data Centric approach on IRL, to learn 073
034 the other Generative models. Our framework
035 is multilingual and works on any language.
a prior on larger data through MLE. Learning a 074
prior helps IRL to leverage the common structure 075
036 1 Introduction across different texts. IRL encodes these struc- 076
tures to infer an expressive reward function from 077
037 Small data is a natural outcome when events are very few text samples. We formulate the text gen- 078
038 not frequent. Getting large amount of data is dif- eration from IRL as a graph traversal problem with 079
039 ficult due to time paucity, operational challenges nodes of the graphs as the vocabulary and use 080
040 and cost considerations. Generative models, when spectral graph theory (Chung [5] and Applegate 081
041 trained on small data, memorize the limited num- and Kannan [1]) to prove the convergence bounds 082
042 ber of samples and do not generalize well (Arora of the Markov Decision Process (MDP). Exten- 083
1
084 sive experimentation on BBC Sport News dataset Objective of generator is to maximize the expected 120
085 (Greene and Cunningham [8]) shows a significant reward with entropy regularization H (q(τ )) 121
086 increase in the quality and diversity of the gener-
087 ated texts. We also create a low-latency pipeline Lg (θ) = Eτ ∼qθ (τ ) [Rφ (τ )] + H (qθ (τ ))
088 to generate synthetic data using pre-trained Lan- 122
= Eτ ∼qθ (τ ) [Rφ (τ )] − Eqθ (τ ) [log qθ (τ )]
089 guage Models.
(3)
090 2 Text generation using Inverse 3 Graphical model for text generation 123
091 Reinforcement Learning
Consider the transition distribution ps : S × 124
092 Following the framework in (Shi et al. [12]) which S × A → [0, 1]. For finite length text se- 125
093 uses IRL to improve the quality and diversity of quence x1:T = x1 , x2 , · · · , xT the trajectory 126
094 the generated text over SeqGAN (Yu et al. [13]): τ = {s1 , a1 , s2 , a2 . . . , sT , aT } will also be finite. 127
095 Text generation task can be regarded as the gener- Consider a weighted directed Graph G = {V, E} 128
096 ation of the text sequence x1:T = x1 , x2 , · · · , xT with nodes V as the tokens in vocabulary. A walk 129
097 with a parameterized auto-regressive probabilis- starting at source node v0 over this graph gives 130
098 tic model qθ (x), where xt is a word in a given a path < v0 , v1 , v2 , · · · , vk > that represents a 131
099 vocabulary V. The generation model qθ (x) is unique state sr in IRL. Lets denote Γk as the set of 132
N
learned from a given dataset D = x(n) n=1 with paths starting at the node v0 and ending at the node
100 133
101 an underlying distribution pdata . The text sequence vk . The probability P of a path γk is the product of 134
102 x1:T = x1 , x2 , · · · , xT can be formulated by the node transition probability of all the transitions 135
110 trajectories are distributed proportionally to their node transition probability from vi to vj . 139
113 where,
where, the transition probability P over each node 141
vi in the Graph is 142
exp (Rφ (τ ))
pφ (τ ) = R (1)
X
114
Pvi = Pvj × wi,j
τ exp (Rφ (τ )) dτ
∀j:vj →vi
X 143
115 The objective of the reward approximator is to = Pvj × [Eγi ∼Γi [P (γi ) × p (vj | γi )]]
∀j:vj →vi
116 learn a reward function Rφ (τ ) that explains ex-
117 pert behaviour by maximizing the log-likehood of (6)
118 samples in training set,
Text Sequence x1:T = x1 , x2 , · · · , xT . in (Yu 144
et al. [13] and Shi et al. [12]) can be generated by 145
N
a walk v1:T =< v0 , v1 , · · · , vT > over G start- 146
1 X ing from Source node v0 and ending at the sink 147
Lr (φ) = log pφ (τn )
N node vT . In IRL The reward R(τ ) of the complete 148
n=1
119 trajectory τn = {s1 , a1 , . . . , sT , aT } is the sum of 149
N Z
1 X
reward of each individual state-action pair in this 150
= Rφ (τn ) − log exp (Rφ (τ )) dτ
N τ
P
trajectory R(τ ) = t r(st , at ). To calculate the 151
n=1
(2) expected reward R(τ ), we will sum the rewards 152
2
N
We have dataset D = τ (n) n=1 with small num-
153 over all possible actions. 181
" # ber of samples on specific topic T and D0 = 182
0(n) N
on related similar topics T 0 . The Data
X
Lr (φ) = E [Rφ (τ )] = E rφ (st , at ) τ n=1
183
τ ∼G τ ∼G 0
S = D ∪ D is used to learn a prior Gθ via MLE. 184
t
This inserts new nodes and edges in G to create
154 185
an expanded graph G 0 ⊃ G. Prior from S bounds
X
186
= E rφ (γi , vj ) × wvi ,vj
∀vj ∈G
vi ∼V(G) the policy √to fewer probable actions per state as 187
γi ∼G(Γi ) |E|
(7) shown by |V| value in table 1, (random transi- 188
√
|E|
155 3.1 Mixing of markov chains tions will have |V| = 1). Considering generator 189
G as policy, bounding the number of probable ac- 190
156 Conductance of an undirected and weighted graph tions per state, bounds the reward variance at each 191
157 G is defined as, state. Taking disjoint set union, Γ = i∈V Γi in equa- 192
(S) tion 7, 193
158 φG = min(G) . (8)
S⊂G (S)((G) − (S)) X
E rφ (γi , vj ) ≤ |Γ| × E rφ (γi , vj )
159 Although, computing φG is NP-hard. The second γi ∼Γi γi ∼Γi
i 194
160 smallest eigenvalue λ̂2 of The normalized Lapla- ≤ |Γ| ×
161 cian matrix L̂ = − D−1/2 A × D−1/2 of G bounds (11)
162 φG by the relation
q Using Cauchy-Schwartz and equation 11 we have, 195
163 2λ̂2 ≥ φG ≥ λ̂2 /2. (9)
X
164 Let σ be the spectral gap σ = 1 − µn−1 , Lr (φ) = E rφ (γi , vj ) × wvi ,vj
165 t∗ = 1/σ. The number of steps of a Markov chain ∀vj ∈G 196
required before it has reached a total variation dis-
X
166 ≤| E rφ (γi , vj )| × |w| ≤ |Γ| × × w
167 tance away from the stationary distribution π sat- γi ∼Γ
i
168 isfies: (12)
1 1 ∗
169 (t∗ − 1) log ≤ -mixing time ≤ log t . Not only Lr (φ) is bounded, Lr (φ) is α- 197
2 πmin
Lipschitz in kLk2 (therefore, in kLkinf ) because 198
(10)
of clipped weights. ( Arjovsky, Chintala, and Bot- 199
170 Equivalent bounds with Laplacian and tou [1]). As Lr (φ) is linear in r We can use 200
171 Cheeger’s Inequality (under assumptions) ex- Theorem-1 to claim that Markov Chain induced 201
172 ist for directed Graphs G as well, making Markov by our PolicyWalk mixes rapidly (Ramachandran 202
173 chains on directed graphs rapidly mixing (Chung and Amir [10]), as shown in Figure 1. Prior de- 203
174 [4]). creases the uncertainty over the state-action space 204
(as observed in table 1). In adversarial training 205
175 3.2 Learning a prior and convergence phase, the reward-approximator is trained on orig- 206
176 We will use the following theorem to show that the inal data D. This allows the generator to learn the 207
177 Markov chain for IRL in equation 7 is mixing common structures across D ∪ D0 and remove the 208
f (λx + (1 − λ)y) ≥ λf (x) + (1 − λ)f (y) − β (athletics, cricket, football, rugby, tennis). Us- 214
ing cricket news as topic T and 124 news Head- 215
178 where f (x) = log F (x). Then the Markov chain lines on cricket as data D. News headlines over 216
179 induced by PolicyWalk
on F rapidly
mixes to T 0 = (athletics, football, rugby and tennis) is D0 . 217
2 2 2 2β~ 1
180 within of F in O n d α e log steps (BERT (Devlin et al. [5]) can also be used to create 218
3
√
|E|
Table 1: Top-3 most frequent n-grams. (Random Policy will have |V| = 1)
√ √ √
|E| |E| |E|
(N)-Gram IRL |V| = 0.757 Real Data |V| = 0.012 IRL with prior |V| = 0.068
to Sri South Africa South Africa
2-Gram Sri Sri Sri Lanka ban by
victory Sri Set to Sri Lanka
victory Sri Sri tour World XI historic series win
3-Gram West to Sri bowl at Wanderer England claim historic
to victory Sri set to join claim historic series
to victory Sri Sri Flintoff fit to bowl England claim historic series
4-Gram given fresh start for set to join Somerset claim historic series win
Tudor given fresh start action given all-clear Pakistan slump in Twenty20 debut
−15
SVM 22.5 24.46
KNN 27.40 34.52 −20
238 5 Conclusion
239 Using Data Centric approach to learn a prior from
240 related datasets, which can be created using BERT,
4
249 References [10] Deepak Ramachandran and Eyal Amir. 297
“Bayesian inverse reinforcement learning”. 298
250 [1] Martin Arjovsky, Soumith Chintala, and
In: Proceedings of the 20th international 299
251 Léon Bottou. “Wasserstein generative ad-
joint conference on Artifical intelligence. 300
252 versarial networks”. In: International con-
2007, pp. 2586–2591. 301
253 ference on machine learning. PMLR. 2017,
254 pp. 214–223. [11] Nils Reimers et al. “Sentence-BERT: Sen- 302
tence Embeddings using Siamese BERT- 303
255 [2] Sanjeev Arora et al. “Generalization and
Networks”. In: Proceedings of the 2019 304
256 equilibrium in generative adversarial nets
Conference on Empirical Methods in Nat- 305
257 (gans)”. In: International Conference on
ural Language Processing. Association for 306
258 Machine Learning. PMLR. 2017, pp. 224–
Computational Linguistics. 2019. 307
259 232.
[12] Zhan Shi et al. “Toward diverse text gener- 308
260 [3] Samy Bengio et al. “Scheduled sampling
ation with inverse reinforcement learning”. 309
261 for sequence prediction with recurrent Neu-
In: Proceedings of the 27th International 310
262 ral networks”. In: Proceedings of the 28th
Joint Conference on Artificial Intelligence. 311
263 International Conference on Neural Infor-
2018, pp. 4361–4367. 312
264 mation Processing Systems-Volume 1. 2015,
265 pp. 1171–1179. [13] Lantao Yu et al. “Seqgan: Sequence gener- 313
ative adversarial nets with policy gradient”. 314
266 [4] Fan Chung. “Laplacians and the Cheeger in-
In: Proceedings of the AAAI conference on 315
267 equality for directed graphs”. In: Annals of
artificial intelligence. Vol. 31. 1. 2017. 316
268 Combinatorics 9.1 (2005), pp. 1–19.
[14] Brian D Ziebart et al. “Maximum entropy 317
269 [5] Jacob Devlin et al. “BERT: Pre-training of
inverse reinforcement learning”. In: Pro- 318
270 Deep Bidirectional Transformers for Lan-
ceedings of the 23rd national conference 319
271 guage Understanding”. In: Proceedings of
on Artificial intelligence-Volume 3. 2008, 320
272 the 2019 Conference of the North American
pp. 1433–1438. 321
273 Chapter of the Association for Computa-
274 tional Linguistics: Human Language Tech-
275 nologies, Volume 1 (Long and Short Pa-
276 pers). 2019, pp. 4171–4186.
277 [6] William Fedus, Ian Goodfellow, and An-
278 drew M Dai. “MaskGAN: Better Text Gen-
279 eration via Filling in the _”. In: Interna-
280 tional Conference on Learning Representa-
281 tions. 2018.
282 [7] Derek Greene and Pádraig Cunningham.
283 “Practical Solutions to the Problem of Diag-
284 onal Dominance in Kernel Document Clus-
285 tering”. In: Proc. 23rd International Con-
286 ference on Machine learning (ICML’06).
287 ACM Press, 2006, pp. 377–384.
288 [8] Ferenc Huszár. “How (not) to Train your
289 Generative Model: Scheduled Sampling,
290 Likelihood, Adversary?” In: arXiv e-prints
291 (2015), arXiv–1511.
292 [9] Andrew Y Ng and Stuart J Russell. “Al-
293 gorithms for Inverse Reinforcement Learn-
294 ing”. In: Proceedings of the Seventeenth In-
295 ternational Conference on Machine Learn-
296 ing. 2000, pp. 663–670.