You are on page 1of 5

Towards Universal Generation for Low Resource Languages using

Generative Adversarial Network

Anonymous ACL submission

Abstract et al. [2]). The convergence of loss during adver- 043


sarial training is not stable when trained on small 044
001 Generating semantically-coherent text for low data. External sources of data from similar top- 045
002 resource languages with small data is a chal- ics may not be completely coherent with the origi- 046
003 lenging task in Natural Language Generation
nal topic intent and may bias the generative model. 047
004 (NLG). Models trained via Maximum Likeli-
005 hood (MLE) suffers from exposure bias and Problem of exposure bias is significant for Natural 048

006 have a tendency to memorize and generate im- Language Generation with few samples. Methods 049
007 plausible samples when trained on small data. like Schedule Sampling (SS) (Bengio et al. [3]) do 050
008 Reinforcement Learning (RL) based Gener- not solve this problem fundamentally (Huszár [8]). 051
009 ative Adversarial Network (GAN) overcome Sequence-GAN (Yu et al. [13]) solves the Expo- 052
010 these shortcomings of MLE. Due to sparse re- sure bias problem by modelling the data genera- 053
011 wards, RL-GAN models need large amount of
tor as a stochastic policy in reinforcement learning 054
012 data to train and generate diverse texts. Inverse
013 Reinforcement Learning (IRL) produces dense
(RL) and uses policy update to bypasses the gen- 055

014 reward signals by inferring the reward function erator differentiation problem. The discriminator 056
015 directly from experts demonstrations. How- module of the SeqGAN provides the reward signal 057
016 ever, it is a challenging task to recover pre- only after judging the complete sequence, which 058
017 cise reward function from very low number of causes a problem of reward sparsity. To overcome 059
018 expert demonstrations that covers only a frac- the problem of reward sparsity in SeqGAN, (Shi 060
019 tion of state space. We generate semantically et al. [12]) used Inverse Reinforcement Learning 061
020 coherent text from extremely-low number of
(IRL). IRL infers the reward function from experts 062
021 training samples using IRL. We leverage the
022 fact that related texts share a common struc- demonstration and the Generator learns an optimal 063

023 ture, which can be used to learn a prior, and in- policy to maximize the expected rewards. The re- 064
024 fer specific reward functions from limited text ward function is an instant reward of each step and 065
025 samples. Formulating IRL as a graph prob- action thereby providing more dense reward sig- 066
026 lem, we theoretically and intuitively show that nals. However, less data leads to inaccurate in- 067
027 learning a prior helps in removing ambiguity ference, greater approximation error and reward 068
028 leading to smoother convergence and rich text
ambiguity in IRL (Ng and Russell [9]). Recov- 069
029 generation. Extensive experiments on BBC
030 News data show that the requirement for the
ering the exact reward function and generalizing 070

031 curated topic-specific data can be drastically from limited number of samples, that covers only 071
032 reduced to 20-50 samples still giving a signifi- a fraction of state space, is a difficult task for IRL. 072
033 cant improvement of 8.7% in data quality over We use Data Centric approach on IRL, to learn 073
034 the other Generative models. Our framework
035 is multilingual and works on any language.
a prior on larger data through MLE. Learning a 074
prior helps IRL to leverage the common structure 075

036 1 Introduction across different texts. IRL encodes these struc- 076
tures to infer an expressive reward function from 077
037 Small data is a natural outcome when events are very few text samples. We formulate the text gen- 078
038 not frequent. Getting large amount of data is dif- eration from IRL as a graph traversal problem with 079
039 ficult due to time paucity, operational challenges nodes of the graphs as the vocabulary and use 080
040 and cost considerations. Generative models, when spectral graph theory (Chung [5] and Applegate 081
041 trained on small data, memorize the limited num- and Kannan [1]) to prove the convergence bounds 082
042 ber of samples and do not generalize well (Arora of the Markov Decision Process (MDP). Exten- 083

1
084 sive experimentation on BBC Sport News dataset Objective of generator is to maximize the expected 120
085 (Greene and Cunningham [8]) shows a significant reward with entropy regularization H (q(τ )) 121
086 increase in the quality and diversity of the gener-
087 ated texts. We also create a low-latency pipeline Lg (θ) = Eτ ∼qθ (τ ) [Rφ (τ )] + H (qθ (τ ))
088 to generate synthetic data using pre-trained Lan- 122
= Eτ ∼qθ (τ ) [Rφ (τ )] − Eqθ (τ ) [log qθ (τ )]
089 guage Models.
(3)

090 2 Text generation using Inverse 3 Graphical model for text generation 123
091 Reinforcement Learning
Consider the transition distribution ps : S × 124
092 Following the framework in (Shi et al. [12]) which S × A → [0, 1]. For finite length text se- 125
093 uses IRL to improve the quality and diversity of quence x1:T = x1 , x2 , · · · , xT the trajectory 126
094 the generated text over SeqGAN (Yu et al. [13]): τ = {s1 , a1 , s2 , a2 . . . , sT , aT } will also be finite. 127
095 Text generation task can be regarded as the gener- Consider a weighted directed Graph G = {V, E} 128
096 ation of the text sequence x1:T = x1 , x2 , · · · , xT with nodes V as the tokens in vocabulary. A walk 129
097 with a parameterized auto-regressive probabilis- starting at source node v0 over this graph gives 130
098 tic model qθ (x), where xt is a word in a given a path < v0 , v1 , v2 , · · · , vk > that represents a 131
099 vocabulary V. The generation model qθ (x) is unique state sr in IRL. Lets denote Γk as the set of 132
N
learned from a given dataset D = x(n) n=1 with paths starting at the node v0 and ending at the node

100 133

101 an underlying distribution pdata . The text sequence vk . The probability P of a path γk is the product of 134

102 x1:T = x1 , x2 , · · · , xT can be formulated by the node transition probability of all the transitions 135

103 a trajectory of Markov-Decision Process (MDP) in γk 136

104 τ = {s1 , a1 , s2 , a2 . . . , sT , aT }. At each time-


105 step t, the model generates xt according a policy t=k
Y
p vt0 | γt−1

106 πθ (at | st ), where st is the current state of the pre- P (γk ) = (4) 137
t=1
107 vious prediction x1:t and at is the action to select
108 the next word xt+1 . Using the Maximum Entropy
109 IRL Framework (Ziebart et al. [14]) in which the The weight wvi ,vj of the edge evi ,vj is the Expected 138

110 trajectories are distributed proportionally to their node transition probability from vi to vj . 139

111 exponentiated rewards. Texts in training set are


112 assumed to be sampled from distribution pφ (τ ) wvi ,vj = Eγi ∼Γi [P (γi ) × p (vj | γi )] (5) 140

113 where,
where, the transition probability P over each node 141
vi in the Graph is 142

exp (Rφ (τ ))
pφ (τ ) = R (1)
X
114
Pvi = Pvj × wi,j
τ exp (Rφ (τ )) dτ
∀j:vj →vi
X 143
115 The objective of the reward approximator is to = Pvj × [Eγi ∼Γi [P (γi ) × p (vj | γi )]]
∀j:vj →vi
116 learn a reward function Rφ (τ ) that explains ex-
117 pert behaviour by maximizing the log-likehood of (6)
118 samples in training set,
Text Sequence x1:T = x1 , x2 , · · · , xT . in (Yu 144
et al. [13] and Shi et al. [12]) can be generated by 145

N
a walk v1:T =< v0 , v1 , · · · , vT > over G start- 146
1 X ing from Source node v0 and ending at the sink 147
Lr (φ) = log pφ (τn )
N node vT . In IRL The reward R(τ ) of the complete 148
n=1
119 trajectory τn = {s1 , a1 , . . . , sT , aT } is the sum of 149
N Z
1 X
reward of each individual state-action pair in this 150
= Rφ (τn ) − log exp (Rφ (τ )) dτ
N τ
P
trajectory R(τ ) = t r(st , at ). To calculate the 151
n=1
(2) expected reward R(τ ), we will sum the rewards 152

2
N
We have dataset D = τ (n) n=1 with small num-

153 over all possible actions. 181
" # ber of samples on specific topic T and D0 = 182
 0(n) N
on related similar topics T 0 . The Data
X
Lr (φ) = E [Rφ (τ )] = E rφ (st , at ) τ n=1
183
τ ∼G τ ∼G 0
S = D ∪ D is used to learn a prior Gθ via MLE. 184
t
This inserts new nodes and edges in G to create
 
154 185
an expanded graph G 0 ⊃ G. Prior from S bounds
X 
186

=  E rφ (γi , vj ) × wvi ,vj 
∀vj ∈G
vi ∼V(G) the policy √to fewer probable actions per state as 187
γi ∼G(Γi ) |E|
(7) shown by |V| value in table 1, (random transi- 188

|E|
155 3.1 Mixing of markov chains tions will have |V| = 1). Considering generator 189
G as policy, bounding the number of probable ac- 190
156 Conductance of an undirected and weighted graph tions per state, bounds the reward variance at each 191
157 G is defined as, state. Taking disjoint set union, Γ = i∈V Γi in equa- 192
(S) tion 7, 193
158 φG = min(G) . (8)
S⊂G (S)((G) − (S)) X
E rφ (γi , vj ) ≤ |Γ| × E rφ (γi , vj )
159 Although, computing φG is NP-hard. The second γi ∼Γi γi ∼Γi
i 194
160 smallest eigenvalue λ̂2 of The normalized Lapla- ≤ |Γ| × 
161 cian matrix L̂ = − D−1/2 A × D−1/2 of G bounds (11)
162 φG by the relation
q Using Cauchy-Schwartz and equation 11 we have, 195
163 2λ̂2 ≥ φG ≥ λ̂2 /2. (9)  
X 
164 Let σ be the spectral gap σ = 1 − µn−1 , Lr (φ) = E  rφ (γi , vj ) × wvi ,vj 
165 t∗ = 1/σ. The number of steps of a Markov chain ∀vj ∈G 196
required before it has reached a total variation dis-
X
166 ≤| E rφ (γi , vj )| × |w| ≤ |Γ| ×  × w
167 tance  away from the stationary distribution π sat- γi ∼Γ
i
168 isfies: (12)
1 1 ∗
169 (t∗ − 1) log ≤ -mixing time ≤ log t . Not only Lr (φ) is bounded, Lr (φ) is α- 197
2 πmin
Lipschitz in kLk2 (therefore, in kLkinf ) because 198
(10)
of clipped weights. ( Arjovsky, Chintala, and Bot- 199
170 Equivalent bounds with Laplacian and tou [1]). As Lr (φ) is linear in r We can use 200
171 Cheeger’s Inequality (under assumptions) ex- Theorem-1 to claim that Markov Chain induced 201
172 ist for directed Graphs G as well, making Markov by our PolicyWalk mixes rapidly (Ramachandran 202
173 chains on directed graphs rapidly mixing (Chung and Amir [10]), as shown in Figure 1. Prior de- 203
174 [4]). creases the uncertainty over the state-action space 204
(as observed in table 1). In adversarial training 205
175 3.2 Learning a prior and convergence phase, the reward-approximator is trained on orig- 206
176 We will use the following theorem to show that the inal data D. This allows the generator to learn the 207
177 Markov chain for IRL in equation 7 is mixing common structures across D ∪ D0 and remove the 208

Theorem 1 (Applegate and Kannan, 1993) Let noises in D0 . 209

F (·) be a positive real valued function defined


4 Experiments and results 210
on {x ∈ Rn | −d ≤ xi ≤ d} for some positive d,
satisfying for all λ ∈ [0, 1] and ∃ α, β ≥ 0 We experimented with BBC Sports News dataset 211
(Greene and Cunningham [7]). Dataset has 737 212
|f (x) − f (y)| ≤ αkx − yk∞ News Headlines over 5 different sports category 213

f (λx + (1 − λ)y) ≥ λf (x) + (1 − λ)f (y) − β (athletics, cricket, football, rugby, tennis). Us- 214
ing cricket news as topic T and 124 news Head- 215
178 where f (x) = log F (x). Then the Markov chain lines on cricket as data D. News headlines over 216
179 induced by PolicyWalk
 on F rapidly
 mixes to T 0 = (athletics, football, rugby and tennis) is D0 . 217
2 2 2 2β~ 1
180 within  of F in O n d α e log  steps (BERT (Devlin et al. [5]) can also be used to create 218

3

|E|
Table 1: Top-3 most frequent n-grams. (Random Policy will have |V| = 1)
√  √  √ 
|E| |E| |E|
(N)-Gram IRL |V| = 0.757 Real Data |V| = 0.012 IRL with prior |V| = 0.068
to Sri South Africa South Africa
2-Gram Sri Sri Sri Lanka ban by
victory Sri Set to Sri Lanka
victory Sri Sri tour World XI historic series win
3-Gram West to Sri bowl at Wanderer England claim historic
to victory Sri set to join claim historic series
to victory Sri Sri Flintoff fit to bowl England claim historic series
4-Gram given fresh start for set to join Somerset claim historic series win
Tudor given fresh start action given all-clear Pakistan slump in Twenty20 debut

Table 2: Predective Modelling outcome of Classifiers


and similarity metric
−5 Training ith Real Small Data
Learning the prior
Model IRL IRL with Prior −10
Random 16.8 16.8
Generator Loss

−15
SVM 22.5 24.46
KNN 27.40 34.52 −20

Logistic 33.53 33.68 −25


Cos-similarity 0.781 0.783
−30
BLEUf 0.506 0.572
0 20 40 60 80 100 120
BLEUb 0.496 0.571 Adversarial Training Iterations
BLEUhf 0.5 0.5715
Figure 1: Convergence of Generator Loss for Model
trained with prior is faster and more optimal
219 D0 ). Intra-sequence mode-collapse (Fedus, Good-
220 fellow, and Dai [6]) will be proportional to num-
resulted in better generation in terms of quality 241
221 ber of cycles
√ in (G), which can be approximated
|E|
and diversity. Prior guides the IRL to encode 242
222 by D = |V| . Density of the state-action graph common structures across texts on similar top- 243
223 can be approximated by sampling generated text ics, which improves convergence and generates 244
224 and using Bi-grams
√ as directed weighted edges. semantically coherent text from extremly-small 245
|E|
225 Density(G) = is shown in table 1. For eval- amount of data. Graph G can be use to evaluate 246
|V|
226 uation purpose, we observe the quality of most metric like intra-sequence mode-collapse, mixing 247
227 frequent n-grams (Table 1). We measured the of markov chains and generation quality. 248
228 mean predictive outcome on generated data from
229 binary classifiers trained on D and D0 . We also
230 measured mean cosine-similarity using Sentence-
231 BERT (Reimers et al. [11]), BLEUf , BLEUb and
232 BLEUhf (Shi et al. [12]). Results are in (table 2).
233 The predective modelling outcome of Various
234 Machine Learning Classifiers trained on the Data
235 D. The average of prediction probability that the
236 generated data point belongs to "Cricket" class is
237 shown in table 2

238 5 Conclusion
239 Using Data Centric approach to learn a prior from
240 related datasets, which can be created using BERT,

4
249 References [10] Deepak Ramachandran and Eyal Amir. 297
“Bayesian inverse reinforcement learning”. 298
250 [1] Martin Arjovsky, Soumith Chintala, and
In: Proceedings of the 20th international 299
251 Léon Bottou. “Wasserstein generative ad-
joint conference on Artifical intelligence. 300
252 versarial networks”. In: International con-
2007, pp. 2586–2591. 301
253 ference on machine learning. PMLR. 2017,
254 pp. 214–223. [11] Nils Reimers et al. “Sentence-BERT: Sen- 302
tence Embeddings using Siamese BERT- 303
255 [2] Sanjeev Arora et al. “Generalization and
Networks”. In: Proceedings of the 2019 304
256 equilibrium in generative adversarial nets
Conference on Empirical Methods in Nat- 305
257 (gans)”. In: International Conference on
ural Language Processing. Association for 306
258 Machine Learning. PMLR. 2017, pp. 224–
Computational Linguistics. 2019. 307
259 232.
[12] Zhan Shi et al. “Toward diverse text gener- 308
260 [3] Samy Bengio et al. “Scheduled sampling
ation with inverse reinforcement learning”. 309
261 for sequence prediction with recurrent Neu-
In: Proceedings of the 27th International 310
262 ral networks”. In: Proceedings of the 28th
Joint Conference on Artificial Intelligence. 311
263 International Conference on Neural Infor-
2018, pp. 4361–4367. 312
264 mation Processing Systems-Volume 1. 2015,
265 pp. 1171–1179. [13] Lantao Yu et al. “Seqgan: Sequence gener- 313
ative adversarial nets with policy gradient”. 314
266 [4] Fan Chung. “Laplacians and the Cheeger in-
In: Proceedings of the AAAI conference on 315
267 equality for directed graphs”. In: Annals of
artificial intelligence. Vol. 31. 1. 2017. 316
268 Combinatorics 9.1 (2005), pp. 1–19.
[14] Brian D Ziebart et al. “Maximum entropy 317
269 [5] Jacob Devlin et al. “BERT: Pre-training of
inverse reinforcement learning”. In: Pro- 318
270 Deep Bidirectional Transformers for Lan-
ceedings of the 23rd national conference 319
271 guage Understanding”. In: Proceedings of
on Artificial intelligence-Volume 3. 2008, 320
272 the 2019 Conference of the North American
pp. 1433–1438. 321
273 Chapter of the Association for Computa-
274 tional Linguistics: Human Language Tech-
275 nologies, Volume 1 (Long and Short Pa-
276 pers). 2019, pp. 4171–4186.
277 [6] William Fedus, Ian Goodfellow, and An-
278 drew M Dai. “MaskGAN: Better Text Gen-
279 eration via Filling in the _”. In: Interna-
280 tional Conference on Learning Representa-
281 tions. 2018.
282 [7] Derek Greene and Pádraig Cunningham.
283 “Practical Solutions to the Problem of Diag-
284 onal Dominance in Kernel Document Clus-
285 tering”. In: Proc. 23rd International Con-
286 ference on Machine learning (ICML’06).
287 ACM Press, 2006, pp. 377–384.
288 [8] Ferenc Huszár. “How (not) to Train your
289 Generative Model: Scheduled Sampling,
290 Likelihood, Adversary?” In: arXiv e-prints
291 (2015), arXiv–1511.
292 [9] Andrew Y Ng and Stuart J Russell. “Al-
293 gorithms for Inverse Reinforcement Learn-
294 ing”. In: Proceedings of the Seventeenth In-
295 ternational Conference on Machine Learn-
296 ing. 2000, pp. 663–670.

You might also like