You are on page 1of 18

Journal of Intelligent & Fuzzy Systems xx (20xx) x–xx 1

DOI:10.3233/JIFS-191690
IOS Press

1 Topic modeling in short-text using


2 non-negative matrix factorization based

f
on deep reinforcement learning

roo
3

4 Zeinab Shahbazi∗ , Faisal Jamil and Yungcheol Byun


Department of Computer Engineering, Jeju National University, Jejusi, Jeju Special Self-Governing Provience,

rP
5

6 Korea

tho
7 Abstract. Topic modeling for short texts is a challenging and interesting problem in the machine learning and knowledge
8 discovery domains. Nowadays, millions of documents published on the internet from various sources. Internet websites are full
9 of various topics and information, but there is a lot of similarity between topics, contents, and total quality of sources, which
10
Au
causes data repetition and gives the user the same information. Another issue is data sparsity and ambiguity because the length
11 of the short text is limited, which causes unsatisfactory results and give irrelevant results to end-users. All these mentioned
12 issues in short texts made an interesting topic for researchers to use machine learning and knowledge discovery techniques
13 to discover underlying topics from a massive amount of data. In this paper, we propose a combination of deep reinforcement
14 learning (RL) and semantics-assisted non-negative matrix factorization model to extract meaningful and underlying topics
15 from short document contents. The main objective of this work is to reduce the problem of repetitive information and data
d

16 sparsity in short texts to help the users to get meaningful and relevant contents. Furthermore, our propose model reviews an
17 issue of the Seq2Seq approach based on the reinforcement learning perspective and provides a combination of reinforcement
cte

18 learning and SeaNMF formulation using the block coordinate descent algorithm. Moreover, we compare different real-world
19 datasets by using numerical calculation and present a couple of state-of-art models to get better performance on short text
20 document topic modeling. Based on experimental results and comparative analysis, our propose model outperforms the state
21 of art techniques in terms of short document topic modeling.
rre

22 Keywords: Topic modeling, knowledge discovery, short text, non-negative matrix factorization, machine learning

23 1. Introduction vide valuable data to extract the user’s topic interests 30


co

and to drive better decisions. Hence, the automatic 31

24 In the past two decades, topic modeling for short identification and extraction of valuable topics from 32

25 types of text documents becomes an effective area in short text data is a fundamental problem in a wide 33

26 machine learning and natural language processing. range of applications, such as context analysis [1], 34
Un

27 Every day a major quantity of short text documents and text mining and classification [2]. Nowadays, 35

28 distributed in social media and another source, e.g., knowledge discovery becomes a fundamental and 36

29 emails, search queries, image tags, advertisements, challenging task that achieves a lot of attention from 37

headlines, status messages, etc. These sources pro- researches all over the world [1–5]. The length of 38

the short text is limited, which causes data sparsity 39

∗ Corresponding
and ambiguity. The sparsity and noisy data in short 40
author. Zeinab Shahbazi, Department of Com-
puter Engineering, Jeju National University, Jejusi 63243, Jeju
text documents are major problems in knowledge dis- 41

Special Self-Governing Provience, Korea. E-mail: zeinab.sh@ covery from a large corpus and make the process of 42

jejunu.ac.kr. analyzing underlying knowledge difficult. There are 43

ISSN 1064-1246/20/$35.00 © 2020 – IOS Press and the authors. All rights reserved
2 Z. Shahbazi et al. / Topic modeling for short-Text

44 different topic identification techniques available to dotype of the document and detect words transpire 96

45 uncover and unearth content from short texts. [3, 4, 13, 14]. But the topics generated by using this 97

46 Topic identification techniques are used to dis- strategy is biased based on pseudo-texts, which cause 98

47 cover semantically meaningful and valuable topics exploratory. 99

48 from a massive amount of data [6–8]. These tech- In particular, the probability combination of most 100

49 niques are capable of identifying underlying hidden of the incoherent short texts into pseudo-texts is high. 101

50 patterns (topics) that explain and identify the simi- Another type of challenge in short texts is the lack of 102

51 larity between documents. Another aspect of topic words or the limited length of the text. The inter- 103

f
52 modeling is to consider purification of the document nal semantic relationship between words proposed 104

roo
53 clustering by unsupervised machine learning strat- to extract information through short text documents, 105

54 egy. This process is opposite to document clustering, which based on this strategy, the problem of lack 106

55 in topic modeling procedure, many topics can occur of words in the text is quite settled. The main rea- 107

56 in the individual document, but frequent topics among son to propose this strategy is the fact of semantic 108

57 the document have more process in the training set. information of words in document which efficiently 109

rP
58 Frequently, topic modeling divided into two groups, obtain along with neural network system, which rely 110

59 i.e., the first group known as non-negative matrix fac- on word embedding techniques, e.g., word2vec [15] 111

60 torization (NMF) [9], and the second group known and Glove [16]. 112

61 as latent Dirichlet allocation (LDA) [6]. NMF topic Different word embedding techniques are pre- 113

tho
62 model is a matrix base model that immediately reads sented [17–19] to discovering and underlying hidden 114

63 the topics by break-down the document terms into topics from short texts based on the usage of 115

64 a matrix that presents bag-of-words into text corpus words from various sources, e.g., word embed- 116

65 with low-rank matrix factors. A low-rank matrix is a ding techniques depends on Google News and 117

66 machine learning algorithm technique and a way to Wikipedia. Based on word semantic differences 118
Au
67 understand the recommendation system process. e.g., between Wikipedia and short text documents causes 119

68 it means the given matrix Z where Zi , X demonstrates noise in topics. Therefore, word embedding can be 120

69 the score which teacher A gives to student B. LDA an effective process in modeling short text topics 121

70 is a statistical measure for identifying and extracting because it contains a huge number of words with 122

71 hidden topics that occur in a set of documents. similar semantic properties, which is very efficient 123

To show the prominent performance of NMF- in achieving better performance to cluster topic mod-
d

72 124

73 based model, clustering and dimension reduction els. Another way of raising the performance of topic 125
cte

74 techniques are used to get a high-dimensional dataset models is skip-gram with negative sampling (SGNS). 126

75 [10–12]. Regular documents primarily consist of This methodology is useful to obtain the relevance 127

76 consistent document structure and compact data con- between words using a small type of sliding window 128

77 tents. This document class exhibits the XML view of approach [15, 20]. The SGNS is very useful because 129

78 relational data. The data range of those documents is each window can present an individual document. In 130
rre

79 in the range of within 40 and 60 %. Irregular docu- that case, word-context semantic relationships will 131

80 ments mainly consist of materials that have intense, be taken by SGNS inefficient way. The relationship 132

81 complicated, and improper structures. This document between words can behold in another form of a word 133

82 class primarily focuses on the evaluation of the capa- occurring. 134


co

83 bleness of compressing improper structural data of This system potentially dominates the difficulty of 135

84 XML documents. In the case of the regular type of data sparsity. The existing studies [21, 22], present 136

85 documents, the conventional topic models come with that the SGNS algorithm is equal to factorize the 137

86 great achievement, but this process in case of short relationship between matrix terms. Afterward, we 138
Un

87 text document collection is not working well. The rea- enhance some questions about this technique: 1) Is it 139

88 son for this problem getting high dimension in a short possible to transform the problem of matrix factoriza- 140

89 document set is because of not having enough infor- tion to a non-negative matrix factorization problem? 141

90 mation in a short type of text, extracting meaningful 2) Is there any way to combine these outputs into the 142

91 keywords is very difficult to extract [1, 3]. There are conventional NMF for the term-document matrix? 3) 143

92 various methods presented to investigate topics mod- Is this model suitable to detect underlying topics form 144

93 eling from short text messages in the recent past to short texts? 145

94 overcome this challenge, and the current methods are In this paper, we propose a novel semantic-assisted 146

95 used to combine short text documents into the pseu- model based on RL and SeaNMF for short-text topic 147
Z. Shahbazi et al. / Topic modeling for short-Text 3

148 modeling, which is used to extract topics from a like a word co-occurrence. These approaches have 195

149 large corpus. This work uses and analyzes four differ- been used in various tasks and work well for normal 196

150 ent real-world datasets of short text documents, such text. Nevertheless, there are many shortcomings in 197

151 as Article headlines, Question Answers, Microblogs, traditional topic models approaches in terms of data 198

152 and News. The following NMF-based topic modeling sparsity, and it occurred when the size of the doc- 199

153 technique is used to discover underlying topics from ument is too short. Many studies are focus on and 200

154 a large volume of data. We analyzed and investigated improve the quality of finding a topic in the short 201

155 the semantic relationship between keywords and doc- text to overcome the problem of data sparsity. For 202

f
156 ument information using skip-gram, which shows instance, in [23] author proposed a system that is used 203

roo
157 the efficiency of disclosing the semantic relation of to find the hidden topic from the large external data 204

158 words. The reinforcement learning system applied to source, e.g., Wikipedia. The presented framework 205

159 get the reward function from the algorithm in train- aim is to find hidden patterns from the spare and short 206

160 ing set for input-specific RL and compare it in the text and web segments. Similarly, in [24], Jin et al. 207

161 test set with specific RL. Experimental results show presented an approach to cluster short text in order to 208

rP
162 the significance of the proposed model as compared find the topically related text from long auxiliary text 209

163 with existing methods, e.g., document classification data. The proposed Dual latent Dirichlet Allocation 210

164 and topic coherence accuracy. Finally, we performed (DLDA) approach combined the topics parameters 211

165 a comparative analysis of the proposed model with from the long text and short text and removed the 212

tho
166 state of the art techniques. By comparing different data inconsistency between the datasets. These stud- 213

167 topics keywords and analyze their network, the pro- ies use a large set of data in order or improve the 214

168 posed system shows that the extracted topics from quality of topic representation. Numerous ingenious 215

169 the benchmark datasets are meaningful, and extracted approaches extract latent topics from topic model- 216

170 keywords using this model are more semantically ing using a large set of data by integrating a short 217
Au
171 associated. Therefore, the proposed RL+SeaNMF is text document. For example, in the study [25], use an 218

172 an effective topic model for short texts. LDA model to produce short texts by the similar user. 219

173 The remaining of this paper is categorized into five The main aim of this study is to find influential users 220

174 sections, which are the following: Section 2 explains from social networking and blogging sites, e.g., twit- 221

175 the related work about topic modeling in short texts ter. Likewise, Davison and Hong et al. [3] develop two 222

based on the majority of state-of-the-art topic models. separate aggregated short text approaches that contain
d

176 223

177 Section 3 presents the detailed system methodology, each word in the large set of data vocabulary, includ- 224
cte

178 design architecture, and overall transaction process ing text authors. In [26], author design, a scheme 225

179 of the proposed system. Section 4, elaborates on the to produce pseudo-documents for LDA data prepro- 226

180 implementations and results of the proposed system, cessing using tweet pooling. In another study [27] 227

181 and section 5 concludes the paper. presented an approach based on word co-occurrence 228

named WNTM to improve the sparsity and imbal- 229


rre

ance issues to generate pseudo-documents in word 230

182 2. Literature review network. The develop scheme greatly improved the 231

semantic compactness of data without introducing 232

183 In this section, we will enlighten on some of the space and time complexity. Author Zuo et al. [14] 233
co

184 existing studies based on topic modeling. This sec- also develop a scheme to generate pseudo-documents 234

185 tion is categorized into three-part, i.e., topic modeling using the topic model. The presented approach uses 235

186 for short text, topic modeling for short text using the short text by influencing fewer pseudo documents 236

187 Non-negative Matrix Factorization, and sequence to to conjecture the latent pseudo documents. Kou et al. 237
Un

188 sequence for topic modeling (Seq2Seq). At the end [28] presented a multifeatured probabilistic graphical 238

189 of this section, we will discuss the novelty of our approach to combined short text into long text in order 239

190 proposed model. to improve the search task in social networking sites. 240

The develop MFPGM is also used to overcome the 241

191 2.1. Topic modeling for short text problem of semantic sparsity. Also, Jin et al. [24] pro- 242

pose an approach that is used to generate the pseudo 243

192 Traditionally topic modeling algorithms, e.g., LDA documents from the short text that contains URLs. 244

193 and PLSA, are used to find the hidden and meaningful Afterward the short text is used for combining auxil- 245

194 topic from a set of document corpus using a pattern iary information, e.g., named entities, hash tags, and 246
4 Z. Shahbazi et al. / Topic modeling for short-Text

247 locations etc. [1, 3, 25, 29]. Kou et al. [30] analyzed tion process, data sparsity and repetition [10, 33, 34]. 297

248 a temporal and spatial characteristic which lead to Although only few of studies focus on topic model- 298

249 develop an approach named as social network short ing for short text using NMF approach. For instance, 299

250 text semantic modeling (STTM). The STTM is used Yan et al. [35] present asymmetric term correlation 300

251 to resolve the problem of semantic sparsity for short matrix based on NMF to extract topics from short 301

252 text in social network and improve the temporal and text documents. The proposed system uses a quar- 302

253 spatial characteristic. tic non-convex based loss function in its processing; 303

254 Nowadays, many researcher and scientist used however, the designed system is failed to work in 304

f
255 word co-occurrence information in order to improve terms of stability and reliability. To dominate this 305

roo
256 the topic modeling techniques for short text. For problem, they present the Sym-NMF model in [15, 306

257 instance, Yan et al. [5] design an approach used to 17], but this model doesn’t supply a good insight in 307

258 generate word pair co-occurrence in the document. topic modeling, and furthermore, we can’t use this 308

259 The develop technique biterm topic model (BTM) model to get document representation immediately. 309

260 collectively used information of large set of data Similarly, in NMF, latent semantic space obtains the 310

rP
261 biterms which further used in for topic distribu- topics in a specific document cluster, and each text 311

262 tion. Moreover the biterms is also used to resolve document proposes merged topics in document [9]. 312

263 the problem of document sparsity, but not consider Automatic text segmentation plays an important role 313

264 the order of words in the algorithm processing. Lin in this area. It helps to make instructive information 314

tho
265 et al. [31] proposed a dual-sparse topic scheme to based on document summaries that are the significant 315

266 investigate the problem of data sparsity in word and ideas of the original text documents. The main issue in 316

267 topic within the document. The design approach automatic text summarization is how to calculate and 317

268 used to decouple the sparsity and improve the doc- extract the right information from the document. Tian 318

269 ument in term of topic-word and document-topic et al. [36] proposed a SeaNMF approach integrated 319
Au
270 distribution. Similarly, Quan et al. [13] presented with local word-context correlation which is used to 320

271 self-aggregated approach for shot text. The designed find topics for the short texts. The design approach 321

272 self-aggregated based topic model (SATM) gener- use the block coordinate decent technique to calcu- 322

273 ate pseudo-document from the short text snippet late the proposed SeaNMF approach. Moreover, the 323

274 sampled. The generated topic and the process of develop system also equipped with sparse SeaNMF 324

aggeneration is process using reinforcement tech- in order to achieve high interpretability.


d

275 325

276 niques so that the aggregation is build based in


cte

277 topical affinity of texts. Nevertheless, the size of 2.3. Sequence to sequence for topic modeling 326

278 data will also increase the number of parameters


279 for configuring a suitable extensive pseudo docu- Seq2Seq is one of approach which is widely 327

280 ment. Temporarily inference procedure containing used in natural language processing for different 328

281 both the topic sampling and text aggregation consum- purposes like machine translation [37–41], head- 329
rre

282 ing a large amount of time during processing. Bicalho line generation [41–43], text summarization [44, 45] 330

283 et al. [32] highlighted a general approach for topic and speech recognition [46–48]. Furthermore the 331

284 modeling by generating a corpus pseudo documents Seq2Seq approach is divided into two parts which 332

285 presentation from the actual documents. The pro- takes input and output of the document. The input 333
co

286 posed scheme used vector representation and word section is a specified data sequence, and the output 334

287 co-occurrence to create pseudo-documents. After- section is a sequence of the data module. Generally, 335

288 ward different topic modeling techniques are applied Seq2Seq model training is based on the ground-truth 336

289 on pseudo-document for further analysis. sequence using the teacher-forcing technique where 337
Un

ground-truth is act as a teacher [49]. During the past 338

290 2.2. Non-negative matrix factorization for topic few years many studies have been published by using 339

291 modeling Seq2Seq approach in order to enhance the complex- 340

ity of topic modeling for short texts [50]. Rush et al. 341

292 During the past few year researcher and data scien- [39], seq2seq model based on encoder attention and 342

293 tist have gained a lot of attention in the field of topic neural network language model (NNLM) decoder 343

294 modeling. Different techniques like Non-Negative based on the neural machine translation (NMT) for 344

295 Matrix Factorization (NMF) are applied on topic text summarization. Specifying document topics is 345

296 modeling in order to improve the knowledge extrac- one of the important step of document readability 346
Z. Shahbazi et al. / Topic modeling for short-Text 5

Table 1
Critical analysis of sequence to sequence topic modeling
Authors Propose Issues
Wei Li, Detecting word clusters which contain Task labeling due to words
Andrew McCallum [51] syntactic classes and semantic topics. indeterminacy
Ilya Sutskever, Manufacture sequence Sentence optimization
Oriol Vinyals, learning approach to structure problem between
Quoc V. Le [52] minimal hypothesis of the process sentence and data origin
Martin Riedl, LDA and TopicTiling Segment numbers are

f
Chris Biemann [53] optimization to improve the not investigated.

roo
accuracy of segmentation
Improving the semantic
Martin Riedl, information of topic models Sparsity issue
Chris Biemann [54] to get better performance in which require different
wordbased and TextTiling algorithms. ways to smooth the data
Using Full Dependence
Dan Fisher, Mixture (FDM) to Hyper parameter doesn’t

rP
Mark Kozdoba [55] combine the general contain the topic arrangement
moment and second moment
of data.

which usually comprised of three main approaches, This figure propose the input data (context, word,

tho
347 377

348 i.e., |total words|, |hiragana ratio|, and document) as (Xi ), (Ji ), (Yi ). (G), (Rc ) and T is 378

349 and |word length|. a vector representation of this proposed system, and 379

350 Table 1 shows the critical analysis of state-of-art each part of T represents a topic in the document. 380

351 techniques based on short text document to develop The presented SeaNMF model [36] is able to 381
Au
352 the probabilistic method to paradigms for realizing extract the semantics out of short text data use as a 382

353 short documents. basis of word-content, and word-document relation- 383

354 As stated above, these existing models are not ade- ship, and the proposed objective function integrates 384

355 quately developed for topic modeling from short text the privilege of both the NMF model for topic 385

356 documents and also have some overcoming in terms modeling and the skip-gram model for obtaining 386

of lack of information, such as data sparsity and ambi- word-context semantic relationships. To solve the
d

357 387

358 guity. To the best knowledge of the author, there has problem of optimization, we used a block coordinate 388

been no semantic-based model to extract meaning- algorithm along with the SeaNMF model sparse ver-
cte

359 389

360 ful topics from short text documents using RL and sion to realize further interpretability. Table 2 is used 390

361 SeaNMF techniques so far. to summarizes frequently used notations 391

3.1. Sequence to sequence model (Seq2Seq) 392


rre

362 3. System architecture of proposed Encoder decoder architecture applied in differ- 393

363 RL+SeaNMF model ent types of natural language processing tasks, e.g., 394

machine translation, dialogue generation, document 395


co

364 The proposed paper comprised of three main mod- summarization, and question answering. Seq2Seq 396

365 ules, i.e., non-negative matrix factorization (NMF), model presented with various strategies, e.g., atten- 397

366 Reinforcement learning, and sequence to sequence tion, copying, and coverage. The maximum result 398

367 model (Seq2Seq). First, we explain the problem state- of the proposed cases focus on repetitive words and 399
Un

368 ment in sequence to sequence models and solutions build a fixed-sized vocabulary based on words and 400

369 based on our study. Second reinforcement learning tokens. Therefore, the efficiency of the model is based 401

370 steps in short text document topic generalization on vocabulary limitations. The Algorithm 1 describes 402

371 and some information about block coordinate decent a simple Seq2Seq model steps. 403

372 methodology. Third, the NMF topic modeling sys- The Algorithm 1 consists of two phases, such as 404

373 tem and SeaNMF model based on a block-coordinate training and testing phases. In training phase, there 405

374 algorithm to display the terms of short text doc- are two groups of data, such as X and U. First, we 406

375 uments. Figure 1 represents the proposed system run encoding on X to get the last encoded state hTe . 407

376 architecture. After getting the last encoded state, we run the decod- 408
6 Z. Shahbazi et al. / Topic modeling for short-Text

f
roo
rP
tho
Fig. 1. Overview of the RL+ SeaNMF Model System architecture.
Au
Table 2
Notation used in this paper
Name Description
Seq2Seq model parameters
d

X Length of input sequence Te , X = {x1 , x2 , ... , xTe }


Y Ground-truth of output sequence T, U = {u1 , u2 ,..., uT }
cte

Û Generated output length sequence T, Û = {uˆ1 ,uˆ2 ,..., uˆT }


Te Encoder numbers of input sequence
T Decoder number of output sequence
d Input and output sequence representation size
B shared vocabulary of input and output
et Hidden state encoder at time t.
rre

jt Hiden state decoder at time t.


Pθ The seq2Seq model with parameter θ.
Reinforcement Learning Parameters
rt = r(st , ut ) The reward that the agent receives by taking action ut when the state of the environment is jt .
Û Sets of actions that the agent is taking for a period of time
co

T, Û = {uˆ1 ,uˆ2 ,..., uˆT } This is similar to the output that the Seq2Seq model is generating.
φ The policy that the agent uses to take the action.
Q(jt , ut ) The Q-value (under policy φ) that shows the estimated reward of taking action yt when at state jt .
SeaNMF Parameters
B word-document matrix.
Un

J semantic correlation matrix.


V words Latent factor matrix.
Vc contexts Latent factor matrix.
E Document Latent factor matrix.
vj Words Vector representation vj .
cj Context Vector representation cj .
R+ Real number of Non-negative.
D Document number in the corpus.
M Distinct words Number in vocabulary.
Z. Shahbazi et al. / Topic modeling for short-Text 7

Algorithm 1 the Seq2Seq procedure faces conformity in the oper-


Seq2Seq model simple Training ation process in training and testing set. Equation 1
Input: Input (X) as sequence and (U) as ground-truth defined to evaluate the loss (error) of output ût . Fig-
Output: Seq2Seq model Trained set
ure 2 describes the training process of the decoder,
Train Step:
for <Group X and U> do which has two inputs.
Run encoding on X and get the last encoder state hTe .
Run decoding by feeding hTe to the first decoder and obtain the 
T
sampled output sequence Û CIN = − (log πθ (ût |û1 ...ût−1 , jt , jt−1 , X)) (1)
Update the parameters of the model

f
t=1
end for

roo
426
Test Step: Output state is defined as jt−1 , input of ground truth 427
for <Group X and U> do
sampling output based on training model Û
is defined as ut . Current output state is defined as jt 428

Estimate the model by applying performance measure. and next action calculation defined as ût . Although, 429

end for in the testing set, all over the decoder process is based 430

on generated action through the model distribution to

rP
431

anticipate the next action. 432


409 ing process by feeding hTe to the first decoder in As shown in Fig. 2, the right side of the Figure 433
410 order to get a sampled output sequence Û. Finally, is presenting the LSTM network, which is correlated 434
411 we update the parameters to executes the next step. with the encoder e to show hidden state of encoder, 435

tho
412 In the testing phase, the input samples are tested and it has Te modules. Te is representing the length of 436
413 based on trained model Û. Finally, we evaluate the the input sequence and the number of encoders. And 437
414 performance of the tested samples using different per- the right side of the LSTM network is representing 438
415 formance measures. The proposed Seq2Seq model the correlate with decoder, and it has T modules. T is 439
416 in this paper is containing convolutional architecture, presenting the length of output sequence and number 440
Au
417 which is comprising words and topics, reinforcement of decoders. 441
418 learning system, and SeaNMF model. Ground-truth
419 in this algorithm defined as U which generates Û as an 3.2. Reinforcement learning 442
420 output of model. In this paper, we used the following
421 measures, such as CIDEr [56], BLEU [57], ROUGE Machine learning is an advance and robust area 443
d

422 [58], and METEOR, to evaluate the performance of in artificial intelligence (AI), which Reinforce- 444
423 the proposed model [59]. ment learning is one of the tasks in this system. 445
cte

The reinforcement learning algorithm is evaluated 446

424 3.1.1. Sequence to sequence model problem based on Markov Decision Processes (MDPs). The 447

425 statement agent-environment relationship in the reinforcement 448

The main problem of the current Seq2Seq model learning system is divided into three categories 449

is, “cross-entropy minimization output,” which has named “Agent,” “Environment,” and “State.” The
rre

450

not to generate the acceptable result in running the agent is described as software which produces high- 451

network. Hence, applying cross-entropy (C) loss in order intentions, and it is also learner part of the 452
co
Un

Fig. 2. Simple sequence to sequence model.


8 Z. Shahbazi et al. / Topic modeling for short-Text

f
roo
rP
tho
d Au
Fig. 3. Work flows of Reinforcement Learning model.
cte

453 RL system. The environment represents the proce- between document output and reference output. Prac- 474

454 dure problem, which should solve, make a real-world tically, the Seq2Seq model is applied to generate the 475

455 and also simulate the environment to communicate function operation, and task-specification, e.g., text 476

456 with the agent. The last category states which are the summarization is extracting the meaning of the func- 477
rre

457 position of agent for a special pick of time in the envi- tion which denotes the next function summarization, 478

458 ronment. As a simple explanation, an agent performs while, in question answering the task, start and point 479

459 an action the environment gives the agent reward and of the action suppose to defined to get the document 480

460 a new state where the agent reached by acting. answer. 481
co

461 Markov Decision Processes is a tuple of (J, B, P, Through reinforcement learning, two output
462 R, T), where J is a set of states, B is a set of actions. sequences generated with input sequence x. û, which
463 Transition function is defined as P : J ∗ B → J define as the first sequence capture by words that this
464 where P(j, b) giving the next state after performing process maximizes and sampling the output of prob-
Un

465 an action b in state j. The reward function is defined ability distribution. Output of both x and û sequences
466 as R : J ∗ B → R with R(j, b), which is giving the are a reward of the process, and it minimizes the loss
467 immediate reward for performing an action b in the of reinforcement based on Equation 2.
468 state of j. Terminal states are given by T ⊆ J that
marks the end of an episode. Figure 3 shows RL total
LRL = −(r(uj ) − r(û)) log pθ(uj )
469
(2)
470 process in document similarity.
471 As presented in Fig. 3, the ground-truth of this
472 architecture can supply by human or automatic met- Equation 2 shows minimum reward (r) output of 482

473 rics, which is generating similarity measurement sequence x and sequence û. 483
Z. Shahbazi et al. / Topic modeling for short-Text 9

484 3.2.1. Reinforcement learning: Model-free semantic information between words and document 518

485 solution technique contents, Vc (j, :) = −



cj for cj ∈ V matrix defined. To 519

486 RL is essentially deal with how to capture detect keywords and document content relationship 520

487 the optimal policy when there is no accessible (word-content) J ≈ VVcT matrix defined. The above 521

488 model. Through this issue, MDPs (Markov Deci- equations defined as a matrix to show the relation- 522

489 sion Process) focus on estimation and defective ship between words and contents. In this equations V 523

490 information, which proposed earlier; it is a crucial defined as latent factor matrix of words, Vc defined 524

491 problem statement in short text documents. To over- as content matrix, Vj represents vector representa- 525

f
492 come this issue, “model-free” reinforcement learning tion of words and cj represents vector representation 526

roo
493 presented, which is not depending on existing infor- of contents. 527

494 mation and reward models. There is a lack of a model


495 for this problem, which generates statistical knowl- 3.3.2. Optimization 528
496 edge to use the sample MDP model. Algorithm 2 Topic modeling on short text documents contains 529
497 presents the common RL based on online document a different type of process that the one which used in

rP
530
resources. this paper is, “each text is defined as a window in the 531

Algorithm 2 skip-gram model.” 532

General Reinforcement Learning algorithm for online short


documents
r Skip-gram model is equal to the relative length 533

tho
of the text document.
r
for <Occurance> do 534

j ∈ J is marked as the start point. Number of short texts is equal to the number of 535
t := 0 windows. 536
while select an action b ∈ Bj do
complete action a
perceive the new state ŝ and received reward r
Block coordinate descent algorithm used to define 537
Au
Update T̂ , R̂, Q̂ and V̂ the term-document matrix B for word representation 538

Using experiment of <j, b, r, ĵ> based on bag-of-words and in next step evaluation of 539
Update the parameters of the model semantic correlation matrix J presented for sampling 540
j := ĵ
the document contents. Equation 3 and 4 are rep- 541
end while <ĵ is a state goal>
end for resenting the SeaNMF model procedure evaluation 542

based on above mentioned explanation.


d

543
498
 
Algorithm 2 represents the short documents topic (BE)(F ) + α(JVc )(F )
cte

V(F ) ← VF + T
499
544
500 modeling issue by evaluating the minimum reward of (E E)(F,F ) + α(VcT Vc )(F,F )
501 the process.  
(VET E)(F ) − α(VTc Vc )(F )
− (3) 545

502 3.3. SeaNMF model (ET E)(F,F ) + α(VcT Vc )(F,F )


rre

546

503 SeaNMF model is a novel semantic-assisted NMF  


(JV )(F )
504 model which is proposed to unearth hidden topics Vc(F ) ← V( cF ) + 547
(V T V )(F,F )
from the short text document. This methodology
 
505

combines the semantic information of document (Vc V T V )(F )


co

506

using word embedding dictionary in a training model, − (4) 548


507
(V T V )(F,F )
508 which this dictionary able to recover words relation-
509 ship, hidden part of a document, etc. The block coordinate descent algorithm in the 549
Un

SeaNMF model is presented in Algorithm 3. 550

510 3.3.1. Model evaluation


511 One of the challenges in this paper is how to 3.3.3. Intuitive evaluation 551
512 recommend words to the NMF system. Based on Furthermore about SeaNMF model, Equation 5
513 latent matrix V ∈ (RM∗F + ) (W components are all and 6 processed from Equation 3 and 4 to represent
514 non-negative), non-negative limitations applied on the update procedure of topic modeling.


515 words and vectors of document. Thus, V ∈ (RM∗F + )  


and c ∈ (RM∗F ) hold. To set the keyword Vi ∈ V (BE)(F ) (VET E)(F )
516
+ 1
← + −
applied to the system based on Vi,: = −
→ V(F ) V F (5)
517 vi . To detect the (ET E)(F,F ) (ET E)(F,F )
10 Z. Shahbazi et al. / Topic modeling for short-Text

Algorithm 3 ships. Although topic similarity can be calculated 584


SeaNMF Algorithm for short text documents based on word occurrence and average PMI. Similar- 585

Input: Matrix B term-document ity measurement divided into two categories named 586

Matrix J semantic correlation; “Distribution of topic words” and “Semantic space of 587

Number of topics F, α; topic models.” 588


Output: V , Vc , E Based on the distribution of topic words, Every 589
Initialize: V ≥ 0, Vc ≥ 0, E ≥ 0 random real numbers;
t = 1; topic is generated from a Dirichlet distribution of V 590

dimensions where V is the size of a word. For every 591

f
while <repeat> do
for <k = 1, K> do
document, Generate a distribution over topics from a 592

roo
t
Compute V(:,k) by Equation 3 Dirichlet distribution of T dimensions where T is the 593
t
Compute Vc(:,k) by Equation 4 number of topics in the corpus and For every word in 594

end for the document, choose a topic according to the distri- 595
t =t+1 bution generated and Choose a word according to the 596
end while <Close>
distribution corresponding to the chosen topic. So, 597

rP
each topic is a probability distribution over the words 598
  of the vocabulary.
(JVc )(F ) (VVcT Vc )(F ) 599
2
V(F ) ← VF + − (6) Semantic space is representing by the topic model, 600
(VcT Vc )(F,F ) (VcT Vc )(F,F )
which is applying to propose topics and words. More- 601

Equation 5 is proposed for standard NMF. It pre-

tho
552 over, each topic is a probability of the words in the 602

553 dicts the terms in short content into the same category training set. Each corpus with a combination of words 603

554 using the term-document matrix. Furthermore, in and documents shows the relationship of them in the 604

555 Equation 6, words that have a similar meaning or topic modeling procedure. 605

556 hidden words with the same significance, share the


Au
557 common context keywords. Hence, using this model
558 increase the topic’s connections. e.g., in Fig. 1 words 4. Implementation and experimental result 606

559 (J1 andJ4 ) are in separate documents, but their con-


560 nection is in J2 based on the same keyword in both 4.1. Experimental setting 607

561 document contents. A simple explanation of SeaNMF


model in short text is as below: In this section, we evaluate the performance of our
d

562 608

563 “Skip-gram model is equal to the relative length model by substantial experiments on different short 609

of the text document. e.g., “Harvard university of


cte

564 document datasets. Finally, Dataset information, 610

565 engineering” and “Oxford university of science” are metrics, baselines estimation, and present various 611

566 two samples of short text documents. “Harvard” and sets of results. Although, comparison between pro- 612

567 “Oxford” are not occurring in the same position posed models and latent Dirichlet allocation (LDA), 613

568 in the second sentence and also “engineering” and Non-negative Matrix Factorization (NMF), Pseudo- 614
rre

569 “science” are not occurring in the first sentence but document-based Topic Model (PTM), and Dirichlet 615

570 the relationship between “Harvard, university” and Multinomial Mixture model (GPUDMM) is pre- 616

571 “Oxford, university” is the meaning of standard NMF sented. 617

572 model.”
co

573 Although, based on Equation 6, both sentences 4.2. Development and simulation environment 618

574 have a shared word “university,” which is the mean-


575 ing of the SeaNMF model. This model is to find the The development environment of the proposed sys- 619

576 similarity between words and also hidden parts of tem is summarized in Table 3. All experiments and 620
Un

577 document and relationship between vectors in short results of the system are carried out using Intel(R) 621

578 text documents to extract meaningful and valuable Core(TM) i7-8700 CPU @3.20GHz 3.19 GHz pro- 622

579 hidden topics, which is the main issue of this type of cessor with 32 GB memory. Moreover, Win-Python 623

580 content because of lack of information. (v3.6.2), based on Jupyter notebook, is a prerequisite 624

to develop the topic modeling environment, config- 625

581 3.4. Similarity measurement ure the model, and get the result detail information. 626

Topic modeling on short text documents evaluates 627

582 Similarity measurement is evaluating based on based on a combination of three different models in 628

583 the word probability distribution and word relation- this area named “Sequence to sequence,” “Reinforce- 629
Z. Shahbazi et al. / Topic modeling for short-Text 11

Table 3 Questions, version 2.04. Collected data is from 10 662


Development environment of proposed short text document topic various categories related to Question subjects, and 663
modeling
it also contains Financial Service, Diet, Fitness, etc. 664
Component Description
Forth is Tag.News that selected dataset is containing 665
Programming language WinPython–3.6.2 part of the TagMyNews dataset that includes tweets, 666
Operating system Windows 10 64bit
Browser Google Chrome, opera news, and snippets. It divided into seven categories 667

Library and framework Jupyter notebook named by “Entertainment”, “Health”, “Sci and Tech”, 668

IDE Anaconda 2019 Release 3 “Sport”, “Business”, “world” and “US”. Each cat- 669

f
CPU Intel(R) Core(TM) i7-8700
egory contains a minimum of 25 keywords. Fifth 670

roo
CPU @3.20GHz 3.19 GHz
Memory 32GB is DBLP which is the computer science bibliogra- 671

Topic modeling algorithm Seq2Seq, Reinforcement phy website. A collected dataset from this website 672
Learning, SeaNMF is divided into four categories named “Database,” 673
Distribution Modeling Algorithm SeaNMF
Reinforcement Learning Model Free
“Machine Learning,” “Information Retrieval,” and 674

Seq2Seq Encoder and Decoder “Data mining.” Each category contains the title of 675

rP
SeaNMF Novel semantic-assisted conference papers.Sixth is Yahoo.CA which Pro- 676
NMF matrix posed dataset extracted from Yahoo Answers Manner 677

Questions, version 2. It contains Questions and a 678

more suitable answer to that question. And finally 679


630 ment Learning,” and “SeaNMF model.” Sequence

tho
ACM.IS data which is collected from Scientific 680
631 to sequence model evaluated based on encoder and Literature Dataset Data-verse Harvard university. it 681
632 decoder architecture to solve the problem of cross- contains the published paper’s information. Dataset 682
633 entropy in short texts. The reinforcement learning information is summarized in Table 4. 683
634 model evaluated based on a model-free technique
to capture the optimal policy on short content, and
635
Au
4.4. Metrics estimation 684
636 finally, the SeaNMF matrix-based model evaluate
637 extracted topics from limited information contents.
To accurate the proposed methodology, Topic 685
638
coherent and Document classification procedure 686

applied in the topic modeling system. More detail 687

4.3. Dataset
d

639 information about each part is explained as below: 688


cte

640 The proposed system contains four real-world 4.4.1. Topic coherent 689
641 short text document datasets correlated with To evaluate the metrics in this process “point
642 related applications, i.e., Article headlines, Question mutual information (PMI)” score used, which is to
643 Answers, Microblogs and News. These categories evaluate information theory and show all possible
644 divided into seven types of datasets that explain in events occurring in text data. Given Equation 7 is
rre

645 detail below: used to calculate Topic coherent.


646 Tweet: This type of data is a good resource for
short text document topic modeling, which accumu- 2  p(vi vj )
647
CK = log (7)
648 lates and labeled by [60]. Out of a large amount of M(M − 1) p(vi vj )
1<=i<j<=M
co

649 tweet data categories, 15 categories selected to apply


650 in the proposed process, i.e., World, Regional, Art, Each topic is defined as K, and the Number of prob- 690

651 Science, Health, News, Shopping, Reference, Soci- ability between words in a document is defined as M. 691

652 ety, Recreation, Computers, Business, Sport, Game, p(vi vj ) is presenting the words occurring together in 692
Un

653 and Home. Each category contains almost 3000 sepa- a similar document. To measure out the topic quality, 693

654 rate tweets with a minimum of two keywords. Second the PMI score applied to the topic modeling process. 694

655 is Collected GoogleNews dataset from the GitHub Based on results of PMI in regular-sized text con- 695

656 website. This type of short document contains 3 tents which it works well in that type, still, there 696

657 million English words based on 300-dimensional are issues in short type of document, which means 697

658 latent embedding applying the word2vec model. This gold-standard topic may be assigned with a low PMI 698

659 data type used for the comparison method with score. 699

660 GPUDMM [17]. Third is Yahoo.Ans which pro- To dominate this issue, First, we evaluate PMI in 700

661 posed dataset extracted from Yahoo Answers Manner four categories of short text data. Next, in two other 701
12 Z. Shahbazi et al. / Topic modeling for short-Text

Table 4
Dataset statistics
Data Type Documents Terms Aggregation (A) Aggregation (B) Document Length Categories
Tweet 54524 21381 1.3855% 1.1824% 8.84 15
Yahoo.Ans 51865 5445 1.2118% 1.1184% 5.41 10
Tag.News 39769 22636 2.3972% 1.2471% 29.25 7
DBLP 26112 3558 1.8714% 1.3788% 7.75 4
Yahoo.CA 41797 5445 6.1642% 1.8865% 53.72 –
ACM.IS 47413 3558 5.3778% 2.1515% 88.51 –

f
roo
702 datasets (Yahoo.Ans and DBLP), which are arranged Table 5
703 documents, based on an external corpus, PMI cal- Topic coherent outputs based on PMI
704 culated. The comparison output of these datasets Tweet Tag.News DBLP Yahoo.ANS
705 presents the effectiveness of the proposed method- Latent Dirichlet 1.3168 1.6159 0.1457 1.3168
706 ology. Allocation

rP
Non-negative 1.9156 1.7525 0.1295 1.2415
Matrix
707 4.4.2. Document classification Factorization
708 Document classification is the further efficient Pseudo-document- 1.4856 1.7739 0.9616 1.2422
based Topic
709 topic modeling process used in the proposed sys- Model

tho
710 tem. In this process, classification is just used for a GPUDMM 0.1324 0.1862 0.3926 0.6819
711 labeled dataset. To calculate the classification per- SeaNMF 4.2588 3.7429 1.7248 1.8664
712 formance, five-fold validation applied to the process Sparse SeaNMF 4.2181 3.7164 1.7341 1.7192
713 and, finally, the LIBLINEAR package used to figure
714 out the quality of classification. F-measure (“Preci- Table 6
Au
715 sion” and “Recall”) used to qualify the classification Topic coherent outputs based on Yahoo.CA and ACM.IS
716 performance. Yahoo.ANS/ DBLP/
Yahoo.CA ACM.IS
717 4.5. Analysis methods Latent Dirichlet Allocation 0.7651 0.5393
Non-negative Matrix Factorization 0.6372 0.4737
d

Pseudo-document-based Topic Model 0.7615 0.5542


718 To show the effectiveness of the presented method, GPUDMM 0.4413 -0.1261
719 we compare the efficiency of our method with other SeaNMF 1.2115 0.7752
cte

720 state-of-art techniques. Sparse SeaNMF 1.1299 0.7558

721 – Latent Dirichlet Allocation (LDA): LDA is a


722 probabilistic-based topic modeling technique – Pseudo-document-based Topic Model (PTM): 741

723 used to extract the topics from the rich source PTM is to propose pseudo-documents to topic 742
rre

724 of information. LDA is used to extract topics models in short text documents to extract the 743

725 from contents based on document keywords. document topic. 744

726 Extracted words are the most relevant keywords – GPUDMM: Applied to extract topics based on 745

727 which occurs in document related to special top- the Dirichlet Multinomial Mixture model. 746
co

728 ics mentioned in content. There are different


729 open-source libraries are available in python, 4.6. Experimental results 747

730 such as Gensim [61], but it does not include


731 NMF. LDA uses a probabilistic mechanism 4.6.1. Topic coherent output results 748
Un

732 while NMF relies on linear algebra. Table 5 represents the topic of coherent outputs 749

733 – Non-negative Matrix Factorization (NMF): via comparison with various methods and effec- 750

734 NMF is an unsupervised learning task which tiveness of the SeaNMF model based on standard 751

735 used for reducing dimension and simulating NMF in short document contents topic modeling. The 752

736 clustering. In this paper, NMF is used to imple- comparison between LDA and PTM also shows the 753

737 ment the block coordinate algorithm. NMF is considerable prove that mentions the SeaNMF model 754

738 a linear-algebraic model, that is used to reduce can extract more coherent topics out of a document. 755

739 the high dimensional vectors into a low dimen- Table 6 represents the topic of coherent outputs 756

740 sional. via a comparison between Yahoo.CA and ACM.IS 757


Z. Shahbazi et al. / Topic modeling for short-Text 13

758 dataset. Based on the mentioned reasons for PMI in correlation matrix of semantic words. The parameter 809

759 topic coherent section, we evaluate topic coherent on f is used to effects the data sparsity of a correlation 810

760 external corpus because it has better performance on matrix, which causes that the semantic words are less 811

761 short text documents. The output result of coherent correlated. It is also observed from the graph that 812

762 topics is presented in Table 6. Going deep into word the value of the F-score is reduced when the param- 813

763 correlation process, SeaNMF obtains more suitable eter f increase. It is found that the F-score slightly 814

764 topics out of short text documents. increases and improves when a smoothing factor Y 815

765 As explained in Tables 5 and 6 by visualizing and is increased. Hence, in the collection of short docu- 816

f
766 comparing the words relationship , the main infor- ments, there is a difference between a high coherent 817

roo
767 mation of contents effects in extracting topics using topic and a high-quality topic. 818

768 Sparse SeaNMF model.


4.7. Combination of Seq2Seq and reinforcement 819
769 4.6.2. Document classification output results
learning model 820
770 Furthermore, Metrics estimation presented into

rP
771 two parts, which document classification is part of
This section is presented the combination Seq2Seq 821
772 the proposed system. Table 7 shows the best results
model with reinforcement learning in every possible 822
773 of our approach in short document dataset (“Tweets”,
technique. The main goal of using these models is 823
774 “Yahoo.ANS”, “Tag.News”). These outputs display
to overcome with the mismatching problem in docu- 824

tho
775 that the proposed system is efficient in short con-
ment keywords and topics. Generally, this system is 825
776 tent. Result comparison with LDA and NMF shows
works base on the reward function for running object 826
777 the presented SeaNMF model has better classifica-
function, which explained earlier in the reinforcement 827
778 tion and also better performance results than other
learning section. The reinforcement learning algo- 828
779 systems. The presented model output is summarized
rithm is an effective system for improving the output 829
Au
780 in Fig. 4.
of state-of-art in the Seq2Seq system. Although, there 830
781 A comparison between these systems represents
are a lot of progressive techniques, e.g. DQN, Actor- 831
782 that skip-gram has an effective role in the high per-
Critic models, and DDQN, which didn’t apply to this 832
783 formance of semantic documents, and also, the NMF
task before. The main issue in this system is the prob- 833
784 model is not good enough comparing with LDA.
lem of using Q-Learning and formative part of that 834
Another effective model in this process is
d

785
in the Seq2Seq model. Text summarization can be an 835
786 GPUDMM, which accomplishes better than other
example of this problem. This system should evaluate 836
cte

787 baselines. The number of Tweets in various cate-


words in each vocabulary even with a lower status of 837
788 gories is almost the same and is more dependable.
training. Related to this reason, the most focus on this 838
789 Although, the Tweet dataset passes the imbalanced
area is a problematic approach, which is the REIN- 839
790 class” issue. Table 7 shows the development of the
FORCE algorithm that is a suitable task to train the 840
791 SeaNMF model, which on average, 14 percent higher
Seq2Seq model. Table 8 presents “policy”, “action”,
rre

841
792 performance in overall the baselines which presents
and “reward” function in Seq2Seq model. 842
793 by precision (P), recall (R), F-measure (F).
794 It is observed that the better performance of the
795 topic coherence does not indicate a better perfor- 4.8. Topics semantic analysis 843
co

796 mance of document classification. The following


797 Fig. 4 presents the performance of topic coher- Generally, topic semantic analysis is to repre- 844

798 ence and classification of documents of the SeaNMF sent the effectiveness of the SeaNMF model based 845

799 model. The parameter alpha indicates the weight- on extracting meaningful keywords. To discover 846
Un

800 ing factor, which used for factorizing the correlation topics out of short text document, specified top 847

801 matrix of semantic words. γ is used as a smoothing keywords based on standard NMF model compare 848

802 factor, which used for the probability of sampling. with SeaNMF system output. Represented dataset 849

803 It is evident in Fig. 4 that the F-score decreases as extracted words depending on the word embedding 850

804 α increases. It is also noticed that there is no sig- system and describe the problem statement of topic 851

805 nificant change in F-score with respect to α value. extraction in short contents. Output covers the set of 852

806 Therefore, our proposed SeaNMF-based model is sta- contents based on word similarity. The topic seman- 853

807 ble for modeling topics for short text documents. The tic analysis procedure is to make a group of sentences 854

808 parameter f also plays a vital role in constructing a by using coherent topics which allocated to individ- 855
14 Z. Shahbazi et al. / Topic modeling for short-Text

f
roo
rP
tho
d Au
cte

Fig. 4. Topic coherence and classification performance of SeaNMF model.

Table 7
Comparison method on various document classification
rre

Tweet Tag.News Yahoo.ANS DBLP


P R F P R F P R F P R F
LDA 0.4938 0.4978 0.4957 0.8434 0.8295 0.8363 0.6131 0.6849 0.6469 0.7192 0.6184 0.6648
NMF 0.4788 0.4628 0.4704 0.7874 0.7482 0.7672 0.7414 0.6581 0.6972 0.7414 0.7337 0.7374
PTM 0.4152 0.4949 0.4524 0.8636 0.8417 0.8524 0.7411 0.7149 0.7277 0.7535 0.7478 0.7505
co

GPUDMM 0.4194 0.4177 0.4183 0.8954 0.8823 0.8887 0.6165 0.7419 0.6732 0.7780 0.7684 0.7731
SeaNMF 0.5759 0.5666 0.5712 0.8979 0.8897 0.8937 0.7677 0.7449 0.7560 0.7759 0.7663 0.7709
Sparse SeaNMF 0.5613 0.5679 0.5644 0.8915 0.8912 0.8913 0.7714 0.7471 0.7590 0.7811 0.7724 0.7766
Un

856 ual sentences and by applying cross-entropy using To display semantically associated discovered top- 865

857 document terms, system ables to extract meaningful ics in the SeaNMF model, we build a network, based 866

858 topics. on top keywords in each topic, which is represent- 867

859 For a start, “Yahoo.Ans” and “DBLP” data used in ing in Fig. 5. Several words with great modulation in 868

860 the training process and extracted topics out of pro- the corpus have less degree or in other meaning, less 869

861 cess selected based on the PMI score. The higher associated with other keywords. 870

862 score selected. By analyzing top keywords, most Figure 5 shows the extracted topics using SeaNMF 871

863 similar topics captured with the SeaNMF system model which has minimum noise in words and 872

864 selected. extracted words are correlated together. Hence, pro- 873
Z. Shahbazi et al. / Topic modeling for short-Text 15

f
roo
rP
tho
d Au
Fig. 5. Topic Semantic Analysis in SeaNMF model.
cte

Table 8
Policy, action, reward in various Seq2Seq models
Seq2Seq Policy (P) Action (A) Reward(R)
T.Summarization
H.Generation Attention-based models, Summarize based on next token, ROUGE, BLEU
rre

M.Translation pointer-generators, etc. translate and headline


Q.Generation
Vocabulary based answer selection
Q&A Seq2Seq or selecting the F-measure
start and end index of the
co

answer in the input document


I.Captioning Seq2Seq Next token chosen as caption CIDEr, SPICE, METEOR
V.Captioning
S.Recognition Seq2Seq Next token chosen as speech Connectionist Temporal
Classification (CTC)
Un

BLEU
D.Generation Seq2Seq Generate Dialogue speech Dialogue Length
Dialogue variety

874 posed semantic analysis output display the presented colors have low chance to occur together in same 879

875 model is able to discover meaningful topics from sentence but dark colors represents the highest prob- 880

876 short text document. Further explanation shows, the ability of words which occur together. 881

877 probability of connection between words through Figure 6 represents the topic visualization between 882

878 dataset. As it shows the color bar in Figure, light words in the selected dataset. The number of topics 883
16 Z. Shahbazi et al. / Topic modeling for short-Text

ment shows that the proposed model performed better 917

in terms of accuracy metric as compared to other mod- 918

els. Topic coherence and classification performance 919

show the consistency and strength of the model. The 920

proposed model results demonstrate that extracted 921

hidden topics from short text data are meaningful, 922

and terms are consistent with topics. The perfor- 923

mance of the proposed model can be enhanced by 924

f
using a graph-based word embedding technique. The 925

roo
graph embedding method overcome the limitations 926

of sequential input methods which is used in the pro- 927

posed system. In future we will use a graph-based 928

word embedding technique to improve the perfor- 929

mance of topics modeling from short text document 930

rP
Fig. 6. Topic distribution in word corpus. using machine learning techniques. 931

884 is manually selected as eight and Fig. 6 Visualize


885 the NMF topic modeling output regarding the short Acknowledgments 932

tho
886 text content matrix. Topics are including the detail
887 information of the input document. Word embedding This work was supported by ’Jeju Industry- 933

888 process applied to find the relationship between indi- University Convergence Foundation’ funded by the 934

889 vidual terms. Each colour shows the importance of Ministry of Trade, Industry, and Energy(MOTIE, 935

890 the topic in the dataset and related contents similarity Korea). [Project Name: ’Jeju Industry-University 936
Au
891 in which light colour present minimum probability of convergence Foundation / Project Number: 937

892 words relationship and dark colours show the maxi- N0002327]. 938

893 mum probability between topic terms.

References 939

5. Conclusion
d

894

[1] W.X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan and 940
X. Li, Comparing twitter and traditional media using topic 941
cte

895 In this paper, we present a combination of


models. In European conference on information retrieval, 942
896 reinforcement learning Seq2Seq system using the (2011), pp. 338–349. Springer. 943
897 SeaNMF model to improve the performance of topic [2] B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu and 944

898 modeling in short text documents. The proposed M. Demirbas, Short text classification in twitter to improve 945
information filtering. In Proceedings of the 33rd interna- 946
899 model leverages the word-context semantic corre- tional ACM SIGIR conference on Research and development
rre

947
900 lations in training, which is effective to overcome in information retrieval, (2010), pp. 841–842. ACM. 948
901 the issue of short text documents, such as lack of [3] L. Hong and B.D. Davison, Empirical study of topic model- 949

902 information in meaningful topics extraction. The ing in twitter. In Proceedings of the first workshop on social 950
media analytics, (2010), pp. 80–88. acm. 951
903 semantic relationship between keywords and docu- [4] Z. Wang and H. Wang, Understanding short texts. 2016. 952
co

904 ment information is from skip-gram, which shows [5] X. Yan, J. Guo, Y. Lan and X. Cheng. A biterm topic model 953

905 the efficiency for disclose semantic relationship of for short texts. In Proceedings of the 22nd international 954

906 words. The reinforcement learning system applied to conference on World Wide Web, (2013), pp. 1445–1456. 955
ACM. 956
907 get the reward function from the algorithm in train- [6] D.M. Blei, A.Y. Ng and M.I. Jordan, Latent dirichlet alloca-
Un

957
908 ing set for input-specific RL and compare it in the tion, Journal of Machine Learning Research 3(Jan) (2003), 958

909 test set with specific RL. It avoids the difficulty of 993–1022. 959

910 the document extraction part. This work used a block [7] S. Deerwester, Susan T. Dumais, George W. Furnas, Thomas 960
K. Landauer and R. Harshman, Indexing by latent semantic 961
911 coordinate descent algorithm to solve the SeaNMF analysis, Journal of the American society for Information 962
912 model issue and compare the proposed model with Science 41(6) (1990), 391–407. 963

913 other state-of-art models using real-world data. In [8] T. Hofmann, Probabilistic latent semantic indexing. In ACM 964
SIGIR Forum 51 (2017), pp. 211–218. ACM. 965
914 this paper, we utilized accuracy as a standard mea-
[9] Daniel D. Lee and H. Sebastian Seung, Learning the parts 966
915 sure to evaluate the performance of topic coherence of objects by non-negative matrix factorization, Nature 967
916 and document classification. The accessible measure- 401(6755) (1999), 788. 968
Z. Shahbazi et al. / Topic modeling for short-Text 17

969 [10] J. Choo, C. Lee, Chandan K. Reddy and H. Park, Weakly [26] R. Mehrotra, S. Sanner, W. Buntine and L. Xie, Improv- 1034
970 supervised nonnegative matrix factorization for user-driven ing lda topic models for microblogs via tweet pooling and 1035
971 clustering, Data Mining and Knowledge Discovery 29(6) automatic labeling. In Proceedings of the 36th international 1036
972 (2015), 1598–1621. ACM SIGIR conference on Research and development in 1037
973 [11] D. Kuang, J. Choo and H. Park, Nonnegative matrix fac- information retrieval, (2013), pp. 889–892. 1038
974 torization for interactive topic modeling and document [27] Y. Zuo, J. Zhao and K. Xu, Word network topic model: 1039
975 clustering, In Partitional Clustering Algorithms (2015), a simple but general solution for short and imbalanced 1040
976 pp. 215–243. Springer. texts, Knowledge and Information Systems 48(2) (2016), 1041
977 [12] D. Kuang, S. Yun and H. Park, Symnmf: nonnegative 379–398. 1042
978 low-rank approximation of a similarity matrix for graph [28] F. Kou, J. Du, C. Yang, Y. Shi, M. Liang, Z. Xue and H. 1043

f
979 clustering, Journal of Global Optimization 62(3) (2015), Li, A multi-feature probabilistic graphical model for social 1044

roo
980 545–574. network semantic search, Neurocomputing 336 (2019), 1045
981 [13] X. Quan, C. Kit, Y. Ge and S.J. Pan, Short and sparse text 67–78. 1046
982 topic modeling via self-aggregation. In Twenty-Fourth Inter- [29] D. Ramage, D. Hall, R. Nallapati and Christopher D. 1047
983 national Joint Conference on Artificial Intelligence, 2015. Manning, Labeled lda: A supervised topic model for 1048
984 [14] Y. Zuo, J. Wu, H. Zhang, H. Lin, F. Wang, K. Xu and H. credit attribution in multi-labeled corpora. In Proceedings 1049
985 Xiong, Topic modeling of short texts: A pseudodocument of the 2009 Conference on Empirical Methods in Nat- 1050

rP
986 view. In Proceedings of the 22nd ACM SIGKDD interna- ural Language Processing: Volume 1-Volume 1, (2009), 1051
987 tional conference on knowledge discovery and data mining, pp. 248–256. Association for Computational Linguistics. 1052
988 (2016), pp. 2105–2114. ACM. [30] F. Kou, J. Du, Z. Lin, M. Liang, H. Li, L. Shi and C. Yang, 1053
989 [15] T. Mikolov, K. Chen, G. Corrado and J. Dean, Efficient A semantic modeling method for social network short text 1054
990 estimation of word representations in vector space. arXiv based on spatial and temporal characteristics, Journal of 1055
991 preprint arXiv:1301.3781, 2013. Computational Science 28 (2018), 281–293. 1056

tho
992 [16] J. Pennington, R. Socher and C. Manning, Glove: Global [31] T. Lin, W. Tian, Q. Mei and H. Cheng, The dual-sparse topic 1057
993 vectors for word representation. In Proceedings of the 2014 model: mining focused topics and focused terms in short 1058
994 conference on empirical methods in natural language pro- text. In Proceedings of the 23rd international conference on 1059
995 cessing (EMNLP), (2014), pp. 1532–1543. World wide web, (2014), pp. 539–550. 1060
996 [17] C. Li, H. Wang, Z. Zhang, A. Sun and Z. Ma, Topic mod- [32] P. Bicalho, M. Pita, G. Pedrosa, A. Lacerda and Gisele L. 1061
997 eling for short texts with auxiliary word embeddings. In Pappa, A general framework to expand short text for topic 1062
Au
998 Proceedings of the 39th International ACM SIGIR confer- modeling, Information Sciences 393 (2017), 66–81. 1063
999 ence on Research and Development in Information Retrieval [33] J. Choo, C. Lee, Chandan K. Reddy and H. Park, 1064
1000 (2016), pp. 165–174. ACM. Utopian: User-driven topic modeling based on interac- 1065
1001 [18] V.K.R. Sridhar, Unsupervised topic modeling for short texts tive nonnegative matrix factorization, IEEE Transactions 1066
1002 using distributed representations of words. In Proceedings on Visualization and Computer Graphics 19(12) (2013), 1067
1003 of the 1st workshop on vector space modeling for natural 1992–2001. 1068
1004 language processing, (2015), pp. 192–200. [34] H. Kim, J. Choo, J. Kim, Chandan K. Reddy and H. 1069
d

1005 [19] G. Xun, V. Gopalakrishnan, F. Ma, Y. Li, J. Gao and A. Park, Simultaneous discovery of common and discrimina- 1070
1006 Zhang, Topic discovery for short texts using word embed- tive topics via joint nonnegative matrix factorization. In 1071
cte

1007 dings. In 2016 IEEE 16th international conference on data Proceedings of the 21th ACM SIGKDD International Con- 1072
1008 mining (ICDM), (2016), pp. 1299–1304. IEEE. ference on Knowledge Discovery and Data Mining, (2015), 1073
1009 [20] T. Mikolov, I. Sutskever, K. Chen, Greg S. Corrado and Jeff pp. 567–576. ACM. 1074
1010 Dean. Distributed representations of words and phrases and [35] X. Yan, J. Guo, S. Liu, X. Cheng and Y. Wang, Learning 1075
1011 their compositionality. In Advances in neural information topics in short texts by non-negative matrix factorization 1076
1012 processing systems (2013), pp. 3111–3119. on term correlation matrix. In proceedings of the 2013 1077
rre

1013 [21] O. Levy and Y. Goldberg, Neural word embedding as SIAM International Conference on Data Mining, (2013), 1078
1014 implicit matrix factorization. In Advances in neural infor- pp. 749–757. SIAM. 1079
1015 mation processing systems, (2014), pp. 2177–2185. [36] T. Shi, K. Kang, J. Choo and Chandan K. Reddy, Short- 1080
1016 [22] O. Levy, Y. Goldberg and I. Dagan, Improving distributional text topic modeling via non-negative matrix factorization 1081
1017 similarity with lessons learned from word embeddings, enriched with local word-context correlations. In Proceed- 1082
co

1018 Transactions of the Association for Computational Linguis- ings of the 2018 World Wide Web Conference, (2018), 1083
1019 tics 3 (2015), 211–225. pp. 1105–1114. 1084
1020 [23] X.-H. Phan, L-M. Nguyen and S. Horiguchi, Learning to [37] M.-T. Luong, H. Pham and Christopher D. Manning, 1085
1021 classify short and sparse text & web with hidden topics Effective approaches to attention-based neural machine 1086
1022 from large-scale data collections. In Proceedings of the translation. arXiv preprint arXiv:1508.04025, 2015. 1087
Un

1023 17th international conference on World Wide Web, (2008), [38] Y. Wu, M. Schuster, Z. Chen, Quoc V. Le, M. Norouzi, W. 1088
1024 pp. 91–100. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al., 1089
1025 [24] O. Jin, Nathan N. Liu, K. Zhao, Y. Yu and Q. Yang, Google’s neural machine translation system: Bridging the 1090
1026 Transferring topical knowledge from auxiliary long texts gap between human and machine translation. arXiv preprint 1091
1027 for short text clustering. In Proceedings of the 20th ACM arXiv:1609.08144, 2016. 1092
1028 international conference on Information and knowledge [39] D. Bahdanau, K. Cho and Y. Bengio, Neural machine 1093
1029 management, (2011), pp. 775–784. translation by jointly learning to align and translate. arXiv 1094
1030 [25] J. Weng, E-P. Lim, J. Jiang and Q. He, Twitterrank: finding preprint arXiv:1409.0473, 2014. 1095
1031 topic-sensitive influential twitterers. In Proceedings of the [40] S. Shen, Y. Cheng, Z. He, W. He, H. Wu, M. Sun and Y. 1096
1032 third ACM international conference onWeb search and data Liu, Minimum risk training for neural machine translation. 1097
1033 mining, (2010), pp. 261–270. arXiv preprint arXiv:1512.02433, 2015. 1098
18 Z. Shahbazi et al. / Topic modeling for short-Text

1099 [41] Alexander M. Rush, S. Chopra and J. Weston, A neural [52] I. Sutskever, O. Vinyals and Quoc V. Le, Sequence to 1139
1100 attention model for abstractive sentence summarization. sequence learning with neural networks. In Advances in neu- 1140
1101 arXiv preprint arXiv:1509.00685, 2015. ral information processing systems, (2014), pp. 3104–3112. 1141
1102 [42] S. Chopra, M. Auli and Alexander M. Rush, Abstrac- [53] M. Riedl and C. Biemann, Text segmentation with topic 1142
1103 tive sentence summarization with attentive recurrent neural models, Journal for Language Technology and Computa- 1143
1104 networks. In Proceedings of the 2016 Conference of the tional Linguistics 27(1) (2012), 47–69. 1144
1105 North American Chapter of the Association for Computa- [54] M. Riedl and C. Biemann, How text segmentation algo- 1145
1106 tional Linguistics: Human Language Technologies, (2016), rithms gain from topic models. In Proceedings of the 2012 1146
1107 pp. 93–98. Conference of the North American Chapter of the Asso- 1147
1108 [43] S.-Q. Shen, Y.-K. Lin, C.-C. Tu, Y. Zhao, Z.-Y. Liu, M.-S. ciation for Computational Linguistics: Human Language 1148

f
1109 Sun, et al., Recent advances on neural headline generation, Technologies, (2012), pp. 553–557. Association for Com- 1149

roo
1110 Journal of Computer Science and Technology 32(4) (2017), putational Linguistics. 1150
1111 768–784. [55] D. Fisher, M. Kozdoba and S. Mannor, Topic modeling via 1151
1112 [44] A. See, Peter J. Liu and Christopher D. Manning, Get to full dependence mixtures. arXiv preprint arXiv:1906.06181, 1152
1113 the point: Summarization with pointer-generator networks. 2019. 1153
1114 arXiv preprint arXiv:1704.04368, 2017. [56] R. Vedantam, C. Lawrence Zitnick and D. Parikh, Cider: 1154
1115 [45] R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang, et al., Abstrac- Consensus-based image description evaluation. In Proceed- 1155

rP
1116 tive text summarization using sequence-tosequence rnns and ings of the IEEE conference on computer vision and pattern 1156
1117 beyond. arXiv preprint arXiv:1602.06023, 2016. recognition, (2015), pp. 4566–4575. 1157
1118 [46] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel and Y. [57] K. Papineni, S. Roukos, T. Ward and W.-J. Zhu, Bleu: a 1158
1119 Bengio, End-to-end attentionbased large vocabulary speech method for automatic evaluation of machine translation. In 1159
1120 recognition. In 2016 IEEE international conference on Proceedings of the 40th annual meeting on association for 1160
1121 acoustics, speech and signal processing (ICASSP), (2016), computational linguistics, (2002), pp. 311–318. Associa- 1161

tho
1122 pp. 4945–4949. IEEE. tion for Computational Linguistics. 1162
1123 [47] A. Graves and N. Jaitly, Towards end-to-end speech recog- [58] C.-Y. Lin, Looking for a few good metrics: Automatic sum- 1163
1124 nition with recurrent neural networks. In International marization evaluation-how many samples are enough? In 1164
1125 conference on machine learning, (2014), pp. 1764–1772. NTCIR, 2004. 1165
1126 [48] Y. Miao, M. Gowayyed and F. Metze, Eesen: End-to-end [59] S. Banerjee and A. Lavie, Meteor: An automatic metric 1166
1127 speech recognition using deep rnn models and wfst- for mt evaluation with improved correlation with human 1167
Au
1128 based decoding. In 2015 IEEE Workshop on Automatic judgments. In Proceedings of the acl workshop on intrinsic 1168
1129 Speech Recognition and Understanding (ASRU), (2015), and extrinsic evaluation measures for machine translation 1169
1130 pp. 167–174. IEEE. and/or summarization, (2005), pp. 65–72. 1170
1131 [49] S. Bengio, O. Vinyals, N. Jaitly and N. Shazeer, Sched- [60] A. Zubiaga and H. Ji, Harnessing web page directories for 1171
1132 uled sampling for sequence prediction with recurrent neural large-scale classification of tweets. In Proceedings of the 1172
1133 networks. In Advances in Neural Information Processing 22nd international conference on world wide web, (2013), 1173
1134 Systems (2015), pp. 1171–1179. pp. 225–226. ACM. 1174
d

1135 [50] M. Zamanian and P. Heydari, Readability of texts: State of [61] R. ek and P. Sojka, Gensim—statistical semantics in python. 1175
1136 the art, Theory & Practice in Language Studies 2(1) 2012. Retrieved from genism. org, 2011. 1176
cte

1137 [51] W. Li and A. McCallum, Semi-supervised sequence mod-


1138 eling with syntactic topic models. In AAAI 5 (2005),
pp. 813–818.
rre
co
Un

You might also like