You are on page 1of 11

Topic Modeling Text Clustering based

on Deep Learning Model

Salman Younus Bilal Shabir


Department of Computer Science Department of Computer Science
Lahore University of Management Sciences Lahore University of Management Sciences
18030028@lums.edu.pk 19030024@lums.edu.pk

Abstract
Today we are living in a digital world and the amount of data we are producing every minute is mind boggling.
According to one of the Forbes article, there are 2.5 quintillion bytes of data created each day and this pace is
increasing day by day because more people and devices are connecting with the internet. These days, data is
really an asset – It is an Oil of 21st Century. People are using the information available to study helping to cure a
disease, boost a company’s revenue, marketing strategies and many more. Now the challenge we are facing is
how to process such a huge amount of data. According to IBM, it is estimated that around 80% of all information
on the internet is unstructured, where text is one of the most common type of unstructured data. Text
processing i.e. analyzing, understanding, and organizing is hard and time-consuming activity. Therefore, our
work is related to topic modeling.
LDA is the most common probabilities model in practice for topic modeling clustering. But it performs well on
large documents and when we talk about to topic modeling of short text messages like on social media
platforms, product reviews / customers feedback then challenges have been observed. Therefore, embedding
plus clustering could be a good option. So, we worked in the direction to use BERT with LDA to achieve the
desired results. And our results show that LDA+BERT combined together produced better results and created
more balanced and separate clusters as compare to LDA and BERT when we applied them separately.

Introduction
In this era of information technology with every passing moment huge amount of data is generating over the
internet in the form of blogs, news, research articles, social networks, web pages etc. And to process this
information, which is available in structured or unstructured form, people are working on machine learning
techniques so that more information can be processed in less time and effort.

With the use of different social media platforms, text data is a big component of the overall information. People
and organizations are using different method and tools for text analysis and processing. Topic modeling is one of
the research area of text analysis. It is an unsupervised machine learning technique that is capable of scanning a
set of documents, detecting word and phrase patterns within them, and automatically clustering word groups
and similar expressions that best characterize a set of documents. The aim of topic modeling is to discover the
themes that run through a corpus by analyzing the words of the original texts. In this research project our
objective is to implement a model based on deep learning for topic modeling cluster and then evaluate the
performance of the model.

Based on the literature review, Latent Dirichlet Allocation (LDA) is the technique that is mostly used for topic
modelling. It is a probability model and works well on large collection of documents. But challenges have been
observed by applying the same technique on short messages on social media platform. People are working in
this area of short text topic modelling by applying different other techniques along with LDA. So, as part of this
project we are also exploring to apply LDA + BERT together to observer the Behaviour.
Literature Review
In this section we will be discussing the related work and techniques used for topic modeling from Singular
Value Decomposition (SVD) topic model to the latest topic models in deep learning techniques for generating
topics from the collection of documents.

[1] Latent semantic indexing (LSI) served as the origin of topic modeling. It is not a probabilistic model. Then in
2001 based on LSI [2] Hofmann proposed Probabilistic Latent Semantic Analysis (PLSA) which is a real topic
model. After PLSA, [3] Latent Dirichlet Allocation (LDA) model proposed in 2003 by Blej et al. is a more
complete probabilistic generative model and which is also the extension of PLSA. LDA is a three-level
hierarchical Bayesian model to obtain cluster assignments, it uses two probability values: word-topics and
Topics-documents. The values are calculated based on an initial random assignment, after which they are
repeated for each word in each document, to decide their topic assignment.

Instead of having flat topic proposed by LDA, HLDA [4] is an LDA expansion to model the tree of topic. HLDA is a
non-parametric Bayesian model where each node in the tree associated as topic which is the distribution of
words. [5] Author-Topic Model (ATM) is another technique based on probabilistic model extension of LDA. In
this technique each word is connected with two variables, author and a topic. Each author has distribution over
word and each word has distribution over topics. To overcome the inabilities of LDA to model the correlation
between topics, [7] Correlated Topic Model (CorrTM) was developed. CorrTM is able to model the complex
underlying structure of topics and provide covariance. This model is more effective than LDA for topic
exploration and visualization.

In natural language processing, a document is usually represented by a BoW that is actually a word-document
matrix. To overcome the bag-of-word approach, Bigram LDA Topic Model (BLTM) was developed [8]. This
model is based on N-gram and it is used to predict the word based on the measurement of previous word. This
model draws the distribution over words context rather than word over topics.

To identify the local and global topics between documents, [9] Multi-grain LDA (MG-LDA) was introduced as an
extension to LDA and PLSA. In the document, the word sampling is based on the mixture of local topic context.
This model has shown good results for analyzing the rating aspects of online reviews.

[10] Convolutional Neural Networks (CNN) for Sentence Classification – In this paper the author et al. performed
a series of experiment using Convolutional Neural Networks for sentence classification built on top of word2vec.
And they discussed that unsupervised pre-training of word vectors is an important ingredient in deep learning
for NLP.

[11] Efficient Deep Learning Model for Text Classification Based on Recurrent and Convolutional Layers – In
order to train NLP model for text classification, we need to represent words numerically so that machine can
understand them. One way to achieve this is by using bag-of-words technique but it is not good choice because
of major weaknesses, ignore word semantics and words order. In this paper the author et al. proposed a neural
language model based on Convolutional Neural Network (CNN) and Bidirectional Recurrent Neural Network
(BRNN) to reduce the loss of detailed local information and capture long term dependencies across input
sequences.

[12] In this paper, the author et. al. has proposed a model based on dynamic semantic and deep neural network
for multi-label text classification. This model utilizes the clustering algorithm and work embedding to select
semantic words. In the classification process, they re-expressed the new words and low frequency words by
sparse constraint.

2
[16] Supervised Citation Network Topic Model (SCNTM) is non-parametric topic model that incorporate
bibliographic analysis of authors, topic and documents. It generates probability vectors represents as counts by
using GEM distribution and base distribution. [17] Sequential Latent Dirichlet Allocation (seqLDA) is hierarchal
modeling applied to documents having multiple segments (e.g. chapter, paragraph). Markov chains was used to
bind a sequence of LDA model.

Kernel Topic Model is a regression model for based on document metadata. [18] Deep Belief Nets for Topic
Modeling is a generative bag-of-word model for conceptual meanings of documents. It provides the better
representation of documents because of non-linear dimension reduction approach. The training of the model is a
2-step process, pre-training, and fine-tuning.

[19] Neural Topic Model (NTM) is an unsupervised learning model used to organize the collection of documents
into topics based on the grouping of word using statistical distribution. The topics from the document learn by
NTM are categorized by latent representation.

Methodology
Our methodology is based on text processing methodology where we will prepare the data, apply topic modeling
techniques, and then process and evaluate the results as shown in the following figure.

Figure 1 - Topic Modeling Process Flow

Based on the literature review, LDA is the most common probabilities model in practice for topic modeling
clustering. But it performs well on large documents and when we talk about to topic modeling of short text
messages like on social media platforms, product reviews / customers feedback then challenges have been
observed.

• Hard time handling short texts when there is not much text to model
• Reviews usually do not coherently discuss a single topic (making it hard for LDA to identify the main
topics of the documents)
• Actual meaning of reviews is largely context-based, so word co-occurrence-based methods like LDA
might fail.

The reason that can be derived from the low performance of LDA fort short text is the lack of context which can
be addressed by embedding the full content of the sentence. Therefore, embedding plus clustering can be a good
option. So, we are working in the direction to use BERT with LDA to achieve the desired results.

BERT is a pre-trained unsupervised natural language processing model. BERT is deeply bi-directional, meaning
it looks at the words before and after entities and context pre-trained on Wikipedia to provide a richer
understanding of language. BERT is an open source machine learning framework for natural language
processing (NLP). It is designed to help computers understand the meaning of ambiguous language in text by
using surrounding text to establish context. We have divided our work into following phases:

3
• Data Pre-Processing
• Train the model for topic modeling cluster
• Analyze and evaluate the model

In data pre-processing we will be doing the activities such as tokenization, stop word removal, stemming. Once
we prepare the data, we will use it for topic modeling cluster and at the end we will analyze and evaluate the
model. Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be
either words, characters, or sub words. Tokenization is the foremost step while modeling text data. A stop word
is a commonly used word (such as “the”, “a”, “an”, “in”) which are filtered out before or after processing of
natural language data (text). On removing stop words, dataset size decreases and the time to train the model
also decreases.

The Raw text data will be pre-processed as first step in order to bring data set in proper a structure so that this
data set can be efficiently used for further operations. Different normalization techniques can be applied for data
pre-processing to convert text data in a structured form. For the purpose of data analysis, we will use
exploratory data analysis (EDA). Exploratory data analysis is an approach to analyzing data sets to summarize
their main characteristics, often with visual methods. It will be helpful in getting better understanding of the
data set and understand structure of the data set. So, through EDA we will also be able improve the quality of
data pre-processing.

After the pre-processing and we will implement LDA + BERT model for Topic Modeling (Text classification)
using python frameworks and finally analyze the evaluate the model.

Architecture
In this section we are discussing the architecture of the model. As discussed above idea is that LDA will be used
along with Bert for Topic modeling of short text to produce better results. So, the data set which will be used in
this project will be short text and the details are discussed in the dataset section. The high level conceptual
architectural model is as follow.

Figure 2 - High Level Architecture

4
We will implement our model for Topic Modeling (Text classification) using python frameworks. The data
pipeline of our model is as follow.

Figure 3 - Data Pipeline

For numerical calculation, “TensorFlow” which is an open source library will be used created by the Google
Brain Team. For LDA we will be using “Genism” which is an open-source library for unsupervised topic modeling
and natural language processing and for clustering will be using Scikit Learn. Scikit Learn is the machine
learning library for Python.

Dataset
We run the model on two different datasets. The first dataset is we are using for our research project consists of
the customer feedback about an online gaming platform. It contains 400,000+ customer reviews from 2010 to
2019. Some of the data insights are as follows.

Figure 4 – Dataset A – Summary

5
The second is the twitter dataset having 200K tweets.

Figure 5 - Twitter Dataset – Summary

Implementation
We used Python for the implementaiton and it
has following steps, Data Loading, Data Cleaning /
Preprocessing, Data transformation: Corpus and
Dictionary, Run the Model, Evaluate Performance
and Visualize Results.

In the data cleaning / preprocessing section, we


normalized the sentences by removing the letter
repetitions, symbols, delimiters, noise text etc.
Then we applied the stop word removal,
lowercase and stemming functions to prepare the Figure 6 - LDA
data for further processing.

As discussed above in the methodology section,


LDA is the most common probabilities model in
practice for topic modeling clustering.

But if we talk about topic modeling of short text


messages like on social media platforms, product
reviews / customers feedback then challenges Figure 7 - BERT
have been observed. So, alternative approach is to
covert the documents into vector space and use
clustering.

Therefore, we implemented LDA+BERT and


compared the results with LDA and BERT result
separately. We used GENSIM library for topic
modeling related activity and SKLEARN for
clustering.
Figure 8 - LDA+BERT
In the LDA+BERT implementation, we used LDA to
generate topic probabilistic assignment vector ‘A’. Then we generated the sentence embedding vector ‘B’ using
SentenceTransformer and concatenated with vector ‘A’ to generate vector ‘C’. Since the vector ‘C’ is in high
dimensional space so we converted it into latent space using Autoencoder deep neural network for clustering.

6
Evaluation
We run our process for multiple scenarios and the results are as follows:

Scenario I: BERT Scenario II: LDA


We converted the documents into vector space using In this scenario, we converted the documents into
sentence embedding and performed clustering. Then we bag-of-words and processed using LDA. But the
used KMean clustering and set the cluster size to 10. The results are not very clear because the model did not
results show that the clusters are not very clear. consider the contextual information

Figure 9 – BERT Results Figure 10 - LDA Results

Scenario II: LDA + BERT


We converted the documents into vector space and then run the clustering in latent space representation. Now
we can see the results are better and clusters are more balanced and separated.

Figure 11 - LDA+BERT Results

7
Gaming Review – Topics using LDA+BERT
Following topics are identified when we run the process with K = 10 (i.e. topic count)

Figure 12 - Gaming Reviews Topics


If we look at the above topics, then we can drive the theme out of them because the words within one topic are
coherence to each other.
Topic ID Topic Theme
Topic 1 Sport games
Topic 2 Online games
Topic 3 Server Issues
Topic 4 Positive Reviews
Topic 5 Entertainment
Topic 6 Hacking
Topic 7 Online Payment
Topic 8 Shooting games
Topic 9 Game development
Topic 10 Gaming modes
Table 1 - Gaming Revie Themes

Twitter Dataset – Topics using LDA+BERT


Following topics are identified when we run the process with K = 10 (i.e. topic count)

Figure 13 - Twitter Dataset Topics

8
Evaluation of a model is very important to determine how good or bad the implemented model? Evaluation of an
unsupervised model is always challenging because of no label data. Therefore, sometimes we rely on human
judgement or eye balling models. But this time we are using Coherence in topic modeling and Silhouette score
for clustering.

In topic coherence we calculate the similarity between top N words in a single topic. Therefore, higher score of
topic coherence means better performance of the model. There are different methods available to calculate topic
coherence as listed below.

• CUCI measure is based on a sliding window and the pointwise mutual information (PMI) of all word pairs
of the given top words
• CV is based on a sliding window, one-set segmentation of the top words and an indirect confirmation
measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity
• CUMASS is based on document cooccurrence counts, a one-preceding segmentation and a logarithmic
conditional probability as confirmation measure
• CA is based on a context window, a pairwise comparison of the top words and an indirect confirmation
measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity
• CP is based on a sliding window, one-preceding segmentation of the top words and the confirmation
measure of Fitelson’s coherence
• CNPMI is an enhanced version of the C_uci coherence using the normalized pointwise mutual information
(NPMI)

We are using CV method in which the score is between 0 and 1, higher is better. Following table shows the CV
score between three models.

Dataset Model Topic Coherence


Gaming Reviews LDA 0.5514
BERT 0.4794
LDA + BERT 0.5729
Twitter Dataset LDA 0.3218
BERT 0.3539
LDA + BERT 0.3835
Table 2 - Topic Coherence

K-Mean clustering is an unsupervised clustering technique therefore, we can perform the evaluated based on
how balanced and separated clusters are formed. There are different methods available but here we are using
Silhouette score. It is based on the on the degree of separating between clusters and the formula is as follows.

Where,
• ai = average distance from all data points in the same cluster
• bi = average distance from all data points in the closest cluster

The score can take values in the interval [-1, 1].


• 0 means the clusters are very close

9
• 1 means the clusters are very far
• -1 means the sample is assigned to the wrong clusters.

Following table shows the Silhouette score.


Dataset Model Silhouette Score
Gaming Reviews BERT 0.0632
LDA + BERT 0.1469
Twitter Dataset BERT 0.0467
LDA + BERT 0.1619
Table 3 - Silhouette Score

We have also evaluated our model against difference gamma values (i.e., weight to balance the relative
importance of information between LDA and BERT).

Effective of Gamma
In the LDA+BERT model, we have used a gamma variable while concatenate the vector form of LDA and BERT.
Purpose is to balance relative importance of information of both vectors (i.e. LDA and BERT). As we increased
the value of gamma from 1 to 15, we got the better results as shown below. The LDA+BERT has the highest value
of coherence and silhouette scores as compare to other models as well as its own versions.

Figure 14 - Gamma Effect

Conclusion
(LDA) is the technique that is mostly used for topic modelling. It is a probability model and works well on large
collection of documents. But challenges have been observed by applying the same technique on short messages
on social media platform. People are working in this area of short text topic modelling by applying different
other techniques along with LDA. So, as part of this project we have worked to apply LDA + BERT together and
observed the results. When we processed our dataset using LDA only then the results were not promising
because LDA did not consider the contextual information even the BERT result were not very good. Clusters are
formed but very close to each other and not balanced. But the LDA+BERT results are good, and we are able to
generate more balanced topic modeling clusters. Topic modeling is a very vast research area and we believe
that there is a lot of room for further development, more work can be done on the gamma effect and defining
efficient measure to evaluate the models.

10
References
[1]. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic
analysis. J Am Soc Inf Sci 41(6):391
[2]. Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(1–
2):177–196
[3]. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
[4]. Griffiths, D. M. B. T. L., and M. I. J. J. B. Tenenbaum. "Hierarchical topic models and the nested chinese
restaurant process." Advances in neural information processing systems 16 (2004): 17.
[5]. Rosen-Zvi, Michal, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. "The author-topic model for
authors and documents." In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp.
487-494. AUAI Press, 2004.
[6]. Rosen-Zvi, Michal, Chaitanya Chemudugunta, Thomas Griffiths, Padhraic Smyth, and Mark Steyvers.
"Learning author-topic models from text corpora." ACM Transactions on Information Systems (TOIS) 28, no.
1 (2010): 4.
[7]. Blei, David, and John Lafferty. "Correlated topic models." Advances in neural information processing systems
18 (2006): 147.
[8]. Wallach, Hanna M. "Topic modeling: beyond bag-of-words." In Proceedings of the 23rd international
conference on Machine learning, pp. 977-984. ACM, 2006
[9]. Titov, Ivan, and Ryan McDonald. "Modeling online reviews with multi-grain topic models." In Proceedings of
the 17th international conference on World Wide Web, pp. 111-120. ACM, 2008
[10]. Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification, Published in EMNLP 2014
[11]. Abdalraouf Hassan et at. 2017. Efficient Deep Learning Model for Text Classification Based on Recurrent
and Convolutional Layers. Published 2017 16th IEEE International Conference on Machine Learning and
Applications (ICMLA)
[12]. Tianshi Wang et.at. 2020. A multi-label text classification method via dynamic semantic representation
model and deep neural network. Published 2020 Computer Science Applied Intelligence
[13]. Liu, M. Hu and B. 2004. "Mining and summarizing customer reviews, ǁin Proc. 10th ACM SIGKDD,."
Washignton DC: in Proc. 10th ACM SIGKDD.
[14]. Jey Han Lau. 2010. Best Topic Word Selection for Topic Labelling. Published in COLING 2010
[15]. X. Mao. 2016. A Novel Fast Framework for Topic Labeling Based on Similarity-preserved Hashing.
Published in COLING 2016
[16]. Lim, Kar Wai, and Wray Buntine. "Bibliographic analysis on research publications using authors,
categorical labels and the citation network." Machine Learning 103, no. 2 (2016): 185-213.
[17]. Du, Lan, Wray Buntine, Huidong Jin, and Changyou Chen. "Sequential latent Dirichlet allocation."
Knowledge and information systems 31, no. 3 (2012): 475-503.
[18]. Maaloe, Lars, Morten Arngren, and Ole Winther. "Deep belief nets for topic modeling." arXiv preprint
arXiv:1501.04325 (2015).
[19]. Cao, Ziqiang, Sujian Li, Yang Liu, Wenjie Li, and Heng Ji. "A Novel Neural Topic Model and Its Supervised
Extension." In AAAI, pp. 2210-2216. 2015.
[20]. Budhaditya Saha, Sanal Lisboa, Shameek Ghosh “Understanding patient complaint characteristics using
contextual clinical BERT embeddings” Published 2020 Medicine, Computer Science 2020 42nd Annual
International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)’
[21]. Mingmin Jin, X. Luo, Hankui Zhuo Combining Deep Learning and Topic Modeling for Review Understanding
in Context-Aware Recommendation Published in NAACL-HLT 2018 Computer Science

11

You might also like