Ammar Khalil: Research Methods Assessment1

AMMAR KHALIL
RESEARCH METHODS
ASSESSMENT1
Now a day’s internet is used in daily life for various purposes like social and commercial
activities. Internet provides vast knowledge of information on any topic rather we can say that
internet is ocean of information. We can get information about any topic by queering it on the
internet. Therefore, there is a requirement for more proficient strategies and instruments that can
help in identifying and examining content in online informal organizations, especially for those
utilizing client created content as a good source of information. Moreover, there is a need to
separate more valuable and concealed data from various online sources that are put away as text
and written in characteristic language inside the informal organization scene (e.g., Twitter,
LinkedIn, and Facebook).
Natural language processing (NLP) is a field that joins the intensity of computational phonetics,
machine learning, and machine learning to empower machines to comprehend, break down, and
create the significance of regular human text. All in all, topic modeling has demonstrated to be
effective in summing up long text like news, articles, tweets and comments. Topic Modeling
strategies have been set up for text mining as it is difficult to recognize topics manually, which
isn't productive or adaptable because of the large size of text generated by the humans. Various
Topic Modelling methods can automatically extract topics from short texts (Cheng et al., 2014)
and standard long-text data (Xie and Xing, 2013). Such methods provide reliable results in
numerous text analysis domains, such as probabilistic latent semantic analysis (PLSA) latent
semantic analysis (LSA) , and latent Dirichlet allocation (LDA) (Blei et al., 2003).
Probabilistic subject are such unsupervised AI calculations that were planned and created to
investigate enormous or large corpora of information revealing their hidden topical structure.
Aside from uncovering the concealed topical structure in the datasets, the topic models likewise
permit us to know the connection between the subjects, and furthermore how these subjects
developed with time. Many issues exist in TM approaches with short textual data within OSN
platforms, like slang, data sparsity, spelling and grammatical errors, unstructured data,
insufficient word co-occurrence information, and non-meaningful and noisy words. For example,
Gao et al. (2019) discussed the problem of word sense disambiguation by using local and global
semantic correlations, achieved by a word embedding model. Yan et al. (2013) developed a
short-text TM method called biterm topic model (BTM) that uses word correlations or
embedding to FIGURE below shows the steps involved in a text mining process (Kaur and
Singh, 2019). advance TM.
The steps involved in a text mining process (Kaur and Singh, 2019).
BACKGROUND AND RELATED WORK
Topic modeling provides an insight of large text into the applications of machine learning, and
topic models in the field of natural language processing.
Topic Modeling
Topic modeling is an unsupervised method of machine learning, where algorithms are used to
uncover the latent thematic structure in document collections. These are used for analyzing large
sets of unlabeled text. Topic models help to organize, understand and summarize the vast
archives D. M. Blei et al (2003)
In natural language processing, topic models show positive results when engaged with tasks of
document classification, topic tracking, event detection, word sense disambiguation, POS
tagging etc. A. Panichella. et al (2013)
Many researchers have started to implement Information Retrieval (IR) methods to analyze the
textual information in the software artifacts. D. Binkley et al (2014). There are many prominent
informational retrieval techniques present. Among these, Topic modeling based on LDA is also
used. Initially, LDA is only used upon natural language processing tasks of enormous textual
data.
How does a topic model work?

Since we have seen what topic models are, let us perceive how they work, A topic model accepts
text as information. At that point, the user tells the ideal number of topic that is to be recovered
from the given text. Then the topic model examines the text and finds the topics that best portray
the topic.
Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) is a basic and most popular topic modeling algorithm used for
topic modeling. It is a three level progressive Bayesian model in which the generative model
presents the formation of text from the dataset. Each item of the set is modeled as a finite
mixture over an underlying set of topic probabilities. D. M. Blei et al (2003).
In LDA, the data is in the form of collection of documents. Each document is treated as a
collection of words. In this algorithm, it is assumed that each document is represented as mixture
of latent topics and each topic is represented as a mixture over words. The process observed in
the LDA model is D. M. Blei al (2012)
 For every topic, sample a distribution over words from a dirichlet prior
 For every document, sample a distribution over topics from a dirichlet prior
 For every word in the document,
o Sample a topic from the document’s topic distribution
o Sample a word from the document’s word distribution
o Observe the word
Evaluating topic models
The use of topic modeling has been expanding altogether for analyzing large content text.
Numerous models are being developed dependent on the utilization and relevance.
As there are number of related literature is available on topic modeling. For research process I
have filtered and refined to get the final set of articles, using the following inclusion and
exclusion criteria.
 Articles between the timeline of 2003 and 2020
 Only research articles, journals, and conference papers
 Documents written in English language
Due to strict timeline I didn’t include other language literature like Chinese blogs because of
time required to validate and translate the articles
Exclusion criteria:
 Articles that are unavailable as full text documents
 Articles that are not peer-reviewed
 Studies that lack an evaluative perspective on topic models
Research Area:
Topic Modeling
Requirements and Quality Criteria:
As the topic models are quickly advancing, and their utilization expanding daily. Topic models
will be evaluated under these criteria. Interpretability, generalization and coherence are the three
major criteria among several others, which are being used to evaluate the topic models. For
almost every criterion
Coefficient of determination:
The coefficient of determination is a popular, intuitive, and easily-interpretable goodness of fit
measure. The coefficient of determination, denoted R2 , is most common in ordinary least
squares (OLS) regression.
Interpretability:
Interpretability of topics obtained as the result of topic modeling began to be actively considered
in 2009, when a method for assessing the interpretability of the topic by a person called word
intrusion was proposed. Intuitively, the assessment of topic interpretability is whether a person
can understand how the words representing a topic are related to each other and what a general
concept to which they relate. Mavrin et al (2018)
Generalizability:
The generalizability coefficient is analogous to classical test theory's reliability coefficient (the
ratio of the universe-score variance to the expected observed-score variance; an intra class
correlation). For relative decisions and a p × I × O random effect design,
the generalizability coefficient is
Coherence:
Topic Coherence measures score a single topic by measuring the degree of semantic similarity
between high scoring words in the topic. These measurements help distinguish between topics
that are semantically interpretable topics and topics that are artifacts of statistical inference.
Criteria Methods Algorithms
Coefficient of determination R2  Latent Dirichlet Allocation (LDA)
 Non Negative Matrix Factorization (NMF)

 Latent Semantic Analysis (LSA)
 Parallel Latent Dirichlet Allocation (PLDA)
 Pachinko Allocation Model (PAM)
Interpretability Point-wise Mutual (LDA)

Information (PMI)
NPMI  (NMF)
Annealed
importance (LSA)
 (PLDA)
 (PAM)
Generalizability Perplexity measure

 (LDA)
Relevance
computation
 (NMF)
Chib style
estimation (LSA)
 (PLDA)
 (PAM)
Coherence Directed acyclic

 (LDA)
graph
Relevance
 (NMF)
computation
(LSA)
 (PLDA)
 (PAM)
Clustering Cosine distance

 (LDA)
K-center clustering
 (NMF)
(LSA)
 (PLDA)
 (PAM)
Relative performance of model Monte Carlo EM
 (LDA)
estimation
Predictive accuracy
 (NMF)
(LSA)
 (PLDA)
 (PAM)
Stability KL-divergence
 (LDA)
calculation
 (NMF)
(LSA)
 (PLDA)
 (PAM)
CONCLUSION:
R2 has many advantages over standard goodness of fit measures commonly used in topic
modeling. Current goodness of fit measures are difficult to interpret, compare across corpora,
and explain to lay audiences. R2 does not have any of these issues. Its scale is effectively
bounded between 0 and 1, as negative values (though possible) are rare and indicate extreme
model misspecification. R2 may be used to compare models of different corpora, if necessary.
Scientifically-literate lay audiences are almost uniformly familiar with R2 in the context of linear
regression; the topic model R2 has a similar interpretation, making it an intuitive measure.
The standard (geometric) interpretation of R2 is preferred to McFadden’s pseudo R2. The
effective upper bound for McFadden’s R2 is considerably smaller than 1. A scale correction
measure is needed. Also, it is debatable which likelihood calculation(s) are most appropriate.
These issues make McFadden’s R2 complicated and subjective. However, a primary motivation
for deriving a topic model R2 is to remove the complications that currently hinder evaluating and
communicating the fidelity with which topic models represent observed text. Most
problematically, McFadden’s R2 varies with the number of true topics in the data. It is therefore
unreliable in practice where the true number of topics in unknown.
LIMITATIONS:
Topic modeling needed on the relationship between goodness of fit and document length, the
number of documents in a corpus, and vocabulary size. Results reported in various paper
demonstrate that document length is a considerable factor in model fit, whereas the number of
documents (above 1,000) is not. If robust, this result indicates that the topic modeling
community may need to change focus away from scaling estimation algorithms for large corpora.
Instead, more effort should be put towards obtaining high-quality data. Also, studying the
relationship between these parameters, along with the number of topics, may facilitate the
development of an adjusted R2, guarding against model over fit.
Because topic modeling is sensitive to input data and analysis, changes (such as adding new
documents and implementing text mining algorithms such as tokenization and stemming) can
generate completely different topics. Therefore, the topics are often an amalgam, and it is
difficult to assign truth to the interpretation and validation in the given text dataset.
RESEARCH PROBLEM:
There are different solution/algorithms available in the market that can provide you topic
information/model from the given text. But there doesn’t exist such a solution that can perform
better on both long and short dataset and text corpuses.
RESEARCH QUESTION:
How can we suggest a solution for both long and short text corpus?
1. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet
allocation." Journal of machine Learning research 3.Jan (2003): 993-1022.
2. Panichella, Annibale, et al. "How to effectively use topic models for software engineering
tasks? an approach based on genetic algorithms." 2013 35th International Conference on
Software Engineering (ICSE). IEEE, 2013.
3. Mavrin, Andrey, Andrey Filchenkov, and Sergei Koltcov. "Four keys to topic interpretability in topic
modeling." Conference on Artificial Intelligence and Natural Language. Springer, Cham, 2018.
4. Binkley, David, et al. "ORBS: Language-independent program slicing." Proceedings of
the 22nd ACM SIGSOFT International Symposium on Foundations of Software
Engineering. 2014.
5. Y. Ding and S. Yan, “Topic Optimization Method Based on Pointwise Mutual
Information,” in Neural Information Processing, S. Arik, T. Huang, W. K. Lai, and Q.
Liu, Eds. Springer International Publishing, 2015, pp. 148–155.
6. Y. Wu, Y. Ding, X. Wang, and J. Xu, “A comparative study of topic models for topic
clustering of Chinese web news,” in 2010 3rd IEEE International Conference on
Computer Science and Information Technology (ICCSIT), 2010, vol. 5, pp. 236–240.
7. J. Chang, S. Gerrish, C. Wang, J. L. Boyd-Graber, and D. M. Blei, “Reading tea leaves:
How humans interpret topic models,” in Advances in neural information processing
systems, 2009, pp. 288–296.

Ammar Khalil: Research Methods Assessment1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ammar Khalil: Research Methods Assessment1

Uploaded by

Copyright:

Available Formats

AMMAR KHALIL

How does a topic model work?

Criteria Methods Algorithms

Coefficient of determination R2  Latent Dirichlet Allocation (LDA)

 Non Negative Matrix Factorization (NMF)

 Parallel Latent Dirichlet Allocation (PLDA)

 Pachinko Allocation Model (PAM)

Interpretability Point-wise Mutual (LDA)

Generalizability Perplexity measure

Coherence Directed acyclic

Clustering Cosine distance

You might also like