You are on page 1of 5

AMMAR KHALIL

RESEARCH METHODS

ASSESSMENT2
How can we suggest a solution for both long and short text corpus?
3.1 Research path selection using decision tree
I passed my research question to the research tree, when I passed found that my research is
applied experimental research I will compare the algorithm and make some decision according to
the results obtained. Firstly I will consider all the algorithms and evaluate them with different
parameters like coherence, generalizability etc.
3.2 Research type
Our research type is applied experimental research as we are finding the solution of the real
world, we are comparing the existing solutions and evaluating them on different bench marks. It
includes a hypothesis that is “implementing machine learning algorithms on NLP tasks that are
not available for NLP in topic modeling “an algorithm that can be manipulated by the researcher,
and variables that can be measured, calculated and compared

3.3 Research Methodology:


3.3.1 Tools and techniques

Our research strategy is to provide a solution that will perform well on both long and short
corpus during topic modeling. On the basis of research question my research is applied
experimental research. Tools required for our research is a system with higher gpu and greater
cups’ for which I don’t have enough resources that’s why I decided to perform computation on
Google Colab platform that is an open to the developers free of cost. First I will preprocess the
both short and long corpses. My approach is to use these algorithms like Latent Dirichlet
Allocation (LDA), Non Negative Matrix Factorization (NMF), Latent Semantic Analysis (LSA),
Parallel Latent Dirichlet Allocation (PLDA), Pachinko Allocation Model (PAM) and compare
there results on both long and short and long corpus and then pass these algorithm to the voting
classifier which is a machine learning algorithm. The machine learning algorithm not made for
NLP task I want to modify it for NLP which gives us better results for both short and long
corpuses.

3.3.2 Algorithm for research


The research algorithm which I am going to use is Latent Dirichlet Allocation (LDA), Non
Negative Matrix Factorization (NMF), Latent Semantic Analysis (LSA), Parallel Latent Dirichlet
Allocation (PLDA), Pachinko Allocation Model (PAM).

3.3.2.1 Latent Dirichlet Allocation (LDA)


It is a statistical and graphical model which is utilized to get connections between various
documents in a corpus. It is created using Variational Exception Maximization (VEM)
calculation for acquiring the greatest probability estimate from the entire corpus of text.
Generally, this can solve by picking out the top few words in the bag of words. Anyway this
totally lack the semantics in the sentence This model follows the idea that each document can be
portrayed by the probabilistic circulation of topics and every topic can be depicted by the
probabilistic distribution of words.

3.3.2.2 Non Negative Matrix Factorization (NMF)

NMF is a matrix factorization strategy where we ensure that the components of the factorized
grids/matrix are non-negative. Consider the document term matrix acquired from a corpus after
eliminating the stop words. The matrix can be factorized into two matrix term-topic matrix and
topic- document matrix. There are numerous advancement models to perform the matrix
factorization. Alternating Least Square is a quicker and better approach to perform NMF. Here
the factorization happens by updating each column in turn while keeping different segments as
constant
3.3.2.3 Latent Semantic Analysis (LSA)

Latent Semantic Analysis is also an unsupervised learning method used to extract relationship
between different words in a pile of documents. This aids us in choosing the correct documents
required. It simply acts as a dimensionality method used to reduce the dimension of the huge
corpus of text data. These unnecessary data acts as a noise in determining the correct insights
from the data.

3.3.2.4 Parallel Latent Dirichlet Allocation (PLDA)

Parallel Latent Dirichlet Allocation (PLDA) assumes that there exists a set of n labels and each
of these labels are associates with each topics of the given corpus. Then the individual topics are
represented as the probabilistic distribution of the whole of corpus similar to the LDA.
Optionally, there could also be a global topic assigned to every document such that there are l
global topics where l is the number of individual documents in the corpus. The method also
assumes that there exists only one label for every topic in the corpus. With the labels given
before developing the model, this process is very quick and precise compared to the above
methods.

3.3.2.5 Pachinko Allocation Model (PAM)

Pachinko Allocation Model (PAM) is an improved method of Latent Dirichlet Allocation model.
LDA model brings out the correlation between words by identifying topics based on the thematic
relationships between words present in the corpus. But PAM improvises by modeling correlation
between the generated topics. This model has greater power in determining the semantic
relationship precisely as they also take into account of the relation between topics. The model is
named after Pachinko, a popular game in Japan. The model makes use of Directed Acrylic
Graphs to understand the correlation between topics. DAG is a finite directed graph to show how
the topics are related.

These are the famous and widely used algorithms for topic modeling. These all algorithms have a
drawback that they don’t perform well on both datasets long and short either perform god on
long or short dataset.

3.4 Proposed solution:


Following are the mile stone that I want to meet in my research.

3.4.1 Milestone 1:

1st I will collect two dataset one is a short and other is long that has some kind of sense as a topic

3.4.2 Milestone 2:

Then I will evaluate each algorithm/model according on both dataset.


3.4.3Milestone 3:

Then according to the evaluated model make a comparison matrix and analyze results

3.4.4 Milestone 4:
Prepare the voting classifier or implement algorithm that can be used for NLP task such as topic
Modeling .Implement KNN and Voting classifier on both datasets

3.4.5 Milestone 5:

Compare the results of voting classifier with other algorithm, make a resultant matrix and deliver
a report that explains the results.

Milestone 4 and 5 are optional if I will get enough time than I will try to implement them
otherwise my research work be up to Milestone 3.

You might also like