Professional Documents
Culture Documents
Ammar Khalil Research Methods Assessment2
Ammar Khalil Research Methods Assessment2
RESEARCH METHODS
ASSESSMENT2
How can we suggest a solution for both long and short text corpus?
3.1 Research path selection using decision tree
I passed my research question to the research tree, when I passed found that my research is
applied experimental research I will compare the algorithm and make some decision according to
the results obtained. Firstly I will consider all the algorithms and evaluate them with different
parameters like coherence, generalizability etc.
3.2 Research type
Our research type is applied experimental research as we are finding the solution of the real
world, we are comparing the existing solutions and evaluating them on different bench marks. It
includes a hypothesis that is “implementing machine learning algorithms on NLP tasks that are
not available for NLP in topic modeling “an algorithm that can be manipulated by the researcher,
and variables that can be measured, calculated and compared
Our research strategy is to provide a solution that will perform well on both long and short
corpus during topic modeling. On the basis of research question my research is applied
experimental research. Tools required for our research is a system with higher gpu and greater
cups’ for which I don’t have enough resources that’s why I decided to perform computation on
Google Colab platform that is an open to the developers free of cost. First I will preprocess the
both short and long corpses. My approach is to use these algorithms like Latent Dirichlet
Allocation (LDA), Non Negative Matrix Factorization (NMF), Latent Semantic Analysis (LSA),
Parallel Latent Dirichlet Allocation (PLDA), Pachinko Allocation Model (PAM) and compare
there results on both long and short and long corpus and then pass these algorithm to the voting
classifier which is a machine learning algorithm. The machine learning algorithm not made for
NLP task I want to modify it for NLP which gives us better results for both short and long
corpuses.
NMF is a matrix factorization strategy where we ensure that the components of the factorized
grids/matrix are non-negative. Consider the document term matrix acquired from a corpus after
eliminating the stop words. The matrix can be factorized into two matrix term-topic matrix and
topic- document matrix. There are numerous advancement models to perform the matrix
factorization. Alternating Least Square is a quicker and better approach to perform NMF. Here
the factorization happens by updating each column in turn while keeping different segments as
constant
3.3.2.3 Latent Semantic Analysis (LSA)
Latent Semantic Analysis is also an unsupervised learning method used to extract relationship
between different words in a pile of documents. This aids us in choosing the correct documents
required. It simply acts as a dimensionality method used to reduce the dimension of the huge
corpus of text data. These unnecessary data acts as a noise in determining the correct insights
from the data.
Parallel Latent Dirichlet Allocation (PLDA) assumes that there exists a set of n labels and each
of these labels are associates with each topics of the given corpus. Then the individual topics are
represented as the probabilistic distribution of the whole of corpus similar to the LDA.
Optionally, there could also be a global topic assigned to every document such that there are l
global topics where l is the number of individual documents in the corpus. The method also
assumes that there exists only one label for every topic in the corpus. With the labels given
before developing the model, this process is very quick and precise compared to the above
methods.
Pachinko Allocation Model (PAM) is an improved method of Latent Dirichlet Allocation model.
LDA model brings out the correlation between words by identifying topics based on the thematic
relationships between words present in the corpus. But PAM improvises by modeling correlation
between the generated topics. This model has greater power in determining the semantic
relationship precisely as they also take into account of the relation between topics. The model is
named after Pachinko, a popular game in Japan. The model makes use of Directed Acrylic
Graphs to understand the correlation between topics. DAG is a finite directed graph to show how
the topics are related.
These are the famous and widely used algorithms for topic modeling. These all algorithms have a
drawback that they don’t perform well on both datasets long and short either perform god on
long or short dataset.
3.4.1 Milestone 1:
1st I will collect two dataset one is a short and other is long that has some kind of sense as a topic
3.4.2 Milestone 2:
Then according to the evaluated model make a comparison matrix and analyze results
3.4.4 Milestone 4:
Prepare the voting classifier or implement algorithm that can be used for NLP task such as topic
Modeling .Implement KNN and Voting classifier on both datasets
3.4.5 Milestone 5:
Compare the results of voting classifier with other algorithm, make a resultant matrix and deliver
a report that explains the results.
Milestone 4 and 5 are optional if I will get enough time than I will try to implement them
otherwise my research work be up to Milestone 3.