You are on page 1of 23

PARALLEL AND DISTRIBUTED COMPUTING

REVIEW-2

TEXT MINING AND CHATBOT REPLY


PARALLELIZATION USING TF-IDF AND COSINE
DISTANCE PARALLELIZATION

Harsh Bhardwaj (20BCE0198)


Clifford Christopher (20BCE2352)
Sarthak Giri (20BCE2913)

Introduction:
Natural language processing (NLP) has become an
increasingly popular field in recent years, as advances in
technology and the availability of large datasets have made it
possible to train models capable of handling complex
language tasks, such as text mining and chatbot reply
generation. However, one major challenge facing the
development and deployment of NLP applications is the
processing time required to analyze large datasets. To
address this challenge, this project proposes a solution that
leverages the power of parallel processing and term
frequency-inverse document frequency (TF-IDF) to speed up
the processing time and improve the accuracy of the results.
TF-IDF is a widely used technique in NLP that assigns a
weight to each term in a document based on its frequency in
the document and its rarity in the corpus as a whole. This
helps to identify important words and phrases in a document
and is often used for tasks such as document classification
and information retrieval. By parallelizing the TF-IDF
algorithm, this project aims to significantly reduce the
processing time required to analyze large datasets. Parallel
processing involves breaking a task down into smaller, more
manageable parts that can be processed simultaneously on
multiple processors or cores. This allows the overall
processing time to be reduced, as the workload is distributed
across multiple processors. The performance metrics used in
this project include processing time and accuracy. By
utilizing parallel processing and TF-IDF, it is expected that
the processing time will be significantly reduced, making
NLP applications more accessible to the wider community.
Additionally, the use of TF-IDF is expected to improve the
accuracy of the results, as important words and phrases are
given greater weight in the analysis. In summary, this project
aims to provide a solution to the challenge of slow
processing time for large datasets in text mining and chatbot
reply generation. By leveraging the power of parallel
processing and TF-IDF, the processing time is expected to be
significantly reduced, while improving the accuracy of the
results. This has the potential to make NLP applications
more accessible and efficient, enabling the development of
more sophisticated and useful language-based tools.

Literature Surveys:

1. Petuum: A New Platform for Distributed Machine


Learning on Big Data
By noticing that many ML systems are inherently
optimization-centric and admit error-tolerant,
iterative-convergent algorithmic solutions, they have
suggested a general-purpose framework, Petuum, that
systematically handles data- and model-parallel issues in
large-scale ML. This offers exceptional possibilities for
an integrated system design, including dynamic
scheduling based on ML programme structure and
bounded-error network synchronization. The
effectiveness of these system designs has been
compared to well-known implementations of
contemporary ML methods, demonstrating that Petuum
enables ML programmes to run in a significantly shorter
amount of time and at significantly larger model sizes,
even on modestly-sized computing clusters.
2. Frequent itemset mining on multiprocessor
systems

The paper proposes two techniques to improve the


performance of frequent pattern mining on a modern
multi-core machine. The first technique is a
cache-conscious FP-array that improves data locality
performance by utilizing hardware and software
prefetching. The second technique is a lock-free dataset
tiling parallelization mechanism that eliminates locks
and improves temporal data locality performance. The
authors demonstrate that with these proposed
techniques, the overall FP-tree algorithm achieves a
24-fold speedup on an 8-core machine. The paper
concludes that the proposed techniques can be applied to
other data mining tasks as well with the prevalence of
multi-core processors, providing an efficient solution for
high-performance data mining in modern computing
environments

3. An Improved Text Sentiment Classification Model


Using TF-IDF and Next Word Negation:
The paper describes a proposed model for sentiment
analysis using three different text representation models.
The first model is a simple binary bag of words model
that represents only the existence of words and doesn't
consider the importance of specific words in a
document. The second model is the bag of words model
with term frequency-inverse document frequency
scores, which assigns scores for each of the words in the
document. The third model is a negation strategy that
modifies the words right after the negation word. The
content provides examples and tables to explain the
different models. The proposed model aims to improve
the accuracy of sentiment analysis, especially in
sentences containing negations.

4. Automatic Multi-Document Summarization for


Indonesian Documents Using Hybrid
Abstractive-Extractive Summarization Technique
The data methods used here are WordNet dataset,
combining extractive and abstractive summarization
using LSA in multi-document summarization. It has
been demonstrated that using a hybrid
abstractive-extractive summarization technique can
effectively summarize many documents and produce a
quick, readable, well-compressed summary.

5. Scalable Machine Learning on Popular Analytic


Languages with Parallel Data Summarization
In contrast to earlier work, in this research they have
generalized a data summarizing matrix to produce one
or more summaries, which benefits a wider class of
models. Our solution performs well on a shared-nothing
architecture, the gold standard in large data analytics,
with popular programming languages like R and Python.
They have presented an algorithm that computes
machine learning models in three phases: Phase 0
prepares the data set and distributes it to the parallel
processing nodes; Phase 1 computes one or more data
summaries concurrently; and Phase 2 computes a model
based on such data set summaries on a single machine

6. A Distributed HOSVD Method With Its Incremental


Computation for Big Data in Cyber-Physical-Social
Systems.
The paper discusses the problem of text preprocessing
for text classification and proposes an Improved Gini
Index (IGI) algorithm for feature selection and feature
weighting. The algorithm is applied to a sensitive
information identification system and achieves better
precision and recall than traditional algorithms. Feature
selection and weighting are important for text
preprocessing, and the IGI algorithm improves the
performance of classifiers. The paper also compares the
IGI algorithm to other common algorithms for feature
selection and weighting

7. Cross-Language Information Retrieval based on

Parallel Texts and Automatic Mining of Parallel

Texts from the Web

This paper discusses the use of a probabilistic

translation model for cross-language information

retrieval (CLIR) and compares its performance with that

of machine translation (MT). The authors point out that

MT and IR have different concerns, with MT systems

focused on producing syntactically correct sentences

and selecting one translation from many possible

choices, while IR is usually based on single words and

benefits from query expansion by synonyms or related

words. The authors argue that using a probabilistic

translation model based on parallel texts has advantages

over MT and bilingual dictionaries or terminology


bases, as it does not require the acquisition or

compilation of such resources and makes word

translations sensitive to the domain. The paper also

investigates the automatic gathering of parallel texts

from the web for CLIR training purposes and shows that

such a corpus can be as good as a manually constructed

one. The experimental results demonstrate that the

probabilistic model can achieve comparable

performances to the best MT systems.

8. Survey on Parallel Comparison of Text Document

with Input Data Mining and VizSFP

Data mining refers to the process of extracting valuable

knowledge or information from large datasets. It

involves using data analysis tools and techniques to

discover patterns, relationships, and insights from the

data. Data mining has become an important research


topic in recent years, especially with the increasing

amount of data available on the web. There are three

main categories of web mining: web usage mining, web

content mining, and web structure mining.

Web usage mining involves analyzing and discovering

interesting patterns in user behavior on the web. Web

content mining involves using data and text mining

techniques to extract useful information from web

pages. Web structure mining involves analyzing the

node and connection structure of a website using graph

theory.

Data mining is often used in combination with side

information, such as hyperlinks, non-textual attributes,

document provenance information, and user-access

behavior from web logs. The process of data mining

involves several steps, including data cleaning, data


integration, data selection, data transformation, data

mining, pattern evaluation, and knowledge presentation.

9. iTextMine: integrated text-mining system for

large-scale knowledge extraction from the literature

This paper describes the iTextMine system, an

automated workflow that integrates multiple text-mining

tools to extract knowledge from large-scale biomedical

text. The system employs parallel processing with

dockerized text-mining tools and a standardized JSON

output format to solve text discrepancy issues for result

integration. The iTextMine system integrates four

relation extraction tools and incorporates results from

PubTator for gene and entity normalization. The paper

also compares iTextMine with existing natural language

processing frameworks and discusses the challenges

faced while integrating different tools for large-scale

processing and knowledge integration. The iTextMine


system is demonstrated with two use cases involving the

genes PTEN and SATB1 and their relation to breast

cancer. The iTextMine system provides a user-friendly

interface for searching and browsing a wide range of

bio-entities and relations.

10. Text Mining: Use of TF-IDF to Examine the

Relevance of Words to Documents

The context discusses the process of collecting and

preprocessing data from 20 different websites belonging

to 4 domains to implement the TF-IDF algorithm. The

collected data contained HTML/CSS and stop words

which were removed, and a list of 500 stop words was

used to filter out the unnecessary words. The TF-IDF

algorithm was implemented by counting the total

number of words and their occurrences in all

documents, calculating the TF, IDF, and finally, the

TF-IDF. The algorithm can be implemented in any


programming language, and for this paper, it was

implemented using PHP for simplicity. The context also

includes a diagram showing all major and minor steps

required for implementing the TF-IDF algorithm using

computer programming.

Explanation of the Algorithm

TF-IDF (Term Frequency-Inverse Document Frequency)


is a numerical statistic used in text mining and information
retrieval to evaluate the importance of a term in a document
relative to a corpus of documents. The intuition behind
TF-IDF is that if a term occurs frequently in a document, it is
important to that document, but if it occurs frequently in
many documents, it may not be as informative.

TF-IDF is calculated for each term in each document in the


corpus by multiplying the term frequency (TF) and the
inverse document frequency (IDF) values.

Term frequency (TF) measures how frequently a term


appears in a document. It is calculated by dividing the
number of times a term appears in a document by the total
number of terms in the document.
Inverse document frequency (IDF) measures how important
a term is to the entire corpus of documents. It is calculated as
the logarithm of the total number of documents in the corpus
divided by the number of documents containing the term.
This value is then multiplied by the term frequency to obtain
the TF-IDF score.
TF-IDF is commonly used in information retrieval to rank
documents that are relevant to a user's query. By assigning a
higher score to documents that contain the query terms, and a
lower score to documents that do not contain them, it helps
to identify the most relevant documents.

Cosine similarity is a measure of similarity between two


non-zero vectors of an inner product space. In the context of
text mining, the inner product space is the vector space
representation of the documents.

Cosine similarity calculates the cosine of the angle between


two vectors, which can be interpreted as a measure of their
similarity. A value of 1 indicates that the vectors are
identical, while a value of 0 indicates that they are
completely dissimilar.

To calculate cosine similarity between two documents, we


first represent each document as a vector of TF-IDF scores
for each term in the corpus. Then, we calculate the cosine of
the angle between the two vectors.
Cosine similarity is commonly used in information retrieval
and recommendation systems to identify documents or items
that are similar to a user's query or preferences.
Architecture for TF-IDF parallelization and Text
Proposed Algorithm for Text mining
This method is based on Term frequency and document
frequency. For Document frequency, we have a model file
where the document frequency is stored and the stopwords
frequency is also stored. Then, for our test document, we
tokenize in sentence level for sentence extraction and in
word level for calculating TF-IDF. In this tokenization
process itself, parallelism is used where multiple threads are
used for sentence tokenization as well as word tokenization.
Then we perform the sanitization process which is basically
the removal of punctuation and then calculate term frequency
of each of the words. Term frequency is calculated from the
words stored in the dictionary/map and the Document
Frequency is calculated from our initial model. After that, we
calculate TF-IDF which means term frequency divided by
document frequency. After the above steps are completed, a
priority queue is used. For each sentence, we calculate the
score of the sentence by adding the TF-IDF value of all the
words and scale the result by 100 and insert the final result to
that priority queue. Finally, we remove the least important
sentences from the priority queue. The remaining k sentences
in the priority queue in order will be our summary. The value
of k can vary.
SCOPE OF PARALLELISM IN THE ABOVE
PROCEDURE:

1. Searching for Top K sentences

Each sentence that is tokenized and cleaned, should be


given to the function that computes the tf-idf score of the
sentence. In a large corpus there could be thousands of
sentences, thus if we could distribute the task among
multiple threads then the computation would be faster, thus
we are using this scope for parallelism. All the threads will
be sharing a common priority queue, thus it is necessary to
synchronize the updates in the queue. We will be using
synchronization constructs of Open MP for the same.

2. Summing up the TF-IDF score

After the sentences have been divided, we need to parallelize


the computation in each sentence. A sentence consists of
multiple tokens, and the task of computation in set of tokens
can be divided and all the thread will be updating the shared
variable (sum) as the sum of each token’s score, resulting the
global sentence’s score

EXTENSION OF TF IDF TO CHATBOT REPLY WITH


PARALLELIZATION IN COSINE DISTANCE AND
DATASET SEARCH

Generally chatbot involves getting the user inputted natural


language query to system query and comparing against the
dataset to generate the output. Whenever user inputs the
sentence, or natural language query, we need to tokenize and
clean the text and perform the TF-IDF score, where we could
use the idea of parallelism proposed above, and then chatbot
involves comparing the query with all the patterns available
in the dataset, and whichever pattern has highest cosine
similarity with the user inputted text, that pattern’s response
is generated as the output. Here we will do data
decomposition, where the cosine similarity searching with
the dataset is divided among multiple threads, and the pattern
with highest similarity, or least distance is selected, and
corresponding response is generated from the user.

Parallelism in:

1. Dataset searching
2. TFIDF Score calculation
3. Matrix multiplication in computing cosine distance
Architecture for Chatbot Reply Parallelization
References:
1.Xing, Eric P., Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang
Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu
Kumar, and Yaoliang Yu. "Petuum: A new platform for
distributed machine learning on big data." IEEE transactions
on Big Data 1, no. 2 (2015): 49-67.

2. Li Liu, Eric Li, Yimin Zhang, and Zhizhong Tang.


Optimization of frequent itemset mining on multiple-core
processor. In VLDB

4. Yapinus, Glorian, Alva Erwin, Maulhikmah Galinium, and


Wahyu Muliady. "Automatic multi-document summarization
for Indonesian documents using hybrid abstractive-extractive
summarization technique." In 2014 6th International
Conference on Information Technology and Electrical
Engineering (ICITEE), pp. 1-5. IEEE, 2014

5. Al-Amin, Sikder Tahsin, and Carlos Ordonez. "Scalable


machine learning on popular analytic languages with parallel
data summarization." In International Conference on Big
Data Analytics and Knowledge Discovery, pp. 269-284.
Springer, Cham, 2020

6. A Distributed HOSVD Method With Its Incremental


Computation for Big Data in Cyber-Physical-Social Systems.
May 2018IEEE Transactions on Computational Social
Systems PP(99):1-12 DOI:10.1109/TCSS.2018.2813320

7. Nie, Jian-yun & Simard, Michel & Isabelle, Pierre &


Durand, Richard. (1999). Cross-Language Information
Retrieval Based on Parallel Texts and Automatic Mining of
Parallel Texts from the Web.. 74-81.
10.1145/312624.312656.

8. L. Ballasteros, W.B. Croft, Resolving ambiguity for


crosslanguage retrieval, ACM-SIGIR, pp.64-71, 1998. R.D
Brown, Automatically-extracted thesauri for crosslanguage
IR: When better is worse, 1st Workshop on Computational
Terminology (Computerm), pp.15-21, 1998.

9. I. S. Dhillon, J. Fan, and Y. Guan. Efficient clustering of


very large document collections. In R. Grossman, C.
Kamath, P. Kegelmeyer, V. Kumar, and R. Namburu, editors,
Data Mining for Scientific and Engineering Applications,
pages 357–381. Kluwer Academic Publishers, 2001.
10. Jia Ren, Gang Li, Karen Ross, Cecilia Arighi, Peter
McGarvey, Shruti Rao, Julie Cowart, Subha Madhavan, K
Vijay-Shanker, Cathy H Wu, iTextMine: integrated
text-mining system for large-scale knowledge extraction
from the literature, Database, Volume 2018, 2018, bay128,

You might also like