Professional Documents
Culture Documents
by
I hereby declare that the thesis entitled “LLM based news research tool
using langchain” submitted by me, for the award of the degree of Bachelor of
Technology in Computer Science and Engineering, Vellore Institute of
Technology, Chennai is a record of bonafide work carried out by me under the
supervision of Dr. Subbulakshmi P.
I further declare that the work reported in this thesis has not been
submitted and will not be submitted, either in part or in full, for the award of any
other degree or diploma in this institute or any other institute or university.
Place: Chennai
Date: Signature of the Candidate
School of Computer Science and Engineering
CERTIFICATE
This is to certify that the report entitled “LLM based news research
tool using langchain” is prepared and submitted by Khushi
Habbu(20BCE1612) to Vellore Institute of Technology, Chennai, in partial
fulfillment of the requirement for the award of the degree of Bachelor of
Technology in Computer Science and Engineering program is a bonafide
record carried out under my guidance. The project fulfills the requirements as per
the regulations of this University and in my opinion meets the necessary standards
for submission. The contents of this report have not been submitted and will not
be submitted either in part or in full, for the award of any other degree or diploma,
and the same is certified.
Date:
(Seal of SCOPE)
ABSTRACT
People find it harder and harder to effectively navigate and absorb news content in the
digital age due to the abundance of news sources and information outlets. As a result,
there's a rising need for sophisticated technologies that help with the categorization,
distillation, and extraction of insights from massive volumes of news data. This abstract
describes a novel approach called LangChain that uses blockchain infrastructure and
Language Model (LM) technology to create a news research tool.
A framework designed specifically for Large Language Models (LLMs), Lang chain
has a number of features that are necessary for document processing. First of all, it has
multiple text loaders that can handle different formats, including text files, CSVs, and
URLs in particular. Character and recursive text splitters are among the tools that Lang
chain offers for separating documents once they have been loaded.
Lang chain uses the Hugging Face and OpenAI modules to encode text into numeric
representations that can be stored in vector databases. These libraries provide
transformations and embeddings for words, turning them into vectors of numbers.
Recognize the basics of classical information retrieval (IR) for tasks like retrieval,
summarization, search, and keyword extraction. Lang chain also interfaces with FAISS,
a library for similarity search and clustering of dense vectors, which is used to
effectively store and retrieve data from FAISS indexes. A primary example of this is
TF-IDF etc. Users can now prompt inquiries that produce words or search results that
are similar to one other.Later on in the process, Lang chain integrates RetrievalQA with
sources chain, which is a crucial step wherein information chunks are repeatedly run
through LLMs. Filtered chunks serve as triggers for additional processing and are
continuously improved. These sections are combined using summarization techniques,
which improves the effectiveness of the queries that follow. In the end, the final
response is obtained from the condensed and refined portions after the input query is
iteratively processed.
iii
ACKNOWLEDGEMENT
It is with gratitude that I would like to extend my thanks to the visionary leader Dr.
G. Viswanathan our Honorable Chancellor, Mr. Sankar Viswanathan, Dr. Sekar
Viswanathan, Dr. G V Selvam Vice Presidents, Dr. Sandhya Pentareddy,
Executive Director, Ms. Kadhambari S. Viswanathan, Assistant Vice-President Dr.
V. S. Kanchana Bhaaskaran Vice-Chancellor i/c & Pro-Vice Chancellor, VIT
Chennai and Dr. P. K. Manoharan, Additional Registrar for providing an
exceptional working environment and inspiring all of us during the tenure of the
course.
My sincere thanks to all the faculties and staff at Vellore Institute of Technology,
Chennai who helped me acquire the requisite knowledge. I would like to thank my
parents for their support. It is indeed a pleasure to thank my friends who
encouraged me to take up and complete this task.
Place: Chennai
iv
CONTENTS
ABSTRACT……………………………………………………………………………… iii
AKNOELEDGEMENT………………………………………………… ……………… iv
CONTENTS……………………………………………………………………………… v
CHAPTER 1
INTRODUCTION
CHAPTER 2
LITERATURE SURVEY……………………………………………………….8
CHAPTER 3
BACKGROUND
CHAPTER 4
MODULES
CHAPTER 5
v
METHODOLOGY
CHAPTER 6
IMPLEMENTATION AND RESULTS
CHAPTER 7
CONCLUSION AND FUTURE WORK
8.1 CONCLUSION…………………………………………………………47
8.2 FUTURE WORK……………………………………………………….48
REFERENCES………………………………………………………………...49
vi
LIST OF FIGURES
vii
Fig. 6.1(m) word count per document
viii
LIST OF ACRONYMS
TR TextRank
PR PageRank
NB Naïve Bayes
ML Machine Learning
DL Deep Learning
LR Linear Regression
RP Random Projection
ix
NLP Natural LanguageProcessing
US United States
DT Decision Tree
RF Random Forest
IR Information Retrieval
x
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
There is a great chance that the proposed LangChain-based LLM-based news research
tool would fundamentally alter how people interact with and understand news content.
The Lang Chain framework provides a structured approach to leverage LLM
capabilities, ensuring more accurate, language-specific, and context-sensitive news
article insights. This has implications for improving the range and efficiency of
information extraction across different language situations.
Include a Legal Language Model (LLM) in the project for accurate analysis.
Utilize LangChain's modular architecture for seamless integration and enhanced natural
language processing. Processing data to ensure accuracy and relevance for analysis,
tidy and arrange news. Using Language Processing to Develop entities-recognition,
concept-extracting, and comprehension algorithms for search and retrieval. These
algorithms will enable the efficient use of advanced search tools to obtain relevant legal
news using natural language queries. Design of User Interface Make an intuitive
dashboard with moveable widgets to enhance the professional user experience.
1
1.2 RESEARCH CHALLENGES
To build precise and efficient algorithms for keyword extraction and to ask a query and
to retrive answer from news articles, researchers must overcome numerous challenges
that span various aspects of natural language processing and domain-specific
knowledge.
Ambiguity and Polysemy: Words and phrases sometimes have more than one meaning,
which causes polysemy and ambiguity. Accurate keyword extraction and query
understanding depend on the ability to recognize a word or phrase in context
Named Entity Recognition (NER): To comprehend the content of news items, named
entities such as individuals, groups, places, dates, and numerical expressions must be
recognized and categorized. However, because of differences in entity references and
interpretations that depend on context, NER can be difficult.
Domain-Specific Terminology: Words and phrases that are not found conventional
language models may be used in news articles. Accurate results require keyword
extraction and query interpretation algorithms that incorporate domain-specific results
Length and Structure of the Document: News pieces come in a variety of forms, from
short news snippets to lengthy investigative investigations. Scalable and successful
keyword extraction and retrieval need algorithms that can handle documents
Temporal Dynamics: News items are subject to time constraints, and over time, search
terms and queries may become less relevant. To provide current and relevant results,
2
keyword extraction and retrieval algorithms must take temporal information into
account as well as the recentness of articles.
Multimodal Content: In addition to text, news stories may also include pictures, videos,
and other types of multimedia content. The effectiveness and coverage of keyword
extraction and query retrieval systems can be improved by incorporating multimodal
content
Cross-Lingual Challenges: News stories may be published in more than one language,
necessitating the use of algorithms capable of comprehending queries and extracting
keywords across languages. Serving a variety of user communities requires overcoming
linguistic obstacles and making sure that translations and interpretations are accurate.
User Intent and Context: Relevant and customized keyword recommendations and
search results depend on an understanding of the user's intent and context. Enhancing
the precision and usability of keyword extraction and retrieval systems can be achieved
by integrating user feedback, preferences, and browsing history.
Scalability and Efficiency: In order to handle massive numbers of news items and user
inquiries in real-time, keyword extraction and query retrieval algorithms need to be
both scalable and efficient.
Through the resolution of these issues, scientists can create algorithms for query-based
information retrieval from news articles and keyword extraction that are more precise,
effective, and intuitive to use.
Creation of the Langchain Framework: Create and put into use the Langchain
framework specifically for LLM-based news analysis. Using LLMs, create a solid
architecture that combines the many parts needed for news analysis.
3
Multiple Source Text Loaders: Create text loaders that can handle a variety of document
types, including CSV files, text files, and URLs.Assure effective textual data extraction
and processing from a variety of sources for the Langchain architecture.
Text Splitter with Recursive Characters: Use recursion and character splitters to divide
documents into more manageable sections. During the splitting process, semantic
integrity and coherence are preserved for further analysis.
Integration of Hugging Face for Embeddings with OpenAI: Combine the Hugging Face
and OpenAI tools to create embeddings from textual data.
Modern pre-trained language models can be used to convert sentences into numerical
representations for additional analysis.
Combining LLMs to Perform Multi-pass Analysis: Combine LLMs with the Langchain
framework to analyze retrieved text chunks across several passes. Make use of LLMs
to produce summaries, extract important data, and incrementally improve query replies.
4
range of news datasets and research scenarios with respect to accuracy, efficiency,
scalability, and usability.
Verification using User Input and Case Studies: Use user input and real-world case
studies to validate the tool's usefulness and efficacy. To improve usability and
practicality, iterate on the design and functionality depending on user experiences and
requirements.
Our goal is to create a comprehensive LLM project covering a real-world industrial use
case for research analysts. We also want to create a news research tool that allows users
to provide the URLs of numerous news stories, and when they ask a question, the
system will pull the response from the filtered portions of those articles.
Simply using a key word search is insufficient; for instance, when we search for "apple"
on Google, we get both the apple company and standard apple fruit information. To
address this, our project will use vector databases and embedding to precisely
understand the context of the search and retrieve pertinent sections from the urls that
we will then use to look for answers in it
In many text processing tasks, such as text categorization, information mining, and text
retrieval, keyword extraction is essential. Because of its ease of use and ability to
accurately determine term relevance based on word frequency, the TF-IDF algorithm
has become one of the most popular established approaches. On the other hand,
depending exclusively on word frequency data may have drawbacks, especially in
situations when words might not appropriately convey their importance within the page.
5
TextRank and RAKE (Rapid Automatic Keyword Extraction) are two more
sophisticated methods that you can use to enhance keyword extraction beyond the
capabilities of standard TF-IDF.
Text loaders with many sources can handle a variety of document types, including text
files, CSV files, and URLs. Text Splitter with Recursive Characters splits documents
into digestible chunks while maintaining semantic consistency by using recursion and
character splitters. Within the Langchain framework, this division makes additional
analysis easier. Using contemporary pre-trained language models, Hugging Face and
OpenAI tools are integrated to build embeddings from textual data. This makes it
possible to translate language into numerical representations, which improves
analyzing skills.
Information chunks are repeatedly passed through LLMs using map reduce method
over stuff method i.e the drawback od exceeding token limit or usage of token limits in
api is solved in the important step where Lang chain integrates RetrievalQA with
sources chain later in the process. Filtered pieces are constantly refined and act as
catalysts for further processing. Summarization techniques are used to combine these
portions, making the following queries more effective. After processing the input query
iteratively, the final response is derived from the condensed and refined components
6
One alternative for casual readers is to browse through a news story: News article text
retrieval and keyword extraction can help casual readers rapidly assess an article's
relevance and content. It is simple to avoid reading the entire information when users
highlight the answers they require from a list of three linked URLs on a certain issue.
This feature improves the user experience by helping the user make better selections
when choosing which articles to read.
News analysts and journalists: Using this tool, journalists can swiftly compile pertinent
news pieces, examine trends, and derive important information to aid in their reporting
and investigative work. News analysts can use the program to track public opinion
toward particular subjects or events, do in-depth research of news material, and
discover emergent topics.
Teachers and Pupils: Teachers can use the application to teach students about news
research methodology, natural language processing techniques, and information
retrieval in journalism, media studies, and data analytics courses.
The application can be used by students for academic assignments, research papers, and
theses that deal with sentiment analysis, media bias detection, and news analysis.
The People in General: The tool can help members of the general public stay up to date
on news developments, hot themes, and current events by compiling and summarizing
news sources. Using the technology, citizen journalists and bloggers may curate news
stories, evaluate public discourse, and provide their audiences with insightful content.
Overall, the LLM-based news research tool with Lang chain meets the needs of a wide
range of users and offers useful tools for compiling, evaluating, and interpreting news
sources in a thorough and effective way
7
CHAPTER 2
LITERATURE SURVEY
Natural language processing, or NLP, has emerged as a key instrument for news item
analysis and summary. It helps with tasks including sentiment analysis, trend analysis,
summarization, and categorization. An overview of recent research articles using
natural language processing (NLP) approaches for news analysis and summarization is
given in this literature review.
Paper 1: Using NLP to Classify Bengali Crime News Based on Newspaper Headlines
Using NLP approaches, Khan et al. present a system for categorizing Bengali crime
news based on newspaper headlines. The authors want to offer news organizations and
law enforcement agencies useful information by automatically classifying crime-
related news pieces.
Paper 2: Using Natural Language Processing to Analyze Stock Market Trends on Indian
Financial News Headlines Using NLP, Saxena et al. analyze stock market trends based
on Indian financial news headlines. The writers hope to provide decision-making
assistance to investors and financial experts by revealing market sentiment and trends
Paper 3: Examining News Articles on Cybersecurity Using the Topic Modeling Method
Ghasiya and Okamura use topic modeling techniques to look at cybersecurity news
stories. The authors want to provide important insights for security professionals and
policymakers by identifying important themes and trends from cybersecurity news
through the application of NLP-based topic modeling techniques.
8
Paper 5: Using LDA Based Topic Modeling to Observe Bangla News Trends
A study on trend detection in Bangla news with LDA-based topic modeling is presented
by Alam et al. The authors want to get insights into societal trends and interests by
using natural language processing (NLP) techniques to identify and analyze subjects
that frequently appear in Bangla news stories.
Paper 7: Text Recap: Tamil Online Sports News Text Summary Priyadharshan and
Sumathipala use natural language processing (NLP) to create a text summary method
for Tamil online sports news. In order to give readers concise and educational content,
the authors automatically generate summaries of sports news stories in Tamil.
Paper 8: Brief Text Synopsis of News Story A technique for inshort text summarizing
news stories is presented by Deny et al. The authors hope that by using NLP approaches,
they will be able to provide succinct summaries of news stories that will help users
quickly understand the key ideas.
Paper 9: Sentiment analysis based on NLP and text summarization for news text
analysis Using text summarization and sentiment analysis based on NLP approaches,
Mishra et al. investigate news text analysis. The authors want to offer thorough insights
into news items, enabling better comprehension and interpretation, by fusing sentiment
analysis and text summary.
9
Paper 11: A Survey of Text Summarization for Product Reviews Based on Natural
Language Processing A survey on NLP-based text summary for product reviews is
conducted by Boorugu and Ramesh. The authors hope to shed light on the difficulties
and cutting-edge methods involved in using natural language processing (NLP) to
summarize product reviews by examining current methods.
Paper 11: A Survey of Text Summarization for Product Reviews Based on Natural
Language Processing A survey on NLP-based text summary for product reviews is
conducted by Boorugu and Ramesh. The authors hope to shed light on the difficulties
and cutting-edge methods involved in using natural language processing (NLP) to
summarize
10
Paper 15: Using NLP technologies to extract keywords from tweets in order to gather
pertinent news. Using NLP technologies, Jayasiriwardene and Ganegoda provide a way
for extracting keywords from tweets in order to gather pertinent news. The authors hope
to help users stay up to date on current events by using tweets' keywords to find
pertinent news articles.
In summary, the reviewed works show how NLP may be used extensively for news
summarization and analysis in a variety of languages and fields. In the ever-expanding
world of digital news, natural language processing (NLP) techniques provide useful
tools for extracting insights and improving information consumption in a variety of
applications, from criminal classification and stock market trend research to cybercrime
investigation and product review summarization.
11
CHAPTER 3
BACKGROUND
An enormous amount of news stories has been produced and disseminated across
numerous channels in recent years due to the exponential rise of digital media and
online news platforms. It becomes very necessary to have effective tools and methods
to browse, comprehend, and extract useful insights from news material in this enormous
sea of information. Because of the size, complexity, and diversity of contemporary
news data, traditional approaches of news analysis frequently prove inadequate. By
utilizing huge language models, including Hugging Face's transformers and OpenAI's
GPT, the LLM-based news research tool overcomes these difficulties. Due to their
training on enormous datasets with billions of words, these models are able to
understand the complex patterns, semantic linkages, and contextual subtleties seen in
natural language processing.
Emphasizing information retrieval (IR) methods is one of the tool's primary features.
IR algorithms are essential for activities including content retrieval, keyword
extraction, summarization, and search. Through the use of cutting-edge information
retrieval techniques, the application helps users to effectively search through enormous
databases of news stories, extract pertinent information, and find insightful information
concealed within the data.
There is a great chance that the proposed LangChain-based LLM-based news research
tool would fundamentally alter how people interact with and understand news content.
Algorithms that capture the semantic relationships between words are necessary to
tackle the issues that synonyms and polysemy present. Word embeddings and semantic
networks, for example, can be used to represent words in a high-dimensional space
where synonyms and related phrases are closer together. By using these semantic
representations, algorithms are able to discern between several word meanings and
identify relevant terms based on their contextual usage. By considering the underlying
12
semantic relationships between words, this strategy increases the accuracy of keyword
extraction.
In the context of information retrieval and machine learning, TF-IDF (Term Frequency-
Inverse Document Frequency) and Faiss (Facebook AI Similarity Search) are not
directly comparable because they have different functions and operate in various areas.
It is the result of multiplying two metrics: the number of times a term occurs in a
document (term frequency, TF) and the number of times a term appears in the total
corpus (inverse document frequency, IDF).
Information retrieval, document categorization, and keyword extraction are among the
common tasks for which TF-IDF is employed. aids in finding key terms in documents
and evaluating how relevant they are to particular searches.
where |𝑑| is the total number of terms in document d and count(t,d) is the number of
times term appears in document d.
13
Facebook AI Similarity Search, or Faiss:
Facebook AI Research created the Faiss package, which makes massive dataset
clustering and similarity search more effective.
Unlike TF-IDF, Faiss does not have a set formula. Rather, it uses a variety of similarity
search methods, the most popular of which being cosine similarity. The formula for
calculating the cosine similarity between two vectors, v1 and v2, is:
where the Euclidean norm is represented by ∥⋅∥ and the dot product by ⋅.
Although Faiss lacks a precise formula like TF-IDF, it offers effective implementations
of similarity search algorithms, including k-closest neighbours (KNN) and
approximation nearest neighbour (ANN) searches.
14
CHAPTER 4
MODULES
The Articles.csv from Kaggle totaling 2692 news articles The source of this dataset is
the website https://www.thenews.com.pk. It features business and sports-related news
pieces from 2015 onward. It includes the specific article's heading, content, and date.
The location where the comment or article was published is also included in the content.
We just used this to get understanding on how information is retrieved from URLs or
csv files i.e. documents and understand the similarity, relevance, extraction among them
After processing the input query iteratively, the final response is derived from the
condensed and refined components. Thus, understanding how this work and what is
relevance score and cosine similarity with using tf-idf, bm25 and faiss
Python has an integrated module called os that offers many functions to facilitate
communication with the operating system.
NumPy is a core package for scientific computing in Python that supports strong
matrices and multi-dimensional arrays and offers a set of mathematical operations to
effectively work with these arrays. NumPy is the cornerstone for many other Python
scientific libraries, including deep learning packages, thanks to its array-oriented
computing capabilities. High performance and scalability are ensured by the
performance-critical operations of NumPy being implemented in C and Fortran, which
makes it a vital tool for numerical calculation and data manipulation in deep learning
processes.
15
Scikit-learn: Built on top of NumPy, SciPy, and matplotlib, Scikit-learn is a flexible
machine learning framework that provides easy-to-use tools for data mining and
analysis. While scikit-learn is mostly concerned with classical machine learning
algorithms, including dimensionality reduction, regression, clustering, and
classification, it also offers a neural network submodule for simple neural network-
based learning applications. For the purpose of prototyping and implementing machine
learning models in diverse applications, data scientists and machine learning
practitioners frequently choose scikit-learn due to its comprehensive documentation,
well-defined and uniform API, and focus on code readability.
Matplotlib: A complete Python library for producing static, interactive, and animated
visualizations is called Matplotlib. With its extensive range of customization choices
and plotting features, users can produce figures for data analysis and presentation that
are suitable for publication. Matplotlib is extensively utilized in scientific computing,
data visualization, and machine learning applications due to its object-oriented API and
support for multiple output formats, such as PNG, PDF, and SVG. Its ability to be
integrated with other Python libraries, such pandas and NumPy, makes it a vital tool for
deriving and sharing insights from data.
Streamlit: For data science and machine learning projects, interactive web apps may be
constructed using this library.
Pickle: Another built-in Python package for serializing and deserializing Python objects
is called pickle.
16
langchain submodule.
Chains: Classes and functions for creating and utilizing NLP chains, like question-
answering chains, are probably included in this submodule.
text_splitter: It appears that this submodule is responsible for text splitting, perhaps for
dividing lengthy papers into manageable chunks.
document_loaders: Unstructured text data from many sources, including URLs, is most
likely loaded by this submodule.
embeddings: Classes for creating embeddings from text data are probably contained in
this submodule.
vectorstores: This submodule could manage tasks pertaining to the archiving and
retrieval of text data represented as vectors.
Pandas is a robust Python toolkit for analysis and data manipulation. It offers high-level
data structures that are effective in managing structured data, like DataFrame and
Series. Pandas is a popular tool for data cleaning, transformation, filtering, grouping,
and aggregating operations in data science and analysis processes.
nltk (import nltk): Natural Language Toolkit, or NLTK for short, is a comprehensive
Python toolkit for activities related to natural language processing (NLP). Text
preparation, tokenization, stemming, lemmatization, part-of-speech tagging, parsing,
and semantic analysis are among the functions it provides. For textual data-related
machine learning, information retrieval, and text mining applications, NLTK is
extensively utilized.
re (import re): Regular expressions, which are effective tools for text manipulation and
17
pattern matching, are supported by Python's re module. Based on predefined patterns,
you can use it to search for, extract from, replace, and edit text. In activities involving
text processing, information extraction, and data cleaning, regular expressions are
frequently utilized.
punkt: The punkt module of NLTK offers tokenizers that have already been trained to
separate text into words and sentences. It is helpful for NLP tokenization jobs,
particularly when handling material with complicated sentence patterns or text in many
languages.
wordnet: The wordnet module of NLTK is a lexical database for the English language
that includes relationships, synonyms, antonyms, and definitions of words. WordNet is
frequently used for lexicography, semantic analysis, and developing natural language
processing (NLP) systems that need-to-know word meanings and relationships.
18
CHAPTER 5
METHODOLOGY
Multiple Source Text Loaders: Create text loaders that can handle a variety of document
types, including CSV files, text files, and URLs.Assure effective textual data extraction
and processing from a variety of sources for the Langchain architecture.
Text Splitter with Recursive Characters: Use recursion and character splitters to divide
documents into more manageable sections. During the splitting process, semantic
integrity and coherence are preserved for further analysis.
Integration of Hugging Face for Embeddings with OpenAI: Combine the Hugging Face
19
and OpenAI tools to create embeddings from textual data.
Modern pre-trained language models can be used to convert sentences into numerical
representations for additional analysis.
Combining LLMs to Perform Multi-pass Analysis: Combine LLMs with the Langchain
framework to analyze retrieved text chunks across several passes. Make use of LLMs
to produce summaries, extract important data, and incrementally improve query replies
20
Fig. 5.3(a) summarization of chunks
There are two methods to summarize chunks ,stuff method (simple append method ) or
map reduce method , a drawback in stuff method being that it exceeds token limit and
usage of more space as summary chunk is huge so we can solve it by using map reduce
method where based on splitter’s and separators we get chunks with similarity search ,
let’s take if we get 4 chunks all those 4 are sent again through LLMs and filtered chunks
with necessary info only based on keyword extraction similarity search with faiss and
those filtered chunks are summarized and along with input query ,sent in final LLM
and we get an answer , in this way we save token limit from exceeding its easier to get
precise answer with lesser cost (as token size is calculated ) following below is the
figure to understand map reduce method
21
Fig. 5.3(b) map reduce method
5.4 RETRIEVALQAWITHSOURCESCHAIN
Question Understanding: makes use of LLMs to interpret the meaning behind user
inquiries, enabling more precise information retrieval.
Source Integration: Provides support for integration with many information sources,
22
including text documents, databases, and websites, allowing for thorough search
functions.
Answer extraction: Takes into account context, relevance, and coherence while using
sophisticated approaches to extract responses from sources that have been retrieved.
Customization: Provides adaptability to various domains and use cases, enabling users
to modify the retrieval and quality assurance procedure in accordance with their needs.
23
Python Interface: FAISS offers a Python interface that is simple to add into your
machine learning pipelines and works well with well-known libraries like NumPy and
scikit-learn.
Pre-trained Models: FAISS makes it simple to install and integrate into applications by
providing pre-trained models and indexes for typical use cases.
All things considered, FAISS is a strong library that can be used to do clustering and
similarity search in high-dimensional spaces. It is extensively utilized in many different
applications, such as content-based retrieval systems, recommendation systems, and
search engines.
24
CHAPTER 6
25
Dictionary filtering: It removes tokens that are present in more than 50% of the
documents or fewer than 20 documents. This aids in eliminating words that are either
exceedingly uncommon or common and might not be useful for subject modelling.
Creation of the Corpus: Each document in the corpus is represented as a list of (word
ID, word frequency) tuples, resulting in a representation of the corpus that is similar to
a bag of words.
LDA Model Training: Using the corpus and the filtered dictionary, it trains a Latent
Dirichlet Allocation (LDA) model. LDA is a generative statistical model that enables
unobserved groupings, or themes, to explain sets of observations.
subject Analysis: It publishes the top words for each subject produced by the LDA
model. Every topic has a word distribution, with the most likely terms for each topic
shown by the top words.
Coherence Score Computation: Using the 'c_v' coherence measure, it calculates the
LDA model's coherence score. The semantic closeness of high-scoring terms in the
topic is measured by coherence.
To sum up, the code takes a set of headlines, uses LDA to find subjects within them,
and then uses coherence to assess how good the topics are.
26
Fig. 6.1(b) simple BM25 code
27
Fig. 6.1(c)simple BM25 code
28
Fig. 6.1(e)simple BM25 output with score
You can see the relevance score as per question asked if given as per
similarity/relevance using bm25 (also tf-idf) but inverted , you ask a query and among
the URLs it gives top 5 documents and most relevant in them as per the score now lets
look into few visualizations to understand this extraction methods better
29
Fig. 6.1(g) relevance score of top documents in bar plot
previous to last code here we add the tf-idf calculated vectors to faiss index , but before
that the process would be we load the dataset ,preprocess the text data ,initiliase the tf
-idf vectorizer and the convert the tf-idf matrix to a numpy array ,initiliase the faiss
index and add the calculated vectors to faiss index , perform a query and get the 5 most
relevant news articles to our query along with heading and and type and also get the
similarity by cosine similarity which one of the most accurate similarity measures
30
Fig. 6.1(h)simple TF-IDF code with FAISS
31
Fig. 6.1(i)simple TF-IDF code with FAISS
32
Fig. 6.1(j)simple TF-IDF code with FAISS output
Filling Data: A dataset is loaded into a pandas Data Frame from a CSV file called
"Articles.csv." It's conceivable that the dataset includes articles with headers and
additional data.
Preprocessing Text: From the Data Frame, the heading data is extracted. A similarity
search will be conducted using this textual data.
Fit_transform() is used to convert the text data into a TF-IDF matrix representation.
The TF-IDF matrix is transformed into a NumPy array.
33
FAISS Index Initialization: In order to do a similarity search, it initializes an FAISS
index (IndexFlatL2). One kind of index that supports L2 (Euclidean) is called
IndexFlatL2.
Vectors added to the FAISS Index: To the FAISS index, it adds TF-IDF vectors that
have been transformed to float32.
Vectorizing Inquiry: Using the same TF-IDF vectorizer as the articles, it vectorizes the
query text. The vectorized query is transformed into a float32 NumPy array.
Similarity Lookup: Using the FAISS index, it conducts a similarity search to identify
the query vector's k nearest neighbours (k=5). The indices of the documents that are
most comparable are retrieved.
Printing Related Documents: The articles are printed with their titles and further details
included.
The cosine similarity between each document vector in the TF-IDF matrix and the
query vector is calculated.
It computes dot products with all document vectors and normalizes the query vector.
It obtains the scores of similarities for the documents that are retrieved.
Scores for Printing Similarity:
The cosine similarity scores of the papers that were retrieved are printed.
To summarize, this method takes a query from the user, finds the documents in the
dataset that are the most similar based on TF-IDF vector similarity, and outputs the
documents along with the scores that correlate to those similarity. It also calculates the
cosine similarity scores between the documents that are retrieved and the query.
34
now let’s look into few visualizations to understand this extraction methods better
35
Fig. 6.1(m) word count per document
36
Fig. 6.1(n) distribution of word count
The decision between the two methods—using BM25 with inverted index or TF-IDF
with FAISS—is based on a number of variables, such as the size and kind of your
dataset as well as the particular needs of your application. Here are some things to think
about with each strategy:
Appropriate for big datasets: Since TF-IDF vectors with FAISS enable quick similarity
searches in high-dimensional spaces, it is effective for big datasets.
37
Scalability: FAISS is appropriate for applications where scalability is a concern because
of its capacity to manage extremely large-scale datasets well.
Similarity that is based on embedding: This type of similarity is calculated using the
documents' embedded representation, which can identify semantic similarities between
them.
Preprocessing is necessary for TF-IDF. Tokenization and TF-IDF vectorization are two
examples of these procedures, which can be computationally costly for very big
datasets.
Efficient and basic: BM25 with inverted index is a straightforward technique for
information retrieval jobs. simpler to comprehend and put into practice.
Ideal for smaller datasets: Although BM25 can manage huge datasets, the overhead of
maintaining the inverted index may cause it to become less effective than FAISS for
very large datasets.
Flexibility in queries: BM25 gives you additional options because it takes terms in the
query and their relevance to the documents into account.
require tuning: Depending on the dataset and the particular task, BM25 parameters like
k1 and b may need to be adjusted for best results.
In summary, TF-IDF with FAISS can be a better option if you have a very large
dataset and scalability, is your top priority.
38
Based on our project requirements using retrievalQA Sources chain and open ai key,
faiss vector index being key things we start the implantation
39
Embeddings are created using OpenAI embeddings and passed the documents and
embeddings in order to create faiss vector index
Now with retrieval sources chain, we ask query which has summaries and prompts
40
Fig. 6.3(d) retrievalQA sources chain and llm prompt
Using the above same methodology as our proposed design, after understanding all the
sub parts like loading, splitting, embedding, faiss index, retrieval, we combine
everything using various modules and packages along with algo into one to get our final
implementation that is our QA-Bot, it is implemented using streamlit
41
Fig. 6.4(a) final project screenshots
42
Fig. 6.4(b) final project screenshots
43
After writing app.py, we run or host it on local network and port 8501 and it you can
see the bot in the URL provided and the final result of the LLM based news research
tool using Lang chain
Explanation:
Here u can see that we have made streamlit application combining NLP using nm and
Lang chain and integrating all the above methods to finally make this interface
44
Fig. 6.4(e) QA-bot (news research tool screenshot)
45
Thus, based on above algorithms /libraries/packages, A news tool i.e. a QA Bot which
gives /retrieves a successfully correct answer to your question is created, this is a
student budget friendly project in the domain of nlp/ai/ml where applications like chat
gpt is booming in the market, with such projects in the current trends of technology, it
adds value and growth in terms of an engineering student
46
CHAPTER 7
7.1 CONCLUSION
In this work, we present a new model that is specifically tailored to extract answers
from filtered chunks using key word search and extraction techniques from news
articles in a variety of categories. These groups cover a range of fields.
Efficient Textual Data Extraction: The application includes several source text loaders
that can handle various document formats, such as text files, CSV files, and URLs. This
guarantees thorough data collecting from multiple sources.
Improved User Interface: Create an intuitive user interface that makes it easier for users
to interact with the tool, enabling them to input queries
47
7.2 FUTURE WORK
Future studies can examine how to optimize the proposed system by integrating it with
deep learning techniques. They can also explore how the model can improve tasks like
topic modeling, document retrieval, and summarization, as well as open up new
avenues for utilizing news content across different languages and domains, such as
regional news retrieval.
Improved User Interface: Create an intuitive user interface that makes it easier for users
to interact with the tool, enabling them to input queries, examine results, and visualize
insights.
Enable continuous monitoring and analysis of news articles as they are released by
implementing real-time data processing capabilities. This will ensure that timely
insights and updates are provided.
Domain-Specific Customization: Provide users with the ability to modify the tool to
suit their own needs and preferences by adapting it to particular domains or industries.
Integration with External APIs: To enhance the analysis and provide a more thorough
knowledge of news events, integrate with external APIs and data sources. Examples of
these sources include social media, financial, and geopolitical data.
48
REFERENCES
[2] A. Saxena, V. Vijay Bhagat and A. Tamang, "Stock Market Trend Analysis on
Indian Financial News Headlines with Natural Language Processing," 2021 Asian
Conference on Innovation in Technology (ASIANCON), PUNE, India, 2021, pp. 1-5,
doi: 10.1109/ASIANCON51346.2021.9544965.
49
Sports News Using NLP," 2018 3rd International Conference on Information
Technology Research (ICITR), Moratuwa, Sri Lanka, 2018, pp. 1-5, doi:
10.1109/ICITR.2018.8736154.
[9] A. Mishra, A. Sahay, M. a. Pandey and S. S. Routaray, "News text Analysis using
Text Summarization and Sentiment Analysis based on NLP," 2023 3rd International
Conference on Smart Data Intelligence (ICSMDI), Trichy, India, 2023, pp. 28-31,
doi: 10.1109/ICSMDI57622.2023.00014.
[11] R. Boorugu and G. Ramesh, "A Survey on NLP based Text Summarization for
Summarizing Product Reviews," 2020 Second International Conference on Inventive
Research in Computing Applications (ICIRCA), Coimbatore, India, 2020, pp. 352-
356, doi: 10.1109/ICIRCA48905.2020.9183355.
50
[14] A. Hingle, A. Katz and A. Johri, "Exploring NLP-Based Methods for Generating
Engineering Ethics Assessment Qualitative Codebooks," 2023 IEEE Frontiers in
Education Conference (FIE), College Station, TX, USA, 2023, pp. 1-8, doi:
10.1109/FIE58773.2023.10342985
51