Thesis Final

A project report on
LLM based news research tool using langchain
Submitted in partial fulfillment for the award of the degree of
Bachelor of Technology in Computer Science

and Engineering
by
KHUSHI HABBU (20BCE1612)
SCHOOL OF COMPUTER SCIENCE AND ENGINEERING

April,2024
LLM based news research tool using langchain
Submitted in partial fulfillment for the award of the degree of
Bachelor of Technology in Computer Science

and Engineering
by
KHUSHI HABBU (20BCE1612)
SCHOOL OF COMPUTER SCIENCE AND ENGINEERING

April, 2024
DECLARATION
I hereby declare that the thesis entitled “LLM based news research tool
using langchain” submitted by me, for the award of the degree of Bachelor of
Technology in Computer Science and Engineering, Vellore Institute of
Technology, Chennai is a record of bonafide work carried out by me under the
supervision of Dr. Subbulakshmi P.
I further declare that the work reported in this thesis has not been
submitted and will not be submitted, either in part or in full, for the award of any
other degree or diploma in this institute or any other institute or university.
Place: Chennai
Date: Signature of the Candidate
School of Computer Science and Engineering
CERTIFICATE
This is to certify that the report entitled “LLM based news research
tool using langchain” is prepared and submitted by Khushi
Habbu(20BCE1612) to Vellore Institute of Technology, Chennai, in partial
fulfillment of the requirement for the award of the degree of Bachelor of
Technology in Computer Science and Engineering program is a bonafide
record carried out under my guidance. The project fulfills the requirements as per
the regulations of this University and in my opinion meets the necessary standards
for submission. The contents of this report have not been submitted and will not
be submitted either in part or in full, for the award of any other degree or diploma,
and the same is certified.
Signature of the Guide:
Name: Dr. Subbulakshmi P
Date:
Signature of the Internal Examiner Signature of the External Examiner

Name: Name:
Date: Date:
Approved by the Head of Department,

B.Tech. CSE
Name: Dr. Nithyanandam P
Date:
(Seal of SCOPE)
ABSTRACT
People find it harder and harder to effectively navigate and absorb news content in the
digital age due to the abundance of news sources and information outlets. As a result,
there's a rising need for sophisticated technologies that help with the categorization,
distillation, and extraction of insights from massive volumes of news data. This abstract
describes a novel approach called LangChain that uses blockchain infrastructure and
Language Model (LM) technology to create a news research tool.
A framework designed specifically for Large Language Models (LLMs), Lang chain
has a number of features that are necessary for document processing. First of all, it has
multiple text loaders that can handle different formats, including text files, CSVs, and
URLs in particular. Character and recursive text splitters are among the tools that Lang
chain offers for separating documents once they have been loaded.
Lang chain uses the Hugging Face and OpenAI modules to encode text into numeric
representations that can be stored in vector databases. These libraries provide
transformations and embeddings for words, turning them into vectors of numbers.
Recognize the basics of classical information retrieval (IR) for tasks like retrieval,
summarization, search, and keyword extraction. Lang chain also interfaces with FAISS,
a library for similarity search and clustering of dense vectors, which is used to
effectively store and retrieve data from FAISS indexes. A primary example of this is
TF-IDF etc. Users can now prompt inquiries that produce words or search results that
are similar to one other.Later on in the process, Lang chain integrates RetrievalQA with
sources chain, which is a crucial step wherein information chunks are repeatedly run
through LLMs. Filtered chunks serve as triggers for additional processing and are
continuously improved. These sections are combined using summarization techniques,
which improves the effectiveness of the queries that follow. In the end, the final
response is obtained from the condensed and refined portions after the input query is
iteratively processed.
iii
ACKNOWLEDGEMENT
It is my pleasure to express with a deep sense of gratitude to Dr. Subbulakshmi P,

Assistant Professor Senior Grade 2, School of Computer Science and Engineering,
Vellore Institute of Technology, Chennai, for her constant guidance, continual
encouragement, and understanding; more than all, she taught me patience in my
endeavor. My association with her is not confined to academics only, but it is a
great opportunity on my part to work with an intellectual and expert in the field of
Machine Learning, Artificial Intelligence and Natural Language Processing
It is with gratitude that I would like to extend my thanks to the visionary leader Dr.
G. Viswanathan our Honorable Chancellor, Mr. Sankar Viswanathan, Dr. Sekar
Viswanathan, Dr. G V Selvam Vice Presidents, Dr. Sandhya Pentareddy,
Executive Director, Ms. Kadhambari S. Viswanathan, Assistant Vice-President Dr.
V. S. Kanchana Bhaaskaran Vice-Chancellor i/c & Pro-Vice Chancellor, VIT
Chennai and Dr. P. K. Manoharan, Additional Registrar for providing an
exceptional working environment and inspiring all of us during the tenure of the
course.
Special mention to Dr. Ganesan R, Dean, Dr. Parvathi R, Associate Dean

Academics, Dr. Geetha S, Associate Dean Research, School of Computer Science
and Engineering, Vellore Institute of Technology, Chennai for spending their
valuable time and efforts in sharing their knowledge and for helping us in every
aspect.
In a jubilant state, I express ingeniously my whole-hearted thanks to Dr.

Nithyanandam P, Head of the Department, B.Tech. CSE and the Project
Coordinators for their valuable support and encouragement to take up and
complete the thesis.
My sincere thanks to all the faculties and staff at Vellore Institute of Technology,
Chennai who helped me acquire the requisite knowledge. I would like to thank my
parents for their support. It is indeed a pleasure to thank my friends who
encouraged me to take up and complete this task.
Place: Chennai
Date: Khushi Habbu
iv
CONTENTS
ABSTRACT……………………………………………………………………………… iii
AKNOELEDGEMENT………………………………………………… ……………… iv
CONTENTS……………………………………………………………………………… v
LIST OF FIGURES………………………………………………………………………. vii
LIST OF ACRONYMS …………………………………………………………….......... ix
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION ……………………………………………………......1

1.2 RESEARCH CHALLENGES………………….………………................2
1.3 RESEARCH OBJECTIVE …………………………...………………......3
1.4 STATEMENT OF PROBLEM ………………………….……………….5
1.5 SCOPE OF THE PROJECT ……………………………..……………….6
CHAPTER 2
LITERATURE SURVEY……………………………………………………….8
CHAPTER 3
BACKGROUND
3.1 BACKGROUND STUDY ……………………………………................12

3.2 TF-IDF AND FAISS …………………………………………………….13
CHAPTER 4
MODULES
4.1 SAMPLE DATA …………………………………………………………15

4.2 PYTHON PACKAGES…………………………………………………...15
CHAPTER 5
v
METHODOLOGY
5.1 PROPOSED SYSTEM DIAGRAM……………. ………………………19

5.2 EXPLANATION OF THE PROPOSED SYSTEM DIAGRAM………..19
5.3 MAP REDUCE METHOD ……………………………………………...20
5.4 RETRIEVALQAWITHSOURCESCHAIN……………………………...22
CHAPTER 6
IMPLEMENTATION AND RESULTS
6.1 RESULTS OF FEW TRADITIONAL IR ALGORITHMS……………...25

6.2 COMPARING BM25 AND TF-IDF WITH FAISS……………….…….37
6.3 CREATING A QA PROMPT A SAMPLE LLM……………….………38
6.4 FINAL IMPLEMENTATION (QA-Bot) ……………….………………41
CHAPTER 7
CONCLUSION AND FUTURE WORK
8.1 CONCLUSION…………………………………………………………47
8.2 FUTURE WORK……………………………………………………….48
REFERENCES………………………………………………………………...49
vi
LIST OF FIGURES
Fig. 5.1 proposed system diagram
Fig. 5.3(a) summarization of chunks
Fig. 5.3(b) map reduce method
Fig. 5.4 FAISS index
Fig. 6.1(a) simple LDA code
Fig. 6.1(b) simple BM25 code
Fig. 6.1(c) simple BM25 code
Fig. 6.1(d)simple BM25 output
Fig. 6.1(e)simple BM25 output with score
Fig. 6.1(f) scores of the top ranked documents
Fig. 6.1(g) relevance score of top documents in bar plot
Fig. 6.1(h)simple T-IDF code with FAISS
Fig. 6.1(i)simple TF-IDF code with FAISS
Fig. 6.1(j)simple TF-IDF code with FAISS output
Fig. 6.1(k) similarity scores of retrieved documents
Fig. 6.1(l) word cloud of retrieved documents headings
vii
Fig. 6.1(m) word count per document
Fig. 6.1(n) distribution of word count
Fig. 6.3(a) loaders and splitters screenshot
Fig. 6.3(b) embeddings
Fig. 6.3(c) retrievalQA sources chain and llm prompt
Fig. 6.3(d) retrievalQA sources chain and llm prompt
Fig. 6.4(a) final project screenshots
Fig. 6.4(b) final project screenshots
Fig. 6.4(c) final project screenshots
Fig. 6.4(d) QA-bot (news research tool screenshot)
Fig. 6.4(e) QA-bot (news research tool screenshot)
Fig. 6.4(f) QA-Bot tool screenshot
viii
LIST OF ACRONYMS
TF-IDF Term Frequency – Inverse Document Frequency
KNN k Nearest Neighbours
TR TextRank
PR PageRank
NB Naïve Bayes
SVM Support Vector Machine
ML Machine Learning
DL Deep Learning
MAUI Multi-purpose automatic topic indexing
KEA Automatic Keyphrase Extraction
LSTM Long short term memory
CRF Conditional Random Field
LR Linear Regression
TF-IDFIS Term Frequency-Inverse Document Frequence
LDA Latent Drichlet Allocation
NMF Non-negative Matrix Factorization
RP Random Projection
MLR Multiple Linear Regression
ix
NLP Natural LanguageProcessing
NLTK Natural Language Toolkit
US United States
DT Decision Tree
RF Random Forest
IR Information Retrieval
x
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
By combining state-of-the-art Language Model capabilities—such as OpenAI's GPT

(Generative Pre-trained Transformer)—with blockchain technology, LangChain
provides clients with a dependable and comprehensive news analysis platform.
Advanced natural language processing (NLP) functionalities are offered by LangChain.
These characteristics enable readers to quickly identify key information in news
articles, identify emerging trends, and ascertain popular opinion on certain topics or
events.
There is a great chance that the proposed LangChain-based LLM-based news research
tool would fundamentally alter how people interact with and understand news content.
The Lang Chain framework provides a structured approach to leverage LLM
capabilities, ensuring more accurate, language-specific, and context-sensitive news
article insights. This has implications for improving the range and efficiency of
information extraction across different language situations.
Include a Legal Language Model (LLM) in the project for accurate analysis.
Utilize LangChain's modular architecture for seamless integration and enhanced natural
language processing. Processing data to ensure accuracy and relevance for analysis,
tidy and arrange news. Using Language Processing to Develop entities-recognition,
concept-extracting, and comprehension algorithms for search and retrieval. These
algorithms will enable the efficient use of advanced search tools to obtain relevant legal
news using natural language queries. Design of User Interface Make an intuitive
dashboard with moveable widgets to enhance the professional user experience.
1
1.2 RESEARCH CHALLENGES
To build precise and efficient algorithms for keyword extraction and to ask a query and
to retrive answer from news articles, researchers must overcome numerous challenges
that span various aspects of natural language processing and domain-specific
knowledge.
It is necessary to address a number of issues in natural language processing and

information retrieval in order to develop accurate and effective algorithms for query-
based information retrieval from news articles and keyword extraction, following are
the issues
Ambiguity and Polysemy: Words and phrases sometimes have more than one meaning,
which causes polysemy and ambiguity. Accurate keyword extraction and query
understanding depend on the ability to recognize a word or phrase in context
Named Entity Recognition (NER): To comprehend the content of news items, named
entities such as individuals, groups, places, dates, and numerical expressions must be
recognized and categorized. However, because of differences in entity references and
interpretations that depend on context, NER can be difficult.
Domain-Specific Terminology: Words and phrases that are not found conventional
language models may be used in news articles. Accurate results require keyword
extraction and query interpretation algorithms that incorporate domain-specific results
Length and Structure of the Document: News pieces come in a variety of forms, from
short news snippets to lengthy investigative investigations. Scalable and successful
keyword extraction and retrieval need algorithms that can handle documents
Temporal Dynamics: News items are subject to time constraints, and over time, search
terms and queries may become less relevant. To provide current and relevant results,
2
keyword extraction and retrieval algorithms must take temporal information into
account as well as the recentness of articles.
Multimodal Content: In addition to text, news stories may also include pictures, videos,
and other types of multimedia content. The effectiveness and coverage of keyword
extraction and query retrieval systems can be improved by incorporating multimodal
content
Cross-Lingual Challenges: News stories may be published in more than one language,
necessitating the use of algorithms capable of comprehending queries and extracting
keywords across languages. Serving a variety of user communities requires overcoming
linguistic obstacles and making sure that translations and interpretations are accurate.
User Intent and Context: Relevant and customized keyword recommendations and
search results depend on an understanding of the user's intent and context. Enhancing
the precision and usability of keyword extraction and retrieval systems can be achieved
by integrating user feedback, preferences, and browsing history.
Scalability and Efficiency: In order to handle massive numbers of news items and user
inquiries in real-time, keyword extraction and query retrieval algorithms need to be
both scalable and efficient.
Through the resolution of these issues, scientists can create algorithms for query-based
information retrieval from news articles and keyword extraction that are more precise,
effective, and intuitive to use.
1.3 RESEARCH OBJECTIVES
Creation of the Langchain Framework: Create and put into use the Langchain
framework specifically for LLM-based news analysis. Using LLMs, create a solid
architecture that combines the many parts needed for news analysis.
3
Multiple Source Text Loaders: Create text loaders that can handle a variety of document
types, including CSV files, text files, and URLs.Assure effective textual data extraction
and processing from a variety of sources for the Langchain architecture.
Text Splitter with Recursive Characters: Use recursion and character splitters to divide
documents into more manageable sections. During the splitting process, semantic
integrity and coherence are preserved for further analysis.
Integration of Hugging Face for Embeddings with OpenAI: Combine the Hugging Face
and OpenAI tools to create embeddings from textual data.
Modern pre-trained language models can be used to convert sentences into numerical
representations for additional analysis.
Investigation of Algorithms for Information Retrieval: Examine several information

retrieval (IR) algorithms that can be used for retrieval, summarization, search, and
keyword extraction.Analyze the performance of various algorithms in relation to
research and news analysis activities.
FAISS Indexing and Retrieval Implementation: Use FAISS (Facebook AI Similarity

Search) to store, index, and retrieve embeddings more quickly. Create indexes with
FAISS to facilitate scalable and quick similarity searches for news items and related
queries.
RetrievalQA Development Using Sources Chain: Create and execute RetrievalQA

using Sources Chain, the foundational element of the news research instrument. Create
algorithms that handle and filter obtained chunks iteratively, adding feedback loops to
improve search results.
Combining LLMs to Perform Multi-pass Analysis: Combine LLMs with the Langchain
framework to analyze retrieved text chunks across several passes. Make use of LLMs
to produce summaries, extract important data, and incrementally improve query replies.
Assessment and Comparative Analysis: Carry out a thorough assessment and

comparison of the created news research instrument. Evaluate its performance on a
4
range of news datasets and research scenarios with respect to accuracy, efficiency,
scalability, and usability.
Verification using User Input and Case Studies: Use user input and real-world case
studies to validate the tool's usefulness and efficacy. To improve usability and
practicality, iterate on the design and functionality depending on user experiences and
requirements.
Record-keeping and Distribution: Provide thorough documentation that addresses the

news research tool's conception, application, and use. To encourage collaboration and
adoption within the research community, disseminate study findings through academic
papers, conference presentations, and open-source contributions.
1.4 STATEMENT OF PROBLEM
Our goal is to create a comprehensive LLM project covering a real-world industrial use
case for research analysts. We also want to create a news research tool that allows users
to provide the URLs of numerous news stories, and when they ask a question, the
system will pull the response from the filtered portions of those articles.
Simply using a key word search is insufficient; for instance, when we search for "apple"
on Google, we get both the apple company and standard apple fruit information. To
address this, our project will use vector databases and embedding to precisely
understand the context of the search and retrieve pertinent sections from the urls that
we will then use to look for answers in it
In many text processing tasks, such as text categorization, information mining, and text
retrieval, keyword extraction is essential. Because of its ease of use and ability to
accurately determine term relevance based on word frequency, the TF-IDF algorithm
has become one of the most popular established approaches. On the other hand,
depending exclusively on word frequency data may have drawbacks, especially in
situations when words might not appropriately convey their importance within the page.
5
TextRank and RAKE (Rapid Automatic Keyword Extraction) are two more
sophisticated methods that you can use to enhance keyword extraction beyond the
capabilities of standard TF-IDF.
The process of locating pertinent information within a sizable collection of text

documents is called text retrieval, or information retrieval (IR). It entails locating
documents pertinent to an information request or query made by a user.
Text loaders with many sources can handle a variety of document types, including text
files, CSV files, and URLs. Text Splitter with Recursive Characters splits documents
into digestible chunks while maintaining semantic consistency by using recursion and
character splitters. Within the Langchain framework, this division makes additional
analysis easier. Using contemporary pre-trained language models, Hugging Face and
OpenAI tools are integrated to build embeddings from textual data. This makes it
possible to translate language into numerical representations, which improves
analyzing skills.
Facebook developed an open-source library called FAISS (Facebook AI Similarity

Search) to efficiently search for similarities and cluster massive datasets. It is especially
made to function well with high-dimensional vectors, which makes it appropriate for
jobs involving large-scale vector similarity computations, such as natural language
processing and picture similarity search.
Information chunks are repeatedly passed through LLMs using map reduce method
over stuff method i.e the drawback od exceeding token limit or usage of token limits in
api is solved in the important step where Lang chain integrates RetrievalQA with
sources chain later in the process. Filtered pieces are constantly refined and act as
catalysts for further processing. Summarization techniques are used to combine these
portions, making the following queries more effective. After processing the input query
iteratively, the final response is derived from the condensed and refined components
1.5 SCOPE OF THE PROJECT
6
One alternative for casual readers is to browse through a news story: News article text
retrieval and keyword extraction can help casual readers rapidly assess an article's
relevance and content. It is simple to avoid reading the entire information when users
highlight the answers they require from a list of three linked URLs on a certain issue.
This feature improves the user experience by helping the user make better selections
when choosing which articles to read.
News analysts and journalists: Using this tool, journalists can swiftly compile pertinent
news pieces, examine trends, and derive important information to aid in their reporting
and investigative work. News analysts can use the program to track public opinion
toward particular subjects or events, do in-depth research of news material, and
discover emergent topics.
Scholars & Researchers: Scholars specializing in media studies, communication, and

linguistics can employ this technology to gather and examine news content for scholarly
investigations. Scholars can investigate how the tool might be used to research language
trends, media biases, and the distribution of information among various news sources.
Teachers and Pupils: Teachers can use the application to teach students about news
research methodology, natural language processing techniques, and information
retrieval in journalism, media studies, and data analytics courses.
The application can be used by students for academic assignments, research papers, and
theses that deal with sentiment analysis, media bias detection, and news analysis.
The People in General: The tool can help members of the general public stay up to date
on news developments, hot themes, and current events by compiling and summarizing
news sources. Using the technology, citizen journalists and bloggers may curate news
stories, evaluate public discourse, and provide their audiences with insightful content.
Overall, the LLM-based news research tool with Lang chain meets the needs of a wide
range of users and offers useful tools for compiling, evaluating, and interpreting news
sources in a thorough and effective way
7
CHAPTER 2
LITERATURE SURVEY
Natural language processing, or NLP, has emerged as a key instrument for news item
analysis and summary. It helps with tasks including sentiment analysis, trend analysis,
summarization, and categorization. An overview of recent research articles using
natural language processing (NLP) approaches for news analysis and summarization is
given in this literature review.
Paper 1: Using NLP to Classify Bengali Crime News Based on Newspaper Headlines
Using NLP approaches, Khan et al. present a system for categorizing Bengali crime
news based on newspaper headlines. The authors want to offer news organizations and
law enforcement agencies useful information by automatically classifying crime-
related news pieces.
Paper 2: Using Natural Language Processing to Analyze Stock Market Trends on Indian
Financial News Headlines Using NLP, Saxena et al. analyze stock market trends based
on Indian financial news headlines. The writers hope to provide decision-making
assistance to investors and financial experts by revealing market sentiment and trends
Paper 3: Examining News Articles on Cybersecurity Using the Topic Modeling Method
Ghasiya and Okamura use topic modeling techniques to look at cybersecurity news
stories. The authors want to provide important insights for security professionals and
policymakers by identifying important themes and trends from cybersecurity news
through the application of NLP-based topic modeling techniques.
Paper 4: Myanmar Language Extractive Summarization An extraction summarizing

technique specifically designed for the Myanmar language is put forth by Lwin and
Nwet. The authors want to make it easier for people who speak Myanmar to find and
consume information by using natural language processing (NLP) techniques to
automatically create succinct summaries of news articles written in Myanmar.
8
Paper 5: Using LDA Based Topic Modeling to Observe Bangla News Trends
A study on trend detection in Bangla news with LDA-based topic modeling is presented
by Alam et al. The authors want to get insights into societal trends and interests by
using natural language processing (NLP) techniques to identify and analyze subjects
that frequently appear in Bangla news stories.
Paper 6: Serbian-language automatic text summarization of news articles

An automatic text summary method for Serbian news items is proposed by Kosmajac
and Kešelj. The authors hope to make it easier for Serbian-speaking audiences to
consume information by using natural language processing (NLP) techniques to
provide succinct summaries of news stories written in Serbian.
Paper 7: Text Recap: Tamil Online Sports News Text Summary Priyadharshan and
Sumathipala use natural language processing (NLP) to create a text summary method
for Tamil online sports news. In order to give readers concise and educational content,
the authors automatically generate summaries of sports news stories in Tamil.
Paper 8: Brief Text Synopsis of News Story A technique for inshort text summarizing
news stories is presented by Deny et al. The authors hope that by using NLP approaches,
they will be able to provide succinct summaries of news stories that will help users
quickly understand the key ideas.
Paper 9: Sentiment analysis based on NLP and text summarization for news text
analysis Using text summarization and sentiment analysis based on NLP approaches,
Mishra et al. investigate news text analysis. The authors want to offer thorough insights
into news items, enabling better comprehension and interpretation, by fusing sentiment
analysis and text summary.
Paper 10: Improving Text Summarization Through a Hybrid Extractive-Abstractive

Framework with Pre- and Post-Processing Techniques A hybrid extractive-abstractive
paradigm for text summarization is proposed by Habu et al. The scientists hope that by
combining pre- and post-processing methods, text summarizing systems will work
better and produce summaries that are more precise and educational.
9
Paper 11: A Survey of Text Summarization for Product Reviews Based on Natural
Language Processing A survey on NLP-based text summary for product reviews is
conducted by Boorugu and Ramesh. The authors hope to shed light on the difficulties
and cutting-edge methods involved in using natural language processing (NLP) to
summarize product reviews by examining current methods.
Paper 10: Improving Text Summarization Through a Hybrid Extractive-Abstractive

Framework with Pre- and Post-Processing Techniques A hybrid extractive-abstractive
paradigm for text summarization is proposed by Habu et al. The scientists hope that by
combining pre- and post-processing methods, text summarizing systems will work
better
Paper 11: A Survey of Text Summarization for Product Reviews Based on Natural
Language Processing A survey on NLP-based text summary for product reviews is
conducted by Boorugu and Ramesh. The authors hope to shed light on the difficulties
and cutting-edge methods involved in using natural language processing (NLP) to
summarize
Paper 12: Using Natural Language Processing Tools to Automatically Summarize

News from Inform.kz Using NLP technologies, Kynabay et al. create a system for
automatically summarizing news from Inform.kz. The authors want to make it easier
for users to consume information efficiently by using natural language processing
Paper 13: Using Support Vector Machines to Classify ChatGPT Generated News and
True News Mayopu et al. suggest utilizing support vector machines to categorize news
that is generated by ChatGPT from genuine news. The authors hope to ensure news
sources are reliable by using machine learning techniques to distinguish between
created and real news stories.
Paper 14: Investigating NLP-Based Techniques to Produce Qualitative Codebooks for

Engineering Ethics Assessment Hingle et al. investigate NLP-based techniques for
producing qualitative codebooks for engineering ethics assessments. The authors hope
to expedite research in this area by automating the creation of codebooks for evaluating
10
Paper 15: Using NLP technologies to extract keywords from tweets in order to gather
pertinent news. Using NLP technologies, Jayasiriwardene and Ganegoda provide a way
for extracting keywords from tweets in order to gather pertinent news. The authors hope
to help users stay up to date on current events by using tweets' keywords to find
pertinent news articles.
In summary, the reviewed works show how NLP may be used extensively for news
summarization and analysis in a variety of languages and fields. In the ever-expanding
world of digital news, natural language processing (NLP) techniques provide useful
tools for extracting insights and improving information consumption in a variety of
applications, from criminal classification and stock market trend research to cybercrime
investigation and product review summarization.
11
CHAPTER 3
BACKGROUND
3.1 BACKGROUND STUDY
An enormous amount of news stories has been produced and disseminated across
numerous channels in recent years due to the exponential rise of digital media and
online news platforms. It becomes very necessary to have effective tools and methods
to browse, comprehend, and extract useful insights from news material in this enormous
sea of information. Because of the size, complexity, and diversity of contemporary
news data, traditional approaches of news analysis frequently prove inadequate. By
utilizing huge language models, including Hugging Face's transformers and OpenAI's
GPT, the LLM-based news research tool overcomes these difficulties. Due to their
training on enormous datasets with billions of words, these models are able to
understand the complex patterns, semantic linkages, and contextual subtleties seen in
natural language processing.
Emphasizing information retrieval (IR) methods is one of the tool's primary features.
IR algorithms are essential for activities including content retrieval, keyword
extraction, summarization, and search. Through the use of cutting-edge information
retrieval techniques, the application helps users to effectively search through enormous
databases of news stories, extract pertinent information, and find insightful information
concealed within the data.
There is a great chance that the proposed LangChain-based LLM-based news research
tool would fundamentally alter how people interact with and understand news content.
Algorithms that capture the semantic relationships between words are necessary to
tackle the issues that synonyms and polysemy present. Word embeddings and semantic
networks, for example, can be used to represent words in a high-dimensional space
where synonyms and related phrases are closer together. By using these semantic
representations, algorithms are able to discern between several word meanings and
identify relevant terms based on their contextual usage. By considering the underlying
12
semantic relationships between words, this strategy increases the accuracy of keyword
extraction.
3.2 TF-IDF AND FAISS
In the context of information retrieval and machine learning, TF-IDF (Term Frequency-
Inverse Document Frequency) and Faiss (Facebook AI Similarity Search) are not
directly comparable because they have different functions and operate in various areas.
Term Frequency-Inverse Document Frequency, or TF-IDF:

A statistical metric called TF-IDF is used to assess a term's significance within a
document in relation to a corpus of documents.
It is the result of multiplying two metrics: the number of times a term occurs in a
document (term frequency, TF) and the number of times a term appears in the total
corpus (inverse document frequency, IDF).
Information retrieval, document categorization, and keyword extraction are among the
common tasks for which TF-IDF is employed. aids in finding key terms in documents
and evaluating how relevant they are to particular searches.
The formula for calculating the TF of a phrase t in a document d is tf(t,d)=count(t,d)/|d|

in mathematics.
where |𝑑| is the total number of terms in document d and count(t,d) is the number of
times term appears in document d.
A term's rarity or uniqueness within a set of documents is gauged by its inverse

document frequency. When compared to terms that appear in fewer papers, terms that
appear in more documents are regarded as being less significant or instructive. In order
to account for terms that might not occur in the corpus, IDF is typically computed as
the logarithm of the ratio of the total number of documents in the corpus to the number
of documents containing the term.
13
Facebook AI Similarity Search, or Faiss:
Facebook AI Research created the Faiss package, which makes massive dataset
clustering and similarity search more effective.
It offers implementations of a number of similarity search methods, including as

clustering algorithms like k-means, k-nearest neighbors (KNN) search, and
approximate nearest neighbor (ANN) search.
Tasks like data clustering, content-based recommendation systems, and similarity

search are the main applications of Faiss. It makes it possible to quickly and efficiently
retrieve clusters of related objects from huge databases.
Unlike TF-IDF, Faiss does not have a set formula. Rather, it uses a variety of similarity
search methods, the most popular of which being cosine similarity. The formula for
calculating the cosine similarity between two vectors, v1 and v2, is:
Cosine Similarity (v1,v2)=(v1.v2)/(|v1| x |v2|)
where the Euclidean norm is represented by ∥⋅∥ and the dot product by ⋅.
Although Faiss lacks a precise formula like TF-IDF, it offers effective implementations
of similarity search algorithms, including k-closest neighbours (KNN) and
approximation nearest neighbour (ANN) searches.
14
CHAPTER 4
MODULES
4.1 SAMPLE DATA
The Articles.csv from Kaggle totaling 2692 news articles The source of this dataset is
the website https://www.thenews.com.pk. It features business and sports-related news
pieces from 2015 onward. It includes the specific article's heading, content, and date.
The location where the comment or article was published is also included in the content.
We just used this to get understanding on how information is retrieved from URLs or
csv files i.e. documents and understand the similarity, relevance, extraction among them
After processing the input query iteratively, the final response is derived from the
condensed and refined components. Thus, understanding how this work and what is
relevance score and cosine similarity with using tf-idf, bm25 and faiss
4.2 PYTHON PACKAGES/MODULES
The modules/packages listed below are in use:
Python has an integrated module called os that offers many functions to facilitate
communication with the operating system.
NumPy is a core package for scientific computing in Python that supports strong
matrices and multi-dimensional arrays and offers a set of mathematical operations to
effectively work with these arrays. NumPy is the cornerstone for many other Python
scientific libraries, including deep learning packages, thanks to its array-oriented
computing capabilities. High performance and scalability are ensured by the
performance-critical operations of NumPy being implemented in C and Fortran, which
makes it a vital tool for numerical calculation and data manipulation in deep learning
processes.
15
Scikit-learn: Built on top of NumPy, SciPy, and matplotlib, Scikit-learn is a flexible
machine learning framework that provides easy-to-use tools for data mining and
analysis. While scikit-learn is mostly concerned with classical machine learning
algorithms, including dimensionality reduction, regression, clustering, and
classification, it also offers a neural network submodule for simple neural network-
based learning applications. For the purpose of prototyping and implementing machine
learning models in diverse applications, data scientists and machine learning
practitioners frequently choose scikit-learn due to its comprehensive documentation,
well-defined and uniform API, and focus on code readability.
Matplotlib: A complete Python library for producing static, interactive, and animated
visualizations is called Matplotlib. With its extensive range of customization choices
and plotting features, users can produce figures for data analysis and presentation that
are suitable for publication. Matplotlib is extensively utilized in scientific computing,
data visualization, and machine learning applications due to its object-oriented API and
support for multiple output formats, such as PNG, PDF, and SVG. Its ability to be
integrated with other Python libraries, such pandas and NumPy, makes it a vital tool for
deriving and sharing insights from data.
Streamlit: For data science and machine learning projects, interactive web apps may be
constructed using this library.
Pickle: Another built-in Python package for serializing and deserializing Python objects
is called pickle.
Time: An additional built-in module for handling tasks involving time.
Langchain: This is probably an external or custom library/module used for activities

related to natural language processing. It has a number of classes and submodules for
NLP functions.
OpenAI: Functionalities pertaining to OpenAI's NLP models may be found in this
16
langchain submodule.
Chains: Classes and functions for creating and utilizing NLP chains, like question-
answering chains, are probably included in this submodule.
text_splitter: It appears that this submodule is responsible for text splitting, perhaps for
dividing lengthy papers into manageable chunks.
document_loaders: Unstructured text data from many sources, including URLs, is most
likely loaded by this submodule.
embeddings: Classes for creating embeddings from text data are probably contained in
this submodule.
vectorstores: This submodule could manage tasks pertaining to the archiving and
retrieval of text data represented as vectors.
"Sentence-transformers”: is used to create sentence embeddings. Sentences or

paragraphs can be transformed into dense numerical vectors that represent the text's
semantic meaning using these embeddings.
Pandas is a robust Python toolkit for analysis and data manipulation. It offers high-level
data structures that are effective in managing structured data, like DataFrame and
Series. Pandas is a popular tool for data cleaning, transformation, filtering, grouping,
and aggregating operations in data science and analysis processes.
nltk (import nltk): Natural Language Toolkit, or NLTK for short, is a comprehensive
Python toolkit for activities related to natural language processing (NLP). Text
preparation, tokenization, stemming, lemmatization, part-of-speech tagging, parsing,
and semantic analysis are among the functions it provides. For textual data-related
machine learning, information retrieval, and text mining applications, NLTK is
extensively utilized.
re (import re): Regular expressions, which are effective tools for text manipulation and
17
pattern matching, are supported by Python's re module. Based on predefined patterns,
you can use it to search for, extract from, replace, and edit text. In activities involving
text processing, information extraction, and data cleaning, regular expressions are
frequently utilized.
stopwords from nltk.corpus (import stopwords from nltk.corpus):A list of common

stopwords for several languages may be found in the NLTK stopwords corpus. In order
to concentrate on relevant content, stopwords—words that appear frequently in a
language, such as "the," "is," and "and"—are commonly eliminated during text
preparation. For applications like topic modeling, sentiment analysis, and text
categorization, these stopwords are helpful.
From nltk.tokenize import word_tokenize (word_tokenize from nltk.tokenize):

Text can be tokenized into individual words or tokens using the word_tokenize function
of NLTK. Tokenization is an essential NLP preprocessing procedure that divides text
into meaningful chunks for processing and analysis later on. For tasks like syntactic
analysis, feature extraction, and text normalization, word tokenization is crucial.
stopwords: As previously indicated, this alludes to the stopwords corpus of NLTK. It

offers a list of frequently used stopwords for text processing applications.
punkt: The punkt module of NLTK offers tokenizers that have already been trained to
separate text into words and sentences. It is helpful for NLP tokenization jobs,
particularly when handling material with complicated sentence patterns or text in many
languages.
wordnet: The wordnet module of NLTK is a lexical database for the English language
that includes relationships, synonyms, antonyms, and definitions of words. WordNet is
frequently used for lexicography, semantic analysis, and developing natural language
processing (NLP) systems that need-to-know word meanings and relationships.
18
CHAPTER 5
METHODOLOGY
5.1 PROPOSED SYSTEM DIAGRAM
Fig. 5.1 proposed system diagram
5.2 EXPLANATION OF THE PROPOSED SYSTEM DIAGRAM
Multiple Source Text Loaders: Create text loaders that can handle a variety of document
types, including CSV files, text files, and URLs.Assure effective textual data extraction
and processing from a variety of sources for the Langchain architecture.
Text Splitter with Recursive Characters: Use recursion and character splitters to divide
documents into more manageable sections. During the splitting process, semantic
integrity and coherence are preserved for further analysis.
Integration of Hugging Face for Embeddings with OpenAI: Combine the Hugging Face
19
and OpenAI tools to create embeddings from textual data.
Modern pre-trained language models can be used to convert sentences into numerical
representations for additional analysis.
Investigation of Algorithms for Information Retrieval: Examine several information

retrieval (IR) algorithms that can be used for retrieval, summarization, search, and
keyword extraction.Analyze the performance of various algorithms in relation to
research and news analysis activities.
FAISS Indexing and Retrieval Implementation: Use FAISS (Facebook AI Similarity

Search) to store, index, and retrieve embeddings more quickly. Create indexes with
FAISS to facilitate scalable and quick similarity searches for news items and related
queries.
RetrievalQA Development Using Sources Chain: Create and execute RetrievalQA

using Sources Chain, the foundational element of the news research instrument. Create
algorithms that handle and filter obtained chunks iteratively, adding feedback loops to
improve search results.
Combining LLMs to Perform Multi-pass Analysis: Combine LLMs with the Langchain
framework to analyze retrieved text chunks across several passes. Make use of LLMs
to produce summaries, extract important data, and incrementally improve query replies
5.3 MAP REDUCE METHOD
20
Fig. 5.3(a) summarization of chunks
There are two methods to summarize chunks ,stuff method (simple append method ) or
map reduce method , a drawback in stuff method being that it exceeds token limit and
usage of more space as summary chunk is huge so we can solve it by using map reduce
method where based on splitter’s and separators we get chunks with similarity search ,
let’s take if we get 4 chunks all those 4 are sent again through LLMs and filtered chunks
with necessary info only based on keyword extraction similarity search with faiss and
those filtered chunks are summarized and along with input query ,sent in final LLM
and we get an answer , in this way we save token limit from exceeding its easier to get
precise answer with lesser cost (as token size is calculated ) following below is the
figure to understand map reduce method
21
Fig. 5.3(b) map reduce method
5.4 RETRIEVALQAWITHSOURCESCHAIN
One of the Langchain framework's components, RetrievalQAWithSourcesChain, is

intended for question-answering (QA) tasks, specifically when it comes to obtaining
pertinent data from several sources. Pre-trained language models (LLMs) are used in
this chain to comprehend and process queries, search across a variety of sources,
including documents and databases, and extract pertinent responses.
Among the RetrievalQAWithSourcesChain's salient characteristics are:
Question Understanding: makes use of LLMs to interpret the meaning behind user
inquiries, enabling more precise information retrieval.
Source Integration: Provides support for integration with many information sources,
22
including text documents, databases, and websites, allowing for thorough search
functions.
Answer extraction: Takes into account context, relevance, and coherence while using
sophisticated approaches to extract responses from sources that have been retrieved.
Scalability: Capable of efficiently managing massive amounts of data, ensuring

scalability for managing intricate quality assurance activities across vast datasets.
Customization: Provides adaptability to various domains and use cases, enabling users
to modify the retrieval and quality assurance procedure in accordance with their needs.
All things considered, the RetrievalQAWithSourcesChain improves the efficacy and

efficiency of Q&A systems through the utilization of LLMs and the integration of many
information sources to provide precise and thorough answers.
Fig. 5.4 FAISS index
23
Python Interface: FAISS offers a Python interface that is simple to add into your
machine learning pipelines and works well with well-known libraries like NumPy and
scikit-learn.
Integration with Embedding Libraries: For applications involving natural language

processing or computer vision, FAISS is frequently utilized in conjunction with
embedding libraries, such as word embeddings.
Pre-trained Models: FAISS makes it simple to install and integrate into applications by
providing pre-trained models and indexes for typical use cases.
All things considered, FAISS is a strong library that can be used to do clustering and
similarity search in high-dimensional spaces. It is extensively utilized in many different
applications, such as content-based retrieval systems, recommendation systems, and
search engines.
24
CHAPTER 6
IMPLEMENTATION AND RESULTS
6.1 RESULTS OF FEW TRADITIONAL IR ALGORITHMS
1. A generative statistical model called Latent Dirichlet Allocation (LDA) is

frequently employed for topic modelling. Lets look into that first
Fig. 6.1(a) simple LDA code
Dictionary Creation: The pre-processed tokens are represented in a dictionary. A

dictionary is a mapping between words and their numeric IDs in this context.
25
Dictionary filtering: It removes tokens that are present in more than 50% of the
documents or fewer than 20 documents. This aids in eliminating words that are either
exceedingly uncommon or common and might not be useful for subject modelling.
Creation of the Corpus: Each document in the corpus is represented as a list of (word
ID, word frequency) tuples, resulting in a representation of the corpus that is similar to
a bag of words.
LDA Model Training: Using the corpus and the filtered dictionary, it trains a Latent
Dirichlet Allocation (LDA) model. LDA is a generative statistical model that enables
unobserved groupings, or themes, to explain sets of observations.
subject Analysis: It publishes the top words for each subject produced by the LDA
model. Every topic has a word distribution, with the most likely terms for each topic
shown by the top words.
Coherence Score Computation: Using the 'c_v' coherence measure, it calculates the
LDA model's coherence score. The semantic closeness of high-scoring terms in the
topic is measured by coherence.
To sum up, the code takes a set of headlines, uses LDA to find subjects within them,
and then uses coherence to assess how good the topics are.
2. Now a ranking algorithm like BM25 with inverted index
26
Fig. 6.1(b) simple BM25 code
27
Fig. 6.1(c)simple BM25 code
Fig. 6.1(d)simple BM25 output
28
Fig. 6.1(e)simple BM25 output with score
You can see the relevance score as per question asked if given as per
similarity/relevance using bm25 (also tf-idf) but inverted , you ask a query and among
the URLs it gives top 5 documents and most relevant in them as per the score now lets
look into few visualizations to understand this extraction methods better
Fig. 6.1(f) scores of the top ranked documents
29
Fig. 6.1(g) relevance score of top documents in bar plot
3. TF-IDF with FAISS
previous to last code here we add the tf-idf calculated vectors to faiss index , but before
that the process would be we load the dataset ,preprocess the text data ,initiliase the tf
-idf vectorizer and the convert the tf-idf matrix to a numpy array ,initiliase the faiss
index and add the calculated vectors to faiss index , perform a query and get the 5 most
relevant news articles to our query along with heading and and type and also get the
similarity by cosine similarity which one of the most accurate similarity measures
30
Fig. 6.1(h)simple TF-IDF code with FAISS
31
Fig. 6.1(i)simple TF-IDF code with FAISS
32
Fig. 6.1(j)simple TF-IDF code with FAISS output
Filling Data: A dataset is loaded into a pandas Data Frame from a CSV file called
"Articles.csv." It's conceivable that the dataset includes articles with headers and
additional data.
Preprocessing Text: From the Data Frame, the heading data is extracted. A similarity
search will be conducted using this textual data.
Vectorization of TF-IDF: Using TfidfVectorizer from scikit-learn, it initializes a TF-

IDF (Term Frequency-Inverse Document Frequency) vectorizer. The TF-IDF is a
statistical measure that quantifies a word's significance in a document in relation to a
group of documents.
Fit_transform() is used to convert the text data into a TF-IDF matrix representation.
The TF-IDF matrix is transformed into a NumPy array.
33
FAISS Index Initialization: In order to do a similarity search, it initializes an FAISS
index (IndexFlatL2). One kind of index that supports L2 (Euclidean) is called
IndexFlatL2.
Vectors added to the FAISS Index: To the FAISS index, it adds TF-IDF vectors that
have been transformed to float32.
User Inquiry: The user is prompted to enter a text query.
Vectorizing Inquiry: Using the same TF-IDF vectorizer as the articles, it vectorizes the
query text. The vectorized query is transformed into a float32 NumPy array.
Similarity Lookup: Using the FAISS index, it conducts a similarity search to identify
the query vector's k nearest neighbours (k=5). The indices of the documents that are
most comparable are retrieved.
Printing Related Documents: The articles are printed with their titles and further details
included.
Computing Cosine Similarity:
The cosine similarity between each document vector in the TF-IDF matrix and the
query vector is calculated.
It computes dot products with all document vectors and normalizes the query vector.
It obtains the scores of similarities for the documents that are retrieved.
Scores for Printing Similarity:
The cosine similarity scores of the papers that were retrieved are printed.
To summarize, this method takes a query from the user, finds the documents in the
dataset that are the most similar based on TF-IDF vector similarity, and outputs the
documents along with the scores that correlate to those similarity. It also calculates the
cosine similarity scores between the documents that are retrieved and the query.
34
now let’s look into few visualizations to understand this extraction methods better
Fig. 6.1(k) similarity scores of retrieved documents
Fig. 6.1(l) word cloud of retrieved documents headings
35
Fig. 6.1(m) word count per document
36
Fig. 6.1(n) distribution of word count
6.2 COMPARING BM25 AND TF-IDF WITH FAISS
The decision between the two methods—using BM25 with inverted index or TF-IDF
with FAISS—is based on a number of variables, such as the size and kind of your
dataset as well as the particular needs of your application. Here are some things to think
about with each strategy:
TF-IDF in conjunction with FAISS:
Appropriate for big datasets: Since TF-IDF vectors with FAISS enable quick similarity
searches in high-dimensional spaces, it is effective for big datasets.
37
Scalability: FAISS is appropriate for applications where scalability is a concern because
of its capacity to manage extremely large-scale datasets well.
Similarity that is based on embedding: This type of similarity is calculated using the
documents' embedded representation, which can identify semantic similarities between
them.
Preprocessing is necessary for TF-IDF. Tokenization and TF-IDF vectorization are two
examples of these procedures, which can be computationally costly for very big
datasets.
BM25 with the index reversed:
Efficient and basic: BM25 with inverted index is a straightforward technique for
information retrieval jobs. simpler to comprehend and put into practice.
Ideal for smaller datasets: Although BM25 can manage huge datasets, the overhead of
maintaining the inverted index may cause it to become less effective than FAISS for
very large datasets.
Flexibility in queries: BM25 gives you additional options because it takes terms in the
query and their relevance to the documents into account.
require tuning: Depending on the dataset and the particular task, BM25 parameters like
k1 and b may need to be adjusted for best results.
In summary, TF-IDF with FAISS can be a better option if you have a very large
dataset and scalability, is your top priority.
6.3 CREATING A QA PROMPT A SAMPLE LLM
38
Based on our project requirements using retrievalQA Sources chain and open ai key,
faiss vector index being key things we start the implantation
as we know there are loaders and splitters for our documents
Fig. 6.3(a) loaders and splitters screenshot
Fig. 6.3(b) embeddings
39
Embeddings are created using OpenAI embeddings and passed the documents and
embeddings in order to create faiss vector index
Now with retrieval sources chain, we ask query which has summaries and prompts
Fig. 6.3(c) retrievalQA sources chain and llm prompt
40
Fig. 6.3(d) retrievalQA sources chain and llm prompt
And we get the answer

We created a LLM prompt using Lang chain with all the modules which gives us
answers after having chunks and then filtering them and then having prompts and within
that it gives an answer
6.4 FINAL IMPLEMENTATION (QA-Bot)
Using the above same methodology as our proposed design, after understanding all the
sub parts like loading, splitting, embedding, faiss index, retrieval, we combine
everything using various modules and packages along with algo into one to get our final
implementation that is our QA-Bot, it is implemented using streamlit
41
Fig. 6.4(a) final project screenshots
42
Fig. 6.4(b) final project screenshots
Fig. 6.4(c) final project screenshots
43
After writing app.py, we run or host it on local network and port 8501 and it you can
see the bot in the URL provided and the final result of the LLM based news research
tool using Lang chain
Fig. 6.4(d) QA-bot (news research tool screenshot)
Explanation:
Here u can see that we have made streamlit application combining NLP using nm and
Lang chain and integrating all the above methods to finally make this interface
44
Fig. 6.4(e) QA-bot (news research tool screenshot)
Fig. 6.4(f) QA-Bot tool screenshot
45
Thus, based on above algorithms /libraries/packages, A news tool i.e. a QA Bot which
gives /retrieves a successfully correct answer to your question is created, this is a
student budget friendly project in the domain of nlp/ai/ml where applications like chat
gpt is booming in the market, with such projects in the current trends of technology, it
adds value and growth in terms of an engineering student
46
CHAPTER 7
CONCLUSION AND FUTURE WORK
7.1 CONCLUSION
In this work, we present a new model that is specifically tailored to extract answers
from filtered chunks using key word search and extraction techniques from news
articles in a variety of categories. These groups cover a range of fields.
Through FAISS and TF-IDF, we improve keyword search accuracy by utilizing

linguistic patterns and domain-specific knowledge by customizing our model for each
category. TF-IDF approaches could miss domain-specific subtleties and be unable to
fully represent the complexities of various topic areas. As an alternative, our model
optimizes and adds domain-specific features to better capture the essence of responses
extracted from news items in each category.
Efficient Textual Data Extraction: The application includes several source text loaders
that can handle various document formats, such as text files, CSV files, and URLs. This
guarantees thorough data collecting from multiple sources.
Semantic Integrity Preservation: Documents can be divided into manageable portions

while maintaining semantic integrity and coherence by using a text splitter that uses
recursive characters. This makes it easier to analyze and comprehend the content at a
more detailed level.
Advanced Embedding Generation: By combining Hugging Face with OpenAI

technologies, embeddings from textual data may be produced using state-of-the-art pre-
trained language models. These embeddings make it easier to express sentences
numerically for additional processing and analysis.
Improved User Interface: Create an intuitive user interface that makes it easier for users
to interact with the tool, enabling them to input queries
47
7.2 FUTURE WORK
Future studies can examine how to optimize the proposed system by integrating it with
deep learning techniques. They can also explore how the model can improve tasks like
topic modeling, document retrieval, and summarization, as well as open up new
avenues for utilizing news content across different languages and domains, such as
regional news retrieval.
Improved User Interface: Create an intuitive user interface that makes it easier for users
to interact with the tool, enabling them to input queries, examine results, and visualize
insights.
Enable continuous monitoring and analysis of news articles as they are released by
implementing real-time data processing capabilities. This will ensure that timely
insights and updates are provided.
Domain-Specific Customization: Provide users with the ability to modify the tool to
suit their own needs and preferences by adapting it to particular domains or industries.
Integration with External APIs: To enhance the analysis and provide a more thorough
knowledge of news events, integrate with external APIs and data sources. Examples of
these sources include social media, financial, and geopolitical data.
48
REFERENCES
[1] N. Khan, M. S. Islam, F. Chowdhury, A. S. Siham and N. Sakib, "Bengali Crime

News Classification Based on Newspaper Headlines using NLP," 2022 25th
International Conference on Computer and Information Technology (ICCIT), Cox's
Bazar, Bangladesh, 2022, pp. 194-199, doi: 10.1109/ICCIT57492.2022.10055391.
[2] A. Saxena, V. Vijay Bhagat and A. Tamang, "Stock Market Trend Analysis on
Indian Financial News Headlines with Natural Language Processing," 2021 Asian
Conference on Innovation in Technology (ASIANCON), PUNE, India, 2021, pp. 1-5,
doi: 10.1109/ASIANCON51346.2021.9544965.
[3] P. Ghasiya and K. Okamura, "Investigating Cybersecurity News Articles by

Applying Topic Modeling Method," 2021 International Conference on Information
Networking (ICOIN), Jeju Island, Korea (South), 2021, pp. 432-438, doi:
10.1109/ICOIN50884.2021.9333952.
[4] S. S. Lwin and K. T. Nwet, "Extractive Summarization for Myanmar

Language," 2018 International Joint Symposium on Artificial Intelligence and
Natural Language Processing (iSAI-NLP), Pattaya, Thailand, 2018, pp. 1-6, doi:
10.1109/iSAI-NLP.2018.8692976.
[5] K. M. Alam, M. T. H. Hemel, S. M. Muhaiminul Islam and A. Akther, "Bangla

News Trend Observation using LDA Based Topic Modeling," 2020 23rd International
Conference on Computer and Information Technology (ICCIT), DHAKA,
Bangladesh, 2020, pp. 1-6, doi: 10.1109/ICCIT51783.2020.9392719.
[6] D. Kosmajac and V. Kešelj, "Automatic Text Summarization of News Articles in

Serbian Language," 2019 18th International Symposium INFOTEH-JAHORINA
(INFOTEH), East Sarajevo, Bosnia and Herzegovina, 2019, pp. 1-6, doi:
10.1109/INFOTEH.2019.8717655.
[7] T. Priyadharshan and S. Sumathipala, "Text Summarization for Tamil Online
49
Sports News Using NLP," 2018 3rd International Conference on Information
Technology Research (ICITR), Moratuwa, Sri Lanka, 2018, pp. 1-5, doi:
10.1109/ICITR.2018.8736154.
[8] J. Deny, S. Kamisetty, H. V. R. Thalakola, J. Vallamreddy and V. K. Uppari,

"Inshort Text Summarization of News Article," 2023 7th International Conference on
Intelligent Computing and Control Systems (ICICCS), Madurai, India, 2023, pp.
1104-1108, doi: 10.1109/ICICCS56967.2023.10142549.
[9] A. Mishra, A. Sahay, M. a. Pandey and S. S. Routaray, "News text Analysis using
Text Summarization and Sentiment Analysis based on NLP," 2023 3rd International
Conference on Smart Data Intelligence (ICSMDI), Trichy, India, 2023, pp. 28-31,
doi: 10.1109/ICSMDI57622.2023.00014.
[10] R. Habu, R. Ratnaparkhi, A. Askhedkar and S. Kulkarni, "A Hybrid Extractive-

Abstractive Framework with Pre & Post-Processing Techniques To Enhance Text
Summarization," 2023 13th International Conference on Advanced Computer
Information Technologies (ACIT), Wrocław, Poland, 2023, pp. 529-533, doi:
10.1109/ACIT58437.2023.10275584.
[11] R. Boorugu and G. Ramesh, "A Survey on NLP based Text Summarization for
Summarizing Product Reviews," 2020 Second International Conference on Inventive
Research in Computing Applications (ICIRCA), Coimbatore, India, 2020, pp. 352-
356, doi: 10.1109/ICIRCA48905.2020.9183355.
[12] B. Kynabay, A. Aldabergen and A. Zhamanov, "Automatic Summarizing the

News from Inform.kz by Using Natural Language Processing Tools," 2021 IEEE
International Conference on Smart Information Systems and Technologies (SIST),
Nur-Sultan, Kazakhstan, 2021, pp. 1-4, doi: 10.1109/SIST50301.2021.9465885.
[13] R. G. Mayopu, V. Nalluri and L. -S. Chen, "Classification ChatGPT Generated

News and True News Using Support Vector Machines," 2023 12th International
Conference on Awareness Science and Technology (iCAST), Taichung, Taiwan, 2023,
pp. 228-232, doi: 10.1109/iCAST57874.2023.10359313.
50
[14] A. Hingle, A. Katz and A. Johri, "Exploring NLP-Based Methods for Generating
Engineering Ethics Assessment Qualitative Codebooks," 2023 IEEE Frontiers in
Education Conference (FIE), College Station, TX, USA, 2023, pp. 1-8, doi:
10.1109/FIE58773.2023.10342985
[15] T. D. Jayasiriwardene and G. U. Ganegoda, "Keyword extraction from Tweets

using NLP tools for collecting relevant news," 2020 International Research
Conference on Smart Computing and Systems Engineering (SCSE),
Colombo, Sri Lanka, 2020, pp. 129-135, doi: 10.1109/SCSE49731.2020.9313024.
51

Thesis Final

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis Final

Uploaded by

Copyright:

Available Formats

A project report on

LLM based news research tool using langchain

Submitted in partial fulfillment for the award of the degree of

Bachelor of Technology in Computer Science

KHUSHI HABBU (20BCE1612)

SCHOOL OF COMPUTER SCIENCE AND ENGINEERING

Submitted in partial fulfillment for the award of the degree of

Bachelor of Technology in Computer Science

KHUSHI HABBU (20BCE1612)

SCHOOL OF COMPUTER SCIENCE AND ENGINEERING

Signature of the Guide:

Name: Dr. Subbulakshmi P

Signature of the Internal Examiner Signature of the External Examiner

Approved by the Head of Department,

It is my pleasure to express with a deep sense of gratitude to Dr. Subbulakshmi P,

Special mention to Dr. Ganesan R, Dean, Dr. Parvathi R, Associate Dean

In a jubilant state, I express ingeniously my whole-hearted thanks to Dr.

Date: Khushi Habbu

LIST OF FIGURES………………………………………………………………………. vii

LIST OF ACRONYMS …………………………………………………………….......... ix

1.1 INTRODUCTION ……………………………………………………......1

3.1 BACKGROUND STUDY ……………………………………................12

4.1 SAMPLE DATA …………………………………………………………15

5.1 PROPOSED SYSTEM DIAGRAM……………. ………………………19

6.1 RESULTS OF FEW TRADITIONAL IR ALGORITHMS……………...25

Fig. 5.1 proposed system diagram

Fig. 5.3(a) summarization of chunks

Fig. 5.3(b) map reduce method

Fig. 5.4 FAISS index

Fig. 6.1(a) simple LDA code

Fig. 6.1(b) simple BM25 code

Fig. 6.1(c) simple BM25 code

Fig. 6.1(d)simple BM25 output

Fig. 6.1(e)simple BM25 output with score

Fig. 6.1(f) scores of the top ranked documents

Fig. 6.1(g) relevance score of top documents in bar plot

Fig. 6.1(h)simple T-IDF code with FAISS

Fig. 6.1(i)simple TF-IDF code with FAISS

Fig. 6.1(j)simple TF-IDF code with FAISS output

Fig. 6.1(k) similarity scores of retrieved documents

Fig. 6.1(l) word cloud of retrieved documents headings

Fig. 6.1(n) distribution of word count

Fig. 6.3(a) loaders and splitters screenshot

Fig. 6.3(b) embeddings

Fig. 6.3(c) retrievalQA sources chain and llm prompt

Fig. 6.3(d) retrievalQA sources chain and llm prompt

Fig. 6.4(a) final project screenshots

Fig. 6.4(b) final project screenshots

Fig. 6.4(c) final project screenshots

Fig. 6.4(d) QA-bot (news research tool screenshot)

Fig. 6.4(e) QA-bot (news research tool screenshot)

Fig. 6.4(f) QA-Bot tool screenshot

TF-IDF Term Frequency – Inverse Document Frequency

KNN k Nearest Neighbours

SVM Support Vector Machine

MAUI Multi-purpose automatic topic indexing

KEA Automatic Keyphrase Extraction

LSTM Long short term memory

CRF Conditional Random Field

TF-IDFIS Term Frequency-Inverse Document Frequence

LDA Latent Drichlet Allocation

NMF Non-negative Matrix Factorization

MLR Multiple Linear Regression

NLTK Natural Language Toolkit