NLP Report

School of Computer Science and Engineering
VIT Chennai
Vandalur - Kelambakkam Road, Chennai - 600 127
Final Review Report
Programme: M. Tech (Int) Software Engineering
Course: SWE1017
Slot: G1
Faculty: Dr. M. Premalatha
Component:
Title: Infolink - Connecting Citizens to Government Services
Team Members:
S Subhitcha (20MIS1020)
Deepti Kannan (20MIS1061)
Lakshya S (20MIS1103)
LIST OF FIGURES
S. NO FIGURE NAME
FIG 1 FAISS Vector Database
FIG 2 Training phase
FIG 3 Backend & Frontend Representation
FIG 4 Search Engine Logic
FIG 5 FAISS index
FIG 6 Installation of Libraries
FIG 7 Working
FIG 8 Frontend
FIG 9 Result
ABSTRACT
In the dynamic landscape of artificial intelligence and natural language processing

(NLP), this initiative seeks to leverage advanced language models to transform
government services for the citizens of India. The convergence of NLP and
cutting-edge language models, such as GPT-3.5, presents a unique opportunity to
elevate accessibility, efficiency, and user experience in public service interactions.
NLP acts as the crucial link between human communication and computational
intelligence, offering a wide array of applications, from language translation to
sentiment analysis and interactive chatbot development. Within this context,
Language Models (LM) play a pivotal role, with GPT-3.5 showcasing the
capabilities of large-scale models in understanding and generating human-like
text. Additionally, the project recognizes the significance of other architectural
advancements in the field, aiming to harness the power of innovative models
beyond BERT.
The primary goal of this project is to streamline government services, enhancing

accessibility and user-friendliness for the citizens of India. These abstract lays the
groundwork for a thorough exploration of the potential of NLP technologies,
underscoring the importance of meticulous planning to address data quality,
privacy concerns, and the intricate technical challenges associated with deploying
sophisticated language models. The journey unfolds with a commitment to
utilizing state-of-the-art AI techniques to enhance the efficiency and
responsiveness of public services, ultimately benefiting the people of India.
KEYWORDS:
1. Large Language Model (LLM)

2. Aritficial Intelligence (AI)
3. Query Preprocessing
4. Chatbot
5. GPT-3.5
6. Bidirectional Encoder Representations from Transformers (BERT)
7. Facebook AI Similarity Search (FAISS)
8. Question Answering (QA)
9. Sentiment Analysis
10. LangChain
11. OpenAI Embeddings
12. ChatOpenAI Model
13. FAISS Vector Database
INTRODUCTION:
In an era marked by the continuous evolution of artificial intelligence, the

incorporation of Natural Language Processing (NLP) emerges as a pivotal force
reshaping our interactions with information and services. NLP, a dynamic subset
of artificial intelligence[1], is dedicated to facilitating seamless communication
between computers and humans through natural language. This project embarks
on a transformative journey, leveraging the capabilities of advanced language
models[2] like GPT-3.5 to revolutionize government services and enhance
accessibility for the citizens of India.
The significance of NLP is evident in its diverse applications, ranging from

language translation to sentiment analysis and the development of interactive
chatbots. At the core of NLP are Language Models (LM), sophisticated
algorithms trained on extensive text data to comprehend and generate human-like
text. GPT-3.5 stands out as a testament to the immense potential of large-scale
language models, showcasing their ability to understand and generate text on a
human level.
Our project envisions a future where government services[3] in India become more
intuitive, user-centric, and efficient through the seamless integration of advanced
NLP techniques. By exploring the capabilities of language models like GPT-3.5[4],
we aim to streamline user interactions, provide context-aware information, and
enhance overall service delivery. However, achieving this vision requires careful
consideration of data quality, privacy concerns, and the intricate technical
challenges associated with deploying and maintaining sophisticated NLP models.
This introduction sets the stage for an in-depth exploration of how NLP
technologies[5], combined with powerful language models, can elevate the
accessibility and effectiveness of government services. The project represents a
commitment to harnessing cutting-edge artificial intelligence to empower citizens
and facilitate more intuitive interactions with public services in the digital age.
LITERATURE REVIEW:
1. Extracting Answers To Natural Language Questions From Large-Scale

Corpus:
The paper delves into the realm of open-domain question answering in artificial
intelligence, fueled by advancements in information retrieval, information
extraction, and natural language processing. It introduces a three-component
architecture that integrates NLP techniques, external resources like WordNet, and
web information retrieval to enhance the accuracy of question answering. This
architecture serves as the cornerstone, strategically bridging user queries and
scattered information across extensive text repositories.
The three-tiered architecture begins with query preprocessing[6], where essential

tasks such as keyword extraction, expansion, and answer type prediction are
performed. Synonyms from WordNet enrich the system's understanding, and
machine learning predicts answer types based on user queries. The information
retrieval component utilizes SMART IR[7] for document retrieval, Google's
ranking algorithm for web retrieval, and a System Similarity Model for passage
retrieval, ensuring effective navigation through diverse information sources.
At the core, the answer extraction component involves named entity recognition,
shallow parsing, answer ranking, and selection. Empirical ranking formulas
consider various features, leading to rigorous evaluation and ranking of candidate
answers. Experimental evaluations using the TREC-8 question-answering track[8]
as a benchmark provide valuable insights, highlighting the system's effectiveness
and suggesting areas for improvement. The systematic integration of NLP
techniques, external resources, and web retrieval shows promise in delivering
precise answers, reflecting the current state of the art in NLP and paving the way
for future research.
2. A Deep Learning Model[9] Based on BERT and Sentence Transformer

for Semantic Keyphrase Extraction on Big Social Data
In the rapidly evolving landscape of natural language processing (NLP) and social
media analytics, the significance of efficient keyword extraction cannot be
overstated. This article delves into the critical realm of social media, particularly
Twitter, where the sheer volume of information makes manual keyword mining
impractical. The increasing prevalence of inappropriate information on Twitter
underscores the need for automated solutions to enhance content extraction,
search capabilities, decision-making processes, and various NLP tasks, including
text classification, sentiment analysis, and name recognition.
Acknowledging the challenges posed by the vast amount of data generated daily
on Twitter, the article introduces the Semkey-BERT model[10] as a powerful
solution. This model harnesses the capabilities of deep learning, specifically
BERT and sentence transformation, to extract meaningful content from Twitter.
The methodology involves a systematic process, starting with information
collection, preprocessing for clean data, and culminating in automated keyword
extraction and scoring. The Semkey-BERT model's effectiveness is validated
through comparisons with established unsupervised models, demonstrating its
prowess in improving search results and facilitating various NLP activities.
As the digital landscape continues to evolve, the article navigates the complexities
of automated keyword extraction, presenting a solution that not only addresses the
challenges posed by excessive data but also opens avenues for refining search
functionalities and advancing natural language processing tasks.
3. COVID-19 vaccine sentiment analysis using public opinions on Twitter
The proposed method for evaluating opinions on the COVID-19 vaccine

expressed on Twitter involves several key steps. It begins with data collection
from various sources, including Twitter, Gaggle, and GitHub, to compile a
comprehensive repository of tweets representing diverse thoughts and feelings.
The subsequent tokenization step, utilizing NLTK's treebank tokenizer[11], breaks
down tweets into single words for further analysis. Preliminary data cleaning
involves stopping word elimination, lemmatization, stemming, and the removal of
special characters and numbers to prepare the text for sentiment analysis.
Sentiment analysis[12] forms the core of the method, with a decision tree algorithm
chosen for its simplicity, interpretability, and efficiency in classifying tweets into
sentiment categories such as positive, negative, or neutral. The final step focuses
on evaluating the effectiveness of the decision tree algorithm using metrics like
accuracy, precision, recall, and F1 score. This experiment aims to measure how
well the model understands public opinion on the COVID-19 vaccine from
Twitter, providing insights into the distribution of opinions and contributing to a
better understanding of public sentiment.
In essence, this method presents a systematic approach to analyzing public
opinions on COVID-19 vaccines expressed on Twitter, offering valuable insights
for researchers seeking to comprehend and monitor public sentiment in the realm
of social media discussions.
DATA SET DESCRIPTION:
1. Budget Speech Dataset
The Budget Speech 2023-24 dataset encapsulates the financial address by Finance
Minister Nirmala Sitharaman on February 1, 2023. It delves into economic
priorities, initiatives, and achievements for the fiscal year.
Content Overview:
- The dataset includes detailed insights into revenue sources, expenditure

categories, and economic indicators presented in the budget speeches.
- Textual content is organized into sections, subsections, and paragraphs,

providing a structured representation of the information.
Metadata:
- Common attributes such as date, location, and government department are

included to facilitate easy categorization and retrieval of information.
2. Financial Bill 2023 Highlights Dataset
The dataset highlights key amendments proposed in the Finance Bill 2023 by
Finance Minister Smt. Nirmala Sitharaman during the Union Budget presentation.
It covers crucial aspects related to tax rates, deductions, exemptions, benefits,
business, capital gains, trusts, assessments, TDS/TCS, penalties, and
miscellaneous provisions.
Highlight Details:
- Information on changes to tax rates, new fiscal policies, and their anticipated
impact on the economy are captured.
- The dataset provides a concise summary of the most significant aspects of the
Financial Bill for quick reference.
Metadata:
Table of Contents: 11 categories including Tax Changes, Deductions, Business,

and more
Key Fields: Category, Subcategory, Summary, Section Reference, Date
Data Volume: Concise information on Finance Bill 2023 amendments
Intended Use: Quick overview for understanding and analysis of key budgetary
changes
3. How to Apply for Driving License Dataset
The "How to Apply for Driving License" dataset is designed to guide individuals
through the process of obtaining a driving license. This dataset includes:
Application Procedures:
- Detailed information on the steps involved in applying for a driving license,

from document submission to testing procedures.
- Any recent updates or changes to the application process are documented for
user awareness.
Eligibility and Requirements:
- Clear guidelines on eligibility criteria and necessary documents are outlined to

assist individuals in preparing for the application process.
Metadata:
Data Format: Text

Contents: Application Procedures, Eligibility, and Requirements
Key Fields: Application Steps, Updates, Eligibility Criteria, Necessary
Documents
Data Volume: Detailed information for obtaining a driving license
Intended Use: Guiding individuals through the driving license application process
4. Vector DataStores
The textual content within each dataset is transformed into numerical vectors
using advanced natural language processing techniques. These vectors are then
indexed using Faiss, a high-performance similarity search library, to enable
efficient data retrieval.
Figure 1: FAISS Vector Database
5. LangChain Framework Integration
The LangChain framework is utilized to enhance specific functionalities within

the project such as document loading, ChatOpenAI, OpenAIEmbeddings,
RetrievalQA.
URL Links Of Datasets
https://www.indiabudget.gov.in/doc/budget_speech.pdf
https://incometaxindia.gov.in/news/finance-bill-2023-highlights.pdf
https://www.bankbazaar.com/driving-
licence.html#:~:text=Step%201%3A%20Visit%20the%20official,have%20to
%20select%20your%20state
IMPLEMENTATION:
The Retrieval QA (Question-Answering)[13] is the NLP level for the identified

topic, which entails developing a system for obtaining pertinent information from
documents based on user queries.
QA Level of Retrieval
The main goal of the Retrieval QA level is to respond to user queries by retrieving
pre-existing information from a knowledge base or collection of documents.
Instead of producing answers on the fly, the system uses its stored knowledge to
produce pertinent responses.
Document Representation: To capture semantic meaning and relationships

between words and phrases, the text content of documents is converted into
numerical representations (embeddings)[14].
List of indexes: To effectively arrange and store these document representations,

an indexing mechanism is used. Based on the similarity between the user query
and the document content, this indexing enables quick and precise retrieval of
pertinent documents.
Processing User Queries: User queries are processed, and their meaning is
deciphered in relation to the documents that have been indexed. Based on the
similarity between the query and the stored document representations, the system
finds the documents that are most relevant.
Answer Retrieval: The user is shown the pertinent data or responses that have
been extracted from the selected documents.
Selection of Retrieval QA:
Applications where it is necessary to provide precise responses or information

from a predefined set of documents are appropriate for the Retrieval QA level.
The application supports the features of a retrieval quality assurance system by
using OpenAI Embeddings for document representation and FAISS for effective
similarity search, respectively. The presumptive ChatOpenAI model[15] enhances
the user experience by managing user inquiries and providing responses in a
conversational fashion, thereby adding to the interactive element.
Machine Learning Models
OpenAI Embeddings
Functionality: OpenAI Embeddings are used for generating vector

representations of words in the text, capturing semantic information.
Role in the Application: In the Retrieval QA system, OpenAI Embeddings play a

vital role in converting textual information into numerical vectors. These vectors
encode the semantic meaning of words and phrases, facilitating similarity
comparisons between user queries and document content during the retrieval
process.
FAISS (Facebook AI Similarity Search)
Functionality: FAISS[16] is a library for efficient similarity search and clustering

of dense vectors.
Role in the Application: FAISS is employed to build an index from the vector
representations generated by OpenAI Embeddings. This index allows for fast and
effective similarity searches, enabling the system to retrieve relevant documents
quickly based on the similarity between the user's query and the content of the
indexed documents. FAISS enhances the efficiency of the retrieval process, which
is crucial for real-time performance.
Figure 2 : Training phase
Figure 3 : Backend & Frontend Representa@on

Figure 4 : Search Engine Logic
Figure 5 : FAISS index

Figure 6 :Installa@on of Libraries
[Pypdf - to extract contents from pdf
Faiss-cpu -content and embedding storage(in memory itself)
Langchain-training and search logic]
Figure 7 : Working
Figure 8 : Frontend
Figure 9: Working of CIVICINFOBOT
RESULTS AND DISCUSSION:
Qualitative Analysis
Performance:
Strengths
The use of FAISS for similarity search and OpenAI Embeddings for document
representation likely contributes to the system's ability to efficiently retrieve
relevant information.
Weaknesses
The effectiveness of the system heavily depends on the quality of embeddings and
the initial training of ChatOpenAI.
Model Selection:
OpenAI Embeddings and FAISS
OpenAI Embeddings are chosen for their ability to capture semantic

meaning, crucial for understanding user queries and document content.
FAISS is selected for efficient similarity search, which is vital for quickly
retrieving relevant documents.
Evaluation Metrics Selection:
The assessment of system performance was aligned with predefined goals,

emphasizing accuracy and relevance in response to user queries. Precision, recall,
and F1 score were selected as suitable evaluation metrics. The implementation of
robust code allowed for the calculation of these metrics, with adjustments to
thresholds made to balance precision and recall. Evaluation on a separate
validation set provided valuable insights into the system's generalization.
Conclusion
The optimization of hyperparameters and the selection of appropriate evaluation

metrics contribute significantly to the overall effectiveness of the Retrieval QA
system. Future work may involve continuous refinement of hyperparameters,
exploration of the hyperparameter space, and potential adjustments to system
goals based on evolving requirements.
Limitations
Cost of GPT API:
If GPT API is used it could be costly, especially if the application has a high
volume of queries.
The financial aspect could limit the scalability of the system.
Data Dependence:
The system's effectiveness relies heavily on the quality and relevance of the
training data.
It may struggle with out-of-distribution queries or topics not well-
represented in the training set[17].
CONCLUSION:
In essence, the Retrieval QA level in Natural Language Processing (NLP) is

designed to provide precise responses to user queries by extracting information
from a predefined set of documents. This process involves converting text into
numerical representations, creating an index for efficient retrieval, and using
advanced machine learning models such as OpenAI Embeddings and FAISS.
OpenAI Embeddings play a vital role in encoding the semantic meaning of words,
while FAISS facilitates fast and effective similarity searches, enhancing the
efficiency of the retrieval process. Together, these components form a powerful
foundation for applications requiring accurate and rapid access to information.
The proposed ChatOpenAI model exemplifies the practical application of these

techniques, offering a conversational user experience. This approach showcases
the potential of advanced NLP in providing interactive and efficient solutions for
real-time user interactions[18].
FUTURE WORK:
In future work, the Retrieval QA system can be strengthened through additional

fine-tuning of the ChatOpenAI model on domain-specific data, diversifying the
training set to enhance robustness, and conducting further experiments to
optimize hyperparameters in OpenAI Embeddings and FAISS. Cost optimization
strategies, such as caching or pre-processing, should be explored to reduce the
expenses associated with external APIs. Improving the user interaction aspect and
addressing system limitations, including biases and specific query challenges, are
essential considerations. Incorporating user feedback, evaluating on diverse
scenarios, exploring advanced NLP techniques, and scaling for large datasets will
collectively contribute to the system's continuous refinement, adaptability, and
overall effectiveness.
REFERENCES:
[1] Woods, W., & Kaplan, R. (1977). Lunar rocks in natural English:exploraCons in natural
language quesCon answering. LinguisCc structures processing. In Fundamental
Studies in Computer Science, 5, 521–569.
[2] Lehnert, W. (1978). The Process of QuesCon Answering: A computer simulaCon of

cogniCon. Lawrence Erlbaum.
[3] Voorhees, E. (2004). Overview of the TREC QuesCon Answering Track. The Thirteenth
Text Retrieval Conference, 2004.
[4] Pasca, M. A. (2001). High-performance, open-
domain quesCon answering from large text collecCons [PhD Thesis]. University of
Southern Methodist.
[5] Lehnert, W. (1978). The Process of QuesCon Answering: A computer simulaCon of

cogniCon. Lawrence Erlbaum.
[6] Voorhees, E. (2004). Overview of the TREC QuesCon Answering Track. The Thirteenth
Text Retrieval Conference, 2004.
[7] Pasca, M. A. (2001). High-performance, open-
domain quesCon answering from large text collecCons [PhD Thesis]. University of
Southern Methodist.
[8] Salton, G. (Ed.). (1969). The SMART retrieval system. Englewood Cliffs,
NJ: PrenCce Hall.
[9] Collobert, R., & Weston, J. (July 2008). ‘A unified architecture for natural language
processing: Deep neural networks with mulCtask learning,’ in Proc. 25th Int. Conf.
Mach. Learn. (pp. 160–167).
[10] Kabir, M. F., Abdullah-Al-Mamun, K., & Huda, M. N. (May 2016). ‘Deep
learningbased parts of speech tagger for Bengali,’ in Proc. 5th Int. Conf. Inform.,
Electron. Vis.
(ICIEV) (pp. 26–29).
[11] Merrouni, Z. A., Frikh, B., & Ouhbi, B. (May 2019). AutomaCc keyphrase
extracCon: Asurvey and trends. Journal of Intelligent InformaCon Systems, 54, 391–
424.
[12] Bougouin, A., Boudin, F., & Daille, B. (October 2013). Topic rank: Graph-based
topicranking for keyphrase extracCon. In Proceedings of the Int. Joint Conf. Natural
Lang.Process. (IJCNLP) (pp. 543–551).
[13] Wan, X., & Xiao, J. (2008). Single document keyphrase extracCon using
neighborhood knowledge. In Proceedings of the AAAI, 8(July) (pp. 855–860).
[14] Yousefinaghani, S., Dara, R., Mubareka, S., Papadopoulos, A., & Sharif,
S. (2021). An analysis of COVID-19 vaccine senCments and opinions onTwimer.

InternaConal Journal of InfecCous Diseases, 108, 256–262.
doi:10.1016/j.ijid.2021.05.059
[15] Dang, N. C., Moreno-García, M. N., & De la Prieta, F. (2020). SenCment

analysis based on deep learning: A comparaCve study. Electronics, 9(3), 483.
doi:10.3390/electronics9030483
[16] Marcec, R., & Likic, R. (2021). Using Twimer for senCment analysis towards,
Pfizer/BioNTech and Moderna COVID-19 vaccines Postgraduate Medical, J.
Publish [Online first, 09]. Oxford: Astra Zeneca. doi:10.1136/postgradmedj-2021-

140685
[17] Mehedi Shamrat, F. M. J., Chakraborty, S., Imran, M. M., Muna, J. N., &
Masum, Md. (2021). Billah, ProCva Das, Md. Obaidur Rahman, SenCment analysis on
twimer tweets about COVID-19 vaccines using NLP and supervised KNN classificaCon
algorithm, 23(1) (pp. 463–470).
[18] Ansari, M. T. J., & Khan, N. A. (2021). Worldwide COVID-19 vaccines senCment
analysis through twimer content. Electronic Journal of General Medicine,
18(6). doi:10.29333/ejgm/11316

NLP Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP Report

Uploaded by

Copyright:

Available Formats

School of Computer Science and Engineering

Final Review Report

Programme: M. Tech (Int) Software Engineering

Faculty: Dr. M. Premalatha

Title: Infolink - Connecting Citizens to Government Services

Deepti Kannan (20MIS1061)

FIG 1 FAISS Vector Database

FIG 2 Training phase

FIG 3 Backend & Frontend Representation

FIG 4 Search Engine Logic

FIG 5 FAISS index

FIG 6 Installation of Libraries

In the dynamic landscape of artificial intelligence and natural language processing

The primary goal of this project is to streamline government services, enhancing

1. Large Language Model (LLM)

In an era marked by the continuous evolution of artificial intelligence, the

The significance of NLP is evident in its diverse applications, ranging from

1. Extracting Answers To Natural Language Questions From Large-Scale

The three-tiered architecture begins with query preprocessing[6], where essential

2. A Deep Learning Model[9] Based on BERT and Sentence Transformer

3. COVID-19 vaccine sentiment analysis using public opinions on Twitter

The proposed method for evaluating opinions on the COVID-19 vaccine

DATA SET DESCRIPTION:

1. Budget Speech Dataset

- The dataset includes detailed insights into revenue sources, expenditure

- Textual content is organized into sections, subsections, and paragraphs,

- Common attributes such as date, location, and government department are

2. Financial Bill 2023 Highlights Dataset

Table of Contents: 11 categories including Tax Changes, Deductions, Business,

3. How to Apply for Driving License Dataset

- Detailed information on the steps involved in applying for a driving license,

Eligibility and Requirements:

- Clear guidelines on eligibility criteria and necessary documents are outlined to

Data Format: Text

Figure 1: FAISS Vector Database

5. LangChain Framework Integration

The LangChain framework is utilized to enhance specific functionalities within

The Retrieval QA (Question-Answering)[13] is the NLP level for the identified

Document Representation: To capture semantic meaning and relationships

List of indexes: To effectively arrange and store these document representations,

Selection of Retrieval QA:

Applications where it is necessary to provide precise responses or information

Machine Learning Models

Functionality: OpenAI Embeddings are used for generating vector

Role in the Application: In the Retrieval QA system, OpenAI Embeddings play a

FAISS (Facebook AI Similarity Search)

Functionality: FAISS[16] is a library for efficient similarity search and clustering

Figure 2 : Training phase

Figure 3 : Backend & Frontend Representa@on

Figure 5 : FAISS index

[Pypdf - to extract contents from pdf

Faiss-cpu -content and embedding storage(in memory itself)

Langchain-training and search logic]

RESULTS AND DISCUSSION:

OpenAI Embeddings and FAISS

OpenAI Embeddings are chosen for their ability to capture semantic

Evaluation Metrics Selection:

The assessment of system performance was aligned with predefined goals,

The optimization of hyperparameters and the selection of appropriate evaluation

Cost of GPT API:

In essence, the Retrieval QA level in Natural Language Processing (NLP) is

The proposed ChatOpenAI model exemplifies the practical application of these

In future work, the Retrieval QA system can be strengthened through additional

[2] Lehnert, W. (1978). The Process of QuesCon Answering: A computer simulaCon of