You are on page 1of 20

School of Computer Science and Engineering

VIT Chennai
Vandalur - Kelambakkam Road, Chennai - 600 127

Final Review Report

Programme: M. Tech (Int) Software Engineering

Course: SWE1017

Slot: G1

Faculty: Dr. M. Premalatha

Component:

Title: Infolink - Connecting Citizens to Government Services

Team Members:

S Subhitcha (20MIS1020)

Deepti Kannan (20MIS1061)

Lakshya S (20MIS1103)
LIST OF FIGURES

S. NO FIGURE NAME

FIG 1 FAISS Vector Database

FIG 2 Training phase

FIG 3 Backend & Frontend Representation

FIG 4 Search Engine Logic

FIG 5 FAISS index

FIG 6 Installation of Libraries

FIG 7 Working

FIG 8 Frontend

FIG 9 Result
ABSTRACT

In the dynamic landscape of artificial intelligence and natural language processing


(NLP), this initiative seeks to leverage advanced language models to transform
government services for the citizens of India. The convergence of NLP and
cutting-edge language models, such as GPT-3.5, presents a unique opportunity to
elevate accessibility, efficiency, and user experience in public service interactions.

NLP acts as the crucial link between human communication and computational
intelligence, offering a wide array of applications, from language translation to
sentiment analysis and interactive chatbot development. Within this context,
Language Models (LM) play a pivotal role, with GPT-3.5 showcasing the
capabilities of large-scale models in understanding and generating human-like
text. Additionally, the project recognizes the significance of other architectural
advancements in the field, aiming to harness the power of innovative models
beyond BERT.

The primary goal of this project is to streamline government services, enhancing


accessibility and user-friendliness for the citizens of India. These abstract lays the
groundwork for a thorough exploration of the potential of NLP technologies,
underscoring the importance of meticulous planning to address data quality,
privacy concerns, and the intricate technical challenges associated with deploying
sophisticated language models. The journey unfolds with a commitment to
utilizing state-of-the-art AI techniques to enhance the efficiency and
responsiveness of public services, ultimately benefiting the people of India.
KEYWORDS:

1. Large Language Model (LLM)


2. Aritficial Intelligence (AI)
3. Query Preprocessing
4. Chatbot
5. GPT-3.5
6. Bidirectional Encoder Representations from Transformers (BERT)
7. Facebook AI Similarity Search (FAISS)
8. Question Answering (QA)
9. Sentiment Analysis
10. LangChain
11. OpenAI Embeddings
12. ChatOpenAI Model
13. FAISS Vector Database

INTRODUCTION:

In an era marked by the continuous evolution of artificial intelligence, the


incorporation of Natural Language Processing (NLP) emerges as a pivotal force
reshaping our interactions with information and services. NLP, a dynamic subset
of artificial intelligence[1], is dedicated to facilitating seamless communication
between computers and humans through natural language. This project embarks
on a transformative journey, leveraging the capabilities of advanced language
models[2] like GPT-3.5 to revolutionize government services and enhance
accessibility for the citizens of India.

The significance of NLP is evident in its diverse applications, ranging from


language translation to sentiment analysis and the development of interactive
chatbots. At the core of NLP are Language Models (LM), sophisticated
algorithms trained on extensive text data to comprehend and generate human-like
text. GPT-3.5 stands out as a testament to the immense potential of large-scale
language models, showcasing their ability to understand and generate text on a
human level.

Our project envisions a future where government services[3] in India become more
intuitive, user-centric, and efficient through the seamless integration of advanced
NLP techniques. By exploring the capabilities of language models like GPT-3.5[4],
we aim to streamline user interactions, provide context-aware information, and
enhance overall service delivery. However, achieving this vision requires careful
consideration of data quality, privacy concerns, and the intricate technical
challenges associated with deploying and maintaining sophisticated NLP models.

This introduction sets the stage for an in-depth exploration of how NLP
technologies[5], combined with powerful language models, can elevate the
accessibility and effectiveness of government services. The project represents a
commitment to harnessing cutting-edge artificial intelligence to empower citizens
and facilitate more intuitive interactions with public services in the digital age.

LITERATURE REVIEW:

1. Extracting Answers To Natural Language Questions From Large-Scale


Corpus:

The paper delves into the realm of open-domain question answering in artificial
intelligence, fueled by advancements in information retrieval, information
extraction, and natural language processing. It introduces a three-component
architecture that integrates NLP techniques, external resources like WordNet, and
web information retrieval to enhance the accuracy of question answering. This
architecture serves as the cornerstone, strategically bridging user queries and
scattered information across extensive text repositories.

The three-tiered architecture begins with query preprocessing[6], where essential


tasks such as keyword extraction, expansion, and answer type prediction are
performed. Synonyms from WordNet enrich the system's understanding, and
machine learning predicts answer types based on user queries. The information
retrieval component utilizes SMART IR[7] for document retrieval, Google's
ranking algorithm for web retrieval, and a System Similarity Model for passage
retrieval, ensuring effective navigation through diverse information sources.

At the core, the answer extraction component involves named entity recognition,
shallow parsing, answer ranking, and selection. Empirical ranking formulas
consider various features, leading to rigorous evaluation and ranking of candidate
answers. Experimental evaluations using the TREC-8 question-answering track[8]
as a benchmark provide valuable insights, highlighting the system's effectiveness
and suggesting areas for improvement. The systematic integration of NLP
techniques, external resources, and web retrieval shows promise in delivering
precise answers, reflecting the current state of the art in NLP and paving the way
for future research.

2. A Deep Learning Model[9] Based on BERT and Sentence Transformer


for Semantic Keyphrase Extraction on Big Social Data

In the rapidly evolving landscape of natural language processing (NLP) and social
media analytics, the significance of efficient keyword extraction cannot be
overstated. This article delves into the critical realm of social media, particularly
Twitter, where the sheer volume of information makes manual keyword mining
impractical. The increasing prevalence of inappropriate information on Twitter
underscores the need for automated solutions to enhance content extraction,
search capabilities, decision-making processes, and various NLP tasks, including
text classification, sentiment analysis, and name recognition.

Acknowledging the challenges posed by the vast amount of data generated daily
on Twitter, the article introduces the Semkey-BERT model[10] as a powerful
solution. This model harnesses the capabilities of deep learning, specifically
BERT and sentence transformation, to extract meaningful content from Twitter.
The methodology involves a systematic process, starting with information
collection, preprocessing for clean data, and culminating in automated keyword
extraction and scoring. The Semkey-BERT model's effectiveness is validated
through comparisons with established unsupervised models, demonstrating its
prowess in improving search results and facilitating various NLP activities.

As the digital landscape continues to evolve, the article navigates the complexities
of automated keyword extraction, presenting a solution that not only addresses the
challenges posed by excessive data but also opens avenues for refining search
functionalities and advancing natural language processing tasks.

3. COVID-19 vaccine sentiment analysis using public opinions on Twitter

The proposed method for evaluating opinions on the COVID-19 vaccine


expressed on Twitter involves several key steps. It begins with data collection
from various sources, including Twitter, Gaggle, and GitHub, to compile a
comprehensive repository of tweets representing diverse thoughts and feelings.
The subsequent tokenization step, utilizing NLTK's treebank tokenizer[11], breaks
down tweets into single words for further analysis. Preliminary data cleaning
involves stopping word elimination, lemmatization, stemming, and the removal of
special characters and numbers to prepare the text for sentiment analysis.

Sentiment analysis[12] forms the core of the method, with a decision tree algorithm
chosen for its simplicity, interpretability, and efficiency in classifying tweets into
sentiment categories such as positive, negative, or neutral. The final step focuses
on evaluating the effectiveness of the decision tree algorithm using metrics like
accuracy, precision, recall, and F1 score. This experiment aims to measure how
well the model understands public opinion on the COVID-19 vaccine from
Twitter, providing insights into the distribution of opinions and contributing to a
better understanding of public sentiment.
In essence, this method presents a systematic approach to analyzing public
opinions on COVID-19 vaccines expressed on Twitter, offering valuable insights
for researchers seeking to comprehend and monitor public sentiment in the realm
of social media discussions.

DATA SET DESCRIPTION:

1. Budget Speech Dataset

The Budget Speech 2023-24 dataset encapsulates the financial address by Finance
Minister Nirmala Sitharaman on February 1, 2023. It delves into economic
priorities, initiatives, and achievements for the fiscal year.

Content Overview:

- The dataset includes detailed insights into revenue sources, expenditure


categories, and economic indicators presented in the budget speeches.

- Textual content is organized into sections, subsections, and paragraphs,


providing a structured representation of the information.

Metadata:

- Common attributes such as date, location, and government department are


included to facilitate easy categorization and retrieval of information.

2. Financial Bill 2023 Highlights Dataset

The dataset highlights key amendments proposed in the Finance Bill 2023 by
Finance Minister Smt. Nirmala Sitharaman during the Union Budget presentation.
It covers crucial aspects related to tax rates, deductions, exemptions, benefits,
business, capital gains, trusts, assessments, TDS/TCS, penalties, and
miscellaneous provisions.
Highlight Details:

- Information on changes to tax rates, new fiscal policies, and their anticipated
impact on the economy are captured.

- The dataset provides a concise summary of the most significant aspects of the
Financial Bill for quick reference.

Metadata:

Table of Contents: 11 categories including Tax Changes, Deductions, Business,


and more
Key Fields: Category, Subcategory, Summary, Section Reference, Date
Data Volume: Concise information on Finance Bill 2023 amendments
Intended Use: Quick overview for understanding and analysis of key budgetary
changes

3. How to Apply for Driving License Dataset

The "How to Apply for Driving License" dataset is designed to guide individuals
through the process of obtaining a driving license. This dataset includes:

Application Procedures:

- Detailed information on the steps involved in applying for a driving license,


from document submission to testing procedures.

- Any recent updates or changes to the application process are documented for
user awareness.

Eligibility and Requirements:

- Clear guidelines on eligibility criteria and necessary documents are outlined to


assist individuals in preparing for the application process.
Metadata:

Data Format: Text


Contents: Application Procedures, Eligibility, and Requirements
Key Fields: Application Steps, Updates, Eligibility Criteria, Necessary
Documents
Data Volume: Detailed information for obtaining a driving license
Intended Use: Guiding individuals through the driving license application process

4. Vector DataStores

The textual content within each dataset is transformed into numerical vectors
using advanced natural language processing techniques. These vectors are then
indexed using Faiss, a high-performance similarity search library, to enable
efficient data retrieval.

Figure 1: FAISS Vector Database

5. LangChain Framework Integration

The LangChain framework is utilized to enhance specific functionalities within


the project such as document loading, ChatOpenAI, OpenAIEmbeddings,
RetrievalQA.
URL Links Of Datasets

https://www.indiabudget.gov.in/doc/budget_speech.pdf

https://incometaxindia.gov.in/news/finance-bill-2023-highlights.pdf

https://www.bankbazaar.com/driving-
licence.html#:~:text=Step%201%3A%20Visit%20the%20official,have%20to
%20select%20your%20state

IMPLEMENTATION:

The Retrieval QA (Question-Answering)[13] is the NLP level for the identified


topic, which entails developing a system for obtaining pertinent information from
documents based on user queries.

QA Level of Retrieval

The main goal of the Retrieval QA level is to respond to user queries by retrieving
pre-existing information from a knowledge base or collection of documents.
Instead of producing answers on the fly, the system uses its stored knowledge to
produce pertinent responses.

Document Representation: To capture semantic meaning and relationships


between words and phrases, the text content of documents is converted into
numerical representations (embeddings)[14].

List of indexes: To effectively arrange and store these document representations,


an indexing mechanism is used. Based on the similarity between the user query
and the document content, this indexing enables quick and precise retrieval of
pertinent documents.

Processing User Queries: User queries are processed, and their meaning is
deciphered in relation to the documents that have been indexed. Based on the
similarity between the query and the stored document representations, the system
finds the documents that are most relevant.
Answer Retrieval: The user is shown the pertinent data or responses that have
been extracted from the selected documents.

Selection of Retrieval QA:

Applications where it is necessary to provide precise responses or information


from a predefined set of documents are appropriate for the Retrieval QA level.
The application supports the features of a retrieval quality assurance system by
using OpenAI Embeddings for document representation and FAISS for effective
similarity search, respectively. The presumptive ChatOpenAI model[15] enhances
the user experience by managing user inquiries and providing responses in a
conversational fashion, thereby adding to the interactive element.

Machine Learning Models

OpenAI Embeddings

Functionality: OpenAI Embeddings are used for generating vector


representations of words in the text, capturing semantic information.

Role in the Application: In the Retrieval QA system, OpenAI Embeddings play a


vital role in converting textual information into numerical vectors. These vectors
encode the semantic meaning of words and phrases, facilitating similarity
comparisons between user queries and document content during the retrieval
process.

FAISS (Facebook AI Similarity Search)

Functionality: FAISS[16] is a library for efficient similarity search and clustering


of dense vectors.

Role in the Application: FAISS is employed to build an index from the vector
representations generated by OpenAI Embeddings. This index allows for fast and
effective similarity searches, enabling the system to retrieve relevant documents
quickly based on the similarity between the user's query and the content of the
indexed documents. FAISS enhances the efficiency of the retrieval process, which
is crucial for real-time performance.

Figure 2 : Training phase

Figure 3 : Backend & Frontend Representa@on


Figure 4 : Search Engine Logic

Figure 5 : FAISS index


Figure 6 :Installa@on of Libraries

[Pypdf - to extract contents from pdf

Faiss-cpu -content and embedding storage(in memory itself)

Langchain-training and search logic]

Figure 7 : Working

Figure 8 : Frontend
Figure 9: Working of CIVICINFOBOT

RESULTS AND DISCUSSION:

Qualitative Analysis

Performance:

Strengths

The use of FAISS for similarity search and OpenAI Embeddings for document
representation likely contributes to the system's ability to efficiently retrieve
relevant information.

Weaknesses

The effectiveness of the system heavily depends on the quality of embeddings and
the initial training of ChatOpenAI.
Model Selection:

OpenAI Embeddings and FAISS

OpenAI Embeddings are chosen for their ability to capture semantic


meaning, crucial for understanding user queries and document content.
FAISS is selected for efficient similarity search, which is vital for quickly
retrieving relevant documents.

Evaluation Metrics Selection:

The assessment of system performance was aligned with predefined goals,


emphasizing accuracy and relevance in response to user queries. Precision, recall,
and F1 score were selected as suitable evaluation metrics. The implementation of
robust code allowed for the calculation of these metrics, with adjustments to
thresholds made to balance precision and recall. Evaluation on a separate
validation set provided valuable insights into the system's generalization.

Conclusion

The optimization of hyperparameters and the selection of appropriate evaluation


metrics contribute significantly to the overall effectiveness of the Retrieval QA
system. Future work may involve continuous refinement of hyperparameters,
exploration of the hyperparameter space, and potential adjustments to system
goals based on evolving requirements.

Limitations

Cost of GPT API:

If GPT API is used it could be costly, especially if the application has a high
volume of queries.
The financial aspect could limit the scalability of the system.
Data Dependence:

The system's effectiveness relies heavily on the quality and relevance of the
training data.
It may struggle with out-of-distribution queries or topics not well-
represented in the training set[17].

CONCLUSION:

In essence, the Retrieval QA level in Natural Language Processing (NLP) is


designed to provide precise responses to user queries by extracting information
from a predefined set of documents. This process involves converting text into
numerical representations, creating an index for efficient retrieval, and using
advanced machine learning models such as OpenAI Embeddings and FAISS.

OpenAI Embeddings play a vital role in encoding the semantic meaning of words,
while FAISS facilitates fast and effective similarity searches, enhancing the
efficiency of the retrieval process. Together, these components form a powerful
foundation for applications requiring accurate and rapid access to information.

The proposed ChatOpenAI model exemplifies the practical application of these


techniques, offering a conversational user experience. This approach showcases
the potential of advanced NLP in providing interactive and efficient solutions for
real-time user interactions[18].

FUTURE WORK:

In future work, the Retrieval QA system can be strengthened through additional


fine-tuning of the ChatOpenAI model on domain-specific data, diversifying the
training set to enhance robustness, and conducting further experiments to
optimize hyperparameters in OpenAI Embeddings and FAISS. Cost optimization
strategies, such as caching or pre-processing, should be explored to reduce the
expenses associated with external APIs. Improving the user interaction aspect and
addressing system limitations, including biases and specific query challenges, are
essential considerations. Incorporating user feedback, evaluating on diverse
scenarios, exploring advanced NLP techniques, and scaling for large datasets will
collectively contribute to the system's continuous refinement, adaptability, and
overall effectiveness.

REFERENCES:

[1] Woods, W., & Kaplan, R. (1977). Lunar rocks in natural English:exploraCons in natural
language quesCon answering. LinguisCc structures processing. In Fundamental
Studies in Computer Science, 5, 521–569.

[2] Lehnert, W. (1978). The Process of QuesCon Answering: A computer simulaCon of


cogniCon. Lawrence Erlbaum.
[3] Voorhees, E. (2004). Overview of the TREC QuesCon Answering Track. The Thirteenth
Text Retrieval Conference, 2004.
[4] Pasca, M. A. (2001). High-performance, open-

domain quesCon answering from large text collecCons [PhD Thesis]. University of
Southern Methodist.

[5] Lehnert, W. (1978). The Process of QuesCon Answering: A computer simulaCon of


cogniCon. Lawrence Erlbaum.
[6] Voorhees, E. (2004). Overview of the TREC QuesCon Answering Track. The Thirteenth
Text Retrieval Conference, 2004.
[7] Pasca, M. A. (2001). High-performance, open-

domain quesCon answering from large text collecCons [PhD Thesis]. University of
Southern Methodist.

[8] Salton, G. (Ed.). (1969). The SMART retrieval system. Englewood Cliffs,
NJ: PrenCce Hall.
[9] Collobert, R., & Weston, J. (July 2008). ‘A unified architecture for natural language
processing: Deep neural networks with mulCtask learning,’ in Proc. 25th Int. Conf.
Mach. Learn. (pp. 160–167).
[10] Kabir, M. F., Abdullah-Al-Mamun, K., & Huda, M. N. (May 2016). ‘Deep
learningbased parts of speech tagger for Bengali,’ in Proc. 5th Int. Conf. Inform.,
Electron. Vis.
(ICIEV) (pp. 26–29).

[11] Merrouni, Z. A., Frikh, B., & Ouhbi, B. (May 2019). AutomaCc keyphrase
extracCon: Asurvey and trends. Journal of Intelligent InformaCon Systems, 54, 391–
424.
[12] Bougouin, A., Boudin, F., & Daille, B. (October 2013). Topic rank: Graph-based
topicranking for keyphrase extracCon. In Proceedings of the Int. Joint Conf. Natural
Lang.Process. (IJCNLP) (pp. 543–551).
[13] Wan, X., & Xiao, J. (2008). Single document keyphrase extracCon using
neighborhood knowledge. In Proceedings of the AAAI, 8(July) (pp. 855–860).
[14] Yousefinaghani, S., Dara, R., Mubareka, S., Papadopoulos, A., & Sharif,

S. (2021). An analysis of COVID-19 vaccine senCments and opinions onTwimer.


InternaConal Journal of InfecCous Diseases, 108, 256–262.
doi:10.1016/j.ijid.2021.05.059

[15] Dang, N. C., Moreno-García, M. N., & De la Prieta, F. (2020). SenCment


analysis based on deep learning: A comparaCve study. Electronics, 9(3), 483.
doi:10.3390/electronics9030483
[16] Marcec, R., & Likic, R. (2021). Using Twimer for senCment analysis towards,
Pfizer/BioNTech and Moderna COVID-19 vaccines Postgraduate Medical, J.

Publish [Online first, 09]. Oxford: Astra Zeneca. doi:10.1136/postgradmedj-2021-


140685
[17] Mehedi Shamrat, F. M. J., Chakraborty, S., Imran, M. M., Muna, J. N., &
Masum, Md. (2021). Billah, ProCva Das, Md. Obaidur Rahman, SenCment analysis on
twimer tweets about COVID-19 vaccines using NLP and supervised KNN classificaCon
algorithm, 23(1) (pp. 463–470).
[18] Ansari, M. T. J., & Khan, N. A. (2021). Worldwide COVID-19 vaccines senCment
analysis through twimer content. Electronic Journal of General Medicine,
18(6). doi:10.29333/ejgm/11316

You might also like