Professional Documents
Culture Documents
VIT Chennai
Vandalur - Kelambakkam Road, Chennai - 600 127
Course: SWE1017
Slot: G1
Component:
Team Members:
S Subhitcha (20MIS1020)
Lakshya S (20MIS1103)
LIST OF FIGURES
S. NO FIGURE NAME
FIG 7 Working
FIG 8 Frontend
FIG 9 Result
ABSTRACT
NLP acts as the crucial link between human communication and computational
intelligence, offering a wide array of applications, from language translation to
sentiment analysis and interactive chatbot development. Within this context,
Language Models (LM) play a pivotal role, with GPT-3.5 showcasing the
capabilities of large-scale models in understanding and generating human-like
text. Additionally, the project recognizes the significance of other architectural
advancements in the field, aiming to harness the power of innovative models
beyond BERT.
INTRODUCTION:
Our project envisions a future where government services[3] in India become more
intuitive, user-centric, and efficient through the seamless integration of advanced
NLP techniques. By exploring the capabilities of language models like GPT-3.5[4],
we aim to streamline user interactions, provide context-aware information, and
enhance overall service delivery. However, achieving this vision requires careful
consideration of data quality, privacy concerns, and the intricate technical
challenges associated with deploying and maintaining sophisticated NLP models.
This introduction sets the stage for an in-depth exploration of how NLP
technologies[5], combined with powerful language models, can elevate the
accessibility and effectiveness of government services. The project represents a
commitment to harnessing cutting-edge artificial intelligence to empower citizens
and facilitate more intuitive interactions with public services in the digital age.
LITERATURE REVIEW:
The paper delves into the realm of open-domain question answering in artificial
intelligence, fueled by advancements in information retrieval, information
extraction, and natural language processing. It introduces a three-component
architecture that integrates NLP techniques, external resources like WordNet, and
web information retrieval to enhance the accuracy of question answering. This
architecture serves as the cornerstone, strategically bridging user queries and
scattered information across extensive text repositories.
At the core, the answer extraction component involves named entity recognition,
shallow parsing, answer ranking, and selection. Empirical ranking formulas
consider various features, leading to rigorous evaluation and ranking of candidate
answers. Experimental evaluations using the TREC-8 question-answering track[8]
as a benchmark provide valuable insights, highlighting the system's effectiveness
and suggesting areas for improvement. The systematic integration of NLP
techniques, external resources, and web retrieval shows promise in delivering
precise answers, reflecting the current state of the art in NLP and paving the way
for future research.
In the rapidly evolving landscape of natural language processing (NLP) and social
media analytics, the significance of efficient keyword extraction cannot be
overstated. This article delves into the critical realm of social media, particularly
Twitter, where the sheer volume of information makes manual keyword mining
impractical. The increasing prevalence of inappropriate information on Twitter
underscores the need for automated solutions to enhance content extraction,
search capabilities, decision-making processes, and various NLP tasks, including
text classification, sentiment analysis, and name recognition.
Acknowledging the challenges posed by the vast amount of data generated daily
on Twitter, the article introduces the Semkey-BERT model[10] as a powerful
solution. This model harnesses the capabilities of deep learning, specifically
BERT and sentence transformation, to extract meaningful content from Twitter.
The methodology involves a systematic process, starting with information
collection, preprocessing for clean data, and culminating in automated keyword
extraction and scoring. The Semkey-BERT model's effectiveness is validated
through comparisons with established unsupervised models, demonstrating its
prowess in improving search results and facilitating various NLP activities.
As the digital landscape continues to evolve, the article navigates the complexities
of automated keyword extraction, presenting a solution that not only addresses the
challenges posed by excessive data but also opens avenues for refining search
functionalities and advancing natural language processing tasks.
Sentiment analysis[12] forms the core of the method, with a decision tree algorithm
chosen for its simplicity, interpretability, and efficiency in classifying tweets into
sentiment categories such as positive, negative, or neutral. The final step focuses
on evaluating the effectiveness of the decision tree algorithm using metrics like
accuracy, precision, recall, and F1 score. This experiment aims to measure how
well the model understands public opinion on the COVID-19 vaccine from
Twitter, providing insights into the distribution of opinions and contributing to a
better understanding of public sentiment.
In essence, this method presents a systematic approach to analyzing public
opinions on COVID-19 vaccines expressed on Twitter, offering valuable insights
for researchers seeking to comprehend and monitor public sentiment in the realm
of social media discussions.
The Budget Speech 2023-24 dataset encapsulates the financial address by Finance
Minister Nirmala Sitharaman on February 1, 2023. It delves into economic
priorities, initiatives, and achievements for the fiscal year.
Content Overview:
Metadata:
The dataset highlights key amendments proposed in the Finance Bill 2023 by
Finance Minister Smt. Nirmala Sitharaman during the Union Budget presentation.
It covers crucial aspects related to tax rates, deductions, exemptions, benefits,
business, capital gains, trusts, assessments, TDS/TCS, penalties, and
miscellaneous provisions.
Highlight Details:
- Information on changes to tax rates, new fiscal policies, and their anticipated
impact on the economy are captured.
- The dataset provides a concise summary of the most significant aspects of the
Financial Bill for quick reference.
Metadata:
The "How to Apply for Driving License" dataset is designed to guide individuals
through the process of obtaining a driving license. This dataset includes:
Application Procedures:
- Any recent updates or changes to the application process are documented for
user awareness.
4. Vector DataStores
The textual content within each dataset is transformed into numerical vectors
using advanced natural language processing techniques. These vectors are then
indexed using Faiss, a high-performance similarity search library, to enable
efficient data retrieval.
https://www.indiabudget.gov.in/doc/budget_speech.pdf
https://incometaxindia.gov.in/news/finance-bill-2023-highlights.pdf
https://www.bankbazaar.com/driving-
licence.html#:~:text=Step%201%3A%20Visit%20the%20official,have%20to
%20select%20your%20state
IMPLEMENTATION:
QA Level of Retrieval
The main goal of the Retrieval QA level is to respond to user queries by retrieving
pre-existing information from a knowledge base or collection of documents.
Instead of producing answers on the fly, the system uses its stored knowledge to
produce pertinent responses.
Processing User Queries: User queries are processed, and their meaning is
deciphered in relation to the documents that have been indexed. Based on the
similarity between the query and the stored document representations, the system
finds the documents that are most relevant.
Answer Retrieval: The user is shown the pertinent data or responses that have
been extracted from the selected documents.
OpenAI Embeddings
Role in the Application: FAISS is employed to build an index from the vector
representations generated by OpenAI Embeddings. This index allows for fast and
effective similarity searches, enabling the system to retrieve relevant documents
quickly based on the similarity between the user's query and the content of the
indexed documents. FAISS enhances the efficiency of the retrieval process, which
is crucial for real-time performance.
Figure 7 : Working
Figure 8 : Frontend
Figure 9: Working of CIVICINFOBOT
Qualitative Analysis
Performance:
Strengths
The use of FAISS for similarity search and OpenAI Embeddings for document
representation likely contributes to the system's ability to efficiently retrieve
relevant information.
Weaknesses
The effectiveness of the system heavily depends on the quality of embeddings and
the initial training of ChatOpenAI.
Model Selection:
Conclusion
Limitations
If GPT API is used it could be costly, especially if the application has a high
volume of queries.
The financial aspect could limit the scalability of the system.
Data Dependence:
The system's effectiveness relies heavily on the quality and relevance of the
training data.
It may struggle with out-of-distribution queries or topics not well-
represented in the training set[17].
CONCLUSION:
OpenAI Embeddings play a vital role in encoding the semantic meaning of words,
while FAISS facilitates fast and effective similarity searches, enhancing the
efficiency of the retrieval process. Together, these components form a powerful
foundation for applications requiring accurate and rapid access to information.
FUTURE WORK:
REFERENCES:
[1] Woods, W., & Kaplan, R. (1977). Lunar rocks in natural English:exploraCons in natural
language quesCon answering. LinguisCc structures processing. In Fundamental
Studies in Computer Science, 5, 521–569.
domain quesCon answering from large text collecCons [PhD Thesis]. University of
Southern Methodist.
domain quesCon answering from large text collecCons [PhD Thesis]. University of
Southern Methodist.
[8] Salton, G. (Ed.). (1969). The SMART retrieval system. Englewood Cliffs,
NJ: PrenCce Hall.
[9] Collobert, R., & Weston, J. (July 2008). ‘A unified architecture for natural language
processing: Deep neural networks with mulCtask learning,’ in Proc. 25th Int. Conf.
Mach. Learn. (pp. 160–167).
[10] Kabir, M. F., Abdullah-Al-Mamun, K., & Huda, M. N. (May 2016). ‘Deep
learningbased parts of speech tagger for Bengali,’ in Proc. 5th Int. Conf. Inform.,
Electron. Vis.
(ICIEV) (pp. 26–29).
[11] Merrouni, Z. A., Frikh, B., & Ouhbi, B. (May 2019). AutomaCc keyphrase
extracCon: Asurvey and trends. Journal of Intelligent InformaCon Systems, 54, 391–
424.
[12] Bougouin, A., Boudin, F., & Daille, B. (October 2013). Topic rank: Graph-based
topicranking for keyphrase extracCon. In Proceedings of the Int. Joint Conf. Natural
Lang.Process. (IJCNLP) (pp. 543–551).
[13] Wan, X., & Xiao, J. (2008). Single document keyphrase extracCon using
neighborhood knowledge. In Proceedings of the AAAI, 8(July) (pp. 855–860).
[14] Yousefinaghani, S., Dara, R., Mubareka, S., Papadopoulos, A., & Sharif,