You are on page 1of 8

Anatomy of Long-form content to KBQA

system and QA generator

Abstract
The focus of this paper is to enhance the pertinence, accuracy, and
brevity of responses derived from extensive content when
answering predetermined or produced inquiries. This is an element
of the field of explainable AI (XAI), which has faced challenges in
finding algorithms that balance predictive precision with
explainability and transparency. For instance, some NLP learning
techniques, such as neural networks and deep learning algorithms,
have prioritized enhancing prediction accuracy, while Bayesian
belief nets and decision trees have demonstrated superiority in
promoting transparency and explainability. Additionally,
pre-trained language models, such as BERT and GPT, require
customization using specific datasets to develop a final model.
Introduction models to understand the meaning and
Question answering (QA) systems are a context of a question in a more accurate
fundamental area of natural language way. These models are trained on large
processing (NLP) research. These systems amounts of unannotated data and fine-tuned
are designed to understand the meaning and on specific tasks, making them more robust
context of a question, and then use a and versatile. Another important
combination of techniques such as advancement is the use of machine reading
information retrieval, machine reading comprehension (MRC) models, which are
comprehension, and knowledge base able to understand the meaning of a text and
retrieval to generate a complete and accurate provide answers to questions. These models
answer. The ability to answer questions in a use attention mechanisms to focus on the
natural and human-like manner is a key goal relevant parts of the text when providing an
of NLP research, as it has the potential to answer. Additionally, the integration of
greatly enhance the ability of computers to external knowledge sources such as
understand and respond to natural language. knowledge graphs, Wikipedia, and Common
Crawl has also greatly improved the
Over the past few years, there has been performance of QA systems. These external
significant progress in the development of sources provide a wealth of structured and
QA systems, thanks to the advancements in unstructured data that can be used to
deep learning and pre-training techniques. generate more accurate and detailed
These techniques have allowed for the answers. In recent years, the use of
development of QA systems that can handle reinforcement learning has also been
the complexity and variability of natural explored in QA systems to improve the
language and provide accurate and detailed efficiency of the search process and the
answers to a wide range of questions. diversity of the answers provided. This
allows the models to learn from interactions
In this paper, we will review the state of the with the environment and improve over
art in QA systems and examine the current time.
challenges and opportunities in this field.
We will also discuss the various real-world Question-answer pair generation from text is
applications of QA systems, such as an important task in the field of natural
customer service, knowledge management, language processing (NLP) because it can be
and e-commerce, and their potential impact used to create a wide range of applications
on different industries. Overall, this paper that require understanding and generating
aims to provide an overview of the current human language. Some examples of the
state of QA systems in NLP, and to highlight importance of question-answer pair
the potential of this technology to generation from text include:
revolutionize the way computers understand
and respond to natural language. - Educational applications:
Automatically generating
One of the most important advancements in question-answer pairs from a text
QA systems is the use of pre-training can be used to create quizzes,
methods such as BERT (Bidirectional flashcards, and other educational
Encoder Representations from materials that can help students learn
Transformers) and GPT-3 (Generative and retain information more
Pre-trained Transformer 3) which have set effectively.
new benchmarks in the field, allowing
- Search engines: Generating In response to a user inquiry, the algorithm
question-answer pairs from text can first chooses the CORD-19 dataset's
help improve search engines by high-coverage documents that are the most
providing users with more specific
pertinent. It then uses question-answering
and accurate answers to their queries.
- Virtual assistants: Generating (QA) models in a Snippet Selector module
question-answer pairs can enable to highlight the answers or evidence (text
virtual assistants to understand and spans) for the query based on the pertinent
respond to more complex questions paragraphs. Additionally, they suggest a
and requests. query-focused Multi-Document Summarizer
- Chatbots: Generating to produce abstractive and extractive replies
question-answer pairs can be used to
linked to the question from various retrieved
improve the conversation flow of
chatbots and enable them to answer-related paragraph fragments in order
understand and respond to more to effectively communicate COVID-19
complex questions and requests. question-related information to the user. By
- Summarization: Generating optimizing pre-trained language models for
question-answer pairs can be used to QA and summarization, they make the most
summarize the main points of a text, of their generalization abilities and offer
making it more accessible and easier
their own adapted strategies for the
to understand.
- Information Retrieval: Generating COVID-19 assignment.
question-answer pairs can be used to
retrieve the relevant information 2. Web Pages Credibility Scores for
from a large corpus of text which Improving Accuracy of Answers in
makes it more efficient and effective. Web-Based Question Answering Systems
In general, question-answer pair generation [2]: This study presented a credibility
is a key task in NLP that can be used to assessment algorithm that scores credibility
improve the ability of machines to according to seven categories—correctness,
understand and generate human language, authority, currency, professionalism,
making them more useful and user-friendly. popularity, impartiality, and quality—each
of which is composed of a number of
different criteria. To rate answers based on
Literature survey
the credibility of Web pages, a credibility
1. Generative Long-form Question
assessment module is implemented on top of
Answering: Relevance, Faithfulness and
an existing QA system. Based on the
Succinctness [1]: This study is to improve:
legitimacy of the Web pages from which
relevance, faithfulness, and succinctness of
answers were collected, the system rates
LFQA. They created a Caire-covid
answers. On 211 factoids questions collected
framework with a Document Retriever, a
from TREC QA data, the research did
Relevant Snippet Selector, and a
thorough quantitative checks. According to
Query-focused multi-document summarizer.
the results of our study, credibility factors
including correctness, professionalism,
objectivity, and quality greatly increased the accurate results on WoW and ELI5.1
accuracy of answers. Through this study, compared to retrieval-augmented models.
experts and researchers should be able to use
the Web credibility assessment model to 4. SONDHAN: A Comparative Study of
increase the information systems' Two Proficiency Language Bangla-English
correctness. on Question-Answer Using Attention
Mechanism [4]: In this study, the authors
3. An Efficient Memory-Augmented compare question-answer domains based on
Transformer for Knowledge-Intensive NLP international GK, Bangladeshi GK, and
Tasks [3]: The aim of the paper is to answer science and technology in both Bangla and
factual questions using a large collection of English. Through an attention mechanism, a
documents of diversified topics. So the Sequence to Sequence LSTM-based
authors replaced current approaches (which question and answer system has been
frequently rely on parametric models that suggested with a total of 10,000 data points
store knowledge in their parameters or use and accuracy rates of 99.91 and 99.48
retrieval-augmented models that have access percent for Bangla and English data,
to external knowledge sources) with an respectively. Overall, the finest Q&A model
Efficient Memory Augmented Transformer is LSTM, which functions flawlessly for
(EMAT), (which encrypts external both Bengali and English.
knowledge into a key-value memory and
takes advantage of the quick maximum inner 5. Improving Neural Question Answering
product search for memory querying). The with Retrieval and Generation [5]:
parametric and retrieval-augmented models, Text-based quality assurance has advanced
which have comparable characteristics in thanks to the use of neural networks, the
terms of computing efficiency and creation of big training datasets, and
predictive accuracy, are used in the model to unsupervised pre-training. Large amounts of
integrate the best aspects of both strategies. hand-annotated data are still needed, it can
The authors propose pre-training challenges be difficult to use the knowledge that is
that enable EMAT to learn an implicit provided correctly, and costly computations
technique for integrating numerous memory are still necessary throughout operations. In
slots into the transformer and to encode order to address these three problems in NL
useful key-value representations. They have generation and IR approaches, the Reading
also conducted studies on a variety of comprehension task's need for "in-domain
knowledge-intensive tasks, including hand-annotated training data" is removed in
dialogue datasets and question-answering this study.
tasks, which demonstrate that just enhancing 1. RC capabilities can be induced using
parametric models (T5-base) with the the following technique without the
approach yields more accurate answers need for hand-annotated RC
while maintaining a high throughput. EMAT instances.
runs faster overall and generates more
a. RAG-Sequence model: retrieval-augmented
creates the entire sequence generator model were then
prior to marginalization using demonstrated across a variety
the same retrieved document. of knowledge-intensive NLP
Technically, it uses a top-K applications, including
approximation to obtain the ODQA.
seq2seq probability p(y|x)
from the recovered document Built on these observations, the paper offers
as a single latent variable that a class of ODQA models that are based on
is marginalized. the idea that knowledge can be represented
b. The generator is BART, and as question-answer pairs, and it shows how
the retriever is DPR. This utilising question generation, these models
flow describes the complete may produce predictions with high
procedure: RAG models calibration, quick inference, and accuracy.
combine an end-to-end
fine-tuned seq2seq model 6. CQACD: A Concept Question-Answering
(Generator) with a System for Intelligent Tutoring Using a
pre-trained retriever (Query Domain Ontology With Rich Semantics [6]:
Encoder + Document Index). In this study, a Concept Question Answering
The top-K documents c are system applied to the Computer Domain
located using Maximum (CQACD) for intelligent tutoring is
Inner Product Search for proposed. This system is a dialogue-based
query x. They minimize Intelligent Tutoring System (ITS) that
across seq2seq predictions allows the tutor and student with
given various documents for mixed-initiative and natural language to ask
final prediction y and treat c each other questions concerning the basic
as a latent variable. computer knowledge in the Computer
2. Examining open-domain QA Basics course. CQACD is based on
(ODQA) and thinking about how to constructivist principles and encourages the
create models that best utilize the learner to construct knowledge rather than
information in a Wikipedia text merely receiving knowledge, which has the
corpus. following characteristics: (a) this system
a. The study shows that employs a domain ontology with rich
retrieval-augmentation semantic relationships to model the basic
significantly enhances big computer knowledge and build up a
pretrained language models' concept-centric knowledge model, (b) uses a
factual predictions in limited number of 80 input templates with
unsupervised environments. description logics to acquire the intention of
b. The strength and adaptability questions posed by students, (c) a textual
of this kind of entailment algorithm with semantic
technologies is proposed to match the input universal language used for communication,
template and assess the student’s many people still struggle to read, write,
contribution to improve the flexibility of the understand, or speak it. On the other hand,
system, and (d) an ontology-driven dialogue native English speakers may find it difficult
management mechanism is proposed, which to understand the vast amount of
can quickly form the conversational content information available on the World Wide
and conversational sequence. Web in many languages. Cross-Language
Information Retrieval (CLIR) systems,
7. Hindsight: Posterior-guided training of which deal with document retrieval tasks
retrievers for improved open-ended across many languages, are suggested as a
generation [7]: Many retrievers may not find way to get over these obstacles. The
relevant passages even among the top-10, performance assessment of several
and the generator may not learn a preference Information Retrieval (IR) models in the
to ground its generated output in them. This CLIR system using the Quran dataset is the
paper aims to provide all possible relevant main objective of this work. This work also
answers to open ended generation tasks. looked into query length and query
They utilize a second guide retriever that is expansion models for efficient retrieval. The
permitted to train on the goal output and "in findings indicate that varying query lengths
retrospect" retrieve pertinent passages. They have an effect on how well the retrieval
train the standard retriever, the generator, techniques perform in terms of efficiency.
and the guide retriever together by
maximizing the evidence lower bound 9. Retrieval-Augmented Generation for
(ELBo) in expectation over Q. The guide Knowledge-Intensive NLP Tasks [9]: It has
retriever is modeled after the posterior been demonstrated that large pre-trained
distribution Q of passages given the input language models may be modified to
and the intended output. With produce state-of-the-art outcomes on
posterior-guided training, the retriever finds downstream NLP tasks by storing factual
passages from the Wizard of Wikipedia knowledge in their parameters. Their
dataset that are more relevant in the top-10 capacity to accurately access and modify
(23% relative improvement), the generator's knowledge, however, is still constrained,
responses are more grounded in the retrieved which causes them to perform less well than
passage (19% relative improvement), and task-specific architectures on
the end-to-end system generates better knowledge-intensive activities. Furthermore,
overall output (6.4% relative improvement) establishing the basis for their judgements
for informative conversations. and updating their understanding of the
outside world are still unsolved research
8. Comparative Analysis of Information issues. The use of trained models with
Retrieval Models on Quran Dataset in differentiable access to explicit
Cross-Language Information Retrieval non-parametric memory has only been
Systems [8]: Even though English is a
studied thus far for downstream extractive In real-world applications, question
tasks. answering systems have a wide range of
uses. They can be used to improve customer
service by providing accurate and detailed
In the paper Retrieval-augmented generation
answers to customer inquiries, and can also
(RAG) models have been mixed with be used to automate knowledge management
pre-trained parametric and non-parametric tasks by providing quick and easy access to
memory for language generation. This information. Additionally, they are widely
research develops a general-purpose recipe used in e-commerce, by providing answers
for fine-tuning RAG models. It presents to customer queries, which help them to
RAG models, where the non-parametric choose the right product or service.
memory is a dense vector index of In healthcare, question answering systems
Wikipedia accessed by a pre-trained neural can be used to assist doctors and nurses in
retriever, and the parametric memory is a providing accurate and up-to-date
pre-trained seq2seq model. The authors then information to patients, as well as to help
contrast two RAG formulations, one of researchers and pharmaceutical companies
in identifying new drug targets and
which only allows certain recovered
treatments.
passages to be used across the whole
sequence that was created, and the other of In education, question answering systems
which allows for the usage of various can be used to help students find answers to
passages per token. The models are then their questions and to assist teachers in
creating lesson plans and quizzes.
adjusted, evaluated, and established as the
state of the art for three open domain QA In research, question answering systems can
tasks, outperforming parametric seq2seq be used to help scientists and researchers
models and task-specific find relevant information and to assist in the
retrieve-and-extract architectures. discovery of new knowledge.
Overall, question answering systems are an
Problem Statement & Use cases important area of NLP research because of
their potential to greatly enhance the ability
Question answering (QA) systems are an of computers to understand and respond to
important area of natural language natural language, and can be applied in
processing (NLP) research because they many fields to improve efficiency and
provide a way to test and evaluate the ability provide better service.
of computers to understand and generate
natural language. These systems are Proposed QA System Architecture
designed to understand the meaning and
context of a question, and then use a The four components of the whole QA
combination of techniques such as system are:
information retrieval, machine reading
1. question processing
comprehension, and knowledge base
retrieval to generate a complete and accurate 2. document retrieval
answer. 3. passage retrieval
4. answer extraction
Fig 2.0 : Architecture Diagram Proposed system
likely to contain the solution are chosen. If
the document is brief, it is not essential, but
Question Processing if it is lengthy, passage selection works well
because we are unsure of the exact location
The process of question processing turns a
where the query was successful. Selecting
question into a search term. Usually, stop
passages also has the advantage of
words and words with particular parts of
accelerating the system as the subsequent
speech are eliminated. Another technique is
answer extraction procedure typically takes
to turn a question into a vector and use it as
a long time to finish.
a query since deep learning technology has
recently advanced to the point where it is
now able to generate a vector that
effectively expresses the meaning of a Answer Extraction
sentence.
Answer extraction takes passages and
Document Retrieval extracts the answers. The question and
passage are inputted in this case, and the
Using the given query, we will use answer extraction model produces the
document retrieval to find documents solution offset along with the grade. The
required. Since the answer will be retrieved
final response is then presented to the
from these documents during the future
processing, it is necessary to search as many user as the one with the highest score
of the documents as possible for the after ranking the answers according to
solution. Deep learning technology enables the score.
us to turn documents into vectors in a
manner similar to question processing. Dataset, Domain and Scope

Passage Retrieval
The documents are broken up into smaller
parts (passages), such as sentences and
paragraphs, and passages that are most

You might also like