You are on page 1of 14

The reason or goal for this work

The major goal of this work is to help users get concise, cohesive, and coherent answers to complex
questions (non-factoid questions). The problem that this research work aims to solve is clearly captured in
the passages below. The major question that informs this work is, how can we provide users with a single,
concise, cohesive, and coherent answer to their complex questions (non-factoid questions)? This work
seeks to improve on the automatic text summarization (ATS) architecture and results thereafter bringing
the improved ATS into the Non-factoid QAS architecture to generate a single, concise, cohesive, and
coherent answer to users' complex questions (non-factoid questions). Further below are the aim and
objectives of this study and also a fairly proposed methodology.

“The limitation of the analyzed studies is related to how they provide the answer. The majority of studies
focus on selecting a few passages from different documents and ranking them according to their
usefulness to answer a question. However, it is common for non-factoid questions to have several
restrictions, narrowing the search space down to a specific answer. For instance, for the question “How
should I treat measles in a 12-year-old boy?” the ideal passage to be used as an answer should cover
“treatment”, “measles”, “12-year-old” and “boy”, which is very unlikely and there may not be a ready-
made passage in the knowledge base containing all the information needed. In this case, the ideal system
must search for different information pieces in different documents and merge them to compose a single
answer. However, this is challenging and still an open research problem.”

“Automatic generation of answers based on multiple passages is a critical issue for developing full end-
to-end non-factoid question-answer systems. The problem emerges from the fact that the automatic
generation of coherent and cohesive text – especially for long passages – is still an open research question
(Bau et al., 2020). Broadly speaking, coherence and cohesion refer to how a text is organized so that it
can hold together. In a coherent answer, concepts are connected meaningfully and logically by using
grammatical and lexical cohesive devices.”

“The future directions in the non-factoid QA should concern methods that generate natural language and
use several information sources to compose complex answers instead of using a simple extracted
sentence”

Cortes, E. G., Woloszyn, V., Barone, D., Möller, S., & Vieira, R. (2022). A systematic review of
question answering systems for non-factoid questions. Journal of Intelligent Information Systems,
58(3), 453–480. https://doi.org/10.1007/s10844-021-00655-8
https://drive.google.com/file/d/1v_iuRQk6AJgw1txmZq4QQevBmr6GM2mm/view?usp=sharing

Aim of the Study

To innovate in the field of deep learning-based text analysis by constructing and thoroughly assessing
hybrid deep learning systems for document summarization and non-factoid question answering. Through
the creation of unique deep learning architectures and proof of concepts, this research aims to enhance the
conciseness, functionality and accuracy of automated systems in interpreting multiple documents and
answering non-factoid questions

Statement of the Problem

The rapid growth of textual content on the internet and the increasing demand for efficient and accurate
information retrieval systems have spurred interest in non-factoid question-answering (QA) systems.
These systems aim to provide comprehensive, contextually accurate, and concise responses to complex
questions, synthesizing information from a variety of sources (Nazari & Mahdavi, 2019; Widyassari et
al., 2022). Despite advancements in natural language processing and question-answering techniques,
crafting coherent and comprehensive answers to non-factoid questions remains an open research
challenge (Breja & Jain, 2022; Cortes et al., 2022).

Existing non-factoid QA models tend to focus primarily on the selection and ranking of passages based
on their relevance to the query. This focus often results in an inadequate addressal of the inherent
complexities of non-factoid questions, which frequently require the integration of information from
diverse sources (Cortes et al., 2022). Moreover, existing automatic text summarization (ATS) approaches
have achieved only limited success in generating answers from multiple passages due to the presence of
redundant information and a lack of effective techniques to integrate various information snippets
cohesively (Sharma & Sharma, 2022; Wang et al., 2022; Yadav et al., 2022).

Within the scope of ATS, abstractive summarization of long documents remains a relatively
underexplored area when compared to its counterpart involving shorter documents. This disparity
becomes more pronounced in cases where very long documents, such as scientific articles, need to be
summarized (Alomari et al., 2022). Recent advancements in massive pre-trained language models
(PTLMs) and the availability of new long-document datasets have begun to steer research focus towards
summarizing lengthy documents containing thousands of words (Alomari et al., 2022).

Furthermore, the abstractive type of ATS, which necessitates a semantic understanding of the text and
real-world knowledge, often produces more meaningful summaries akin to those crafted by humans
(Suleiman & Awajan, 2020; Wazery et al., 2022). However, it's also associated with higher computational
costs and has the propensity to generate summaries riddled with factual errors, redundant information,
and significant information loss (Alomari et al., 2022).

In response to these challenges, the present work proposes an innovative approach at the intersection of
non-factoid QA systems and ATS. Instead of the traditional answer extraction and answer ranking in non-
factoid QA systems, this approach employs ATS to provide the final answer to the user. The intended
result is a comprehensive, coherent, and non-redundant summary that effectively answers the user's
question. This is achieved by designing a novel deep learning-based framework that integrates extractive,
abstractive, and compressive-based methods (Lewis et al., 2020; See et al., 2017). Such an approach not
only addresses a significant gap in the current research landscape but also signifies a promising avenue
towards addressing the complexities of non-factoid question answering.

Objectives

Objective 1: To design and implement a hybrid deep learning system for efficient document
summarization.

Objective 2: To assess the performance and accuracy of the developed hybrid deep learning document
summarization system through suitable evaluation techniques.

Objective 3: To design and develop a novel deep learning-based system for non-factoid question
answering, culminating in the creation of a comprehensive proof of concept.

Objective 4: To systematically evaluate the functionality, accuracy, and efficiency of the developed non-
factoid question-answering system using the developed proof of concept.
Sample Methodology Breakdown

Objective 1: To design and implement a hybrid deep learning system for efficient document
summarization

Methodology 1

- Design of the Hybrid Summarization Architecture

- The model type to be used

- Given the complexity of the problem, we will utilize a variation of the


Transformer model, particularly a fine-tuned version of BERT for the extractive
part and GPT for the abstractive part and given its ability to handle long
sequences effectively, the Longformer model will be used for the compressive
part.

- The design specifics

- The integration of the extractive, abstractive, and compressive techniques into a


single model would be handled through a multi-step architecture that feeds the
output of one stage into the next.

- The extractive part will be implemented using BERT, which will be fine-tuned to
assign relevance scores to sentences in the document. The sentences with the
highest scores will be selected as informative.

- Load a pre-trained BERT model using the HuggingFace Transformers


library.

- Create a sentence scoring model by adding a dense layer on top of the


BERT model.

- The dense layer will output a score for each sentence in the input
document.

- Fine-tune the model on the dataset, where the target is to assign higher
scores to sentences that are part of the human-written summary.

- Use the model to score the sentences in a new document. Select the top-
scoring sentences as the extractive summary.

- The abstractive part will involve using GPT, which will generate a new,
abstractive summary from the informative sentences selected by BERT.
- Load a pre-trained GPT model using the HuggingFace Transformers
library.

- Fine-tune the GPT model on the extractive summaries. The target is the
human-written summary.

- Use the fine-tuned GPT model to generate a summary for a new


document by feeding it the extractive summary.

- The compressive part will involve using Longformer, which can handle long
sequences. Longformer will be used to condense the selected sentences without
losing key information.

- Load a pre-trained Longformer model using the HuggingFace


Transformers library.

- Fine-tune the Longformer model on the dataset, with the target being the
human-written summary.

- Use the fine-tuned Longformer to generate a compressed summary for a


new document.

- The entire process forms a pipeline, with each step relying on the output from the
previous step.

- Implementation of the Hybrid Summarization Architecture

- Language and framework

- Use Python programming language.

- For deep learning tasks, use the PyTorch library.

- Use the HuggingFace Transformers library for loading and fine-tuning


transformer models.

- Coding the model

- Implement the designed architecture using best software engineering practices


including version control (Git) and writing modular, reusable, and well-
commented code.

- Implement both unit tests and functional tests to ensure that individual
components of the architecture are working correctly.

- Training of the Hybrid Summarization Model

- Data collection
- The CNN / Daily Mail dataset will be used, as it is a standard benchmark for
document summarization tasks.

- Data preprocessing

- BERT tokenizer will be used for tokenizing the text for the extractive part and
GPT tokenizer for the abstractive part. Special tokens like start, end, and pad
tokens will be handled accordingly. All sequences will be padded to a fixed
length - the maximum length of the sequence in the dataset. For sequences
shorter than this length, padding tokens will be added until they reach the
maximum length. For sequences longer than the maximum length, they will be
truncated.

- Use the BERT tokenizer for tokenizing the text for the extractive part.

- Use the GPT tokenizer for the abstractive part.

- For the compressive part, utilize the Longformer tokenizer. As


Longformer is designed for long text, we can manage longer documents
with it.

- Handle special tokens like start, end, and pad tokens appropriately for
each of the models.

- For each of the models, pad all sequences to a fixed length - the
maximum length allowed by the model (BERT, GPT, Longformer have
different sequence length limits).

- If a sequence is shorter than this length, add padding tokens until it


reaches the maximum length.

- If a sequence is longer than the maximum length, truncate it.

- Training the model

- The Adam optimizer will be used with an initial learning rate of 0.001. After a
few epochs, if the validation loss does not decrease, the learning rate will be
reduced by a factor of 10 to refine the optimization process.

- Use the Adam optimizer with an initial learning rate of 0.001.

- Monitor the validation loss after each epoch.

- If the validation loss does not decrease after a few epochs, reduce the
learning rate by a factor of 10.

- Evaluating the model


- The model's performance will be evaluated on the validation set using ROUGE
scores, and the loss and accuracy will be monitored during training.

- Evaluate the model's performance on the validation set using ROUGE


scores.

- Monitor the loss and accuracy during training.

- Hyperparameter Tuning

- A grid search or random search strategy will be used to explore the performance
of different hyperparameters. The learning rate, batch size, and the number of
training epochs will be given particular attention

- Use a grid search or random search strategy to explore the performance


of different hyperparameters.

- Give particular attention to the learning rate, batch size, and the number
of training epochs.

- Fine-tune these hyperparameters based on the performance on the


validation set.

Objective 2: To assess the performance and accuracy of the developed hybrid deep learning document
summarization system through suitable evaluation techniques

Methodology 2

- Evaluation of the Hybrid Summarization Model

- Quantitative Evaluation

- Rouge Score Calculation: We will use the PyRouge library in Python to calculate
the Rouge scores. The Rouge scores will include Rouge-1 (comparing unigram
overlaps), Rouge-2 (bigram overlaps), and Rouge-L (longest common
subsequence). This will give us a statistical measure of how well the model
performed in terms of precision, recall, and F1 score.

- Import the Rouge module from the rouge library in Python (from rouge
import Rouge).

- Initialize an instance of the Rouge class (rouge = Rouge()).

- For each generated summary and corresponding reference summary in


your dataset, calculate the Rouge scores as follows:
- Use the get_scores method of your Rouge instance to compute the scores
(scores = rouge.get_scores(hypothesis, reference)). The 'hypothesis' is
your generated summary and 'reference' is the human-written summary.

- The result will be a dictionary containing Rouge-1, Rouge-2, and Rouge-


L scores, each including 'precision', 'recall', and 'f1' scores.

- Calculate the average Rouge scores overall summaries in your dataset for
a comprehensive measure of the model's performance.

- BLEU Score Calculation: Besides Rouge, we can also compute the BLEU
(Bilingual Evaluation Understudy) scores using the NLTK (Natural Language
Toolkit) library in Python. BLEU is a metric for evaluating a generated sentence
to a reference sentence, which can also serve as a good quantitative measure for
our summarization task.

- Import the sentence_bleu function from the nltk.translate.bleu_score


module in Python (from nltk.translate.bleu_score import sentence_bleu).

- For each generated summary and corresponding reference summary,


calculate the BLEU score:

- Tokenize both summaries into words.

- Use the sentence_bleu function to calculate the score (score =


sentence_bleu([reference], hypothesis)). Here 'reference' is the tokenized
human-written summary and 'hypothesis' is the tokenized machine-
generated summary.

- Calculate the average BLEU score over all summaries in your dataset

- Qualitative Evaluation

- Human Evaluation: Human evaluation for assessing the coherence, relevance,


and information completeness of the generated summaries will be performed. We
would need a team of human evaluators, preferably experts in the field, to rate
the summaries based on defined criteria. Each evaluator will be given clear
instructions and a rubric to follow for scoring the summaries.

- Assemble a team of human evaluators, preferably experts in the field.

- Give each evaluator a set of summaries produced by the model, along


with the original texts from which the summaries were generated.

- Provide evaluators with clear instructions and a scoring rubric that


includes criteria for coherence, relevance, and information completeness.

- Ask each evaluator to read the summary and the original text, then rate
the summary based on the provided criteria.
- Inter-Annotator Agreement: Finally, to ensure the validity of human evaluations,
we will calculate the inter-annotator agreement among the human evaluators
using metrics such as Fleiss's kappa. This step is necessary to ensure that the
scoring isn't subjective and has a consensus.

- Collect the scores given by all the evaluators.

- Calculate the inter-annotator agreement using a statistical measure like


Fleiss's kappa.

- You can use the fleiss_kappa function from the nltk library to do this
(from nltk import agreement; ratingtask =
agreement.AnnotationTask(data=your_data); print('Fleiss\'s
Kappa:',ratingtask.multi_kappa())). The 'your_data' is a list of lists, each
inner list containing the evaluator ID, item ID, and score.

- If the kappa score is high (close to 1), it means there's strong agreement
between the evaluators, which validates the human evaluation results.

Objective 3: To design and develop a novel deep learning-based system for non-factoid question
answering, culminating in the creation of a comprehensive proof of concept.

Methodology 3

- Design of Question-Answering Architecture

- Defining the model

- The Question-Answering (QA) system will be designed with three core modules:
a question processing module, a document retrieval module (Dense Passage
Retrieval - DPR), and a summarization module (our hybrid summarization model
from Objective 1).

- Design specifics

- Question Processing Module: This is the initial phase where we receive the user's
question. We will use the BERT model from the HuggingFace Transformers
library to encode the user's question into a meaningful vector representation.

- Document Retrieval Module: We'll use the Dense Passage Retrieval (DPR)
model for this step. This model will use the output from the question processing
module (the vector representation of the user's question) to retrieve the most
relevant documents from the given dataset.

- Summarization Module: Here, we will input the retrieved documents into our
pre-trained hybrid summarization model from Objective 1. The summarization
model will generate a concise and accurate summary of the information in the
documents, which directly answers the user's question.

- Implementation of Question-Answering System

- Language and framework

- We'll use Python for implementation because of its extensive libraries for NLP
and ML tasks.

- The HuggingFace Transformers library will be used for the DPR module, and
PyTorch for our summarization module.

- The implementation of the model

- Question Processing Module: The role of this module is to convert the user's
question into a suitable format for the document retrieval module.

- Use the HuggingFace Transformers library to access pre-trained BERT


models.

- Tokenize the input question using BERT's tokenizer.

- Add required special tokens ([CLS], [SEP]).

- Create an attention mask to indicate the presence of non-padding tokens.

- Document Retrieval Module: This module will take the processed question and
find the most relevant documents from our corpus.

- Use HuggingFace's DPR implementation, which includes pre-trained


models and tokenizers.

- Feed the processed question from the BERT model into the DPR.

- Use the cosine similarity of the question and document embeddings to


retrieve the most relevant documents from the corpus.

- Summarization Module: This model will take the retrieved documents and
generate a summary that directly answers the user's question

- Implement the hybrid summarization model in PyTorch. The model takes


the documents retrieved by DPR and summarizes them.

- Apply the necessary pre-processing to the document text, similar to the


steps in the question processing module.

- Feed the processed documents into the model and retrieve the generated
summaries.
- Training of Question-Answering Model

- Data Collection

- Use the MS MARCO dataset, a large-scale, real-world dataset designed


specifically for information retrieval tasks.

- Data Preprocessing

- Tokenize the text from the dataset.

- Handle special tokens required by the models.

- Ensure sequences are of fixed length by applying padding or truncation as


required.

- Model Training

- Use the AdamW optimizer for training the DPR model. This variant of the Adam
optimizer includes weight decay, which can help prevent overfitting.

- Monitor the model's validation loss and adjust the learning rate using PyTorch's
ReduceLROnPlateau scheduler if the loss plateaus.

- Hyperparameter Tuning

- Use Scikit-learn's GridSearchCV to perform hyperparameter tuning. This will


ensure the optimal learning rate, batch size, and number of training epochs for
best model performance

- Validation

- Evaluate the performance of the model on a separate validation set from the MS
MARCO dataset.

- Use Precision@K, Recall@K, and F1-Score for the document retrieval model
evaluation. These metrics measure the quality of the documents retrieved in
terms of relevance.

- Use ROUGE scores for the summarization model evaluation. This metric
measures the quality of the summaries generated in terms of similarity with the
reference summaries.

Objective 4: To systematically evaluate the functionality, accuracy, and efficiency of the developed non-
factoid question-answering system using the developed proof of concept.

Methodology 4
- Evaluation of Question-Answering System

- Setup Evaluation Benchmarks

- Identify state-of-the-art non-factoid question-answering models to serve as the


benchmark. These can include models such as the Transformer-based T5 or
BERT-based question-answering systems.

- Obtain or re-implement these models for evaluation. We will use the


HuggingFace Transformers library, which provides easy access to many pre-
trained models including T5 and BERT-based models.

- Quantitative Evaluation

- Use the same dataset (e.g., MS MARCO) to evaluate all models under
consideration. This ensures a fair comparison.

- Evaluate the Document Retrieval Module: Use Precision@K, Recall@K, and F1-
Score to evaluate the performance of document retrieval. These metrics measure
the relevance and comprehensiveness of the retrieved documents.

- Evaluate the Summarization Module: Use ROUGE (Recall-Oriented Understudy


for Gisting Evaluation) scores to evaluate the performance of the summarization.
These metrics measure the quality of the summaries generated.

- Evaluate the Overall Question-Answering System: Use BLEU (Bilingual


Evaluation Understudy) score, which measures the similarity between the
system's output and the ground-truth answers.

- Qualitative Evaluation

- Conduct a qualitative evaluation of a subset of answers from each model. This


could involve human evaluators assessing the relevance, accuracy, and
comprehensiveness of the answers.

- Use Python to code the infrastructure for the qualitative evaluation. Use an
interface that allows evaluators to easily rate the answers generated by each
model according to the criteria.

- Recruit a small panel of human evaluators to perform this task. They would rate
the responses based on clarity, relevance, and correctness

- Comparison and Analysis

- Compare the performance of our model with the benchmark models based on
both quantitative and qualitative evaluation results.

- Use Python's Matplotlib or Seaborn libraries to visualize the results for easier
comparison and interpretation.
- Analyze areas where our model outperforms or underperforms compared to the
benchmarks. Identify potential areas for improvement.

Summary of what is being solved and what is being implemented

- Exploring deep learning methods for multiple document summarization to answer user-complex
questions.

- Exploring different combinations of abstractive, extractive, and compressive summarization

- Incorporating the best-performing model into QAS to answer user questions

What is expected for each objective

- Objective 1

- We would combine and implement different summarization techniques (Extractive,


Abstractive, Compressive) for document (multiple) summarization. First 2, then the 3.

- The goal is then to arrive at a suitable architecture for automatic multiple-document


summarization

- We would be implementing and documenting the architecture

- Objective 2

- We would evaluate the performance of the architecture gotten from Objective 1 by


benchmarking it with state-of-the-art architectures

- Objective 3

- The goal here is to infuse our summarization architecture from objective 1 into the QAS
architecture.

- Then we implement our hybrid architecture for QAS

- Objective 4

- We evaluate the performance of QAS from objective 3 benchmarking it with state-of-the-


art architectures

You might also like