Professional Documents
Culture Documents
Question Answering Finale
Question Answering Finale
Chapter 1
INTRODUCTION
1.1 Overview
Text is a prominent and direct source of information. Detection, extraction and
recognizing the context of the text present in a given paragraph. Attempt to develop a
computer system with the ability to automatically read and infer from the given text is
immense necessity. The text content lexically embedded in complex contextualization will be
analysed so as to obtain a specific pattern or sequence that cannot be achieved by an average
human brain. Generally, users feel that it is very tedious and time consuming job to sift
through many documents.
Taking the current world scenario, the generation of data takes place at a rate that is
incomprehensible to the common man. Now, this is where the art or science of Machine
Learning and Artificial Intelligence comes into picture. Summarization and context
recognition plays a key role in this process.
With the rise of free and open source technologies the computing world has been
lifted to new heights now. In the present world, people from various communities interact in
a multi-cultural environment to develop solutions for mans’ never ending problems and
needs. One of the notable contributions of the open source community to the scientific world
is Python. [1]
1.2 Scope
The Question-Answer System became a very important concept or technology in our
daily life because of the ever increasing number of documents and informational content,
which makes it impossible to be fully managed and monitored by humans. There are many
examples of the fields where humans are facing challenges like monitoring the documents,
getting the abstract or gist of some particular data, etc. However, the question-answer system
is a very challenging technology to be put into practice due to the diversity of the information
formats of different documents, different meanings and contexts, and also due to the
prevalence of non-uniform conditions during the process of document acquisition. This
project mainly introduces Automatic Question-Answer System which can be deployed in
industries that generate a lot of textual data and help answer specific questions based on the
text data being provided. This can help save a lot of time and resource which was earlier
utilized for manual parsing.
1.3 Assumptions
The objective of our project is to process a large amount of natural language data to
analyze the context and answer questions based on the given context. Our application is
based on the following assumptions:
1. The text must follow standard NLP rules.
2. The input must not contain any non-textual data (image, audio, etc.).
3. Emoticons or Emoji must not be used amidst the text.
4. The file format must be a text file (.txt).
1.4 Existing System
There are numerous NLP systems that are available for use today. These systems are
based on different methodologies but still the process of context recognition is a really
challenging task. These in turn lead to a lower performance rate and in turn are difficult to
implement. [3]
Rule-based approaches are the oldest approaches to NLP. This method tends to
focus on pattern-matching or parsing and is low precision, high recall, i.e. they can have high
performance in specific use cases, but often suffer performance degradation when
generalized.
Importantly, both neural network and non-neural network approaches can be useful
for contemporary NLP in their own right; they can also can be used or studied in tandem for
maximum potential benefit.
The next task is to come up with appropriate neural network architecture. This is an
important task as the selected architecture affects the overall performance of the system in
terms of speed, accuracy and efficiency. The knowledge graph in owl or rdf format forms the
base from which essential or critical attributes can be extracted using various feature
engineering techniques. A Convolution neural network fits best for scenarios that include
image processing, but we will be making use of a recurrent neural network as it reiterates
back from the previous inputs, redefining the weights in order to increase the classification of
original sentence. The above steps when put together help us obtain the optimum response
and also structure it in a grammatically correct format.
Chapter 2
LITERATURE SURVEY
2.1 Semantic Parsing via Staged Query Graph Generation
Organizing the world’s facts and storing them in a structured database, large-scale
knowledge bases (KB) like DBPedia (Auer et al., 2007) and Freebase (Bollacker et al., 2008)
have become important resources for supporting open-domain question answering (QA).
Most state-of-the-art approaches to KB-QA are based on semantic parsing, where a question
(utterance) is mapped to its formal meaning representation (e.g., logical form) and then
translated to a KB query. The answers to the question can then be retrieved simply by
executing the query. The semantic parse also provides a deeper understanding of the
question, which can be used to justify the answer to users, as well as to provide easily
interpretable information to developers for error analysis.
However, most traditional approaches for se-mantic parsing are largely
decoupled from the knowledge base, and thus are faced with several challenges when
adapted to applications like QA. We first define a query graph that can be
straightforwardly mapped to a logical form in λ-calculus and is semantically closely
related to λ-DCS (Liang, 2013). Semantic parsing is then reduced to query graph
generation, formulated as a search problem with staged states and actions. Each state is a
candidate parse in the query graph representation and each action defines a way to grow
the graph. The representation power of the semantic parse is thus controlled by the set of
legitimate actions applicable to each state. In particular, we stage the actions into three
main steps: locating the topic entity in the question, finding the main relationship between
the answer and the topic entity, and expanding the query graph with additional constraints
that describe properties the answer needs to have, or relationships between the answer and
other entities in the question.
The main focus of this work was to develop a semantic parsing framework that
maps a natural language question to a logical form query, which can be executed against a
knowledge base to retrieve the answers. Question answer pairs were taken from the
WEBQUESTIONS dataset and the model was evaluated against this dataset. It was
demonstrated that the model could retrieve the right answers with an accuracy of 52.5%.
This approach resulted in a model that performed much better than previous approaches.
The advanced entity linking system played a significant role in improving the
performance of the system as it helped reduce the number of entities suggested by half.
This paper presented a semantic parsing framework for question answering using a
knowledge base. A query graph is defined as the meaning representation that can be directly
mapped to a logical form. Semantic parsing is reduced to query graph generation, formulated
as a staged search problem. With the help of an advanced entity linking system and a deep
convolution neural network model that matches questions and predicate sequences, our
system outperforms previous methods substantially on the WEBQUESTIONS dataset. [24]
2.2 The NarrativeQA Reading Comprehension Challenge
Natural language understanding seeks to create models that read and comprehend
text. A common strategy for assessing the language understanding capabilities of
comprehension models is to demonstrate that they can answer questions about documents
they read, akin to how reading comprehension is tested in children when they are learning to
read. After reading a document, a reader usually cannot reproduce the entire text from
memory, but often can answer questions about underlying narrative elements of the
document: the salient entities, events, places, and the relations between them. Thus, testing
understanding requires creation of questions that examine high-level abstractions instead of
just facts occurring in one sentence at a time.
We have introduced a new dataset and a set of tasks for training and evaluating
reading comprehension systems, born from an analysis of the limitations of existing datasets
and tasks. While our QA task resembles tasks provided by existing datasets, it exposes new
challenges because of its domain: fiction. Fictional stories in contrast to news stories are self-
contained and describe richer set of entities, events, and the relations between them. We have
a range of tasks, from simple (which requires models to read summaries of books and movie
scripts, and generate or rank fluent English answers to human-generated questions) to more
complex (which requires models to read the full stories to answer the questions, with no
access to the summaries). [21]
Overall, we see a modest benefit of NLP techniques in IR. However, this bene-fit
comes with large computational costs, and non-NLP techniques tend to yield greater
improvements. Small positive effects often seem to be a superposition of positive and
negative effects. Automatically separating positive and negative instances would help a lot.
Such a separation would require a joint focus on NLP and retrieval, not to build an NLP
system and then apply it to retrieval more or less as a black box. Processing techniques that
were developed directly for information retrieval tend to be more successful than techniques
that were developed independently based on linguistic. The Porter stemming algorithm is
very fast and tailored for normalization in retrieval systems. Similarly, statistical “phrases” as
investigated in the retrieval community collide with linguistic knowledge. But they are
optimized for the retrieval task and are therefore successful. [23]
2.4 Preprocessing Techniques for Text Mining
Preprocessing is an important task and critical step in Text mining, Natural Language
Processing (NLP) and information retrieval (IR). In the area of Text Mining, data
preprocessing used for extracting interesting and non-trivial and knowledge from
unstructured text data. Information Retrieval (IR) is essentially a matter of deciding which
documents in a collection should be retrieved to satisfy a user's need for information. The
user's need for information is represented by a query or profile, and contains one or more
search terms, plus some additional information such as weight of the words. Hence, the
retrieval decision is made by comparing the terms of the query with the index terms
(important words or phrases) appearing in the document itself. The decision may be binary
(retrieve/reject), or it may involve estimating the degree of relevance that the document has to
query. Unfortunately, the words that appear in documents and in queries often have many
structural variants. So before the information retrieval from the documents, the data
preprocessing techniques are applied on the target data set to reduce the size of the data set
which will increase the effectiveness of IR System The objective of this study is to analyze
the issues of preprocessing methods such as Tokenization, Stop word removal and Stemming
for the text documents.
Need for Text Processing in NLP System:
1. To reduce indexing file size of the Text documents by removing stop words that
accounts for 20-30% of total word counts, and also stemming.
2. To improve the efficiency and effectiveness of the Information Retrieval system. [22]
Chapter
3
SYSTEM REQUIREMENTS
System requirements are the configuration that a system must have for a hardware or
software application to run easily and proficiently. If these requirements are not satisfied,
they can lead to installation or performance problems. Installation problems may prevent a
device or an application from getting installed. Performance problems may cause a product to
malfunction or perform below expectation or even hang or crash. [5]
3.1 Hardware Requirements
The section of hardware configuration is an important task related to the software
development insufficient random access memory may affect adversely on the speed and
efficiency of the entire system. The process should be powerful to handle the entire
operations. The hard disk should have sufficient capacity to store the file and application.
A major element in building a system is the section of compatible software since the
software in the market is experiencing in geometric progression. Selected software should be
acceptable by the firm and one user as well as it should be feasible for the system.
NLTK is intended to support research and teaching in NLP or closely related areas,
including empirical linguistics, cognitive science, artificial intelligence, information retrieval,
and machine learning. NLTK has been used successfully as a teaching tool, as an individual
study tool, and as a platform for prototyping and building research systems. There are 32
universities in the US and 25 countries using NLTK in their courses. NLTK supports
classification, tokenization, and stemming, tagging, parsing, and semantic reasoning
functionalities.
The Natural Language Toolkit (NLTK) is a platform used for building Python
programs that work with human language data for applying in statistical natural language
processing(NLP).
NLTK includes more than 50 corpora and lexical sources such as the Penn Treebank
Corpus, Open Multilingual Wordnet, Problem Report Corpus, and Lin’s Dependency
Thesaurus. [6]
Figure 3.1: NLTK Hierarchy
3.2.2 TensorFlow
TensorFlow is a free and open-source software library for dataflow and
differentiable programming across a range of tasks. It is a symbolic math library, and is also
used for machine learning applications such as neural networks. It is used for both research
and production. It is a computational framework for building machine learning models.
A tensor can be originated from the input data or the result of a computation. In
TensorFlow, all the operations are conducted inside a graph. The graph is a set of
computation that takes place successively. Each operation is called an op node and is
connected to each other.
The graph outlines the ops and connections between the nodes. However, it does not
display the values. The edge of the nodes is the tensor, i.e., a way to populate the operation
with data.
Graphs
TensorFlow makes use of a graph framework. The graph gathers and describes all
the series computations done during the training. The graph has lots of advantages:
It was done to run on multiple CPUs or GPUs and even mobile operating system
The portability of the graph allows preserving the computations for immediate or later
use. The graph can be saved to be executed in the future.
All the computations in the graph are done by connecting tensors together
A tensor has a node and an edge. The node carries the mathematical operation and
produces endpoints output. The edges the edges explain the input/output relationships
between nodes.
3.2.3 SciPy
SciPy is a free and open-source Python library used for scientific computing and
technical computing. SciPy contains modules for optimization, linear algebra, integration,
interpolation, special functions, FFT, signal and image processing, ODE solvers and other
tasks common in science and engineering.
SciPy builds on the NumPy array object and is part of the NumPy stack which
includes tools like Matplotlib, pandas and SymPy, and an expanding set of scientific
computing libraries. This NumPy stack has similar users to other applications such as
MATLAB, GNU Octave, and Scilab. The NumPy stack is also sometimes referred to as the
SciPy stack.
The SciPy library is currently distributed under the BSD license, and its
development is sponsored and supported by an open community of developers. It is also
supported by NumFOCUS, a community foundation for supporting reproducible and
accessible science. [8]
The basic data structure used by SciPy is a multidimensional array provided by the
NumPy module. NumPy provides some functions for linear algebra, Fourier transforms, and
random number generation, but not with the generality of the equivalent functions in SciPy.
NumPy can also be used as an efficient multidimensional container of data with arbitrary
datatypes. This allows NumPy to seamlessly and speedily integrate with a wide variety of
databases. Older versions of SciPy used Numeric as an array type, which is now deprecated
in favor of the newer NumPy array code.
3.2.4 Numpy
NumPy is a library for the Python programming language, adding support for large,
multi-dimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays. The ancestor of NumPy, Numeric, was
originally created by Jim Hugunin with contributions from several other developers. In 2005,
Travis Oliphant created NumPy by incorporating features of the competing Numarray into
Numeric, with extensive modifications. NumPy is open-source software and has many
contributors.
Python bindings of the widely used computer vision library OpenCV utilize NumPy
arrays to store and operate on data. Since images with multiple channels are simply
represented as three-dimensional arrays, indexing, slicing or masking with other arrays are
very efficient ways to access specific pixels of an image. The NumPy array as universal data
structure in OpenCV for images, extracted feature points, filter kernels and many more vastly
simplifies the programming workflow and debugging. [7]
The core functionality of NumPy is its "ndarray", for n-dimensional array, data
structure. These arrays are strided views on memory. In contrast to Python's built-in list data
structure (which, despite the name, is a dynamic array), these arrays are homogeneously
typed: all elements of a single array must be of the same type.
Such arrays can also be views into memory buffers allocated by C/C++, Cython, and Fortran
extensions to the CPython interpreter without the need to copy data around, giving a degree
of compatibility with existing numerical libraries. This functionality is exploited by the SciPy
package, which wraps a number of such libraries (notably BLAS and LAPACK). NumPy has
built-in support for memory-mapped ndarrays.
Limitations
Algorithms that are not expressible as a vectorized operation will typically run slowly
because they must be implemented in "pure Python", while vectorization may increase
memory complexity of some operations from constant to linear, because temporary arrays
must be created that are as large as the inputs. Runtime compilation of numerical code has
been implemented by several groups to avoid these problems; open source solutions that
interoperate with NumPy include scipy.weave, numexpr and Numba. Cython and Pythran are
static-compiling alternatives to these.
3.2.5 Scikit-learn
3.2.6 Pandas
In 2008, developer Wes McKinney started developing pandas when in need of high
performance, flexible tool for analysis of data.
Prior to Pandas, Python was majorly used for data munging and preparation. It had very
little contribution towards data analysis. Pandas solved this problem. Using Pandas, we can
accomplish five typical steps in the processing and analysis of data, regardless of the origin
of data — load, prepare, manipulate, model, and analyze.
Python with Pandas is used in a wide range of fields including academic and commercial
domains including finance, economics, Statistics, analytics, etc. [9]
Fast and efficient DataFrame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats.
Building and handling two or more dimensional arrays is a tedious task, burden is placed on
the user to consider the orientation of the data set when writing functions. But using Pandas
data structures, the mental effort of the user is reduced.
For example, with tabular data (DataFrame) it is more semantically helpful to think of
the index (the rows) and the columns rather than axis 0 and axis 1.
Mutability
All Pandas data structures are value mutable (can be changed) and except Series all are size
mutable. Series is size immutable. DataFrame is widely used and one of the most important
data structures. Panel is used much less.
Series
Series is a one-dimensional array like structure with homogeneous data. For example, the
following series is a collection of integers 10, 23, 56, …
10 23 56 17 52 61 73 90 26
Key Points
Homogeneous data
Size Immutable
DataFrame
DataFrame is a two-dimensional array with heterogeneous data. For example,
The table represents the data of a sales team of an organization with their overall
performance rating. The data is represented in rows and columns. Each column represents an
attribute and each row represents a person.
Column Type
Name String
Age Integer
Gender String
Rating Float
Key Points
Heterogeneous data
Size Mutable
Data Mutable
Panel
Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent
the panel in graphical representation. But a panel can be illustrated as a container of
DataFrame.
Key Points
Heterogeneous data
Size Mutable
Data Mutable
3.2.7 PyTorch
Autograd Module
PyTorch uses a technique called automatic differentiation. A recorder records what
operations have performed, and then it replays it backward to compute the gradients. This
technique is especially powerful when building neural networks in order to save time on one
epoch by calculating differentiation of the parameters at the forward pass itself.
Optim Module
torch.optim is a module that implements various optimization algorithms used for
building neural networks. Most of the commonly used methods are already supported, so
there is no need to build them from scratch.
nn Module
PyTorch autograd makes it easy to define computational graphs and take gradients,
but raw autograd can be a bit too low-level for defining complex neural networks. This is
where the nn module can help.
3.2.8 Pickle
The pickle module implements binary protocols for serializing and de-serializing a
Python object structure. “Pickling” is the process whereby a Python object hierarchy is
converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte
stream (from a binary file or bytes-like object) is converted back into an object hierarchy.
Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” or
“flattening”; however, to avoid confusion, the terms used here are “pickling” and
“unpickling”. [9]
Data stream format
The data format used by pickle is Python-specific. This has the advantage that there are no
restrictions imposed by external standards such as JSON or XDR (which can’t represent
pointer sharing); however it means that non-Python programs may not be able to reconstruct
pickled Python objects.
By default, the pickle data format uses a relatively compact binary representation. If you need
optimal size characteristics, you can efficiently compress pickled data. The module
pickletools contains tools for analyzing data streams generated by pickle. pickletools source
code has extensive comments about opcodes used by pickle protocols.
Chapter
4
SYSTEM ARCHITECTURE
4.2.1 Tokenization
Tokenization is a step which splits longer strings of text into smaller pieces, or
tokens. Larger chunks of text can be tokenized into sentences, sentences can be tokenized
into words, etc. Further processing is generally performed after a piece of text has been
appropriately tokenized. Tokenization is also referred to as text segmentation or lexical
analysis. Sometimes segmentation is used to refer to the breakdown of a large chunk of text
into pieces larger than words (e.g. paragraphs or sentences), while tokenization is reserved
for the breakdown process which results exclusively in words. [12]
4.3.2 Pre-process
Before the actual recognition and extraction of the context, it is necessary to enhance
the quality of the text content. This will aid in the process of context recognition and
extraction. The pre-processing steps can defined as – tokenization, removal of stop words,
removal off accented characters & contractions, stemming and lemmatization. After, the
implementation of all the above mentioned steps, one can obtain the cleaned/pre-processed
text.
4.3.3 Localize
This step is responsible in identifying the zone or location wherein the actual context
is present w.r.t to the input question. The paragraphs are marked based on the TF-IDF (Term
frequency-Inverse Document Frequency) values. This score gives the relativity in context
between the specific paragraph and the underlying context of the question being asked.
Hence, these paragraphs are ranked based on their TF-IDF scores and his will help us localize
the region in the document with highest context similarity.
4.3.5 Segment
The extracted components are the parts of the document that are of use to us for
further text processing. Here we apply a specific method where the entire extracted
component is scanned from top to the bottom. Then the Best span is identified, which gives
us the vector values of the starting and ending word of a sentence with the highest confidence
score.
IMPLEMENTATION
5.1 Introduction
The implementation of the proposed system involves a sequence of simple steps as given
below:
1. Firstly, generate a pickle file which contains all the necessary weights and other
parameters from the SQuAD dataset using neural networks.
2. Then import the pickle file of the SQuAD dataset as the model into our system.
4. Perform all the pre-processing techniques on the document and on the question to get
cleaned text.
5. Now sort the paragraph into sentences and then into words. Rate the paragraph based
on contextual match to the question.
6. Now build the TensorFlow session, and encode all the weights into numpy arrays.
7. Then generate the answer using the selected model and display onto the users screen.
As described earlier, any mistakes in selection the programming language may lead
to the failure of the entire system. Hence, the programming language for any code must be
chosen with proper knowledge about the design of the proposed system. The programming
language chosen is Python. [2]
5.2.1 Python
The main reasons for using Python in our proposed system are:
2. Text processing is easier as Python includes libraries that support text processing
from the basic to advanced level. Many of the functions for text segmentation,
feature extraction, context detection, etc. are implicitly available in the libraries.
An activity diagram is similar to a flowchart but here the flow is represented from one
activity to another activity. An activity is nothing but an operation of the system. This
diagram describes the dynamic behaviour of the system. These are usually constructed using
forward and reverse engineering techniques. The following Figure 5.3 describes the activity
diagram of our proposed system. [17]
Figure 5.3: Activity diagram
A use case diagram is a simple way of representing the interaction of a user with the
system. It shows the complete relationship between the user and system with the use cases. It
is also known as a behaviour diagram as it depicts the behaviour of the system with the
external actors (users).The following Figure 5.4 shows the use case diagram for our system.
Here the user is considered as the external actor that interact with the system in order to
retrieve the anaswer.
Figure 5.4: Use case diagram
A sequence diagram depicts the active processes that live simultaneously as vertical
lines. The horizontal arrows show the messages that are being transferred from one live
process or object to another. These messages are given in the order that they are exchanged
from the top to the bottom of the sequence diagram. These sequence diagrams are also called
event diagrams or event scenarios as they show the various events that occur in the system in
the proper order. The Figure 5.5 below shows the sequence diagram for our question
answering system.
Figure 5.5: Sequence diagram
Chapter 6
SYSTEM STUDY
Chapter 7
SYSTEM TESTING
The purpose of testing is to identify all errors. Testing is the process of trying to
discover every likely fault or weakness in the system. It offers a way to inspect the
functionality of components, sub-assemblies, assemblies and/or a finished product. It is the
process of implementing the software with the objective of ensuring that the software system
meets its requirements and user expectations and does not fail in any improper manner. There
are various types of test each addressing a specific testing requirement. [18]
Chapter
8
RESULTS AND DISCUSSIONS
Figure 8.6 depicts a question-answer pair wherein the system has to summarize to
identify the gist of the input text to answer the question.
Figure 8.7: Answers with listings
Figure 8.7 depicts a question-answer pair wherein the system has to list out certain
names to answer the question.
Figure 8.11 depicts a question-answer pair wherein the system has to summarize to identify
the gist of the input text to answer the question.
Figure 8.12: Answer based on numerical context
Figure 8.12 depicts a question-answer pair wherein the system has to compare
numerical values
Chapter 9
8.1 Conclusion
The system developed effectively presents a question answering system which provides
answers to the given question using the user document fed into the system. When using a
paragraph-level Question Answering model across multiple paragraphs, the training method
of sampling non-answer containing paragraphs while using a shared-norm objective function
can be very beneficial. Combining this with our suggestions for paragraph selection, using
the summed training objective, and our model design allows us to advance on SQuAD by a
large stride.
1. Using our Question Answering System to sift through thousands of documents and
enable academic researchers and legal workers to retrieve the information they require
without having a massive time overhead.
2. The Question Answering System also finds its way in medical transcripts, where
spoken language is converted to written language and later analyzed.
3. Businesses can deploy question answering system in chat bots for 24x7 customer
support.
4. Implementation of various other languages can also take place in the system.
References