Question Answering Finale

Question-Answering System
Chapter 1
INTRODUCTION
1.1 Overview
Text is a prominent and direct source of information. Detection, extraction and
recognizing the context of the text present in a given paragraph. Attempt to develop a
computer system with the ability to automatically read and infer from the given text is
immense necessity. The text content lexically embedded in complex contextualization will be
analysed so as to obtain a specific pattern or sequence that cannot be achieved by an average
human brain. Generally, users feel that it is very tedious and time consuming job to sift
through many documents.
Taking the current world scenario, the generation of data takes place at a rate that is
incomprehensible to the common man. Now, this is where the art or science of Machine
Learning and Artificial Intelligence comes into picture. Summarization and context
recognition plays a key role in this process.
With the rise of free and open source technologies the computing world has been
lifted to new heights now. In the present world, people from various communities interact in
a multi-cultural environment to develop solutions for mans’ never ending problems and
needs. One of the notable contributions of the open source community to the scientific world
is Python. [1]
The exponential growth of available electronic data is almost useless without

efficient tools to retrieve the right information at the right time. This is especially crucial in
the context of decision making (e.g. for politicians), innovative development (e.g. for
scientists and industrials) or economic development (e.g. for market or concurrence studies).
It is now widely acknowledged that information retrieval systems (IRS) need to take
semantics into account. Academics and researchers are very much benefitted by using
automatic text summarization system as a tool to lessen the amount of time spent manually
extracting the chief thoughts from large documents. This article surveys semantic based
methodologies designed to efficiently retrieve and exploit information. Some of them, based
Dept. of CSE, JSSATE Page 1

on terminologies, are fitted to open context, dealing with heterogeneous and unstructured
data, while others, based on taxonomies or ontology, are semantically richer but require
formal knowledge representation of the studied domain.
The aim is to develop a Question-Answer System which can be deployed in
industries that generate a lot of textual data and help answer specific questions based on the
text data being provided. This particular concept on a much larger scale would help crunch
documents or books entirely while forming a back-end analysis. This particular analysis
would then serve as the base for the front-end Question-Answer System. Systems like these
would help improving the efficiency with of certain phases of a large-scale automation
process and thus, reducing the existing requirement of resources. [1]
1.2 Scope
The Question-Answer System became a very important concept or technology in our
daily life because of the ever increasing number of documents and informational content,
which makes it impossible to be fully managed and monitored by humans. There are many
examples of the fields where humans are facing challenges like monitoring the documents,
getting the abstract or gist of some particular data, etc. However, the question-answer system
is a very challenging technology to be put into practice due to the diversity of the information
formats of different documents, different meanings and contexts, and also due to the
prevalence of non-uniform conditions during the process of document acquisition. This
project mainly introduces Automatic Question-Answer System which can be deployed in
industries that generate a lot of textual data and help answer specific questions based on the
text data being provided. This can help save a lot of time and resource which was earlier
utilized for manual parsing.
1.3 Assumptions
The objective of our project is to process a large amount of natural language data to
analyze the context and answer questions based on the given context. Our application is
based on the following assumptions:
1. The text must follow standard NLP rules.
2. The input must not contain any non-textual data (image, audio, etc.).
3. Emoticons or Emoji must not be used amidst the text.
4. The file format must be a text file (.txt).
1.4 Existing System

There are numerous NLP systems that are available for use today. These systems are
based on different methodologies but still the process of context recognition is a really
challenging task. These in turn lead to a lower performance rate and in turn are difficult to
implement. [3]
Rule-based approaches are the oldest approaches to NLP. This method tends to
focus on pattern-matching or parsing and is low precision, high recall, i.e. they can have high
performance in specific use cases, but often suffer performance degradation when
generalized.
"Traditional" machine learning approaches include probabilistic modelling,

likelihood maximization, and linear classifiers. This is good for sequence labeling (using
probabilistic modeling), some ideas in neural networks are very similar to earlier methods
(word2vec similar in concept to distributional semantic methods) and use methods from
traditional approaches to improve neural network approaches (for example, word alignments
and attention mechanisms are similar).
Importantly, both neural network and non-neural network approaches can be useful
for contemporary NLP in their own right; they can also can be used or studied in tandem for
maximum potential benefit.
1.5 Proposed System

Question-Answer System can be deployed in industries that generate a lot of textual
data and help answer specific questions based on the text data being provided. The Question-
Answer systems are generally divided into four sub-steps:
1. Text preprocessing:
2. Text Parsing
3. Training with SQuAD
4. Feature Engineering
5. Modeling / Pattern Mining

First step is to obtain clean and usable data from an unstructured format using various
techniques such as tokenization, lemmatization and various other data exploratory techniques
(text normalizer). The representation of information obtained from data input is in the form of
a graph. It makes it easier for the programmer to understand the underlying context of the
textual data. Since, the computer cannot recognize the graphical representation; we need to
convert this graph into a system appropriate version. Our paragraph selection method chooses
the paragraph that has the smallest TF-IDF cosine distance with the question. After
initializing the model and the document, the text must go through a pre-processing stage.
Then, a Tensorflow session is built that encodes various paragraph pairs into numpy arrays.
The next task is to come up with appropriate neural network architecture. This is an
important task as the selected architecture affects the overall performance of the system in
terms of speed, accuracy and efficiency. The knowledge graph in owl or rdf format forms the
base from which essential or critical attributes can be extracted using various feature
engineering techniques. A Convolution neural network fits best for scenarios that include
image processing, but we will be making use of a recurrent neural network as it reiterates
back from the previous inputs, redefining the weights in order to increase the classification of
original sentence. The above steps when put together help us obtain the optimum response
and also structure it in a grammatically correct format.
1.6 Problem Statement

We will be making use of various leveraging tools, techniques, and algorithms to
process a large amount of natural language data and to analyse the context of the data which
is fed into the program. This data being fed into the program is assumed to be unstructured
and so requires a large amount of preprocessing. A structured graph should be constructed to
represent the entities and the relationships between them. Thus, using the above processed
data and analysis, the system must be capable of providing grammatically structured response
to the questions posed by the user within the context of the data that was initially fed as input.

Chapter 2
LITERATURE SURVEY
2.1 Semantic Parsing via Staged Query Graph Generation
In this paper, a semantic parser framework is created which aims to:

1. Map a natural language question to a logical form query, which can be executed against a
knowledge base to retrieve answers.
2. Reduce semantic parsing to a query graph that resembles sub graphs of the knowledge
base, formulated as a staged search problem.
3. Apply advance entity linking system and deep CNN model and predicates sequences.
This proposes a semantic parsing framework for question answering using a

knowledge base. It defines a query graph that resembles subgraphs of the knowledge base
and can be directly mapped to a logical form. Semantic parsing is reduced to query graph
generation, formulated as a staged search problem. Unlike traditional approaches, this
method leverages the knowledge base in an early stage to prune the search space and thus
simplifies the semantic matching problem. By applying an advanced entity linking system
and a deep convolutional neural network model that matches questions and predicate
sequences, the system outperforms previous methods substantially, and achieves an F1
measure of 52.5% on the WEBQUESTIONS dataset.
Organizing the world’s facts and storing them in a structured database, large-scale
knowledge bases (KB) like DBPedia (Auer et al., 2007) and Freebase (Bollacker et al., 2008)
have become important resources for supporting open-domain question answering (QA).
Most state-of-the-art approaches to KB-QA are based on semantic parsing, where a question
(utterance) is mapped to its formal meaning representation (e.g., logical form) and then
translated to a KB query. The answers to the question can then be retrieved simply by
executing the query. The semantic parse also provides a deeper understanding of the
question, which can be used to justify the answer to users, as well as to provide easily
interpretable information to developers for error analysis.

However, most traditional approaches for se-mantic parsing are largely

decoupled from the knowledge base, and thus are faced with several challenges when
adapted to applications like QA. We first define a query graph that can be
straightforwardly mapped to a logical form in λ-calculus and is semantically closely
related to λ-DCS (Liang, 2013). Semantic parsing is then reduced to query graph
generation, formulated as a search problem with staged states and actions. Each state is a
candidate parse in the query graph representation and each action defines a way to grow
the graph. The representation power of the semantic parse is thus controlled by the set of
legitimate actions applicable to each state. In particular, we stage the actions into three
main steps: locating the topic entity in the question, finding the main relationship between
the answer and the topic entity, and expanding the query graph with additional constraints
that describe properties the answer needs to have, or relationships between the answer and
other entities in the question.
The main focus of this work was to develop a semantic parsing framework that
maps a natural language question to a logical form query, which can be executed against a
knowledge base to retrieve the answers. Question answer pairs were taken from the
WEBQUESTIONS dataset and the model was evaluated against this dataset. It was
demonstrated that the model could retrieve the right answers with an accuracy of 52.5%.
This approach resulted in a model that performed much better than previous approaches.
The advanced entity linking system played a significant role in improving the
performance of the system as it helped reduce the number of entities suggested by half.
This paper presented a semantic parsing framework for question answering using a
knowledge base. A query graph is defined as the meaning representation that can be directly
mapped to a logical form. Semantic parsing is reduced to query graph generation, formulated
as a staged search problem. With the help of an advanced entity linking system and a deep
convolution neural network model that matches questions and predicate sequences, our
system outperforms previous methods substantially on the WEBQUESTIONS dataset. [24]

2.2 The NarrativeQA Reading Comprehension Challenge
Reading comprehension (RC) in contrast to information retrieval requires integrating

in-formation and reasoning about events, entities, and their relations across a full document.
Question answering is conventionally used to assess RC ability, in both artificial agents and
children learning to read. However, existing RC datasets and tasks are dominated by
questions that can be solved by selecting answers using superficial information (e.g., local
context similarity or global term frequency); they thus fail to test for the essential integrative
aspect of RC. To encourage progress on deeper comprehension of language, we present a
new dataset and set of tasks in which the reader must answer questions about stories by
reading entire books or movie scripts. These tasks are designed so that successfully
answering their questions requires understanding the underlying narrative rather than relying
on shallow pattern matching or salience. We show that al-though humans solve the tasks
easily, standard RC models struggle on the tasks presented here. We provide an analysis of
the dataset and the challenges it presents.
Natural language understanding seeks to create models that read and comprehend
text. A common strategy for assessing the language understanding capabilities of
comprehension models is to demonstrate that they can answer questions about documents
they read, akin to how reading comprehension is tested in children when they are learning to
read. After reading a document, a reader usually cannot reproduce the entire text from
memory, but often can answer questions about underlying narrative elements of the
document: the salient entities, events, places, and the relations between them. Thus, testing
understanding requires creation of questions that examine high-level abstractions instead of
just facts occurring in one sentence at a time.
We have introduced a new dataset and a set of tasks for training and evaluating
reading comprehension systems, born from an analysis of the limitations of existing datasets
and tasks. While our QA task resembles tasks provided by existing datasets, it exposes new
challenges because of its domain: fiction. Fictional stories in contrast to news stories are self-
contained and describe richer set of entities, events, and the relations between them. We have
a range of tasks, from simple (which requires models to read summaries of books and movie
scripts, and generate or rank fluent English answers to human-generated questions) to more

complex (which requires models to read the full stories to answer the questions, with no
access to the summaries). [21]
2.3 Natural Language Processing in Information Retrieval
Many Natural Language Processing (NLP) techniques, including stemming, part-of-

speech tagging, compound recognition, de-compounding, chunking, word sense
disambiguation and others, have been used in Information Retrieval (IR). The core IR task we
are investigating here is document retrieval. Several other IR tasks use very similar
techniques, e.g. document clustering, filtering, new event detection, and link detection, and
they can be combined with NLP in a way similar to document retrieval.
In most cases, researchers work on using existing NLP components (stemmers,

taggers,), apply them to an IR data set and queries, and then use standard IR techniques. This
out-of-the-box use of NLP components that are not geared towards IR might be one reason
why NLP techniques are only moderately successful when compared to state-of-the art non-
NLP retrieval techniques. The moderate success contradicts the intuition that NLP should
help IR, which is shared by a large number of researchers.
Overall, we see a modest benefit of NLP techniques in IR. However, this bene-fit
comes with large computational costs, and non-NLP techniques tend to yield greater
improvements. Small positive effects often seem to be a superposition of positive and
negative effects. Automatically separating positive and negative instances would help a lot.
Such a separation would require a joint focus on NLP and retrieval, not to build an NLP
system and then apply it to retrieval more or less as a black box. Processing techniques that
were developed directly for information retrieval tend to be more successful than techniques
that were developed independently based on linguistic. The Porter stemming algorithm is
very fast and tailored for normalization in retrieval systems. Similarly, statistical “phrases” as
investigated in the retrieval community collide with linguistic knowledge. But they are
optimized for the retrieval task and are therefore successful. [23]

2.4 Preprocessing Techniques for Text Mining
Preprocessing is an important task and critical step in Text mining, Natural Language
Processing (NLP) and information retrieval (IR). In the area of Text Mining, data
preprocessing used for extracting interesting and non-trivial and knowledge from
unstructured text data. Information Retrieval (IR) is essentially a matter of deciding which
documents in a collection should be retrieved to satisfy a user's need for information. The
user's need for information is represented by a query or profile, and contains one or more
search terms, plus some additional information such as weight of the words. Hence, the
retrieval decision is made by comparing the terms of the query with the index terms
(important words or phrases) appearing in the document itself. The decision may be binary
(retrieve/reject), or it may involve estimating the degree of relevance that the document has to
query. Unfortunately, the words that appear in documents and in queries often have many
structural variants. So before the information retrieval from the documents, the data
preprocessing techniques are applied on the target data set to reduce the size of the data set
which will increase the effectiveness of IR System The objective of this study is to analyze the
issues of preprocessing methods such as Tokenization, Stop word removal and Stemming for
the text documents.
Need for Text Processing in NLP System:
1. To reduce indexing file size of the Text documents by removing stop words that
accounts for 20-30% of total word counts, and also stemming.
2. To improve the efficiency and effectiveness of the Information Retrieval system. [22]

Chapter 3
SYSTEM REQUIREMENTS
System requirements are the configuration that a system must have for a hardware or
software application to run easily and proficiently. If these requirements are not satisfied,
they can lead to installation or performance problems. Installation problems may prevent a
device or an application from getting installed. Performance problems may cause a product to
malfunction or perform below expectation or even hang or crash. [5]
3.1 Hardware Requirements

The section of hardware configuration is an important task related to the software
development insufficient random access memory may affect adversely on the speed and
efficiency of the entire system. The process should be powerful to handle the entire
operations. The hard disk should have sufficient capacity to store the file and application.
Processor : Intel Core i5 or more

RAM : 4GB or more
Hard disk : 10GB or more (depending on input file size)
Peripherals : Keyboard, Compatible mouse
Cache Memory : L2-1 MB
GPU : Intel HD Graphics or Nvidia chip for better performance
Monitor Resolution : 1024*768 or 1336*768 or 1280*1024
3.2 Software Requirements
A major element in building a system is the section of compatible software since the
software in the market is experiencing in geometric progression. Selected software should be
acceptable by the firm and one user as well as it should be feasible for the system.
This document gives a detailed description of the software requirement

specification. The study of requirement specification is focused specially on the functioning

of the system. It allows the developer or analyst to understand the system, function to be
carried out the performance level to be obtained and corresponding interfaces to be
established. [5]
3.2.1 NLTK (Natural Language Toolkit)

The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and
programs for symbolic and statistical natural language processing (NLP) for English written
in the Python programming language. It was developed by Steven Bird and Edward Loper in
the Department of Computer and Information Science at the University of Pennsylvania.
NLTK includes graphical demonstrations and sample data. It is accompanied by a book that
explains the underlying concepts behind the language processing tasks supported by the
toolkit, plus a cookbook.
NLTK is intended to support research and teaching in NLP or closely related areas,
including empirical linguistics, cognitive science, artificial intelligence, information retrieval,
and machine learning. NLTK has been used successfully as a teaching tool, as an individual
study tool, and as a platform for prototyping and building research systems. There are 32
universities in the US and 25 countries using NLTK in their courses. NLTK supports
classification, tokenization, and stemming, tagging, parsing, and semantic reasoning
functionalities.
The Natural Language Toolkit (NLTK) is a platform used for building Python
programs that work with human language data for applying in statistical natural language
processing(NLP).
It contains text processing libraries for tokenization, parsing, classification, stemming,

tagging and semantic reasoning. It also includes graphical demonstrations and sample data
sets as well as accompanied by a cook book and a book which explains the principles behind
the underlying language processing tasks that NLTK supports.
NLTK includes more than 50 corpora and lexical sources such as the Penn Treebank
Corpus, Open Multilingual Wordnet, Problem Report Corpus, and Lin’s Dependency
Thesaurus. [6]

Figure 3.1: NLTK Hierarchy

3.2.2 TensorFlow
TensorFlow is a free and open-source software library for dataflow and
differentiable programming across a range of tasks. It is a symbolic math library, and is also
used for machine learning applications such as neural networks. It is used for both research
and production. It is a computational framework for building machine learning models.
TensorFlow provides a variety of different toolkits that allow you to construct

models at your preferred level of abstraction. You can use lower-level APIs to build models
by defining a series of mathematical operations. Alternatively, you can use higher-level APIs
(like tf.estimator) to specify predefined architectures, such as linear regressors or neural
networks. [25]
Figure 3.2: TensorFlow toolkit hierarchy

Tensor
Tensorflow's name is directly derived from its core framework: Tensor. In

Tensorflow, all the computations involve tensors. A tensor is a vector or matrix of n-
dimensions that represents all types of data. All values in a tensor hold identical data type
with a known (or partially known) shape. The shape of the data is the dimensionality of the
matrix or array. [21]
A tensor can be originated from the input data or the result of a computation. In
TensorFlow, all the operations are conducted inside a graph. The graph is a set of
computation that takes place successively. Each operation is called an op node and is
connected to each other.
The graph outlines the ops and connections between the nodes. However, it does not
display the values. The edge of the nodes is the tensor, i.e., a way to populate the operation
with data.
Graphs
TensorFlow makes use of a graph framework. The graph gathers and describes all
the series computations done during the training. The graph has lots of advantages:
 It was done to run on multiple CPUs or GPUs and even mobile operating system
 The portability of the graph allows preserving the computations for immediate or later
use. The graph can be saved to be executed in the future.
 All the computations in the graph are done by connecting tensors together
 A tensor has a node and an edge. The node carries the mathematical operation and
produces endpoints output. The edges the edges explain the input/output relationships
between nodes.
High Level APIs

 Keras, TensorFlow's high-level API for building and training deep learning models.
 Eager Execution, an API for writing TensorFlow code imperatively, like you would
use Numpy.
 Importing Data, easy input pipelines to bring your data into your TensorFlow
program.
 Estimators, a high-level API that provides fully-packaged models ready for large-
scale training and production.
Low Level APIs

 Introduction, which introduces the basics of how you can use TensorFlow outside of
the high Level APIs.
 Tensors, which explains how to create, manipulate, and access Tensors--the
fundamental object in TensorFlow.
 Variables, which details how to represent shared, persistent state in your program.
 Graphs and Sessions, which explains:
o dataflow graphs, which are TensorFlow's representation of
computations as dependencies between operations.
o sessions, which are TensorFlow's mechanism for running dataflow
graphs across one or more local or remote devices. If you are
programming with the low-level TensorFlow API, this unit is
essential. If you are programming with a high-level TensorFlow API
such as Estimators or Keras, the high-level API creates and manages
graphs and sessions for you, but understanding graphs and sessions
can still be helpful.
 Save and Restore, which explains how to save and restore variables and models.
 Ragged Tensors, which explains how to use Ragged Tensors to encode nested
variable-length lists.
3.2.3 SciPy
SciPy is a free and open-source Python library used for scientific computing and
technical computing. SciPy contains modules for optimization, linear algebra, integration,
interpolation, special functions, FFT, signal and image processing, ODE solvers and other
tasks common in science and engineering.
SciPy builds on the NumPy array object and is part of the NumPy stack which
includes tools like Matplotlib, pandas and SymPy, and an expanding set of scientific

computing libraries. This NumPy stack has similar users to other applications such as
MATLAB, GNU Octave, and Scilab. The NumPy stack is also sometimes referred to as the
SciPy stack.
The SciPy library is currently distributed under the BSD license, and its
development is sponsored and supported by an open community of developers. It is also
supported by NumFOCUS, a community foundation for supporting reproducible and
accessible science. [8]
Available sub-packages include:
 constants: physical constants and conversion factors (since version 0.7.0)

 cluster: hierarchical clustering, vector quantization, K-means
 fftpack: Discrete Fourier Transform algorithms
 integrate: numerical integration routines
 interpolate: interpolation tools
 io: data input and output
 lib: Python wrappers to external libraries
 linalg: linear algebra routines
 misc: miscellaneous utilities (e.g. image reading/writing)
 ndimage: various functions for multi-dimensional image processing
 optimize: optimization algorithms including linear programming
 signal: signal processing tools
 sparse: sparse matrix and related algorithms
 spatial: KD-trees, nearest neighbors, distance functions
 special: special functions
 stats: statistical functions
 weave: tool for writing C/C++ code as Python multiline strings
The basic data structure used by SciPy is a multidimensional array provided by the
NumPy module. NumPy provides some functions for linear algebra, Fourier transforms, and
random number generation, but not with the generality of the equivalent functions in SciPy.
NumPy can also be used as an efficient multidimensional container of data with arbitrary
datatypes. This allows NumPy to seamlessly and speedily integrate with a wide variety of

databases. Older versions of SciPy used Numeric as an array type, which is now deprecated
in favor of the newer NumPy array code.
3.2.4 Numpy
NumPy is a library for the Python programming language, adding support for large,
multi-dimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays. The ancestor of NumPy, Numeric, was
originally created by Jim Hugunin with contributions from several other developers. In 2005,
Travis Oliphant created NumPy by incorporating features of the competing Numarray into
Numeric, with extensive modifications. NumPy is open-source software and has many
contributors.
NumPy targets the CPython reference implementation of Python, which is a non-

optimizing bytecode interpreter. Mathematical algorithms written for this version of Python
often run much slower than compiled equivalents. NumPy addresses the slowness problem
partly by providing multidimensional arrays and functions and operators that operate
efficiently on arrays, requiring rewriting some code, mostly inner loops using NumPy.
Python bindings of the widely used computer vision library OpenCV utilize NumPy
arrays to store and operate on data. Since images with multiple channels are simply
represented as three-dimensional arrays, indexing, slicing or masking with other arrays are
very efficient ways to access specific pixels of an image. The NumPy array as universal data
structure in OpenCV for images, extracted feature points, filter kernels and many more vastly
simplifies the programming workflow and debugging. [7]
The ndarray data structure
The core functionality of NumPy is its "ndarray", for n-dimensional array, data
structure. These arrays are strided views on memory. In contrast to Python's built-in list data
structure (which, despite the name, is a dynamic array), these arrays are homogeneously
typed: all elements of a single array must be of the same type.

Such arrays can also be views into memory buffers allocated by C/C++, Cython, and Fortran
extensions to the CPython interpreter without the need to copy data around, giving a degree
of compatibility with existing numerical libraries. This functionality is exploited by the SciPy
package, which wraps a number of such libraries (notably BLAS and LAPACK). NumPy has
built-in support for memory-mapped ndarrays.
Limitations
Inserting or appending entries to an array is not as trivially possible as it is with

Python's lists. The np.pad(...) routine to extend arrays actually creates new arrays of the
desired shape and padding values, copies the given array into the new one and returns it.
NumPy's np.concatenate([a1,a2]) operation does not actually link the two arrays but returns a
new one, filled with the entries from both given arrays in sequence. Reshaping the
dimensionality of an array with np.reshape(...) is only possible as long as the number of
elements in the array does not change. These circumstances originate from the fact that
NumPy's arrays must be views on contiguous memory buffers. A replacement package called
Blaze attempts to overcome this limitation.
Algorithms that are not expressible as a vectorized operation will typically run slowly
because they must be implemented in "pure Python", while vectorization may increase
memory complexity of some operations from constant to linear, because temporary arrays
must be created that are as large as the inputs. Runtime compilation of numerical code has
been implemented by several groups to avoid these problems; open source solutions that
interoperate with NumPy include scipy.weave, numexpr and Numba. Cython and Pythran are
static-compiling alternatives to these.
3.2.5 Scikit-learn
Scikit-learn (formerly scikits.learn) is a free software machine learning library for

the Python programming language. It features various classification, regression and clustering
algorithms including support vector machines, random forests, gradient boosting, k-means
and DBSCAN, and is designed to interoperate with the Python numerical and scientific
libraries NumPy and SciPy.

Scikit-learn is largely written in Python, with some core algorithms written in

Cython to achieve performance. Support vector machines are implemented by a Cython
wrapper around LIBSVM; logistic regression and linear support vector machines by a similar
wrapper around LIBLINEAR. [8]
3.2.6 Pandas
Pandas is an open-source Python Library providing high-performance data

manipulation and analysis tool using its powerful data structures. The name Pandas is
derived from the word Panel Data – an Econometrics from Multidimensional data.
In 2008, developer Wes McKinney started developing pandas when in need of high
performance, flexible tool for analysis of data.
Prior to Pandas, Python was majorly used for data munging and preparation. It had very
little contribution towards data analysis. Pandas solved this problem. Using Pandas, we can
accomplish five typical steps in the processing and analysis of data, regardless of the origin
of data — load, prepare, manipulate, model, and analyze.
Python with Pandas is used in a wide range of fields including academic and commercial
domains including finance, economics, Statistics, analytics, etc. [9]
Key Features of Pandas
 Fast and efficient DataFrame object with default and customized indexing.
 Tools for loading data into in-memory data objects from different file formats.
 Data alignment and integrated handling of missing data.
 Reshaping and pivoting of date sets.
 Label-based slicing, indexing and subsetting of large data sets.
 Columns from a data structure can be deleted or inserted.
 Group by data for aggregation and transformations.
 High performance merging and joining of data.
 Time Series functionality.

Dimension & Description

The best way to think of these data structures is that the higher dimensional data structure is
a container of its lower dimensional data structure. For example, DataFrame is a container of
Series, Panel is a container of DataFrame.
Data Dimensions Description

Structure
Series 1 1D labeled homogeneous array, sizeimmutable.
Data Frames 2 General 2D labeled, size-mutable tabular structure with potentially

heterogeneously typed columns.
Panel 3 General 3D labeled, size-mutable array.
Building and handling two or more dimensional arrays is a tedious task, burden is placed on
the user to consider the orientation of the data set when writing functions. But using Pandas
data structures, the mental effort of the user is reduced.
For example, with tabular data (DataFrame) it is more semantically helpful to think of
the index (the rows) and the columns rather than axis 0 and axis 1.
Mutability
All Pandas data structures are value mutable (can be changed) and except Series all are size
mutable. Series is size immutable. DataFrame is widely used and one of the most important
data structures. Panel is used much less.
Series
Series is a one-dimensional array like structure with homogeneous data. For example, the
following series is a collection of integers 10, 23, 56, …
10 23 56 17 52 61 73 90 26

Key Points
 Homogeneous data
 Size Immutable
 Values of Data Mutable
DataFrame
DataFrame is a two-dimensional array with heterogeneous data. For example,
Name Age Gender Rating
Steve 32 Male 3.45
Lia 28 Female 4.6
Vin 45 Male 3.9
Katie 38 Female 2.78
The table represents the data of a sales team of an organization with their overall
performance rating. The data is represented in rows and columns. Each column represents an
attribute and each row represents a person.
Data Type of Columns

The data types of the four columns are as follows −
Column Type
Name String
Age Integer

Gender String
Rating Float
Key Points
 Heterogeneous data
 Size Mutable
 Data Mutable
Panel
Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent
the panel in graphical representation. But a panel can be illustrated as a container of
DataFrame.
Key Points
 Heterogeneous data
 Size Mutable
 Data Mutable
3.2.7 PyTorch
PyTorch is an open-source machine learning library for Python, based on Torch,

used for applications such as natural language processing. It is primarily developed by
Facebook's artificial-intelligence research group, and Uber's "Pyro" Probabilistic
programming language software is built on it. [9]
PyTorch provides two high-level features:
 Tensor computation (like NumPy) with strong GPU acceleration

 Deep neural networks built on a tape-based autodiff system

In terms of programming, Tensors can simply be considered multidimensional arrays.

Tensors in PyTorch are similar to NumPy arrays, with the addition being that Tensors can
also be used on a GPU that supports CUDA. PyTorch supports various types of Tensors.
Autograd Module
PyTorch uses a technique called automatic differentiation. A recorder records what
operations have performed, and then it replays it backward to compute the gradients. This
technique is especially powerful when building neural networks in order to save time on one
epoch by calculating differentiation of the parameters at the forward pass itself.
Optim Module
torch.optim is a module that implements various optimization algorithms used for
building neural networks. Most of the commonly used methods are already supported, so
there is no need to build them from scratch.
nn Module
PyTorch autograd makes it easy to define computational graphs and take gradients,
but raw autograd can be a bit too low-level for defining complex neural networks. This is
where the nn module can help.
3.2.8 Pickle
The pickle module implements binary protocols for serializing and de-serializing a
Python object structure. “Pickling” is the process whereby a Python object hierarchy is
converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte
stream (from a binary file or bytes-like object) is converted back into an object hierarchy.
Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” or
“flattening”; however, to avoid confusion, the terms used here are “pickling” and
“unpickling”. [9]

Data stream format
The data format used by pickle is Python-specific. This has the advantage that there are no
restrictions imposed by external standards such as JSON or XDR (which can’t represent
pointer sharing); however it means that non-Python programs may not be able to reconstruct
pickled Python objects.
By default, the pickle data format uses a relatively compact binary representation. If you need
optimal size characteristics, you can efficiently compress pickled data. The module
pickletools contains tools for analyzing data streams generated by pickle. pickletools source
code has extensive comments about opcodes used by pickle protocols.

Chapter 4
SYSTEM ARCHITECTURE
4.1 System Structure

The proposed approach divides the whole system into three modules. First module
deals with taking a document and question as input. The second module performs various
pre-processing techniques on the document. This converts the raw textual data into a
structured format. The third module makes use of a neural network model that applies
confidence-based methods on the data to search for the text with the highest confident score.
The text with the highest confidence score is displayed as the answer.
The following Figure 4.1 shows the overall system structure. The document along
with a question are fed into the system. The document analyser will select a model specific to
the application. Once the processed text is run on the neural network, the answer is fetched
based on a ranking system. And, the system returns the sentence with the highest confidence
score. [10]
Figure 4.1: Overall System Structure

4.2 System Design

Our Question Answering system basically has three modules as shown in Figure 4.2.
They are:
 Document Collection - Captures and stores the image of the vehicle for further
processing.
 Natural Language Processing – Pre-processing of textual data to obtain clean and
structured data.
 Answer Generation - Used to generate a semantically correct answer. [11]
Figure 4.2: System Modules

The text processing module is further divided into five sub-modules. They are
briefly described below.
4.2.1 Tokenization
Tokenization is a step which splits longer strings of text into smaller pieces, or
tokens. Larger chunks of text can be tokenized into sentences, sentences can be tokenized
into words, etc. Further processing is generally performed after a piece of text has been
appropriately tokenized. Tokenization is also referred to as text segmentation or lexical

analysis. Sometimes segmentation is used to refer to the breakdown of a large chunk of text
into pieces larger than words (e.g. paragraphs or sentences), while tokenization is reserved
for the breakdown process which results exclusively in words. [12]
4.2.2 Removing Stop Words

Removing stop words also involved removing punctuations, whitespaces along and
default stop words. Stop words are those words which are filtered out before further
processing of text, since these words contribute little to overall meaning, given that they are
generally the most common words in a language. For instance, "the," "and," and "a," while
all required words in a passage, don't generally contribute greatly to one's understanding of
content.
4.2.3 Removing Accented Characters

Usually in any text corpus, the text might contain accented characters/letters,
especially if only the English language is to be analysed. Hence, these characters need to be
converted and standardized into ASCII characters. A simple example — converting é to e.
4.2.4 Removing Contractions

Contractions are shortened version of words or syllables. They often exist in either
written or spoken forms in the English language. These shortened versions or contractions of
words are created by removing specific letters and sounds. In case of English contractions,
they are often created by removing one of the vowels from the word. Examples would be, do
not to don’t and I would to I’d. Converting each contraction to its expanded, original form
helps with text standardization. [13]
4.2.5 Stemming and Lemmatization

Stemming is the process of eliminating affixes (suffixed, prefixes, infixes, and
circumfixes) from a word in order to obtain a word stem. For example, the word ‘running’
will be converted to ‘run’ after stemming.
Lemmatization is related to stemming, differing in that lemmatization is able to
capture canonical forms based on a word’s lemma. For example, stemming the word ‘better’
would fail to return its citation form; lemmatization would result in ‘good’.

4.3 Design Description

The image processing is an integral part of the entire proposed system. Image
processing happens in the six main steps shown in Figure 4.3.
Figure 4.3: Main steps in text processing

4.3.1 Capture
The input document and the questions w.r.t. the document are captured or retrieved
through the command line. The contents of the input document are stored in a text file
locally. Thus, the relative address or the directory of this text file is mentioned explicitly in
the command line. The question, on the other hand, is included directly in the command. [14]
4.3.2 Pre-process
Before the actual recognition and extraction of the context, it is necessary to enhance
the quality of the text content. This will aid in the process of context recognition and
extraction. The pre-processing steps can defined as – tokenization, removal of stop words,
removal off accented characters & contractions, stemming and lemmatization. After, the
implementation of all the above mentioned steps, one can obtain the cleaned/pre-processed
text.
4.3.3 Localize
This step is responsible in identifying the zone or location wherein the actual context
is present w.r.t to the input question. The paragraphs are marked based on the TF-IDF (Term
frequency-Inverse Document Frequency) values. This score gives the relativity in context
between the specific paragraph and the underlying context of the question being asked.
Hence, these paragraphs are ranked based on their TF-IDF scores and his will help us localize
the region in the document with highest context similarity.
4.3.4 Connected Component Analysis

The next step is to perform the connected component analysis. This is done to
eliminate the unwanted parts of the text. The entire text is traversed first to find all the
connected components. The context has connected components and is hence recognisable.
These connected components are then extracted.
4.3.5 Segment
The extracted components are the parts of the document that are of use to us for
further text processing. Here we apply a specific method where the entire extracted
component is scanned from top to the bottom. Then the Best span is identified, which gives
us the vector values of the starting and ending word of a sentence with the highest confidence
score.
4.3.6 Context Recognition

The connected components are then sent to a neural network, which has been run
through a pickle file that has all the network parameters stored as a python object. Here each
component is compared with a stored object. Then the resulting sentence is returned. [15]

Chapter 5
IMPLEMENTATION
5.1 Introduction
Implementation is the realization of the technical specification or algorithm as a

program. The implementation phase of any project is the most crucial phase as it is here that
the project finally takes shape. This phase is implemented keeping the end user in mind. It is
the implementation of the project that yields the final solution, which solves the problem in
hand. The implementation phase involves the actual materialization of the ideas, which are
expressed in the analysis document, and the development of the system in a suitable
programming language necessary to achieve the final product. Often a product is ruined due
to the incorrect choice of the programming language or unsuitable method of programming. It
is hence better for the coding phase to be directly linked to the design phase.
The implementation stage consists of proper planning, detailed study of the existing
system and its constraints and designing of any alternative methods and their evaluation. [37]
The implementation of the proposed system involves a sequence of simple steps as given
below:
1. Firstly, generate a pickle file which contains all the necessary weights and other
parameters from the SQuAD dataset using neural networks.
2. Then import the pickle file of the SQuAD dataset as the model into our system.
3. Import document and the question to be answered into the system.
4. Perform all the pre-processing techniques on the document and on the question to get
cleaned text.
5. Now sort the paragraph into sentences and then into words. Rate the paragraph based
on contextual match to the question.
6. Now build the TensorFlow session, and encode all the weights into numpy arrays.

7. Then generate the answer using the selected model and display onto the users screen.
5.2 Programming Language Selection
As described earlier, any mistakes in selection the programming language may lead
to the failure of the entire system. Hence, the programming language for any code must be
chosen with proper knowledge about the design of the proposed system. The programming
language chosen is Python. [2]
5.2.1 Python
Python is an object-oriented scripting language which is very easy to learn. Hence it

is also commonly called as a beginner’s language. It is an interpreted and interactive
language. This language was created by Guido van Rossum and its implementation began in
1989 at Centrum Wiskunde & Informatica (CWI), Netherland. Python is basically developed
under and open source license which has been approved by OSI. Hence, this language is free
for use and distribution (even for commercial use). [3]
The main reasons for using Python in our proposed system are:
1. Easy to learn and use.
2. Text processing is easier as Python includes libraries that support text processing
from the basic to advanced level. Many of the functions for text segmentation,
feature extraction, context detection, etc. are implicitly available in the libraries.
3. Python provides interfaces to almost all commercial databases.
4. The Python language is portable and scalable. [4]

5.3 Data Flow Diagram
A flowchart is a diagram which depicts the algorithm and/or process as a sequence.

The various parts of the algorithm are represented in boxes, where there is a different kind of
box for different kinds of statements. The order of the process is shown by arrows connecting
the boxes in the appropriate direction. A data flow diagram depicts the flow of data in the
system in an organized manner.
The following Figure 5.1 shows the Level 0 data flow diagram which is an abstract
data flow diagram. [16]
Figure 5.1: Level 0 dataflow diagram

The Level 1 data flow diagram is given in Figure 5.2. Here the text processing
module is described in detail. Also it denotes the top paragraph in which the probability of
finding the answer is maximum.
Figure 5.2: Level 1 dataflow diagram
5.4 Activity Diagram
An activity diagram is similar to a flowchart but here the flow is represented from one
activity to another activity. An activity is nothing but an operation of the system. This
diagram describes the dynamic behaviour of the system. These are usually constructed using
forward and reverse engineering techniques. The following Figure 5.3 describes the activity
diagram of our proposed system. [17]

Figure 5.3: Activity diagram
5.5 Use Case Diagram
A use case diagram is a simple way of representing the interaction of a user with the
system. It shows the complete relationship between the user and system with the use cases. It
is also known as a behaviour diagram as it depicts the behaviour of the system with the
external actors (users).The following Figure 5.4 shows the use case diagram for our system.
Here the user is considered as the external actor that interact with the system in order to
retrieve the anaswer.

Figure 5.4: Use case diagram
5.6 Sequence Diagram
A sequence diagram depicts the active processes that live simultaneously as vertical
lines. The horizontal arrows show the messages that are being transferred from one live
process or object to another. These messages are given in the order that they are exchanged
from the top to the bottom of the sequence diagram. These sequence diagrams are also called
event diagrams or event scenarios as they show the various events that occur in the system in
the proper order. The Figure 5.5 below shows the sequence diagram for our question
answering system.

Figure 5.5: Sequence diagram

Chapter 6
SYSTEM STUDY
6.1 Feasibility Study

In this phase, the feasibility of the project is evaluated and business scheme is put
forth with a general plan for the project and cost estimations. The feasibility study of the
proposed system is to be performed during system analysis. This is to ensure that the
proposed system is not a burden to the company. It evaluates the project’s potential for
success.
The feasibility study summarizes and analyzes several methods to achieve success. It
helps in tapering the scope of the project and to recognize the best business scenario. The
intention of this study is to identify the probability of one or more solutions satisfying the
specified business requirements. For feasibility analysis, understanding of the major
requirements for the system is necessary. [26]
Three key aspects involved in the feasibility analysis are:
1. Economic Feasibility
2. Technical Feasibility
3. Social Feasibility
6.1.1 Economic Feasibility

This study is carried out to check the economic impact that the system will have on
the organization, that is, the amount of fund that the company can pour into the research and
development of the system. It also serves as an independent project assessment and improves
system reliability by helping decision makers determine the positive economic benefits to the
organization that the proposed system will provide.
The economic feasibility study projects how much start-up capital is needed, sources
of capital, returns on investment, and other financial considerations. It looks at how much
cash is needed, where it will come from, and how it will be spent. Thus the developed system
should be within the budget and this can be achieved because most of the technologies used
are freely available. Only the customized products had to be purchased. Thus, the
expenditures must be justified. [26]

6.1.2 Technical Feasibility

This study is carried out to check the technical feasibility, that is to say the technical
necessities of the system. It also concentrates on acquiring and understanding the modern
technical resources and their applicability to fulfill the needs of the proposed system. It is an
evaluation of the hardware and software potential in meeting the needs of the proposed
system.
Any system developed must not have a high requirement for the available technical
resources which will lead to high demands being placed on the client. The developed system
must have a humble requirement, as only negligible or null changes are required for
implementing the system. The technical feasibility study should basically support the
financial statistics of an organization. It calculates the aspects of how you intend to furnish a
product or service to customers. [26]
6.1.3 Social Feasibility

This study is carried out to check the extent of acceptance of the system by the user.
This includes the process of training the user to use the system effectively. The user must not
feel vulnerable by the system, instead must accept it as a necessity. Social feasibility study
includes environmental impacts on the project location and in associated areas to evaluate the
effects on environmental resources due to alterations or pollutants.
Social feasibility describes the effect on users from the introduction of the new
system taking into account whether there will be a need for retraining the workforce. The
level of tolerance by the users solely depends on the means that are employed to educate
them about the system. Their level of confidence must be increased so that they are able to
make some productive criticism. Social feasibility also describes how you propose to ensure
user co-operation before changes are introduced. [26]

Chapter 7
SYSTEM TESTING
The purpose of testing is to identify all errors. Testing is the process of trying to
discover every likely fault or weakness in the system. It offers a way to inspect the
functionality of components, sub-assemblies, assemblies and/or a finished product. It is the
process of implementing the software with the objective of ensuring that the software system
meets its requirements and user expectations and does not fail in any improper manner. There
are various types of test each addressing a specific testing requirement. [18]
7.1 Types of testing

The types of testing are as given in the following section.
7.1.1 Unit testing

Unit testing comprises of the design of test cases that validate the internal logic
function and also its inputs and outputs. All decision branches and internal code flow should
be verified. It is the testing of individual software units of the application, done after the
completion of a distinct unit just before integration. This is a structural testing, that relies on
the knowledge of its construction and is invasive. Unit tests accomplish basic tests at
component level and also test a specific application, business process and/or system
configuration. Unit tests ensure that each unique path of a process performs in harmony with
the standard specifications and clearly defines the inputs and expected results. [19]
7.1.2 Integration testing

Software integration testing is the process of testing two or more integrated software
components on a single platform to verify failures caused during interfacing. [19]
Integration tests are designed to test combined software components to determine
whether they run as a single program. Integration tests exhibit that although the components
were exclusively effective, as shown by the fruitful unit testing, the combination of these are
correct and consistent. Integration testing is explicitly aimed at revealing the problems that
arise due to assembly of components. [20]

7.1.3 Functional testing

Functional tests offer organized demonstrations that indicate whether the functions
tested are available as stated by the business and technical requirements, user manuals and
system documentation. Functional testing is focused on the following items as shown in
Table 7.1.
Table 7.1: Functional Testing items and their corresponding functions

Test Items Functions
Valid Input identifies classes of valid inputs that must be accepted
Invalid Input identifies classes of invalid inputs that must be rejected
Functions identifies functions to be applied
Output identifies classes of outputs that must be executed
Systems/Procedures identifies interfacing systems or procedures that must be
invoked
Organization and preparation of functional tests is focused on requirements, key

functions, or special test cases. In addition, business process flows, data fields, predefined
processes, and successive processes must be considered for systematic coverage pertaining to
testing. Before functional testing is complete, additional tests are recognized and the effective
value of current tests is obtained. [20]
7.1.4 System Testing

System testing guarantees that the complete integrated software system meets the
requirements. It tests a configuration to ensure pre-determine and expected results. One such
example of system testing is the configuration oriented system integration test. System testing
works on the basis of process descriptions and flows, highlighting pre-driven process links
and integration points. System testing falls into the category of black-box testing hence does
not require any knowledge of the inner logic or design of the code. System testing tests not
only the design, but also the behavior of the system. It even tests the believed expectations of
the customer. It is also expected to test up to and beyond the constraints specified in the
software or hardware requirements specification. [20]

7.1.5 Black Box Testing

Black Box testing also known as Behavioral testing, is testing the software without
prior knowledge of the inner workings, structure or language of the module being tested.
Black box test must be written from a completed source document, like specification or
requirements document. It is a testing in which the software under consideration is treated as
a black box. The test provides inputs and responds to outputs without bearing in mind how
the software works. [20]
7.1.6 White Box Testing

White Box Testing also known as structural testing, is a testing in which the
software tester has information of the inner workings, language and structure of the software,
or at least its purpose. White-box testing can be employed at the unit, integration and system
levels of the software testing process. It helps in finding errors but has the ability to miss
unimplemented sections of the specification or missing requirements. It is used to test regions
that cannot be reached from a black box level. [20]
7.1.7 Acceptance Testing

User Acceptance Testing is a level of software testing where a system is tested for
acceptability. This is a critical phase of any project and requires significant participation by
the end user. The main purpose of this test is to assess whether the system is in accordance
with business requirements and evaluate whether or not it is acceptable for delivery. It also
guarantees that the product meets the functional requirements. The acceptance test need to be
performed several times, as all of the test cases may not be executed within a single test
iteration. [20]
7.2 Test cases
Test case ID Test Case Expected Result Actual Result Status

Description/Steps
TC_ID_01 Start the command The command Command Pass
prompt prompt should prompt is
respond and start running
running successfully
TC_ID_02 Enter the address The directory of Successfully Pass
of the directory in the command shift directories

which the system is prompt should

present. shift to the
entered address
TC_ID_03 Enter the command Accepts the Successfully Pass
with all the command and runs the system
necessary starts running the and provide
parameters system. output.
TC_ID_04 Change the model Accepts and run Successfully Pass
on which the the system again. runs the system
system has to run and provide
on in the output.
command.
TC_ID_05 Change the Accepts and run Successfully Pass
document on which the system again. runs the system
the system has to and provide
run on in the output.
command.
question on which the system again. runs the system
command.
question on which the system again. runs the system
command. Provide
a question with a
grammatical
mistake.

Chapter 8
RESULTS AND DISCUSSIONS
The analysis is done by implementing the three algorithms in Python in virtual

Environment. Given below are some of the screenshots which depict the working of the
models. Also, a classification report is printed for each algorithm which summarizes the
evaluation results of the model based on evaluating parameters.
Figure 8.1: Using NLP rules (Brute Force Method)

Figure 8.1 shows the basic working of the system that was implemented using
certain standard set of NLP rules. It takes in simple sentences, compares it to the specified
NLP rules and comes to conclusion by ranking its possible outcomes.
Figure 8.2: Grammatical changes in the input statement

Figure 8.2 depicts the importance of certain grammatical changes being made in the
input statement. Since there only certain specific set of NLP rules being present, the system is
unsure about the answer, as it is not able to match the input to any of the pre-written rules.
Figure 8.3: Input paragraph used for analysis
Figure 8.4: Answer Generation based on the input text document

Figure 8.3 is the input text document to the system and Figure 8.4 depicts the answer
provided by the system based on the analysis of the text document and the question being
posed by the user.
Figure 8.5: System answering indirect questions

Figure 8.5 depicts the working of the system when the question is posed by the user in
an indirect manner. The input document does not mention the total number of movies
explicitly, but the system is capable of counting the movies based on unique movie titles
mentioned within the document.
Figure 8.6: Generating answer by summarizing the input data
Figure 8.6 depicts a question-answer pair wherein the system has to summarize to
identify the gist of the input text to answer the question.

Figure 8.7: Answers with listings

Figure 8.7 depicts a question-answer pair wherein the system has to list out certain
names to answer the question.
Figure 8.8: Answer in a single word/phrase

Figure 8.8 depict question-answer pairs where the answer is limited to a single word.
Figure 8.9: Answer with underlying context

Figure 8.9 depicts question-answer pair where the underlying context is also
mentioned along with answer sentence.
Figure 8.10: Different input text document
Figure 8.11: Generating answer by summarizing the input data
Figure 8.11 depicts a question-answer pair wherein the system has to summarize to identify
the gist of the input text to answer the question.

Figure 8.12: Answer based on numerical context

Figure 8.12 depicts a question-answer pair wherein the system has to compare
numerical values

Chapter 9
CONCLUSION AND FUTURE ENHANCEMENTS
8.1 Conclusion
The system developed effectively presents a question answering system which provides
answers to the given question using the user document fed into the system. When using a
paragraph-level Question Answering model across multiple paragraphs, the training method
of sampling non-answer containing paragraphs while using a shared-norm objective function
can be very beneficial. Combining this with our suggestions for paragraph selection, using
the summed training objective, and our model design allows us to advance on SQuAD by a
large stride.
8.2 Future Scope

This project has tremendous potential to gain wide acceptance among people. Through this
project we intend to incorporate our system into various industries thus eliminating the
tedious task of manually going through the document and finding the answers to any
questions. The future scope of our project is vast and can be used in extensive ways:
1. Using our Question Answering System to sift through thousands of documents and
enable academic researchers and legal workers to retrieve the information they require
without having a massive time overhead.
2. The Question Answering System also finds its way in medical transcripts, where
spoken language is converted to written language and later analyzed.
3. Businesses can deploy question answering system in chat bots for 24x7 customer
support.
4. Implementation of various other languages can also take place in the system.

References
[1] Introduction To Machine Learning

https://towardsdatascience.com/introduction-to-machinelearning-
db7c668822c4
[2] "Semantic Web Machine Reading with FRED". Aldo Gangemi, Valentina Presutti,
Diego Reforgiato Recupero, Andrea Giovanni Nuzzolese, Francesco Draicchio,
Misael Mongiovì. Semantic Web Journal 8(6):873-893, 2017.
[3] "Knowledge Extraction Based on Discourse Representation Theory and Linguistic
Frames". Valentina Presutti, Francesco Draicchio, Aldo Gangemi. EKAW 2012: 114-
129
[4] “Syntactic and Semantic Decomposition Strategies for Question Answering from
Multiple Resources” Boris Katz, Gary Borchardt and Sue Felshin.
[5] Anaconda - https://www.anaconda.com/why-anaconda/
[6] Natural Language Toolkit - https://www.nltk.org/
[7] Numpy - https://www.numpy.org/
[8] Scikit-learn - https://scikit-learn.org/stable/
[9] Pandas - https://pandas.pydata.org/
[10] DocumentQA - https://github.com/allenai/document-qa
[11] SQuAD Dataset - https://rajpurkar.github.io/SQuAD-explorer/
[12] ReadAI - https://github.com/ayoungprogrammer/readAI
[13] ReadAI Blog - http://blog.ayoungprogrammer.com/2015/09/a-simple-artificial-

intelligence.html/
[14] Simple and Effective Multi-Paragraph Reading Comprehension -

https://arxiv.org/abs/1710.10723
[15] BERT: Pre-training of Deep Bidirectional Transformers for Language

Understanding - https://arxiv.org/abs/1810.04805

[16] System Design in Software development - https://medium.com/the-andela-

way/system-designin-software-development-f360ce6fcbb9
[17] Designing Use Cases for a Project -
https://www.geeksforgeeks.org/designing-use-cases-for-aproject/
[18] Unit Testing - http://softwaretestingfundamentals.com/unit-testing/
[19] Integration Testing - http://softwaretestingfundamentals.com/integration-
testing/
[20] System Testing - http://softwaretestingfundamentals.com/system-testing/
[21] Question Answering with TensorFlow -
https://www.oreilly.com/ideas/question-answering-with-tensorflow
[22] A Practitioner’s guide to natural language Processing -
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-
processing-part-i-processing-understanding-text-9f4abfd13e72
[23] Natural Language Processing in Information Retrieval -
https://pdfs.semanticscholar.org/8721/f2a087ff35318a056a5814ba287a37df0e
c8.pdf
[24] Semantic Parsing via Staged Query Graph Generation: Question Answering
with Knowledge Base - http://www.aclweb.org/anthology/P15-1128
[25] TensorFlow- https://www.tensorflow.org/guide/
[26] Feasibility study – https://www.sqa.org.uk/e-
learning/SDM02CD/page_11.htm

Question Answering Finale

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Question Answering Finale

Uploaded by

Copyright:

Available Formats

Question-Answering System

The exponential growth of available electronic data is almost useless without

Dept. of CSE, JSSATE Page 1

1.4 Existing System

"Traditional" machine learning approaches include probabilistic modelling,

1.5 Proposed System

Dept. of CSE, JSSATE Page 3

1.6 Problem Statement

Dept. of CSE, JSSATE Page 4

2.1 Semantic Parsing via Staged Query Graph Generation

In this paper, a semantic parser framework is created which aims to:

This proposes a semantic parsing framework for question answering using a

Dept. of CSE, JSSATE Page 5

However, most traditional approaches for se-mantic parsing are largely

Dept. of CSE, JSSATE Page 6

2.2 The NarrativeQA Reading Comprehension Challenge

Reading comprehension (RC) in contrast to information retrieval requires integrating

Dept. of CSE, JSSATE Page 7

2.3 Natural Language Processing in Information Retrieval

Many Natural Language Processing (NLP) techniques, including stemming, part-of-

In most cases, researchers work on using existing NLP components (stemmers,

Dept. of CSE, JSSATE Page 8

2.4 Preprocessing Techniques for Text Mining

Dept. of CSE, JSSATE Page 9

3.1 Hardware Requirements

Processor : Intel Core i5 or more

3.2 Software Requirements

This document gives a detailed description of the software requirement

Dept. of CSE, JSSATE Page 10

3.2.1 NLTK (Natural Language Toolkit)

It contains text processing libraries for tokenization, parsing, classification, stemming,

Dept. of CSE, JSSATE Page 11

Figure 3.1: NLTK Hierarchy

TensorFlow provides a variety of different toolkits that allow you to construct

Figure 3.2: TensorFlow toolkit hierarchy

Dept. of CSE, JSSATE Page 12

Tensorflow's name is directly derived from its core framework: Tensor. In

High Level APIs

Low Level APIs

Dept. of CSE, JSSATE Page 14

Available sub-packages include:

 constants: physical constants and conversion factors (since version 0.7.0)

Dept. of CSE, JSSATE Page 15

NumPy targets the CPython reference implementation of Python, which is a non-

The ndarray data structure

Dept. of CSE, JSSATE Page 16

Inserting or appending entries to an array is not as trivially possible as it is with

Scikit-learn (formerly scikits.learn) is a free software machine learning library for

Dept. of CSE, JSSATE Page 17

Scikit-learn is largely written in Python, with some core algorithms written in

Pandas is an open-source Python Library providing high-performance data

Key Features of Pandas

 Data alignment and integrated handling of missing data.

 Reshaping and pivoting of date sets.

 Label-based slicing, indexing and subsetting of large data sets.

 Columns from a data structure can be deleted or inserted.

 Group by data for aggregation and transformations.

 High performance merging and joining of data.

 Time Series functionality.

Dept. of CSE, JSSATE Page 18

Dimension & Description

Data Dimensions Description

Series 1 1D labeled homogeneous array, sizeimmutable.

Data Frames 2 General 2D labeled, size-mutable tabular structure with potentially