Professional Documents
Culture Documents
net/publication/335243982
CITATIONS READS
0 68
1 author:
Louis Baligand
École Polytechnique Fédérale de Lausanne
3 PUBLICATIONS 1 CITATION
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Louis Baligand on 19 August 2019.
Louis BALIGAND
louis.baligand@epfl.ch
August 2018
i
ii
Acknowledgements
Acknowledgements
First and foremost I would like to express my deepest gratitude to my supervisor,
David Schmalz, who has been constantly eager to guide me towards the realization of my
thesis. He has been an exceptionnal advisor and this apply far beyond the span of my the-
sis. I would also like to thank my colleagues and more specifically Debmalya Biswas wi-
thout whom this opportunity would not have been possible.
I would also like to thank Prof. Karl Aberer and Hamza Harkous from EPFL who has
never failed at giving me great insights and sharing their expertise in an exciting and vibra-
ting area.
Last but not least, I am grateful to my family and friends who have provided me with
unique support and motivation throughout the whole of my studies.
iii
Acknowledgements
iv
Abstract
Modern search-driven engines leverage Natural Language Processing (NLP) and
Machine Learning (ML) to improve relevance of results, as well as reduce the friction be-
tween the querying human and the assisting application. The emergence of Information
Retrieval (IR) algorithms and deep learning models have shifted the search beyond simple
Boolean keyword search.
In this work, we investigate on two different use cases at Philip Morris International
(PMI), each of which presented different painpoints. For instance, opeators on factory pro-
duction floors often look for answers manually (or with obsolete search engines), which
leads to considerable time losses.Highly confidential unstructured informations.
To improve this situation, we designed a search engine that aims at answering in-
dustry domain-specific questions. We combined best-of-breed information retrieval ap-
proaches with deep neural networks, based on DrQA, a system for Machine Reading
Comprehension (MRC) applied to open-domain question answering proposed by Face-
book research group (Chen, et al., 2017).
On the retrieval side, following unsuccessful attempts to improve the relevance with
Latent Dirichlet Allocation and granular inverted index matrices, we integrated and custom-
ized ElasticSearch to perform this task. We collected a test set and our retriever outper-
formed the inital settings by 8% precision for the top 5 pages and 6% for the full docu-
ments, starting with respectively 76% and 90% (precision at 5). In addition, we improved
the reader performance by 11% F1 score (initially 28% with original settings) after chang-
ing the word embedding model as well as refining and extending the file system parsing
process. Final results suggest that the reader model we used has a benefit over (1) a ran-
dom guess of text segment over the whole page and (2) an IR non-ML based sliding win-
dow. Even though we attempt to tackle a Question-answering problem in this project, our
tool has been built in a way such that it is functional for full text search.
v
Acknowledgements
Keywords
Information Retrieval, Question Answering, Full text search, Machine Reading Compre-
hension
vi
Acknowledgements
Contents
Acknowledgements ...................................................................................................................................................... iii
Abstract ............................................................................................................................................................................. v
Keywords ....................................................................................................................................................................... vi
List of Equations.......................................................................................................................................................... 11
Introduction .......................................................................................................................................... 13
Implementation..................................................................................................................................... 29
vii
Acknowledgements
5.1 Dataset........................................................................................................................................................... 49
5.3 Results........................................................................................................................................................... 53
References .................................................................................................................................................................... 67
viii
List of Figures
Figure 15 Q:"What is the inner frame transversal transport belt Focke 550 for?" Result of DrQA
highlighted in yellow and expcted answer in green ...................................................................... 55
ix
List of Tables
Table 1 Comparison of two PMI use cases ................................................................................... 16
Table 3 Results from DrQA with initial setting (line 1), the granular TFIDF (line 2), the inital and
granular TFIDF on a directory containing all the files of the department (only the subfolder contains
the relevant files) (line 3 and 4) and finally the reader only (line 5) ........................................... 56
Table 11 % of relevant answer by DrQA when we retrieved different number of pages .................... 62
11
0 Abstract
List of Equations
Equation 1 ............................................................................................................................ 30
Equation 2 ............................................................................................................................ 30
Equation 3 ............................................................................................................................ 30
Equation 4 ............................................................................................................................ 32
Equation 5 ............................................................................................................................ 34
Equation 6 ............................................................................................................................ 34
Equation 7 ............................................................................................................................ 34
Equation 8 ............................................................................................................................ 35
Equation 9 ............................................................................................................................ 35
Equation 10 .......................................................................................................................... 40
Equation 11 .......................................................................................................................... 42
12
Chapter 1 Introduction
Introduction
In this Chapter we first present the background of the project (Section 1.1) to better
understand what the motivations are (Section 1.2). We then define the problem we faced
during the thesis (Section 1.3) and finally we give a short but complete summary of the
research process (Section 1.4).
1.1 Context
Over the last decade, the rate at which information has been created has multiplied
by a thousand-fold. Most predictions have targeted a continuous, exponential increase in
the upcoming years (Eriksen, 2001). Digital giants such as Google, Bing and Yahoo! have
created entirely new business models that are based on the expanding need to find rele-
vant knowledge by refining their information indexing and retrieval capabilities, as well as
leveraging personal data to contextualize results.
Like the majority of large-scale corporations, PMI produces a plethora of data year
on year and is looking for innovative ways to effectively tackle information overload of the
employees. Such innovation initiatives are in the scope of PMI Enterprise Architecture
team that is part of the Information Systems Function. It works closely with various internal
departments and external experts in order to grasp the corporate need of ade-
quate cutting-edge technologies. These range from the use of Blockchain to Artificial Intel-
ligence to the Internet Of Things, and Enterprise Architecture- that has a very broad scope
of action and is a great concern to the Enterprise Search. This practice challenges compa-
nies with the following points:
13
Chapter 1 Introduction
tomatic speech recognition, image processing and translations using Machine Learning
and Natural Language Processing (NLP). Hence, as the possibilities scale up, so does
the amount of noise in search results, as well as user expectations. Having as accurate
results as web search is a very tough task practically in the field of enterprise search.
Indeed, compared to public search engines, the amount of information and user activi-
ties is too small, and semantic relationships between documents are not tracked.
2) Variety of sources:
There is no single place to search and the end-users must pro-actively know what
they are looking for by issuing the right key words on the right platform. Indeed, the
services through which information can be accessed is broad. These include work-
stations’ hard drives, emails, shared folders, collaboration platforms like SharePoint,
that provide a range of information that can be retrieved using their respective search
tools. Being able to search through the entire set of documents of the company and
even beyond – combining it with public information – is a long-term goal of the Enter-
prise Search capability.
3) Unstructured data:
Another challenging aspect is coming from the nature of the documents- users have
to dig into various extensions and most of the files are unstructured which makes the
indexation not a trivial task. Some of the used repositories contain unclassified data
and keeping tracks of tags and metadata is a task that has often been neglected.
4) Confidentiality:
In large-scale companies such as PMI, all informations cannot be shared within the
firm and hence a lot of caution has to be taken on what can be uploaded to the cloud.
Does the user have access to this data? If no, what are his/her options to reach the in-
formation needed? These are few questions that have to be taken into account in the
field of Enterprise Search. In particular, we have been asked to work with a financial
department and the main reason why they were interested to build an application inter-
nally at PMI is to be able to search through highly confidential data. These kind of doc-
uments can be seen by a very restricted amount of employees before they get released
publicly.
14
Chapter 1 Introduction
1.2 Motivation
There are plenty of reasons to be interested in the Entreprise Search. The first and
main reason that comes to the mind is productivity gains and even though it is very difficult
to measure the gain and effectivness of a search tool it is obvious that several minutes
gained per search is highly beneficial if applied on thousands of employees. To put it into
perspective for the reader, if we estimate the number of cigarettes produced by one factory
by 24 billions per year, it results to roughly 10 millions cigarettes produced per hour and if
we estimate the net benefit of one cigarette by 70 cents, then one minute of down-time
would result in the loss of 143 000 USD. According to a McKinsey report1: employees
spend 1.8 hours every day-20% of the time-searching and gathering informations. In other
words, 4 employees out of 5 are working and the last one is looking for answers. Thus, it is
still essential that companies minimizes the time employees spend looking for information
to complete their on-going activities. This pre-supposes that the retrieved information is
accurate and relevant. Beside the importance of time, it is also crucial to not miss infor-
mation and being able to leverage the value of documents that demand a lot of ressources
in order to be created.
1
https://www.mckinsey.com/industries/high-tech/our-insights/the-social-economy, Report MGI July 2012
15
Chapter 1 Introduction
The nature of the documents can vary depending of the department in which we
operate. For example, the Trainings and Operations domain looks for answers concerning
specific machines and how to use them under their respective task depscriptions – such
as "How can I remove the knife from the cutting unit? “ Whereas in the department of fi-
nance, we focus on financial communications, (i.e. via emails) and numericals in spread-
sheets- where there is a greater concentration of questions such as "What is the market
share price of PMI in September 2014?”. Whether it is full documents, a page or a short
16
Chapter 1 Introduction
text span to be retrieved, the end-user needs to have the freedom to explore the data and
look for the satisfying answer. Keeping these factors in mind, the company strives to cre-
ate a tool flexible to push through said categories.
Problem statement: Can the current pain points be addressed using a combination
of classical information retrieval and machine learning techniques? Can the solution be
scaled/deployed to other corpora of documents? Is the resulting search interface more
relevant and intuitive for end-users? How does a pure information retrieval approach com-
pare with our combined proposal?
In this work, we attempt to answer these questions by building a search tool that is
based on previous public and internal research. First, we must make sure that the existing
home-grown system works on the provided data. Following this, we breakdown the prob-
lem and quantify the perfomance of each step of the system. Then, the most crucial ele-
ment of the project is to extend the product in every sense and push it forward on the pipe-
line. In order to optimise the product, we followed an iterative approach by testing and
measuring new solutions to progressively find the best option. It is imperative that we un-
derstand the data we are working with and have critical and concrete analysis on the re-
sults that we obtain. This tool must be able to scale well and handle a large amount of da-
ta. Looking at Figure 1 which is showing the way Enterprise Search at PMI articulates, there
is evidence that highlights the possible flexibility of the independent company’s “Dedicated
Search” system as well as what part of the system that must be automated and the areas
of the system that must be tailored to the specific requirements of every customer.
17
Chapter 1 Introduction
Finally, in order for the system to be utilized and furthermore tested, we need to
build a complete user interface and bring the conception to a maturity such that employees
make concrete use of it. Depending on the use case, documents are more or less confi-
dential we thus design our system having this in mind but overall it was not the main sub-
ject of the thesis. Indeed, in our case we are dealing with specific corpus of documents
which consists in a very vertical search compared to the horizontal search indexing multi-
ple sources of documents as commonly seen in Enterprise Search. Last but not least, we
address this problem as natural language search yet, the need for natural language query
has not been fully validated with end-users and we assume, for the rest of the Thesis, that
it is more effective than other approaches (like guided search, or via bots that will look at
eliciting intent and contextual information).
18
Chapter 1 Introduction
1) Initial State
PMI has developed a home-grown application based on the work of the Facebook
AI research team (Chen, et al., 2017) called DrQA that we describe in Section 4.1. The
orignal tool built by Facebook aims at answering factoid questions by reading Wikipedia
articles. As stated on their repository, « DrQA treats Wikipedia as a generic collection of
articles and does not rely on its internal graph structure. As a result, DrQA can be straight-
19
Chapter 1 Introduction
forwardly applied to any collection of documents ». Thus the architecture has been applied
to the operation use case.
From this initial stage, few shortcuts have been taken: only two documents were indexed,
no measurements have been taken as no test set existed for this purpose.
2) Contributions
a) Pagewise Crawler
In order to parse all documents, we created a crawler as described in Section 4.2. We then
indexed the totality of the documents related to the field and collected a test set (Section 5.1) in
order to obtain first results presented in Sections 5.3.1 and 5.3.2. The first part being the retrieval
of relevant document, we were using the original bigram TFIDF method, which was retrieving the
correct page in the top 5 in 76% of the cases initially. The fetched pageswere then fed in the sec-
ond part that is using a pre trained RNN model, we obtained an Exact Match (EM) score of 8% and
a F1 score of 28% for the first answer predicted. We tested the reader itself by giving the document
containing the answer as input which gave us an upper bound of 42.05% for the F1 score.
b) ElasticSearch Integration
We replaced the classic TFIDF approach for document retrieval by the possibility to inte-
grate an Elasticsearch server which is shown in Section 4.3.3. As discussed in the result Section
5.3.6, we started with a P@1 of 40% and P@5 of 74% for the whole document retrieval. P@n here
is defined as the % of questions for which the answer segment appears in one the top n docu-
ments.
After working on troubleshooting, tweaking weights of n-grams and various other parame-
ters we manage to reach P@1 of 70% and P@5 of 94% for document and P@1 of 48% and P@5
of 86% for the page retrieval. Then we refined measurements of DrQA’s result by assuming the
answer span predicted (+-10 tokens) is a success if it contains at least one of the keyword ex-
pected; again we call it P@n for the top n predictions. We obtained a P@1 of 58% and P@5 of
86% for pages.
20
Chapter 1 Introduction
d) Parameter customization
Then we compared the way the RNN was rearranging the page retrieved (Section 5.3.8);
the first page fetched by the retriever contained the answer in 72% of the questions whereas the
page where the RNN predicted the first answer contained the expected answer in 54% of the ques-
tions. This means that we should not let the RNN rank answers among different pages but instead
we present a list of document to the user and he decides which pages the machine has to be read.
3) Trials
a) Granular TF-IDF
Seeing that the original approach was indexing documents page per page for memory allo-
cation facilities, one first guess was to build an invert index for both documents and pages. We first
wanted to filter the top relevant manuals and then retrieve the pages among them as described in
Section 4.3.1. It slightly improved the performance of the retriever by 2% however it did not change
the final results.
By tweeking the number of documents retrieved, i.e. 3, we improved the results using the
double TFIDF method to an EM score of 10% (+2% from original) and F1 of 31.12% (+3% from
original) (see Section 5.3.4).
c) LDA
We attempted to infer topics from documents using the unsupervised Latent Dirichlet Allo-
cation (LDA) approach and filter out every documents that did not have the same topic as the que-
ry as described in Section 4.3.2. However it gave exactly the same results with or without LDA
(Section 5.3.5).
d) Solr
Then we decided to integrate a search engine based on Lucene in order to investigate fur-
ther on the document retrieval part. After comparison between Solr and Elasticsearch (Section
4.3.3) we decided to not use Solr.
e) Sliding Window
One main concern was to see whether the generic RNN model was helping giving some in-
sight in the document or not. We attempted a non ML method based on Query-Document Rele-
vance ranking functions to replace the RNN. It consists of creating an inverted index with as
21
Chapter 1 Introduction
much text as possible related to PMI and the specific field we expect to search in. Then we sam-
ple the text with a sliding window that we shift along the whole page and compute the top BM25
score. The full describtion can be found in Section 4.4.2. If we input a correct document in the
model, we obtain a P@1 (and P@5 respectively) of 70% (and 90%) for BM25 compared to 94%
(and 94%) for the modified RNN method. P@n here is defined as the % of questions for which
the answer segment (30 tokens for the sliding window and + 10 tokens around the predicted span
from the RNN) contains at least one keyword of the expected answer. This shows that using a
pre trained generic model performs better than simple non ML method.
f) Multi-language
In Section 4.5, we explain how we handled the variety of language after the stakeholders
requested to extend the project to more languages.
Finally, in Section 4.6, we propose APIs and a User Interface prototype for the actors of this
project.
22
Chapter 2 Related Work
Related Work
There has been tremendous research around Information Retrieval (IR) (Manning,
et al., 2008) and Knowledge Base (KB) (Jurafsky & Martin, 2018) (Liu, et al., 2015) based
Question Answering over the last decades. It has experienced progress such that IBM
Watson were able to beat human’s performance (Ferrucci, et al., 2010) following with other
similar tasks as the SQuAD1.1 Leaderboard suggests2 (Rajpurkar, et al., 2016). This pro-
gress is motivated with the growth of competitions and conferences such as the annual
Text Retrieval Conference Publications (TREC) competition and the ACM SIGIR. TREC
has built a variety of large test collections, including cross-language, speech and domain-
specific collections. Since the beginning of TREC in 1992, retrieval effectiveness has ap-
proximately doubled (Ellen M. Voorhees, 2005). With the development of KBs such as
WebQuestions (Berant, et al., 2013), Wikidata (Vrandečić & Krötzsch, 2014) or DBpedia
(Auer, et al., 2007) combined with (semi-)automatic KB extraction (Fan, et al., 2012), KB-
based QA, which is a problem that aims at translating a a free-text user query to a struc-
tured query (such as SPARQL, lambda-expr), has accelerated. Nevertheless, it has inher-
ent limitations because it is not flexible (missing information, rigid schema, etc) and lots of
manpower and field-expertise is needed. Thus, more recent researches focus on answer-
ing questions from raw text, we call this practice machine reading comprehension (MRC)
and it has been demonstrated to be particularly successful thanks to new deep learning
architectures such as attention-based and augmented Recurrent Neural Networks (RNN)
(Raison, et al., 2018) (Bahdanau, et al., 2014) (Weston, et al., 2014) (Graves, et al., 2014),
Convolutional networks (Yu, et al., 2018) and Stochastic Answer Networks (SAN) (Liu, et
al., 2017).
This considerable progress on MRC is largely due to the large scale dataset availa-
ble like MCTest (Richardson, et al., 2013), QACNN/DailyMail (Hermann, et al., 2015), CBT
(Hermann, et al., 2015), WikiQA (Yang, et al., 2015), bAbI (Weston, et al., 2014), SQuAD
2
https://rajpurkar.github.io/SQuAD-explorer/
23
Chapter 2 Related Work
(Rajpurkar, et al., 2016) or QAngaroo (Welbl, et al., 2017)— as well as initiatives to group
them together like ParlAI (Miller, et al., 2017). However, these tasks expect that a support-
ing document is provided with the question which cannot be assumed in our case. There
has been new resources that combines queries with text retrieved from search engines
such as MSMARCO (Nguyen, et al., 2016) and we attempt to follow the same path using
the open source search engine Elasticsearch. We attempt to use a similar two steps archi-
tecture as proposed in (Chen, et al., 2017) to tackle the full pipeline problem. Our ap-
proach differs from the latter by integrating Elasticsearch and augmenting the focus on the
document retrieval stage. In addition, further studies on Reinforced Reader-Ranker (Wang,
et al., 2017) has proved that FastText (Joulin, et al., 2016) improves performances for fea-
ture vector on DrQA which originally used GloVe (Pennington, et al., 2014).
Learn to rank models which aims at re ranking a top-k retrieval document with a
machine learning method has also known great progress (Turnbull, 2016). Even though
our project would benefit greatly from this task, it relies on specific labelled data and hence
this is left for future work as building the dataset would take a significant amount of time. In
the context of unsupervised data, Latent Dirichlet Allocation (LDA) is heavily cited in the
literature and it has been shown to be a promising method for IR (Wei & Croft, 2006).
However, the efficiency of LDA is contrasted so we want to see if it could bring context to
the machine and add value to the already existing IR method.
One simple way to retrieve potentially relevant documents is the use of Boolean
Retrieval which consists in formulating a query with AND, OR and NOT expressions. Giv-
en a query such as “knife AND drums”, it will return documents containing the term men-
tioned according to the set theory. It is still a frequently used approach and it is very intui-
tive as we know exactly how the result is selected. However, there is no way to rank re-
sults and we might either have a large amount of results or no result at all. One way to
overcome these limitations is to vectorise documents and queries in a D dimensional
space and compute a similarity between the two. One way to score query-documents that
has been widely used in the literature is Okapi BM25 and it has been proved to be suc-
cessful (Robertson & Zaragoza, 2009). We use this method to build a sliding window that
24
Chapter 2 Related Work
aims at doing IR at fine grained granularity of the text. (Moen, et al., 2015) has outper-
formed Lucene for the task of domain-specific IR using sliding windows of word vectors to
capture the domain-specificity in the semantic model. In our case instead of using it for IR,
the idea is to benchmark and situate the generic machine reading model with a non ML
method.
Our research tends to contribute by attempting to apply state of the art IR for a spe-
cific industrial real-life domain and perform machine reading on such collection of docu-
ments. The research was centered on adapting the said pipeline to the different use cases
and automatically build replicas for other collections. We also explore a variety of metrics
in order to assess the performance of the model and with the recent progress in multi-
language translation models (Yu, et al., 2018) we expend our research to French and Ital-
ian.
25
Chapter 3 Context & Customer needs
Having closesly worked with managers brimming with the nitty-gritties of manufac-
turing and making use of their extensive knowledge, we were able to understand the cur-
26
Chapter 3 Context & Customer needs
rent situation at hand and meet their needs and expectations. The methods of research
undertaken by employees currently, can be deemed as slow and inefficient. This is be-
cause, the employee must undertake a number of steps in order to find the relevant infor-
mation- crawling inside the training programme’s directory, finding the document that
matches his/her query and the page with the relevant information using the classic “Ctrl-f”
tool. This basic search tool can be largely undermined by a much simpler full-text keyword
search- however, our aim is to improvise and go beyond this. We would like to make the
search tool accessible to even the newest of employees. In order to execute this aim, we
would first like to create a search engine with natural human interaction, however it is im-
perative to note that we do not wish to build a chatbot. The idea here is to communicate
with the search tool using natural language and tweek the results with a pre/post pro-
cessing part. Thus, the search tool consists of a question-answer style giving the user ac-
cess to the potentially relevant documents. The Operation department provided a set of 50
Frequently Asked Questions with answers, this was very helpful to first do a series of trou-
bleshooting and then to test, benchmark and measure the performance of our solutions. A
full describtion of it can be found in Section 5.1.
However, there are two major specifications in the previous case. The already exist-
ing tool indexes documents that are available outside the realm of PMI, these included
information surrounding analysis, research and news in various domains by The Guardi-
ans, Financial Time, JP Morgan or FDA. The body of documents is much smaller here. We
are looking at roughly a thousand documents where we would add one hundred more per
year. Furthermore, we do not focus on anwering natural language questions but rather
focus on the formulation of queries with keywords and parameters such as dates, topics,
locations, etc. This can be understood with the following example: the user requests
27
Chapter 3 Context & Customer needs
« CAGNY 2017 » related to the past 90 days under the topic Philip Morris. Moreover, the
search tool required the addition of metadata for each document added to the corpus.
We now could expect the following queries if they were to be asked in a more natu-
ral language way:
Note that Question 3 (“What are the latest IQOS conversion rates?”) represents the
nature of a question DrQA could successfully answer. However, this is under the con-
straint that the precise answer is contextualized around the text. With reference to the fi-
nance KB, the answer can be located in a spreadsheet of numeric data, however it is in-
convenient to retrieve data of such nature using our system. Another issue with the use of
this search tool is the nature of questions that may be inputed into the search engine. For
example, if the question is not content specific and expects a set of pages as answers,
then the original system is not suitable. More so, if we are faced with Question 2 (“What is
the latest status on Applications for the IQOS in the U.S.?”) we would like to report the
documents/ paragraphs that are a top match to the query.
There is confidential information under PMI that cannot be shared with a third party
such as the commercial tool they make use of. Thus, the financial departement needs to
have an internal tool that offers a service at the same level or better than the commercial
tool. Therefore, we proposed to solve the problem with a pure Document Retrieval system.
Indeed, as opposed to the previous use case, the natural language query is just adding
noise over a keyword search whereas in the Operations use case if we have for example a
question starting with “how long” we know that we are looking for a time. As for the Thesis
point of view we decided to dismiss this use case, since this process is the first stage of
our full pipeline and hence it is covered by the first use case.
28
Chapter 4 Implementation
Implementation
In this Chapter we describe the trials to find a matching solution to the problem de-
scribed in the previous chapters. We first describe the implementation of the (at the begin-
ning of the master thesis) state of the art and then we show attempts to improve the exist-
ing prototypes.
29
Chapter 4 Implementation
The pipeline can be split into two modules- The Document Retriever and the Document
Reader.
Firstly, the term frequency can be computed with a range of variants, either
binary (term contained in the document or not), “brut” by simply counting the number
of occurences of the term in the document or normalized. In DrQA, a “0.5” max nor-
malization variation is used as follows for a term 𝑡 in a document 𝑑:
𝑓𝑡,𝑑
𝑇𝐹𝑡,𝑑 = 0.5 + 0.5 ∗ 𝑚𝑎𝑥 (1)
{𝑡′∈𝑑}𝑓 ′
𝑡 .𝑑
Following this, the inverse document frequency allows us to capture the im-
portance of one specific term 𝑖 appearing in a document compared to the entire cor-
pus. It is computed using the following fomula:
|𝐷|
𝐼𝐷𝐹𝑖 = log |{𝑑 (2)
𝑗 :𝑡𝑖 ∈𝑑𝑗 }|
Where |D| is the total number of documents in the corpus and |{𝑑𝑗 : 𝑡𝑖 ∈ 𝑑𝑗 }|
represent the number documents where the term 𝑑𝑗 appears.
Finally the weight is simply the product between the two measures:
Thus, the above results in a sparced matrix with rows containing all possible terms
in our corpus and each column representative of each document. This statistical approach
30
Chapter 4 Implementation
is efficient as the weights are computed only once at computation. However, this also
means that we need to recompute the TFIDF matrix every time we add a new document.
In order to prevent the computations from being redundant and to ensure their efficiency,
we create a so called inverted index. It can be seen as a dictionary where the keys are all
possible terms and the values are the documents in line with the appearance of term. For
example, the couple term1 {doc2, doc5, doc8} would mean that term1 appears in doc2,
doc5, doc8.
We shall later see how we pre-process each source of information in order to have
an exploitable list of documents with terms in order to create said inverted index. For now,
we hold the assumption that we have collected a list of documets to search through, are
readable (i.e. they consist of a decoded character sequence) and are ready for indexation.
We use Stanford coreNLP Tokenizer for the following steps, (there is no specific
reason as there are a number of other tokenziers such as spaCy, NLTK that would per-
form equally well for this task.) Firstly, we must tokenize the text in order to determine the
vocabulary of terms. This task consists of breaking down a text into pieces called tokens.
For example, the phrase “the black cat!” is tokenized into the following list: “the”,
“black”,”cat”, “!”. Following this we must remove stopwords, punctutation, compound end-
ings and special characters such as “\n” for line jumps. Stopwords are common terms that
appear frequently in all documents such as “the” or “a”- which is not a concern in our case
as there is no loss of information. Note that we can add words such as “pmi” or “tobacco”
in our stopwords list in this context, but we must keep the general case for now. To add to
this, we must delete possible question words such as, “what”, “how”, etc.- it is a question-
able decision but is not a concern to the validation of our analysis.
DrQA is improved upon with the addition of bi-gram over a standard TFIDF; the said
bi-gram is included in the list of terms as an attempt to take local order into account. For
instance: if one wishes to index a document containing the sentence “the black cat”, the
list of terms would not only contain “the”,”black” and “cat”, but also “the_black” and
“black_cat”. This allows for greater emphasis on the importance of the order of the words
in the phrase, and the words in the phrase itself, which depends on its context. “Sue ate an
alligator” and “an alligator ate Sue” are two sentences containing the exact same words
which would result in identical answers given the same query- however, they have com-
pletely different meanings. If the query is, “Who ate an alligator?” both the aforementioned
31
Chapter 4 Implementation
phrases would be provided holding equal weight in the instance of unigrams. Bi-grams on
the other hand, would hold a higher match weight for the sentence “Sue ate an alligator”
due to the use of terms such as “ate_an” and “an_alligator” as opposed to the phrase “an
alligator ate Sue”. Due to the lack of speed and RAM efficiency, hashing of (Weinberger, et
al., 2009) has been used in order to map the bigrams to 2^24 bins with an unsigned mur-
mur3 hash. Therefore, it is only required to store hashed values in a sparce matrix
(scipy.sparce library in Python) that will allow us to do a lookup in O(1).
The top 5 closest documents are retrieved by computing the dot product between
the query and documents in TFIDF weighted (3) word vector space (note again that bi-
grams are included) and selecting documents with which the product is the highest, i.e. :
This retrieval technique is the one originally used in (Chen, et al., 2017) and we ini-
tially used exactly the same settings.
- The PLC
- The drums”
We need to read the two first paragraphs concatenated in order to respond correctly to the
question “Where are located the PLC?”
32
Chapter 4 Implementation
Thus for the two reasons above, we deal with entire pages rather than paragraphs
as input, as PMI’s manuals consist of a great number of tables and pictures.
The logic behind an RNN compared to a standard neural network is that the output
of one cell is fed inside the next cell input. This implies that in one cell, we feed an encod-
ed word which approximately corresponds to a “vectorized” word and we also feed the
output from the previous cell. Using LSTM unit instead of standard unit (which only has
one layer that is the activation function) is a very good fit for question answering as it man-
ages to memorize information not only from the local order (as with n-gram) but it can re-
tain informations comming earlier in the text and this is combined with a layer in the LSTM
unit that computes whether we should discard or not what has been said before. LSTMs
have proven to be successful in similar tasks such as Machine reading and comprehen-
sion (Hermann, et al., 2015) (Chen, et al., 2016) and in the same spirit of text understand-
ing with sentiment analysis (Li & Qian, 2016). Now a bidirectional RNN is the superposition
of two RNNs with the concatenation of both outputs as shown on Figure 4. This is useful to
access information at any point in time from both the past and the future (Graves, 2012).
We use default settings for the RNN parameters: 128 hidden LSTM units with 3 hidden
layers and 0.4 of dropout rate and 0.1 of learning rate (for Stochastic Gradient Descent
optimization).
We use the same encoding as described in DrQA which consists of a word embed-
ding using Stanford GloVe (Pennington, et al., 2014) 300-d trained from 840B Web crawl
data. Word embedding consists of representation of a word in d-dimensional space as a
33
Chapter 4 Implementation
vector. There are two particularities after the embedding model has been trained, first
words that have similar meanings are close to eachother in terms of distance (l2 or eu-
clidiean distance) and very common words such as indefinite and definite articles like “a”
or “the” are close to the origin. An unknown word is set to the origin and as we have a
number of specific terms designating machines which are unknown to GloVe, it would be
valuable to fine tune the embedding matrix with our corpus however benefits would be too
low to see significant result improvements. Moreover we managed to reassemble aroung
15 milion tokens in total and around 1 bilion would be necessary to re-train a whole em-
bedding matrix.
In addition of the word embedding, we add few features to the vector such as
whether the word is an exact match with a question word, the part-of-speech, i.e. whether
it is a noun, verb etc…, the named entity recognition, i.e. the category of a word such as
Person, Company, Country etc… and the normalized term frequency. Last, but not the
least, we add an aligned question embedding computed as follows:
Where the attention score 𝑎𝑖,𝑗 aims at measuring how close in meaning is a word
from the question and E(𝑞𝑗 ) is the encoding (embedding with word parameters) of the
word 𝑞𝑗 . The attention score can be computed as:
(6)
α(x) is the activation function defined as the ReLu function: max(0, 𝑥) for all 𝑥.
We also have another RNN (biLSTM) for the query in which we only feed the em-
bedded question words and compute the weighted average of each hidden cell output
such that:
𝑞 = ∑𝑗 𝑏𝑗 𝑞𝑗 (7)
With bj capturing the importance of each question word, it can be computed as fol-
lows with w being a parameters (a weight vector) to learn:
34
Chapter 4 Implementation
(8)
We finally have to aggregate the results in order to find the most probable answer.
We compute for all possible starting and ending words:
Such that max 𝑃𝑠𝑡𝑎𝑟𝑡 (𝑖) ∗ 𝑃𝑒𝑛𝑑 (𝑖) with Ws and We being weight matrices to be
𝑖,𝑖′
learned.
Then we select the answer span with the highest value among every selected pag-
es. Note that there is limit in the number of maximum tokens to be predicted (i.e. 15) but
for our case we set it to 90 since we sometimes much longer answers than just few words.
The model that we use for the document reader is trained on the SQuAD dataset
and various distant supervision training sets jointly (called multitask). The Stanford Ques-
tion Answering Dataset (SQuAD) (Rajpurkar, et al., 2016) is a set of 100k Question, An-
swers and the paragraphs containing answers on a set of Wikipedia articles. It is currently
the largest general purpose dataset for Comprehension of text and hence also for Ques-
tion Answering. 87k Q&As can be used to train a model. Crowdworkers had to find a
question and an answer given a Wikipedia paragraph however we use the model that has
joined the several datasets such as webQuestions (Berant, et al., 2013) and WikiMovies
(Miller, et al., 2016). For the three added datasets, there is no paragraph associated to
each pair of question answers hence distant supervision must be used. Distant supervision
allows to automatically extract a paragraph in relation with pairs of Question Answers
(Mintz, et al., 2009).As an inital purpose for our system, we use the pretrained model as it
is very generic. However we expect it to be less powerful given that we have a quite spe-
cific task.
35
Chapter 4 Implementation
Extend the parsable extension to doc, docx, xls, xlsx, ppt, pptx, txt, msg, pdf, jpg,
png, zip, aspx. Note that we still need to make sure to parse document page by
page without loosing the page layout.
The metadata that is not directly shown to the user such as the document name,
the extension, the author, the date of creation or the file size is information that
could be useful when searching for a document that also must be extracted along-
side the sequence of words.
There are PDF files that are formatted as images and a typical parser would not be
able to read the content. We would need to convert the text contained in images in-
to the corresponding sequence of words using an Optical Character Recognition
(OCR) engine.
We aim to detect the language of the documents inorder to do a selection in the first
place.
36
Chapter 4 Implementation
As shown on Figure 5, we first recursively scan the directory and select documents with
the non case sensitive extensions. Note that we not only parse Microsoft Word Documents
(.docx) but also Microsoft 97 – 2003 Document (.doc) which has a completely different
format.
Apache Tika faces no problems with regards to extraction of text from aforemen-
tioned extensions. However, it does face issues in relation to the splitting of pages. We
shall consider the previous discussion on the topic of memory allocation, to further our un-
derstanding- if we fed an entire manual into the RNN, the alternative is to select a shorter
span of text, i.e. a page. In the case of PDF or PPT the pagination works alongside the
XML format of the document and hence makes the task fairly simple: for all pages or slides
we extract content and metadata. However, in the case of documents such as DOC and
DOCX, the pagination is created whilst the file is being opened (compliation of the font
style, size, etc.)- This makes it increasingly difficult for the parser to extract the content
from each page independently. A possible solution is to convert the file to a PDF after sav-
ing the metadata and then execute content extraction. We convert the DOCX files using
Apache POI library and for DOC we aim to convert it into DOCX using docx4j3. However, if
this fails (for example, due to image conversion) we must “manually” read every paragraph
as a String and concatenate them in order to create the PDF. The main issue however, is
not resolved when we use the latter method as we sometimes risk to lose the page layout
of the DOC thus hindering the progression of the presentation of results.
3
https://www.docx4java.org/trac/docx4j
37
Chapter 4 Implementation
Finally, we use tesseract OCR v3.054 before saving all the page’s content and
metadata in JSON files. UTF 8 is used for the text encoding which is the most common
today5. Finally we compressed the contents by removing some unicodes and character
escapes which were causing trouble for the JSON loader in the RNN.
4
https://github.com/tesseract-ocr/tesseract
5
https://w3techs.com/technologies/overview/character_encoding/all
38
Chapter 4 Implementation
As depicted, we first fetch a set of relevant document and then a set of pages using
two different inverted index. Note that we use exactly the same TFIDF customizations as
for the page only in DrQA.
4.3.2 LDA
Initially the only information given by the user is a query. From this small piece of
information provided we need to infer an answer that is assumed to be inside a directory.
The task is tedious for a machine as it must find the answer in the entire corpus which
cause great disturbance; hence we aim to reduce the text to be read as much as possible.
A common approach to Question Answering is to analyse the structure and the se-
mantic of the question in order to formulate a structured language-based query (Liu, et al.,
2015). This task makes use of a Knowledge Base in opposition to us where we needed to
find the answer in a text. However one common goal is to understand what the user is
looking for. In order to do this, the machine must comprehend the domain and the context
of the industry and user.
39
Chapter 4 Implementation
One way insight can be provided to the machine, is to create a structured hierarchy
of terms describing the different groups of the domain: a Taxonomy. In order to understand
this further, we will take the Operator trainings into considerdations. Documents are divid-
ed into four main categories: Cigarette making equipment, packaging equipment, filter
making equipment and tobacco processing. Each machine has a respective manual that
instructs the user about the product in the production line, these include: Protos, M8, GD
121, MK9. Each machine has different versions and languages available- allowing for eas-
ier access as the documents are already classified given the path hierarchy, yet the re-
quired taxonomoy at lower levels is missing. It is essential that we have access to
knowledge with regards to the materials and tools within the machines itself, the technical
and mechanical jargon used to describe the machine- all of this can be done manually.
However, we voluntarily did not want to create this taxonomy manually since we did not
have the expertise to know the exact terminations and more importantly, we wanted to be
able to automatically replicate the tool on another corpus in a completely different context.
Г(𝛼𝐾) 𝛼−1
𝑓(𝑥1 , … , 𝑥𝐾−1 ; 𝛼) = Г(𝛼)𝐾 ∏𝐾
𝑖=1 𝑥𝑖 (10)
With {𝑥1 , … , 𝑥𝐾−1} being the weights of each topics and Г(𝑥) = (𝑥 − 1)! being a
gamma function.
This will generate a random topic model distribution that we attempt to enhance
through a learning process. For all terms w in each documents d we compute the follow-
ing:
1. P(topic t | document d): probability that the document d is assigned to the topic t.
2. P(word w | topic t): probability that the topic t is assigned the word w.
We then choose the topic by computing the product between the two probabilities
which correspond to the probability that the topic t generates the word w in document d.
40
Chapter 4 Implementation
This is intuitively understood as taking one word and updating its topic according to one
that is most likely to be generated by the document, by “blocking” all other terms’ model
(we assume to be correct).
An optimistic guess is that we could automatically detect the four groups described
previously and hence having a weighted average of terms with highest weight on the terms
that describe said topics the most. In this case, as we know how many topics must be de-
rived, it is more valuable than the Hierarchical Dirichlet Process (HDP) which is a variant of
LDA and does not require the number of topics as input. If we successfully inferred the
topic from each document we could use the same model to analyse the topic of the query
and filter out and document not related to this topic.
4.3.3 ElasticSearch
Before answering questions we want the user to be able to explore data indexed
and the search to be as relevant as possible. In order to custom an efficient document
ranker we chose to make use of a database management system. Among many of them
we compared two of the three most popular ones according to the DB-Engine ranking6 that
are ElasticSearch (ES) and Solr. Both are based on Lucene and Open source with an
HTTP layer to interact with the server, they also have rich frameworks, ecosystem and
support. We integrated and tested them both to do document retrieval and given the simi-
larity of the results, the better scalability and popularity of ES which is important for a po-
tential industrialization of the tool we decided to make use of ES.
6
https://db-engines.com/en/ranking/search+engine
7
https://github.com/dadoonet/fscrawler
41
Chapter 4 Implementation
stemming, bi grams and tri grams analysis, added list of synonyms. On Figure 7, we illus-
trate the pipeline for document retrieval with the integration of ES.
The scoring function, Okapi BM25 (since 2015) is the default method for Lucene to
score text and hence the default one used in Elasticsearch. As presented in section 11.4.3
of (Manning, et al., 2008), Okapi BM25 is a TFIDF-like that pays attention to term frequen-
cy and document length. One of its variant is computed as following.
𝑁 (𝑘1 +1)𝑇𝐹𝑡,𝑑
𝑆𝑐𝑜𝑟𝑒(𝑞, 𝑑) = ∑𝑡∈𝑞 log [𝐷𝐹 ] . 𝐿
(11)
𝑡 𝑘1 ((1−𝑏)+𝑏×( 𝑑 ))+𝑇𝐹𝑡,𝑑
𝐿𝑎𝑣𝑒
With k1 and b being free parameters, empirically set to k1 = 1.6 and b=0.75, k1 cal-
ibrates the document term frequency scaling and b determines the scaling by document
length. 𝐷𝐹𝑡 is the number of documents with term t, Ld is the length of document d (# to-
kens) and Lave is the average length of documets in corpus.
42
Chapter 4 Implementation
goo way to illustrate the process of going from word embedding to document distance is
illustrated in Figure 8 taken from (Kusner, et al., 2015).
Word representation, for its success, has many applications and guarantees a high scala-
bility with massive amount of data. As shown by (Mikolov, et al., 2017), fastText performs
better than GloVe on the SQuAD dataset on the original DrQA model so we try to improve
our system by incorporating fastText.
For the whole pipeline, even if we aim at answering questions, we could let the user
read one sentence or a small segment where we believe the answer is most likely to be in.
This can be done following the same model as for document retrieval. The approach con-
sists in creating a sliding window which shifts along the text in a page. It can be seen the
same way as in the preprocessing of a speech signal in automatic speech processing,
where here the samples are the tokens. We compute a similarity score between the query
and the selected segment with so called Query-Document Relevance (QDR) ranking func-
tions. An overview of the process is shown on Figure 9.
43
Chapter 4 Implementation
As first step we needed to create our model which needed as much text as possible
relevant to the specific field. We used an open source implementation of QDR8 in order to
do the training which, like word embedding, is unsupervised meaning we only need unla-
belled row text. It supports the following function for the scoring TFIDF, Okapi BM25 and
Language Model. Thus what we call the model is simply the following, supporting the rank-
ing functions:
After the model is trained we can compute the ranks and we arbitrarly decide to use
BM25 as the ranking function to select the top segments.
Note that the BM25 variant described in Equation (11) is the one used for our QDR
function.
4.5 Multilanguage
Even though we did all the expirements with documents written in English, a tool in
English only at PMI is pointless. Indeed, people working in production speak divers’ lan-
guages and most of the time, they are not able to speak English so the manuals are trans-
8
https://github.com/matt-peters/qdr/
44
Chapter 4 Implementation
lated by a specialist in their mother tongue. We indexed two additional non-english main
sets of documents: one in french and one in Italian. These are related to the Operation
technical training domain previously described and the nature of documents was not dif-
ferent from before.
In the first step, the solution is fairly simple since we are making use of Elas-
ticSearch so we configured accordingly each indexes. Note that the main change lied in
the list of stopwords. We chose our stopword lists from ranks.nl that is an online analysis
Web site which checks for proper usage of keywords. The list is complete but not too long
and most importantly, it includes question words.
The second step is more problematic since we are using a generic model trained on
English data and hence it would fail answering non-english questions. One solution that
we implemented is to send the text to the translate.google.com servers using the Goog-
letrans API in order to have the English translation and we do the same at the endpoint to
translate the answer aswell. Another solution is to bypass the RNN with the sliding window
with the scorer trained on the given corpus.
In the final pipeline, one can choose whether to use DrQA’s model or the sliding
window with the API but as shown on Figure 10 if documents are in English it uses DrQA,
otherwise it finds the right segment using the sliding window trained a priori.
The multi-language feature has been created for the sake of functionality but no
research has been done on it for our case. However, today’s attenditon-based neural ma-
chine translation (NMT) (Bahdanau, et al., 2014) models have demonstrated excellent
45
Chapter 4 Implementation
quality (Wu, et al., 2016) which would not add too much noise for the reader. On the other
hand, even though there are many words which have a particular meaning in PMI’s indus-
try, one possiblity is to train our own model with PMI-related documents. Indeed, the ma-
jority of the non-english manuals come from the translation of an expert from the source
language English and hence this constitutes a training set. Finally if the non-english docu-
ments are a subset of the English documents, we could instead create an automatic map-
ping between every language varients of a document and only trnaslate the query. Under
this condition, this method would reduce considerably the noise created by the translation.
We created two main RESTful API with the use of Flask-RESTPlus9, one has be
created to call the full pipeline with the query as input and the predicted answer with the
context as output. What we call the context is the text in which the model has found the
answer with the predicted segment highlighted and one can also choose to retrieve a win-
dow of l tokens around the prediction. Note that all these parameters can be set in the API
and the mechanism is depicted on Figure 11. The second API was created for the ma-
chine reading stage and one can choose to use the RNN or the sliding window method. It
handles English, French and Italian but it has been optimized for English and it has not
been tested for other languages. For the top-k document retrieval part since we are using
Elasticsearch, it has already a complete RESTful API available.
For the UI we made use of Reactive Search10 which is an open source React and
React Native UI components library for Elasticsearch. Note that there were a variety of
other similar projects to fulfill this task such as SearchKit, DejaVu or InstantSearch. Howe-
ver, ReactiveSearch was a better match to do full text search with an active community,
9
https://github.com/noirbizarre/flask-restplus/
10
https://github.com/appbaseio/reactivesearch
46
Chapter 4 Implementation
development and maintenance. We created our own new React components in order to
add the feature of machine reading as well as the document preview. The graph on Figure
11 also describes the pipeline of the application since it follows the same steps.
The application can be decomposed into four main components and a screenshot is
shown on Figure 12:
The search bar to formulate the query on top of the app, note that it suggests con-
tent (according to document names) as the user types the query.
The filters on the left which could easily be extended depending of the metadata of
the docs. It currently has a nested list to filter documents in the hierarchy. For files
such as in the operation case, this is a very important feature which can leas to a
massive gain of time but it supposes that files are structured in categories which is
not always the case. Other filters include the extension, the creation date, the lan-
guage and the authors.
The result list with the retrieval score and metadatas. One can click « Show » to
unwrap the text of the page and click « Read » in order to run the prediction for the
best span in the text. It gives a set of 5 predictions and as the user move the mouse
47
Chapter 4 Implementation
over one of the predictions, it highlights the corresponding segment in the text. One
can also click the path of the file to open the document at the specific page.
The document preview which shows the corresponding document at the right page.
Most of the choices taken for the interface have been motivated by discussing with
the stakeholders. The idea is to both have the possibility to proceed in a top-down and a
bottom-up approach. Either we can with filtering out documents and being more and more
precise in the query formulation or we can also have a precise question in mind and direct-
ly call the reader from the results.
48
Chapter 5 Expirement Results
5.1 Dataset
Initially we did not have any example of questions in order to see how each steps of
the system was performing. We hence asked to the most interested stackholders being
Operations to have a sample question and answer pairs. They originally gave us a set of
10 Q&As which was barely enough to have a general overview of DrQA’s performance so
they created a set of 50 Q&As for us. The test set consists of a question asked in a natural
language manner paired with an answer that is an exact match of a sequence of words
contained in the corpus indexed. The first task was to clean the dataset by correcting the
error of formulations and miss spelled words, we also paired every Q&A with the corre-
sponding document and its page. Note that for each pair we initially suppose that only one
document is relevant and contains the exact answer which makes the measurements
much easier however it is not a correct assumption since there are duplicas of manuals
(word and pdf for instance).
{"question": "What is the capacity of the glue jet tank for a Protos 70 ?", "answer": ["ap-
prox. 13 litres"], "document": "Operator course Protos 70.pdf", "page": "51"}
{"question": "Where are located the PLC ?", "answer": ["Into switch cabinet"], "document":
"Operator course Protos 70.pdf", "page": "65"}
{"question": "How to check the strength of the cigarette seam ?", "answer": ["Select at ran-
dom one cigarette and hold the cigarette horizontally at the filter with your left hand in such
a way that the tobacco ends is facing yourself. Then turn the rod 45° to the right and 45° to
the left and check if the seam has opened due to the torsion. If that is the case, check all
49
Chapter 5 Expirement Results
{"question": "How to adjust the MAX PM drums ?", "answer": ["any drum can be chosen as
a reference point. Usually it is convinient to begin ajusting from filter feed drum"], "docu-
ment": "OT. 01.11.04.05-2 (MAX PM) en.v2.doc", "page": "28"}
{"question": "How to adjust the GD121P garniture clappers ?", "answer": ["Follow the oper-
ation below for performing the adjustment of the clapper"], "document": "GD12P Mehcani-
cal Manual.pdf", "page": "134"}
As shown in the five example above there are few remarks to be made about this
testset:
Names of machines are not very explicit, i.e. Protos 70, MAX PM, GD121P. There
are terms that have a particular meaning in this particular field such as “glue”,
“seam”, “drums”, “clappers”. The names of the machines are tokens that does not
exist in our dictionary and the domain-specific words have a semantic that migt not
fit the typical meaning of it. For example in the case of “knife”, it describes an actual
piece on a machine and hence the meaning is captured in the embedding. However
“drum” does not refer to the musical instrument but the part of a tool in a machine.
50
Chapter 5 Expirement Results
Questions can often be ambiguous, for instance “Where are locted the PLC?” or
“How do I replace the V-Belts?” can have different answers since “PLC” can be sit-
uated at different places. We do not know on which line of production is or which
machine he is working on.
51
Chapter 5 Expirement Results
5.2 Metrics
The choice of metrics to assess the performances of different approaches appeared
to not be as trivial as expected. As a first guess and as many other research papers in
question answering are suggesting we should aim at maximizing two scores:
The Exact Match (EM) score which is simply an average of the answer predicted
that is equal to the true expected answer. It is a binay results for each question.
The F1 score which is the harmonic mean between the precision and recall that is:
2
𝐹1 = 1 1 . It is between 0 and 1 with 1 being the best value. The precision is
+
𝑝𝑟𝑒𝑖𝑠𝑖𝑜𝑛 𝑟𝑒𝑐𝑎𝑙𝑙
𝑇𝑃
defined as 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = , TP for True Positive is a document correctly re-
𝑇𝑃 + 𝐹𝑃
the two words we retrieved are correct and there is no incorrect word. The recall is
𝑇𝑃
defined as 𝑅𝑒𝑐𝑎𝑙𝑙 = , FN for False Negative is a document wrongly not re-
𝑇𝑃 + 𝐹𝑁
trieved 2 of the 3 correct words. Note that we do not take order into account which
can be missleading. Considering again the same example, these two results would
lead to an F1 score of 0.8.
There are few limitations concerning these metrics, firstly for each question we
would need to have an exact match of the answer in the documents which was not the
case for all our questions. Moreover it is sometimes subjective to give an answer to a
question, where should we start and where should we stop. If we look at the answer “Fol-
low the operation below for performing the adjustment of the clapper“, one could expect
the operation coming next as an answer or simply “Follow the operation bellow”. As dis-
cussed earlier, the question might be too vague and we cannot always expect one and
only one answer to be correct. In our context we need to ask ourselves what we want to
maximize: the answer is not given in speech manner or in a very short period of time, in-
52
Chapter 5 Expirement Results
stead, the user has the possibility to explore documents. An answer can only be validated
if the context is given in addition. For these reasons we augmented the context of the span
retrieved by 10 tokens on both sides of the answer and we assume it to be correct if it con-
tains at least one keyword (not a stopword) of the expected answer. For the lack of sim-
plicity we call P@n the % of questions for which an answer is correct (as we described) in
one of the top n documents that is not to be confused with the precision in the context of
document retrieval.
On the document retrieval part, the choice of metrics we first assumed that only one
document was relevant and we defined P@n the percentage of questions for which the
document expected is retrieved however we refined this measurement along the thesis by
checking if the answer segment appears in one of the top n documents. Note that it is dif-
ferent from the precision commonly described in information retrieval which is the one we
described at the start of this section.
5.3 Results
This section presents the performance of our iterative approach, we will see the re-
sults that motivated the decisions we took.
53
Chapter 5 Expirement Results
As shown on Figure 14Figure 13, the token distribtion is rightly skewed because the
system gives in general shorter answer than expected. The example shown on Figure
15Figure 14 gives a good picture on how the model has been trained: the answers have to
be as short as possible and if the query is of the form « What is {m} ? » and there is {m}
with parentheses, it is most likely that what is in parentheses is the infomation we are look-
ing for. From an engineer point of view we could easily retrieve a bigger segment and from
the user point of view we could easily locate the exact aswer we are looking for with this
kind of answer.
54
Chapter 5 Expirement Results
Figure 15 Q:"What is the inner frame transversal transport belt Focke 550 for?" Result of DrQA highlighted in yellow and
expcted answer in green
55
Chapter 5 Expirement Results
We tested it on the 50 QAs test set and the results we obtained are shown in the
first and third lines of Table 3. As compared to the expected distribution, it confirms the
assumption that DrQA aswers as short answer as possible. The results for the retriever
are comparable to those stated (Chen, et al., 2017) where it was doing 78% with the same
measurements on the SQuAD datasets. However the RNN performance are lower: we
initially reached 28% F1 score (for the full pipeline) as opposed to 79% (for the reader part
only) and 8% EM compared to 29.5% (both on the full pipeline). The results are obviously
getting worse as we increase irrelevant documents in the inverted index as it add noise to
the retrieval step.
Table 3 Results from DrQA with initial setting (line 1), the granular TFIDF (line 2), the inital and granular TFIDF on a directory containing
all the files of the department (only the subfolder contains the relevant files) (line 3 and 4) and finally the reader only (line 5)
56
Chapter 5 Expirement Results
57
Chapter 5 Expirement Results
Note that results are worse as we fed more text in the RNN. Indeed we could not
evaluate it without the retriever because of memory allocation issues however we illustrate
the idea by retrieving 500 pages that we input in the RNN and we reached a F1 score of
12.35% and an EM of 4% which is very low. Hence there is an obvious benefit in the doc-
ument retriever step beforehand, it is a considerable gain of time and precision.
58
Chapter 5 Expirement Results
Besides, the results for the reader stage made us question ourselves on two aspects:
ElasticSearch Solr
P@1 54% 54%
P@3 80% 80%
P@5 88% 84%
Table 6 ES and Solr comparison results
As we can see there is no considerable gap between the two and hence we priori-
tized the most easy to use with the biggest support and popularity being Elasticsearch. We
set ourselves as a goal to reach at least the original performances of TFIDF, i.e. P@5 of
90%. The Table 7 shows the different customization we followed. The way we proceeded
was simply to look at each questions that was causing issues and this guided us to the
desired results. For instance, « When should the pump cleaning be performed ? » the two
tokens « pump » and « cleaning » articulates together and they have a different meaning if
59
Chapter 5 Expirement Results
they are not together hence it makes sense to use bigrams. We also have similar ques-
tions such as « What is the LES inspection system ? » however the three tokens « LES »,
« inspection » and « system » articulate together hence we have a gain of 12% in P@1 by
simply adding a trigram analyzer. Another improvement was done using a list of synonyms
for example if we take « What does GSR means ?», « GSR » is not a token that appears
in any of the documents in the corpus and one way to fix it is to add a synonym that ap-
pears in the corpus, i.e. « Gas to Solids Ratio » in this example. If we have the combina-
tion of Uni-Bi-Three-grams and synonyms we reach P@5 of 94%. By tweaking the im-
portance of different fields we amnaged to improve the P@5 of 2%, indeed we squared the
importance of the bigrams as well as threegrams. Last but not least, in he case of question
such as « How temperature is influencing the tobacco's taste and monitored ? », the an-
swer appears where only the root of « influencing » is situated and hence we used english
stemming. It improved the precision of P@3 by 10%.
Note that we are getting these results doing full document retrieval. We applied the
analyzer performing the best to the indexed pages and the results are in Table 8.
We have a P@5 of 84% compared to 78% with the original TFIDF implemented by
Chan et al, even though they stated in their paper that it performed better than Elas-
ticSearch. Another comment is that this percentage only means the proportion of ques-
60
Chapter 5 Expirement Results
tions where we retrieved the expected document but it is high likely that even though the
page is not the one expected, it does contain a meaningful answer. Thus, as mentioned in
the Measurement section, we refined this P@n metric by checking each document manu-
ally in order to see if a meaningful answer was in the top n retrieved. By meaningful we
mean that the answer is exactly the same as the expected span or deviating by few syno-
nyms and that the user could still infer the answer from the document. The results are pre-
sented in Table 9 and note that since we had to do these measurements manually, for the
P@5, we looked at the response retrieved only, whereas for P@1 we looked at the whole
page.
Average
Best predicted span in a meaningful page
58%
(“P@1”)
Top 5 predicted span giving correct answer
86%
or part of it (“P@5”)
Table 9 Results of ES with the refined measurement
F1 Score
Original DrQA with GloVe 5 Pages 32.01%
DrQA with fastText 5 Pages 35.29%
DrQA with fastText 3 Pages 39.49%
DrQA with fastText 1 Page 38.19%
Table 10 Results comparison with Glove and Fasttext
61
Chapter 5 Expirement Results
Average
P@1 Retriever 72%
P@1 Reader – Top5 Pages Retrieved 58%
P@1 Reader – Top3 Pages Retrieved 70%
Table 11 % of relevant answer by DrQA when we retrieved different number of pages
We can see that if we input more than one page in the RNN, it tends to more likely
find an answer in an irrelevant document. Hence, it raised the question introducted in Sec-
tion 4.4.2: Is the model that reads at scale helping to gain insight or would a simple non
ML method outperform this reading step? Note also that the highest P@1 is when we read
the first page retrieved only which shows that we should rather read one page at a time
which has been taken into account in the UI.
The code for the implementation of the sliding window is in the jupyter notebook
called BM25vsDrQA.ipynb. It also contains the comparison with the best DrQA model.
7’246’565 tokens (in 43k documents) are used to train the QDR scoring function model
which is comming from the original training manuals plus an additional dataset acquired at
the end of the thesis that contains webpages on Reduce Risk Products. Of course, these
are two different fields however they are still related to the cigarettes, PMI and tobacco
whereas the corpus dealing with Finance KB contains too many terms out of the topic to
include it aswell.
62
Chapter 5 Expirement Results
prediction, then note also that we input directly a relevant page in the model. The results
are shown on Table 12.
These results show that DrQA is much more precise and it clearly exceeds a ran-
dom guess as well as a simple non ML method based on a BM25 shifting window.
63
Chapter 6 Conclusion
Chapter 6 Conclusion
6.1 Result Discussion
Along our research we have seen first that there is a considerable difference of per-
formance between our enterprise use case and the results obtained by Chen et al with
DrQA: 42% F1 score versus 79% for the reader only and 8% EM against 29% on the full
pipeline. Despite the good results from the Document Retriever relatively to Chen et al -
with 76% P@5 for pages and 90% P@5 for full documents versus 78% P@5 for Wikipedia
articles-, we decided to focus on this step as in the case of a search engine it is the priority
and the performance of what is following in the pipeline relies on this.
We then conducted our research on how to adapt the generic MRC model to our
domain-specific use case. We managed to reach 39% F1 score (+10% from the original
settings of DrQA) by doing troubleshooting and follow settings of recent research (Raison,
et al., 2018). Along the way, we also refined the used measurement methods because the
purpose of our project was search and gaining insight on enterprise-related documents
which is less rigid than giving a very short and exact answer span. Final results suggest
that the deep learning (DL) model we used has a benefit over (1) a random guess of text
64
Chapter 6 Conclusion
segment over the whole page and (2) an IR non-ML based sliding window. Even though
performances are not as good as models that answer open-domain questions in the litera-
ture, recent breakthroughs in DL on NLP and MRC has a positive impact and it makes the
technologies ready for certain enterprise use-cases. Moreover, in the context of vertical
search where operators were searching through a hierarchical filtering process of the file
and keyword search of manual names, there is a high potential in terms of time gained for
the end user that uses the system created. Indeed, the search engine was designed so
that it handles both a top-down and a botom-up approach: whether the user knows exactly
what he is looking for and what to ask or whether he needs support with the explorations
of the results and the filters of the metadata.
This project dispelled the myths surrouding deep learning by clarifying what can be
done and what is not achievable in terms of IR and MRC on a relatively specific domain
such as the production machine’s manuals. There are high expectations in the industry
towards this technology and results are promising but it often needs a considerable
amount of ressources in order to be customized to the use case. Finally even though we
attempt to tackle a QA task, the top-k document retrieval stage of our pipeline which
demonstrated excellent results is non ML and it presents the main reliable (and best scal-
able) feature for the end user to proceed in full text search.
Another angle of attack is to do post processing and constantly refining shown re-
sults to the user by either creating incremental training. Another option would be to cache
already successfully answered questions and propose it if a query is similar enough in
65
Chapter 6 Conclusion
terms of F1 distance with the one correctly answered. This is a key aspect as in natural
language search users often do not know how to formulate their questions or what to ask.
In the same perspective, we could also focus on reinforcement learning where the user
can give feedback on whether the content proposed is relevant for each query.
For a larger scale project and in a more horizontal search perspective, we could
also aim at leveraging the context of the user who is searching by looking at his location,
departement, colleagues, etc. Of course it brings the confidentiality in the picture which
requires the knowledge graph of employees’ read, write and edit rights. This is an essen-
tial concept that has not been tackled here.
Last but not least, we did not collected any question and answers pairs in other lan-
guages in order to test how the model performed in a non-english setup. As a next step,
one could be interested to collect real-life logs so that an in-depth analysis of the tool and
its equivalent in other languages can be made. While most successful models manage to
train on even more data, we hope that this technology will tend towards being less data
demanding in order to train case-specific models.
66
Chapter 6 Conclusion
References
Auer, S. et al., 2007. DBpedia: A Nucleus for a Web of Open Data. Berlin, Springer-Verlag, pp. 722-735.
Bahdanau, D., Cho, K. & Bengio, Y., 2014. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR, Volume
abs/1409.0473.
Berant, J., Chou, A., Frostig, R. & Liang, P., 2013. Semantic parsing on freebase from question-answer pairs. 1.pp. 1533-1544.
Chen, D., Bolton, J. & Manning, C. D., 2016. A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task. CoRR,
Volume abs/1606.02858.
Chen, D., Fisch, A., Weston, J. & Bordes, A., 2017. Reading Wikipedia to Answer Open-Domain Questions. CoRR, Volume
abs/1704.00051.
Ellen M. Voorhees, D. K. H., 2005. Trec: Experiment and Evaluation in Information Retrieval. s.l.:MIT PR.
Fan, J., Kalyanpur, A., Gondek, D. C. & Ferrucci, D. A., 2012. Automatic knowledge extraction from documents. IBM Journal of
Research and Development, 5, Volume 56, pp. 5:1--5:10.
Ferrucci, D. et al., 2010. Building Watson: An Overview of the DeepQA Project. AI Magazine, 7, Volume 31, p. 59.
Graves, A., 2012. Supervised Sequence Labelling with Recurrent Neural Networks. s.l.:Springer Berlin Heidelberg.
Graves, A., Wayne, G. & Danihelka, I., 2014. Neural Turing Machines. CoRR, Volume abs/1410.5401.
Hermann, K. M. et al., 2015. Teaching Machines to Read and Comprehend. CoRR, Volume abs/1506.03340.
Hill, F., Cho, K. & Korhonen, A., 2016. Learning Distributed Representations of Sentences from Unlabelled Data. CoRR, Volume
abs/1602.03483.
Joulin, A. et al., 2016. FastText.zip: Compressing text classification models. CoRR, Volume abs/1612.03651.
Jurafsky, D. & Martin, J. H., 2018. Speech and Language Processing: An Introduction to Natural Language Processing, Computational
Linguistics, and Speech Recognition. 3rd draft ed. Upper Saddle River, NJ, USA: Prentice Hall PTR.
Kusner, M., Sun, Y., Kolkin, N. & Weinberger, K., 2015. From Word Embeddings To Document Distances. Lille, PMLR, pp. 957-966.
Li, D. & Qian, J., 2016. Text sentiment analysis based on long short-term memory. s.l., IEEE.
Lindström, M., 2017. Commentary on Wanget al. (2017): Differing patterns of short-term transitions of nondaily smokers for
different indicators of socioeconomic status (SES). Addiction, 4, Volume 112, pp. 873-874.
Liu, K., Zhao, J., He, S. & Zhang, Y., 2015. Question Answering over Knowledge Bases. IEEE Intelligent Systems, 9, Volume 30, pp. 26-
35.
Liu, X., Shen, Y., Duh, K. & Gao, J., 2017. Stochastic Answer Networks for Machine Reading Comprehension. CoRR, Volume
abs/1712.03556.
Manning, C. D., Raghavan, P. & Schütze, H., 2008. Introduction to Information Retrieval. New York, NY, USA: Cambridge University
Press.
Mikolov, T., Chen, K., Corrado, G. & Dean, J., 2013. Efficient Estimation of Word Representations in Vector Space. CoRR, Volume
abs/1301.3781.
67
Chapter 6 Conclusion
Mikolov, T. et al., 2017. Advances in Pre-Training Distributed Word Representations. CoRR, Volume abs/1712.09405.
Miller, A. C. et al., 2017. A genetic basis for molecular asymmetry at vertebrate electrical synapses.. eLife, 5.Volume 6.
Miller, A. H. et al., 2016. Key-Value Memory Networks for Directly Reading Documents. CoRR, Volume abs/1606.03126.
Mintz, M., Bills, S., Snow, R. & Jurafsky, D., 2009. Distant Supervision for Relation Extraction Without Labeled Data. Stroudsburg,
Association for Computational Linguistics, pp. 1003-1011.
Moen, H. et al., 2015. Care episode retrieval: distributional semantic models for information retrieval in the clinical domain. BMC
Medical Informatics and Decision Making, 6.Volume 15.
Nguyen, T. et al., 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. CoRR, Volume
abs/1611.09268.
Pennington, J., Socher, R. & Manning, C., 2014. Glove: Global Vectors for Word Representation. s.l., Association for Computational
Linguistics.
Raison, M., Mazaré, P.-E., Das, R. & Bordes, A., 2018. Weaver: Deep Co-Encoding of Questions and Documents for Machine
Reading. CoRR, Volume abs/1804.10490.
Rajpurkar, P., Zhang, J., Lopyrev, K. & Liang, P., 2016. SQuAD: 100, 000+ Questions for Machine Comprehension of Text. CoRR,
Volume abs/1606.05250.
Richardson, A., Gardner, R. G. & Prelich, G., 2013. Physical and genetic associations of the Irc20 ubiquitin ligase with Cdc48 and
SUMO.. PloS one, 8(10), p. e76424.
Robertson, S. & Zaragoza, H., 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr., 4, Volume
3, pp. 333-389.
Vrandečić, D. & Krötzsch, M., 2014. Wikidata. Communications of the ACM, 9, Volume 57, pp. 78-85.
Wang, S. et al., 2017. R ̂3: Reinforced Reader-Ranker for Open-Domain Question Answering. CoRR, Volume abs/1709.00023.
Weinberger, K. Q. et al., 2009. Feature Hashing for Large Scale Multitask Learning. CoRR, Volume abs/0902.2206.
Wei, X. & Croft, W. B., 2006. LDA-based Document Models for Ad-hoc Retrieval. New York, NY, USA, ACM, pp. 178-185.
Welbl, J., Stenetorp, P. & Riedel, S., 2017. Constructing Datasets for Multi-hop Reading Comprehension Across Documents. CoRR,
Volume abs/1710.06481.
Weston, J., Chopra, S. & Bordes, A., 2014. Memory Networks. CoRR, Volume abs/1410.3916.
Wu, Y. et al., 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR,
Volume abs/1609.08144.
Yu, A. W. et al., 2018. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. CoRR, Volume
abs/1804.09541.
68