You are on page 1of 69

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/335243982

Natural Language Search Engine to Answer Enterprise Domain Specific


Questions

Thesis · August 2018

CITATIONS READS

0 68

1 author:

Louis Baligand
École Polytechnique Fédérale de Lausanne
3 PUBLICATIONS   1 CITATION   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Analyzing and supporting collaborative writing processes View project

All content following this page was uploaded by Louis Baligand on 19 August 2019.

The user has requested enhancement of the downloaded file.


École Polytechnique Fédérale de Lausanne

Faculté Informatique et Communication

Philip Morris International Management S.A

Master Thesis in Communication Systems

Natural Language Search Engine to Answer Enter-


prise Domain – Specific Questions

Louis BALIGAND
louis.baligand@epfl.ch

Under the supervision of

David Schmalz Prof. Karl Aberer Hamza Harkous


David.Schmalz@pmi.com karl.aberer@epfl.ch hamza.harkous@epfl.ch

August 2018

i
ii
Acknowledgements

Acknowledgements
First and foremost I would like to express my deepest gratitude to my supervisor,
David Schmalz, who has been constantly eager to guide me towards the realization of my
thesis. He has been an exceptionnal advisor and this apply far beyond the span of my the-
sis. I would also like to thank my colleagues and more specifically Debmalya Biswas wi-
thout whom this opportunity would not have been possible.

I would also like to thank Prof. Karl Aberer and Hamza Harkous from EPFL who has
never failed at giving me great insights and sharing their expertise in an exciting and vibra-
ting area.

Last but not least, I am grateful to my family and friends who have provided me with
unique support and motivation throughout the whole of my studies.

Lausanne, 17 August 2018

iii
Acknowledgements

iv
Abstract
Modern search-driven engines leverage Natural Language Processing (NLP) and
Machine Learning (ML) to improve relevance of results, as well as reduce the friction be-
tween the querying human and the assisting application. The emergence of Information
Retrieval (IR) algorithms and deep learning models have shifted the search beyond simple
Boolean keyword search.

In this work, we investigate on two different use cases at Philip Morris International
(PMI), each of which presented different painpoints. For instance, opeators on factory pro-
duction floors often look for answers manually (or with obsolete search engines), which
leads to considerable time losses.Highly confidential unstructured informations.

To improve this situation, we designed a search engine that aims at answering in-
dustry domain-specific questions. We combined best-of-breed information retrieval ap-
proaches with deep neural networks, based on DrQA, a system for Machine Reading
Comprehension (MRC) applied to open-domain question answering proposed by Face-
book research group (Chen, et al., 2017).

On the retrieval side, following unsuccessful attempts to improve the relevance with
Latent Dirichlet Allocation and granular inverted index matrices, we integrated and custom-
ized ElasticSearch to perform this task. We collected a test set and our retriever outper-
formed the inital settings by 8% precision for the top 5 pages and 6% for the full docu-
ments, starting with respectively 76% and 90% (precision at 5). In addition, we improved
the reader performance by 11% F1 score (initially 28% with original settings) after chang-
ing the word embedding model as well as refining and extending the file system parsing
process. Final results suggest that the reader model we used has a benefit over (1) a ran-
dom guess of text segment over the whole page and (2) an IR non-ML based sliding win-
dow. Even though we attempt to tackle a Question-answering problem in this project, our
tool has been built in a way such that it is functional for full text search.

v
Acknowledgements

Keywords
Information Retrieval, Question Answering, Full text search, Machine Reading Compre-
hension

vi
Acknowledgements

Contents
Acknowledgements ...................................................................................................................................................... iii

Abstract ............................................................................................................................................................................. v

Keywords ....................................................................................................................................................................... vi

List of Figures ............................................................................................................................................................... ix

List of Tables ................................................................................................................................................................ 11

List of Equations.......................................................................................................................................................... 11

Introduction .......................................................................................................................................... 13

1.1 Context .......................................................................................................................................................... 13

1.2 Motivation ..................................................................................................................................................... 15

1.3 Problem Formulation ................................................................................................................................. 17

1.4 Thesis Overview .......................................................................................................................................... 19

Related Work ........................................................................................................................................ 23

Context & Customer needs ................................................................................................................ 26

3.1 Operation Technical Trainings ................................................................................................................ 26

3.2 Finance Knowledge Base ......................................................................................................................... 27

Implementation..................................................................................................................................... 29

4.1 Initial State : DrQA ...................................................................................................................................... 29

4.1.1 Document Retriever ....................................................................................................................... 30

4.1.2 Document Reader........................................................................................................................... 32

4.2 Pre Processing ............................................................................................................................................ 36

4.3 Document Retrieval .................................................................................................................................... 38

4.3.1 Granular TF-IDFs ............................................................................................................................ 38

4.3.2 LDA .................................................................................................................................................... 39

4.3.3 ElasticSearch .................................................................................................................................. 41

4.4 Document Reader ....................................................................................................................................... 42

4.4.1 Word Embedding ............................................................................................................................ 42

4.4.2 Sliding window ................................................................................................................................ 43

4.5 Multilanguage .............................................................................................................................................. 44

vii
Acknowledgements

4.6 API and User Interface ............................................................................................................................... 46

Chapter 5 Expirement Results ............................................................................................................................. 49

5.1 Dataset........................................................................................................................................................... 49

5.2 Metrics ........................................................................................................................................................... 52

5.3 Results........................................................................................................................................................... 53

5.3.1 Initial DrQA ....................................................................................................................................... 53

5.3.2 Reader only ...................................................................................................................................... 56

5.3.3 Granular TF-IDF .............................................................................................................................. 56

5.3.4 Parameter Optimization ................................................................................................................ 57

5.3.5 Automatic Topic Modelling classifier ........................................................................................ 58

5.3.6 ElasticSearch Integration Results .............................................................................................. 59

5.3.7 Word vectorization ......................................................................................................................... 61

5.3.8 Sliding Window ............................................................................................................................... 62

Chapter 6 Conclusion ............................................................................................................................................ 64

6.1 Result Discussion....................................................................................................................................... 64

6.2 Future development ................................................................................................................................... 65

References .................................................................................................................................................................... 67

viii
List of Figures

Figure 1 Graph of the modules describing the search at PMI .................................................... 18

Figure 2 Full pipeline of the Thesis' system .................................................................................. 19

Figure 3 Full Pipeline of DrQA as described in (Chen et al 2017) ............................................. 29

Figure 4 Graph of Bi-LSTM units articalted together in an RNN ................................................ 33

Figure 5 Crawling process description ........................................................................................... 37

Figure 6 Pipeline of our granular TFIDF method .......................................................................... 39

Figure 7 Graph of the Ingestion, index and search process in ES ............................................. 42

Figure 8 Taken from original paper ................................................................................................ 43

Figure 9 Graph of the sliding window process .............................................................................. 44

Figure 10 Selection process of the method in case of multi language ...................................... 45

Figure 11 Pipeline of both API and UI ............................................................................................ 47

Figure 12 Screenshot of the UI ....................................................................................................... 48

Figure 13 Answer's number of tokens distribution........................................................................ 51

Figure 14 Distribution of segments length answered by DrQA................................................... 55

Figure 15 Q:"What is the inner frame transversal transport belt Focke 550 for?" Result of DrQA
highlighted in yellow and expcted answer in green ...................................................................... 55

Figure 16 Visualisation of the results in Table 3 ........................................................................... 57

Figure 17 Visualization of the results of Table 4 ........................................................................... 58

ix
List of Tables
Table 1 Comparison of two PMI use cases ................................................................................... 16

Table 2 Sample of answers from DrQA with initial settings ........................................................ 54

Table 3 Results from DrQA with initial setting (line 1), the granular TFIDF (line 2), the inital and
granular TFIDF on a directory containing all the files of the department (only the subfolder contains
the relevant files) (line 3 and 4) and finally the reader only (line 5) ........................................... 56

Table 4 Results when tweeking the number of documents retrieved ........................................ 57

Table 5 LDA results on 4 topics ...................................................................................................... 59

Table 6 ES and Solr comparison results ....................................................................................... 59

Table 7 ES results with different settings....................................................................................... 60

Table 8 Results of ES on full documents compared to pages .................................................... 60

Table 9 Results of ES with the refined measurement .................................................................. 61

Table 10 Results comparison with Glove and Fasttext................................................................ 61

Table 11 % of relevant answer by DrQA when we retrieved different number of pages .................... 62

Table 12 Comparison of random, Sliding Window and DrQA approach .............................................. 63

11
0 Abstract

List of Equations
Equation 1 ............................................................................................................................ 30
Equation 2 ............................................................................................................................ 30
Equation 3 ............................................................................................................................ 30
Equation 4 ............................................................................................................................ 32

Equation 5 ............................................................................................................................ 34
Equation 6 ............................................................................................................................ 34
Equation 7 ............................................................................................................................ 34
Equation 8 ............................................................................................................................ 35
Equation 9 ............................................................................................................................ 35

Equation 10 .......................................................................................................................... 40
Equation 11 .......................................................................................................................... 42

12
Chapter 1 Introduction

Introduction
In this Chapter we first present the background of the project (Section 1.1) to better
understand what the motivations are (Section 1.2). We then define the problem we faced
during the thesis (Section 1.3) and finally we give a short but complete summary of the
research process (Section 1.4).

1.1 Context

Over the last decade, the rate at which information has been created has multiplied
by a thousand-fold. Most predictions have targeted a continuous, exponential increase in
the upcoming years (Eriksen, 2001). Digital giants such as Google, Bing and Yahoo! have
created entirely new business models that are based on the expanding need to find rele-
vant knowledge by refining their information indexing and retrieval capabilities, as well as
leveraging personal data to contextualize results.

Like the majority of large-scale corporations, PMI produces a plethora of data year
on year and is looking for innovative ways to effectively tackle information overload of the
employees. Such innovation initiatives are in the scope of PMI Enterprise Architecture
team that is part of the Information Systems Function. It works closely with various internal
departments and external experts in order to grasp the corporate need of ade-
quate cutting-edge technologies. These range from the use of Blockchain to Artificial Intel-
ligence to the Internet Of Things, and Enterprise Architecture- that has a very broad scope
of action and is a great concern to the Enterprise Search. This practice challenges compa-
nies with the following points:

1) Breakthroughs in capabilities but amount of information not comparable to the


World Wide Web:
Traditional approaches to Enterprise Search have shifted from simple keyword que-
ries to leveraging cognitive analysis. This is due to the emergence of tools such as au-

13
Chapter 1 Introduction

tomatic speech recognition, image processing and translations using Machine Learning
and Natural Language Processing (NLP). Hence, as the possibilities scale up, so does
the amount of noise in search results, as well as user expectations. Having as accurate
results as web search is a very tough task practically in the field of enterprise search.
Indeed, compared to public search engines, the amount of information and user activi-
ties is too small, and semantic relationships between documents are not tracked.

2) Variety of sources:
There is no single place to search and the end-users must pro-actively know what
they are looking for by issuing the right key words on the right platform. Indeed, the
services through which information can be accessed is broad. These include work-
stations’ hard drives, emails, shared folders, collaboration platforms like SharePoint,
that provide a range of information that can be retrieved using their respective search
tools. Being able to search through the entire set of documents of the company and
even beyond – combining it with public information – is a long-term goal of the Enter-
prise Search capability.

3) Unstructured data:
Another challenging aspect is coming from the nature of the documents- users have
to dig into various extensions and most of the files are unstructured which makes the
indexation not a trivial task. Some of the used repositories contain unclassified data
and keeping tracks of tags and metadata is a task that has often been neglected.

4) Confidentiality:
In large-scale companies such as PMI, all informations cannot be shared within the
firm and hence a lot of caution has to be taken on what can be uploaded to the cloud.
Does the user have access to this data? If no, what are his/her options to reach the in-
formation needed? These are few questions that have to be taken into account in the
field of Enterprise Search. In particular, we have been asked to work with a financial
department and the main reason why they were interested to build an application inter-
nally at PMI is to be able to search through highly confidential data. These kind of doc-
uments can be seen by a very restricted amount of employees before they get released
publicly.

14
Chapter 1 Introduction

1.2 Motivation
There are plenty of reasons to be interested in the Entreprise Search. The first and
main reason that comes to the mind is productivity gains and even though it is very difficult
to measure the gain and effectivness of a search tool it is obvious that several minutes
gained per search is highly beneficial if applied on thousands of employees. To put it into
perspective for the reader, if we estimate the number of cigarettes produced by one factory
by 24 billions per year, it results to roughly 10 millions cigarettes produced per hour and if
we estimate the net benefit of one cigarette by 70 cents, then one minute of down-time
would result in the loss of 143 000 USD. According to a McKinsey report1: employees
spend 1.8 hours every day-20% of the time-searching and gathering informations. In other
words, 4 employees out of 5 are working and the last one is looking for answers. Thus, it is
still essential that companies minimizes the time employees spend looking for information
to complete their on-going activities. This pre-supposes that the retrieved information is
accurate and relevant. Beside the importance of time, it is also crucial to not miss infor-
mation and being able to leverage the value of documents that demand a lot of ressources
in order to be created.

Instead of looking at Enterprise Search as a whole, i.e. horizontally, we interest our-


selves to more precise or vertical problems related to the search at PMI. Two departe-
ments addressed their interests in enhancing their search capabilities: the Operations En-
gineering and a financial departement. We briefly compare the two problems in Table 1
and we further describe these two use-cases in Chapter 3.

Use case Operations Technical Trainings Finance Knowledge Base (KB)


Operators learning how to use Analysts who are expert in
Type of users machines and in the process of their domain and knows the
training. terminology involved.
Number of users Hundreds to thousands Dozens
Current search Sharepoint searchbar and hierar- Commercial tool aggregating
engine chical classification of documents public and private documents

1
https://www.mckinsey.com/industries/high-tech/our-insights/the-social-economy, Report MGI July 2012

15
Chapter 1 Introduction

with metadata filters but


metadata is added manually
- Manually look for answers
- Long technical manuals
- Highly confidential data
- Prior manual work of cat-
cannot be updated to
egorizing documents but
Current pain- the current commercial
no automatic extraction of
points search tool’s cloud.
metadata
- Unable to search
- Urgency to quickly find the
through emails
relevant information (ex-
ample of down-time)
Both public articles, news and
Mostly Sharepoints with directo-
Nature of infor- internal sharepoints with highly
ries in different languages. No
mation sources confidential datas. Aggrega-
aggregation of sources.
tion of sources desired.
“How to”-questions such as how Request all communications,
to perform a specific task and news and research done on a
Type of queries
factoid questions. subject, an event, a time peri-
issued
The operator works on a precise od etc ranked according to
machine and production line. their relevance.
Example of ques- “How many knives are there on “What is the latest status of
tion the drums?” PMI’s share?”
List of documents.
Expected result
Few words or a page. Report about an event, a time,
length
a company, a place.
Table 1 Comparison of two PMI use cases

The nature of the documents can vary depending of the department in which we
operate. For example, the Trainings and Operations domain looks for answers concerning
specific machines and how to use them under their respective task depscriptions – such
as "How can I remove the knife from the cutting unit? “ Whereas in the department of fi-
nance, we focus on financial communications, (i.e. via emails) and numericals in spread-
sheets- where there is a greater concentration of questions such as "What is the market
share price of PMI in September 2014?”. Whether it is full documents, a page or a short

16
Chapter 1 Introduction

text span to be retrieved, the end-user needs to have the freedom to explore the data and
look for the satisfying answer. Keeping these factors in mind, the company strives to cre-
ate a tool flexible to push through said categories.

1.3 Problem Formulation


In the previous section, we presented the current situation of two use cases that
motivated to conduct the following research.

Problem statement: Can the current pain points be addressed using a combination
of classical information retrieval and machine learning techniques? Can the solution be
scaled/deployed to other corpora of documents? Is the resulting search interface more
relevant and intuitive for end-users? How does a pure information retrieval approach com-
pare with our combined proposal?

In this work, we attempt to answer these questions by building a search tool that is
based on previous public and internal research. First, we must make sure that the existing
home-grown system works on the provided data. Following this, we breakdown the prob-
lem and quantify the perfomance of each step of the system. Then, the most crucial ele-
ment of the project is to extend the product in every sense and push it forward on the pipe-
line. In order to optimise the product, we followed an iterative approach by testing and
measuring new solutions to progressively find the best option. It is imperative that we un-
derstand the data we are working with and have critical and concrete analysis on the re-
sults that we obtain. This tool must be able to scale well and handle a large amount of da-
ta. Looking at Figure 1 which is showing the way Enterprise Search at PMI articulates, there
is evidence that highlights the possible flexibility of the independent company’s “Dedicated
Search” system as well as what part of the system that must be automated and the areas
of the system that must be tailored to the specific requirements of every customer.

17
Chapter 1 Introduction

Figure 1 Graph of the modules describing the search at PMI

Finally, in order for the system to be utilized and furthermore tested, we need to
build a complete user interface and bring the conception to a maturity such that employees
make concrete use of it. Depending on the use case, documents are more or less confi-
dential we thus design our system having this in mind but overall it was not the main sub-
ject of the thesis. Indeed, in our case we are dealing with specific corpus of documents
which consists in a very vertical search compared to the horizontal search indexing multi-
ple sources of documents as commonly seen in Enterprise Search. Last but not least, we
address this problem as natural language search yet, the need for natural language query
has not been fully validated with end-users and we assume, for the rest of the Thesis, that
it is more effective than other approaches (like guided search, or via bots that will look at
eliciting intent and contextual information).

18
Chapter 1 Introduction

1.4 Thesis Overview


In this Section, we give a brief yet, complete overview of our study to ensure there
is an ease in understanding of what was the current situation “as-is” when joining the com-
pany and the core contributions of the aforementioned research, as well as the trials that
led to the decisions taken.

This Thesis is organized alongside the information retrieval pipeline described in


Figure 2 and which is articulated in different separate blocks: the Document Retriever and
the Document Reader (the Post-Processing part is left for future work). For a complete
description of the metrics used to assess the performance of the different components,
see Section 5.2.

Figure 2 Full pipeline of the Thesis' system

1) Initial State

PMI has developed a home-grown application based on the work of the Facebook
AI research team (Chen, et al., 2017) called DrQA that we describe in Section 4.1. The
orignal tool built by Facebook aims at answering factoid questions by reading Wikipedia
articles. As stated on their repository, « DrQA treats Wikipedia as a generic collection of
articles and does not rely on its internal graph structure. As a result, DrQA can be straight-

19
Chapter 1 Introduction

forwardly applied to any collection of documents ». Thus the architecture has been applied
to the operation use case.
From this initial stage, few shortcuts have been taken: only two documents were indexed,
no measurements have been taken as no test set existed for this purpose.

2) Contributions

a) Pagewise Crawler

In order to parse all documents, we created a crawler as described in Section 4.2. We then
indexed the totality of the documents related to the field and collected a test set (Section 5.1) in
order to obtain first results presented in Sections 5.3.1 and 5.3.2. The first part being the retrieval
of relevant document, we were using the original bigram TFIDF method, which was retrieving the
correct page in the top 5 in 76% of the cases initially. The fetched pageswere then fed in the sec-
ond part that is using a pre trained RNN model, we obtained an Exact Match (EM) score of 8% and
a F1 score of 28% for the first answer predicted. We tested the reader itself by giving the document
containing the answer as input which gave us an upper bound of 42.05% for the F1 score.

b) ElasticSearch Integration

We replaced the classic TFIDF approach for document retrieval by the possibility to inte-
grate an Elasticsearch server which is shown in Section 4.3.3. As discussed in the result Section
5.3.6, we started with a P@1 of 40% and P@5 of 74% for the whole document retrieval. P@n here
is defined as the % of questions for which the answer segment appears in one the top n docu-
ments.

After working on troubleshooting, tweaking weights of n-grams and various other parame-
ters we manage to reach P@1 of 70% and P@5 of 94% for document and P@1 of 48% and P@5
of 86% for the page retrieval. Then we refined measurements of DrQA’s result by assuming the
answer span predicted (+-10 tokens) is a success if it contains at least one of the keyword ex-
pected; again we call it P@n for the top n predictions. We obtained a P@1 of 58% and P@5 of
86% for pages.

c) Word Embedding for the Reader

We improved DrQA’s performance by modifying the word embedding matrix (described in


Section 4.4.1) from GloVe with 32.01% F1 and 10% EM score to FastText with 39.49% F1 and
12% EM score as results suggests in Section 5.3.7.

20
Chapter 1 Introduction

d) Parameter customization

Then we compared the way the RNN was rearranging the page retrieved (Section 5.3.8);
the first page fetched by the retriever contained the answer in 72% of the questions whereas the
page where the RNN predicted the first answer contained the expected answer in 54% of the ques-
tions. This means that we should not let the RNN rank answers among different pages but instead
we present a list of document to the user and he decides which pages the machine has to be read.

3) Trials

a) Granular TF-IDF

Seeing that the original approach was indexing documents page per page for memory allo-
cation facilities, one first guess was to build an invert index for both documents and pages. We first
wanted to filter the top relevant manuals and then retrieve the pages among them as described in
Section 4.3.1. It slightly improved the performance of the retriever by 2% however it did not change
the final results.

b) Number of pages retrieved

By tweeking the number of documents retrieved, i.e. 3, we improved the results using the
double TFIDF method to an EM score of 10% (+2% from original) and F1 of 31.12% (+3% from
original) (see Section 5.3.4).

c) LDA

We attempted to infer topics from documents using the unsupervised Latent Dirichlet Allo-
cation (LDA) approach and filter out every documents that did not have the same topic as the que-
ry as described in Section 4.3.2. However it gave exactly the same results with or without LDA
(Section 5.3.5).

d) Solr

Then we decided to integrate a search engine based on Lucene in order to investigate fur-
ther on the document retrieval part. After comparison between Solr and Elasticsearch (Section
4.3.3) we decided to not use Solr.

e) Sliding Window

One main concern was to see whether the generic RNN model was helping giving some in-
sight in the document or not. We attempted a non ML method based on Query-Document Rele-
vance ranking functions to replace the RNN. It consists of creating an inverted index with as

21
Chapter 1 Introduction

much text as possible related to PMI and the specific field we expect to search in. Then we sam-
ple the text with a sliding window that we shift along the whole page and compute the top BM25
score. The full describtion can be found in Section 4.4.2. If we input a correct document in the
model, we obtain a P@1 (and P@5 respectively) of 70% (and 90%) for BM25 compared to 94%
(and 94%) for the modified RNN method. P@n here is defined as the % of questions for which
the answer segment (30 tokens for the sliding window and + 10 tokens around the predicted span
from the RNN) contains at least one keyword of the expected answer. This shows that using a
pre trained generic model performs better than simple non ML method.

f) Multi-language

In Section 4.5, we explain how we handled the variety of language after the stakeholders
requested to extend the project to more languages.

g) API & User Interface

Finally, in Section 4.6, we propose APIs and a User Interface prototype for the actors of this
project.

22
Chapter 2 Related Work

Related Work
There has been tremendous research around Information Retrieval (IR) (Manning,
et al., 2008) and Knowledge Base (KB) (Jurafsky & Martin, 2018) (Liu, et al., 2015) based
Question Answering over the last decades. It has experienced progress such that IBM
Watson were able to beat human’s performance (Ferrucci, et al., 2010) following with other
similar tasks as the SQuAD1.1 Leaderboard suggests2 (Rajpurkar, et al., 2016). This pro-
gress is motivated with the growth of competitions and conferences such as the annual
Text Retrieval Conference Publications (TREC) competition and the ACM SIGIR. TREC
has built a variety of large test collections, including cross-language, speech and domain-
specific collections. Since the beginning of TREC in 1992, retrieval effectiveness has ap-
proximately doubled (Ellen M. Voorhees, 2005). With the development of KBs such as
WebQuestions (Berant, et al., 2013), Wikidata (Vrandečić & Krötzsch, 2014) or DBpedia
(Auer, et al., 2007) combined with (semi-)automatic KB extraction (Fan, et al., 2012), KB-
based QA, which is a problem that aims at translating a a free-text user query to a struc-
tured query (such as SPARQL, lambda-expr), has accelerated. Nevertheless, it has inher-
ent limitations because it is not flexible (missing information, rigid schema, etc) and lots of
manpower and field-expertise is needed. Thus, more recent researches focus on answer-
ing questions from raw text, we call this practice machine reading comprehension (MRC)
and it has been demonstrated to be particularly successful thanks to new deep learning
architectures such as attention-based and augmented Recurrent Neural Networks (RNN)
(Raison, et al., 2018) (Bahdanau, et al., 2014) (Weston, et al., 2014) (Graves, et al., 2014),
Convolutional networks (Yu, et al., 2018) and Stochastic Answer Networks (SAN) (Liu, et
al., 2017).

This considerable progress on MRC is largely due to the large scale dataset availa-
ble like MCTest (Richardson, et al., 2013), QACNN/DailyMail (Hermann, et al., 2015), CBT
(Hermann, et al., 2015), WikiQA (Yang, et al., 2015), bAbI (Weston, et al., 2014), SQuAD

2
https://rajpurkar.github.io/SQuAD-explorer/

23
Chapter 2 Related Work

(Rajpurkar, et al., 2016) or QAngaroo (Welbl, et al., 2017)— as well as initiatives to group
them together like ParlAI (Miller, et al., 2017). However, these tasks expect that a support-
ing document is provided with the question which cannot be assumed in our case. There
has been new resources that combines queries with text retrieved from search engines
such as MSMARCO (Nguyen, et al., 2016) and we attempt to follow the same path using
the open source search engine Elasticsearch. We attempt to use a similar two steps archi-
tecture as proposed in (Chen, et al., 2017) to tackle the full pipeline problem. Our ap-
proach differs from the latter by integrating Elasticsearch and augmenting the focus on the
document retrieval stage. In addition, further studies on Reinforced Reader-Ranker (Wang,
et al., 2017) has proved that FastText (Joulin, et al., 2016) improves performances for fea-
ture vector on DrQA which originally used GloVe (Pennington, et al., 2014).

Learn to rank models which aims at re ranking a top-k retrieval document with a
machine learning method has also known great progress (Turnbull, 2016). Even though
our project would benefit greatly from this task, it relies on specific labelled data and hence
this is left for future work as building the dataset would take a significant amount of time. In
the context of unsupervised data, Latent Dirichlet Allocation (LDA) is heavily cited in the
literature and it has been shown to be a promising method for IR (Wei & Croft, 2006).
However, the efficiency of LDA is contrasted so we want to see if it could bring context to
the machine and add value to the already existing IR method.

One simple way to retrieve potentially relevant documents is the use of Boolean
Retrieval which consists in formulating a query with AND, OR and NOT expressions. Giv-
en a query such as “knife AND drums”, it will return documents containing the term men-
tioned according to the set theory. It is still a frequently used approach and it is very intui-
tive as we know exactly how the result is selected. However, there is no way to rank re-
sults and we might either have a large amount of results or no result at all. One way to
overcome these limitations is to vectorise documents and queries in a D dimensional
space and compute a similarity between the two. One way to score query-documents that
has been widely used in the literature is Okapi BM25 and it has been proved to be suc-
cessful (Robertson & Zaragoza, 2009). We use this method to build a sliding window that

24
Chapter 2 Related Work

aims at doing IR at fine grained granularity of the text. (Moen, et al., 2015) has outper-
formed Lucene for the task of domain-specific IR using sliding windows of word vectors to
capture the domain-specificity in the semantic model. In our case instead of using it for IR,
the idea is to benchmark and situate the generic machine reading model with a non ML
method.

Our research tends to contribute by attempting to apply state of the art IR for a spe-
cific industrial real-life domain and perform machine reading on such collection of docu-
ments. The research was centered on adapting the said pipeline to the different use cases
and automatically build replicas for other collections. We also explore a variety of metrics
in order to assess the performance of the model and with the recent progress in multi-
language translation models (Yu, et al., 2018) we expend our research to French and Ital-
ian.

25
Chapter 3 Context & Customer needs

Context & Customer needs


As previously discussed, the aim of this Master Thesis is to eventually create a
search engine that could be extended to a panel of use cases which require a specific
dedicated search. The ideal goal is to have one search engine that could respond to a
number of different customers with a customizable architecture and flexible in terms of the
corpora it is based on. Ergo, we must first focus on an individual customer basis which
itself leads to different systems. The final result is almost similar for all searches – that is,
finding the correct level of information within a set of structured or unstructred data docu-
ments. However, the research and the expected answer format is different. It is therefore
necessary to identify and categorize the different needs of customers which is the purpose
of this Chapter. We first describe the Operation Technical Trainings in Section 3.1 and in
Section 3.2, the Finance Knowledge Base use case.

3.1 Operation Technical Trainings


The aim of the Operations department is to supervise the concept and control of the
production of goods and services. An example of the tasks PMI undertakes in this section
of the management line, is with regards to the Technical Trainings branch. The role of the
said branch is to ensure that employees have the right tools and competencey level to
work in the factory. In order to illustrate and understand the workings of Technical Train-
ings, we must first understand the use of manuals; workers have a set of manuals at their
disposal that bolster the understanding of concepts such as cigarette making and filter
making, cigarette packing and the tobacco processing equipment. The manuals provide a
concise yet clear task description of every individual machine- for instance, how manufac-
turers should use thes various machines, how they should be cleaned and the duration of
each machinery task.

Having closesly worked with managers brimming with the nitty-gritties of manufac-
turing and making use of their extensive knowledge, we were able to understand the cur-

26
Chapter 3 Context & Customer needs

rent situation at hand and meet their needs and expectations. The methods of research
undertaken by employees currently, can be deemed as slow and inefficient. This is be-
cause, the employee must undertake a number of steps in order to find the relevant infor-
mation- crawling inside the training programme’s directory, finding the document that
matches his/her query and the page with the relevant information using the classic “Ctrl-f”
tool. This basic search tool can be largely undermined by a much simpler full-text keyword
search- however, our aim is to improvise and go beyond this. We would like to make the
search tool accessible to even the newest of employees. In order to execute this aim, we
would first like to create a search engine with natural human interaction, however it is im-
perative to note that we do not wish to build a chatbot. The idea here is to communicate
with the search tool using natural language and tweek the results with a pre/post pro-
cessing part. Thus, the search tool consists of a question-answer style giving the user ac-
cess to the potentially relevant documents. The Operation department provided a set of 50
Frequently Asked Questions with answers, this was very helpful to first do a series of trou-
bleshooting and then to test, benchmark and measure the performance of our solutions. A
full describtion of it can be found in Section 5.1.

3.2 Finance Knowledge Base


A Finance department is currently making use of commercial tool that is a search
engine that retrieves a set of documents based on keyword query formulation. It complies
with results that directly point at relevant information or paragraphs inside a document.
Keeping this concept in mind, we applied it to our prototype with DrQA- first in retrieving
the document, followed by linguistic research and applying NLP to reading.

However, there are two major specifications in the previous case. The already exist-
ing tool indexes documents that are available outside the realm of PMI, these included
information surrounding analysis, research and news in various domains by The Guardi-
ans, Financial Time, JP Morgan or FDA. The body of documents is much smaller here. We
are looking at roughly a thousand documents where we would add one hundred more per
year. Furthermore, we do not focus on anwering natural language questions but rather
focus on the formulation of queries with keywords and parameters such as dates, topics,
locations, etc. This can be understood with the following example: the user requests

27
Chapter 3 Context & Customer needs

« CAGNY 2017 » related to the past 90 days under the topic Philip Morris. Moreover, the
search tool required the addition of metadata for each document added to the corpus.

We now could expect the following queries if they were to be asked in a more natu-
ral language way:

1) What are the terms to sell IQOS in the U.S.?


2) What is the latest status on Applications for IQOS in the U.S.?
3) What are the latest IQOS conversion rates?
4) What was the heated tobacco unit capacity for 2017? What have we projected for
2018?
5) What is the latest status on HeatSticks inventory levels in *country*?
6) What’s IQOS devices and related accessories’ contribution to our net revenues?
7) What is the current volume of illicit trade? Target / goal on illicit trade reduction?

Note that Question 3 (“What are the latest IQOS conversion rates?”) represents the
nature of a question DrQA could successfully answer. However, this is under the con-
straint that the precise answer is contextualized around the text. With reference to the fi-
nance KB, the answer can be located in a spreadsheet of numeric data, however it is in-
convenient to retrieve data of such nature using our system. Another issue with the use of
this search tool is the nature of questions that may be inputed into the search engine. For
example, if the question is not content specific and expects a set of pages as answers,
then the original system is not suitable. More so, if we are faced with Question 2 (“What is
the latest status on Applications for the IQOS in the U.S.?”) we would like to report the
documents/ paragraphs that are a top match to the query.

There is confidential information under PMI that cannot be shared with a third party
such as the commercial tool they make use of. Thus, the financial departement needs to
have an internal tool that offers a service at the same level or better than the commercial
tool. Therefore, we proposed to solve the problem with a pure Document Retrieval system.
Indeed, as opposed to the previous use case, the natural language query is just adding
noise over a keyword search whereas in the Operations use case if we have for example a
question starting with “how long” we know that we are looking for a time. As for the Thesis
point of view we decided to dismiss this use case, since this process is the first stage of
our full pipeline and hence it is covered by the first use case.

28
Chapter 4 Implementation

Implementation
In this Chapter we describe the trials to find a matching solution to the problem de-
scribed in the previous chapters. We first describe the implementation of the (at the begin-
ning of the master thesis) state of the art and then we show attempts to improve the exist-
ing prototypes.

4.1 Initial State : DrQA


Document retrieval Question Answering (DrQA) is presented (Chen, et al., 2017) as
a system that claims to answer ad-hoc questions – questions that does not fall into stand-
ard categories, meaning that there exists many way to formulate a query - from a large
collection of unstructured data. The system is designed in such a way, so as to answer
open domain questions based on Wikipedia articles. Using Wikipedia as the source of
knowledge is not the aim of this thesis, but rather a point of interest that provides insight
into its functions and the possible area of use and improvement for the subject of our said
thesis. An overview of this is shown in Figure 3.

Figure 3 Full Pipeline of DrQA as described in (Chen et al 2017)

29
Chapter 4 Implementation

The pipeline can be split into two modules- The Document Retriever and the Document
Reader.

4.1.1 Document Retriever


The former as its name suggests, aims at retrieving a set of potential documents
containing the answers. A Term Frequency- Inverse Document Frequency (TFIDF) is
used- it is an intuitive yet simple (low complexity) non-machine learning method that is
used to rank documents according to a query. This statistical measure allows one to eval-
uate the importance of a term contained in a document relative to the corpus of docu-
ments.

A detailed description of how the weights are computed, is given below:

Firstly, the term frequency can be computed with a range of variants, either
binary (term contained in the document or not), “brut” by simply counting the number
of occurences of the term in the document or normalized. In DrQA, a “0.5” max nor-
malization variation is used as follows for a term 𝑡 in a document 𝑑:

𝑓𝑡,𝑑
𝑇𝐹𝑡,𝑑 = 0.5 + 0.5 ∗ 𝑚𝑎𝑥 (1)
{𝑡′∈𝑑}𝑓 ′
𝑡 .𝑑

Following this, the inverse document frequency allows us to capture the im-
portance of one specific term 𝑖 appearing in a document compared to the entire cor-
pus. It is computed using the following fomula:

|𝐷|
𝐼𝐷𝐹𝑖 = log |{𝑑 (2)
𝑗 :𝑡𝑖 ∈𝑑𝑗 }|

Where |D| is the total number of documents in the corpus and |{𝑑𝑗 : 𝑡𝑖 ∈ 𝑑𝑗 }|
represent the number documents where the term 𝑑𝑗 appears.

Finally the weight is simply the product between the two measures:

𝑇𝐹𝐼𝐷𝐹𝑖,𝑗 = 𝑇𝐹𝑖,𝑗 ∗ 𝐼𝐷𝐹𝑖 (3)

Thus, the above results in a sparced matrix with rows containing all possible terms
in our corpus and each column representative of each document. This statistical approach

30
Chapter 4 Implementation

is efficient as the weights are computed only once at computation. However, this also
means that we need to recompute the TFIDF matrix every time we add a new document.
In order to prevent the computations from being redundant and to ensure their efficiency,
we create a so called inverted index. It can be seen as a dictionary where the keys are all
possible terms and the values are the documents in line with the appearance of term. For
example, the couple term1 {doc2, doc5, doc8} would mean that term1 appears in doc2,
doc5, doc8.

We shall later see how we pre-process each source of information in order to have
an exploitable list of documents with terms in order to create said inverted index. For now,
we hold the assumption that we have collected a list of documets to search through, are
readable (i.e. they consist of a decoded character sequence) and are ready for indexation.

We use Stanford coreNLP Tokenizer for the following steps, (there is no specific
reason as there are a number of other tokenziers such as spaCy, NLTK that would per-
form equally well for this task.) Firstly, we must tokenize the text in order to determine the
vocabulary of terms. This task consists of breaking down a text into pieces called tokens.
For example, the phrase “the black cat!” is tokenized into the following list: “the”,
“black”,”cat”, “!”. Following this we must remove stopwords, punctutation, compound end-
ings and special characters such as “\n” for line jumps. Stopwords are common terms that
appear frequently in all documents such as “the” or “a”- which is not a concern in our case
as there is no loss of information. Note that we can add words such as “pmi” or “tobacco”
in our stopwords list in this context, but we must keep the general case for now. To add to
this, we must delete possible question words such as, “what”, “how”, etc.- it is a question-
able decision but is not a concern to the validation of our analysis.

DrQA is improved upon with the addition of bi-gram over a standard TFIDF; the said
bi-gram is included in the list of terms as an attempt to take local order into account. For
instance: if one wishes to index a document containing the sentence “the black cat”, the
list of terms would not only contain “the”,”black” and “cat”, but also “the_black” and
“black_cat”. This allows for greater emphasis on the importance of the order of the words
in the phrase, and the words in the phrase itself, which depends on its context. “Sue ate an
alligator” and “an alligator ate Sue” are two sentences containing the exact same words
which would result in identical answers given the same query- however, they have com-
pletely different meanings. If the query is, “Who ate an alligator?” both the aforementioned

31
Chapter 4 Implementation

phrases would be provided holding equal weight in the instance of unigrams. Bi-grams on
the other hand, would hold a higher match weight for the sentence “Sue ate an alligator”
due to the use of terms such as “ate_an” and “an_alligator” as opposed to the phrase “an
alligator ate Sue”. Due to the lack of speed and RAM efficiency, hashing of (Weinberger, et
al., 2009) has been used in order to map the bigrams to 2^24 bins with an unsigned mur-
mur3 hash. Therefore, it is only required to store hashed values in a sparce matrix
(scipy.sparce library in Python) that will allow us to do a lookup in O(1).

The top 5 closest documents are retrieved by computing the dot product between
the query and documents in TFIDF weighted (3) word vector space (note again that bi-
grams are included) and selecting documents with which the product is the highest, i.e. :

argmax ∑𝑖 𝑇𝐹𝐼𝐷𝐹𝑖,𝑗 ∗ 𝑇𝐹𝐼𝐷𝐹𝑖,𝑞 (4)


𝑗

This retrieval technique is the one originally used in (Chen, et al., 2017) and we ini-
tially used exactly the same settings.

4.1.2 Document Reader


The document reader is a Recurrent Neural Network (RNN) that uses bi-directional
Long Short Term Memory (LSTM) units. With regards to the general case of open-domain
questions, the source of knowledge is Wikipedia and the input for the RNN is an encoded
version of each paragraph of the top articles. However in our case, we are mainly search-
ing answers in manuals of hundred of pages and on the one hand it would be too memory
expensive to read the whole book and on the other hand we frequently have answers that
is either longer than a paragraph or we need to read the previous paragraph to introduce
the answer in the next paragraph. In clear, if we only read paragraphs independently we
might often miss a crucial context. For example if we have the following:

“Into the switch cabinet are located:

- The PLC

- The drums”

We need to read the two first paragraphs concatenated in order to respond correctly to the
question “Where are located the PLC?”

32
Chapter 4 Implementation

Thus for the two reasons above, we deal with entire pages rather than paragraphs
as input, as PMI’s manuals consist of a great number of tables and pictures.

The logic behind an RNN compared to a standard neural network is that the output
of one cell is fed inside the next cell input. This implies that in one cell, we feed an encod-
ed word which approximately corresponds to a “vectorized” word and we also feed the
output from the previous cell. Using LSTM unit instead of standard unit (which only has
one layer that is the activation function) is a very good fit for question answering as it man-
ages to memorize information not only from the local order (as with n-gram) but it can re-
tain informations comming earlier in the text and this is combined with a layer in the LSTM
unit that computes whether we should discard or not what has been said before. LSTMs
have proven to be successful in similar tasks such as Machine reading and comprehen-
sion (Hermann, et al., 2015) (Chen, et al., 2016) and in the same spirit of text understand-
ing with sentiment analysis (Li & Qian, 2016). Now a bidirectional RNN is the superposition
of two RNNs with the concatenation of both outputs as shown on Figure 4. This is useful to
access information at any point in time from both the past and the future (Graves, 2012).
We use default settings for the RNN parameters: 128 hidden LSTM units with 3 hidden
layers and 0.4 of dropout rate and 0.1 of learning rate (for Stochastic Gradient Descent
optimization).

Figure 4 Graph of Bi-LSTM units articalted together in an RNN

We use the same encoding as described in DrQA which consists of a word embed-
ding using Stanford GloVe (Pennington, et al., 2014) 300-d trained from 840B Web crawl
data. Word embedding consists of representation of a word in d-dimensional space as a

33
Chapter 4 Implementation

vector. There are two particularities after the embedding model has been trained, first
words that have similar meanings are close to eachother in terms of distance (l2 or eu-
clidiean distance) and very common words such as indefinite and definite articles like “a”
or “the” are close to the origin. An unknown word is set to the origin and as we have a
number of specific terms designating machines which are unknown to GloVe, it would be
valuable to fine tune the embedding matrix with our corpus however benefits would be too
low to see significant result improvements. Moreover we managed to reassemble aroung
15 milion tokens in total and around 1 bilion would be necessary to re-train a whole em-
bedding matrix.

In addition of the word embedding, we add few features to the vector such as
whether the word is an exact match with a question word, the part-of-speech, i.e. whether
it is a noun, verb etc…, the named entity recognition, i.e. the category of a word such as
Person, Company, Country etc… and the normalized term frequency. Last, but not the
least, we add an aligned question embedding computed as follows:

∑𝑗 𝑎𝑖,𝑗 E(𝑞𝑗 ) (5)

Where the attention score 𝑎𝑖,𝑗 aims at measuring how close in meaning is a word
from the question and E(𝑞𝑗 ) is the encoding (embedding with word parameters) of the
word 𝑞𝑗 . The attention score can be computed as:

(6)

α(x) is the activation function defined as the ReLu function: max(0, 𝑥) for all 𝑥.

We also have another RNN (biLSTM) for the query in which we only feed the em-
bedded question words and compute the weighted average of each hidden cell output
such that:

𝑞 = ∑𝑗 𝑏𝑗 𝑞𝑗 (7)

With bj capturing the importance of each question word, it can be computed as fol-
lows with w being a parameters (a weight vector) to learn:

34
Chapter 4 Implementation

(8)

We finally have to aggregate the results in order to find the most probable answer.
We compute for all possible starting and ending words:

𝑃𝑠𝑡𝑎𝑟𝑡 (𝑖) ∝ exp(𝒑𝒊 𝑾𝒔 𝒒)

𝑃𝑒𝑛𝑑 (𝑖) ∝ exp(𝒑𝒊 𝑾𝒆 𝒒) (9)

Such that max 𝑃𝑠𝑡𝑎𝑟𝑡 (𝑖) ∗ 𝑃𝑒𝑛𝑑 (𝑖) with Ws and We being weight matrices to be
𝑖,𝑖′

learned.

Then we select the answer span with the highest value among every selected pag-
es. Note that there is limit in the number of maximum tokens to be predicted (i.e. 15) but
for our case we set it to 90 since we sometimes much longer answers than just few words.

The model that we use for the document reader is trained on the SQuAD dataset
and various distant supervision training sets jointly (called multitask). The Stanford Ques-
tion Answering Dataset (SQuAD) (Rajpurkar, et al., 2016) is a set of 100k Question, An-
swers and the paragraphs containing answers on a set of Wikipedia articles. It is currently
the largest general purpose dataset for Comprehension of text and hence also for Ques-
tion Answering. 87k Q&As can be used to train a model. Crowdworkers had to find a
question and an answer given a Wikipedia paragraph however we use the model that has
joined the several datasets such as webQuestions (Berant, et al., 2013) and WikiMovies
(Miller, et al., 2016). For the three added datasets, there is no paragraph associated to
each pair of question answers hence distant supervision must be used. Distant supervision
allows to automatically extract a paragraph in relation with pairs of Question Answers
(Mintz, et al., 2009).As an inital purpose for our system, we use the pretrained model as it
is very generic. However we expect it to be less powerful given that we have a quite spe-
cific task.

This is the detailed description of DrQA as we initially implemented it to the entre-


prise use. Given the structure, we could easily breakdown the pipeline and improve the
most critical parts.

35
Chapter 4 Implementation

4.2 Pre Processing


In the context of Entreprise Search, we recurrently listen to different sources but to
simply illustrate the problem, let us say we are accessing only one sharepoint folder. This
contains a large range of different document extensions. We need to design a crawler that
scans a directory. We define this as task ingestion- given it is the very first step of the pro-
cess, it is important to ensure the use of a good quality parser. The existing parser that
had used Apache POI library on Java had created the database from a few extensions:
pdf, word docx and pptx PowerPoint files. We decided to rewrite the crawler with the Java
library called Apache Tika for the following reasons:

 Extend the parsable extension to doc, docx, xls, xlsx, ppt, pptx, txt, msg, pdf, jpg,
png, zip, aspx. Note that we still need to make sure to parse document page by
page without loosing the page layout.

 The metadata that is not directly shown to the user such as the document name,
the extension, the author, the date of creation or the file size is information that
could be useful when searching for a document that also must be extracted along-
side the sequence of words.

 There are PDF files that are formatted as images and a typical parser would not be
able to read the content. We would need to convert the text contained in images in-
to the corresponding sequence of words using an Optical Character Recognition
(OCR) engine.

 We aim to detect the language of the documents inorder to do a selection in the first
place.

36
Chapter 4 Implementation

As shown on Figure 5, we first recursively scan the directory and select documents with
the non case sensitive extensions. Note that we not only parse Microsoft Word Documents
(.docx) but also Microsoft 97 – 2003 Document (.doc) which has a completely different
format.

Figure 5 Crawling process description

Apache Tika faces no problems with regards to extraction of text from aforemen-
tioned extensions. However, it does face issues in relation to the splitting of pages. We
shall consider the previous discussion on the topic of memory allocation, to further our un-
derstanding- if we fed an entire manual into the RNN, the alternative is to select a shorter
span of text, i.e. a page. In the case of PDF or PPT the pagination works alongside the
XML format of the document and hence makes the task fairly simple: for all pages or slides
we extract content and metadata. However, in the case of documents such as DOC and
DOCX, the pagination is created whilst the file is being opened (compliation of the font
style, size, etc.)- This makes it increasingly difficult for the parser to extract the content
from each page independently. A possible solution is to convert the file to a PDF after sav-
ing the metadata and then execute content extraction. We convert the DOCX files using
Apache POI library and for DOC we aim to convert it into DOCX using docx4j3. However, if
this fails (for example, due to image conversion) we must “manually” read every paragraph
as a String and concatenate them in order to create the PDF. The main issue however, is
not resolved when we use the latter method as we sometimes risk to lose the page layout
of the DOC thus hindering the progression of the presentation of results.

3
https://www.docx4java.org/trac/docx4j

37
Chapter 4 Implementation

Finally, we use tesseract OCR v3.054 before saving all the page’s content and
metadata in JSON files. UTF 8 is used for the text encoding which is the most common
today5. Finally we compressed the contents by removing some unicodes and character
escapes which were causing trouble for the JSON loader in the RNN.

4.3 Document Retrieval


How can a machine answer a question that is not even contained in the text it has
read? Before working on next steps we needed to make sure that the document retrieved
have the answer expected. This was actually of a bigger concern here than a common
Question Answering task because for a search purpose we want to have the ability to ex-
plore documents efficiently prior to answer a very specific question.

4.3.1 Granular TF-IDFs


Intuitively we would want to retrieve a whole document and then read it to find the
answer related to our query, however such an approach is not feasible in our case as we
have sen above. We attempted to bypass this limitation by creating an additional layer of
Document retrieving as illustrated of Figure 6.

4
https://github.com/tesseract-ocr/tesseract
5
https://w3techs.com/technologies/overview/character_encoding/all

38
Chapter 4 Implementation

Figure 6 Pipeline of our granular TFIDF method

As depicted, we first fetch a set of relevant document and then a set of pages using
two different inverted index. Note that we use exactly the same TFIDF customizations as
for the page only in DrQA.

4.3.2 LDA
Initially the only information given by the user is a query. From this small piece of
information provided we need to infer an answer that is assumed to be inside a directory.
The task is tedious for a machine as it must find the answer in the entire corpus which
cause great disturbance; hence we aim to reduce the text to be read as much as possible.

A common approach to Question Answering is to analyse the structure and the se-
mantic of the question in order to formulate a structured language-based query (Liu, et al.,
2015). This task makes use of a Knowledge Base in opposition to us where we needed to
find the answer in a text. However one common goal is to understand what the user is
looking for. In order to do this, the machine must comprehend the domain and the context
of the industry and user.

39
Chapter 4 Implementation

One way insight can be provided to the machine, is to create a structured hierarchy
of terms describing the different groups of the domain: a Taxonomy. In order to understand
this further, we will take the Operator trainings into considerdations. Documents are divid-
ed into four main categories: Cigarette making equipment, packaging equipment, filter
making equipment and tobacco processing. Each machine has a respective manual that
instructs the user about the product in the production line, these include: Protos, M8, GD
121, MK9. Each machine has different versions and languages available- allowing for eas-
ier access as the documents are already classified given the path hierarchy, yet the re-
quired taxonomoy at lower levels is missing. It is essential that we have access to
knowledge with regards to the materials and tools within the machines itself, the technical
and mechanical jargon used to describe the machine- all of this can be done manually.
However, we voluntarily did not want to create this taxonomy manually since we did not
have the expertise to know the exact terminations and more importantly, we wanted to be
able to automatically replicate the tool on another corpus in a completely different context.

We attempted to apply a fully automated solution which consists of building topics


using a probabilistic model called Latent Dirichlet Allocation (LDA). The only input provided
is the number K of topics we want the model to learn and the documents containing raw
text. We attribute initially a topic to each word according to a dirichlet distribution 𝐷𝑖𝑟(𝛼)
based on K topics and a concentration parameter α. The probability density function of the
Dirichlet distribution can be expressed as:

Г(𝛼𝐾) 𝛼−1
𝑓(𝑥1 , … , 𝑥𝐾−1 ; 𝛼) = Г(𝛼)𝐾 ∏𝐾
𝑖=1 𝑥𝑖 (10)

With {𝑥1 , … , 𝑥𝐾−1} being the weights of each topics and Г(𝑥) = (𝑥 − 1)! being a
gamma function.

This will generate a random topic model distribution that we attempt to enhance
through a learning process. For all terms w in each documents d we compute the follow-
ing:

1. P(topic t | document d): probability that the document d is assigned to the topic t.

2. P(word w | topic t): probability that the topic t is assigned the word w.

We then choose the topic by computing the product between the two probabilities
which correspond to the probability that the topic t generates the word w in document d.

40
Chapter 4 Implementation

This is intuitively understood as taking one word and updating its topic according to one
that is most likely to be generated by the document, by “blocking” all other terms’ model
(we assume to be correct).

An optimistic guess is that we could automatically detect the four groups described
previously and hence having a weighted average of terms with highest weight on the terms
that describe said topics the most. In this case, as we know how many topics must be de-
rived, it is more valuable than the Hierarchical Dirichlet Process (HDP) which is a variant of
LDA and does not require the number of topics as input. If we successfully inferred the
topic from each document we could use the same model to analyse the topic of the query
and filter out and document not related to this topic.

4.3.3 ElasticSearch
Before answering questions we want the user to be able to explore data indexed
and the search to be as relevant as possible. In order to custom an efficient document
ranker we chose to make use of a database management system. Among many of them
we compared two of the three most popular ones according to the DB-Engine ranking6 that
are ElasticSearch (ES) and Solr. Both are based on Lucene and Open source with an
HTTP layer to interact with the server, they also have rich frameworks, ecosystem and
support. We integrated and tested them both to do document retrieval and given the simi-
larity of the results, the better scalability and popularity of ES which is important for a po-
tential industrialization of the tool we decided to make use of ES.

ElasticSearch is a very powerful tool to manage and search through a database, it


is easy to customize the tokenizers, the analysers and the filters. We start our analysis
with a crawler that indexes documents entirely7 and then used the ingestion we designed
ourself as proposed in the section above. For both techniques we used an OCR with
metadata extraction. We initialy started with a standard analyser and progressively cus-
tomized parameters such as augmenting the stopword list, do some troubleshooting (ana-
lyse the text the same way at indexing and searching, fix the dataset etc…), lowercase,

6
https://db-engines.com/en/ranking/search+engine
7
https://github.com/dadoonet/fscrawler

41
Chapter 4 Implementation

stemming, bi grams and tri grams analysis, added list of synonyms. On Figure 7, we illus-
trate the pipeline for document retrieval with the integration of ES.

Figure 7 Graph of the Ingestion, index and search process in ES

The scoring function, Okapi BM25 (since 2015) is the default method for Lucene to
score text and hence the default one used in Elasticsearch. As presented in section 11.4.3
of (Manning, et al., 2008), Okapi BM25 is a TFIDF-like that pays attention to term frequen-
cy and document length. One of its variant is computed as following.

𝑁 (𝑘1 +1)𝑇𝐹𝑡,𝑑
𝑆𝑐𝑜𝑟𝑒(𝑞, 𝑑) = ∑𝑡∈𝑞 log [𝐷𝐹 ] . 𝐿
(11)
𝑡 𝑘1 ((1−𝑏)+𝑏×( 𝑑 ))+𝑇𝐹𝑡,𝑑
𝐿𝑎𝑣𝑒

With k1 and b being free parameters, empirically set to k1 = 1.6 and b=0.75, k1 cal-
ibrates the document term frequency scaling and b determines the scaling by document
length. 𝐷𝐹𝑡 is the number of documents with term t, Ld is the length of document d (# to-
kens) and Lave is the average length of documets in corpus.

4.4 Document Reader

4.4.1 Word Embedding


As briefly explained in the first section of this chapter, the embedding of words is a
word representation. It is commonly learned by training log-bilinear models based on either
skip gram or continuous bag-of-words (cbow). Two main examples that have used such
implementation are word2vec (Mikolov, et al., 2013) and fastText (Joulin, et al., 2016). A

42
Chapter 4 Implementation

goo way to illustrate the process of going from word embedding to document distance is
illustrated in Figure 8 taken from (Kusner, et al., 2015).

Figure 8 Taken from original paper

Word representation, for its success, has many applications and guarantees a high scala-
bility with massive amount of data. As shown by (Mikolov, et al., 2017), fastText performs
better than GloVe on the SQuAD dataset on the original DrQA model so we try to improve
our system by incorporating fastText.

4.4.2 Sliding window


One of the concern we have had after implementing all of the ideas above was
wether we could apply a generic ML model to answer domain specific questions and if it
was performing well. Indeed, we focussed our research on the retrieval part where we
worked around the granularity of the context retrieved.

For the whole pipeline, even if we aim at answering questions, we could let the user
read one sentence or a small segment where we believe the answer is most likely to be in.
This can be done following the same model as for document retrieval. The approach con-
sists in creating a sliding window which shifts along the text in a page. It can be seen the
same way as in the preprocessing of a speech signal in automatic speech processing,
where here the samples are the tokens. We compute a similarity score between the query
and the selected segment with so called Query-Document Relevance (QDR) ranking func-
tions. An overview of the process is shown on Figure 9.

43
Chapter 4 Implementation

Figure 9 Graph of the sliding window process

As first step we needed to create our model which needed as much text as possible
relevant to the specific field. We used an open source implementation of QDR8 in order to
do the training which, like word embedding, is unsupervised meaning we only need unla-
belled row text. It supports the following function for the scoring TFIDF, Okapi BM25 and
Language Model. Thus what we call the model is simply the following, supporting the rank-
ing functions:

 A unigram language model: token -> total count in corpus

 Corpus document counts: token -> total documents in corpus

 Number total docs in corpus for TFIDF

 Average document length: total words / total documents) for BM25

After the model is trained we can compute the ranks and we arbitrarly decide to use
BM25 as the ranking function to select the top segments.

Note that the BM25 variant described in Equation (11) is the one used for our QDR
function.

4.5 Multilanguage
Even though we did all the expirements with documents written in English, a tool in
English only at PMI is pointless. Indeed, people working in production speak divers’ lan-
guages and most of the time, they are not able to speak English so the manuals are trans-

8
https://github.com/matt-peters/qdr/

44
Chapter 4 Implementation

lated by a specialist in their mother tongue. We indexed two additional non-english main
sets of documents: one in french and one in Italian. These are related to the Operation
technical training domain previously described and the nature of documents was not dif-
ferent from before.

In the first step, the solution is fairly simple since we are making use of Elas-
ticSearch so we configured accordingly each indexes. Note that the main change lied in
the list of stopwords. We chose our stopword lists from ranks.nl that is an online analysis
Web site which checks for proper usage of keywords. The list is complete but not too long
and most importantly, it includes question words.

The second step is more problematic since we are using a generic model trained on
English data and hence it would fail answering non-english questions. One solution that
we implemented is to send the text to the translate.google.com servers using the Goog-
letrans API in order to have the English translation and we do the same at the endpoint to
translate the answer aswell. Another solution is to bypass the RNN with the sliding window
with the scorer trained on the given corpus.

In the final pipeline, one can choose whether to use DrQA’s model or the sliding
window with the API but as shown on Figure 10 if documents are in English it uses DrQA,
otherwise it finds the right segment using the sliding window trained a priori.

Figure 10 Selection process of the method in case of multi language

The multi-language feature has been created for the sake of functionality but no
research has been done on it for our case. However, today’s attenditon-based neural ma-
chine translation (NMT) (Bahdanau, et al., 2014) models have demonstrated excellent

45
Chapter 4 Implementation

quality (Wu, et al., 2016) which would not add too much noise for the reader. On the other
hand, even though there are many words which have a particular meaning in PMI’s indus-
try, one possiblity is to train our own model with PMI-related documents. Indeed, the ma-
jority of the non-english manuals come from the translation of an expert from the source
language English and hence this constitutes a training set. Finally if the non-english docu-
ments are a subset of the English documents, we could instead create an automatic map-
ping between every language varients of a document and only trnaslate the query. Under
this condition, this method would reduce considerably the noise created by the translation.

4.6 API and User Interface


In order to go further with the project we had to collect practical logs which can be
done only if the system has reach a level of maturity such that employees have an actual
benefit to use our tool. Even though the User Interface (UI) was not the main focus of the
project, we still explored the way we wanted to show results and for this task we used var-
ious tools.

We created two main RESTful API with the use of Flask-RESTPlus9, one has be
created to call the full pipeline with the query as input and the predicted answer with the
context as output. What we call the context is the text in which the model has found the
answer with the predicted segment highlighted and one can also choose to retrieve a win-
dow of l tokens around the prediction. Note that all these parameters can be set in the API
and the mechanism is depicted on Figure 11. The second API was created for the ma-
chine reading stage and one can choose to use the RNN or the sliding window method. It
handles English, French and Italian but it has been optimized for English and it has not
been tested for other languages. For the top-k document retrieval part since we are using
Elasticsearch, it has already a complete RESTful API available.

For the UI we made use of Reactive Search10 which is an open source React and
React Native UI components library for Elasticsearch. Note that there were a variety of
other similar projects to fulfill this task such as SearchKit, DejaVu or InstantSearch. Howe-
ver, ReactiveSearch was a better match to do full text search with an active community,

9
https://github.com/noirbizarre/flask-restplus/
10
https://github.com/appbaseio/reactivesearch

46
Chapter 4 Implementation

development and maintenance. We created our own new React components in order to
add the feature of machine reading as well as the document preview. The graph on Figure
11 also describes the pipeline of the application since it follows the same steps.

Figure 11 Pipeline of both API and UI

The application can be decomposed into four main components and a screenshot is
shown on Figure 12:

 The search bar to formulate the query on top of the app, note that it suggests con-
tent (according to document names) as the user types the query.

 The filters on the left which could easily be extended depending of the metadata of
the docs. It currently has a nested list to filter documents in the hierarchy. For files
such as in the operation case, this is a very important feature which can leas to a
massive gain of time but it supposes that files are structured in categories which is
not always the case. Other filters include the extension, the creation date, the lan-
guage and the authors.

 The result list with the retrieval score and metadatas. One can click « Show » to
unwrap the text of the page and click « Read » in order to run the prediction for the
best span in the text. It gives a set of 5 predictions and as the user move the mouse

47
Chapter 4 Implementation

over one of the predictions, it highlights the corresponding segment in the text. One
can also click the path of the file to open the document at the specific page.

 The document preview which shows the corresponding document at the right page.

Most of the choices taken for the interface have been motivated by discussing with
the stakeholders. The idea is to both have the possibility to proceed in a top-down and a
bottom-up approach. Either we can with filtering out documents and being more and more
precise in the query formulation or we can also have a precise question in mind and direct-
ly call the reader from the results.

Figure 12 Screenshot of the UI

48
Chapter 5 Expirement Results

Chapter 5 Expirement Results


In this Chapter, we first present the test set we collected in Section 5.1, we then
propose the metrics (Section 5.2) to assess the performances of each stage and modifica-
tions of our system as described in Section 5.3.

5.1 Dataset
Initially we did not have any example of questions in order to see how each steps of
the system was performing. We hence asked to the most interested stackholders being
Operations to have a sample question and answer pairs. They originally gave us a set of
10 Q&As which was barely enough to have a general overview of DrQA’s performance so
they created a set of 50 Q&As for us. The test set consists of a question asked in a natural
language manner paired with an answer that is an exact match of a sequence of words
contained in the corpus indexed. The first task was to clean the dataset by correcting the
error of formulations and miss spelled words, we also paired every Q&A with the corre-
sponding document and its page. Note that for each pair we initially suppose that only one
document is relevant and contains the exact answer which makes the measurements
much easier however it is not a correct assumption since there are duplicas of manuals
(word and pdf for instance).

{"question": "What is the capacity of the glue jet tank for a Protos 70 ?", "answer": ["ap-
prox. 13 litres"], "document": "Operator course Protos 70.pdf", "page": "51"}

{"question": "Where are located the PLC ?", "answer": ["Into switch cabinet"], "document":
"Operator course Protos 70.pdf", "page": "65"}

{"question": "How to check the strength of the cigarette seam ?", "answer": ["Select at ran-
dom one cigarette and hold the cigarette horizontally at the filter with your left hand in such
a way that the tobacco ends is facing yourself. Then turn the rod 45° to the right and 45° to
the left and check if the seam has opened due to the torsion. If that is the case, check all

49
Chapter 5 Expirement Results

remaining cigarettes of the sample."], "document":


"Course_manual_maker_Update_09_Eng.pdf", "page": "28"}

{"question": "How to adjust the MAX PM drums ?", "answer": ["any drum can be chosen as
a reference point. Usually it is convinient to begin ajusting from filter feed drum"], "docu-
ment": "OT. 01.11.04.05-2 (MAX PM) en.v2.doc", "page": "28"}

{"question": "How to adjust the GD121P garniture clappers ?", "answer": ["Follow the oper-
ation below for performing the adjustment of the clapper"], "document": "GD12P Mehcani-
cal Manual.pdf", "page": "134"}

As shown in the five example above there are few remarks to be made about this
testset:

 Names of machines are not very explicit, i.e. Protos 70, MAX PM, GD121P. There
are terms that have a particular meaning in this particular field such as “glue”,
“seam”, “drums”, “clappers”. The names of the machines are tokens that does not
exist in our dictionary and the domain-specific words have a semantic that migt not
fit the typical meaning of it. For example in the case of “knife”, it describes an actual
piece on a machine and hence the meaning is captured in the embedding. However
“drum” does not refer to the musical instrument but the part of a tool in a machine.

50
Chapter 5 Expirement Results

 If we plot the distribution of number of tokens in the answer as shown on Figure


13Figure 13, one can observe a mixture of two different Gaussians: one of mean 3.71
and and standard deviation 1.71 and an other one 26.89 and standard deviation
16.03. This means we could have short answer questions with four tokens in aver-
age where the user wants to know a fact such as: “What is the capacity…”, “Where
…”, “How much time…”, “When…”. The second types of question would be catego-
rized by long answers to describe definitions and tasks: “What Is … for/role?”,
“What is the goal/prupose/function of…”, “How to …”.
Occurences

Figure 13 Answer's number of tokens distribution

 Questions can often be ambiguous, for instance “Where are locted the PLC?” or
“How do I replace the V-Belts?” can have different answers since “PLC” can be sit-
uated at different places. We do not know on which line of production is or which
machine he is working on.

The questions are related to a subfolder of Training Operations on conventional


cigarette machines, it contains 2’179 diverse files and the whole sharepoint contains 3’182
files. We created one index for the subdirectory and the main one however most of the
documents in the other subdirectories are replicas of what can be found in the one con-
taining 2’179 files.

51
Chapter 5 Expirement Results

5.2 Metrics
The choice of metrics to assess the performances of different approaches appeared
to not be as trivial as expected. As a first guess and as many other research papers in
question answering are suggesting we should aim at maximizing two scores:

 The Exact Match (EM) score which is simply an average of the answer predicted
that is equal to the true expected answer. It is a binay results for each question.

 The F1 score which is the harmonic mean between the precision and recall that is:
2
𝐹1 = 1 1 . It is between 0 and 1 with 1 being the best value. The precision is
+
𝑝𝑟𝑒𝑖𝑠𝑖𝑜𝑛 𝑟𝑒𝑐𝑎𝑙𝑙

𝑇𝑃
defined as 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = , TP for True Positive is a document correctly re-
𝑇𝑃 + 𝐹𝑃

trieved, FP for False Positive is a document wrongly retrieved. Hence, it measures


the exactness of the words predicted. For instance if we expected the answer « ap-
2
prox 13 liters » and we retrieved « 13 liters », the precision score is = 1 since
2+0

the two words we retrieved are correct and there is no incorrect word. The recall is
𝑇𝑃
defined as 𝑅𝑒𝑐𝑎𝑙𝑙 = , FN for False Negative is a document wrongly not re-
𝑇𝑃 + 𝐹𝑁

trieved. Hence, it measures the completeness of words expected. For example if we


2 2
take the example from the precision, the recall score is = 3 because we re-
2+1

trieved 2 of the 3 correct words. Note that we do not take order into account which
can be missleading. Considering again the same example, these two results would
lead to an F1 score of 0.8.

There are few limitations concerning these metrics, firstly for each question we
would need to have an exact match of the answer in the documents which was not the
case for all our questions. Moreover it is sometimes subjective to give an answer to a
question, where should we start and where should we stop. If we look at the answer “Fol-
low the operation below for performing the adjustment of the clapper“, one could expect
the operation coming next as an answer or simply “Follow the operation bellow”. As dis-
cussed earlier, the question might be too vague and we cannot always expect one and
only one answer to be correct. In our context we need to ask ourselves what we want to
maximize: the answer is not given in speech manner or in a very short period of time, in-

52
Chapter 5 Expirement Results

stead, the user has the possibility to explore documents. An answer can only be validated
if the context is given in addition. For these reasons we augmented the context of the span
retrieved by 10 tokens on both sides of the answer and we assume it to be correct if it con-
tains at least one keyword (not a stopword) of the expected answer. For the lack of sim-
plicity we call P@n the % of questions for which an answer is correct (as we described) in
one of the top n documents that is not to be confused with the precision in the context of
document retrieval.

On the document retrieval part, the choice of metrics we first assumed that only one
document was relevant and we defined P@n the percentage of questions for which the
document expected is retrieved however we refined this measurement along the thesis by
checking if the answer segment appears in one of the top n documents. Note that it is dif-
ferent from the precision commonly described in information retrieval which is the one we
described at the start of this section.

5.3 Results
This section presents the performance of our iterative approach, we will see the re-
sults that motivated the decisions we took.

5.3.1 Initial DrQA


As a starting point we run our system on the Operations documents (both full direc-
tory and the subfolder) and we present a sample of results (queries going through the
whole pipeline) with the subfolder with documents splitted as pages in Table 2 Sample of an-

swers from DrQA with initial settingsTable 2.

DRQA ANSWER PRE-


QUESTION TRUE ANSWER
DICTED
WHAT IS THE CAPACITY
approx. 13 litres
OF THE GLUE JET TANK 13 litres
FOR A PROTOS 70 ?
WHERE ARE LOCATED
Into switch cabinet B13
THE PLC ?

53
Chapter 5 Expirement Results

Select at random one ciga-


rette and hold the cigarette
horizontally at the filter with
your left hand in such a way
that the tobacco ends is
HOW TO CHECK THE
facing yourself. Then turn
STRENGTH OF THE CIG- 1/3
the rod 45° to the right and
ARETTE SEAM ?
45° to the left and check if
the seam has opened due
to the torsion. If that is the
case, check all remaining
cigarettes of the sample.
any drum can be chosen as
HOW TO ADJUST THE a reference point. Usually it
a reference point
MAX PM DRUMS ? is convinient to begin ajust-
ing from filter feed drum
HOW TO ADJUST THE Follow the operation below Repeat the adjustment if
GD121P GARNITURE for performing the adjust- necessary after the final
CLAPPERS ? ment of the clapper check
Table 2 Sample of answers from DrQA with initial settings

As shown on Figure 14Figure 13, the token distribtion is rightly skewed because the
system gives in general shorter answer than expected. The example shown on Figure
15Figure 14 gives a good picture on how the model has been trained: the answers have to
be as short as possible and if the query is of the form « What is {m} ? » and there is {m}
with parentheses, it is most likely that what is in parentheses is the infomation we are look-
ing for. From an engineer point of view we could easily retrieve a bigger segment and from
the user point of view we could easily locate the exact aswer we are looking for with this
kind of answer.

54
Chapter 5 Expirement Results

Figure 14 Distribution of segments length answered by DrQA

Figure 15 Q:"What is the inner frame transversal transport belt Focke 550 for?" Result of DrQA highlighted in yellow and
expcted answer in green

55
Chapter 5 Expirement Results

We tested it on the 50 QAs test set and the results we obtained are shown in the
first and third lines of Table 3. As compared to the expected distribution, it confirms the
assumption that DrQA aswers as short answer as possible. The results for the retriever
are comparable to those stated (Chen, et al., 2017) where it was doing 78% with the same
measurements on the SQuAD datasets. However the RNN performance are lower: we
initially reached 28% F1 score (for the full pipeline) as opposed to 79% (for the reader part
only) and 8% EM compared to 29.5% (both on the full pipeline). The results are obviously
getting worse as we increase irrelevant documents in the inverted index as it add noise to
the retrieval step.

Table 3 Results from DrQA with initial setting (line 1), the granular TFIDF (line 2), the inital and granular TFIDF on a directory containing
all the files of the department (only the subfolder contains the relevant files) (line 3 and 4) and finally the reader only (line 5)

5.3.2 Reader only


Testing the reader corresponds to the best performance we could achieve by giving
the right page to the reader and as shown in Table 3, the upper bound as F1 score is 42%.
This is low considering we had to refer to 79% as stated by (Chen, et al., 2017).

5.3.3 Granular TF-IDF


In the same Table 3 we show the results of the combination of TFIDFs filtering out
document first and then pages described in section 4.2.1. For N=5 documents retrieved,
the P@5 for the retrieving part is slgihtly better (+2%) but the results over the all pipeline
are the same (28% f1 score for both approaches).

Combinations can be visualized on Figure 16.

56
Chapter 5 Expirement Results

Figure 16 Visualisation of the results in Table 3

5.3.4 Parameter Optimization


One of the concern we had is whether we should give a big bulk of text to the read-
er or one as small as possible. We hence tested the system with various number of pages
fed to the reader. The results are shown in both Table 4 and Figure 17.

Table 4 Results when tweeking the number of documents retrieved

57
Chapter 5 Expirement Results

Figure 17 Visualization of the results of Table 4

Note that results are worse as we fed more text in the RNN. Indeed we could not
evaluate it without the retriever because of memory allocation issues however we illustrate
the idea by retrieving 500 pages that we input in the RNN and we reached a F1 score of
12.35% and an EM of 4% which is very low. Hence there is an obvious benefit in the doc-
ument retriever step beforehand, it is a considerable gain of time and precision.

5.3.5 Automatic Topic Modelling classifier

In Table 5, we present an example of result LDA with 4 different topics. We then


apply the additionnal filtering with our best model so far and evvaluate performances. We
obtained exactly the same results as without the topic modelling filtering: 78% P@5, EM
10% and 31.12% F1 score. This was expected because both LDA bases its topic model-
ling on word frequency in documents as described in chapter 4 and hence it is highly cor-
related with TFIDF approach.

Topic Weighted average of frst 10 terms


id
0 0.011*"machine" + 0.008*"fig" + 0.006*"position" + 0.006*"set" + 0.005*"type" + 0
.005*"operator" + 0.005*"international" + 0.005*"safety" + 0.004*"filter" + 0.004*"li
ne"
1 0.074*"morris" + 0.072*"philip" + 0.022*"page" + 0.022*"task" + 0.017*"izhora" +
0.016*"version" + 0.016*"qsmp" + 0.013*"date" + 0.011*"effective" + 0.009*"intert
aba"

58
Chapter 5 Expirement Results

2 0.019*"training" + 0.012*"machine" + 0.008*"task" + 0.008*"manual" + 0.008*"dru


m" + 0.008*"page" + 0.007*"wheel" + 0.007*"adjustment" + 0.007*"screws" + 0.0
06*"filter"
3 0.021*"doc" + 0.018*"focke" + 0.018*"pos" + 0.015*"versione" + 0.008*"het" + 0.
008*"machine" + 0.008*"van" + 0.008*"copyright" + 0.008*"sono" + 0.008*"numer
o"
Table 5 LDA results on 4 topics

Besides, the results for the reader stage made us question ourselves on two aspects:

- Is a generic ML model well suited for a specific enterprise case such?


- Are the measurements maximizing a Question Answering task performances (i.e.
EM and F1) the right ones to use?

5.3.6 ElasticSearch Integration Results


Solr has been tested on 1200 full documents in English only and with a standard
analyzer. For ES however we have indexed every single documents (zip, pictures, videos,
etc…), i.e. 2179 documents. Initially we started with a P@1 40% and P@5 74% (for ES)
but after some troubleshooting both in the testset and the measurement compuations we
obtained the results shown in Table 6.

ElasticSearch Solr
P@1 54% 54%
P@3 80% 80%
P@5 88% 84%
Table 6 ES and Solr comparison results

As we can see there is no considerable gap between the two and hence we priori-
tized the most easy to use with the biggest support and popularity being Elasticsearch. We
set ourselves as a goal to reach at least the original performances of TFIDF, i.e. P@5 of
90%. The Table 7 shows the different customization we followed. The way we proceeded
was simply to look at each questions that was causing issues and this guided us to the
desired results. For instance, « When should the pump cleaning be performed ? » the two
tokens « pump » and « cleaning » articulates together and they have a different meaning if

59
Chapter 5 Expirement Results

they are not together hence it makes sense to use bigrams. We also have similar ques-
tions such as « What is the LES inspection system ? » however the three tokens « LES »,
« inspection » and « system » articulate together hence we have a gain of 12% in P@1 by
simply adding a trigram analyzer. Another improvement was done using a list of synonyms
for example if we take « What does GSR means ?», « GSR » is not a token that appears
in any of the documents in the corpus and one way to fix it is to add a synonym that ap-
pears in the corpus, i.e. « Gas to Solids Ratio » in this example. If we have the combina-
tion of Uni-Bi-Three-grams and synonyms we reach P@5 of 94%. By tweaking the im-
portance of different fields we amnaged to improve the P@5 of 2%, indeed we squared the
importance of the bigrams as well as threegrams. Last but not least, in he case of question
such as « How temperature is influencing the tobacco's taste and monitored ? », the an-
swer appears where only the root of « influencing » is situated and hence we used english
stemming. It improved the precision of P@3 by 10%.

2-grams + uni- 3-grams + uni- Weight


Synonyms Stemming
gram gram tweaks

P@1 66% 66% 68% 68% 70%

P@3 80% 82% 84% 84% 94%

P@5 90% 92% 94% 96% 96%

Table 7 ES results with different settings

Note that we are getting these results doing full document retrieval. We applied the
analyzer performing the best to the indexed pages and the results are in Table 8.

Full Document indexed Page indexed


P@1 70% 52%
P@3 94% 76%
P@5 96% 84%
Table 8 Results of ES on full documents compared to pages

We have a P@5 of 84% compared to 78% with the original TFIDF implemented by
Chan et al, even though they stated in their paper that it performed better than Elas-
ticSearch. Another comment is that this percentage only means the proportion of ques-

60
Chapter 5 Expirement Results

tions where we retrieved the expected document but it is high likely that even though the
page is not the one expected, it does contain a meaningful answer. Thus, as mentioned in
the Measurement section, we refined this P@n metric by checking each document manu-
ally in order to see if a meaningful answer was in the top n retrieved. By meaningful we
mean that the answer is exactly the same as the expected span or deviating by few syno-
nyms and that the user could still infer the answer from the document. The results are pre-
sented in Table 9 and note that since we had to do these measurements manually, for the
P@5, we looked at the response retrieved only, whereas for P@1 we looked at the whole
page.

Average
Best predicted span in a meaningful page
58%
(“P@1”)
Top 5 predicted span giving correct answer
86%
or part of it (“P@5”)
Table 9 Results of ES with the refined measurement

5.3.7 Word vectorization


We modified the original DrQA model by making use of fastText embedding instead
of GloVe. We used the model trained on Common Crawl with 600B tokens that has trained
2 million tokens in 300 dimensions. Results are presented on Table 10.

F1 Score
Original DrQA with GloVe 5 Pages 32.01%
DrQA with fastText 5 Pages 35.29%
DrQA with fastText 3 Pages 39.49%
DrQA with fastText 1 Page 38.19%
Table 10 Results comparison with Glove and Fasttext

As suspected, fastText vectorization outperforms GloVe by 7% of F1 score. It is important


to note that there is only 1% drop if we search in the top 1 page instead of top 3.

61
Chapter 5 Expirement Results

5.3.8 Sliding Window


The RNN takes as input a set of documents and try to find the answer given this
text independently of how well it performed on the retrieval part. Thus it « reranks » the
documents according to the prediction span results. On Table 11 we computed the P@1
for the page both at the end end of the retriever stage and the reader stage (the page con-
taining the best prediction when reading top 3 and 5 pages). Note that like the end of the
section of ElasticSearch’s results, we consider a page to be relevant when it contains the
answer (not only if it is the expected page).

Average
P@1 Retriever 72%
P@1 Reader – Top5 Pages Retrieved 58%
P@1 Reader – Top3 Pages Retrieved 70%
Table 11 % of relevant answer by DrQA when we retrieved different number of pages

We can see that if we input more than one page in the RNN, it tends to more likely
find an answer in an irrelevant document. Hence, it raised the question introducted in Sec-
tion 4.4.2: Is the model that reads at scale helping to gain insight or would a simple non
ML method outperform this reading step? Note also that the highest P@1 is when we read
the first page retrieved only which shows that we should rather read one page at a time
which has been taken into account in the UI.

The code for the implementation of the sliding window is in the jupyter notebook
called BM25vsDrQA.ipynb. It also contains the comparison with the best DrQA model.
7’246’565 tokens (in 43k documents) are used to train the QDR scoring function model
which is comming from the original training manuals plus an additional dataset acquired at
the end of the thesis that contains webpages on Reduce Risk Products. Of course, these
are two different fields however they are still related to the cigarettes, PMI and tobacco
whereas the corpus dealing with Finance KB contains too many terms out of the topic to
include it aswell.

In order to compare the two approaches we decided to use a slightly different


measurement as before like we did for the retriever part. Instead of computing the distance
between two spans, we use a binary classification that is true when at least one of the
keyword of the expected answer is contained in the selected segment. Note first that we
added a small context of 10 tokens on both sides of the answer predicted for the DrQA

62
Chapter 5 Expirement Results

prediction, then note also that we input directly a relevant page in the model. The results
are shown on Table 12.

30 Tokens Window 30 Tokens Window DrQA Span +- 10


Random ranked on BM25 Tokens
P@1 (Top1 pre-
28% 70% 94%
dicted segment)
P@5 (Top5 pre-
64% 90% 94%
dicted segment)
Table 12 Comparison of random, Sliding Window and DrQA approach

These results show that DrQA is much more precise and it clearly exceeds a ran-
dom guess as well as a simple non ML method based on a BM25 shifting window.

63
Chapter 6 Conclusion

Chapter 6 Conclusion
6.1 Result Discussion
Along our research we have seen first that there is a considerable difference of per-
formance between our enterprise use case and the results obtained by Chen et al with
DrQA: 42% F1 score versus 79% for the reader only and 8% EM against 29% on the full
pipeline. Despite the good results from the Document Retriever relatively to Chen et al -
with 76% P@5 for pages and 90% P@5 for full documents versus 78% P@5 for Wikipedia
articles-, we decided to focus on this step as in the case of a search engine it is the priority
and the performance of what is following in the pipeline relies on this.

We first attempted to improve the performance of the retriever by classifying the


documents unsupervisingly with LDA but this approach was not successful and it did not
lead to any improvement. However, by adding one layer of TFIDF and filtering documents
and then pages, we managed to gain a slight improvement which is not enough (+2%
P@5 for pages).

We then proposed two main contributions, first by creating a scanner of a directory


(Section 4.2). The particularities of the scanner are that it parses documents to pages by
keeping the page layout and it handles many extensions with the use of an OCR and the
extraction of metadata. Secondly we integrated Elasticsearch in (Chen, et al., 2017) pipe-
line and we managed to obtain 84% (+8%) P@5 for pages and 96% (+6%) P@5 for full
documents.

We then conducted our research on how to adapt the generic MRC model to our
domain-specific use case. We managed to reach 39% F1 score (+10% from the original
settings of DrQA) by doing troubleshooting and follow settings of recent research (Raison,
et al., 2018). Along the way, we also refined the used measurement methods because the
purpose of our project was search and gaining insight on enterprise-related documents
which is less rigid than giving a very short and exact answer span. Final results suggest
that the deep learning (DL) model we used has a benefit over (1) a random guess of text

64
Chapter 6 Conclusion

segment over the whole page and (2) an IR non-ML based sliding window. Even though
performances are not as good as models that answer open-domain questions in the litera-
ture, recent breakthroughs in DL on NLP and MRC has a positive impact and it makes the
technologies ready for certain enterprise use-cases. Moreover, in the context of vertical
search where operators were searching through a hierarchical filtering process of the file
and keyword search of manual names, there is a high potential in terms of time gained for
the end user that uses the system created. Indeed, the search engine was designed so
that it handles both a top-down and a botom-up approach: whether the user knows exactly
what he is looking for and what to ask or whether he needs support with the explorations
of the results and the filters of the metadata.

This project dispelled the myths surrouding deep learning by clarifying what can be
done and what is not achievable in terms of IR and MRC on a relatively specific domain
such as the production machine’s manuals. There are high expectations in the industry
towards this technology and results are promising but it often needs a considerable
amount of ressources in order to be customized to the use case. Finally even though we
attempt to tackle a QA task, the top-k document retrieval stage of our pipeline which
demonstrated excellent results is non ML and it presents the main reliable (and best scal-
able) feature for the end user to proceed in full text search.

6.2 Future development


Future work may entail improving the core of the system by designing a new archi-
tecture either for the reading side or perform end-to-end training across the document re-
triever and the reader instead of doing both independently as suggested in (Chen, et al.,
2017). One could retrain a completly different architecture as suggested in (Raison, et al.,
2018) which might require a GPU for training. In addition, refactoring and lower the com-
plexity of few noted parts of the code would be necessary for the deployement and produc-
tion of the tool. Moreover, the UI has been built for the sake of demonstration and func-
tionaility and might need improvement in further work.

Another angle of attack is to do post processing and constantly refining shown re-
sults to the user by either creating incremental training. Another option would be to cache
already successfully answered questions and propose it if a query is similar enough in

65
Chapter 6 Conclusion

terms of F1 distance with the one correctly answered. This is a key aspect as in natural
language search users often do not know how to formulate their questions or what to ask.
In the same perspective, we could also focus on reinforcement learning where the user
can give feedback on whether the content proposed is relevant for each query.

For a larger scale project and in a more horizontal search perspective, we could
also aim at leveraging the context of the user who is searching by looking at his location,
departement, colleagues, etc. Of course it brings the confidentiality in the picture which
requires the knowledge graph of employees’ read, write and edit rights. This is an essen-
tial concept that has not been tackled here.

Last but not least, we did not collected any question and answers pairs in other lan-
guages in order to test how the model performed in a non-english setup. As a next step,
one could be interested to collect real-life logs so that an in-depth analysis of the tool and
its equivalent in other languages can be made. While most successful models manage to
train on even more data, we hope that this technology will tend towards being less data
demanding in order to train case-specific models.

66
Chapter 6 Conclusion

References
Auer, S. et al., 2007. DBpedia: A Nucleus for a Web of Open Data. Berlin, Springer-Verlag, pp. 722-735.

Bahdanau, D., Cho, K. & Bengio, Y., 2014. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR, Volume
abs/1409.0473.

Berant, J., Chou, A., Frostig, R. & Liang, P., 2013. Semantic parsing on freebase from question-answer pairs. 1.pp. 1533-1544.

Chen, D., Bolton, J. & Manning, C. D., 2016. A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task. CoRR,
Volume abs/1606.02858.

Chen, D., Fisch, A., Weston, J. & Bordes, A., 2017. Reading Wikipedia to Answer Open-Domain Questions. CoRR, Volume
abs/1704.00051.

Ellen M. Voorhees, D. K. H., 2005. Trec: Experiment and Evaluation in Information Retrieval. s.l.:MIT PR.

Eriksen, T. H., 2001. Tyranny of the Moment. s.l.:Pluto Press.

Fan, J., Kalyanpur, A., Gondek, D. C. & Ferrucci, D. A., 2012. Automatic knowledge extraction from documents. IBM Journal of
Research and Development, 5, Volume 56, pp. 5:1--5:10.

Ferrucci, D. et al., 2010. Building Watson: An Overview of the DeepQA Project. AI Magazine, 7, Volume 31, p. 59.

Graves, A., 2012. Supervised Sequence Labelling with Recurrent Neural Networks. s.l.:Springer Berlin Heidelberg.

Graves, A., Wayne, G. & Danihelka, I., 2014. Neural Turing Machines. CoRR, Volume abs/1410.5401.

Hermann, K. M. et al., 2015. Teaching Machines to Read and Comprehend. CoRR, Volume abs/1506.03340.

Hill, F., Cho, K. & Korhonen, A., 2016. Learning Distributed Representations of Sentences from Unlabelled Data. CoRR, Volume
abs/1602.03483.

Joulin, A. et al., 2016. FastText.zip: Compressing text classification models. CoRR, Volume abs/1612.03651.

Jurafsky, D. & Martin, J. H., 2018. Speech and Language Processing: An Introduction to Natural Language Processing, Computational
Linguistics, and Speech Recognition. 3rd draft ed. Upper Saddle River, NJ, USA: Prentice Hall PTR.

Kusner, M., Sun, Y., Kolkin, N. & Weinberger, K., 2015. From Word Embeddings To Document Distances. Lille, PMLR, pp. 957-966.

Li, D. & Qian, J., 2016. Text sentiment analysis based on long short-term memory. s.l., IEEE.

Lindström, M., 2017. Commentary on Wanget al. (2017): Differing patterns of short-term transitions of nondaily smokers for
different indicators of socioeconomic status (SES). Addiction, 4, Volume 112, pp. 873-874.

Liu, K., Zhao, J., He, S. & Zhang, Y., 2015. Question Answering over Knowledge Bases. IEEE Intelligent Systems, 9, Volume 30, pp. 26-
35.

Liu, X., Shen, Y., Duh, K. & Gao, J., 2017. Stochastic Answer Networks for Machine Reading Comprehension. CoRR, Volume
abs/1712.03556.

Manning, C. D., Raghavan, P. & Schütze, H., 2008. Introduction to Information Retrieval. New York, NY, USA: Cambridge University
Press.

Mikolov, T., Chen, K., Corrado, G. & Dean, J., 2013. Efficient Estimation of Word Representations in Vector Space. CoRR, Volume
abs/1301.3781.

67
Chapter 6 Conclusion

Mikolov, T. et al., 2017. Advances in Pre-Training Distributed Word Representations. CoRR, Volume abs/1712.09405.

Miller, A. C. et al., 2017. A genetic basis for molecular asymmetry at vertebrate electrical synapses.. eLife, 5.Volume 6.

Miller, A. H. et al., 2016. Key-Value Memory Networks for Directly Reading Documents. CoRR, Volume abs/1606.03126.

Mintz, M., Bills, S., Snow, R. & Jurafsky, D., 2009. Distant Supervision for Relation Extraction Without Labeled Data. Stroudsburg,
Association for Computational Linguistics, pp. 1003-1011.

Moen, H. et al., 2015. Care episode retrieval: distributional semantic models for information retrieval in the clinical domain. BMC
Medical Informatics and Decision Making, 6.Volume 15.

Nguyen, T. et al., 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. CoRR, Volume
abs/1611.09268.

Pennington, J., Socher, R. & Manning, C., 2014. Glove: Global Vectors for Word Representation. s.l., Association for Computational
Linguistics.

Raison, M., Mazaré, P.-E., Das, R. & Bordes, A., 2018. Weaver: Deep Co-Encoding of Questions and Documents for Machine
Reading. CoRR, Volume abs/1804.10490.

Rajpurkar, P., Zhang, J., Lopyrev, K. & Liang, P., 2016. SQuAD: 100, 000+ Questions for Machine Comprehension of Text. CoRR,
Volume abs/1606.05250.

Richardson, A., Gardner, R. G. & Prelich, G., 2013. Physical and genetic associations of the Irc20 ubiquitin ligase with Cdc48 and
SUMO.. PloS one, 8(10), p. e76424.

Robertson, S. & Zaragoza, H., 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr., 4, Volume
3, pp. 333-389.

Turnbull, D., 2016. Relevant Search. s.l.:Manning.

Vrandečić, D. & Krötzsch, M., 2014. Wikidata. Communications of the ACM, 9, Volume 57, pp. 78-85.

Wang, S. et al., 2017. R ̂3: Reinforced Reader-Ranker for Open-Domain Question Answering. CoRR, Volume abs/1709.00023.

Weinberger, K. Q. et al., 2009. Feature Hashing for Large Scale Multitask Learning. CoRR, Volume abs/0902.2206.

Wei, X. & Croft, W. B., 2006. LDA-based Document Models for Ad-hoc Retrieval. New York, NY, USA, ACM, pp. 178-185.

Welbl, J., Stenetorp, P. & Riedel, S., 2017. Constructing Datasets for Multi-hop Reading Comprehension Across Documents. CoRR,
Volume abs/1710.06481.

Weston, J., Chopra, S. & Bordes, A., 2014. Memory Networks. CoRR, Volume abs/1410.3916.

Wu, Y. et al., 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR,
Volume abs/1609.08144.

Yang, L. et al., 2015. Yang et al. 2015 Supplementary materials. s.l.:s.n.

Yu, A. W. et al., 2018. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. CoRR, Volume
abs/1804.09541.

68

View publication stats

You might also like