You are on page 1of 67

Internship Report

Presented at school to be graduated as :

Business Intelligence Engineer

Field: Software Engineering

An Intelligent MCQ System Based on NLP and Gradient Boosting Machines

Louly Adam

National School Of Applied Sciences

Supervised by:

Pr. Karim El Moutaouakil (ENSA AL-Hoceima)

M. Badr El Houari (xHub)


End of studies internship report 2018-2019 2

Abstract

For quite some time now, artificial intelligence (AI) and its subset, machine learning,
have been a hot topic. Countless industries apply this technology in various ways to automate –
and optimize – all kinds of processes especially the field of recruitment and HR included.

Recruitment process is one of the most painful processes for the HR department, Reading all
resumes of applicants, classifying them by skills and level of expertise, scheduling the online
technical test and approving the profile whether to be accepted for a physical interview, this
process is taking a lot of time to be processed for each applicant.

Also the traditional recruitment process fails with candidates, companies and recruiters. It
revolves around the human interpretation of complex data too sensitive to prejudice and mental
shortcuts.

On this project, we have used Machine learning and Artificial intelligence techniques, in order to
make an automatic hiring system, which will be very helpful for HR department, by handling
automatically the recruitment process from A-Z.

The project consist of 3 main components, a resume parser which uses natural language
processing, to parse a new resume applied by a new applicant, and extract his profile, skills and
level of expertise, second component is a flexible technical test, this component is using Gradient
Boosting Machines (GMBs) to classify questions, and it also uses natural language processing to
predict the next question, the third component is profiling system that analyses CV and technical
test performance, and give a summary about the applicant that can give a better insight for the
hiring manager.

Keywords : Natural Language Processing, Machine Learning, Automatic process, Gradient


Boosting Machines, Artificial intelligence, Hiring System
End of studies internship report 2018-2019 3

Résumé

L'intelligence artificielle (IA) et son sous-ensemble, l'apprentissage automatique,


constituent depuis un certain temps un sujet brûlant. De nombreuses industries appliquent cette
technologie de différentes manières pour automatiser - et optimiser - toutes sortes de processus
notamment les domaines de recrutement et RH.

Le processus de recrutement est l’un des processus les plus pénibles pour le département des
ressources humaines. Lire tous les CV des candidats, les classer par compétences et par niveau
d’expertise, programmer le test technique en ligne et approuver le profil accepté ou non pour un
entretien physique, ce processus prend beaucoup de temps à traiter pour chaque demandeur.

De plus, le processus de recrutement traditionnel échoue avec les candidats, les entreprises et les
recruteurs. Cela tourne autour de l'interprétation humaine de données complexes trop sensibles
aux préjugés et aux raccourcis mentaux.

Sur ce projet, nous utiliserons des techniques d'apprentissage automatique et d'intelligence


artificielle afin de créer un système de recrutement automatique, très utile pour le service des
ressources humaines, en gérant automatiquement le processus de recrutement de A à Z.

Le projet consiste en 3 composantes principales, un analyseur de CV utilisant le traitement du


langage naturel, pour analyser un nouveau CV appliqué par un nouveau candidat, et extraire son
profil, ses compétences et son niveau d'expertise, le second composant est un test technique
flexible, ce composant est: en utilisant GMBs (Gradient Boosting Machines) pour classifier les
questions, et utilise également le traitement du langage naturel pour prédire la question suivante.
Le troisième composant est un système de profilage qui analyse le CV et les performances des
tests techniques, et fournit un résumé du candidat permettant une meilleure aperçu pour le
gestionnaire d'embauche.

Mots clés : Traitement automatique du langage naturel, Apprentissage automatique, Processus


automatique, Machines de renforcement du gradient, L'intelligence artificielle, Système de
recrutement
End of studies internship report 2018-2019 4

Table of content

Abstract 2

Résumé 3

Table of content 4

Table of figures 8

Signings 10

Thanks 11

General Introduction 12

Chapter 1: General context of the project 13

Introduction 14

1. Presentation of the host organization “xHub” 14

1.1. General Presentation 14


1.2. IT Events 15
1.3. IT Expertise 15
1.4. Outsourcing 16
1.5. Development of innovative products 16
2. Presentation of the Mother project “DevPeer” 16

2.1. Vision and motivation of the project 16


2.2. The « DevPeer » Platform 17
2.3. Objectives of the project 17
2.4. Project Management Methodology Adopted 18
End of studies internship report 2018-2019 5

2.4.1. Justifications for the use of Scrum 18


2.4.2. Project planning 19
Conclusion 19

Chapter 2 : State of the art 21

Introduction 22

1. Natural language processing 22

1.1. What is NLP? 22


1.2. Levels of NLP 24
1.2.1. Phonology 24
1.2.2. Morphology 25
1.2.3. Lexical 25
1.2.4. Syntactic 25
1.2.5. Semantic 26
1.2.6. Discourse 26
1.2.7. Pragmatic 27
1.3. Natural Language Generation 27
1.4. Applications of NLP 29
1.4.1. Machine Translation 29
1.4.2. Text Categorization 29
1.4.3. Spam Filtering 30
1.4.4. Information Extraction 31
1.4.5. Summarization 32
1.5. Fundamental concepts of NLP 33
1.5.1. Word embeddings 33
1.5.2. Importance of context 34
1.5.3. Word2Vec & Global Vectors for Word Representations 35
1.6. State of the art architectures 36
1.6.1. Embeddings from Language Models (ELMo) 36
1.6.2. Bidirectional Encoder Representations from Transformers (BERT) 38
1.6.3. Multi-Task Deep Neural Networks (MT-DNN) 40
End of studies internship report 2018-2019 6

2. Gradient Boosting Machines 41

2.1. What are GBMs? 41


2.2. Fundamental concepts of GBMS 42
2.2.1. Ensemble 42
2.2.2. Bagging 43
2.2.3. Boosting 43
2.3. Gradient Boosting algorithm 44
2.4. Types of GBMs 45
2.4.1. CatBoost 45
2.4.2. LightGBM 45
2.4.3. XGBoost 45
3. Performance Measures 46

3.1. Classification Accuracy 46


3.2. Confusion Matrix 46
3.3. Area Under Curve 47
3.4. F1 Score 49
3.5. Mean Absolute Error 50
Conclusion 50

Chapter 3 : Implementation 51

Introduction 52

1. Work Environment 52

1.1. Data Environment 52


1.1.1. Kaggle Kernels 52
1.1.2. Python 53
1.1.3. Python Packages & Libraries 53
1.1.4. Keras 54
1.1.5. PyTorch 54
1.1.6. XGBoost 54
1.1.7. Software Environment 54
End of studies internship report 2018-2019 7

1.1.7.1. Flask (Backend) 54


1.1.7.2. ReactJs (Frontend) 55
1.1.7.3. Docker (Deployment) 55
1.1.7.4. Jenkins (CI/CD) 55
1.2. Model 1: Resume Parser 56

1.2.1. Data Collection 56


1.2.2. Data Preparation 56
1.2.2.1. Tokenization, Part-of-speech Tagging, and Parsing 56
1.2.2.2. Lemmatization and Stemming 57
1.2.2.3. Stopword Removal 57
1.2.3. Modelling 58
1.3. Model 2: Next Difficulty Level 58

1.3.1. Data Collection 58


1.3.2. Data Preparation 59
1.3.3. Modelling 59
1.4. Model 3: Next Question Prediction 59

1.5. Software Implementation 60

1.5.1. General workflow of the application 61


1.5.2. System performance 61
1.5.3. Demonstration 62
2. Conclusion 65

General conclusion 66

References 67
End of studies internship report 2018-2019 8

Table of figures
Figure 1: xHub's logo .................................................................................................................... 14

Figure 2: DevoxxMa & Devoxx4Kids logo.................................................................................. 15

Figure 3: Scrum methodology ...................................................................................................... 18

Figure 4: Natural language processing types ................................................................................ 23

Figure 5: Natural language processing levels ............................................................................... 24

Figure 6: Natural language processing components ..................................................................... 28

Figure 7: Text embeddings ........................................................................................................... 33

Figure 8:PCA on word vectors ..................................................................................................... 33

Figure 9: Example corpus of vectorization ................................................................................... 35

Figure 10: GloVe implementation ................................................................................................ 36

Figure 11: ELMo example ............................................................................................................ 36

Figure 12: Biderctional language model example ........................................................................ 37

Figure 13: Bidirectional language model example 2 .................................................................... 38

Figure 14:BERT ............................................................................................................................ 39

Figure 15: Encoder used in BERT ................................................................................................ 39

Figure 16: How BERT was trained ............................................................................................... 40

Figure 17: Multi-Task Deep neural networks architecture ........................................................... 41

Figure 18: The history of GBM's .................................................................................................. 42


End of studies internship report 2018-2019 9

Figure 19: Ensembling on Gradient boosting machines ............................................................... 43

Figure 20: Bagging and Boosting on GBMs................................................................................. 44

Figure 21: Lemma and stem ......................................................................................................... 57

Figure 22: Global Architecture of the Application ....................................................................... 60

Figure 23: Choose profile page ..................................................................................................... 62

Figure 25: Uploade resume page .................................................................................................. 63

Figure 26: Rate yourself page ....................................................................................................... 63

Figure 27: General test page ......................................................................................................... 64

Figure 28: Chosen profile test ....................................................................................................... 64

Figure 29: Test done page ............................................................................................................. 65


End of studies internship report 2018-2019 10

Signings

To My Dear Mother

An inexhaustible source of tenderness, patience and sacrifice. Your prayer and your Blessing
have been of great help to me throughout my life. Even though I can say and write, I could not
express my great affection and my deep gratitude.I hope never to disappoint you, nor to betray
your trust and your sacrifices.

May Almighty God preserve you and give you health, long life and happiness.

To My Dear Father

No dedication can express my respects, my gratitude and my deep love, and I hope to realize one
of your dreams.

You inspired me during my life and you will always be a big inspiration for me as long as I am
alive May God preserve you and bring you health and happiness.

To My Brothers

Mohammed, Zayd, one of the blessings that I thank God for, no dedication can express the depth
of the fraternal feelings and love, attachment I feel for you.

I dedicate this work to you as a testimony of my deep affection for memories of our unwavering
union that has been woven over the days. May God protect you, keep and strengthen our
fraternity

To My Family

Louly’s family, without you I would not have achieved this success.

I am grateful to you throughout my life for your unconditional support

To My Friends

It will be hard for me to quote you all, you are in my heart fondly
End of studies internship report 2018-2019 11

Thanks
I sincerely thank Allah for giving me the courage and the will to complete this work.

My gratitude, more than sincere and profound, to the Professor El Karim El Moutaouakil who
supervised me, supported, accompanied during this internship project end of studies. His advice,
his remarks as well as the many exchanges that we had, were of an immense help to carry out
this work. At the end of this scientific adventure, I would also like to thank him for his human
qualities, his scientific skills and his teaching.

All my thanks to Mr. EL Houari Badr, my supervisor of the internship at xHub, who welcomed
me and supported me warmly in his service. Thank you for his daily support, his always sharp
remarks and his encouragement.

My thanks also go to the members of the jury. Thanks to them for accepting the evaluation of
this work and the related implications.

Finally, I would also like to extend my affectionate thanks to the entire school and administration
of the ENSAH, for giving me all the necessary knowledge during my three years of study, and in
a pleasant framework of complicity and of respect.
End of studies internship report 2018-2019 12

General Introduction
Automation is the technology by which a process or procedure is performed without
human assistance. Automation or automatic control is the use of various control systems for
operating equipment such as machinery, processes in factories, boilers and heat treating ovens,
switching on telephone networks, steering and stabilization of ships, aircraft and other
applications and vehicles with minimal or reduced human intervention. Some processes have
been completely automated.

Automation covers applications ranging from a household thermostat controlling a boiler, to a


large industrial control system with tens of thousands of input measurements and output control
signals, In control complexity it can range from simple on-off control to multi-variable high-
level algorithms.

During my internship, my mission was to combine Automation and Machine learning to build an
automatic smart MCQ system, this system was made to imitate a recruiter and cover the most of
recruitment process automatically.

The system is designed to parse resumes and analyse them, and classify each profile based on the
skills, profile and level of expertise that were extracted from the resume using Natural Language
Processing Techniques likes BERT ( Bidirectional Encoder Representations from Transformers )
and Multinomial Naive Bayes, After Analysing the resume and classifying the profile, a
Technical test is being auto-generated from the output of the first component, the technical test is
an MCQ system, flexible and adaptive, the next question is being predicted based on the
previous questions and the answers that were submitted by the applicants.

After the end of the technical test, an analyser will come into the game and combine all the
behaviour of the applicant, resume and answers submitted, and generate a summary about the
applicant, that will help the recruiter decide whether the applicant was accepted for a physical
interview or not.
End of studies internship report 2018-2019 13

Chapter 1: General context of the project

Summary:

This chapter presents in a general way the context and the objectives of our graduation
project.

We begin by presenting the company "xHub" as the host organization. the second part will be
dedicated to the presentation of our project: its framework, its context, the study of the existing
and its objectives.
End of studies internship report 2018-2019 14

Introduction

Theoretically, one should not start an action without having acquired a thorough
knowledge of the human or economic environment, on which the project will intervene.

Indeed, the analysis of the project context is a basic element in project management that is
intended to create solutions in social, economic and environmental contexts. It is an initial tool of
analysis that is part of a chain of reflection and definition of projects. In addition, it is an
excellent communication tool when preparing documents that will be presented to various
stakeholders, including potential project funders.

1. Presentation of the host organization “xHub”

1.1. General Presentation

Figure 1: xHub's logo

XHUB is an internationally renowned consulting and IT consulting firm. The primary


mission of XHUB is to provide real solutions to all types of issues related to the field of
information technology. With offices in Morocco, Spain and Canada and a vast network of
affiliated experts around the world, XHUB's mission is to provide its customers with fast,
responsive, high value-added service regardless where they are.

XHUB's experts are highly qualified IT engineers who have been selected primarily for
their positive and dynamic mindset. Their technical talents combined with their very good
collaboration skills (listening, retroactivity and speed of execution) are real assets that allow
XHUB to build, with confidence, real long-term partnerships with its customers. In addition,
XHUB aims to be a true catalyst for IT talent in the African region.
End of studies internship report 2018-2019 15

By organizing various large-scale events in Morocco "Devoxx Morocco", in Algeria


"Voxxed Algiers", in Senegal "Voxxed Dakar", XHUB contributes to foster the emergence of a
true community of talented and motivated African developers. The latter is essential to develop
high-quality cooperation and business relations in the IT field in Africa, a continent in the
process of digital transformation.

Finally, XHUB wants to prepare now the next generation of Moroccan developers by
sponsoring each year the association Devoxx4Kids. This day entirely dedicated to children aims
to discover, in a fun way, the fabulous world of computer development and introduce them to
video game programming and robotics.

XHUB offers 4 main service offerings that are the organization of international IT events,
the provision of high level IT expertise, outsourcing (management and package) and the
development of innovative products.

1.2. IT Events

XHUB is the organizer of the Devoxx Morocco event for a first in Africa and the Middle
East. Indeed, Devoxx is the largest independent IT conference in the world specializing in
information technology. Different themes are animated by high-level speakers who make the
daily news such as Big Data, Cloud, Internet of Things, Mobile, DevOps methodology and
Agility, IT entrepreneurship, and Security.

Figure 2: DevoxxMa & Devoxx4Kids logo

1.3. IT Expertise

XHUB wants to position itself as the reference expert in Morocco. Indeed, XHUB is able
to offer highly specialized expertise services for Moroccan and African companies. As it
organizes international conferences (with international and local partners), it has access to highly
End of studies internship report 2018-2019 16

specialized and globally recognized experts. It is also up-to-date on the latest technologies and
can therefore easily answer complex IT issues. It offers both consulting and auditing services for
any IT department wishing to make a digital transformation or acquire high-performance IT
solutions.

1.4. Outsourcing

XHUB positioning in the developer community, where it helps to promote and enhance
talent, gives it an important connection with them. Indeed, the message that is conveyed during
the gathering of this community, it systematically promotes the talents that have become central
resources in an organization of the information system of companies. Thus, it can offer these
customers, talented resources that will perfectly meet their needs. These resources will be
positioned either at home or in projects package mode where the technical support of XHUB is
more to prove.

1.5. Development of innovative products

XHUB continually seeks to create added value by developing innovative technology


products in Cloud Ready. And in this sense, XHUB has developed Umbreo.com, a cloud-based
DevOps platform that gives the developer, IT-Operations and System Administrators the power
to quickly and easily implement IT infrastructures.

2. Presentation of the Mother project “DevPeer”

2.1. Vision and motivation of the project

As part of XHub's growth strategy, one of the main objectives of which is to distinguish
itself from other startups and consulting firms by organizing many events (the most important
one is DevoxxMA), and by carrying out internal projects always more and more innovative and
modern, a social network named "DevPeers" was imagined, serving as a successor to
"developpeur.ma", former project of XHub consisting of a collaborative platform for Moroccan
developers.
End of studies internship report 2018-2019 17

This platform has unfortunately had a mixed success due in particular to a user experience
not responsive to the needs of its user base, and an architecture certainly effective but backward
technologically compared to the competition.

Thats why the project "DevPeers" comes into play, aiming at correcting the errors of
devloppeur.ma and its predecessor by focusing on the aspect related to data processing and
analysis.

2.2. The « DevPeer » Platform

Thus, while the platform is still in its infancy, our team has been responsible for designing
and implementing data driven software components inspired by the latest technologies on the
market, to lay the foundation for which the platform will be able to develop, and thus guarantee
the evolution of this one of simple website to a "data driven website" highly evolutive and
responsive to the data generated and consumed by its users.

As a result, this project is as research-oriented as it is development-oriented, because an


additional effort of reflection must be invested to define the specifics of the specifications,
instead of starting the project on the basis of well-defined specifications.

2.3. Objectives of the project

Among the tasks to be carried out for the good progress of the project, we quote:

• Suggestion a data model for a data-driven platform.

• Participate in the selection and development of the necessary tools.

• Exploring and visualizing data.

• Performing statistical analysis and experiments to derive business insights.

• Study and propose algorithms for data processing.

• Developing machine learning and deep learning algorithms.


End of studies internship report 2018-2019 18

2.4. Project Management Methodology Adopted

The development process is a determining factor in the success of a project, because it


frames its different phases and characterizes the main features of its conduct. As part of this
project, the Scrum methodology was used.

2.4.1. Justifications for the use of Scrum

Scrum is an agile methodology designed to greatly improve team productivity. It is adopted by


the company XHub in all its projects and has largely helped to shape its corporate culture based on
daily collaboration, team spirit and continuous delivery.

This method is simple to set up and greatly improve the follow-up of the coaching as well as the
fluidity of the internal and external communication of the team.

It also allows, thanks to its incremental nature, to prevent, and thus to correct in advance any
kind of deviation of the project, which is particularly useful in our case, especially during the first
phase of the project which consists of analyzing and design our first components, which represents a
crucial step with a high risk of derailing the project from its original purpose.

Figure 3: Scrum methodology


End of studies internship report 2018-2019 19

2.4.2. Project planning

Before starting the project, a first sprint, named sprint 0, took place under the supervision
of the Product Owner, who at the same time took the role of Scrum Master for this occasion, to
document the current situation, to conduct a preliminary study on the circumstances of the
project as well as to train us in the Scrum methodology.

At the end of this sprint 0, our team was able to have a clear vision of the situation and
separate and specialized tasks were distributed among the members.

5 components were imagined, and were the subject of our work during the following sprints:

• A recommendation system.

• A sentiment analysis module.

• A resume parser.

• A smart technical test.

• A chatbot.

The next 3-week sprint was the starting point of the work. I have decided to work on the the
technical test , the backlog of my tasks was as follows:

• Design of a first data schema on which our components will be based

• Creation / Generation of the training / testing set for the first model of the components

• Study and test frameworks specialized in the development of Machine learning models

• Deliver a prototype of the components as a Rest API

Conclusion

This first chapter was devoted to the presentation of our host organization "xHub" and the
detailed presentation of our graduation project. In what follows we will begin the state of the art
End of studies internship report 2018-2019 20

of our project, carrying a theoretical study of the knowledge describing the concepts and trades
related to our subject
End of studies internship report 2018-2019 21

Chapter 2 : State of the art

Summary:

This chapter presents the state of the art of the machine learning techniques that we have
used on our MCQ system. First, it exposes the notion of "NLP" Natural language processing by
detailing its axes. Then we will be talking about gradient boosting machines which were used in
classification problems.
End of studies internship report 2018-2019 22

Introduction

Natural language processing (NLP) has recently gained much attention for representing
and analysing human language computationally. It has spread its applications in various fields
such as machine translation, email spam detection, information extraction, summarization,
medical, and question answering etc. This chapter distinguishes main phases by discussing
different levels of NLP and components of Natural Language Generation (NLG) followed by
presenting the applications that took place in our project using natural language processing.

1. Natural language processing

1.1. What is NLP?

Natural Language Processing (NLP) is a tract of Artificial Intelligence and Linguistics,


devoted to make computers understand the statements or words written in human languages.
Natural language processing came into existence to ease the user’s work and to satisfy the wish
to communicate with the computer in natural language. Since all the users may not be well-
versed in machine specific language, NLP caters those users who do not have enough time to
learn new languages or get perfection in it. A language can be defined as a set of rules or set of
symbol. Symbol are combined and used for conveying information or broadcasting the
information. Symbols are tyrannized by the Rules. Natural Language Processing basically can be
classified into two parts i.e. Natural Language Understanding and Natural Language Generation
which evolves the task to understand and generate the text.
End of studies internship report 2018-2019 23

Figure 4: Natural language processing types

The goal of Natural Language Processing is to accommodate one or more specialities of an


algorithm or system. The metric of NLP assess on an algorithmic system allows for the
integration of language understanding and language generation. It is even used in multilingual
event detection Rospocher et al. proposed a novel modular system for cross lingual event
extraction for English, Dutch and Italian texts by using different pipelines for different
languages. The system incorporates a modular set of foremost multilingual Natural Language
Processing (NLP) tools. The pipeline integrates modules for basic NLP processing as well as
more advanced tasks such as cross-lingual named entity linking, semantic role labelling and time
normalization. Thus, the cross-lingual framework allows for the interpretation of events,
participants, locations and time, as well as the relations between them. Output of these individual
pipelines is intended to be used as input for a system that obtains event centric knowledge
graphs. All modules behave like UNIX pipes: they all take standard input, to do some annotation,
and produce standard output which in turn is the input for the next module pipelines are built as a
data centric architecture so that modules can be adapted and replaced. Furthermore, modular
architecture allows for different configurations and for dynamic distribution.
End of studies internship report 2018-2019 24

1.2. Levels of NLP

The ‘levels of language’ are one of the most explanatory method for representing the Natural
Language processing which helps to generate the NLP text by realising Content Planning,
Sentence Planning and Surface Realization phases

Figure 5: Natural language processing levels

Linguistic is the science which involves meaning of language, language context and
various forms of the language. The various important terminologies of Natural Language
Processing are:

1.2.1. Phonology

Phonology is the part of Linguistics which refers to the systematic arrangement of sound. The
term phonology comes from Ancient Greek and the term phono- which means voice or sound,
and the suffix –logy refers to word or speech. In 1993 Nikolai Trubetzkoy stated that Phonology
is “the study of sound pertaining to the system of language". Whereas Lass in 1998 wrote that
phonology refers broadly with the sounds of language, concerned with the to lathe sub discipline
of linguistics, whereas it could be explained as, "phonology proper is concerned with the
function, behaviour and organization of sounds as linguistic items. Phonology include semantic
use of sound to encode meaning of any Human language.
End of studies internship report 2018-2019 25

1.2.2. Morphology

The different parts of the word represent the smallest units of meaning known as Morphemes.
Morphology which comprise of Nature of words, are initiated by morphemes. An example of
Morpheme could be, the word precancellation can be morphologically scrutinized into three
separate morphemes: the prefix pre, the root cancella, and the suffix -tion. The interpretation of
morpheme stays same across all the words, just to understand the meaning humans can break any
unknown word into morphemes. For example, adding the suffix –ed to a verb, conveys that the
action of the verb took place in the past. The words that cannot be divided and have meaning by
themselves are called Lexical morpheme (e.g.: table, chair) The words (e.g. -ed, -ing, -est, -ly, -
ful) that are combined with the lexical morpheme are known as Grammatical morphemes (eg.
Worked, Consulting, Smallest, Likely, Use). Those grammatical morphemes that occurs in
combination called bound morphemes( eg. -ed, -ing) Grammatical morphemes can be divided
into bound morphemes and derivational morphemes.

1.2.3. Lexical

In Lexical, humans, as well as NLP systems, interpret the meaning of individual words. Sundry
types of processing bestow to word-level understanding – the first of these being a part-of-speech
tag to each word. In this processing, words that can act as more than one partof-speech are
assigned the most probable part-of speech tag based on the context in which they occur. At the
lexical level, Semantic representations can be replaced by the words that have one meaning. In
NLP system, the nature of the representation varies according to the semantic theory deployed.

1.2.4. Syntactic

This level emphasis to scrutinize the words in a sentence so as to uncover the grammatical
structure of the sentence. Both grammar and parser are required in this level. The output of this
level of processing is representation of the sentence that divulge the structural dependency
relationships between the words. There are various grammars that can be impeded, and which in
twirl, whack the option of a parser. Not all NLP applications require a full parse of sentences,
therefore the abide challenges in parsing of prepositional phrase attachment and conjunction
audit no longer impede that plea for which phrasal and clausal dependencies are adequate.
End of studies internship report 2018-2019 26

Syntax conveys meaning in most languages because order and dependency contribute to
connotation. For example, the two sentences: ‘The cat chased the mouse.’ and ‘The mouse
chased the cat.’ differ only in terms of syntax, yet convey quite different meanings.

1.2.5. Semantic

In semantic most people think that meaning is determined, however, this is not it is all the levels
that bestow to meaning. Semantic processing determines the possible meanings of a sentence by
pivoting on the interactions among word-level meanings in the sentence. This level of processing
can incorporate the semantic disambiguation of words with multiple senses; in a cognate way to
how syntactic disambiguation of words that can errand as multiple parts-of-speech is adroit at the
syntactic level. For example, amongst other meanings, ‘file’ as a noun can mean either a binder
for gathering papers, or a tool to form one’s fingernails, or a line of individuals in a queue
(Elizabeth D. Liddy,2001). The semantic level scrutinizes words for their dictionary elucidation,
but also for the elucidation they derive from the milieu of the sentence. Semantics milieu that
most words have more than one elucidation but that we can spot the appropriate one by looking
at the rest of the sentence.

1.2.6. Discourse

While syntax and semantics travail with sentence-length units, the discourse level of NLP travail
with units of text longer than a sentence i.e, it does not interpret multi sentence texts as just
sequence sentences, apiece of which can be elucidated singly. Rather, discourse focuses on the
properties of the text as a whole that convey meaning by making connections between
component sentences. The two of the most common levels are Anaphora Resolution - Anaphora
resolution is the replacing of words such as pronouns, which are semantically stranded, with the
pertinent entity to which they refer. Discourse/Text Structure Recognition - Discourse/text
structure recognition sway the functions of sentences in the text, which, in turn, adds to the
meaningful representation of the text.
End of studies internship report 2018-2019 27

1.2.7. Pragmatic

Pragmatic is concerned with the firm use of language in situations and utilizes nub over and
above the nub of the text for understanding the goal and to explain how extra meaning is read
into texts without literally being encoded in them. This requisite much world knowledge,
including the understanding of intentions, plans, and goals. For example, the following two
sentences need aspiration of the anaphoric term ‘they’, but this aspiration requires pragmatic or
world knowledge

1.3. Natural Language Generation

Natural Language Generation (NLG) is the process of producing phrases, sentences and
paragraphs that are meaningful from an internal representation. It is a part of Natural Language
Processing and happens in four phases: identifying the goals, planning on how goals maybe
achieved by evaluating the situation and available communicative sources and realizing the plans
as a text. It is opposite to Understanding.

Components of NLG are as follows:

● Speaker and Generator – To generate a text, we need to have a speaker or an


application and a generator or a program that renders the application’s intentions into
fluent phrase relevant to the situation.
End of studies internship report 2018-2019 28

● Components and Levels of Representation –The process of language generation


involves the following interweaved tasks. Content selection: Information should be
selected and included in the set. Depending on how this information is parsed into
representational units, parts of the units may have to be removed while some others may
be added by default. Textual Organization: The information must be textually organized

Figure 6: Natural language processing components

according the grammar, it must be ordered both sequentially and in terms of linguistic
relations like modifications. Linguistic Resources: To support the information’s
realization, linguistic resources must be chosen. In the end these resources will come
down to choices of particular words, idioms, syntactic constructs etc. Realization: The
selected and organized resources must be realized as an actual text or voice output.

● Application or Speaker – This is only for maintaining the model of the situation. Here
the speaker just initiates the process doesn’t take part in the language generation. It stores
the history, structures the content that is potentially relevant and deploys a representation
of what it actually knows. All these form the situation, while selecting subset of
End of studies internship report 2018-2019 29

propositions that speaker has. The only requirement is the speaker has to make sense of
the situation.

1.4. Applications of NLP

Natural Language Processing can be applied into various areas like Machine Translation, Email
Spam detection, Information Extraction, Summarization, Question Answering etc.

1.4.1. Machine Translation

As most of the world is online, the task of making data accessible and available to all is a
challenge. Major challenge in making data accessible is the language barrier. There are multitude
of languages with different sentence structure and grammar. Machine Translation is generally
translating phrases from one language to another with the help of a statistical engine like Google
Translate. The challenge with machine translation technologies is not directly translating words
but keeping the meaning of sentences intact along with grammar and tenses. The statistical
machine learning gathers as many data as they can find that seems to be parallel between two
languages and they crunch their data to find the likelihood that something in Language A
corresponds to something in Language B. As for Google, in September 2016, announced a new
machine translation system based on Artificial neural networks and Deep learning . In recent
years, various methods have been proposed to automatically evaluate machine translation quality
by comparing hypothesis translations with reference translations. Examples of such methods are
word error rate, position-independent word error rate, generation string accuracy, multi-reference
word error rate, BLEU score, NIST score All these criteria try to approximate human assessment
and often achieve an astonishing degree of correlation to human subjective evaluation of fluency
and adequacy .

1.4.2. Text Categorization

Categorization systems inputs a large flow of data like official documents, military casualty
reports, market data, newswires etc. and assign them to predefined categories or indices. For
example, The Carnegie Group’s Construe system, inputs Reuters articles and saves much time by
doing the work that is to be done by staff or human indexers. Some companies have been using
End of studies internship report 2018-2019 30

categorization systems to categorize trouble tickets or complaint requests and routing to the
appropriate desks. Another application of text categorization is email spam filters. Spam filters is
becoming important as the first line of defence against the unwanted emails. A false negative and
false positive issues of spam filters are at the heart of NLP technology, its brought down to the
challenge of extracting meaning from strings of text. A filtering solution that is applied to an
email system uses a set of protocols to determine which of the incoming messages are spam and
which are not. There are several types of spam filters available. Content filters: Review the
content within the message to determine whether it is a spam or not. Header filters: Review the
email header looking for fake information. General Blacklist filters: Stopes all emails from
blacklisted recipients. Rules Based Filters: It uses user-defined criteria. Such as stopping mails
from specific person or stopping mail including a specific word. Permission Filters: Require
anyone sending a message to be pre-approved by the recipient. Challenge Response Filters:
Requires anyone sending a message to enter a code in order to gain permission to send email.

1.4.3. Spam Filtering

It works using text categorization and in recent times, various machine learning techniques have
been applied to text categorization or Anti-Spam Filtering like Rule Learning, Naïve Bayes,
Memory based Learning, Support vector machines, Decision Trees and Maximum Entropy
Model. Sometimes combining different learners. Using these approaches is better as classifier is
learned from training data rather than making by hand. The naïve bayes is preferred because of
its performance despite its simplicity In Text Categorization two types of models have been
used. Both modules assume that a fixed vocabulary is present. But in first model a document is
generated by first choosing a subset of vocabulary and then using the selected words any number
of times, at least once irrespective of order. This is called Multi-variate Bernoulli model. It takes
the information of which words are used in a document irrespective of number of words and
order. In second model, a document is generated by choosing a set of word occurrences and
arranging them in any order. this model is called multi-nomial model, in addition to the Multi-
variate Bernoulli model, it also captures information on how many times a word is used in a
document. Most text categorization approaches to anti spam Email filtering have used multi
variate Bernoulli model.
End of studies internship report 2018-2019 31

1.4.4. Information Extraction

Information extraction is concerned with identifying phrases of interest of textual data. For many
applications, extracting entities such as names, places, events, dates, times and prices is a
powerful way of summarize the information relevant to a user’s needs. In the case of a domain
specific search engine, the automatic identification of important information can increase
accuracy and efficiency of a directed search. There is use of hidden Markov models (HMMs) to
extract the relevant fields of research papers. These extracted text segments are used to allow
searched over specific fields and to provide effective presentation of search results and to match
references to papers. For example, noticing the pop up ads on any websites showing the recent
items you might have looked on an online store with discounts. In Information Retrieval two
types of models have been used.

Both modules assume that a fixed vocabulary is present. But in first model a document is
generated by first choosing a subset of vocabulary and then using the selected words any number
of times, at least once without any order. This is called Multi-variate Bernoulli model. It takes
the information of which words are used in a document irrespective of number of words and
order. In second model, a document is generated by choosing a set of word occurrences and
arranging them in any order. this model is called multi-nomial model, in addition to the Multi-
variate Bernoulli model , it also captures information on how many times a word is used in a
document

Discovery of knowledge is becoming important areas of research over the recent years.
Knowledge discovery research use a variety of techniques in order to extract useful information
from source documents like Parts of Speech (POS) tagging, Chunking or Shadow Parsing, Stop-
words (Keywords that are used and must be removed before processing documents), Stemming
(Mapping words to some base for, it has two methods, dictionary based stemming and Porter
style stemming. Former one has higher accuracy but higher cost of implementation while latter
has lower implementation cost and is usually insufficient for IR).

Compound or Statistical Phrases (Compounds and statistical phrases index multi token units
instead of single tokens.) Word Sense Disambiguation (Word sense disambiguation is the task of
End of studies internship report 2018-2019 32

understanding the correct sense of a word in context. When used for information retrieval, terms
are replaced by their senses in the document vector.)

Its extracted information can be applied on a variety of purpose, for example to prepare a
summary, to build databases, identify keywords, classifying text items according to some
predefined categories etc. For example CONSTRUE, it was developed for Reuters, that is used
in classifying news stories. It has been suggested that many IE systems can successfully extract
terms from documents, acquiring relations between the terms is still a difficulty. PROMETHEE
is a system that extracts lexico-syntactic patterns relative to a specific conceptual relation. IE
systems should work at many levels, from word recognition to discourse analysis at the level of
the complete document. An application of the Blank Slate Language Processor (BSLP) approach
for the analysis of a real life natural language corpus that consists of responses to open-ended
questionnaires in the field of advertising.

There’s a system called MITA (Metlife’s Intelligent Text Analyzer) that extracts information
from life insurance applications. Ahonen et al. suggested a mainstream framework for text
mining that uses pragmatic and discourse level analyses of text.

1.4.5. Summarization

Overload of information is the real thing in this digital age, and already our reach and access to
knowledge and information exceeds our capacity to understand it. This trend is not slowing
down, so an ability to summarize the data while keeping the meaning intact is highly required.
This is important not just allowing us the ability to recognize the understand the important
information for a large set of data, it is used to understand the deeper emotional meanings; For
example, a company determine the general sentiment on social media and use it on their latest
product offering. This application is useful as a valuable marketing asset.

The types of text summarization depends on the basis of the number of documents and the two
important categories are single document summarization and multi document summarization .
Summaries can also be of two types: generic or query-focused, Summarization task can be either
supervised or unsupervised. Training data is required in a supervised system for selecting
relevant material from the documents.
End of studies internship report 2018-2019 33

1.5. Fundamental concepts of NLP

In this section, we’ll talk about some core concepts of natural language processing.

1.5.1. Word embeddings

The concept of word embedding plays a critical role in the realisation of transfer learning for
NLP tasks. Word embeddings are essentially fixed-length vectors used to encode and represent a
piece of text.

Figure 7: Text embeddings

The vectors represent words as multidimensional continuous numbers where semantically similar
words are mapped to proximate points in geometric space. You can see here below how the
vectors for sports like “tennis”, “badminton”, and “squash” get mapped very close together.

Figure 8:PCA on word vectors


End of studies internship report 2018-2019 34

A key benefit of representing words as vectors is that mathematical operations can be performed
on them, such as:

King — man + woman = queen

1.5.2. Importance of context

To get a good representation of text data it is crucial for us to be able to capture both the context
and semantics. For example, consider the two sentences below, whilst the spelling of the word
“minute” is the same in both cases, their meanings are very different.

In addition to this, the same words can also have different meanings based on their context. For
example, “good” and “not good” convey two very different sentiments.

By incorporating the context of words we can achieve high performance for downstream real-
world tasks such as sentiment analysis, text classification, clustering, summarisation, translation
etc.

Traditional approaches to NLP such as one-hot encoding and bag of words models do not
capture information about a word’s meaning or context. However, neural network-based
language models aim to predict words from neighbouring words by considering their sequences
in the corpus.

Context can be incorporated by constructing a co-occurrence matrix. This is computed simply by


counting how two or more words occur together in a given corpus.
End of studies internship report 2018-2019 35

Figure 9: Example corpus of vectorization

1.5.3. Word2Vec & Global Vectors for Word Representations

Word2Vec and GloVe are both implementations which can be used to produce word embeddings
from their co-occurrence information. Word2Vec only takes into account local contexts. By
contrast, GloVe uses neural methods to decompose the co-occurrence matrix into more
expressive and dense word vectors; matrix factorization is performed to yield a lower-
dimensional matrix of words and features where each row yields a vector representation for each
word.
End of studies internship report 2018-2019 36

Figure 10: GloVe implementation

1.6. State of the art architectures

We will now present some architectures which leverage these fundamental concepts and have
recently achieved some State Of The Art (SOTA) results on NLP tasks.

1.6.1. Embeddings from Language Models (ELMo)

Figure 11: ELMo example

ELMo aims to provide an improved word representation for NLP tasks in different contexts by
producing multiple word embeddings per single word, across different scenarios. In the example
below, the word “minute” has multiple meanings (homonyms) so gets represented by multiple
End of studies internship report 2018-2019 37

embeddings with ELMo. However, with other models such as GloVe, each instance would have
the same representation regardless of its context.

ELMo uses a bidirectional language model (biLM) to learn both word and linguistic context. At
each word, the internal states from both the forward and backward pass are concatenated to
produce an intermediate word vector. As such, it is the model’s bidirectional nature that gives it a
hint not only as to the next word in a sentence but also the words that came before.

Figure 12: Biderctional language model example

Another feature of ELMo is that it uses language models comprised of multiple layers, forming a
multilayer RNN. The intermediate word vector produced by layer 1 is fed up to layer 2. The
more layers that are present in the model, the more the internal states get processed and as such
represent more abstract semantics such as topics and sentiment. By contrast, lower layers
represent less abstract semantics such as short phrases or parts of speech.
End of studies internship report 2018-2019 38

Figure 13: Bidirectional language model example 2

In order to compute the word embeddings that get fed into the first layer of the biLM, ELMo
uses a character-based CNN. The input is computed purely from combinations of characters
within a word. This has two key benefits:

● It is able to form representations of words external to the vocabulary it was trained on.
For example, the model could determine that “Mum” and “Mummy” are somewhat
related before even considering the context in which they are used. This is particularly
useful for us at Badoo as it can help detect misspelled words through context.

● It continues to perform well when it encounters a word that was absent from the training
dataset.

1.6.2. Bidirectional Encoder Representations from Transformers (BERT)

In a recent paper published by Google, BERT incorporates an attention mechanism (transformer)


that learns contextual relations between words in text. Unlike bidirectional models such as
ELMo, where the text input is read sequentially (left-to-right or right-to-left), here the entire
sequence of words is read at once: one could actually describe BERT as non-directional.
End of studies internship report 2018-2019 39

Essentially, BERT is a trained transformer encoder stack where results are passed up from one
encoder to the next.

Figure 14:BERT

At each encoder, self-attention is applied and this helps the encoder to look at other words in the
input sentence as it encodes each specific word, so helping it to learn correlations between the
words. These results then pass through a feed-forward network.

Figure 15: Encoder used in BERT


End of studies internship report 2018-2019 40

BERT was trained on Wikipedia text data and uses masked modelling rather than sequential
modelling during training. It masks 15% of the words in each sequence and tries to predict the
original value based on the context. This involves the following:

● Adding a classification layer on top of the encoder output.

● Multiplying the output vectors by the embedding matrix, transforming them into the
vocabulary dimension.

● Calculating the probability of each word in the vocabulary with softmax.

Figure 16: How BERT was trained

As the BERT loss function only takes into consideration the prediction of the masked values, so
converging more slowly than directional models. This drawback, however, is offset by its
increased awareness of context.

1.6.3. Multi-Task Deep Neural Networks (MT-DNN)

MT-DNN extends the function of BERT to achieve even better results in NLP problems. It uses
multi-task (parallel) learning instead of transfer (sequential) learning by computing the loss
across a different task, while simultaneously applying the change to the model.
End of studies internship report 2018-2019 41

As with BERT, MT-DNN tokenises sentences and transforms them into the initial word, segment
and position embeddings. The multi-directional transformer is then used to learn the contextual
word embeddings.

Figure 17: Multi-Task Deep neural networks architecture

2. Gradient Boosting Machines

2.1. What are GBMs?

Gradient boosting machines are a family of powerful machine-learning techniques that have
shown considerable success in a wide range of practical applications. They are highly
customizable to the particular needs of the application, like being learned with respect to
different loss functions.
End of studies internship report 2018-2019 42

A theoretical information is complemented with descriptive examples and illustrations which


cover all the stages of the gradient boosting model design. Considerations on handling the model
complexity are discussed. Three practical examples of gradient boosting applications are
presented and comprehensively analyzed.

Figure 18: The history of GBM's

2.2. Fundamental concepts of GBMS

2.2.1. Ensemble

When we try to predict the target variable using any machine learning technique, the main causes
of difference in actual and predicted values are noise, variance, and bias. Ensemble helps to
reduce these factors (except noise, which is irreducible error)

An ensemble is just a collection of predictors which come together (e.g. mean of all predictions)
to give a final prediction. The reason we use ensembles is that many different predictors trying to
predict same target variable will perform a better job than any single predictor alone. Ensembling
techniques are further classified into Bagging and Boosting.
End of studies internship report 2018-2019 43

Figure 19: Ensembling on Gradient boosting machines

2.2.2. Bagging

A simple ensembling technique in which we build many independent predictors/models/learners


and combine them using some model averaging techniques (e.g. weighted average, majority vote
or normal average).

We typically take random sub-sample/bootstrap of data for each model, so that all the models are
little different from each other. Each observation is chosen with replacement to be used as input
for each of the model. So, each model will have different observations based on the bootstrap
process. Because this technique takes many uncorrelated learners to make a final model, it
reduces error by reducing variance. Example of bagging ensemble is Random Forest models.

2.2.3. Boosting

Boosting is an ensemble technique in which the predictors are not made independently, but
sequentially.
End of studies internship report 2018-2019 44

This technique employs the logic in which the subsequent predictors learn from the mistakes of
the previous predictors. Therefore, the observations have an unequal probability of appearing in
subsequent models and ones with the highest error appear most. (So the observations are not
chosen based on the bootstrap process, but based on the error). The predictors can be chosen
from a range of models like decision trees, regressors, classifiers etc. Because new predictors are
learning from mistakes committed by previous predictors, it takes less time/iterations to reach
close to actual predictions. But we have to choose the stopping criteria carefully or it could lead
to overfitting on training data. Gradient Boosting is an example of boosting algorithm.

Figure 20: Bagging and Boosting on GBMs

2.3. Gradient Boosting algorithm

The objective of any supervised learning algorithm is to define a loss function and minimize it.
Let’s see how maths work out for Gradient Boosting algorithm. Say we have mean squared error
(MSE) as loss defined as:
End of studies internship report 2018-2019 45

We want our predictions, such that our loss function (MSE) is minimum. By using gradient
descent and updating our predictions based on a learning rate, we can find the values where
MSE is minimum.

Which becomes:

So, we are basically updating the predictions such that the sum of our residuals is close to 0 (or
minimum) and predicted values are sufficiently close to actual values.

2.4. Types of GBMs

2.4.1. CatBoost

CatBoost has the flexibility of giving indices of categorical columns so that it can be encoded as
one-hot encoding using one_hot_max_size (Use one-hot encoding for all features with number of
different values less than or equal to the given parameter value). If you don’t pass any anything
in cat_features argument, CatBoost will treat all the columns as numerical variables.

2.4.2. LightGBM

Similar to CatBoost, LightGBM can also handle categorical features by taking the input of
feature names. It does not convert to one-hot coding, and is much faster than one-hot coding.
LGBM uses a special algorithm to find the split value of categorical features

2.4.3. XGBoost

Unlike CatBoost or LGBM, XGBoost cannot handle categorical features by itself, it only accepts
numerical values similar to Random Forest. Therefore one has to perform various encodings like
label encoding, mean encoding or one-hot encoding before supplying categorical data to
XGBoost.
End of studies internship report 2018-2019 46

3. Performance Measures

3.1. Classification Accuracy

Classification Accuracy is what we usually mean, when we use the term accuracy. It is the ratio

of number of correct predictions to the total number of input samples.

It works well only if there are equal number of samples belonging to each class.

For example, consider that there are 98% samples of class A and 2% samples of class B in our

training set. Then our model can easily get 98% training accuracy by simply predicting every

training sample belonging to class A.

When the same model is tested on a test set with 60% samples of class A and 40% samples of

class B, then the test accuracy would drop down to 60%. Classification Accuracy is great, but

gives us the false sense of achieving high accuracy.

The real problem arises, when the cost of misclassification of the minor class samples are very

high. If we deal with a rare but fatal disease, the cost of failing to diagnose the disease of a sick

person is much higher than the cost of sending a healthy person to more tests.

3.2. Confusion Matrix

Confusion Matrix as the name suggests gives us a matrix as output and describes the complete

performance of the model.

Lets assume we have a binary classification problem. We have some samples belonging to two

classes : YES or NO. Also, we have our own classifier which predicts a class for a given input

sample. On testing our model on 165 samples ,we get the following result.
End of studies internship report 2018-2019 47

There are 4 important terms :

● True Positives : The cases in which we predicted YES and the actual output was also

YES.

● True Negatives : The cases in which we predicted NO and the actual output was NO.

● False Positives : The cases in which we predicted YES and the actual output was NO.

● False Negatives : The cases in which we predicted NO and the actual output was YES.

Accuracy for the matrix can be calculated by taking average of the values lying across the “main

diagonal” i.e

Confusion Matrix forms the basis for the other types of metrics.

3.3. Area Under Curve

Area Under Curve(AUC) is one of the most widely used metrics for evaluation. It is used for

binary classification problem. AUC of a classifier is equal to the probability that the classifier

will rank a randomly chosen positive example higher than a randomly chosen negative example.

Before defining AUC, let us understand two basic terms :


End of studies internship report 2018-2019 48

● True Positive Rate (Sensitivity) : True Positive Rate is defined as TP/ (FN+TP). True

Positive Rate corresponds to the proportion of positive data points that are correctly

considered as positive, with respect to all positive data points.

● False Positive Rate (Specificity) : False Positive Rate is defined as FP / (FP+TN). False

Positive Rate corresponds to the proportion of negative data points that are mistakenly

considered as positive, with respect to all negative data points.


End of studies internship report 2018-2019 49

False Positive Rate and True Positive Rate both have values in the range [0, 1]. FPR and TPR bot

hare computed at threshold values such as (0.00, 0.02, 0.04, …., 1.00) and a graph is drawn.

AUC is the area under the curve of plot False Positive Rate vs True Positive Rate at different

points in [0, 1].

3.4. F1 Score

F1 Score is the Harmonic Mean between precision and recall. The range for F1 Score is [0, 1]. It

tells you how precise your classifier is (how many instances it classifies correctly), as well as

how robust it is (it does not miss a significant number of instances).

High precision but lower recall, gives you an extremely accurate, but it then misses a large

number of instances that are difficult to classify. The greater the F1 Score, the better is the

performance of our model. Mathematically, it can be expressed as :

● Precision : It is the number of correct positive results divided by the number of positive

results predicted by the classifier.

● Recall : It is the number of correct positive results divided by the number of all relevant

samples (all samples that should have been identified as positive).

F1 Score tries to find the balance between precision and recall.


End of studies internship report 2018-2019 50

3.5. Mean Absolute Error

Mean Absolute Error is the average of the difference between the Original Values and the

Predicted Values. It gives us the measure of how far the predictions were from the actual output.

However, they don’t gives us any idea of the direction of the error i.e. whether we are under

predicting the data or over predicting the data. Mathematically, it is represented as :

Conclusion

This chapter has allowed us to frame the subject of our work, get closer to the techniques we
have used in our project, and we have quoted all the terms and concepts related to our project.
The next chapter will be dedicated to the analysis and specification of needs.
End of studies internship report 2018-2019 51

Chapter 3 : Implementation

Summary:

After having detailed the design adapted to our project, we will devote the last chapter of
this report to the realization part. For this we will first present the hardware and software
development environment, then we will describe the development phase, including the
technologies adopted to carry out the project and the tests carried out and we end up with some
graphical interfaces illustrating the work done.
End of studies internship report 2018-2019 52

Introduction

In previous chapters, we tried to follow a logical sequence that allowed us to develop our project.
We now come to the realization phase, which is the completion and completion phase of the
project.

The purpose of this chapter, here, is not to describe the lines of the source code one after the
other. This would be tedious and deeply boring for the reader. It is rather about present the work
environment, the user interfaces of the application as well as final product evaluation tests.

1. Work Environment

1.1. Data Environment

In this part, we will present the environment used mainly to train models and deal with data.

1.1.1. Kaggle Kernels

Kaggle Kernels are essentially Jupyter notebooks in the browser that can be run in the cloud, free
of charge.

We have used kaggle kernels in our data processing part, also to train the models.

Technical specifications :

● 9 hours execution time

● 5 Gigabytes of auto-saved disk space (/kaggle/working)

● 16 Gigabytes of temporary, scratchpad disk space (outside /kaggle/working)

CPU Specifications

● 4 CPU cores

● 17 Gigabytes of RAM

GPU Specifications
End of studies internship report 2018-2019 53

● 2 CPU cores

● 14 Gigabytes of RAM

1.1.2. Python

Python is an interpreted, high-level, general-purpose programming language. Created by Guido


van Rossum and first released in 1991, Python's design philosophy emphasizes code readability
with its notable use of significant whitespace. Its language constructs and object-oriented
approach aims to help programmers write clear, logical code for small and large-scale projects.

Python has become the language of choice for data analytics. One of the major reasons for this is
the availability of some easy and fun to work with libraries in python which make it interesting
to work and analyse large data sets.

1.1.3. Python Packages & Libraries

Some of the packages that we have used on the machine learning & data analytics part are:

● NumPy is used for fast mathematical operations such as performing matrix


multiplications and many other mathematical functions for computations on arrays of any
dimension.

● SciPy is also a scientific computing library that adds a collection of algorithms and high
level commands for manipulating and visualizing data. It also contains modules for
optimization, linear algebra, integration, Fast Fourier Transform, signal and image
processing and much more.

● Pandas provides easy to use data analysis tools and contains functions designed to make
data analysis fast and easy.

● Matplotlib is a Python library that supports 2D and 3D graphics. It is used to produce


publication figures like histograms, power spectra, bar charts, box plots, pie charts and
scatter plots with just few lines of code. Matplotlib easily integrates with Pandas
dataframes to make visualisations quickly and conveniently.
End of studies internship report 2018-2019 54

1.1.4. Keras

Keras is an open-source neural-network library written in Python. It is capable of running on top


of TensorFlow, Microsoft Cognitive Toolkit, Theano, or PlaidML. Designed to enable fast
experimentation with deep neural networks, it focuses on being user-friendly, modular, and
extensible. It was developed as part of the research effort of project ONEIROS (Open-ended
Neuro-Electronic Intelligent Robot Operating System), and its primary author and maintainer is
François Chollet, a Google engineer.

1.1.5. PyTorch

PyTorch is a computer software, specifically a machine learning library for the programming
language Python, based on the Torch library,used for applications such as natural language
processing.It is primarily developed by Facebook's artificial-intelligence research group, and
Uber's Pyro probabilistic programming language software is built on it. It is free and open-source
software released under one of the BSD licenses.

1.1.6. XGBoost

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient,


flexible and portable. It implements machine learning algorithms under the Gradient Boosting
framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve
many data science problems in a fast and accurate way. The same code runs on major distributed
environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

1.1.7. Software Environment

1.1.7.1. Flask (Backend)

Flask is a micro web framework written in Python. It is classified as a microframework because


it does not require particular tools or libraries.It has no database abstraction layer, form
validation, or any other components where pre-existing third-party libraries provide common
functions. However, Flask supports extensions that can add application features as if they were
implemented in Flask itself.
End of studies internship report 2018-2019 55

1.1.7.2. ReactJs (Frontend)

React (also known as React.js or ReactJS) is a JavaScript library for building user interfaces. It is
maintained by Facebook and a community of individual developers and companies.

React can be used as a base in the development of single-page or mobile applications, as it is


optimal for fetching rapidly changing data that needs to be recorded. However, fetching data is
only the beginning of what happens on a web page, which is why complex React applications
usually require the use of additional libraries for state management, routing, and interaction with
an API.

1.1.7.3. Docker (Deployment)

Docker is a set of coupled software-as-a-service and platform-as-a-service products that use


operating-system-level virtualization to develop and deliver software in packages called
containers.The software that hosts the containers is called Docker Engine.It was first started in
2013 and is developed by Docker, Inc. The service has both free and premium tiers.

1.1.7.4. Jenkins (CI/CD)

Jenkins is an open source automation server written in Java. Jenkins helps to automate the non-
human part of the software development process, with continuous integration and facilitating
technical aspects of continuous delivery. It is a server-based system that runs in servlet
containers such as Apache Tomcat. It supports version control tools, including AccuRev, CVS,
Subversion, Git, Mercurial, Perforce, TD/OMS, ClearCase and RTC, and can execute Apache
Ant, Apache Maven and sbt based projects as well as arbitrary shell scripts and Windows batch
commands. The creator of Jenkins is Kohsuke Kawaguchi. Released under the MIT License,
Jenkins is free software.
End of studies internship report 2018-2019 56

1.2. Model 1: Resume Parser

This model was made in order to parse a resume, extract information about the applicant using
rule based & Machine learning techniques.

To do so, we need to apply NLP & classification algorithms on some labeled data, since it is a
supervised learning problem.

1.2.1. Data Collection

We have collected data from freelance websites, and hiring websites using web scraping.

The data set contains only English resumes, and it was all formed into a pdf format.

1.2.2. Data Preparation

This phase is one of the important phases in building our model, we have applied RegEx and
some rules to extract the maximum of the information.

And then we applied other text analysis techniques as:

1.2.2.1. Tokenization, Part-of-speech Tagging, and Parsing

tokenization refers to the process of breaking up a string of characters into semantically


meaningful parts that can be analyzed (e.g., words) while discarding meaningless chunks (e.g.
whitespaces).

The examples below show two different ways in which one could tokenize the string 'Analyzing
text is not that hard'.

(Incorrect): Analyzing text is not that hard. = [“Analyz”, “ing text”, “is n”, “ot that”, “hard.”]
(Correct): Analyzing text is not that hard. = [“Analyzing”, “text”, “is”, “not”, “that”, “hard”, “.”]

Once the tokens have been recognized, it's time to categorize them. Part-of-speech tagging refers
to the process of assigning a grammatical category, such as noun, verb, etc. to the tokens that
have been detected.

Here are the PoS tags of the tokens from the sentence above:
End of studies internship report 2018-2019 57

“Analyzing”: VERB, “text”: NOUN, “is”: VERB, “not”: ADV, “that”: ADV, “hard”: ADJ, “.”:
PUNCT

1.2.2.2. Lemmatization and Stemming

Stemming and Lemmatization both refer to the process of removing all of the affixes (i.e.
suffixes, prefixes, etc.) attached to a word in order to keep its lexical base, also known as root or
stem or its dictionary form or lemma. The main difference between these two processes is that
stemming is usually based on rules that trim word beginnings and endings (and sometimes lead
to somewhat weird results), whereas lemmatization makes use of dictionaries and a much more
complex morphological analysis.

Figure 21: Lemma and stem

1.2.2.3. Stopword Removal

To provide a more accurate automated analysis of the text, it is important that we remove from
play all the words that are very frequent but provide very little semantic information or no
meaning at all. These words are also known as stopwords.

There are many different lists of stopwords for every language. However, it is important to
understand that you might need to add words to or remove words from those lists depending on
the texts you would like to analyze and the analyses you would like to perform.
End of studies internship report 2018-2019 58

1.2.3. Modelling

The model was made in a way to be able to extract skills from a resume, also to classify it into a
profile (Backend, Frontend, Data scientist etc ..)

So we have trained a Naive Bayes Model, that can predict weather a word in a corpus is a
technical skill or not, also to predict weather a text is for a backend developer of for frontend
developer.

We have used word2vec & doc2vec to transform data into vectors so the model can understand it
in a good manner.

After several Model Tuning, we have achieved a good results, and we will see the results in the
demonstration part.

1.3. Model 2: Next Difficulty Level

1.3.1. Data Collection

The data was collected from an annonym source, there was 5 columns of independent variables
and one dependent variable which was -1 ,0 or 1

● -1 : The next level will be lower than the current level

● 0 : The next level will be the same as the current level

● 1 : The next level will be greater than the current level

The other 5 columns contains only 1 or 0

● 1 : the question was answered correctly by the user

● 0 : the question was answered wrong by the user


End of studies internship report 2018-2019 59

1.3.2. Data Preparation

The data was sort of prepared so, we had not need to prepare it.

On training phase we have used 80% of the data

On the test phase we have used 20% of the data

We applied 5 Fold Cross validation, 0.8/0.2 for each iteration

1.3.3. Modelling

On this phase we have trained an XGBoost classifier, which is was predicting one of the 3
numbers (-1, 0 or 1)

Model specifications:

● 5 Fold Cross validation

● Hyper parameter tuning using GridSearch

● No stacking or blending were applied

As the dataset had quality and well prepared data, the results were good, and the accuracy using
MSE was around 96%

1.4. Model 3: Next Question Prediction

On this part, we have used a pre trained model, Bert-base, which we mentioned before in the
environment section.

BERT makes use of Transformer, an attention mechanism that learns contextual relations
between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate
mechanisms — an encoder that reads the text input and a decoder that produces a prediction for
the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is
necessary. The detailed workings of Transformer are described in a paper by Google.
End of studies internship report 2018-2019 60

We have used BERT to predict the next question to be generated, based on the answer of the
user.

It is more to simulate a real interview

● 0 (false answer) : Next question should be different from the previous one

● 1 (correct answer) : Next question should be following to the previous one

The model is still new and general, so we should implement our own way to calculate the next
sentence prediction, based on PyTorch & FastAI implementation.

1.5. Software Implementation

We will present now the way we developed our application, which is a intelligent quiz system.

Here an overview about the global architecture of our application.

Figure 22: Global Architecture of the Application


End of studies internship report 2018-2019 61

1.5.1. General workflow of the application

On this part, we will explain the general workflow of the application, by presenting every step of
the application, and its parallel action in the server side.

● A user uploads his resume

● The resume is being sent to backend using POST method

● The resume is being parsed in the backend using the models we have made before

● A json object is being sent to the client side contains the skills extracted and profile

● The client side asks the user to rate himself for each skill extracted

● The rating are being sent to the backend again, a session is being created and the test will
start

● The user then passes 5 general questions, they are mainly about general concepts and
problem solving

● When the user finished the general test, the client side asks the backend for the next
question

● The backend start predicting the next question and send it to the client side

● The user keeps on answering the quiz, the result is being sent to the backend to predict
the next question until the end of the quiz.

1.5.2. System performance


When we were creating our model that predicts the next level of difficulty, we have tried

multiple models, but the best results were achieved using XGBoost.

Model/Metric F1 RMSE(Train) RMSE(Test)


XGBoost 94.13 97.08 95.64
SVM 88.36 95.33 82.12
Random Forest 91.07 96.11 90.20
End of studies internship report 2018-2019 62

SVM results were good in the training phase, but on the testing phase we can see that it overfits

and the accuracy goes down by 17%

Random forest and xgboost results were close but random forest also overfits a little bit.

We had to choose XGBoost for it promising results.

1.5.3. Demonstration

On this chapter, we will present a demonstration of our application using screenshots.

 When you first get into the website, a page appears, it contains the available positions
that are open to join the company.

Figure 23: Choose profile page

 After choosing the profile that you want to apply to, a page appears with a form, you fill
your name and upload your resume in a pdf format.
End of studies internship report 2018-2019 63

Figure 24: Uploade resume page

 After uploading the resume, the script start parsing your resume and return a list of your
skills to rate them

Figure 25: Rate yourself page

 After you done rating your skills, a general test started.


End of studies internship report 2018-2019 64

Figure 26: General test page

 After finishing the general test, the test associated to the chosen profile starts

Figure 27: Chosen profile test


End of studies internship report 2018-2019 65

 The user keep on answering the questions, after each question the script calculates the

difficulty level and the next questions for 15 times, and when you finish answering all the

questions, a congrats screen pops up.

Figure 28: Test done page

2. Conclusion

In this last chapter, we were able to present the work environment and the development process.
We have explained the services performed and we exposed the result of development using
screen previews. Finally, we closed the chapter with functional tests of the work done.
End of studies internship report 2018-2019 66

General conclusion
During this project, we contributed to the establishment of a automatic & adaptive MCQ system
based on natural language processing and GBMs.

To carry out this work, we first presented the general context of the project where we presented
the host company as well as the general framework of the project. Then, we carried out a state of
the art of the notions treated in our subject of final studies project, then we described the phase of
management and conduct of our project, before proceeding to the analysis and the specification
of the needs of our application . Finally, this analysis was used in the design and implementation
of the new application.

This internship was beneficial to us as it allowed us to apply our theoretical knowledge through
the practice of new technologies. In addition, we had the opportunity to use some techniques that
we have never heard about before, such as BERT.

To be able to carry out this project, we have integrated the professional environment in which
xHub has taught us to evolve by responding to technical requirements and constraints; and
instilling in us notions and habits of professionalism.

This project was a great human experience for us, in contact with the coaches who gave us their
advice and who made us benefit from their experiences.

This work meets the needs previously fixed but it can obviously be improved and optimized by
integrating new datasets that are more quality than the one handled to us.

Also the profiling system is so weak, we need more research about this part, this is our next goal,
is to make a profiling system that gives a summary about the candidate after finishing the test
based on his answers and his resume.
End of studies internship report 2018-2019 67

References

1- Understanding Gradient Boosting Machine - Harshdeep Singhs


https://towardsdatascience.com/understanding-gradient-boosting-machines-
9be756fe76ab

2- Gradient Boosting from scratch - Prince Grover


https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d

3- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding


- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
https://arxiv.org/abs/1810.04805

4- Achieving State-Of-The-Art Results In Natural Language Processing - Laura Mitchell


https://badootech.badoo.com/achieving-state-of-the-art-results-in-natural-language-
processing-d6fd25954a90

5- Natural Language Processing: State of The Art, Current Trends and Challenges -
Sukhdev Singh
https://www.researchgate.net/publication/319164243_Natural_Language_Processing_
State_of_The_Art_Current_Trends_and_Challenges

6- Deep contextualized word representations - Matthew E. Peters, Mark Neumann ,


Mohit Iyyer, Matt Gardner

https://aclweb.org/anthology/N18-1202

You might also like