Professional Documents
Culture Documents
Louly Adam
Supervised by:
Abstract
For quite some time now, artificial intelligence (AI) and its subset, machine learning,
have been a hot topic. Countless industries apply this technology in various ways to automate –
and optimize – all kinds of processes especially the field of recruitment and HR included.
Recruitment process is one of the most painful processes for the HR department, Reading all
resumes of applicants, classifying them by skills and level of expertise, scheduling the online
technical test and approving the profile whether to be accepted for a physical interview, this
process is taking a lot of time to be processed for each applicant.
Also the traditional recruitment process fails with candidates, companies and recruiters. It
revolves around the human interpretation of complex data too sensitive to prejudice and mental
shortcuts.
On this project, we have used Machine learning and Artificial intelligence techniques, in order to
make an automatic hiring system, which will be very helpful for HR department, by handling
automatically the recruitment process from A-Z.
The project consist of 3 main components, a resume parser which uses natural language
processing, to parse a new resume applied by a new applicant, and extract his profile, skills and
level of expertise, second component is a flexible technical test, this component is using Gradient
Boosting Machines (GMBs) to classify questions, and it also uses natural language processing to
predict the next question, the third component is profiling system that analyses CV and technical
test performance, and give a summary about the applicant that can give a better insight for the
hiring manager.
Résumé
Le processus de recrutement est l’un des processus les plus pénibles pour le département des
ressources humaines. Lire tous les CV des candidats, les classer par compétences et par niveau
d’expertise, programmer le test technique en ligne et approuver le profil accepté ou non pour un
entretien physique, ce processus prend beaucoup de temps à traiter pour chaque demandeur.
De plus, le processus de recrutement traditionnel échoue avec les candidats, les entreprises et les
recruteurs. Cela tourne autour de l'interprétation humaine de données complexes trop sensibles
aux préjugés et aux raccourcis mentaux.
Table of content
Abstract 2
Résumé 3
Table of content 4
Table of figures 8
Signings 10
Thanks 11
General Introduction 12
Introduction 14
Introduction 22
Chapter 3 : Implementation 51
Introduction 52
1. Work Environment 52
General conclusion 66
References 67
End of studies internship report 2018-2019 8
Table of figures
Figure 1: xHub's logo .................................................................................................................... 14
Signings
To My Dear Mother
An inexhaustible source of tenderness, patience and sacrifice. Your prayer and your Blessing
have been of great help to me throughout my life. Even though I can say and write, I could not
express my great affection and my deep gratitude.I hope never to disappoint you, nor to betray
your trust and your sacrifices.
May Almighty God preserve you and give you health, long life and happiness.
To My Dear Father
No dedication can express my respects, my gratitude and my deep love, and I hope to realize one
of your dreams.
You inspired me during my life and you will always be a big inspiration for me as long as I am
alive May God preserve you and bring you health and happiness.
To My Brothers
Mohammed, Zayd, one of the blessings that I thank God for, no dedication can express the depth
of the fraternal feelings and love, attachment I feel for you.
I dedicate this work to you as a testimony of my deep affection for memories of our unwavering
union that has been woven over the days. May God protect you, keep and strengthen our
fraternity
To My Family
Louly’s family, without you I would not have achieved this success.
To My Friends
It will be hard for me to quote you all, you are in my heart fondly
End of studies internship report 2018-2019 11
Thanks
I sincerely thank Allah for giving me the courage and the will to complete this work.
My gratitude, more than sincere and profound, to the Professor El Karim El Moutaouakil who
supervised me, supported, accompanied during this internship project end of studies. His advice,
his remarks as well as the many exchanges that we had, were of an immense help to carry out
this work. At the end of this scientific adventure, I would also like to thank him for his human
qualities, his scientific skills and his teaching.
All my thanks to Mr. EL Houari Badr, my supervisor of the internship at xHub, who welcomed
me and supported me warmly in his service. Thank you for his daily support, his always sharp
remarks and his encouragement.
My thanks also go to the members of the jury. Thanks to them for accepting the evaluation of
this work and the related implications.
Finally, I would also like to extend my affectionate thanks to the entire school and administration
of the ENSAH, for giving me all the necessary knowledge during my three years of study, and in
a pleasant framework of complicity and of respect.
End of studies internship report 2018-2019 12
General Introduction
Automation is the technology by which a process or procedure is performed without
human assistance. Automation or automatic control is the use of various control systems for
operating equipment such as machinery, processes in factories, boilers and heat treating ovens,
switching on telephone networks, steering and stabilization of ships, aircraft and other
applications and vehicles with minimal or reduced human intervention. Some processes have
been completely automated.
During my internship, my mission was to combine Automation and Machine learning to build an
automatic smart MCQ system, this system was made to imitate a recruiter and cover the most of
recruitment process automatically.
The system is designed to parse resumes and analyse them, and classify each profile based on the
skills, profile and level of expertise that were extracted from the resume using Natural Language
Processing Techniques likes BERT ( Bidirectional Encoder Representations from Transformers )
and Multinomial Naive Bayes, After Analysing the resume and classifying the profile, a
Technical test is being auto-generated from the output of the first component, the technical test is
an MCQ system, flexible and adaptive, the next question is being predicted based on the
previous questions and the answers that were submitted by the applicants.
After the end of the technical test, an analyser will come into the game and combine all the
behaviour of the applicant, resume and answers submitted, and generate a summary about the
applicant, that will help the recruiter decide whether the applicant was accepted for a physical
interview or not.
End of studies internship report 2018-2019 13
Summary:
This chapter presents in a general way the context and the objectives of our graduation
project.
We begin by presenting the company "xHub" as the host organization. the second part will be
dedicated to the presentation of our project: its framework, its context, the study of the existing
and its objectives.
End of studies internship report 2018-2019 14
Introduction
Theoretically, one should not start an action without having acquired a thorough
knowledge of the human or economic environment, on which the project will intervene.
Indeed, the analysis of the project context is a basic element in project management that is
intended to create solutions in social, economic and environmental contexts. It is an initial tool of
analysis that is part of a chain of reflection and definition of projects. In addition, it is an
excellent communication tool when preparing documents that will be presented to various
stakeholders, including potential project funders.
XHUB's experts are highly qualified IT engineers who have been selected primarily for
their positive and dynamic mindset. Their technical talents combined with their very good
collaboration skills (listening, retroactivity and speed of execution) are real assets that allow
XHUB to build, with confidence, real long-term partnerships with its customers. In addition,
XHUB aims to be a true catalyst for IT talent in the African region.
End of studies internship report 2018-2019 15
Finally, XHUB wants to prepare now the next generation of Moroccan developers by
sponsoring each year the association Devoxx4Kids. This day entirely dedicated to children aims
to discover, in a fun way, the fabulous world of computer development and introduce them to
video game programming and robotics.
XHUB offers 4 main service offerings that are the organization of international IT events,
the provision of high level IT expertise, outsourcing (management and package) and the
development of innovative products.
1.2. IT Events
XHUB is the organizer of the Devoxx Morocco event for a first in Africa and the Middle
East. Indeed, Devoxx is the largest independent IT conference in the world specializing in
information technology. Different themes are animated by high-level speakers who make the
daily news such as Big Data, Cloud, Internet of Things, Mobile, DevOps methodology and
Agility, IT entrepreneurship, and Security.
1.3. IT Expertise
XHUB wants to position itself as the reference expert in Morocco. Indeed, XHUB is able
to offer highly specialized expertise services for Moroccan and African companies. As it
organizes international conferences (with international and local partners), it has access to highly
End of studies internship report 2018-2019 16
specialized and globally recognized experts. It is also up-to-date on the latest technologies and
can therefore easily answer complex IT issues. It offers both consulting and auditing services for
any IT department wishing to make a digital transformation or acquire high-performance IT
solutions.
1.4. Outsourcing
XHUB positioning in the developer community, where it helps to promote and enhance
talent, gives it an important connection with them. Indeed, the message that is conveyed during
the gathering of this community, it systematically promotes the talents that have become central
resources in an organization of the information system of companies. Thus, it can offer these
customers, talented resources that will perfectly meet their needs. These resources will be
positioned either at home or in projects package mode where the technical support of XHUB is
more to prove.
As part of XHub's growth strategy, one of the main objectives of which is to distinguish
itself from other startups and consulting firms by organizing many events (the most important
one is DevoxxMA), and by carrying out internal projects always more and more innovative and
modern, a social network named "DevPeers" was imagined, serving as a successor to
"developpeur.ma", former project of XHub consisting of a collaborative platform for Moroccan
developers.
End of studies internship report 2018-2019 17
This platform has unfortunately had a mixed success due in particular to a user experience
not responsive to the needs of its user base, and an architecture certainly effective but backward
technologically compared to the competition.
Thats why the project "DevPeers" comes into play, aiming at correcting the errors of
devloppeur.ma and its predecessor by focusing on the aspect related to data processing and
analysis.
Thus, while the platform is still in its infancy, our team has been responsible for designing
and implementing data driven software components inspired by the latest technologies on the
market, to lay the foundation for which the platform will be able to develop, and thus guarantee
the evolution of this one of simple website to a "data driven website" highly evolutive and
responsive to the data generated and consumed by its users.
Among the tasks to be carried out for the good progress of the project, we quote:
This method is simple to set up and greatly improve the follow-up of the coaching as well as the
fluidity of the internal and external communication of the team.
It also allows, thanks to its incremental nature, to prevent, and thus to correct in advance any
kind of deviation of the project, which is particularly useful in our case, especially during the first
phase of the project which consists of analyzing and design our first components, which represents a
crucial step with a high risk of derailing the project from its original purpose.
Before starting the project, a first sprint, named sprint 0, took place under the supervision
of the Product Owner, who at the same time took the role of Scrum Master for this occasion, to
document the current situation, to conduct a preliminary study on the circumstances of the
project as well as to train us in the Scrum methodology.
At the end of this sprint 0, our team was able to have a clear vision of the situation and
separate and specialized tasks were distributed among the members.
5 components were imagined, and were the subject of our work during the following sprints:
• A recommendation system.
• A resume parser.
• A chatbot.
The next 3-week sprint was the starting point of the work. I have decided to work on the the
technical test , the backlog of my tasks was as follows:
• Creation / Generation of the training / testing set for the first model of the components
• Study and test frameworks specialized in the development of Machine learning models
Conclusion
This first chapter was devoted to the presentation of our host organization "xHub" and the
detailed presentation of our graduation project. In what follows we will begin the state of the art
End of studies internship report 2018-2019 20
of our project, carrying a theoretical study of the knowledge describing the concepts and trades
related to our subject
End of studies internship report 2018-2019 21
Summary:
This chapter presents the state of the art of the machine learning techniques that we have
used on our MCQ system. First, it exposes the notion of "NLP" Natural language processing by
detailing its axes. Then we will be talking about gradient boosting machines which were used in
classification problems.
End of studies internship report 2018-2019 22
Introduction
Natural language processing (NLP) has recently gained much attention for representing
and analysing human language computationally. It has spread its applications in various fields
such as machine translation, email spam detection, information extraction, summarization,
medical, and question answering etc. This chapter distinguishes main phases by discussing
different levels of NLP and components of Natural Language Generation (NLG) followed by
presenting the applications that took place in our project using natural language processing.
The ‘levels of language’ are one of the most explanatory method for representing the Natural
Language processing which helps to generate the NLP text by realising Content Planning,
Sentence Planning and Surface Realization phases
Linguistic is the science which involves meaning of language, language context and
various forms of the language. The various important terminologies of Natural Language
Processing are:
1.2.1. Phonology
Phonology is the part of Linguistics which refers to the systematic arrangement of sound. The
term phonology comes from Ancient Greek and the term phono- which means voice or sound,
and the suffix –logy refers to word or speech. In 1993 Nikolai Trubetzkoy stated that Phonology
is “the study of sound pertaining to the system of language". Whereas Lass in 1998 wrote that
phonology refers broadly with the sounds of language, concerned with the to lathe sub discipline
of linguistics, whereas it could be explained as, "phonology proper is concerned with the
function, behaviour and organization of sounds as linguistic items. Phonology include semantic
use of sound to encode meaning of any Human language.
End of studies internship report 2018-2019 25
1.2.2. Morphology
The different parts of the word represent the smallest units of meaning known as Morphemes.
Morphology which comprise of Nature of words, are initiated by morphemes. An example of
Morpheme could be, the word precancellation can be morphologically scrutinized into three
separate morphemes: the prefix pre, the root cancella, and the suffix -tion. The interpretation of
morpheme stays same across all the words, just to understand the meaning humans can break any
unknown word into morphemes. For example, adding the suffix –ed to a verb, conveys that the
action of the verb took place in the past. The words that cannot be divided and have meaning by
themselves are called Lexical morpheme (e.g.: table, chair) The words (e.g. -ed, -ing, -est, -ly, -
ful) that are combined with the lexical morpheme are known as Grammatical morphemes (eg.
Worked, Consulting, Smallest, Likely, Use). Those grammatical morphemes that occurs in
combination called bound morphemes( eg. -ed, -ing) Grammatical morphemes can be divided
into bound morphemes and derivational morphemes.
1.2.3. Lexical
In Lexical, humans, as well as NLP systems, interpret the meaning of individual words. Sundry
types of processing bestow to word-level understanding – the first of these being a part-of-speech
tag to each word. In this processing, words that can act as more than one partof-speech are
assigned the most probable part-of speech tag based on the context in which they occur. At the
lexical level, Semantic representations can be replaced by the words that have one meaning. In
NLP system, the nature of the representation varies according to the semantic theory deployed.
1.2.4. Syntactic
This level emphasis to scrutinize the words in a sentence so as to uncover the grammatical
structure of the sentence. Both grammar and parser are required in this level. The output of this
level of processing is representation of the sentence that divulge the structural dependency
relationships between the words. There are various grammars that can be impeded, and which in
twirl, whack the option of a parser. Not all NLP applications require a full parse of sentences,
therefore the abide challenges in parsing of prepositional phrase attachment and conjunction
audit no longer impede that plea for which phrasal and clausal dependencies are adequate.
End of studies internship report 2018-2019 26
Syntax conveys meaning in most languages because order and dependency contribute to
connotation. For example, the two sentences: ‘The cat chased the mouse.’ and ‘The mouse
chased the cat.’ differ only in terms of syntax, yet convey quite different meanings.
1.2.5. Semantic
In semantic most people think that meaning is determined, however, this is not it is all the levels
that bestow to meaning. Semantic processing determines the possible meanings of a sentence by
pivoting on the interactions among word-level meanings in the sentence. This level of processing
can incorporate the semantic disambiguation of words with multiple senses; in a cognate way to
how syntactic disambiguation of words that can errand as multiple parts-of-speech is adroit at the
syntactic level. For example, amongst other meanings, ‘file’ as a noun can mean either a binder
for gathering papers, or a tool to form one’s fingernails, or a line of individuals in a queue
(Elizabeth D. Liddy,2001). The semantic level scrutinizes words for their dictionary elucidation,
but also for the elucidation they derive from the milieu of the sentence. Semantics milieu that
most words have more than one elucidation but that we can spot the appropriate one by looking
at the rest of the sentence.
1.2.6. Discourse
While syntax and semantics travail with sentence-length units, the discourse level of NLP travail
with units of text longer than a sentence i.e, it does not interpret multi sentence texts as just
sequence sentences, apiece of which can be elucidated singly. Rather, discourse focuses on the
properties of the text as a whole that convey meaning by making connections between
component sentences. The two of the most common levels are Anaphora Resolution - Anaphora
resolution is the replacing of words such as pronouns, which are semantically stranded, with the
pertinent entity to which they refer. Discourse/Text Structure Recognition - Discourse/text
structure recognition sway the functions of sentences in the text, which, in turn, adds to the
meaningful representation of the text.
End of studies internship report 2018-2019 27
1.2.7. Pragmatic
Pragmatic is concerned with the firm use of language in situations and utilizes nub over and
above the nub of the text for understanding the goal and to explain how extra meaning is read
into texts without literally being encoded in them. This requisite much world knowledge,
including the understanding of intentions, plans, and goals. For example, the following two
sentences need aspiration of the anaphoric term ‘they’, but this aspiration requires pragmatic or
world knowledge
Natural Language Generation (NLG) is the process of producing phrases, sentences and
paragraphs that are meaningful from an internal representation. It is a part of Natural Language
Processing and happens in four phases: identifying the goals, planning on how goals maybe
achieved by evaluating the situation and available communicative sources and realizing the plans
as a text. It is opposite to Understanding.
according the grammar, it must be ordered both sequentially and in terms of linguistic
relations like modifications. Linguistic Resources: To support the information’s
realization, linguistic resources must be chosen. In the end these resources will come
down to choices of particular words, idioms, syntactic constructs etc. Realization: The
selected and organized resources must be realized as an actual text or voice output.
● Application or Speaker – This is only for maintaining the model of the situation. Here
the speaker just initiates the process doesn’t take part in the language generation. It stores
the history, structures the content that is potentially relevant and deploys a representation
of what it actually knows. All these form the situation, while selecting subset of
End of studies internship report 2018-2019 29
propositions that speaker has. The only requirement is the speaker has to make sense of
the situation.
Natural Language Processing can be applied into various areas like Machine Translation, Email
Spam detection, Information Extraction, Summarization, Question Answering etc.
As most of the world is online, the task of making data accessible and available to all is a
challenge. Major challenge in making data accessible is the language barrier. There are multitude
of languages with different sentence structure and grammar. Machine Translation is generally
translating phrases from one language to another with the help of a statistical engine like Google
Translate. The challenge with machine translation technologies is not directly translating words
but keeping the meaning of sentences intact along with grammar and tenses. The statistical
machine learning gathers as many data as they can find that seems to be parallel between two
languages and they crunch their data to find the likelihood that something in Language A
corresponds to something in Language B. As for Google, in September 2016, announced a new
machine translation system based on Artificial neural networks and Deep learning . In recent
years, various methods have been proposed to automatically evaluate machine translation quality
by comparing hypothesis translations with reference translations. Examples of such methods are
word error rate, position-independent word error rate, generation string accuracy, multi-reference
word error rate, BLEU score, NIST score All these criteria try to approximate human assessment
and often achieve an astonishing degree of correlation to human subjective evaluation of fluency
and adequacy .
Categorization systems inputs a large flow of data like official documents, military casualty
reports, market data, newswires etc. and assign them to predefined categories or indices. For
example, The Carnegie Group’s Construe system, inputs Reuters articles and saves much time by
doing the work that is to be done by staff or human indexers. Some companies have been using
End of studies internship report 2018-2019 30
categorization systems to categorize trouble tickets or complaint requests and routing to the
appropriate desks. Another application of text categorization is email spam filters. Spam filters is
becoming important as the first line of defence against the unwanted emails. A false negative and
false positive issues of spam filters are at the heart of NLP technology, its brought down to the
challenge of extracting meaning from strings of text. A filtering solution that is applied to an
email system uses a set of protocols to determine which of the incoming messages are spam and
which are not. There are several types of spam filters available. Content filters: Review the
content within the message to determine whether it is a spam or not. Header filters: Review the
email header looking for fake information. General Blacklist filters: Stopes all emails from
blacklisted recipients. Rules Based Filters: It uses user-defined criteria. Such as stopping mails
from specific person or stopping mail including a specific word. Permission Filters: Require
anyone sending a message to be pre-approved by the recipient. Challenge Response Filters:
Requires anyone sending a message to enter a code in order to gain permission to send email.
It works using text categorization and in recent times, various machine learning techniques have
been applied to text categorization or Anti-Spam Filtering like Rule Learning, Naïve Bayes,
Memory based Learning, Support vector machines, Decision Trees and Maximum Entropy
Model. Sometimes combining different learners. Using these approaches is better as classifier is
learned from training data rather than making by hand. The naïve bayes is preferred because of
its performance despite its simplicity In Text Categorization two types of models have been
used. Both modules assume that a fixed vocabulary is present. But in first model a document is
generated by first choosing a subset of vocabulary and then using the selected words any number
of times, at least once irrespective of order. This is called Multi-variate Bernoulli model. It takes
the information of which words are used in a document irrespective of number of words and
order. In second model, a document is generated by choosing a set of word occurrences and
arranging them in any order. this model is called multi-nomial model, in addition to the Multi-
variate Bernoulli model, it also captures information on how many times a word is used in a
document. Most text categorization approaches to anti spam Email filtering have used multi
variate Bernoulli model.
End of studies internship report 2018-2019 31
Information extraction is concerned with identifying phrases of interest of textual data. For many
applications, extracting entities such as names, places, events, dates, times and prices is a
powerful way of summarize the information relevant to a user’s needs. In the case of a domain
specific search engine, the automatic identification of important information can increase
accuracy and efficiency of a directed search. There is use of hidden Markov models (HMMs) to
extract the relevant fields of research papers. These extracted text segments are used to allow
searched over specific fields and to provide effective presentation of search results and to match
references to papers. For example, noticing the pop up ads on any websites showing the recent
items you might have looked on an online store with discounts. In Information Retrieval two
types of models have been used.
Both modules assume that a fixed vocabulary is present. But in first model a document is
generated by first choosing a subset of vocabulary and then using the selected words any number
of times, at least once without any order. This is called Multi-variate Bernoulli model. It takes
the information of which words are used in a document irrespective of number of words and
order. In second model, a document is generated by choosing a set of word occurrences and
arranging them in any order. this model is called multi-nomial model, in addition to the Multi-
variate Bernoulli model , it also captures information on how many times a word is used in a
document
Discovery of knowledge is becoming important areas of research over the recent years.
Knowledge discovery research use a variety of techniques in order to extract useful information
from source documents like Parts of Speech (POS) tagging, Chunking or Shadow Parsing, Stop-
words (Keywords that are used and must be removed before processing documents), Stemming
(Mapping words to some base for, it has two methods, dictionary based stemming and Porter
style stemming. Former one has higher accuracy but higher cost of implementation while latter
has lower implementation cost and is usually insufficient for IR).
Compound or Statistical Phrases (Compounds and statistical phrases index multi token units
instead of single tokens.) Word Sense Disambiguation (Word sense disambiguation is the task of
End of studies internship report 2018-2019 32
understanding the correct sense of a word in context. When used for information retrieval, terms
are replaced by their senses in the document vector.)
Its extracted information can be applied on a variety of purpose, for example to prepare a
summary, to build databases, identify keywords, classifying text items according to some
predefined categories etc. For example CONSTRUE, it was developed for Reuters, that is used
in classifying news stories. It has been suggested that many IE systems can successfully extract
terms from documents, acquiring relations between the terms is still a difficulty. PROMETHEE
is a system that extracts lexico-syntactic patterns relative to a specific conceptual relation. IE
systems should work at many levels, from word recognition to discourse analysis at the level of
the complete document. An application of the Blank Slate Language Processor (BSLP) approach
for the analysis of a real life natural language corpus that consists of responses to open-ended
questionnaires in the field of advertising.
There’s a system called MITA (Metlife’s Intelligent Text Analyzer) that extracts information
from life insurance applications. Ahonen et al. suggested a mainstream framework for text
mining that uses pragmatic and discourse level analyses of text.
1.4.5. Summarization
Overload of information is the real thing in this digital age, and already our reach and access to
knowledge and information exceeds our capacity to understand it. This trend is not slowing
down, so an ability to summarize the data while keeping the meaning intact is highly required.
This is important not just allowing us the ability to recognize the understand the important
information for a large set of data, it is used to understand the deeper emotional meanings; For
example, a company determine the general sentiment on social media and use it on their latest
product offering. This application is useful as a valuable marketing asset.
The types of text summarization depends on the basis of the number of documents and the two
important categories are single document summarization and multi document summarization .
Summaries can also be of two types: generic or query-focused, Summarization task can be either
supervised or unsupervised. Training data is required in a supervised system for selecting
relevant material from the documents.
End of studies internship report 2018-2019 33
In this section, we’ll talk about some core concepts of natural language processing.
The concept of word embedding plays a critical role in the realisation of transfer learning for
NLP tasks. Word embeddings are essentially fixed-length vectors used to encode and represent a
piece of text.
The vectors represent words as multidimensional continuous numbers where semantically similar
words are mapped to proximate points in geometric space. You can see here below how the
vectors for sports like “tennis”, “badminton”, and “squash” get mapped very close together.
A key benefit of representing words as vectors is that mathematical operations can be performed
on them, such as:
To get a good representation of text data it is crucial for us to be able to capture both the context
and semantics. For example, consider the two sentences below, whilst the spelling of the word
“minute” is the same in both cases, their meanings are very different.
In addition to this, the same words can also have different meanings based on their context. For
example, “good” and “not good” convey two very different sentiments.
By incorporating the context of words we can achieve high performance for downstream real-
world tasks such as sentiment analysis, text classification, clustering, summarisation, translation
etc.
Traditional approaches to NLP such as one-hot encoding and bag of words models do not
capture information about a word’s meaning or context. However, neural network-based
language models aim to predict words from neighbouring words by considering their sequences
in the corpus.
Word2Vec and GloVe are both implementations which can be used to produce word embeddings
from their co-occurrence information. Word2Vec only takes into account local contexts. By
contrast, GloVe uses neural methods to decompose the co-occurrence matrix into more
expressive and dense word vectors; matrix factorization is performed to yield a lower-
dimensional matrix of words and features where each row yields a vector representation for each
word.
End of studies internship report 2018-2019 36
We will now present some architectures which leverage these fundamental concepts and have
recently achieved some State Of The Art (SOTA) results on NLP tasks.
ELMo aims to provide an improved word representation for NLP tasks in different contexts by
producing multiple word embeddings per single word, across different scenarios. In the example
below, the word “minute” has multiple meanings (homonyms) so gets represented by multiple
End of studies internship report 2018-2019 37
embeddings with ELMo. However, with other models such as GloVe, each instance would have
the same representation regardless of its context.
ELMo uses a bidirectional language model (biLM) to learn both word and linguistic context. At
each word, the internal states from both the forward and backward pass are concatenated to
produce an intermediate word vector. As such, it is the model’s bidirectional nature that gives it a
hint not only as to the next word in a sentence but also the words that came before.
Another feature of ELMo is that it uses language models comprised of multiple layers, forming a
multilayer RNN. The intermediate word vector produced by layer 1 is fed up to layer 2. The
more layers that are present in the model, the more the internal states get processed and as such
represent more abstract semantics such as topics and sentiment. By contrast, lower layers
represent less abstract semantics such as short phrases or parts of speech.
End of studies internship report 2018-2019 38
In order to compute the word embeddings that get fed into the first layer of the biLM, ELMo
uses a character-based CNN. The input is computed purely from combinations of characters
within a word. This has two key benefits:
● It is able to form representations of words external to the vocabulary it was trained on.
For example, the model could determine that “Mum” and “Mummy” are somewhat
related before even considering the context in which they are used. This is particularly
useful for us at Badoo as it can help detect misspelled words through context.
● It continues to perform well when it encounters a word that was absent from the training
dataset.
Essentially, BERT is a trained transformer encoder stack where results are passed up from one
encoder to the next.
Figure 14:BERT
At each encoder, self-attention is applied and this helps the encoder to look at other words in the
input sentence as it encodes each specific word, so helping it to learn correlations between the
words. These results then pass through a feed-forward network.
BERT was trained on Wikipedia text data and uses masked modelling rather than sequential
modelling during training. It masks 15% of the words in each sequence and tries to predict the
original value based on the context. This involves the following:
● Multiplying the output vectors by the embedding matrix, transforming them into the
vocabulary dimension.
As the BERT loss function only takes into consideration the prediction of the masked values, so
converging more slowly than directional models. This drawback, however, is offset by its
increased awareness of context.
MT-DNN extends the function of BERT to achieve even better results in NLP problems. It uses
multi-task (parallel) learning instead of transfer (sequential) learning by computing the loss
across a different task, while simultaneously applying the change to the model.
End of studies internship report 2018-2019 41
As with BERT, MT-DNN tokenises sentences and transforms them into the initial word, segment
and position embeddings. The multi-directional transformer is then used to learn the contextual
word embeddings.
Gradient boosting machines are a family of powerful machine-learning techniques that have
shown considerable success in a wide range of practical applications. They are highly
customizable to the particular needs of the application, like being learned with respect to
different loss functions.
End of studies internship report 2018-2019 42
2.2.1. Ensemble
When we try to predict the target variable using any machine learning technique, the main causes
of difference in actual and predicted values are noise, variance, and bias. Ensemble helps to
reduce these factors (except noise, which is irreducible error)
An ensemble is just a collection of predictors which come together (e.g. mean of all predictions)
to give a final prediction. The reason we use ensembles is that many different predictors trying to
predict same target variable will perform a better job than any single predictor alone. Ensembling
techniques are further classified into Bagging and Boosting.
End of studies internship report 2018-2019 43
2.2.2. Bagging
We typically take random sub-sample/bootstrap of data for each model, so that all the models are
little different from each other. Each observation is chosen with replacement to be used as input
for each of the model. So, each model will have different observations based on the bootstrap
process. Because this technique takes many uncorrelated learners to make a final model, it
reduces error by reducing variance. Example of bagging ensemble is Random Forest models.
2.2.3. Boosting
Boosting is an ensemble technique in which the predictors are not made independently, but
sequentially.
End of studies internship report 2018-2019 44
This technique employs the logic in which the subsequent predictors learn from the mistakes of
the previous predictors. Therefore, the observations have an unequal probability of appearing in
subsequent models and ones with the highest error appear most. (So the observations are not
chosen based on the bootstrap process, but based on the error). The predictors can be chosen
from a range of models like decision trees, regressors, classifiers etc. Because new predictors are
learning from mistakes committed by previous predictors, it takes less time/iterations to reach
close to actual predictions. But we have to choose the stopping criteria carefully or it could lead
to overfitting on training data. Gradient Boosting is an example of boosting algorithm.
The objective of any supervised learning algorithm is to define a loss function and minimize it.
Let’s see how maths work out for Gradient Boosting algorithm. Say we have mean squared error
(MSE) as loss defined as:
End of studies internship report 2018-2019 45
We want our predictions, such that our loss function (MSE) is minimum. By using gradient
descent and updating our predictions based on a learning rate, we can find the values where
MSE is minimum.
Which becomes:
So, we are basically updating the predictions such that the sum of our residuals is close to 0 (or
minimum) and predicted values are sufficiently close to actual values.
2.4.1. CatBoost
CatBoost has the flexibility of giving indices of categorical columns so that it can be encoded as
one-hot encoding using one_hot_max_size (Use one-hot encoding for all features with number of
different values less than or equal to the given parameter value). If you don’t pass any anything
in cat_features argument, CatBoost will treat all the columns as numerical variables.
2.4.2. LightGBM
Similar to CatBoost, LightGBM can also handle categorical features by taking the input of
feature names. It does not convert to one-hot coding, and is much faster than one-hot coding.
LGBM uses a special algorithm to find the split value of categorical features
2.4.3. XGBoost
Unlike CatBoost or LGBM, XGBoost cannot handle categorical features by itself, it only accepts
numerical values similar to Random Forest. Therefore one has to perform various encodings like
label encoding, mean encoding or one-hot encoding before supplying categorical data to
XGBoost.
End of studies internship report 2018-2019 46
3. Performance Measures
Classification Accuracy is what we usually mean, when we use the term accuracy. It is the ratio
It works well only if there are equal number of samples belonging to each class.
For example, consider that there are 98% samples of class A and 2% samples of class B in our
training set. Then our model can easily get 98% training accuracy by simply predicting every
When the same model is tested on a test set with 60% samples of class A and 40% samples of
class B, then the test accuracy would drop down to 60%. Classification Accuracy is great, but
The real problem arises, when the cost of misclassification of the minor class samples are very
high. If we deal with a rare but fatal disease, the cost of failing to diagnose the disease of a sick
person is much higher than the cost of sending a healthy person to more tests.
Confusion Matrix as the name suggests gives us a matrix as output and describes the complete
Lets assume we have a binary classification problem. We have some samples belonging to two
classes : YES or NO. Also, we have our own classifier which predicts a class for a given input
sample. On testing our model on 165 samples ,we get the following result.
End of studies internship report 2018-2019 47
● True Positives : The cases in which we predicted YES and the actual output was also
YES.
● True Negatives : The cases in which we predicted NO and the actual output was NO.
● False Positives : The cases in which we predicted YES and the actual output was NO.
● False Negatives : The cases in which we predicted NO and the actual output was YES.
Accuracy for the matrix can be calculated by taking average of the values lying across the “main
diagonal” i.e
Confusion Matrix forms the basis for the other types of metrics.
Area Under Curve(AUC) is one of the most widely used metrics for evaluation. It is used for
binary classification problem. AUC of a classifier is equal to the probability that the classifier
will rank a randomly chosen positive example higher than a randomly chosen negative example.
● True Positive Rate (Sensitivity) : True Positive Rate is defined as TP/ (FN+TP). True
Positive Rate corresponds to the proportion of positive data points that are correctly
● False Positive Rate (Specificity) : False Positive Rate is defined as FP / (FP+TN). False
Positive Rate corresponds to the proportion of negative data points that are mistakenly
False Positive Rate and True Positive Rate both have values in the range [0, 1]. FPR and TPR bot
hare computed at threshold values such as (0.00, 0.02, 0.04, …., 1.00) and a graph is drawn.
AUC is the area under the curve of plot False Positive Rate vs True Positive Rate at different
3.4. F1 Score
F1 Score is the Harmonic Mean between precision and recall. The range for F1 Score is [0, 1]. It
tells you how precise your classifier is (how many instances it classifies correctly), as well as
High precision but lower recall, gives you an extremely accurate, but it then misses a large
number of instances that are difficult to classify. The greater the F1 Score, the better is the
● Precision : It is the number of correct positive results divided by the number of positive
● Recall : It is the number of correct positive results divided by the number of all relevant
Mean Absolute Error is the average of the difference between the Original Values and the
Predicted Values. It gives us the measure of how far the predictions were from the actual output.
However, they don’t gives us any idea of the direction of the error i.e. whether we are under
Conclusion
This chapter has allowed us to frame the subject of our work, get closer to the techniques we
have used in our project, and we have quoted all the terms and concepts related to our project.
The next chapter will be dedicated to the analysis and specification of needs.
End of studies internship report 2018-2019 51
Chapter 3 : Implementation
Summary:
After having detailed the design adapted to our project, we will devote the last chapter of
this report to the realization part. For this we will first present the hardware and software
development environment, then we will describe the development phase, including the
technologies adopted to carry out the project and the tests carried out and we end up with some
graphical interfaces illustrating the work done.
End of studies internship report 2018-2019 52
Introduction
In previous chapters, we tried to follow a logical sequence that allowed us to develop our project.
We now come to the realization phase, which is the completion and completion phase of the
project.
The purpose of this chapter, here, is not to describe the lines of the source code one after the
other. This would be tedious and deeply boring for the reader. It is rather about present the work
environment, the user interfaces of the application as well as final product evaluation tests.
1. Work Environment
In this part, we will present the environment used mainly to train models and deal with data.
Kaggle Kernels are essentially Jupyter notebooks in the browser that can be run in the cloud, free
of charge.
We have used kaggle kernels in our data processing part, also to train the models.
Technical specifications :
CPU Specifications
● 4 CPU cores
● 17 Gigabytes of RAM
GPU Specifications
End of studies internship report 2018-2019 53
● 2 CPU cores
● 14 Gigabytes of RAM
1.1.2. Python
Python has become the language of choice for data analytics. One of the major reasons for this is
the availability of some easy and fun to work with libraries in python which make it interesting
to work and analyse large data sets.
Some of the packages that we have used on the machine learning & data analytics part are:
● SciPy is also a scientific computing library that adds a collection of algorithms and high
level commands for manipulating and visualizing data. It also contains modules for
optimization, linear algebra, integration, Fast Fourier Transform, signal and image
processing and much more.
● Pandas provides easy to use data analysis tools and contains functions designed to make
data analysis fast and easy.
1.1.4. Keras
1.1.5. PyTorch
PyTorch is a computer software, specifically a machine learning library for the programming
language Python, based on the Torch library,used for applications such as natural language
processing.It is primarily developed by Facebook's artificial-intelligence research group, and
Uber's Pyro probabilistic programming language software is built on it. It is free and open-source
software released under one of the BSD licenses.
1.1.6. XGBoost
React (also known as React.js or ReactJS) is a JavaScript library for building user interfaces. It is
maintained by Facebook and a community of individual developers and companies.
Jenkins is an open source automation server written in Java. Jenkins helps to automate the non-
human part of the software development process, with continuous integration and facilitating
technical aspects of continuous delivery. It is a server-based system that runs in servlet
containers such as Apache Tomcat. It supports version control tools, including AccuRev, CVS,
Subversion, Git, Mercurial, Perforce, TD/OMS, ClearCase and RTC, and can execute Apache
Ant, Apache Maven and sbt based projects as well as arbitrary shell scripts and Windows batch
commands. The creator of Jenkins is Kohsuke Kawaguchi. Released under the MIT License,
Jenkins is free software.
End of studies internship report 2018-2019 56
This model was made in order to parse a resume, extract information about the applicant using
rule based & Machine learning techniques.
To do so, we need to apply NLP & classification algorithms on some labeled data, since it is a
supervised learning problem.
We have collected data from freelance websites, and hiring websites using web scraping.
The data set contains only English resumes, and it was all formed into a pdf format.
This phase is one of the important phases in building our model, we have applied RegEx and
some rules to extract the maximum of the information.
The examples below show two different ways in which one could tokenize the string 'Analyzing
text is not that hard'.
(Incorrect): Analyzing text is not that hard. = [“Analyz”, “ing text”, “is n”, “ot that”, “hard.”]
(Correct): Analyzing text is not that hard. = [“Analyzing”, “text”, “is”, “not”, “that”, “hard”, “.”]
Once the tokens have been recognized, it's time to categorize them. Part-of-speech tagging refers
to the process of assigning a grammatical category, such as noun, verb, etc. to the tokens that
have been detected.
Here are the PoS tags of the tokens from the sentence above:
End of studies internship report 2018-2019 57
“Analyzing”: VERB, “text”: NOUN, “is”: VERB, “not”: ADV, “that”: ADV, “hard”: ADJ, “.”:
PUNCT
Stemming and Lemmatization both refer to the process of removing all of the affixes (i.e.
suffixes, prefixes, etc.) attached to a word in order to keep its lexical base, also known as root or
stem or its dictionary form or lemma. The main difference between these two processes is that
stemming is usually based on rules that trim word beginnings and endings (and sometimes lead
to somewhat weird results), whereas lemmatization makes use of dictionaries and a much more
complex morphological analysis.
To provide a more accurate automated analysis of the text, it is important that we remove from
play all the words that are very frequent but provide very little semantic information or no
meaning at all. These words are also known as stopwords.
There are many different lists of stopwords for every language. However, it is important to
understand that you might need to add words to or remove words from those lists depending on
the texts you would like to analyze and the analyses you would like to perform.
End of studies internship report 2018-2019 58
1.2.3. Modelling
The model was made in a way to be able to extract skills from a resume, also to classify it into a
profile (Backend, Frontend, Data scientist etc ..)
So we have trained a Naive Bayes Model, that can predict weather a word in a corpus is a
technical skill or not, also to predict weather a text is for a backend developer of for frontend
developer.
We have used word2vec & doc2vec to transform data into vectors so the model can understand it
in a good manner.
After several Model Tuning, we have achieved a good results, and we will see the results in the
demonstration part.
The data was collected from an annonym source, there was 5 columns of independent variables
and one dependent variable which was -1 ,0 or 1
The data was sort of prepared so, we had not need to prepare it.
1.3.3. Modelling
On this phase we have trained an XGBoost classifier, which is was predicting one of the 3
numbers (-1, 0 or 1)
Model specifications:
As the dataset had quality and well prepared data, the results were good, and the accuracy using
MSE was around 96%
On this part, we have used a pre trained model, Bert-base, which we mentioned before in the
environment section.
BERT makes use of Transformer, an attention mechanism that learns contextual relations
between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate
mechanisms — an encoder that reads the text input and a decoder that produces a prediction for
the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is
necessary. The detailed workings of Transformer are described in a paper by Google.
End of studies internship report 2018-2019 60
We have used BERT to predict the next question to be generated, based on the answer of the
user.
● 0 (false answer) : Next question should be different from the previous one
The model is still new and general, so we should implement our own way to calculate the next
sentence prediction, based on PyTorch & FastAI implementation.
We will present now the way we developed our application, which is a intelligent quiz system.
On this part, we will explain the general workflow of the application, by presenting every step of
the application, and its parallel action in the server side.
● The resume is being parsed in the backend using the models we have made before
● A json object is being sent to the client side contains the skills extracted and profile
● The client side asks the user to rate himself for each skill extracted
● The rating are being sent to the backend again, a session is being created and the test will
start
● The user then passes 5 general questions, they are mainly about general concepts and
problem solving
● When the user finished the general test, the client side asks the backend for the next
question
● The backend start predicting the next question and send it to the client side
● The user keeps on answering the quiz, the result is being sent to the backend to predict
the next question until the end of the quiz.
multiple models, but the best results were achieved using XGBoost.
SVM results were good in the training phase, but on the testing phase we can see that it overfits
Random forest and xgboost results were close but random forest also overfits a little bit.
1.5.3. Demonstration
When you first get into the website, a page appears, it contains the available positions
that are open to join the company.
After choosing the profile that you want to apply to, a page appears with a form, you fill
your name and upload your resume in a pdf format.
End of studies internship report 2018-2019 63
After uploading the resume, the script start parsing your resume and return a list of your
skills to rate them
After finishing the general test, the test associated to the chosen profile starts
The user keep on answering the questions, after each question the script calculates the
difficulty level and the next questions for 15 times, and when you finish answering all the
2. Conclusion
In this last chapter, we were able to present the work environment and the development process.
We have explained the services performed and we exposed the result of development using
screen previews. Finally, we closed the chapter with functional tests of the work done.
End of studies internship report 2018-2019 66
General conclusion
During this project, we contributed to the establishment of a automatic & adaptive MCQ system
based on natural language processing and GBMs.
To carry out this work, we first presented the general context of the project where we presented
the host company as well as the general framework of the project. Then, we carried out a state of
the art of the notions treated in our subject of final studies project, then we described the phase of
management and conduct of our project, before proceeding to the analysis and the specification
of the needs of our application . Finally, this analysis was used in the design and implementation
of the new application.
This internship was beneficial to us as it allowed us to apply our theoretical knowledge through
the practice of new technologies. In addition, we had the opportunity to use some techniques that
we have never heard about before, such as BERT.
To be able to carry out this project, we have integrated the professional environment in which
xHub has taught us to evolve by responding to technical requirements and constraints; and
instilling in us notions and habits of professionalism.
This project was a great human experience for us, in contact with the coaches who gave us their
advice and who made us benefit from their experiences.
This work meets the needs previously fixed but it can obviously be improved and optimized by
integrating new datasets that are more quality than the one handled to us.
Also the profiling system is so weak, we need more research about this part, this is our next goal,
is to make a profiling system that gives a summary about the candidate after finishing the test
based on his answers and his resume.
End of studies internship report 2018-2019 67
References
5- Natural Language Processing: State of The Art, Current Trends and Challenges -
Sukhdev Singh
https://www.researchgate.net/publication/319164243_Natural_Language_Processing_
State_of_The_Art_Current_Trends_and_Challenges
https://aclweb.org/anthology/N18-1202